Apache Hive

Overview

Apache Hive (TM)

Master Build Status Maven Central

The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Built on top of Apache Hadoop (TM), it provides:

  • Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis

  • A mechanism to impose structure on a variety of data formats

  • Access to files stored either directly in Apache HDFS (TM) or in other data storage systems such as Apache HBase (TM)

  • Query execution using Apache Hadoop MapReduce, Apache Tez or Apache Spark frameworks.

Hive provides standard SQL functionality, including many of the later 2003 and 2011 features for analytics. These include OLAP functions, subqueries, common table expressions, and more. Hive's SQL can also be extended with user code via user defined functions (UDFs), user defined aggregates (UDAFs), and user defined table functions (UDTFs).

Hive users have a choice of 3 runtimes when executing SQL queries. Users can choose between Apache Hadoop MapReduce, Apache Tez or Apache Spark frameworks as their execution backend. MapReduce is a mature framework that is proven at large scales. However, MapReduce is a purely batch framework, and queries using it may experience higher latencies (tens of seconds), even over small datasets. Apache Tez is designed for interactive query, and has substantially reduced overheads versus MapReduce. Apache Spark is a cluster computing framework that's built outside of MapReduce, but on top of HDFS, with a notion of composable and transformable distributed collection of items called Resilient Distributed Dataset (RDD) which allows processing and analysis without traditional intermediate stages that MapReduce introduces.

Users are free to switch back and forth between these frameworks at any time. In each case, Hive is best suited for use cases where the amount of data processed is large enough to require a distributed system.

Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks. Hive is designed to maximize scalability (scale out with more machines added dynamically to the Hadoop cluster), performance, extensibility, fault-tolerance, and loose-coupling with its input formats.

General Info

For the latest information about Hive, please visit out website at:

http://hive.apache.org/

Getting Started

Requirements

Java

Hive Version Java Version
Hive 1.0 Java 6
Hive 1.1 Java 6
Hive 1.2 Java 7
Hive 2.x Java 7
Hive 3.x Java 8
Hive 4.x Java 8

Hadoop

  • Hadoop 1.x, 2.x
  • Hadoop 3.x (Hive 3.x)

Upgrading from older versions of Hive

  • Hive includes changes to the MetaStore schema. If you are upgrading from an earlier version of Hive it is imperative that you upgrade the MetaStore schema by running the appropriate schema upgrade scripts located in the scripts/metastore/upgrade directory.

  • We have provided upgrade scripts for MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and Derby databases. If you are using a different database for your MetaStore you will need to provide your own upgrade script.

Useful mailing lists

  1. [email protected] - To discuss and ask usage questions. Send an empty email to [email protected] in order to subscribe to this mailing list.

  2. [email protected] - For discussions about code, design and features. Send an empty email to [email protected] in order to subscribe to this mailing list.

  3. [email protected] - In order to monitor commits to the source repository. Send an empty email to [email protected] in order to subscribe to this mailing list.

Comments
  •  HIVE-21737: Upgrade Avro to version 1.10.1

    HIVE-21737: Upgrade Avro to version 1.10.1

    Co-Authored-By: Ismaël Mejía [email protected]

    What changes were proposed in this pull request?

    Why are the changes needed?

    Does this PR introduce any user-facing change?

    How was this patch tested?

    tests unstable stale 
    opened by iemejia 28
  • HIVE-24035: Add Jenkinsfile for branch-2.3

    HIVE-24035: Add Jenkinsfile for branch-2.3

    What changes were proposed in this pull request?

    Enable precommit tests for github PR against branch-2.3.

    Why are the changes needed?

    Adding a new Jenkinsfile for the repo. This is almost the same file used in the master branch, except changing timeout from 6 to 12 hours.

    Does this PR introduce any user-facing change?

    No

    How was this patch tested?

    N/A

    tests unstable 
    opened by sunchao 25
  • HIVE-24316. Upgrade ORC from 1.5.6 to 1.5.8 in branch-3.1

    HIVE-24316. Upgrade ORC from 1.5.6 to 1.5.8 in branch-3.1

    What changes were proposed in this pull request?

    This PR aims to upgrade Apache ORC from 1.5.6 to 1.5.8.

    Why are the changes needed?

    This will bring eleven bug fixes.

    • ORC 1.5.7: https://issues.apache.org/jira/projects/ORC/versions/12345702
    • ORC 1.5.8: https://issues.apache.org/jira/projects/ORC/versions/12346462

    Does this PR introduce any user-facing change?

    No.

    How was this patch tested?

    Pass the CI with the existing test cases.

    tests unstable 
    opened by dongjoon-hyun 23
  • HIVE-23998: Upgrade guava to 27 for Hive 2.3 branch

    HIVE-23998: Upgrade guava to 27 for Hive 2.3 branch

    What changes were proposed in this pull request?

    This PR proposes to upgrade Guava to 27 in Hive 2.3 branch.

    Why are the changes needed?

    When trying to upgrade Guava in Spark, found the following error. A Guava method became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0.

    sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
    	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
    	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
    	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
    	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
    	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
    

    Does this PR introduce any user-facing change?

    Yes. This upgrades Guava to 27.

    How was this patch tested?

    Built Hive locally.

    opened by viirya 18
  • HIVE-23553: Upgrade ORC version to 1.6.7

    HIVE-23553: Upgrade ORC version to 1.6.7

    What changes were proposed in this pull request?

    Bump apache ORC version to latest 1.6 release (1.6.7)

    Why are the changes needed?

    So Apache HIVE can take advantage of the latest features and bug fixes

    Does this PR introduce any user-facing change?

    • Integer to Timestamp conversion uses seconds while ORC-1.5 used milliseconds This results in different expected output in queries like: schema_evol_orc_nonvec_part_all_primitive.q

    Non user-facing changes:

    • CacheWritter bufferSize is now decoupled from llap.max.alloc with the former being 8Mb and the later 16Mb
    • ZeroCopy Tests for ORC files disabled until ORC-701

    How was this patch tested?

    Internal tests + q files

    tests passed 
    opened by pgaref 16
  • HIVE-23998: Upgrade guava to 27 for Hive branch-2

    HIVE-23998: Upgrade guava to 27 for Hive branch-2

    What changes were proposed in this pull request?

    This PR proposes to upgrade Guava to 27 in Hive branch-2. This is basically used to trigger test for #1394.

    Why are the changes needed?

    When trying to upgrade Guava in Spark, found the following error. A Guava method became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0.

    sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
    	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
    	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
    	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
    	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
    	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
    

    Does this PR introduce any user-facing change?

    Yes. This upgrades Guava to 27.

    How was this patch tested?

    Built Hive locally.

    tests failed stale 
    opened by viirya 16
  • HIVE-25522: NullPointerException in TxnHandler

    HIVE-25522: NullPointerException in TxnHandler

    What changes were proposed in this pull request?

    • This fixes https://issues.apache.org/jira/browse/HIVE-25522
    • There are two options, either make the initialization static and kill HMS if there is an error, or keep it lazy. Went with second approach as there seem to be db connections that are taken and don't need to be if nobody uses any txnHandler methods.
    • Make the initailization setConf method be idempotent by checking each of the static variables it sets and able to resume if a particular variable is not set.
    • Also some refactor to push down verbose catch blocks as much as possible.

    Why are the changes needed?

    • See https://issues.apache.org/jira/browse/HIVE-25522

    Does this PR introduce any user-facing change?

    No

    How was this patch tested?

    Running unit tests

    tests failed 
    opened by szehon-ho 15
  • HIVE-23980: Shade Guava from hive-exec in Hive 2.3

    HIVE-23980: Shade Guava from hive-exec in Hive 2.3

    What changes were proposed in this pull request?

    This PR proposes to shade Guava from hive-exec in Hive 2.3 branch.

    Why are the changes needed?

    When trying to upgrade Guava in Spark, found the following error. A Guava method became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0.

    sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
    	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
    	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
    	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
    	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
    	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
    

    This is a problem for downstream clients. Hive project noticed that problem too in HIVE-22126, however that only targets 4.0.0. It'd be nicer if we can also shade Guava from current Hive versions, e.g. Hive 2.3 line.

    Does this PR introduce any user-facing change?

    Yes. Guava will be shaded from hive-exec.

    How was this patch tested?

    Built Hive locally and checked jar content.

    tests failed 
    opened by viirya 15
  • [HIVE-13482][UDF] Explicitly define str_to_map args as regex

    [HIVE-13482][UDF] Explicitly define str_to_map args as regex

    Successor to https://github.com/apache/spark/pull/23888

    See discussion there for some more details about the Hive side of this, in particular my comment here about existing StackOverflow answers and here:

    My conclusion is that it's eminently ambiguous whether the intended behavior in either Hive or SparkSQL is to treat the delimiters as regular expressions.

    BUT the behavior has been around for 8 years and at least going off of the SO answers, it seems to be accepted as "known" behavior so things will probably break if we change it.

    Thus, this PR intends to solidify the interpretation of delimiter1 and delimiter2 as regular expressions once and for all.

    If the non-regexp behavior is strongly desired, eventually there could be a fixed: bool argument that behaves like the identically-named argument in R regular expression functions like gsub and strsplit...

    tests unstable 
    opened by MichaelChirico 15
  • HIVE-26254: upgrade calcite to 1.26.0 due to CVE

    HIVE-26254: upgrade calcite to 1.26.0 due to CVE

    What changes were proposed in this pull request?

    Upgrade calcite version due to CVE - https://issues.apache.org/jira/browse/HIVE-26254

    Why are the changes needed?

    Does this PR introduce any user-facing change?

    How was this patch tested?

    tests failed stale 
    opened by pjfanning 14
  • HIVE-24484: Upgrade Hadoop to 3.3.1

    HIVE-24484: Upgrade Hadoop to 3.3.1

    What changes were proposed in this pull request?

    Why are the changes needed?

    Does this PR introduce any user-facing change?

    How was this patch tested?

    tests failed stale 
    opened by belugabehr 14
  • HIVE-26892: Backport HIVE-25243 to 3.2.0: Handle nested values in null struct.

    HIVE-26892: Backport HIVE-25243 to 3.2.0: Handle nested values in null struct.

    What changes were proposed in this pull request?

    HIVE-26892: Backport HIVE-25243 to 3.2.0: Handle nested values in null struct.

    This differs from the original patch as follows:

    • There are no changes in TestJdbcWithMiniLlapVectorArrow. This file does not exist in branch-3. The original patch was only a whitespace change though.
    • I have not included the new test suite TestMiniLlapVectorArrowWithLlapIODisabled. This test suite depends on an additional feature that is only present in master: HIVE-20300 / https://github.com/apache/hive/commit/a8ef2147fad5aeaaf01279230da9c584db6a2337. This is a fairly large patch that I have not chosen to try backporting at this time.

    Why are the changes needed?

    On branch-3, we've seen a failure in TestArrowColumnarBatchSerDe while trying to serialize a row of null values. It fails while trying to serialize the fields of a null struct. This was fixed in 4.0 by HIVE-25243. This issue tracks a backport to branch-3.

    Does this PR introduce any user-facing change?

    Null structs are now serialized to Arrow format correctly without error.

    How was this patch tested?

    Unfortunately, we are left with no specific new tests as part of this backport. However, I applied this patch locally in combination with HIVE-26840 / #3859. After that, all tests in TestArrowColumnarBatchSerDe are passing.

    tests unstable 
    opened by cnauroth 0
  • HIVE-26891: Fix TestArrowColumnarBatchSerDe test failures in branch-3

    HIVE-26891: Fix TestArrowColumnarBatchSerDe test failures in branch-3

    What changes were proposed in this pull request?

    Because of Jackson upgrade to 2.12.0 there are unit test failure in TestArrowColumnarBatchSerDe. We cannot directly upgrade the arrow version as it was giving compilation error. These commits are interdependent hence these 6 commits are required.

    Why are the changes needed?

    There are 2 ways to fix these tests:

    1. Downgrade Jackson Version to 2.10.0
    2. Upgrade arrow version from 0.8.0 to 0.10.0. We are going ahead with this.

    Does this PR introduce any user-facing change?

    No

    How was this patch tested?

    On my local machine

    tests failed 
    opened by Aggarwal-Raghav 0
  • HIVE-26889 - Implement array_join udf to concatenate the elements of an array with a specified delimiter

    HIVE-26889 - Implement array_join udf to concatenate the elements of an array with a specified delimiter

    What changes were proposed in this pull request?

    Implement array_join function in Hive

    Why are the changes needed?

    This enhancement is already implemented in Spark

    Does this PR introduce any user-facing change?

    How was this patch tested?

    Created Junit tests as well as qtests as part of this change

    tests passed 
    opened by tarak271 1
  • HIVE-26890 : Disable TestSSL (Done as part of HIVE-21456 in oss/master)

    HIVE-26890 : Disable TestSSL (Done as part of HIVE-21456 in oss/master)

    This was done as part of HIVE-21456 on oss/master. This test also fails with the same error in Hive-3.1.3 release. So we can safely ignore this test from branch-3. Please find the JIRA link : https://issues.apache.org/jira/browse/HIVE-26890

    tests unstable 
    opened by amanraj2520 1
  • HIVE-26887 Make sure dirPath has the correct permissions

    HIVE-26887 Make sure dirPath has the correct permissions

    HIVE-26887 In the QueryResultsCache function of class QueryResultsCache, there is the following code segment

      private QueryResultsCache(HiveConf configuration) throws IOException {
        ......
        FileSystem fs = cacheDirPath.getFileSystem(conf);
        FsPermission fsPermission = new FsPermission("700");
        fs.mkdirs(cacheDirPath, fsPermission);
        ......
    }
    

    It can be seen that the function will use the mkdirs to create cacheDirPath, and the parameters passed in include the path variable cacheDirPath and a permission 700. But we haven't confirmed whether the permission is correctly assigned to the file.

    The above question is raised because there are two mkdir functions of hadoop,mkdirs(Path f, FsPermission permission) with mkdirs(FileSystem fs, Path dir, FsPermission permission)and the first one is used here. The permissions of this function will be affected by the underlying umask. Although 700 here will hardly be affected by umask, but I think from a rigorous point of view, we should have one more permission check and permission grant here. So I judge whether the file permission is correctly granted by detecting the influence of umask on the permission, and if not, give it the correct permission through setPermission.

        if (!permission.equals(permission.applyUMask(FsPermission.getUMask(conf)))) {
            fs.setPermission(dir, permission);
        }
    

    And I find same issue in other three methods here. In class Context

    private Path getScratchDir(String scheme, String authority,
          boolean mkdir, String scratchDir) {
              ......
              FileSystem fs = dirPath.getFileSystem(conf);
              dirPath = new Path(fs.makeQualified(dirPath).toString());
              FsPermission fsPermission = new FsPermission(scratchDirPermission);
    
              if (!fs.mkdirs(dirPath, fsPermission)) {
                throw new RuntimeException("Cannot make directory: "
                    + dirPath.toString());
              ......
      }
    

    In class SessionState

      static void createPath(HiveConf conf, Path path, String permission, boolean isLocal,
          boolean isCleanUp) throws IOException {
        FsPermission fsPermission = new FsPermission(permission);
        FileSystem fs;
        ......
        if (!fs.mkdirs(path, fsPermission)) {
          throw new IOException("Failed to create directory " + path + " on fs " + fs.getUri());
        }
        ......
      }
    

    and in class TezSessionState

    private Path createTezDir(String sessionId, String suffix) throws IOException {
        ......
        Path tezDir = new Path(hdfsScratchDir, TEZ_DIR);
        FileSystem fs = tezDir.getFileSystem(conf);
        FsPermission fsPermission = new FsPermission(HiveConf.getVar(conf, HiveConf.ConfVars.SCRATCHDIRPERMISSION));
        fs.mkdirs(tezDir, fsPermission);
        ......
      }
    
    tests failed 
    opened by skysiders 1
  • HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray)

    HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray)

    What changes were proposed in this pull request?

    Implement array_slice function in Hive

    Why are the changes needed?

    This enhancement is already implemented in Spark

    Does this PR introduce any user-facing change?

    No

    How was this patch tested?

    Created Junit tests as well as qtests as part of this change

    tests passed 
    opened by tarak271 1
Owner
The Apache Software Foundation
The Apache Software Foundation
Apache Calcite

Apache Calcite Apache Calcite is a dynamic data management framework. It contains many of the pieces that comprise a typical database management syste

The Apache Software Foundation 3.6k Dec 31, 2022
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 1, 2023
The Chronix Server implementation that is based on Apache Solr.

Chronix Server The Chronix Server is an implementation of the Chronix API that stores time series in Apache Solr. Chronix uses several techniques to o

Chronix 262 Jul 3, 2022
Apache Pinot - A realtime distributed OLAP datastore

What is Apache Pinot? Features When should I use Pinot? Building Pinot Deploying Pinot to Kubernetes Join the Community Documentation License What is

The Apache Software Foundation 4.4k Dec 30, 2022
Apache Ant is a Java-based build tool.

Apache Ant What is it? ----------- Ant is a Java based build tool. In theory it is kind of like "make" without makes wrinkles and with

The Apache Software Foundation 355 Dec 22, 2022
Apache Aurora - A Mesos framework for long-running services, cron jobs, and ad-hoc jobs

NOTE: The Apache Aurora project has been moved into the Apache Attic. A fork led by members of the former Project Management Committee (PMC) can be fo

The Apache Software Foundation 627 Nov 28, 2022
Apache Drill is a distributed MPP query layer for self describing data

Apache Drill Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage sys

The Apache Software Foundation 1.8k Jan 7, 2023
Flink Connector for Apache Doris(incubating)

Flink Connector for Apache Doris (incubating) Flink Doris Connector More information about compilation and usage, please visit Flink Doris Connector L

The Apache Software Foundation 115 Dec 20, 2022
HurricaneDB a real-time distributed OLAP engine, powered by Apache Pinot

HurricaneDB is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

GuinsooLab 4 Dec 28, 2022
Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

The Apache Software Foundation 4.6k Dec 28, 2022
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Equivalent Exchange 3 Apache 2 Equivalent Exchange 3 pahimar Equivalent-Exchange-3. Mods for Minecraft. License: Apache 2 , .

Welcome to Equivalent Exchange 3! All versions are available here Minecraft Forums page Compiling EE3 - For those that want the latest unreleased feat

Rob Davis 709 Dec 15, 2022
Apache Solr is an enterprise search platform written in Java and using Apache Lucene.

Apache Solr is an enterprise search platform written in Java and using Apache Lucene. Major features include full-text search, index replication and sharding, and result faceting and highlighting.

The Apache Software Foundation 630 Dec 28, 2022
FLiP: StreamNative: Cloud-Native: Streaming Analytics Using Apache Flink SQL on Apache Pulsar

StreamingAnalyticsUsingFlinkSQL FLiP: StreamNative: Cloud-Native: Streaming Analytics Using Apache Flink SQL on Apache Pulsar Running on NVIDIA XAVIER

Timothy Spann 5 Dec 19, 2021