Apache Hive

Overview

Apache Hive (TM)

Master Build Status Maven Central

The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Built on top of Apache Hadoop (TM), it provides:

  • Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis

  • A mechanism to impose structure on a variety of data formats

  • Access to files stored either directly in Apache HDFS (TM) or in other data storage systems such as Apache HBase (TM)

  • Query execution using Apache Hadoop MapReduce, Apache Tez or Apache Spark frameworks.

Hive provides standard SQL functionality, including many of the later 2003 and 2011 features for analytics. These include OLAP functions, subqueries, common table expressions, and more. Hive's SQL can also be extended with user code via user defined functions (UDFs), user defined aggregates (UDAFs), and user defined table functions (UDTFs).

Hive users have a choice of 3 runtimes when executing SQL queries. Users can choose between Apache Hadoop MapReduce, Apache Tez or Apache Spark frameworks as their execution backend. MapReduce is a mature framework that is proven at large scales. However, MapReduce is a purely batch framework, and queries using it may experience higher latencies (tens of seconds), even over small datasets. Apache Tez is designed for interactive query, and has substantially reduced overheads versus MapReduce. Apache Spark is a cluster computing framework that's built outside of MapReduce, but on top of HDFS, with a notion of composable and transformable distributed collection of items called Resilient Distributed Dataset (RDD) which allows processing and analysis without traditional intermediate stages that MapReduce introduces.

Users are free to switch back and forth between these frameworks at any time. In each case, Hive is best suited for use cases where the amount of data processed is large enough to require a distributed system.

Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks. Hive is designed to maximize scalability (scale out with more machines added dynamically to the Hadoop cluster), performance, extensibility, fault-tolerance, and loose-coupling with its input formats.

General Info

For the latest information about Hive, please visit out website at:

http://hive.apache.org/

Getting Started

Requirements

Java

Hive Version Java Version
Hive 1.0 Java 6
Hive 1.1 Java 6
Hive 1.2 Java 7
Hive 2.x Java 7
Hive 3.x Java 8
Hive 4.x Java 8

Hadoop

  • Hadoop 1.x, 2.x
  • Hadoop 3.x (Hive 3.x)

Upgrading from older versions of Hive

  • Hive includes changes to the MetaStore schema. If you are upgrading from an earlier version of Hive it is imperative that you upgrade the MetaStore schema by running the appropriate schema upgrade scripts located in the scripts/metastore/upgrade directory.

  • We have provided upgrade scripts for MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and Derby databases. If you are using a different database for your MetaStore you will need to provide your own upgrade script.

Useful mailing lists

  1. [email protected] - To discuss and ask usage questions. Send an empty email to [email protected] in order to subscribe to this mailing list.

  2. [email protected] - For discussions about code, design and features. Send an empty email to [email protected] in order to subscribe to this mailing list.

  3. [email protected] - In order to monitor commits to the source repository. Send an empty email to [email protected] in order to subscribe to this mailing list.

Comments
  •  HIVE-21737: Upgrade Avro to version 1.10.1

    HIVE-21737: Upgrade Avro to version 1.10.1

    Co-Authored-By: Ismaël Mejía [email protected]

    What changes were proposed in this pull request?

    Why are the changes needed?

    Does this PR introduce any user-facing change?

    How was this patch tested?

    tests unstable stale 
    opened by iemejia 28
  • HIVE-24035: Add Jenkinsfile for branch-2.3

    HIVE-24035: Add Jenkinsfile for branch-2.3

    What changes were proposed in this pull request?

    Enable precommit tests for github PR against branch-2.3.

    Why are the changes needed?

    Adding a new Jenkinsfile for the repo. This is almost the same file used in the master branch, except changing timeout from 6 to 12 hours.

    Does this PR introduce any user-facing change?

    No

    How was this patch tested?

    N/A

    tests unstable 
    opened by sunchao 25
  • HIVE-24316. Upgrade ORC from 1.5.6 to 1.5.8 in branch-3.1

    HIVE-24316. Upgrade ORC from 1.5.6 to 1.5.8 in branch-3.1

    What changes were proposed in this pull request?

    This PR aims to upgrade Apache ORC from 1.5.6 to 1.5.8.

    Why are the changes needed?

    This will bring eleven bug fixes.

    • ORC 1.5.7: https://issues.apache.org/jira/projects/ORC/versions/12345702
    • ORC 1.5.8: https://issues.apache.org/jira/projects/ORC/versions/12346462

    Does this PR introduce any user-facing change?

    No.

    How was this patch tested?

    Pass the CI with the existing test cases.

    tests unstable 
    opened by dongjoon-hyun 23
  • HIVE-23998: Upgrade guava to 27 for Hive 2.3 branch

    HIVE-23998: Upgrade guava to 27 for Hive 2.3 branch

    What changes were proposed in this pull request?

    This PR proposes to upgrade Guava to 27 in Hive 2.3 branch.

    Why are the changes needed?

    When trying to upgrade Guava in Spark, found the following error. A Guava method became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0.

    sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
    	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
    	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
    	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
    	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
    	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
    

    Does this PR introduce any user-facing change?

    Yes. This upgrades Guava to 27.

    How was this patch tested?

    Built Hive locally.

    opened by viirya 18
  • HIVE-23553: Upgrade ORC version to 1.6.7

    HIVE-23553: Upgrade ORC version to 1.6.7

    What changes were proposed in this pull request?

    Bump apache ORC version to latest 1.6 release (1.6.7)

    Why are the changes needed?

    So Apache HIVE can take advantage of the latest features and bug fixes

    Does this PR introduce any user-facing change?

    • Integer to Timestamp conversion uses seconds while ORC-1.5 used milliseconds This results in different expected output in queries like: schema_evol_orc_nonvec_part_all_primitive.q

    Non user-facing changes:

    • CacheWritter bufferSize is now decoupled from llap.max.alloc with the former being 8Mb and the later 16Mb
    • ZeroCopy Tests for ORC files disabled until ORC-701

    How was this patch tested?

    Internal tests + q files

    tests passed 
    opened by pgaref 16
  • HIVE-23998: Upgrade guava to 27 for Hive branch-2

    HIVE-23998: Upgrade guava to 27 for Hive branch-2

    What changes were proposed in this pull request?

    This PR proposes to upgrade Guava to 27 in Hive branch-2. This is basically used to trigger test for #1394.

    Why are the changes needed?

    When trying to upgrade Guava in Spark, found the following error. A Guava method became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0.

    sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
    	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
    	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
    	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
    	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
    	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
    

    Does this PR introduce any user-facing change?

    Yes. This upgrades Guava to 27.

    How was this patch tested?

    Built Hive locally.

    tests failed stale 
    opened by viirya 16
  • HIVE-25522: NullPointerException in TxnHandler

    HIVE-25522: NullPointerException in TxnHandler

    What changes were proposed in this pull request?

    • This fixes https://issues.apache.org/jira/browse/HIVE-25522
    • There are two options, either make the initialization static and kill HMS if there is an error, or keep it lazy. Went with second approach as there seem to be db connections that are taken and don't need to be if nobody uses any txnHandler methods.
    • Make the initailization setConf method be idempotent by checking each of the static variables it sets and able to resume if a particular variable is not set.
    • Also some refactor to push down verbose catch blocks as much as possible.

    Why are the changes needed?

    • See https://issues.apache.org/jira/browse/HIVE-25522

    Does this PR introduce any user-facing change?

    No

    How was this patch tested?

    Running unit tests

    tests failed 
    opened by szehon-ho 15
  • HIVE-23980: Shade Guava from hive-exec in Hive 2.3

    HIVE-23980: Shade Guava from hive-exec in Hive 2.3

    What changes were proposed in this pull request?

    This PR proposes to shade Guava from hive-exec in Hive 2.3 branch.

    Why are the changes needed?

    When trying to upgrade Guava in Spark, found the following error. A Guava method became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0.

    sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
    	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
    	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
    	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
    	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
    	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
    

    This is a problem for downstream clients. Hive project noticed that problem too in HIVE-22126, however that only targets 4.0.0. It'd be nicer if we can also shade Guava from current Hive versions, e.g. Hive 2.3 line.

    Does this PR introduce any user-facing change?

    Yes. Guava will be shaded from hive-exec.

    How was this patch tested?

    Built Hive locally and checked jar content.

    tests failed 
    opened by viirya 15
  • [HIVE-13482][UDF] Explicitly define str_to_map args as regex

    [HIVE-13482][UDF] Explicitly define str_to_map args as regex

    Successor to https://github.com/apache/spark/pull/23888

    See discussion there for some more details about the Hive side of this, in particular my comment here about existing StackOverflow answers and here:

    My conclusion is that it's eminently ambiguous whether the intended behavior in either Hive or SparkSQL is to treat the delimiters as regular expressions.

    BUT the behavior has been around for 8 years and at least going off of the SO answers, it seems to be accepted as "known" behavior so things will probably break if we change it.

    Thus, this PR intends to solidify the interpretation of delimiter1 and delimiter2 as regular expressions once and for all.

    If the non-regexp behavior is strongly desired, eventually there could be a fixed: bool argument that behaves like the identically-named argument in R regular expression functions like gsub and strsplit...

    tests unstable 
    opened by MichaelChirico 15
  • HIVE-26254: upgrade calcite to 1.26.0 due to CVE

    HIVE-26254: upgrade calcite to 1.26.0 due to CVE

    What changes were proposed in this pull request?

    Upgrade calcite version due to CVE - https://issues.apache.org/jira/browse/HIVE-26254

    Why are the changes needed?

    Does this PR introduce any user-facing change?

    How was this patch tested?

    tests failed stale 
    opened by pjfanning 14
  • HIVE-24484: Upgrade Hadoop to 3.3.1

    HIVE-24484: Upgrade Hadoop to 3.3.1

    What changes were proposed in this pull request?

    Why are the changes needed?

    Does this PR introduce any user-facing change?

    How was this patch tested?

    tests failed stale 
    opened by belugabehr 14
  • Bump jettison from 1.5.1 to 1.5.2

    Bump jettison from 1.5.1 to 1.5.2

    Bumps jettison from 1.5.1 to 1.5.2.

    Release notes

    Sourced from jettison's releases.

    Jettison 1.5.2

    What's Changed

    Full Changelog: https://github.com/jettison-json/jettison/compare/jettison-1.5.1...jettison-1.5.2

    Commits
    • 6dc73a0 [maven-release-plugin] prepare release jettison-1.5.2
    • 19ae19f Fixing StackOverflow error
    • 325b51b Bump woodstox-core from 6.2.8 to 6.4.0
    • 81d3786 [maven-release-plugin] prepare for next development iteration
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    tests passed dependencies 
    opened by dependabot[bot] 1
  • HIVE-26899 : Upgrade arrow to 0.11.0 in branch-3

    HIVE-26899 : Upgrade arrow to 0.11.0 in branch-3

    Please refer to this JIRA : https://issues.apache.org/jira/browse/HIVE-26899

    What changes were proposed in this pull request?

    Why are the changes needed?

    Does this PR introduce any user-facing change?

    How was this patch tested?

    tests unstable 
    opened by amanraj2520 2
  • HIVE-26896 : Test fixes for lineage3.q and load_static_ptn_into_bucketed_table.q

    HIVE-26896 : Test fixes for lineage3.q and load_static_ptn_into_bucketed_table.q

    These tests were fixed in branch-3.1 as part of Hive 3.1.3 release therefore backporting the same in branch-3

    What changes were proposed in this pull request?

    Why are the changes needed?

    Does this PR introduce any user-facing change?

    How was this patch tested?

    tests unstable 
    opened by amanraj2520 0
  • HIVE-26860: Entries in HMS tables are not deleted upon drop partition table with skewed columns

    HIVE-26860: Entries in HMS tables are not deleted upon drop partition table with skewed columns

    What changes were proposed in this pull request?

    1. Fixed cast exception in dropStorageDescriptors method while iterating over result fetched for SKEWED_VALUES.STRING_LIST_ID_EID.
    2. Fixed cast exception in dropDanglingColumnDescriptors method while iterating over result fetched for SDS.CD_ID(this exception is observed with oracle db).

    Why are the changes needed?

    When a partitioned table with skewed columns is dropped, appropriate entries in HMS tables i.e., SDS, SERDES, SERDE_PARAMS, SKEWED_COL_NAMES, SKEWED_COL_VALUE_LOC_MAP, SKEWED_VALUES are not deleted. Issue link:HIVE-26860. These entries will remain in the backing datastore forever.

    Does this PR introduce any user-facing change?

    No

    How was this patch tested?

    Manually verified from the database client UI for the entries to be deleted upon drop table.

    tests passed 
    opened by VenuReddy2103 1
  • HIVE-26892: Backport HIVE-25243 to 3.2.0: Handle nested values in null struct.

    HIVE-26892: Backport HIVE-25243 to 3.2.0: Handle nested values in null struct.

    What changes were proposed in this pull request?

    HIVE-26892: Backport HIVE-25243 to 3.2.0: Handle nested values in null struct.

    This differs from the original patch as follows:

    • There are no changes in TestJdbcWithMiniLlapVectorArrow. This file does not exist in branch-3. The original patch was only a whitespace change though.
    • I have not included the new test suite TestMiniLlapVectorArrowWithLlapIODisabled. This test suite depends on an additional feature that is only present in master: HIVE-20300 / https://github.com/apache/hive/commit/a8ef2147fad5aeaaf01279230da9c584db6a2337. This is a fairly large patch that I have not chosen to try backporting at this time.

    Why are the changes needed?

    On branch-3, we've seen a failure in TestArrowColumnarBatchSerDe while trying to serialize a row of null values. It fails while trying to serialize the fields of a null struct. This was fixed in 4.0 by HIVE-25243. This issue tracks a backport to branch-3.

    Does this PR introduce any user-facing change?

    Null structs are now serialized to Arrow format correctly without error.

    How was this patch tested?

    Unfortunately, we are left with no specific new tests as part of this backport. However, I applied this patch locally in combination with HIVE-26840 / #3859. After that, all tests in TestArrowColumnarBatchSerDe are passing.

    tests unstable 
    opened by cnauroth 0
Owner
The Apache Software Foundation
The Apache Software Foundation
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Mirror of Apache Storm

Master Branch: Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processi

The Apache Software Foundation 6.4k Jan 3, 2023
Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

The Apache Software Foundation 3.6k Dec 28, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

The Apache Software Foundation 20.4k Jan 5, 2023
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 9, 2023
This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Apache Kylin Apache Kylin is an open source Distributed Analytics Engine to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supp

Kylin OLAP Engine 561 Dec 4, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
Apache Dubbo漏洞测试Demo及其POC

DubboPOC Apache Dubbo 漏洞POC 持续更新中 CVE-2019-17564 CVE-2020-1948 CVE-2020-1948绕过 CVE-2021-25641 CVE-2021-30179 others 免责声明 项目仅供学习使用,任何未授权检测造成的直接或者间接的后果及

lz2y 19 Dec 12, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
Flink CDC Connectors is a set of source connectors for Apache Flink

Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC). The Flink CDC Connectors integrates Debezium as the engine to capture data changes.

null 6 Mar 23, 2022
Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

The Apache Software Foundation 4.6k Dec 28, 2022
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Equivalent Exchange 3 Apache 2 Equivalent Exchange 3 pahimar Equivalent-Exchange-3. Mods for Minecraft. License: Apache 2 , .

Welcome to Equivalent Exchange 3! All versions are available here Minecraft Forums page Compiling EE3 - For those that want the latest unreleased feat

Rob Davis 709 Dec 15, 2022
Apache Solr is an enterprise search platform written in Java and using Apache Lucene.

Apache Solr is an enterprise search platform written in Java and using Apache Lucene. Major features include full-text search, index replication and sharding, and result faceting and highlighting.

The Apache Software Foundation 630 Dec 28, 2022