The official home of the Presto distributed SQL query engine for big data

Presto

Last update: Jan 5, 2023

Overview

Presto

Presto is a distributed SQL query engine for big data.

See the User Manual for deployment instructions and end user documentation.

Requirements

Mac OS X or Linux
Java 8 Update 151 or higher (8u151+), 64-bit. Both Oracle JDK and OpenJDK are supported.
Maven 3.3.9+ (for building)
Python 2.4+ (for running with the launcher script)

Building Presto

Presto is a standard Maven project. Simply run the following command from the project root directory:

./mvnw clean install

On the first build, Maven will download all the dependencies from the internet and cache them in the local repository (~/.m2/repository), which can take a considerable amount of time. Subsequent builds will be faster.

Presto has a comprehensive set of unit tests that can take several minutes to run. You can disable the tests when building:

./mvnw clean install -DskipTests

Running Presto in your IDE

Overview

After building Presto for the first time, you can load the project into your IDE and run the server. We recommend using IntelliJ IDEA. Because Presto is a standard Maven project, you can import it into your IDE using the root pom.xml file. In IntelliJ, choose Open Project from the Quick Start box or choose Open from the File menu and select the root pom.xml file.

After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:

Open the File menu and select Project Structure
In the SDKs section, ensure that a 1.8 JDK is selected (create one if none exist)
In the Project section, ensure the Project language level is set to 8.0 as Presto makes use of several Java 8 language features

Presto comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:

Main Class: com.facebook.presto.server.PrestoServer
VM Options: -ea -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -Xmx2G -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties
Working directory: $MODULE_DIR$
Use classpath of module: presto-main

The working directory should be the presto-main subdirectory. In IntelliJ, using $MODULE_DIR$ accomplishes this automatically.

Additionally, the Hive plugin must be configured with location of your Hive metastore Thrift service. Add the following to the list of VM options, replacing localhost:9083 with the correct host and port (or use the below value if you do not have a Hive metastore):

-Dhive.metastore.uri=thrift://localhost:9083

Using SOCKS for Hive or HDFS

If your Hive metastore or HDFS cluster is not directly accessible to your local machine, you can use SSH port forwarding to access it. Setup a dynamic SOCKS proxy with SSH listening on local port 1080:

ssh -v -N -D 1080 server

Then add the following to the list of VM options:

-Dhive.metastore.thrift.client.socks-proxy=localhost:1080

Running the CLI

Start the CLI to connect to the server and run SQL queries:

presto-cli/target/presto-cli-*-executable.jar

Run a query to see the nodes in the cluster:

SELECT * FROM system.runtime.nodes;

In the sample configuration, the Hive connector is mounted in the hive catalog, so you can run the following queries to show the tables in the Hive database default:

SHOW TABLES FROM hive.default;

Code Style

We recommend you use IntelliJ as your IDE. The code style template for the project can be found in the codestyle repository along with our general programming and Java guidelines. In addition to those you should also adhere to the following:

Alphabetize sections in the documentation source files (both in table of contents files and other regular documentation files). In general, alphabetize methods/variables/sections if such ordering already exists in the surrounding code.
When appropriate, use the Java 8 stream API. However, note that the stream implementation does not perform well so avoid using it in inner loops or otherwise performance sensitive sections.
Categorize errors when throwing exceptions. For example, PrestoException takes an error code as an argument, PrestoException(HIVE_TOO_MANY_OPEN_PARTITIONS). This categorization lets you generate reports so you can monitor the frequency of various failures.
Ensure that all files have the appropriate license header; you can generate the license by running mvn license:format.
Consider using String formatting (printf style formatting using the Java Formatter class): format("Session property %s is invalid: %s", name, value) (note that format() should always be statically imported). Sometimes, if you only need to append something, consider using the + operator.
Avoid using the ternary operator except for trivial expressions.
Use an assertion from Airlift's Assertions class if there is one that covers your case rather than writing the assertion by hand. Over time we may move over to more fluent assertions like AssertJ.
When writing a Git commit message, follow these guidelines.

Building the Web UI

The Presto Web UI is composed of several React components and is written in JSX and ES6. This source code is compiled and packaged into browser-compatible Javascript, which is then checked in to the Presto source code (in the dist folder). You must have Node.js and Yarn installed to execute these commands. To update this folder after making changes, simply run:

yarn --cwd presto-main/src/main/resources/webapp/src install

If no Javascript dependencies have changed (i.e., no changes to package.json), it is faster to run:

yarn --cwd presto-main/src/main/resources/webapp/src run package

To simplify iteration, you can also run in watch mode, which automatically re-compiles when changes to source files are detected:

yarn --cwd presto-main/src/main/resources/webapp/src run watch

To iterate quickly, simply re-build the project in IntelliJ after packaging is complete. Project resources will be hot-reloaded and changes are reflected on browser refresh.

Release Notes

When authoring a pull request, the PR description should include its relevant release notes. Follow Release Notes Guidelines when authoring release notes.

Comments

Fix optimized parquet reader complex hive types processing
Fix reading repeated fields, when parquet consists of multiple pages, so the beginning of the field can be on one page and it's ending on the next page.

Support empty arrays read

Determine null values of optional fields

Add tests for hive complex types: arrays, maps and structs

Rewrite tests to read parquets consising of multiple pages

Add TestDataWritableWriter with patch for empty array and empty map because the bug https://issues.apache.org/jira/browse/HIVE-13632 is already fixed in current hive version, so presto should be able to read empty arrays too

CLA Signed
opened by kgalieva 77

Add support for prepared statements in JDBC driver

I'm using presto-jdbc-0.66-SNAPSHOT.jar, and trying to execute presto query to presto-server on my java application.

Below sample code, using jdbc statement, is working well.

    Class.forName("com.facebook.presto.jdbc.PrestoDriver");
    Connection connection = DriverManager.getConnection("jdbc:presto://192.168.33.33:8080/hive/default", "hive", "hive");

    Statement statement = connection.createStatement();
    ResultSet rs = statement.executeQuery("SHOW TABLES");
    while(rs.next()) {
        System.out.println(rs.getString(1));
    }

However, using jdbc preparedstatement, throw exception. Is presto-jdbc not support yet "preparedstatement" ? Here's my test code and exception info.

Test Code :

    Class.forName("com.facebook.presto.jdbc.PrestoDriver");
    Connection connection = DriverManager.getConnection("jdbc:presto://192.168.33.33:8080/hive/default", "hive", "hive");

    PreparedStatement ps = connection.prepareStatement("SHOW TABLES");
    ResultSet rs = ps.executeQuery();
    while(rs.next()) {
        System.out.println(rs.getString(1));
    }

Exception Info :

    java.lang.UnsupportedOperationException: PreparedStatement
at com.facebook.presto.jdbc.PrestoPreparedStatement.<init>(PrestoPreparedStatement.java:44)
at com.facebook.presto.jdbc.PrestoConnection.prepareStatement(PrestoConnection.java:93)
at com.nsouls.frescott.hive.mapper.PrestoConnectionTest.testShowTable(PrestoConnectionTest.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83)
at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:88)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174)
at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

opened by felika 46

Add support for query pushdown to S3 using S3 Select

This change will allow Presto users to improve the performance of their queries using S3SelectPushdown. It pushes down projections and predicate evaluations to S3. As a result Presto doesn't need to download full S3 objects and only data required to answer the user's query is returned to Presto, thereby improving performance.

S3SelectPushdown Technical Document: S3SelectPushdown.pdf

This PR is a continuation of https://github.com/prestodb/presto/pull/11033.
CLA Signed

opened by same3r 42
Implement EXPLAIN ANALYZE

This should work similarly to Postgresql (http://www.postgresql.org/docs/9.4/static/sql-explain.html), by executing the query, recording stats, and then rendering the stats along with the plan. A first pass at implementing this could probably be to render similarly to EXPLAIN (TYPE DISTRIBUTED) with the stage & operator stats inserted
enhancement

opened by cberner 42
Performance Regressions in Presto 0.206?
I was recently benchmarking Presto 0.206 vs 0.172. The tests are run on Parquet datasets stored on S3.

We found that Presto 0.206 was generally faster on smaller datasets, there were some significant performance regressions on larger datasets. The CPU time reported by EXPLAIN ANALYZE was lower in 0.206 than 0.172, but the wall time was much longer.

This possibly indicates either stragglers or some sort of scheduling bug that adversely affects parallelism. Note that the concurrency settings like task.concurrency are the same in both clusters.

For instance, on the TPCH scale 1000 dataset, query#7 slowed down by a factor of 2x in wall time. The query was:

SELECT supp_nation, cust_nation, l_year, sum(volume) AS revenue FROM (SELECT n1.n_name AS supp_nation, n2.n_name AS cust_nation, substr(l_shipdate, 1, 4) AS l_year, l_extendedprice * (1 - l_discount) AS volume FROM lineitem_parq, orders_parq, customer_parq, supplier_parq, nation_parq n1, nation_parq n2 WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_nationkey AND c_nationkey = n2.n_nationkey AND ((n1.n_name = 'KENYA' AND n2.n_name = 'PERU') OR (n1.n_name = 'PERU' AND n2.n_name = 'KENYA')) AND l_shipdate BETWEEN '1995-01-01' AND '1996-12-31' ) AS shipping GROUP BY supp_nation, cust_nation, l_year ORDER BY supp_nation, cust_nation, l_year;

I compared the output of EXPLAIN ANALYZE from both versions of Presto and cannot find anything that could explain this. Here are some observations:

The CPU time reported by each stage was usually lower in 0.206. This probably rules out operator performance regressions.

Some of the leaf stages were using ScanProject in 0.172, but they use ScanFilterProject in 0.205. This actually reduces the output rows and leads to drastically lower CPU usage in upper stages of the query tree. This is a big improvement and should have led to faster query processing.

References

Explain analyze from 0.206 - https://gist.github.com/anoopj/40eea820c1c310dff72139d495ac98b0

Explain analyze from 0.172 - https://gist.github.com/anoopj/01985fe0ad298dad4c22b1444e1f1e21
opened by anoopj 39
[native] PrestoCpp build from source pipeline
Fully automated build from source process proposal for presto-native-execution (PrestoCpp and Velox). README file added for clarification. appreciate any and all of the feedback.

Prestissimo - Dockerfile build

💡 PrestoDB repository: Presto - https://github.com/prestodb/presto

💡 Velox repository: Velox - https://github.com/facebookincubator/velox

Practical Velox implementation using PrestoCpp

📝 Note: This readme and the build process was adapted from internal pipeline. You can e-mail the author if you've got questions [email protected]

Prestissimo, marked in PrestoDB GitHub repository as 'presto-native-execution', is effort of making PrestoDB even better using Velox library as a starting point. Both of mentioned - PrestoCpp and Velox - are mainly written using low level C and C++ 17 languages, which makes the build-from-scratch process humongously complicated. To make this process simple, Intel Cloud Native Data Services Team is introducing 3-stage, fully automated Docker build process based on unmodified project GitHub repository.

Quick Start

1. Clone this repository

git clone https://github.com/prestodb/presto prestodb

2. (Optional) Define and export Docker registry, image name and image tag variables

📝 Note: Remember to end your IMAGE_REGISTRY with / as this is required for full tag generation.

💡 Tip: Depending on your configuration you may need to run all bellow commands as root user, to switch type as your first command sudo su

💡 Tip: If IMAGE_REGISTRY is not specified IMAGE_PUSH should be set '0' or docker image push stage will fail.

Type in you console, changing variables values to meet your needs:

# defaults to 'avx', more info on Velox GitHub export CPU_TARGET="avx" # defaults to 'presto/prestissimo-${CPU_TARGET}-centos' export IMAGE_NAME='presto/prestissimo-${CPU_TARGET}-centos' # defaults to 'latest' export IMAGE_TAG='latest' # defaults to '' export IMAGE_REGISTRY='https://my_docker_registry.com/' # defaults to '0' export IMAGE_PUSH='0'

3. Make sure Docker daemon is running

(Ubuntu users) Type in your console:

systemctl status docker

4. Build Dockerfile repo

Type in your console:

cd prestodb/presto-native-execution make runtime-container

The process is fully automated and require no interaction for user. The process of building images for the first time can take up to couple of hours (~1-2h using 10 processor cores).

5. Run container

📝 Note: Remember that you should start Presto JAVA server first

Depending on values you have set the container tag is defined as

PRESTO_CPP_TAG="${IMAGE_REGISTRY}${IMAGE_NAME}:${IMAGE_TAG}"

for default values this will be just:

PRESTO_CPP_TAG=presto/prestissimo-avx-centos:latest

to run container build with default tag execute:

docker run "presto/prestissimo-avx-centos:latest" \ --use-env-params \ --discovery-uri=http://localhost:8080 \ --http-server-port=8080"

to run container interactively, not executing entrypoint file:

docker run -it --entrypoint=/bin/bash "presto/prestissimo-avx-centos:latest"

Container manual build

For manual build outside Intel network or without access to Cloud Native Data Services Poland Docker registry follow the steps bellow. In you terminal - with the same session that you want to build the images - define and export environment variables:

export CPU_TARGET="avx" export IMAGE_NAME='presto/prestissimo-${CPU_TARGET}-centos' export IMAGE_TAG='latest' export IMAGE_REGISTRY='some-registry.my-domain.com/' export IMAGE_PUSH='0' export PRESTODB_REPOSITORY=$(git config --get remote.origin.url) export PRESTODB_CHECKOUT=$(git show -s --format="%H" HEAD)

Where IMAGE_NAME and IMAGE_TAG will be the prestissimo release image name and tag, IMAGE_REGISTRY will be the registry that the image will be tagged with and witch will be used to download the images from previous stages in case there are no cached images locally. The CPU_TARGET will be unchanged for most of the cases, for more info read the Velox documentation. The PRESTODB_REPOSITORY and PRESTODB_CHECKOUT will be used as a build repository and branch inside the container. You can set them manually or as provided using git commands.

Then for example to build containers when being behind a proxy server, change dir to and type:

cd presto-native-execution/scripts/release-centos-dockerfile docker build \ --network=host \ --build-arg http_proxy \ --build-arg https_proxy \ --build-arg no_proxy \ --build-arg CPU_TARGET \ --build-arg PRESTODB_REPOSITORY \ --build-arg PRESTODB_CHECKOUT \ --tag "${IMAGE_REGISTRY}${IMAGE_NAME}:${IMAGE_TAG}" .

Build process - more info - prestissimo (with artifacts ~35 GB, without ~10 GB)

Most of runtime and build time dependencies are downloaded, configured and installed in this step. The result from this step is a starting point for both second and third stage. This container will be build 'once per breaking change' in any of repositories. It can be used as starting point for Ci/Cd integrated systems. This step install Maven, Java 8, Python3-Dev, libboost-dev and lots of other massive frameworks, libraries and applications and ensures that all of steps from 2 stage will run with no errors.

On-top of container from step 1 repository is initialized, Velox and submodules are updated, adapters, connectors and side-dependencies are build and configured. PrestoDB native, full repository build, using Meta wrapper mvnw for Maven is being done. After all of those partial steps, make and build are being run for PrestoCpp and Velox with Parquet, ORC, Hive connector with Thrift with S3-EMRFS filesystem implementation (schema s3://) and Hadoop filesystem implementation.

### DIRECTORY AND MAIN BUILD ARTIFACTS ## Native Presto JAVA build artifacts: /root/.m2/ ## Build, third party dependencies, mostly for adapters /opt/dependency/ /opt/dependency/aws-sdk-cpp /opt/dependency/install/ /opt/dependency/install/run/ /opt/dependency/install/bin/ /opt/dependency/install/lib64/ ## Root PrestoDB application directory /opt/presto/ ## Root GitHub clone of PrestoDB repository /opt/presto/_repo/ ## Root PrestoCpp subdirectory /opt/presto/_repo/presto-native-execution/ ## Root Velox GitHub repository directory, as PrestoDB submodule /opt/presto/_repo/presto-native-execution/Velox ## Root build results directory for PrestoCpp with Velox /opt/presto/_repo/presto-native-execution/_build/release/ /opt/presto/_repo/presto-native-execution/_build/release/velox/ /opt/presto/_repo/presto-native-execution/_build/release/presto_cpp/

Release container build - mostly with only the must-have runtime files, including presto_server build presto executable and some libraries. What will be used in the final released container depends on user needs and can be adjusted.

Prestissimo - runtime configuration and settings

⚠️ _Notice: Presto-native-execution binary requires 32Gb of RAM at runtime to start (default settings). To override this and overcome runtime error add node.memory_gb=8 line in node.properties.

Presto server with all dependencies can be found inside /opt/presto/, runtime name is presto_server. There are 2 ways of starting PrestoCpp using provided entry point /opt/entrypoint.sh.

1) Quick start - pass parameters to entrypoint

This is valid when running using docker and using kubernetes. It is not advised to use this method. User should prefer mounting configuration files using Kubernetes.

"/opt/entrypoint.sh --use-env-params --discovery-uri=http://presto-coordinaator.default.svc.cluster.local:8080 --http-server-port=8080"

2) Using in Kubernetes environment:

Mount config file inside a container as /opt/presto/node.properties.template. Replace each variable with you configuration values or leave it as is:

Notice: set up same values for JAVA coordinator as for prestoCpp - version, location and environment should be the same or you will get connection errors.

presto.version=0.273.3 node.location=datacenter-warsaw node.environment=test-environment node.data-dir=/var/presto/data catalog.config-dir=/opt/presto/catalog plugin.dir=/opt/presto/plugin # node.id is generated and filled during machine startup if not specified

Mount config file inside a container as /opt/presto/config.properties.template. Replace each variable with you configuration values:

coordinator=false http-server.http.port=8080 discovery.uri=http://presto-coordinaator.default.svc.cluster.local:8080

3) Hive-Metastore connector and S3 configuration:

For minimum required configuration just mount file /opt/presto/catalog/hive.properties inside container at give path (fill hive.metastore.uri with you metastore endpoint address):

connector.name=hive-hadoop2 hive.metastore.uri=thrift://hive-metastore-service.default.svc:9098 hive.pushdown-filter-enabled=true cache.enabled=true

Setting required by S3 connector and Velox query engine, replace with your values, reefer to presto hive connector settings help:

hive.s3.path-style-access={{ isPathstyle }} hive.s3.endpoint={{ scheme }}://{{ serviceFqdnTpl . }}:{{ portRest }} hive.s3.aws-access-key={{ accessKey }} hive.s3.aws-secret-key={{ secretKey }} hive.s3.ssl.enabled={{ sslEnabled }} hive.s3select-pushdown.enabled={{ s3selectPushdownFilterEnabled }} hive.parquet.pushdown-filter-enabled={{ parquetPushdownFilterEnabled }}

Signed-off-by: Linkiewicz, Milosz [email protected]
opened by Mionsz 36
Add support for query pushdown to S3 using S3Select

This change will allow Presto users to improve the performance of their queries using S3SelectPushdown. It pushes down projections and predicate evaluations to S3. As a result Presto doesn't need to download full S3 objects and only data required to answer the user's query is returned to Presto, thereby improving performance.

S3SelectPushdown Technical Document: S3SelectPushdown.pdf

PR UPDATE Closed this PR as it was slow to work with due to large volume of comments. Created a new PR to continue the work https://github.com/prestodb/presto/pull/11970
CLA Signed

opened by same3r 36

Support multiple columns in IN predicate

Support queries like:

presto:sf1> select count(*) from lineitem where (orderkey, linenumber) IN (SELECT orderkey, linenumber from lineitem);
Query 20161018_062422_00016_uqzsf failed: line 1:60: Multiple columns returned by subquery are not yet supported. Found 2

It should be easy to implement once https://github.com/prestodb/presto/issues/6384 got implemented.

opened by kokosing 36

Prune Nested Fields for Parquet Columns
Read necessary fields only for Parquet nested columns Currently, Presto will read all the fields in a struct for Parquet columns. e.g.

select s.a, s.b from t

if it is a parquet file, with struct column s: {a int, b double, c long, d float} current Presto will read a, b, c, d from s, and output just a and b

For columnar storage as Parquet or ORC, we could do better, by just reading the necessary fields. In the previous example, just read {a int, b double} from s. Not reading other fields to save IO.

This patch introduces an optional NestedFields in ColumnHandle. When optimizing the plan, PruneNestedColumns optimizer will visit expressions, and put candidate nested fields into ColumnHandle. When scanning parquet files, the record reader could use NestedFields to specify necessary fields only for parquet files.

This has an dependency on @jxiang 's https://github.com/prestodb/presto/pull/4714, which gives us the flexibility to specify metastore schemas differently from parquet file schemas.

@dain @martint @electrum @cberner @erichwang any comments are appreciated
CLA Signed
opened by zhenxiao 36
Cassandra connector IN query very slow planning on large list
A query like -

select col1 from table where col2 in (<long list of integers>) and col3 in (<long list of string>) and col4 in (<another long list of integers>) and col1 is not null group by col1;

takes more than 5 minutes just planning. My cassandra table being queried has a lot of partitions and list length for IN query I was experimenting with was anywhere between 50 to 200. <col2, col3, col4> together form the partition keys so I don't imagine a full table scan to take place during planning or execution. Any ideas?
opened by aandis 34
Add InMemory connector

Add connector that stores all data in memory on the workers.

Rationale behind it is to serve as a storage for SQL query benchmarking. Using JMH unit benchmarks from scratch is time consuming to setup, it's often much easier to write some query against TPCH. Previous benchmarks had significant drawback that generating data in TPCH connector was using most of the CPU time. With InMemory connector that's no longer the case.

Connector is based on BlackHole and first commit is just copy/paste with some renames.
CLA Signed ready-to-merge

opened by pnowojski 33
folly compiled failed when running setup-macos.sh
When i running setup-macos.sh in presto-native-execution/scripts, I got the follow errors when it compile folly project.

run_and_time install_folly

install_folly

github_checkout facebook/folly v2022.07.11.00

local REPO=facebook/folly

shift

local VERSION=v2022.07.11.00

shift

local GIT_CLONE_PARAMS= ++ basename facebook/folly

local DIRNAME=folly

cd /Users/ericdoug/Documents/mydev/presto/presto-native-execution/scripts

'[' -z folly ']'

'[' -d folly ']'

prompt 'folly already exists. Delete?' folly already exists. Delete? [Y, n] Y

rm -rf folly

'[' '!' -d folly ']'

git clone -q -b v2022.07.11.00 [email protected]:facebook/folly.git Note: switching to '4ba3bfed38ad14d0951d82b154c44235d380f59b'.

You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example:

git switch -c

Or undo this operation with:

git switch -

Turn off this advice by setting config variable advice.detachedHead to false

cd folly ++ brew --prefix [email protected]

OPENSSL_ROOT_DIR=/usr/local/opt/[email protected]

cmake_install -DBUILD_TESTS=OFF +++ pwd ++ basename /Users/ericdoug/Documents/mydev/presto/presto-native-execution/scripts/folly

local NAME=folly

local BINARY_DIR=_build

'[' -d _build ']'

mkdir -p _build

CPU_TARGET=avx ++ get_cxx_flags avx ++ local CPU_ARCH=avx ++ local OS +++ uname ++ OS=Darwin ++ local MACHINE +++ uname -m ++ MACHINE=x86_64 ++ ADDITIONAL_FLAGS= ++ '[' -z avx ']' ++ case $CPU_ARCH in ++ echo -n '-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 '

COMPILER_FLAGS='-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 '

cmake -Wno-dev -B_build -GNinja -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_CXX_STANDARD=17 '' '' '-DCMAKE_CXX_FLAGS=-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 ' -DBUILD_TESTING=OFF -DBUILD_TESTS=OFF CMake Warning: Ignoring empty string ("") provided on the command line.

CMake Warning: Ignoring empty string ("") provided on the command line.

-- The CXX compiler identification is AppleClang 14.0.0.14000029 -- The C compiler identification is AppleClang 14.0.0.14000029 -- The ASM compiler identification is Clang with GNU-like command-line -- Found assembler: /Library/Developer/CommandLineTools/usr/bin/cc -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE
-- Found Boost: /usr/local/lib/cmake/Boost-1.79.0/BoostConfig.cmake (found suitable version "1.79.0", minimum required is "1.51.0") found components: context filesystem program_options regex system thread -- Found DoubleConversion: /usr/local/lib/libdouble-conversion.a
-- Could NOT find gflags (missing: LIBGFLAGS_LIBRARY LIBGFLAGS_INCLUDE_DIR) -- Could NOT find Glog (missing: GLOG_LIBRARY GLOG_INCLUDE_DIR) -- Found libevent: /usr/local/lib/libevent.dylib -- Found ZLIB: /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/libz.tbd (found version "1.2.11") -- Found OpenSSL: /usr/local/opt/[email protected]/lib/libcrypto.dylib (found suitable version "1.1.1q", minimum required is "1.1.1")
-- Looking for ASN1_TIME_diff -- Looking for ASN1_TIME_diff - not found -- Found BZip2: /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/libbz2.tbd (found version "1.0.8") -- Looking for BZ2_bzCompressInit -- Looking for BZ2_bzCompressInit - not found -- Looking for lzma_auto_decoder in /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/liblzma.tbd -- Looking for lzma_auto_decoder in /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/liblzma.tbd - not found -- Looking for lzma_easy_encoder in /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/liblzma.tbd -- Looking for lzma_easy_encoder in /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/liblzma.tbd - not found -- Looking for lzma_lzma_preset in /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/liblzma.tbd -- Looking for lzma_lzma_preset in /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/liblzma.tbd - not found -- Could NOT find LibLZMA (missing: LIBLZMA_HAS_AUTO_DECODER LIBLZMA_HAS_EASY_ENCODER LIBLZMA_HAS_LZMA_PRESET) (found version "5.2.5") -- Found LZ4: /usr/local/lib/liblz4.dylib
-- Found LZ4: /usr/local/lib/liblz4.dylib -- Could NOT find ZSTD (missing: ZSTD_LIBRARY ZSTD_INCLUDE_DIR) -- Could NOT find SNAPPY (missing: SNAPPY_LIBRARY SNAPPY_INCLUDE_DIR) -- Could NOT find LIBDWARF (missing: LIBDWARF_LIBRARY LIBDWARF_INCLUDE_DIR) -- Could NOT find LIBIBERTY (missing: LIBIBERTY_LIBRARY LIBIBERTY_INCLUDE_DIR) -- Could NOT find LIBAIO (missing: LIBAIO_LIBRARY LIBAIO_INCLUDE_DIR) -- Could NOT find LIBURING (missing: LIBURING_LIBRARY LIBURING_INCLUDE_DIR) -- Found LIBSODIUM: /usr/local/lib/libsodium.dylib
-- Found Libsodium: /usr/local/lib/libsodium.dylib -- Could NOT find LIBUNWIND (missing: LIBUNWIND_LIBRARY) -- Looking for swapcontext -- Looking for swapcontext - found -- Looking for C++ include elf.h -- Looking for C++ include elf.h - not found -- Looking for backtrace -- Looking for backtrace - found -- backtrace facility detected in default set of libraries -- Found Backtrace: /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/include
-- Setting FOLLY_USE_SYMBOLIZER: OFF -- Setting FOLLY_HAVE_ELF: -- Setting FOLLY_HAVE_DWARF: FALSE -- Performing Test FOLLY_CPP_ATOMIC_BUILTIN -- Performing Test FOLLY_CPP_ATOMIC_BUILTIN - Success -- Performing Test FOLLY_STDLIB_LIBSTDCXX -- Performing Test FOLLY_STDLIB_LIBSTDCXX - Failed -- Performing Test FOLLY_STDLIB_LIBSTDCXX_GE_9 -- Performing Test FOLLY_STDLIB_LIBSTDCXX_GE_9 - Failed -- Performing Test FOLLY_STDLIB_LIBCXX -- Performing Test FOLLY_STDLIB_LIBCXX - Success -- Performing Test FOLLY_STDLIB_LIBCXX_GE_9 -- Performing Test FOLLY_STDLIB_LIBCXX_GE_9 - Success -- Performing Test FOLLY_STDLIB_LIBCPP -- Performing Test FOLLY_STDLIB_LIBCPP - Failed -- Looking for C++ include jemalloc/jemalloc.h -- Looking for C++ include jemalloc/jemalloc.h - not found -- Performing Test COMPILER_HAS_UNKNOWN_WARNING_OPTION -- Performing Test COMPILER_HAS_UNKNOWN_WARNING_OPTION - Success -- Performing Test COMPILER_HAS_W_SHADOW_LOCAL -- Performing Test COMPILER_HAS_W_SHADOW_LOCAL - Failed -- Performing Test COMPILER_HAS_W_SHADOW_COMPATIBLE_LOCAL -- Performing Test COMPILER_HAS_W_SHADOW_COMPATIBLE_LOCAL - Failed -- Performing Test COMPILER_HAS_W_NOEXCEPT_TYPE -- Performing Test COMPILER_HAS_W_NOEXCEPT_TYPE - Success -- Performing Test COMPILER_HAS_W_NULLABILITY_COMPLETENESS -- Performing Test COMPILER_HAS_W_NULLABILITY_COMPLETENESS - Success -- Performing Test COMPILER_HAS_W_INCONSISTENT_MISSING_OVERRIDE -- Performing Test COMPILER_HAS_W_INCONSISTENT_MISSING_OVERRIDE - Success -- Performing Test COMPILER_HAS_F_ALIGNED_NEW -- Performing Test COMPILER_HAS_F_ALIGNED_NEW - Success -- Performing Test COMPILER_HAS_F_OPENMP -- Performing Test COMPILER_HAS_F_OPENMP - Failed -- Looking for pthread_atfork -- Looking for pthread_atfork - found -- Looking for accept4 -- Looking for accept4 - not found -- Looking for getrandom -- Looking for getrandom - not found -- Looking for preadv -- Looking for preadv - found -- Looking for pwritev -- Looking for pwritev - found -- Looking for clock_gettime -- Looking for clock_gettime - found -- Looking for pipe2 -- Looking for pipe2 - not found -- Looking for sendmmsg -- Looking for sendmmsg - not found -- Looking for recvmmsg -- Looking for recvmmsg - not found -- Looking for malloc_usable_size -- Looking for malloc_usable_size - not found -- Performing Test FOLLY_HAVE_IFUNC -- Performing Test FOLLY_HAVE_IFUNC - Failed -- Performing Test FOLLY_HAVE_STD__IS_TRIVIALLY_COPYABLE -- Performing Test FOLLY_HAVE_STD__IS_TRIVIALLY_COPYABLE - Success -- Performing Test FOLLY_HAVE_UNALIGNED_ACCESS -- Performing Test FOLLY_HAVE_UNALIGNED_ACCESS - Success -- Performing Test FOLLY_HAVE_VLA -- Performing Test FOLLY_HAVE_VLA - Success -- Performing Test FOLLY_HAVE_WEAK_SYMBOLS -- Performing Test FOLLY_HAVE_WEAK_SYMBOLS - Failed -- Performing Test FOLLY_HAVE_LINUX_VDSO -- Performing Test FOLLY_HAVE_LINUX_VDSO - Failed -- Performing Test FOLLY_HAVE_WCHAR_SUPPORT -- Performing Test FOLLY_HAVE_WCHAR_SUPPORT - Success -- Performing Test FOLLY_HAVE_EXTRANDOM_SFMT19937 -- Performing Test FOLLY_HAVE_EXTRANDOM_SFMT19937 - Failed -- Performing Test HAVE_VSNPRINTF_ERRORS -- Performing Test HAVE_VSNPRINTF_ERRORS - Failed -- arch does not match x86_64, skipping setting SSE2/AVX2 compile flags for LtHash SIMD code -- Performing Test COMPILER_HAS_M_PCLMUL -- Performing Test COMPILER_HAS_M_PCLMUL - Success -- compiler has flag pclmul, setting compile flag for /Users/ericdoug/Documents/mydev/presto/presto-native-execution/scripts/folly/folly/hash/detail/ChecksumDetail.cpp;/Users/ericdoug/Documents/mydev/presto/presto-native-execution/scripts/folly/folly/hash/detail/Crc32CombineDetail.cpp;/Users/ericdoug/Documents/mydev/presto/presto-native-execution/scripts/folly/folly/hash/detail/Crc32cDetail.cpp -- Configuring done CMake Error in CMakeLists.txt: Target "folly_deps" INTERFACE_INCLUDE_DIRECTORIES property contains path:

"/Users/ericdoug/Documents/mydev/presto/presto-native-execution/scripts/folly/GLOG_INCLUDE_DIR-NOTFOUND"

which is prefixed in the source directory.

-- Generating done CMake Warning: Manually-specified variables were not used by the project:

BUILD_TESTING

CMake Generate step failed. Build files cannot be regenerated correctly.
opened by eric-doug 1
Reduce TaskHandle lock contention
Avoids synchronizing on TaskHandle when checking whether it has been destroyed by making the destroyed flag volatile instead. The previously synchronized isDestroyed method is called once at the end of each driver processing interval from worker threads check whether the split is finished which could create unnecessary lock contention when all threads are active on tasks with many splits that frequently block or completely quickly.

Also includes a minor improvement to PrioritizedSplitRunner#process() that avoids unnecessary redundant calls to System.nanoTime() and reduces logging overhead in TaskExecutor by avoiding system calls and string concatenation when debug logging is not enabled.

== NO RELEASE NOTE ==
opened by pettyjamesm 0

WIP: Add S3 Select pushdown JSON tests

Test plan

Locally tested:

[INFO] --- maven-surefire-plugin:3.0.0-M7:test (default-test) @ presto-hive-hadoop2 ---
[INFO] Tests will run in random order. To reproduce ordering use flag -Dsurefire.runOrder.random.seed=114273716603401
[INFO] Using auto detected provider org.apache.maven.surefire.testng.TestNGProvider
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running com.facebook.presto.hive.s3select.TestHiveFileSystemS3SelectJsonPushdown
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.fs.HadoopExtendedFileSystemCache (file:/Users/dnnanuti/.m2/repository/com/facebook/presto/presto-hive-common/0.279-SNAPSHOT/presto-hive-common-0.279-SNAPSHOT.jar) to field java.lang.reflect.Field.modifiers
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.fs.HadoopExtendedFileSystemCache
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2023-01-05T17:29:59.421-0600 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-01-05T17:29:59.523-0600 INFO Successfully loaded & initialized native-bzip2 library system-native
2023-01-05T17:29:59.528-0600 INFO Successfully loaded & initialized native-zlib library
2023-01-05T17:30:01.824-0600 WARNING NoSuchMethodException was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
2023-01-05T17:30:03.178-0600 INFO io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2023-01-05T17:30:03.397-0600 INFO Got brand-new decompressor [.bz2]
2023-01-05T17:30:03.498-0600 INFO Got brand-new decompressor [.bz2]
2023-01-05T17:30:03.834-0600 INFO Got brand-new decompressor [.gz]
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.838 s - in com.facebook.presto.hive.s3select.TestHiveFileSystemS3SelectJsonPushdown
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  28.323 s
[INFO] Finished at: 2023-01-05T23:30:04Z
[INFO] ------------------------------------------------------------------------
+ EXIT_CODE=0
+ set -e
+ popd
~/workplace/connectors/github-repositories/presto
+ cleanup_docker_containers
+ docker-compose -f ./presto-hive-hadoop2/bin/../conf/docker-compose.yml down
Stopping conf_hadoop-master_1 ... done
Removing conf_hadoop-master_1 ... done
Removing network conf_default
+ wait
+ exit 0

Needs rebasing and will be merged after: https://github.com/prestodb/presto/pull/18901

== NO RELEASE NOTE ==

opened by dnanuti 0

WIP: Add S3 Select pushdown for JSON files
Add S3 Select pushdown for JSON files

Small refactoring for IonSqlQueryBuilder to support query generation for JSON

Pushdown logic works for base columns, same as for CSV

Tests for JSON support will be added in a separate PR, as they require more changes: https://github.com/prestodb/presto/pull/18902

Needs rebasing and will be merged after: https://github.com/prestodb/presto/pull/18786 https://github.com/prestodb/presto/pull/18798

== RELEASE NOTES == Hive Changes * Add Amazon S3 Select pushdown for JSON files.
opened by dnanuti 0
find_first UDF cannot distinguish between NULL returned as a value and NULL returned because of no match

The find_first added in https://github.com/prestodb/presto/pull/18316 returns NULL if no match found. However, it cannot distinguish from the NULL returned as values.

For example, both SELECT FIND_FIRST(ARRAY[NULL, 1], x->x is NULL) is NULL; and SELECT FIND_FIRST(ARRAY[1], x->x is NULL) is NULL; returns true.

opened by feilong-liu 1

The official home of the Presto distributed SQL query engine for big data

Related tags

Overview

Presto

Requirements

Building Presto

Running Presto in your IDE

Overview

Using SOCKS for Hive or HDFS

Running the CLI

Code Style

Building the Web UI

Release Notes

Comments

Prestissimo - Dockerfile build

Practical Velox implementation using PrestoCpp

Quick Start

1. Clone this repository

2. (Optional) Define and export Docker registry, image name and image tag variables

3. Make sure Docker daemon is running

4. Build Dockerfile repo

5. Run container

Container manual build

Build process - more info - prestissimo (with artifacts ~35 GB, without ~10 GB)

Prestissimo - runtime configuration and settings

1) Quick start - pass parameters to entrypoint

2) Using in Kubernetes environment:

3) Hive-Metastore connector and S3 configuration:

Owner

Presto

SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Real-time Query for Hadoop; mirror of Apache Impala

Netflix's distributed Data Pipeline

Machine Learning Platform and Recommendation Engine built on Kubernetes

OpenRefine is a free, open source power tool for working with messy data and improving it

Hadoop library for large-scale data processing, now an Apache Incubator project

A platform for visualization and real-time monitoring of data workflows

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Access paged data as a "stream" with async loading while maintaining order

The official home of the Presto distributed SQL query engine for big data

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Keycloak: Home IdP Discovery - discover home identity provider or realm by email domain

Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository containing all my study material.

Allows you to use the MongoDB query syntax to query your relational database.

Spring JPA @Query for custom query in Spring Boot example

gMark: a domain- and query language-independent graph instance and query workload generator

Build process - more info - `prestissimo` (with artifacts ~35 GB, without ~10 GB)