The official home of the Presto distributed SQL query engine for big data

Overview

Presto Build Status

Presto is a distributed SQL query engine for big data.

See the User Manual for deployment instructions and end user documentation.

Requirements

  • Mac OS X or Linux
  • Java 8 Update 151 or higher (8u151+), 64-bit. Both Oracle JDK and OpenJDK are supported.
  • Maven 3.3.9+ (for building)
  • Python 2.4+ (for running with the launcher script)

Building Presto

Presto is a standard Maven project. Simply run the following command from the project root directory:

./mvnw clean install

On the first build, Maven will download all the dependencies from the internet and cache them in the local repository (~/.m2/repository), which can take a considerable amount of time. Subsequent builds will be faster.

Presto has a comprehensive set of unit tests that can take several minutes to run. You can disable the tests when building:

./mvnw clean install -DskipTests

Running Presto in your IDE

Overview

After building Presto for the first time, you can load the project into your IDE and run the server. We recommend using IntelliJ IDEA. Because Presto is a standard Maven project, you can import it into your IDE using the root pom.xml file. In IntelliJ, choose Open Project from the Quick Start box or choose Open from the File menu and select the root pom.xml file.

After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:

  • Open the File menu and select Project Structure
  • In the SDKs section, ensure that a 1.8 JDK is selected (create one if none exist)
  • In the Project section, ensure the Project language level is set to 8.0 as Presto makes use of several Java 8 language features

Presto comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:

  • Main Class: com.facebook.presto.server.PrestoServer
  • VM Options: -ea -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -Xmx2G -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties
  • Working directory: $MODULE_DIR$
  • Use classpath of module: presto-main

The working directory should be the presto-main subdirectory. In IntelliJ, using $MODULE_DIR$ accomplishes this automatically.

Additionally, the Hive plugin must be configured with location of your Hive metastore Thrift service. Add the following to the list of VM options, replacing localhost:9083 with the correct host and port (or use the below value if you do not have a Hive metastore):

-Dhive.metastore.uri=thrift://localhost:9083

Using SOCKS for Hive or HDFS

If your Hive metastore or HDFS cluster is not directly accessible to your local machine, you can use SSH port forwarding to access it. Setup a dynamic SOCKS proxy with SSH listening on local port 1080:

ssh -v -N -D 1080 server

Then add the following to the list of VM options:

-Dhive.metastore.thrift.client.socks-proxy=localhost:1080

Running the CLI

Start the CLI to connect to the server and run SQL queries:

presto-cli/target/presto-cli-*-executable.jar

Run a query to see the nodes in the cluster:

SELECT * FROM system.runtime.nodes;

In the sample configuration, the Hive connector is mounted in the hive catalog, so you can run the following queries to show the tables in the Hive database default:

SHOW TABLES FROM hive.default;

Code Style

We recommend you use IntelliJ as your IDE. The code style template for the project can be found in the codestyle repository along with our general programming and Java guidelines. In addition to those you should also adhere to the following:

  • Alphabetize sections in the documentation source files (both in table of contents files and other regular documentation files). In general, alphabetize methods/variables/sections if such ordering already exists in the surrounding code.
  • When appropriate, use the Java 8 stream API. However, note that the stream implementation does not perform well so avoid using it in inner loops or otherwise performance sensitive sections.
  • Categorize errors when throwing exceptions. For example, PrestoException takes an error code as an argument, PrestoException(HIVE_TOO_MANY_OPEN_PARTITIONS). This categorization lets you generate reports so you can monitor the frequency of various failures.
  • Ensure that all files have the appropriate license header; you can generate the license by running mvn license:format.
  • Consider using String formatting (printf style formatting using the Java Formatter class): format("Session property %s is invalid: %s", name, value) (note that format() should always be statically imported). Sometimes, if you only need to append something, consider using the + operator.
  • Avoid using the ternary operator except for trivial expressions.
  • Use an assertion from Airlift's Assertions class if there is one that covers your case rather than writing the assertion by hand. Over time we may move over to more fluent assertions like AssertJ.
  • When writing a Git commit message, follow these guidelines.

Building the Web UI

The Presto Web UI is composed of several React components and is written in JSX and ES6. This source code is compiled and packaged into browser-compatible Javascript, which is then checked in to the Presto source code (in the dist folder). You must have Node.js and Yarn installed to execute these commands. To update this folder after making changes, simply run:

yarn --cwd presto-main/src/main/resources/webapp/src install

If no Javascript dependencies have changed (i.e., no changes to package.json), it is faster to run:

yarn --cwd presto-main/src/main/resources/webapp/src run package

To simplify iteration, you can also run in watch mode, which automatically re-compiles when changes to source files are detected:

yarn --cwd presto-main/src/main/resources/webapp/src run watch

To iterate quickly, simply re-build the project in IntelliJ after packaging is complete. Project resources will be hot-reloaded and changes are reflected on browser refresh.

Release Notes

When authoring a pull request, the PR description should include its relevant release notes. Follow Release Notes Guidelines when authoring release notes.

Comments
  • Fix optimized parquet reader complex hive types processing

    Fix optimized parquet reader complex hive types processing

    • Fix reading repeated fields, when parquet consists of multiple pages, so the beginning of the field can be on one page and it's ending on the next page.

    • Support empty arrays read

    • Determine null values of optional fields

    • Add tests for hive complex types: arrays, maps and structs

    • Rewrite tests to read parquets consising of multiple pages

    • Add TestDataWritableWriter with patch for empty array and empty map because the bug https://issues.apache.org/jira/browse/HIVE-13632 is already fixed in current hive version, so presto should be able to read empty arrays too

    CLA Signed 
    opened by kgalieva 77
  • Add support for prepared statements in JDBC driver

    Add support for prepared statements in JDBC driver

    I'm using presto-jdbc-0.66-SNAPSHOT.jar, and trying to execute presto query to presto-server on my java application.

    Below sample code, using jdbc statement, is working well.

        Class.forName("com.facebook.presto.jdbc.PrestoDriver");
        Connection connection = DriverManager.getConnection("jdbc:presto://192.168.33.33:8080/hive/default", "hive", "hive");
    
        Statement statement = connection.createStatement();
        ResultSet rs = statement.executeQuery("SHOW TABLES");
        while(rs.next()) {
            System.out.println(rs.getString(1));
        }
    

    However, using jdbc preparedstatement, throw exception. Is presto-jdbc not support yet "preparedstatement" ? Here's my test code and exception info.

    Test Code :

        Class.forName("com.facebook.presto.jdbc.PrestoDriver");
        Connection connection = DriverManager.getConnection("jdbc:presto://192.168.33.33:8080/hive/default", "hive", "hive");
    
        PreparedStatement ps = connection.prepareStatement("SHOW TABLES");
        ResultSet rs = ps.executeQuery();
        while(rs.next()) {
            System.out.println(rs.getString(1));
        }
    

    Exception Info :

        java.lang.UnsupportedOperationException: PreparedStatement
    at com.facebook.presto.jdbc.PrestoPreparedStatement.<init>(PrestoPreparedStatement.java:44)
    at com.facebook.presto.jdbc.PrestoConnection.prepareStatement(PrestoConnection.java:93)
    at com.nsouls.frescott.hive.mapper.PrestoConnectionTest.testShowTable(PrestoConnectionTest.java:37)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
    at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83)
    at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:88)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
    at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
    at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
    at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202)
    at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
    
    opened by felika 46
  • Add support for query pushdown to S3 using S3 Select

    Add support for query pushdown to S3 using S3 Select

    This change will allow Presto users to improve the performance of their queries using S3SelectPushdown. It pushes down projections and predicate evaluations to S3. As a result Presto doesn't need to download full S3 objects and only data required to answer the user's query is returned to Presto, thereby improving performance.

    S3SelectPushdown Technical Document: S3SelectPushdown.pdf

    This PR is a continuation of https://github.com/prestodb/presto/pull/11033.

    CLA Signed 
    opened by same3r 42
  • Implement EXPLAIN ANALYZE

    Implement EXPLAIN ANALYZE

    This should work similarly to Postgresql (http://www.postgresql.org/docs/9.4/static/sql-explain.html), by executing the query, recording stats, and then rendering the stats along with the plan. A first pass at implementing this could probably be to render similarly to EXPLAIN (TYPE DISTRIBUTED) with the stage & operator stats inserted

    enhancement 
    opened by cberner 42
  • Performance Regressions in Presto 0.206?

    Performance Regressions in Presto 0.206?

    I was recently benchmarking Presto 0.206 vs 0.172. The tests are run on Parquet datasets stored on S3.

    We found that Presto 0.206 was generally faster on smaller datasets, there were some significant performance regressions on larger datasets. The CPU time reported by EXPLAIN ANALYZE was lower in 0.206 than 0.172, but the wall time was much longer.

    This possibly indicates either stragglers or some sort of scheduling bug that adversely affects parallelism. Note that the concurrency settings like task.concurrency are the same in both clusters.

    For instance, on the TPCH scale 1000 dataset, query#7 slowed down by a factor of 2x in wall time. The query was:

    SELECT supp_nation,
           cust_nation,
           l_year,
           sum(volume) AS revenue
    FROM
      (SELECT n1.n_name AS supp_nation,
              n2.n_name AS cust_nation,
              substr(l_shipdate, 1, 4) AS l_year,
              l_extendedprice * (1 - l_discount) AS volume
       FROM lineitem_parq,
            orders_parq,
            customer_parq,
            supplier_parq,
            nation_parq n1,
            nation_parq n2
       WHERE s_suppkey = l_suppkey
         AND o_orderkey = l_orderkey
         AND c_custkey = o_custkey
         AND s_nationkey = n1.n_nationkey
         AND c_nationkey = n2.n_nationkey
         AND ((n1.n_name = 'KENYA'
               AND n2.n_name = 'PERU')
              OR (n1.n_name = 'PERU'
                  AND n2.n_name = 'KENYA'))
         AND l_shipdate BETWEEN '1995-01-01' AND '1996-12-31' ) AS shipping
    GROUP BY supp_nation,
             cust_nation,
             l_year
    ORDER BY supp_nation,
             cust_nation,
             l_year;
    

    I compared the output of EXPLAIN ANALYZE from both versions of Presto and cannot find anything that could explain this. Here are some observations:

    • The CPU time reported by each stage was usually lower in 0.206. This probably rules out operator performance regressions.
    • Some of the leaf stages were using ScanProject in 0.172, but they use ScanFilterProject in 0.205. This actually reduces the output rows and leads to drastically lower CPU usage in upper stages of the query tree. This is a big improvement and should have led to faster query processing.

    References

    • Explain analyze from 0.206 - https://gist.github.com/anoopj/40eea820c1c310dff72139d495ac98b0
    • Explain analyze from 0.172 - https://gist.github.com/anoopj/01985fe0ad298dad4c22b1444e1f1e21
    opened by anoopj 39
  • [native] PrestoCpp build from source pipeline

    [native] PrestoCpp build from source pipeline

    Fully automated build from source process proposal for presto-native-execution (PrestoCpp and Velox). README file added for clarification. appreciate any and all of the feedback.

    Prestissimo - Dockerfile build

    💡 PrestoDB repository: Presto - https://github.com/prestodb/presto

    💡 Velox repository: Velox - https://github.com/facebookincubator/velox

    Practical Velox implementation using PrestoCpp

    📝 Note: This readme and the build process was adapted from internal pipeline. You can e-mail the author if you've got questions [email protected]

    Prestissimo, marked in PrestoDB GitHub repository as 'presto-native-execution', is effort of making PrestoDB even better using Velox library as a starting point. Both of mentioned - PrestoCpp and Velox - are mainly written using low level C and C++ 17 languages, which makes the build-from-scratch process humongously complicated. To make this process simple, Intel Cloud Native Data Services Team is introducing 3-stage, fully automated Docker build process based on unmodified project GitHub repository.

    Quick Start

    1. Clone this repository

    git clone https://github.com/prestodb/presto prestodb
    

    2. (Optional) Define and export Docker registry, image name and image tag variables

    📝 Note: Remember to end your IMAGE_REGISTRY with / as this is required for full tag generation.

    💡 Tip: Depending on your configuration you may need to run all bellow commands as root user, to switch type as your first command sudo su

    💡 Tip: If IMAGE_REGISTRY is not specified IMAGE_PUSH should be set '0' or docker image push stage will fail.

    Type in you console, changing variables values to meet your needs:

    # defaults to 'avx', more info on Velox GitHub
    export CPU_TARGET="avx"
    # defaults to 'presto/prestissimo-${CPU_TARGET}-centos'
    export IMAGE_NAME='presto/prestissimo-${CPU_TARGET}-centos'
    # defaults to 'latest'
    export IMAGE_TAG='latest'
    # defaults to ''
    export IMAGE_REGISTRY='https://my_docker_registry.com/'
    # defaults to '0'
    export IMAGE_PUSH='0'
    

    3. Make sure Docker daemon is running

    (Ubuntu users) Type in your console:

    systemctl status docker
    

    4. Build Dockerfile repo

    Type in your console:

    cd prestodb/presto-native-execution
    make runtime-container
    

    The process is fully automated and require no interaction for user. The process of building images for the first time can take up to couple of hours (~1-2h using 10 processor cores).

    5. Run container

    📝 Note: Remember that you should start Presto JAVA server first

    Depending on values you have set the container tag is defined as

    PRESTO_CPP_TAG="${IMAGE_REGISTRY}${IMAGE_NAME}:${IMAGE_TAG}"

    for default values this will be just:

    PRESTO_CPP_TAG=presto/prestissimo-avx-centos:latest

    to run container build with default tag execute:

    docker run "presto/prestissimo-avx-centos:latest" \
                --use-env-params \
                --discovery-uri=http://localhost:8080 \
                --http-server-port=8080"
    

    to run container interactively, not executing entrypoint file:

    docker run -it --entrypoint=/bin/bash "presto/prestissimo-avx-centos:latest"
    

    Container manual build

    For manual build outside Intel network or without access to Cloud Native Data Services Poland Docker registry follow the steps bellow. In you terminal - with the same session that you want to build the images - define and export environment variables:

    export CPU_TARGET="avx"
    export IMAGE_NAME='presto/prestissimo-${CPU_TARGET}-centos'
    export IMAGE_TAG='latest'
    export IMAGE_REGISTRY='some-registry.my-domain.com/'
    export IMAGE_PUSH='0'
    export PRESTODB_REPOSITORY=$(git config --get remote.origin.url)
    export PRESTODB_CHECKOUT=$(git show -s --format="%H" HEAD)
    

    Where IMAGE_NAME and IMAGE_TAG will be the prestissimo release image name and tag, IMAGE_REGISTRY will be the registry that the image will be tagged with and witch will be used to download the images from previous stages in case there are no cached images locally. The CPU_TARGET will be unchanged for most of the cases, for more info read the Velox documentation. The PRESTODB_REPOSITORY and PRESTODB_CHECKOUT will be used as a build repository and branch inside the container. You can set them manually or as provided using git commands.

    Then for example to build containers when being behind a proxy server, change dir to and type:

    cd presto-native-execution/scripts/release-centos-dockerfile
    docker build \
        --network=host \
        --build-arg http_proxy  \
        --build-arg https_proxy \
        --build-arg no_proxy    \
        --build-arg CPU_TARGET  \
        --build-arg PRESTODB_REPOSITORY \
        --build-arg PRESTODB_CHECKOUT \
        --tag "${IMAGE_REGISTRY}${IMAGE_NAME}:${IMAGE_TAG}" .
    

    Build process - more info - prestissimo (with artifacts ~35 GB, without ~10 GB)

    Most of runtime and build time dependencies are downloaded, configured and installed in this step. The result from this step is a starting point for both second and third stage. This container will be build 'once per breaking change' in any of repositories. It can be used as starting point for Ci/Cd integrated systems. This step install Maven, Java 8, Python3-Dev, libboost-dev and lots of other massive frameworks, libraries and applications and ensures that all of steps from 2 stage will run with no errors.

    On-top of container from step 1 repository is initialized, Velox and submodules are updated, adapters, connectors and side-dependencies are build and configured. PrestoDB native, full repository build, using Meta wrapper mvnw for Maven is being done. After all of those partial steps, make and build are being run for PrestoCpp and Velox with Parquet, ORC, Hive connector with Thrift with S3-EMRFS filesystem implementation (schema s3://) and Hadoop filesystem implementation.

    ### DIRECTORY AND MAIN BUILD ARTIFACTS
    ## Native Presto JAVA build artifacts:
    /root/.m2/
    
    ## Build, third party dependencies, mostly for adapters
    /opt/dependency/
    /opt/dependency/aws-sdk-cpp
    /opt/dependency/install/
    /opt/dependency/install/run/
    /opt/dependency/install/bin/
    /opt/dependency/install/lib64/
    
    ## Root PrestoDB application directory
    /opt/presto/
    
    ## Root GitHub clone of PrestoDB repository
    /opt/presto/_repo/
    
    ## Root PrestoCpp subdirectory
    /opt/presto/_repo/presto-native-execution/
    
    ## Root Velox GitHub repository directory, as PrestoDB submodule
    /opt/presto/_repo/presto-native-execution/Velox
    
    ## Root build results directory for PrestoCpp with Velox
    /opt/presto/_repo/presto-native-execution/_build/release/
    /opt/presto/_repo/presto-native-execution/_build/release/velox/
    /opt/presto/_repo/presto-native-execution/_build/release/presto_cpp/
    

    Release container build - mostly with only the must-have runtime files, including presto_server build presto executable and some libraries. What will be used in the final released container depends on user needs and can be adjusted.

    Prestissimo - runtime configuration and settings

    ⚠️ _Notice: Presto-native-execution binary requires 32Gb of RAM at runtime to start (default settings). To override this and overcome runtime error add node.memory_gb=8 line in node.properties.

    Presto server with all dependencies can be found inside /opt/presto/, runtime name is presto_server. There are 2 ways of starting PrestoCpp using provided entry point /opt/entrypoint.sh.

    1) Quick start - pass parameters to entrypoint

    This is valid when running using docker and using kubernetes. It is not advised to use this method. User should prefer mounting configuration files using Kubernetes.

    "/opt/entrypoint.sh --use-env-params --discovery-uri=http://presto-coordinaator.default.svc.cluster.local:8080 --http-server-port=8080"
    

    2) Using in Kubernetes environment:

    Mount config file inside a container as /opt/presto/node.properties.template. Replace each variable with you configuration values or leave it as is:

    Notice: set up same values for JAVA coordinator as for prestoCpp - version, location and environment should be the same or you will get connection errors.

    presto.version=0.273.3
    node.location=datacenter-warsaw
    node.environment=test-environment
    node.data-dir=/var/presto/data
    catalog.config-dir=/opt/presto/catalog
    plugin.dir=/opt/presto/plugin
    # node.id is generated and filled during machine startup if not specified
    

    Mount config file inside a container as /opt/presto/config.properties.template. Replace each variable with you configuration values:

    coordinator=false
    http-server.http.port=8080
    discovery.uri=http://presto-coordinaator.default.svc.cluster.local:8080
    

    3) Hive-Metastore connector and S3 configuration:

    For minimum required configuration just mount file /opt/presto/catalog/hive.properties inside container at give path (fill hive.metastore.uri with you metastore endpoint address):

    connector.name=hive-hadoop2
    hive.metastore.uri=thrift://hive-metastore-service.default.svc:9098
    hive.pushdown-filter-enabled=true
    cache.enabled=true
    

    Setting required by S3 connector and Velox query engine, replace with your values, reefer to presto hive connector settings help:

    hive.s3.path-style-access={{ isPathstyle }}
    hive.s3.endpoint={{ scheme }}://{{ serviceFqdnTpl . }}:{{ portRest }}
    hive.s3.aws-access-key={{ accessKey }}
    hive.s3.aws-secret-key={{ secretKey }}
    hive.s3.ssl.enabled={{ sslEnabled }}
    hive.s3select-pushdown.enabled={{ s3selectPushdownFilterEnabled }}
    hive.parquet.pushdown-filter-enabled={{ parquetPushdownFilterEnabled }}
    

    image

    Signed-off-by: Linkiewicz, Milosz [email protected]

    opened by Mionsz 36
  • Add support for query pushdown to S3 using S3Select

    Add support for query pushdown to S3 using S3Select

    This change will allow Presto users to improve the performance of their queries using S3SelectPushdown. It pushes down projections and predicate evaluations to S3. As a result Presto doesn't need to download full S3 objects and only data required to answer the user's query is returned to Presto, thereby improving performance.

    S3SelectPushdown Technical Document: S3SelectPushdown.pdf

    PR UPDATE Closed this PR as it was slow to work with due to large volume of comments. Created a new PR to continue the work https://github.com/prestodb/presto/pull/11970

    CLA Signed 
    opened by same3r 36
  • Support multiple columns in IN predicate

    Support multiple columns in IN predicate

    Support queries like:

    presto:sf1> select count(*) from lineitem where (orderkey, linenumber) IN (SELECT orderkey, linenumber from lineitem);
    Query 20161018_062422_00016_uqzsf failed: line 1:60: Multiple columns returned by subquery are not yet supported. Found 2
    

    It should be easy to implement once https://github.com/prestodb/presto/issues/6384 got implemented.

    opened by kokosing 36
  • Prune Nested Fields for Parquet Columns

    Prune Nested Fields for Parquet Columns

    Read necessary fields only for Parquet nested columns Currently, Presto will read all the fields in a struct for Parquet columns. e.g.

    select s.a, s.b
    from t
    

    if it is a parquet file, with struct column s: {a int, b double, c long, d float} current Presto will read a, b, c, d from s, and output just a and b

    For columnar storage as Parquet or ORC, we could do better, by just reading the necessary fields. In the previous example, just read {a int, b double} from s. Not reading other fields to save IO.

    This patch introduces an optional NestedFields in ColumnHandle. When optimizing the plan, PruneNestedColumns optimizer will visit expressions, and put candidate nested fields into ColumnHandle. When scanning parquet files, the record reader could use NestedFields to specify necessary fields only for parquet files.

    This has an dependency on @jxiang 's https://github.com/prestodb/presto/pull/4714, which gives us the flexibility to specify metastore schemas differently from parquet file schemas.

    @dain @martint @electrum @cberner @erichwang any comments are appreciated

    CLA Signed 
    opened by zhenxiao 36
  • Cassandra connector IN query very slow planning on large list

    Cassandra connector IN query very slow planning on large list

    A query like -

    select col1
    from table
    where col2 in (<long list of integers>)
    and col3 in (<long list of string>)
    and col4 in (<another long list of integers>)
    and col1 is not null
    group by col1;
    

    takes more than 5 minutes just planning. My cassandra table being queried has a lot of partitions and list length for IN query I was experimenting with was anywhere between 50 to 200. <col2, col3, col4> together form the partition keys so I don't imagine a full table scan to take place during planning or execution. Any ideas?

    opened by aandis 34
  • Add InMemory connector

    Add InMemory connector

    Add connector that stores all data in memory on the workers.

    Rationale behind it is to serve as a storage for SQL query benchmarking. Using JMH unit benchmarks from scratch is time consuming to setup, it's often much easier to write some query against TPCH. Previous benchmarks had significant drawback that generating data in TPCH connector was using most of the CPU time. With InMemory connector that's no longer the case.

    Connector is based on BlackHole and first commit is just copy/paste with some renames.

    CLA Signed ready-to-merge 
    opened by pnowojski 33
  • Reading Delta Lake table from S3 is slow if _last_checkpoint does not exist

    Reading Delta Lake table from S3 is slow if _last_checkpoint does not exist

    There is a performance issue when I read newly created Delta Lake table stored on S3. Each delta table access requires more than 15 seconds to be done.

    The issue is caused by loadMetadataFromFile(...) method in Delta Lake Connectors project. The method expects FileNotFoundException if _last_checkpoint file is missing, which is true for tables with less then 10 updates. See Checkpoints.scala#L133

    I created the issue https://github.com/delta-io/connectors/issues/437 in Delta Lake Connectors repo where it was suggested to fix it in Presto by throwing FileNotFoundException if file does not exist on S3.

    I created the PR https://github.com/prestodb/presto/pull/18307 which was merged, but unfortunately it has not solved the issue. Obviously I missed something needed to be changed, i.e. the PR is not complete.

    Error log before the PR:

    2022-09-07T13:51:16.040Z        WARN    Query-20220907_135115_00006_up6te-593   io.delta.standalone.internal.DeltaLogImpl       Failed to parse s3a://data-warehouse/my_database/my_table/_delta_log/_last_checkpoint. This may happen if there was an error during read operation, or a file appears to be partial. Sleeping and trying again.
    com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 1712984EA7667136; S3 Extended Request ID: 0aa37532-b31a-403d-9632-343d5975c4e9), S3 Extended Request ID: 0aa37532-b31a-403d-9632-343d5975c4e9 (Path: s3a://data-warehouse/my_database/my_table/_delta_log/_last_checkpoint)
            at com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3InputStream.lambda$openStream$2(PrestoS3FileSystem.java:1114)
            at com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:139)
            at com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3InputStream.openStream(PrestoS3FileSystem.java:1099)
    		...
    

    Error log after the PR:

    2023-01-04T16:09:56.528Z        WARN    20230104_160933_13352_s8vjw.2.0.0-0-105 io.delta.standalone.internal.DeltaLogImpl       Failed to parse s3a://data-warehouse/my_database/my_table/_delta_log/_last_checkpoint. This may happen if there was an error during read operation, or a file appears to be partial. Sleeping and trying again.
    java.io.UncheckedIOException: java.io.FileNotFoundException: File does not exist: s3a://data-warehouse/my_database/my_table/_delta_log/_last_checkpoint
            at io.delta.storage.LineCloseableIterator.hasNext(LineCloseableIterator.java:72)
            at io.delta.storage.LineCloseableIterator.next(LineCloseableIterator.java:78)
            at io.delta.storage.LineCloseableIterator.next(LineCloseableIterator.java:29)
    		...
    2023-01-04T16:09:57.532Z        WARN    20230104_160933_13352_s8vjw.2.0.0-0-105 io.delta.standalone.internal.DeltaLogImpl       s3a://data-warehouse/my_database/my_table/_delta_log/_last_checkpoint is corrupted. Will search the checkpoint files directly
    java.io.UncheckedIOException: java.io.FileNotFoundException: File does not exist: s3a://data-warehouse/my_database/my_table/_delta_log/_last_checkpoint
            at io.delta.storage.LineCloseableIterator.hasNext(LineCloseableIterator.java:72)
            at io.delta.storage.LineCloseableIterator.next(LineCloseableIterator.java:78)
            at io.delta.storage.LineCloseableIterator.next(LineCloseableIterator.java:29)
    		...
    
    opened by dnskr 1
  • [native] Fix coredump bug for VeloxQueryPlanConverter.

    [native] Fix coredump bug for VeloxQueryPlanConverter.

    When running TPC-DS query 12 like SQL below:

    --q12.sql--
    
     select i_item_id,
      i_item_desc, i_category, i_class, i_current_price,
      sum(ws_ext_sales_price) as itemrevenue,
      sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
              (partition by i_class) as revenueratio
     from
    	web_sales, item, date_dim
     where
    	ws_item_sk = i_item_sk
      	and i_category in ('Sports', 'Books', 'Home')
      	and ws_sold_date_sk = d_date_sk
        and cast(d_date as varchar) between '1999-02-22' and '1999-03-24'
     group by
    	i_item_id, i_item_desc, i_category, i_class, i_current_price
     order by
    	i_category, i_class, i_item_id, i_item_desc, revenueratio
     LIMIT 100
                
    ;
    
    

    PrestoCPP will crash and generate a coredump whose stack trace looks like below:

    (gdb) where
    #0  std::vector<facebook::presto::protocol::Ordering, std::allocator<facebook::presto::protocol::Ordering> >::size (this=<optimized out>, this=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:94
    #1  std::vector<facebook::presto::protocol::Ordering, std::allocator<facebook::presto::protocol::Ordering> >::vector (__x=..., this=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:552
    #2  facebook::presto::VeloxQueryPlanConverter::toVeloxQueryPlan (this=0x7f4870e26ca0, node=std::shared_ptr<const facebook::presto::protocol::WindowNode> (use count 3, weak count 0) = {...},
        tableWriteInfo=std::shared_ptr<facebook::presto::protocol::TableWriteInfo> (use count 1, weak count 0) = {...}, taskId="20230104_062438_00004_na9h2.1.0.0")
        at ../presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp:1858
    #3  0x0000555be3bdcef6 in facebook::presto::VeloxQueryPlanConverter::toVeloxQueryPlan (this=0x7f4870e26ca0, node=std::shared_ptr<const facebook::presto::protocol::PlanNode> (use count 3, weak count 0) = {...},
        tableWriteInfo=std::shared_ptr<facebook::presto::protocol::TableWriteInfo> (use count 1, weak count 0) = {...}, taskId="20230104_062438_00004_na9h2.1.0.0")
        at ../presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp:1960
    #4  0x0000555be3be2411 in facebook::presto::VeloxQueryPlanConverter::toVeloxQueryPlan (this=0x7f4870e26ca0,
        node=std::shared_ptr<const facebook::presto::protocol::ProjectNode> (use count 3, weak count 0) = {...},
        tableWriteInfo=std::shared_ptr<facebook::presto::protocol::TableWriteInfo> (use count 1, weak count 0) = {...}, taskId="20230104_062438_00004_na9h2.1.0.0") at /usr/include/c++/9/ext/atomicity.h:96
    #5  0x0000555be3bdc656 in facebook::presto::VeloxQueryPlanConverter::toVeloxQueryPlan (this=0x7f4870e26ca0, node=std::shared_ptr<const facebook::presto::protocol::PlanNode> (use count 3, weak count 0) = {...},
        tableWriteInfo=std::shared_ptr<facebook::presto::protocol::TableWriteInfo> (use count 1, weak count 0) = {...}, taskId="20230104_062438_00004_na9h2.1.0.0")
        at ../presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp:1899
    

    And the code PrestoToVeloxQueryPlan.cpp:1858 will generate a coredump when the specification's orderingScheme was not initialized.


    Test plan - (Please fill in how you tested your changes)

    Existing tests, no new features are added.


    == NO RELEASE NOTE ==
    
    opened by wuxueyang96 0
  • fix(sec): upgrade org.testng:testng to

    fix(sec): upgrade org.testng:testng to

    What happened?

    There are 1 security vulnerabilities found in org.testng:testng 7.5

    What did I do?

    Upgrade org.testng:testng from 7.5 to for vulnerability fix

    What did you expect to happen?

    Ideally, no insecure libs should be used.

    The specification of the pull request

    PR Specification from OSCS

    Signed-off-by:pen4[email protected]

    opened by pen4 0
  • Revert reduce hdfs config copies changes

    Revert reduce hdfs config copies changes

    The changes

    seems to be causing issues likely due to some lost config values during our testing of the 277 branch. Other changes also have had to be made to avoid this issue by other contributors

    So, I am proposing to revert all the above changes in this pull request.

    opened by ugurmeet 2
Owner
Presto
Distributed SQL query engine for big data
Presto
Apache Drill is a distributed MPP query layer for self describing data

Apache Drill Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage sys

The Apache Software Foundation 1.8k Jan 7, 2023
requery - modern SQL based query & persistence for Java / Kotlin / Android

A light but powerful object mapping and SQL generator for Java/Kotlin/Android with RxJava and Java 8 support. Easily map to or create databases, perfo

requery 3.1k Jan 5, 2023
A Java library to query pictures with SQL-like language

PicSQL A Java library to query pictures with SQL-like language. Features : Select and manipulate pixels of pictures in your disk with SQL-like dialect

Olivier Cavadenti 16 Dec 25, 2022
A Java library to query pictures with SQL-like language.

PicSQL A Java library to query pictures with SQL-like language. Features : Select and manipulate pixels of pictures in your disk with SQL-like dialect

null 16 Dec 25, 2022
CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.

About CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time. CrateDB offers the

Crate.io 3.6k Jan 2, 2023
sql2o is a small library, which makes it easy to convert the result of your sql-statements into objects. No resultset hacking required. Kind of like an orm, but without the sql-generation capabilities. Supports named parameters.

sql2o Sql2o is a small java library, with the purpose of making database interaction easy. When fetching data from the database, the ResultSet will au

Lars Aaberg 1.1k Dec 28, 2022
🚀flink-sql-submit is a custom SQL submission client

??flink-sql-submit is a custom SQL submission client This is a customizable extension of the client, unlike flink's official default client.

ccinn 3 Mar 28, 2022
Java implementation of Condensation - a zero-trust distributed database that ensures data ownership and data security

Java implementation of Condensation About Condensation enables to build modern applications while ensuring data ownership and security. It's a one sto

CondensationDB 43 Oct 19, 2022
HurricaneDB a real-time distributed OLAP engine, powered by Apache Pinot

HurricaneDB is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

GuinsooLab 4 Dec 28, 2022
jdbi is designed to provide convenient tabular data access in Java; including templated SQL, parameterized and strongly typed queries, and Streams integration

The Jdbi library provides convenient, idiomatic access to relational databases in Java. Jdbi is built on top of JDBC. If your database has a JDBC driv

null 1.7k Dec 27, 2022
An open source SQL database designed to process time series data, faster

English | 简体中文 | العربية QuestDB QuestDB is a high-performance, open-source SQL database for applications in financial services, IoT, machine learning

QuestDB 9.9k Jan 1, 2023
blockchain database, cata metadata query

Drill Storage Plugin for IPFS 中文 Contents Introduction Compile Install Configuration Run Introduction Minerva is a storage plugin of Drill that connec

null 145 Dec 7, 2022
Aggregation query proxy is a scalable sidecar application that sits between a customer application and Amazon Keyspaces/DynamoDB

Aggregation query proxy is a scalable sidecar application that sits between a customer application and Amazon Keyspaces/DynamoDB. It allows you to run bounded aggregation queries against Amazon Keyspaces and DynamoDB services.

AWS Samples 3 Jul 18, 2022
A distributed in-memory data store for the cloud

EVCache EVCache is a memcached & spymemcached based caching solution that is mainly used for AWS EC2 infrastructure for caching frequently used data.

Netflix, Inc. 1.9k Jan 2, 2023
jOOQ is the best way to write SQL in Java

jOOQ's reason for being - compared to JPA Java and SQL have come a long way. SQL is an "ancient", yet established and well-understood technology. Java

jOOQ Object Oriented Querying 5.3k Jan 4, 2023
Java code generator for calling PL/SQL.

OBridge OBridge provides a simple Java source code generator for calling Oracle PL/SQL package procedures. Supported input, output parameters and retu

Ferenc Karsany 21 Oct 7, 2022