Mirror of Apache SystemML

Overview

Apache SystemDS

Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient, local and distributed ML model training, to deployment and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for (1) the different tasks of the data-science lifecycle, and (2) users with different expertise. These high-level scripts are compiled into hybrid execution plans of local, in-memory CPU and GPU operations, as well as distributed operations on Apache Spark. In contrast to existing systems - that either provide homogeneous tensors or 2D Datasets - and in order to serve the entire data science lifecycle, the underlying data model are DataTensors, i.e., tensors (multi-dimensional arrays) whose first dimension may have a heterogeneous and nested schema.

Quick Start Install, Quick Start and Hello World

Documentation: SystemDS Documentation

Python Documentation Python SystemDS Documentation

Issue Tracker Jira Dashboard

Status and Build: SystemDS is renamed from SystemML which is an Apache Top Level Project. To build from source visit SystemDS Install from source

Build Documentation Component Test Application Test Function Test Python Test Federated Python Test

Comments
  • [SYSTEMML-769] [WIP] Adding support for native BLAS

    [SYSTEMML-769] [WIP] Adding support for native BLAS

    Based on the discussion with @frreiss and @bertholdreinwald , I am proposing to switch our default BLAS from Java-based BLAS to native BLAS. We will recommend using Intel MKL and provide optional support other BLAS such as OpenBLAS (and possibly Accelerate) etc. Also, if no BLAS is installed, we will switch to Java-based BLAS. This future-proofs SystemML from hardware improvement made in other BLAS and also simplifies testing.

    The proposed solution in this jar will work:

    1. In distributed setting with no additional dependency other than BLAS.
    2. On hybrid cluster with different types of BLAS.
    3. With Parfor.
    4. Without an additional artifact: systemml-accelerator.jar (like https://github.com/apache/incubator-systemml/pull/307).

    Since we are not including external dependency (such as BLAS), this PR adds no additional overhead on the release process. Also when we add support to TSMM, SYSTEMML-1166 will be resolved and will again future proof SystemML on related issues.

    The initial performance numbers are same as that of https://github.com/apache/incubator-systemml/pull/307

    @mboehm7 @dusenberrymw @nakul02 @lresende @deroneriksson @asurve @fschueler

    opened by niketanpansare 97
  • [SYSTEMML-294] Print matrix capability

    [SYSTEMML-294] Print matrix capability

    Can pass a matrix object to the print statement

    • More complex things like print(X[1:10, 1:10]) do not work. The correct sequence of instructions is not being generated.
    • Also Java style string concatenation does not work. print (x + X) does not work. (x=scalar, X=matrix).

    I realize the matrix printing may not be the best format. Here is what it looks like for now:

    For this DML code:

    X = rand(rows=1000, cols=1000, min=0, max=4, pdf="uniform", sparsity=0.2)
    Z = X[1:10,1:10]
    print(Z)
    

    screenshot

    @deroneriksson, @niketanpansare, @mboehm7 - thoughts?

    opened by nakul02 53
  • [Scala Pipeline API] Add scala logisticRegression api for spark pipeline

    [Scala Pipeline API] Add scala logisticRegression api for spark pipeline

    I wrote a scala ml pipeline wrapper for LogisticRegression Model as a example for scala user.
    I thought those api (java, python, scala) should be put to separated jar, but just follow what it is this time.

    Regards. Wenpei.

    opened by Wenpei 52
  • [SYSTEMML-1451][Phase 2] Decouple Scripts and HDFS support

    [SYSTEMML-1451][Phase 2] Decouple Scripts and HDFS support

    Please refer to https://issues.apache.org/jira/browse/SYSTEMML-1451 for more details.

    • [x] Decouple systemml-spark-submit.py
    • [x] Decouple systemml-standalone.py
    • [x] Refractor perf test suit to accept args like debug, stats, config etc...
    • [x] Add HDFS support
    • [x] Google Docs support
    • [x] Compare SystemML with previous versions
    • [x] Pylint, Comment
    • [x] Extra arguments configuration Test
    • [x] Windows Test
    • [x] Doc update
    • [x] systemml standalone comments
    • [x] systemml spark submit comments
    opened by krishnakalyan3 33
  • [SYSTEMML-1451][Phase 1] Automate performance suite and report performance numbers

    [SYSTEMML-1451][Phase 1] Automate performance suite and report performance numbers

    Please refer to https://issues.apache.org/jira/browse/SYSTEMML-1451 for more details.

    Phase 1:

    • [x] Generate Data
    • [x] Test all algorithms with singlenode
    • [x] Test all algorithms spark-hybrid execution mode
    • [x] Capture Time Taken
    • [x] Generate a full set of plain text reports
    • [x] Test automatic benchmark end to end

    Error Handling and Reporting:

    • [x] Test End to End in standalone
    • [x] Test End to End in spark_hybrid
    • [x] If data already exists do not generate the data again
    • [x] Fix time to be taken from the std.out
    • [x] Execution function to return failure incase job fails
    • [x] If data not present do not execute train or predict. (Minor)
    • [x] Remove unused imports
    • [x] Log Stdout and Std error
    • [x] Proper Reporting of Metrics (Minor)
    • [x] Add current time to log
    • [x] User Guide (This will be a separate google doc file)

    Current status of family

    • [x] Clustering
    • [x] Binomial
    • [x] Multinomial
    • [x] Regression1
    • [x] Regression2
    • [x] Stats1
    • [x] Stats2

    To test this script please navigate to the gist below

    https://gist.github.com/krishnakalyan3/26f5578b7b342bd4e14d986a9889a42e
    

    Local Machine Configuration

    Operating System: OSX 10.2
    Ram: 16GB DDR 3 @ 1600 MHz
    Speed: 2.5 GHz
    Processor: Intel Core i5
    

    Standalone Configuration

    JVM Memory Settings: -Xmx8g -Xms4g -Xmn1g
    

    Spark Configuration

    Number of Executors: 2
    Memory Size (Driver): 5g
    Memory of Executor: 2g
    Executor Cores: 1
    Spark Master Threads: 4
    

    Performance Test Conducted on the following configs with all families (This includes all algorithms).

    data size: 10k_100
    execution mode: standalone
    matrix_type: dense, sparse
    Log: https://gist.github.com/krishnakalyan3/a07d404d7192261691584123fd69140a
    (More than 5 hours)
    
    data size: 10k_100
    execution mode: hybrid_spark
    matrix_type: dense, sparse
    Log: https://gist.github.com/krishnakalyan3/213bcca9792addee62e8a4177fbba996
    (Less than one hour)
    
    
    opened by krishnakalyan3 33
  • [SYSTEMML-451] Python embedded DSL

    [SYSTEMML-451] Python embedded DSL

    To make progress on adoption front, I think we need to support embedded Python DSL. So, I have created this PR as an initial proposal. I have added subset of builtin functions (namely, full/rand.*/mlprint/dot), binary/unary operators for matrices and parfor as an example language level construct.

    Please use this PR to facilitate further discussion on this topic. Here are some initial points:

    1. What is preferred usage scenario for a Python data scientist ? a. Scikit-learn like-library. In this case, we can call the DML algorithms using MLContext. b. External DSL approach: The data scientists will write her code in PyDML and then use MLContext. c. Embedded DSL approach.
    2. Even in embedded DSL approach, there are two possible implementation choices: a. Execute code with a context (as in this PR). Pros: Simple and elegant push-down mechanism and no redundant computation. Cons: Difficult to implement mixed code. b. Add lazy data structures which will be executed only when certain actions are invoked on them.
    3. Push-down of language-level constructs (Will skip this in this commit): a. No modification to Python implementation, but slightly inelegant usage through use of functions (as in this PR). See parfor invocation in the below code. b. Modify Python parser to support pushdown. This requires users to rebuilt oru version of python from scratch.
    4. Whether to add standalone support as it requires reimplementing Py4J bridge of PySpark.
    5. APIs of the built-in functions: Should it be similar to PyDML ?

    If after the discussion, we decide to go in different direction, I will delete this PR.

    If we agree to go in this direction, I can create following tasks on JIRA:

    1. Make Python embedded DSL feature complete.
    2. Add documentation and examples for Python embedded DSL.
    3. Create .py files for all our algorithms using Python embedded DSL. We can keep interface similar to scikit-learn.
    4. Add additional input/output mechanism (other than DataFrame), for example: binary blocked RDD, string rdd, MLLib's BlockMatrix, NumPy arrays, etc.
    5. Add Py4J support to allow standalone usage and create pip installer for systemml.

    Please note: I am using older MLContext approach and we can update it to newer MLContext once it is delivered.

    An example script supported by this commit:

    # wget https://sparktc.ibmcloud.com/repo/latest/systemml-0.11.0-incubating-SNAPSHOT.jar
    # pyspark --master local[*] --driver-class-path systemml-0.11.0-incubating-SNAPSHOT.jar
    
    >>> import SystemML as sml
    >>> import numpy as np
    >>> sml.setSparkContext(sc)
    
    Welcome to Apache SystemML!
    
    >>> m1 = sml.matrix(np.ones((3,3)) + 2)
    >>> m2 = sml.matrix(np.ones((3,3)) + 3)
    >>> m2 = m1 * (m2 + m1)
    >>> m4 = 1.0 - m2
    >>> m4
    # This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.
    mVar1 = load(" ", format="csv")
    mVar2 = load(" ", format="csv")
    mVar3 = mVar2 + mVar1
    mVar4 = mVar1 * mVar3
    mVar5 = 1.0 - mVar4
    save(mVar5, " ")
    
    <SystemML.defmatrix.matrix object>
    >>> m2.eval()
    >>> m2
    # This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPyArray() method.
    <SystemML.defmatrix.matrix object>
    >>> m4
    # This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.
    mVar4 = load(" ", format="csv")
    mVar5 = 1.0 - mVar4
    save(mVar5, " ")
    
    <SystemML.defmatrix.matrix object>
    >>> m4.sum(axis=1).toNumPyArray()
    array([[-60.],
           [-60.],
           [-60.]])
    >>>
    

    @mboehm7 @frreiss @bertholdreinwald @nakul02 @dusenberrymw @deroneriksson

    opened by niketanpansare 33
  • [SYSTEMML-540] Extended Caffe2DML to support image segmentation problems

    [SYSTEMML-540] Extended Caffe2DML to support image segmentation problems

    • This PR only extends Caffe2DML to support image segmentation problem.
    • I will add support for depthwise deconvolution (when number of filters = number of groups) in a a separate PR.
    • I have also added couple of bugfixes for loading of Caffe model using Caffe2DML (for the case when Caffe is not installed).
    • This PR also fixes few bugs related to loading a caffemodel.
    • Additionally, I have added a summary() method to Caffe2DML to print network:
    >>> lenet.summary()
    +-----+---------------+--------------+------------+---------+-----------+---------+
    | Name|           Type|        Output|      Weight|     Bias|        Top|   Bottom|
    +-----+---------------+--------------+------------+---------+-----------+---------+
    |mnist|           Data| (, 1, 28, 28)|            |         |mnist,mnist|         |
    |conv1|    Convolution|(, 32, 28, 28)|   [32 X 25]| [32 X 1]|      conv1|    mnist|
    |relu1|           ReLU|(, 32, 28, 28)|            |         |      relu1|    conv1|
    |pool1|        Pooling|(, 32, 14, 14)|            |         |      pool1|    relu1|
    |conv2|    Convolution|(, 64, 14, 14)|  [64 X 800]| [64 X 1]|      conv2|    pool1|
    |relu2|           ReLU|(, 64, 14, 14)|            |         |      relu2|    conv2|
    |pool2|        Pooling|  (, 64, 7, 7)|            |         |      pool2|    relu2|
    |  ip1|   InnerProduct| (, 512, 1, 1)|[3136 X 512]|[1 X 512]|        ip1|    pool2|
    |relu3|           ReLU| (, 512, 1, 1)|            |         |      relu3|      ip1|
    |drop1|        Dropout| (, 512, 1, 1)|            |         |      drop1|    relu3|
    |  ip2|   InnerProduct|  (, 10, 1, 1)|  [512 X 10]| [1 X 10]|        ip2|    drop1|
    | loss|SoftmaxWithLoss|  (, 10, 1, 1)|            |         |       loss|ip2,mnist|
    +-----+---------------+--------------+------------+---------+-----------+---------+
    
    

    @prithvirajsen @bertholdreinwald Can you please review this PR ?

    opened by niketanpansare 31
  • [SYSTEMML-1437][WIP] factorization machines

    [SYSTEMML-1437][WIP] factorization machines

    FMs are general predictors which allow to capture interactions between all features in a features matrix. The feature matrices pertinent to the recommendation systems are highly sparse. SystemML's highly efficient distributed sparse matrix operations can be leveraged to implement FMs in a scalable fashion.

    Formulae:

    • [x] 1. core fm module
    • [x] 2. regression
    • [x] 3. binary classification
    • [ ] 4. Ranking
    opened by j143-zz 28
  • [WIP][SYSTEMML-1216] implement local svd( ) function

    [WIP][SYSTEMML-1216] implement local svd( ) function

    Please merge this PR for the local svd() function. Implementing distributed svd may take some time ( I will open that in a different PR), by then my branch may become stale. Thanks.

    @niketanpansare @mboehm7

    opened by j143-zz 25
  • [SYSTEMML-1563] Adding a distributed synchronous SGD MNIST LeNet example.

    [SYSTEMML-1563] Adding a distributed synchronous SGD MNIST LeNet example.

    In order to exploit a multi-GPU setup, this implements synchronous distributed SGD. More specifically, this (1) sets the degree of parallelism (hardcoded for now), (2) adjusts the inner for loop (previously for iterations) to extract out groups of mini-batches, where the number of mini-batches is equal to the degree of parallelism, (3) adds an inner parfor to parallelize the computation of gradients over a group of mini-batches (i.e. one mini-batch per GPU) and store the gradients, and (4) updates the model after the parfor loop using the aggregated (averaged) gradients.

    To run, follow the instructions here.

    Reference:

    • https://arxiv.org/abs/1604.00981
    opened by dusenberrymw 24
  • [SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend

    [SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend

    This is a standing PR to facilitate discussion on whether or not, we should support native BLAS in SystemML. After discussion and after resolving the issues with deployment, we can decide whether to turn on this feature by default. Since I wanted feedback from community before proceeding ahead, I did not complete the PR. The remaining tasks are:

    • Generalize to other BLAS, not just MKL. This would also involve completing the CMake file.
    • Add other operations: conv2d_backward_*, etc.

    I ran some preliminary performance experiments comparing conv2d with/without sparse+caching and with/without native BLAS. I provided fairly large memory budget (-Xmx20g -Xms20g -Xmn2048m -server) and used Open JDK 1.8 64-Bit Server VM. The script tested the performance of conv2d using four commonly used setups for 1000 iterations:

    max_iterations = 1000
    setup = $2
    numFilters = -1
    numChannels = -1
    filterSize = -1
    pad = -1
    if(setup == 1) {
            numFilters = 20
            numChannels = 1
            filterSize = 5
            pad = 0
    }
    else if(setup == 2) {
            numFilters = 50
            numChannels = 20
            filterSize = 5
            pad = 0
    }
    else if(setup == 3) {
            numFilters = 20
            numChannels = 1
            filterSize = 3
            pad = 1
    }
    else if(setup == 4) {
            numFilters = 50
            numChannels = 20
            filterSize = 3
            pad = 1
    }
    else {
            stop('Incorrect setup (needs to be [1, 4]).')
    }
    imgSize = 28
    n = 60000
    X = rand(rows=n, cols=numChannels*imgSize*imgSize)
    batch_size = 64
    w = rand(rows=numFilters, cols=numChannels*filterSize*filterSize)
    P = (imgSize + 2 * pad - filterSize)  + 1
    foo = matrix(0, rows=n, cols=numFilters*P*P)
    for(iter in 1:max_iterations) {
            beg = (iter * batch_size) %% n + 1
            end = min(n, beg + batch_size)
            X_batch = X[beg:end, ]
            n_batch = nrow(X_batch)
            convOut_1 = conv2d(X_batch, w, input_shape=[n_batch,numChannels,imgSize,imgSize], filter_shape=[numFilters,numChannels,filterSize,filterSize], padding=[pad,pad], stride=[1,1])
            foo = convOut_1
    }
    print(sum(foo))
    

    To compile the native SystemML library, please use:

    export MKLROOT=/opt/intel/mkl
    export JAVA_HOME=....
    # Please go to https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor to find LINKER_OPTIONS and COMPILER_OPTIONS
    export LINKER_OPTIONS=" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_rt -lpthread -lm -ldl"
    export COMPILER_OPTIONS=" -m64 -I${MKLROOT}/include"
    g++ -shared -fPIC -o libsystemml.so systemml.cpp -I. -I$JAVA_HOME/include -I$JAVA_HOME/include/linux -lm -fopenmp -O3 $LINKER_OPTIONS $COMPILER_OPTIONS 
    

    Please see below the results of the experiments. Both sparse and caching are disabled for the setup SystemML_native and SystemML_CP.

    ,Number of Iterations, Setup, Time in seconds
    SystemML_native,1000,1,7.103096398
    SystemML_CP,1000,1,6.498525426
    SystemML_CP_WithCacheNSparseEnabled,1000,1,7.195620854
    Tensorflow,1000,1,4.071731716
    SystemML_native,1000,2,31.315343223
    SystemML_CP,1000,2,81.769984552
    SystemML_CP_WithCacheNSparseEnabled,1000,2,101.274622939
    Tensorflow,1000,2,33.476548341
    SystemML_native,1000,3,7.662274848
    SystemML_CP,1000,3,6.355272119
    SystemML_CP_WithCacheNSparseEnabled,1000,3,7.607337158
    Tensorflow,1000,3,3.837932081
    SystemML_native,1000,4,26.638438614
    SystemML_CP,1000,4,49.716594505
    SystemML_CP_WithCacheNSparseEnabled,1000,4,71.542244484
    Tensorflow,1000,4,26.395180006
    

    There are some additional overhead cost (such as initial compilation/validation, reuse of previously allocated but non-zeroed array, dynamic recompilation, GC, etc) which we have not yet optimized. These cost are beyond the scope of this PR and some of them are inherent to our design principles. We can work on them in a separate PR :)

    @mboehm7 @bertholdreinwald @dusenberrymw @frreiss @prithvirajsen @fschueler @nakul02 @asurve @deroneriksson I understand the above experiments might not be sufficient to accept the change and would welcome your feedback on additional experiments/setups. I would also appreciate if some of you are willing to help me with these experiments too ;)

    Here are the shapes of the matrix multiplication for the four setups:

    Setup 1:
    64 parallel matrix multiplication of shape (20, 25) %*% (25, 576) executed 1000 times.
     
    Setup 2:
    64 parallel matrix multiplication of shape (50, 500) %*% (500, 576) executed 1000 times.
     
    Setup 3:
    64 parallel matrix multiplication of shape (20, 25) %*% (25, 784) executed 1000 times.
     
    Setup 4:
    64 parallel matrix multiplication of shape (50, 500) %*% (500, 784) executed 1000 times.
    

    I will provide an update soon comparing the results of the above matrix multiplications. If you are interested, here are the respective code path for the matrix multiplications:

    • CP: https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixDNN.java#L327

    • Native: https://github.com/niketanpansare/incubator-systemml/blob/for_cpp/src/main/cpp/systemml.cpp#L163

    opened by niketanpansare 24
  • [SYSTEMDS-3481] FrameFromMatrix Improvements

    [SYSTEMDS-3481] FrameFromMatrix Improvements

    This commit update the change of matrix to frame to more efficiently change the MatrixBlock to frames. The previous implementation has nice cache blocks and allocation for direct double to double change, this PR simply adds support for this change in the case of changing into other types, like boolean.

    Changing a Matrix 64kx2k to boolean frame:

    After: 22/12/21 19:56:11 ERROR frame.FrameFromMatrixBlockTest: 1055.994364 22/12/21 19:56:12 ERROR frame.FrameFromMatrixBlockTest: 1039.756463 22/12/21 19:56:13 ERROR frame.FrameFromMatrixBlockTest: 946.029085 22/12/21 19:56:14 ERROR frame.FrameFromMatrixBlockTest: 928.161053 22/12/21 19:56:15 ERROR frame.FrameFromMatrixBlockTest: 943.132151 22/12/21 19:56:16 ERROR frame.FrameFromMatrixBlockTest: 950.212744 22/12/21 19:56:17 ERROR frame.FrameFromMatrixBlockTest: 964.515222 22/12/21 19:56:17 ERROR frame.FrameFromMatrixBlockTest: 966.944032 22/12/21 19:56:18 ERROR frame.FrameFromMatrixBlockTest: 965.85695 22/12/21 19:56:19 ERROR frame.FrameFromMatrixBlockTest: 956.783357

    Before: 22/12/21 19:59:56 ERROR frame.FrameFromMatrixBlockTest: 2199.846241 22/12/21 19:59:58 ERROR frame.FrameFromMatrixBlockTest: 2373.381971 22/12/21 20:00:01 ERROR frame.FrameFromMatrixBlockTest: 2270.362306 22/12/21 20:00:03 ERROR frame.FrameFromMatrixBlockTest: 2324.07255 22/12/21 20:00:05 ERROR frame.FrameFromMatrixBlockTest: 2294.39046 22/12/21 20:00:08 ERROR frame.FrameFromMatrixBlockTest: 2284.978142 22/12/21 20:00:10 ERROR frame.FrameFromMatrixBlockTest: 2295.71655 22/12/21 20:00:12 ERROR frame.FrameFromMatrixBlockTest: 2297.712022 22/12/21 20:00:14 ERROR frame.FrameFromMatrixBlockTest: 2311.518135 22/12/21 20:00:17 ERROR frame.FrameFromMatrixBlockTest: 2467.055097

    opened by Baunsgaard 3
  • [SYSTEMDS-3468] Adding the outline for the survey of builtin functions

    [SYSTEMDS-3468] Adding the outline for the survey of builtin functions

    This patch proposes a comparison of builtin functions based on a grouping by ML lifecycle phase and ML task. This patch presents only the basic outline- subsequent patches will do the actual comparison.

    opened by BACtaki 2
  • [SYSTEMDS-3421] Adds missing read/writeExternal to ColumnEncoderUDF.

    [SYSTEMDS-3421] Adds missing read/writeExternal to ColumnEncoderUDF.

    The read/writeExternal functions for ColumnEncoderUDF where missing. This also adds the necessary switch case in the EncoderFactory and a test that currently does nothing.

    Closes #1681

    opened by Baunsgaard 0
  • [GIO] New Implementation of IOGEN

    [GIO] New Implementation of IOGEN

    This PR is a new implementation of the GIO (generating readers for custom datasets). In the new implementation, removed all hard-coded implementations for flat datasets and replaced them with code gen. One of the primary goals of GIO is to support single and multi-row representations of tuples in source datasets. This PR is for supporting both of them. An outline of the PR is:

    It supports Matrix and Frame It supports nested and flat datasets It supports non-standard datasets, i.e., an incomplete number of cols in a CSV file

    opened by fathollahzadeh 0
  • [SYSTEMDS-3375] CUDA 11.x / CUDNN 8.x support

    [SYSTEMDS-3375] CUDA 11.x / CUDNN 8.x support

    This changeset brings CUDA11 and CUDNN8 support to SystemDS. All required API changes to successfully compile & run have been applied. A few library functions are still on the ToDo list. They continue to work under a deprecation notice for the time being. Tested on CUDA 11.6.1 and CUDNN 8.4

    opened by corepointer 2
Releases(3.0.0-rc2)
  • 3.0.0-rc2(Jun 26, 2022)

    What's Changed

    • [SYSTEMDS-3193] Promote svn repo dev/systemds/2.x-rc to release/systemds/2.x by @j143 in https://github.com/apache/systemds/pull/1427
    • [SYSTEMDS-3190] Do not run workflow for empty checks by @j143 in https://github.com/apache/systemds/pull/1426
    • [SYSTEMDS-2599] Development branch name change to main by @j143 in https://github.com/apache/systemds/pull/1429
    • [MINOR] Update java language level by @Baunsgaard in https://github.com/apache/systemds/pull/1430
    • [MINOR] Update docs to 2.3.0-SNAPSHOT by @j143 in https://github.com/apache/systemds/pull/1433
    • [SYSTEMDS-3196] Update org.apache parent pom version to 24 by @j143 in https://github.com/apache/systemds/pull/1436
    • [SYSTEMDS-3212] Move Prefetch threadpool to CommonThreadPool by @phaniarnab in https://github.com/apache/systemds/pull/1450
    • [MINOR] Upgrade junit version to 4.13.2 by @j143 in https://github.com/apache/systemds/pull/1451
    • [MINOR] Silence mvn package download info by @j143 in https://github.com/apache/systemds/pull/1465
    • [MINOR] Fix imports in BuiltinUnaryGPUInstructionTest by @j143 in https://github.com/apache/systemds/pull/1467
    • [MINOR][PYTHON-DOC] Update Multiple Federated Environments Example Result by @fathollahzadeh in https://github.com/apache/systemds/pull/1469
    • [SYSTEMDS-3236] Cache-friendly Apply phase for sparse target matrix by @phaniarnab in https://github.com/apache/systemds/pull/1473
    • [SYSTEMDS-3239] Train and predict in different Contexts Python by @Baunsgaard in https://github.com/apache/systemds/pull/1474
    • [SYSTEMDS-3240] Fix IOGEN test path if the iogen directory doesn't exist by @fathollahzadeh in https://github.com/apache/systemds/pull/1475
    • [SYSTEMDS-3187] Add documentation for the release scripts by @j143 in https://github.com/apache/systemds/pull/1478
    • [MINOR] Cleaning Pipelines cleanups by @Shafaq-Siddiqi in https://github.com/apache/systemds/pull/1493
    • Action for building docker images automatically by @j143 in https://github.com/apache/systemds/pull/1441
    • [SYSTEMDS-3268] Publish Docker images on schedule by @j143 in https://github.com/apache/systemds/pull/1500
    • [SYSTEMDS-3270][DOCS] Docker usage documentation by @j143 in https://github.com/apache/systemds/pull/1501
    • [SYSTEMDS-3271] Publish docker test image when requested by @Baunsgaard in https://github.com/apache/systemds/pull/1502
    • [SYSTEMDS-3273] Federated Timeout by @Baunsgaard in https://github.com/apache/systemds/pull/1508
    • Bump actions/setup-python from 1 to 2.3.1 by @dependabot in https://github.com/apache/systemds/pull/1512
    • [MINOR][DOC] xgboost function y parameter correct usage by @j143 in https://github.com/apache/systemds/pull/1532
    • Bump actions/setup-python from 2.3.1 to 2.3.2 by @dependabot in https://github.com/apache/systemds/pull/1538
    • [MINOR] Refactoring input and output parameters of dbscanApply and db… by @Shafaq-Siddiqi in https://github.com/apache/systemds/pull/1541
    • [SYSTEMDS-3288] CLA SDC isolated DefaultTuple by @Baunsgaard in https://github.com/apache/systemds/pull/1533
    • [SYSTEMDS-3184] Builtin for computing information gain using entropy and gini by @morf1us in https://github.com/apache/systemds/pull/1520
    • [DOCS] Add missing data files in docs by @j143 in https://github.com/apache/systemds/pull/1548
    • Bump actions/setup-python from 2.3.2 to 3 by @dependabot in https://github.com/apache/systemds/pull/1552
    • Bump actions/checkout from 2 to 3 by @dependabot in https://github.com/apache/systemds/pull/1553
    • [MINOR] Performance improvements in cleaning pipelines by @Shafaq-Siddiqi in https://github.com/apache/systemds/pull/1556
    • Bump actions/cache from 2 to 3 by @dependabot in https://github.com/apache/systemds/pull/1570
    • [Cleaning Pipelines] MINOR improvements in allocation of resources by @Shafaq-Siddiqi in https://github.com/apache/systemds/pull/1576
    • keep release scripts concise by @j143 in https://github.com/apache/systemds/pull/1549
    • Bump actions/setup-java from 2 to 3 by @dependabot in https://github.com/apache/systemds/pull/1582
    • Create CITATION.cff by @j143 in https://github.com/apache/systemds/pull/1449
    • [SYSTEMDS-3345] Add support for as.boolean(X) by @Baunsgaard in https://github.com/apache/systemds/pull/1579
    • [SYSTEMDS-3022] Avoid cudaMemset() where possible by @corepointer in https://github.com/apache/systemds/pull/1572
    • Bump jackson-databind from 2.12.3 to 2.12.6.1 by @dependabot in https://github.com/apache/systemds/pull/1577
    • [MINOR] Github actions use Java OpenJDK adopt-hotspot by @Baunsgaard in https://github.com/apache/systemds/pull/1598
    • Bump docker/login-action from 1 to 2 by @dependabot in https://github.com/apache/systemds/pull/1606
    • Bump docker/build-push-action from 2 to 3 by @dependabot in https://github.com/apache/systemds/pull/1605
    • Bump docker/setup-buildx-action from 1 to 2 by @dependabot in https://github.com/apache/systemds/pull/1604
    • Bump docker/metadata-action from 3 to 4 by @dependabot in https://github.com/apache/systemds/pull/1607
    • generate proto file by @j143 in https://github.com/apache/systemds/pull/1620
    • [MINOR] Add protobuf parameter for consistency by @j143 in https://github.com/apache/systemds/pull/1621
    • Bump actions/setup-python from 3 to 4 by @dependabot in https://github.com/apache/systemds/pull/1633
    • [MINOR] Minor fixes i.e., validation checks, formatting e.t.c. by @Shafaq-Siddiqi in https://github.com/apache/systemds/pull/1636

    New Contributors

    • @fathollahzadeh made their first contribution in https://github.com/apache/systemds/pull/1469
    • @morf1us made their first contribution in https://github.com/apache/systemds/pull/1520

    Full Changelog: https://github.com/apache/systemds/compare/2.2.0-rc1...3.0.0-rc2

    Source code(tar.gz)
    Source code(zip)
  • 2.2.2-rc1(Jun 26, 2022)

  • 2.2.1-rc3(Dec 7, 2021)

  • 2.2.0-rc1(Nov 2, 2021)

  • 2.1.0-rc3(Jul 6, 2021)

Owner
The Apache Software Foundation
The Apache Software Foundation
Mirror of Apache Mahout

Welcome to Apache Mahout! The goal of the Apache Mahout™ project is to build an environment for quickly creating scalable, performant machine learning

The Apache Software Foundation 2k Jan 4, 2023
Mirror of Apache Qpid

We have moved to using individual Git repositories for the Apache Qpid components and you should look to those for new development. This Subversion re

The Apache Software Foundation 125 Dec 29, 2022
Now redundant weka mirror. Visit https://github.com/Waikato/weka-trunk for the real deal

weka (mirror) Computing and Mathematical Sciences at the University of Waikato now has an official github organization including a read-only git mirro

Benjamin Petersen 313 Dec 16, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 2, 2023
Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

The Apache Software Foundation 20.4k Jan 5, 2023
Model import deployment framework for retraining models (pytorch, tensorflow,keras) deploying in JVM Micro service environments, mobile devices, iot, and Apache Spark

The Eclipse Deeplearning4J (DL4J) ecosystem is a set of projects intended to support all the needs of a JVM based deep learning application. This mean

Eclipse Foundation 12.7k Dec 30, 2022
Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers

What is Firestorm Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote ser

Tencent 246 Nov 29, 2022
Flink/Spark Connectors for Apache Doris(Incubating)

Apache Doris (incubating) Connectors The repository contains connectors for Apache Doris (incubating) Flink Doris Connector More information about com

The Apache Software Foundation 30 Dec 7, 2022
Word Count in Apache Spark using Java

Word Count in Apache Spark using Java

Arjun Gautam 2 Feb 24, 2022
Mirror of Apache SystemML

Apache SystemDS Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engine

The Apache Software Foundation 940 Dec 25, 2022
Mirror of Apache Deltaspike

Apache DeltaSpike Documentation Mailing Lists Contribution Guide JIRA Apache License v2.0 Apache DeltaSpike is a suite of portable CDI Extensions inte

The Apache Software Foundation 141 Jan 1, 2023
Mirror of Apache Mahout

Welcome to Apache Mahout! The goal of the Apache Mahout™ project is to build an environment for quickly creating scalable, performant machine learning

The Apache Software Foundation 2k Jan 4, 2023
Mirror of Apache Kafka

Apache Kafka See our web site for details on the project. You need to have Java installed. We build and test Apache Kafka with Java 8, 11 and 15. We s

The Apache Software Foundation 23.9k Jan 5, 2023
Mirror of Apache RocketMQ

Apache RocketMQ Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level c

The Apache Software Foundation 18.5k Dec 28, 2022
Mirror of Apache ActiveMQ

Welcome to Apache ActiveMQ Apache ActiveMQ is a high performance Apache 2.0 licensed Message Broker and JMS 1.1 implementation. Getting Started To hel

The Apache Software Foundation 2.1k Jan 2, 2023
Mirror of Apache ActiveMQ Artemis

ActiveMQ Artemis This file describes some minimum 'stuff one needs to know' to get started coding in this project. Source For details about the modify

The Apache Software Foundation 824 Dec 26, 2022
Mirror of Apache Storm

Master Branch: Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processi

The Apache Software Foundation 6.4k Dec 26, 2022