This is a standing PR to facilitate discussion on whether or not, we should support native BLAS in SystemML. After discussion and after resolving the issues with deployment, we can decide whether to turn on this feature by default. Since I wanted feedback from community before proceeding ahead, I did not complete the PR. The remaining tasks are:
- Generalize to other BLAS, not just MKL. This would also involve completing the CMake file.
- Add other operations: conv2d_backward_*, etc.
I ran some preliminary performance experiments comparing conv2d with/without sparse+caching and with/without native BLAS. I provided fairly large memory budget (-Xmx20g -Xms20g -Xmn2048m -server) and used Open JDK 1.8 64-Bit Server VM. The script tested the performance of conv2d using four commonly used setups for 1000 iterations:
max_iterations = 1000
setup = $2
numFilters = -1
numChannels = -1
filterSize = -1
pad = -1
if(setup == 1) {
numFilters = 20
numChannels = 1
filterSize = 5
pad = 0
}
else if(setup == 2) {
numFilters = 50
numChannels = 20
filterSize = 5
pad = 0
}
else if(setup == 3) {
numFilters = 20
numChannels = 1
filterSize = 3
pad = 1
}
else if(setup == 4) {
numFilters = 50
numChannels = 20
filterSize = 3
pad = 1
}
else {
stop('Incorrect setup (needs to be [1, 4]).')
}
imgSize = 28
n = 60000
X = rand(rows=n, cols=numChannels*imgSize*imgSize)
batch_size = 64
w = rand(rows=numFilters, cols=numChannels*filterSize*filterSize)
P = (imgSize + 2 * pad - filterSize) + 1
foo = matrix(0, rows=n, cols=numFilters*P*P)
for(iter in 1:max_iterations) {
beg = (iter * batch_size) %% n + 1
end = min(n, beg + batch_size)
X_batch = X[beg:end, ]
n_batch = nrow(X_batch)
convOut_1 = conv2d(X_batch, w, input_shape=[n_batch,numChannels,imgSize,imgSize], filter_shape=[numFilters,numChannels,filterSize,filterSize], padding=[pad,pad], stride=[1,1])
foo = convOut_1
}
print(sum(foo))
To compile the native SystemML library, please use:
export MKLROOT=/opt/intel/mkl
export JAVA_HOME=....
# Please go to https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor to find LINKER_OPTIONS and COMPILER_OPTIONS
export LINKER_OPTIONS=" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_rt -lpthread -lm -ldl"
export COMPILER_OPTIONS=" -m64 -I${MKLROOT}/include"
g++ -shared -fPIC -o libsystemml.so systemml.cpp -I. -I$JAVA_HOME/include -I$JAVA_HOME/include/linux -lm -fopenmp -O3 $LINKER_OPTIONS $COMPILER_OPTIONS
Please see below the results of the experiments. Both sparse and caching are disabled for the setup SystemML_native
and SystemML_CP
.
,Number of Iterations, Setup, Time in seconds
SystemML_native,1000,1,7.103096398
SystemML_CP,1000,1,6.498525426
SystemML_CP_WithCacheNSparseEnabled,1000,1,7.195620854
Tensorflow,1000,1,4.071731716
SystemML_native,1000,2,31.315343223
SystemML_CP,1000,2,81.769984552
SystemML_CP_WithCacheNSparseEnabled,1000,2,101.274622939
Tensorflow,1000,2,33.476548341
SystemML_native,1000,3,7.662274848
SystemML_CP,1000,3,6.355272119
SystemML_CP_WithCacheNSparseEnabled,1000,3,7.607337158
Tensorflow,1000,3,3.837932081
SystemML_native,1000,4,26.638438614
SystemML_CP,1000,4,49.716594505
SystemML_CP_WithCacheNSparseEnabled,1000,4,71.542244484
Tensorflow,1000,4,26.395180006
There are some additional overhead cost (such as initial compilation/validation, reuse of previously allocated but non-zeroed array, dynamic recompilation, GC, etc) which we have not yet optimized. These cost are beyond the scope of this PR and some of them are inherent to our design principles. We can work on them in a separate PR :)
@mboehm7 @bertholdreinwald @dusenberrymw @frreiss @prithvirajsen @fschueler @nakul02 @asurve @deroneriksson I understand the above experiments might not be sufficient to accept the change and would welcome your feedback on additional experiments/setups. I would also appreciate if some of you are willing to help me with these experiments too ;)
Here are the shapes of the matrix multiplication for the four setups:
Setup 1:
64 parallel matrix multiplication of shape (20, 25) %*% (25, 576) executed 1000 times.
Setup 2:
64 parallel matrix multiplication of shape (50, 500) %*% (500, 576) executed 1000 times.
Setup 3:
64 parallel matrix multiplication of shape (20, 25) %*% (25, 784) executed 1000 times.
Setup 4:
64 parallel matrix multiplication of shape (50, 500) %*% (500, 784) executed 1000 times.
I will provide an update soon comparing the results of the above matrix multiplications. If you are interested, here are the respective code path for the matrix multiplications:
-
CP: https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixDNN.java#L327
-
Native: https://github.com/niketanpansare/incubator-systemml/blob/for_cpp/src/main/cpp/systemml.cpp#L163