Statistical Machine Intelligence & Learning Engine

Last update: Jan 1, 2023

Overview

Smile

Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance. Smile is well documented and please check out the project website for programming guides and more information.

Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.

Smile implements the following major machine learning algorithms:

Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.
Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, ElasticNet, Ridge Regression.
Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, TreeSHAP, Signal Noise ratio, Sum Squares ratio.
Clustering: BIRCH, CLARANS, DBSCAN, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical Clustering, Sequential Information Bottleneck, Self-Organizing Maps, Spectral Clustering, Minimum Entropy Clustering.
Association Rule & Frequent Itemset Mining: FP-growth mining algorithm.
Manifold Learning: IsoMap, LLE, Laplacian Eigenmap, t-SNE, UMAP, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA.
Multi-Dimensional Scaling: Classical MDS, Isotonic MDS, Sammon Mapping.
Nearest Neighbor Search: BK-Tree, Cover Tree, KD-Tree, SimHash, LSH.
Sequence Learning: Hidden Markov Model, Conditional Random Field.
Natural Language Processing: Sentence Splitter and Tokenizer, Bigram Statistical Test, Phrase Extractor, Keyword Extractor, Stemmer, POS Tagging, Relevance Ranking

You can use the libraries through Maven central repository by adding the following to your project pom.xml file.

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-core</artifactId>
      <version>2.6.0</version>
    </dependency>

For NLP, use the artifactId smile-nlp.

For Scala API, please use

    libraryDependencies += "com.github.haifengl" %% "smile-scala" % "2.6.0"

For Kotlin API, add the below into the dependencies section of Gradle build script.

    implementation("com.github.haifengl:smile-kotlin:2.6.0")

For Clojure API, add the following dependency to your project or build file:

    [org.clojars.haifengl/smile "2.6.0"]

Some algorithms rely on BLAS and LAPACK (e.g. manifold learning, some clustering algorithms, Gaussian Process regression, MLP, etc). To use these algorithms, you should include OpenBLAS for optimized matrix computation:

    libraryDependencies ++= Seq(
      "org.bytedeco" % "javacpp"   % "1.5.4"        classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
      "org.bytedeco" % "openblas"  % "0.3.10-1.5.4" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
      "org.bytedeco" % "arpack-ng" % "3.7.0-1.5.4"  classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le"
    )

In this example, we include all supported 64-bit platforms and filter out 32-bit platforms. The user should include only the needed platforms to save spaces.

If you prefer other BLAS implementations, you can use any library found on the "java.library.path" or on the class path, by specifying it with the "org.bytedeco.openblas.load" system property. For example, to use the BLAS library from the Accelerate framework on Mac OS X, we can pass options such as -Djava.library.path=/usr/lib/ -Dorg.bytedeco.openblas.load=blas.

For a default installation of MKL that would be -Dorg.bytedeco.openblas.load=mkl_rt. Or you may simply include smile-mkl module in your project, which includes MKL binaries. With smile-mkl module in the class path, Smile will automatically switch to MKL.

    libraryDependencies += "com.github.haifengl" %% "smile-mkl" % "2.6.0"

Shell

Smile comes with interactive shells for Java, Scala and Kotlin. Download pre-packaged Smile from the releases page. In the home directory of Smile, type

    ./bin/smile

to enter the Scala shell. You can run any valid Scala expressions in the shell. In the simplest case, you can use it as a calculator. Besides, all high-level Smile operators are predefined in the shell. By default, the shell uses up to 75% memory. If you need more memory to handle large data, use the option -J-Xmx or -XX:MaxRAMPercentage. For example,

    ./bin/smile -J-Xmx30G

You can also modify the configuration file ./conf/smile.ini for the memory and other JVM settings.

To use Java's JShell, type

    ./bin/jshell.sh

which has Smile's jars in the classpath. Similarly, run

    ./bin/kotlin.sh

to enter Kotlin REPL.

Model Serialization

Most models support the Java Serializable interface (all classifiers do support Serializable interface) so that you can use them in Spark. For reading/writing the models in non-Java code, we suggest [XStream] (https://github.com/x-stream/xstream) to serialize the trained models. XStream is a simple library to serialize objects to XML and back again. XStream is easy to use and doesn't require mappings (actually requires no modifications to objects). Protostuff is a nice alternative that supports forward-backward compatibility (schema evolution) and validation. Beyond XML, Protostuff supports many other formats such as JSON, YAML, protobuf, etc.

Visualization

Smile provides a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, QQ plot, contour plot, surface, and wireframe.

To use SmilePlot, add the following to dependencies

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-plot</artifactId>
      <version>2.6.0</version>
    </dependency>

Smile also support data visualization in declarative approach. With smile.plot.vega package, we can create a specification that describes visualizations as mappings from data to properties of graphical marks (e.g., points or bars). The specification is based on Vega-Lite. The Vega-Lite compiler automatically produces visualization components including axes, legends, and scales. It then determines properties of these components based on a set of carefully designed rules.

Gallery

Kernel PCA	IsoMap
Multi-Dimensional Scaling	SOM
Neural Network	SVM
Agglomerative Clustering	X-Means
DBSCAN	Neural Gas
Wavelet	Exponential Family Mixture

Comments

Sparse Data Memory Consumption

Do I have to create a sparse matrix to train a classifier on Smile? I am trying to train a logistic regression model with 300.000 examples and 40.000 features. This requires a new double[300000][40000] which consumes approximately 96 GB of memory. However, I am able to train a logistic regression model with StanfordNLP with the same data using dense Counter objects.

How can I train smile with dense vectors?
question

opened by hrzafer 50
ManiFoldLearning-ISOMap; java.lang.ArrayIndexOutOfBoundsException

Hi Hai, I am running into this issue while running ISOMap [main] INFO smile.manifold.IsoMap - IsoMap: 2 connected components, largest one has 986 samples. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10 at smile.math.matrix.EigenValueDecomposition.tql2(EigenValueDecomposition.java:1404) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:629) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:422) at smile.math.Math.eigen(Math.java:4316) at smile.manifold.IsoMap.<init>(IsoMap.java:179) at com.smile.dimensionality.reduction.IsoMapLearner.learn(IsoMapLearner.java:80) at com.smile.dimensionality.reduction.ManifoldLearningFunction.execute(ManifoldLearningFunction.java:85) at com.common.ModelingEngine.main(ModelingEngine.java:81)
invalid

opened by aminaaslam 47
QRDecomposition is slow

I discovered that OLS regression of 120k data points with 1200 dimensions per point takes an extremely long time (I'm not sure how long, it has been running for many minutes, with one core at 100% -- much longer than expected). The culprit is QRDecomposition, which has O(n^2) time complexity.

I don't know if this can be improved -- the following doc may help: https://www.math.kth.se/na/SF2524/matber15/qrmethod.pdf

Or maybe OLS can be performed in some other way that doesn't use QRDecomposition? Or is there an approximate / SGD-based way of performing approximate OLS regression?

If not, QRDecomposition can be parallelized to some degree, that would help.
enhancement

opened by lukehutch 43

DecisionTree and RegressionTree uses too much memory space

I found that DecisionTree and RegressionTree implementation use too large heap space for constructing trees as split method call depth grows because trueSamples and falseSamples is not released for GC until each split method call returns.

https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/classification/DecisionTree.java#L655 https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/regression/RegressionTree.java#L599

Other RandomForest implementation uses bag[N], in which bag[i] represents sample[bag[i]]++, instead of samples[N] and bag.length is reduced as the tree depth grows. In smile, N is the number of training examples and it consumes O(2N * depth) in split.

Releasing samples and trueSamples as soon as possible helps a little. https://github.com/myui/hivemall/pull/259/files

Here is a condition that OOM happened due to recursive node splits.

    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:705)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree.<init>(RegressionTree.java:799)

Using SparseIntArray for samples is an option but it'll consume more time for tree construction. @haifengl How do you think?

enhancement

opened by myui 32

RandomForest: need correlation table before using factors ?

Describe the bug When using the randomForest classifier, it seems we need a "correlation table" to recover the real value of our predicted variable

Expected behavior Would be easier to be able to directly use the real values of the predicted variable

Actual behavior We need to convert the predicted variable values into factors by ourself in order to keep the correlation table between factors and "real values", then use the model and then convert the resulting factors back to the "real values". Do I understand well ?

opened by j3r3m1 29
PCA:Array Index Out of Bounds

Hi Hai, I am running PCA(Principal Component Analysis) on my data set with following parameters Correlation: true Dimensions: 2 But i am running into this issue. This is the same issue that i am getting on ISOMap and LLE. Does this means this numeric stability issue will affect all of your algorithms. Do you have any time frame for fixing this problem?? Thanks, Amina

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 194 at smile.math.matrix.EigenValueDecomposition.tql2(EigenValueDecomposition.java:1382) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:992) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:960) at smile.projection.PCA.(PCA.java:170) at com.baesystems.ai.analytics.smile.dimensionality.reduction.PCAProjection.reduce(PCAProjection.java:73) at com.baesystems.ai.analytics.smile.dimensionality.reduction.DimensionReductionFunction.execute(DimensionReductionFunction.java:89) at com.baesystems.ai.analytics.common.ModelingEngine.main(ModelingEngine.java:81)
invalid

opened by aminaaslam 23
inverse of a densematrix throws segmentation fault
Expected behaviour

smile> val a = randn(100,100) a: DenseMatrix = -0.2015 0.0481 0.0432 -0.6100 -1.0286 -0.7054 0.1032 ... 0.9704 0.3677 -0.0011 1.5735 -0.7982 1.1519 -0.3758 ... 2.7269 -0.7253 -0.2067 -0.8212 1.8176 -0.7403 -0.1626 ... -0.1460 -1.6094 1.0685 -0.7500 -2.9544 0.9874 -0.7922 ... 1.8722 0.6541 -1.5475 1.7311 1.5635 1.2492 -0.6935 ... 0.4959 0.6218 1.8119 2.1485 0.0260 -0.2688 0.0886 ... 0.5539 -1.8473 0.8755 -0.2812 1.3386 -0.3521 0.3232 ... ...

The expected output is the matrix inverse of DenseMatrix a.

Actual behaviour

smile> val b = inv(a) Nov 03, 2017 1:13:28 AM com.github.fommil.jni.JniLoader liberalLoad INFO: successfully loaded /tmp/jniloader7844851871842582915netlib-native_system-linux-x86_64.so Segmentation fault

Code snippet

val a = randn(100,100) val b = inv(a)

Input data

randn(100,100)

Information

What Java (OpenJDK, Orack JDK, etc.) are you using and which Java version openjdk 1.8.0_131

Which Smile version 1.5.0 RC2

What is your build system (e.g. Ubuntu, MacOS, Windows, Debian ) Ubuntu 16.04.3 LTS x64 32Gb RAM i7 6700

help wanted
opened by sudarsun 22

RandomForest Example

Hello, I'm trying to train RandomForest Model, but getting same result for each test(about 300 entries) Here's Java Code

RandomForest model;
model = new RandomForest(X,Y,500);

For this data, python sklearn works as expected

clf = RandomForestClassifier(n_estimators=500)
clf = clf.fit(X, Y)

What am I missing?

Thanks

question

opened by Mega4alik 22

add simhash
Hi Haifeng, I add the LSH search based on signature of simhash, during the implementation, I found that there are a lot of repeated code and we cannot reuse some code already has been implemented. I think we may extract the Hash functions as devices so that all LSH search could share them. But it's not quite easy when I tried, because even the Euclidean hash functions used by LSH and MPLSH is slightly different from each other. It will take time to do this refactor. BTW, do you have some recommended data sets to test this commit?

commit summary:

add simhash(simply use weights[1,1,1...1]) and the LSH search for signatures

new feature
opened by kid1412z 22

Headless Plotting

Is it possible to generate a plot and save it to file without needing to open a window? I'm trying to generate a few thousand plot images and I'm blocked at the moment.

(0 until 2) foreach { zone =>

  val myData = // Generate data

  val canvas = ScatterPlot.plot(myData, '*', Color.BLACK)
  val plotFile = s"/tmp/channelPlots/$zone.png".toFile

  // (1)
  canvas.save(plotFile.toJava)

  // (2)
  val imageSize = 750
  val bi = new BufferedImage(imageSize, imageSize, BufferedImage.TYPE_INT_ARGB)
  val g2d = bi.createGraphics()
  canvas.print(g2d)
  g2d.dispose()

  ImageIO.write(bi, "png", plotFile.toJava)

  // (3)
  plot(myData)

}

Fails with: [error] Exception in thread "main" java.lang.IllegalArgumentException: Width (0) and height (0) cannot be <= 0
Results in a solid colored image written to file
Works fine but requires an opened window

Any help would be greatly appreciated!

new feature

opened by cranst0n 21

ensemble tree shap

implementation for https://github.com/haifengl/smile/issues/515

similar (although not exactly same) python result for boston housing dataset:

CRIM      4.7090670e-01
ZN        1.1862384e-03
INDUS     3.6680367e-02
CHAS      0.0000000e+00
NOX       2.9865596e-01
RM        1.6684741e+00
AGE       3.4423586e-02
DIS       2.4549793e-01
RAD       1.1467161e-02
TAX       4.9007054e-02
PTRATIO   9.5596902e-02
B         8.1507929e-02
LSTAT     2.5720918e+00

bostonShap

opened by rayeaster 20

UMAP does not output the same amount of instances as input
Describe the bug When running UMAP on a data array with 600 rows and 3 columns of uniform samples, UMAP returns an array with 29 rows and 2 columns or 31 rows (nondeterministic number of rows)

Expected behavior I would expect that it returns the same number of rows (600) so that there is a 1-to-1 mapping from input samples to output projection.

Actual behavior Returns 29 or 28 or 31 rows (probably depending on the randomness of the samples)

Code snippet

public static void main(String[] args) { double[][] data = new double[600][3]; for(int i=0; i<data.length; i++) { for(int j=0; j<3; j++) { data[i][j] = Math.random(); } } UMAP umap = UMAP.of(data, 2); if(umap.coordinates.length != data.length) throw new IllegalStateException("non bijective mapping"); }

Input data input is generated by code snippet

Additional context

What Java (OpenJDK, Orack JDK, etc.) are you using and which Java version

Adopt OpenJDK version 11 (LTS)

Which Smile version

<groupId>com.github.haifengl</groupId><artifactId>smile-core</artifactId><version>2.5.0</version>

What is your build system (e.g. Ubuntu, MacOS, Windows, Debian )

Windows 10

Add any other context about the problem here.
opened by hageldave 5
GradientTreeBoost : OnlineRegression

Is your feature request related to a problem? Please describe. GradientTreeBoost is a powerfull machine learning algorithm, but it is difficult and painfull to find the good parameters. We have to make multiple attemps, which can be slow. But there is one parameter that could be analysed differently and efficiently : ntrees (nb of trees)

Describe the solution you'd like It would be nice to adapt the fitting method to allow the caller to test the model, at each iteration, to compare the evolution of RMSE (for ex.) on training dataset and validation dataset to see the effect of ntrees, and then be abble to detect when the model is overfitting. It would avoid to test with ntrees=100 then ntrees = 200 etc. which is not efficient. So, in Smile vocabulary, it consists of making GradientTreeBoost an OnlineRegression with update method.

This mechanism could also allow the caller to monitor the progress of the training (UI with progress bar, etc.) and to stop it if too long.

opened by olivbrau 2
[Feature proposal] Dataframe merge by ID

I've got a few different dataframes that I'd like to merge when doing calculating some regression, and right now I do so by converting to a matrix of doubles, aligning the rows by id, and then rebuilding a dataframe. In spark and pandas, they have utility methods that allow you to merge dataframes with a by option to specify which column is used to match the data.

Describe the solution you'd like Extend the merge method with either a simple by option to specific key to merge on, add a mergeWith method, or a MergeOptions parameter that contains information such as by (key to join on), and mergeType (inner vs outerjoins, left vs right join).

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html

opened by adamsar 2
Extend FeatureRanking interface for regression tasks

It may be useful to have a feature ranking procedure applicable not only to classification tasks (e.g. SignalNoiseRatio and SumSquaresRatio, implementing the FeatureRanking interface), but also to regression tasks. At the moment the FeatureRanking interface only accepts integer target vectors for calculating the feature rank.

opened by sabbatinif 2
TreeSHAP Values are inconsistent
Describe the bug When calculating TreeSHAP values for random forest classification they dont add up. I would expect that the prediction from .vote() minus the respective SHAP Values gives me the base value which is constant and should be the same for different observations. Note that this is the behaviour we observe in lundbergs python module. Also it would be really handy if there were a function that just calculates the base value (expected_value in python) for me.

Expected behavior When calculating TreeSHAP Values I expect them to add up together with the base value to the predicted probability

Actual behavior Calculated Base Values vary even for observations from the same class

Code snippet

val iris = read.arff("../data/weka/iris.arff") val formula: Formula = "class" ~ val x = formula.x(iris).toArray val y = formula.y(iris).toIntArray val model = smile.classification.randomForest(formula,iris) val arr50 = new Array[Double](3) val arr52 = new Array[Double](3) model.vote(iris(50),arr50) model.vote(iris(52),arr52) val shap_50 = model.shap(iris(50)) val shap_52 = model.shap(iris(52)) arr50(1)-shap_50.indices.filter(x => (x+2) % 3 == 0).map(shap_50).sum // res15: Double = 0.41123849878987584 arr52(1)-shap_52.indices.filter(x => (x+2) % 3 == 0).map(shap_52).sum // res16: Double = 0.4571260444068466

Input data Iris data set

Additional context

using smile from the Try-It-Online binder
opened by ntrost-targ 11
HDBSCAN

Would be very useful to have the HDBSCAN clustering algorithm in addition to the regular DBSCAN. HDBSCAN really is state of the art, and having to use an additional ML library just for this is very inconvenient.
new feature

opened by telhoc 0

Releases(v3.0.0)

v3.0.0(Dec 15, 2022)
Java Module friendly with auto module name

Redesigned feature engineering packages (missing value imputation, transform, selection, extraction, importance)

One-class SVM

Isolation forest

Feature Hashing

One-way ANOVA

BigMatrix supporting more than 2 billion elements

Latin hypercube sampling

CLI supports training, batch prediction, endpoint, etc.

Bug fixes

Source code(tar.gz)
Source code(zip)
smile-3.0.0.zip(122.44 MB)
v2.6.0(Dec 5, 2020)
Spark integration (thanks Pierre Nodet)

t-SNE is 6X faster (thanks Brault Olivier-O)

Fully redesigned Gaussian Process Regression with HPO

L-BFGS-B

Matern kernel and composed kernels

Fully redesigned model validation facilities and metrics

Various optimization and bug fixes

Source code(tar.gz)
Source code(zip)
smile-2.6.0.zip(157.44 MB)
v2.5.3(Sep 19, 2020)
enhance MLP

bug fixes.

Source code(tar.gz)
Source code(zip)
smile-2.5.3.zip(156.90 MB)
v2.5.2(Sep 6, 2020)
AR and ARMA for time series modeling

Optimize interpolation package

Optimize matrix decomposition memory usage.

Bug fixes.

Source code(tar.gz)
Source code(zip)
smile-2.5.2.zip(156.50 MB)
v2.5.1(Aug 17, 2020)
Generalized ridge regression

Enhance JSON library

Do NOT transparently include OpenBLAS library to save space. If the users need optimized matrix computation, they should add the dependency based on their platform. See README for details.

Source code(tar.gz)
Source code(zip)
smile-2.5.1.zip(156.48 MB)
v2.5.0(Jul 23, 2020)
New matrix design

New formula design

Generalized linear models (GLM)

Sparse logistic regression

Bug fixes

Source code(tar.gz)
Source code(zip)
smile-2.5.0.zip(174.93 MB)
v2.4.0(May 1, 2020)
All new declarative data visualization

TreeSHAP (contributed by Ray Ma @rayeaster)

UMAP (contributed by Ray Ma @rayeaster)

Levenberg-Marquardt algorithm

The packages smile-cas and smile-vega are merged into scala-scala package

Spark integration in smile-spark

NLP in Kotlin

Grid search and random search for hyperparameter tuning

Bug fixes

Smile Shell is based on Scala REPL (2.13.2) again

DataFrame and Tuple -> JSON

Kotlin and Clojure notebooks

Kudos to Ray Ma @rayeaster for great contributions!
Source code(tar.gz)
Source code(zip)
smile-2.4.0.zip(201.03 MB)
v2.3.0(Apr 1, 2020)
Kotlin API

Clojure API

smile.plot.swing API is redesigned. Leaner, simpler, and better headless support

Bug fixes

Source code(tar.gz)
Source code(zip)
smile-2.3.0.zip(244.58 MB)
v2.2.2(Mar 12, 2020)

Various minor improvements
Source code(tar.gz)
Source code(zip)
smile-2.2.2.zip(255.48 MB)
v2.2.0(Feb 29, 2020)
The CAS module is a computer algebra system that has the ability to manipulate mathematical expressions in a way similar to the traditional manual computations of mathematicians and scientists.

The symbolic manipulations supported include:

simplification to a smaller expression or some standard form, including automatic simplification with assumptions and simplification with constraints

substitution of symbols or numeric values for certain expressions

change of form of expressions: expanding products and powers, partial and full factorization, rewriting as partial fractions, constraint satisfaction, rewriting trigonometric functions as exponentials, transforming logic expressions, etc.

partial and total differentiation

matrix operations including products, inverses, etc.

Source code(tar.gz)
Source code(zip)
smile-2.2.0.tgz(176.23 MB)
v2.1.0(Jan 29, 2020)
Vega-lite based plot

Jupyter notebook examples

Bug fixes

Source code(tar.gz)
Source code(zip)
smile-2.1.0.zip(174.00 MB)
v2.0.0(Nov 22, 2019)
Smile has been fully rewritten with more than 150,000 lines change.

Fully redesigned API. It is leaner, simpler and even more friendly.

Faster implementation and memory optimization. Many algorithms are fully reimplemented. RandomForest is 8X faster than XGBoost on large benchmark data (10MM samples).

New parallelism mechanism

All new DataFrame and Formula

New algorithms such as ICA, error reduction prune, quantile loss, TWCNB, etc.

Support arbitrary class labels.

Enhancement and harden numeric computations.

Support Parquet, SAS, Arrow, Avro, etc.

Bug fixes.

Source code(tar.gz)
Source code(zip)
smile-2.0.0.zip(174.63 MB)
v1.5.3(Jun 2, 2019)
ElasticNet

GroupKFold

Bug fixes

Source code(tar.gz)
Source code(zip)
smile-1.5.3.zip(166.04 MB)
v1.5.2(Oct 15, 2018)
K-Modes clustering

Online learning with LogisticRegression by SGD

MCC (Matthews correlation coefficient) metric

Bug fixes

Source code(tar.gz)
Source code(zip)
smile-1.5.2.zip(160.53 MB)
v1.5.1(Feb 26, 2018)
Performance improvement of hierarchical clustering

Bug fixes.

Source code(tar.gz)
Source code(zip)
smile-1.5.1.zip(165.99 MB)
v1.5.0(Nov 10, 2017)
DataFrame

New Shell for Mac and Linux

Shell improvement for Windows

Out of box support of native LAPACK for Windows

Scala functions to export AttributeDataset, double[][], double[] to ARFF or CSV

Scala functions for validation measures

Refactor feature transformation and generation classes

NeuralNetwork for regression

Recursive least squares

Refactor Scala NLP API

Bug fixes

Source code(tar.gz)
Source code(zip)
smile-1.5.0.dmg(475.00 MB)
smile-1.5.0.zip(166.17 MB)
v1.4.0(Aug 6, 2017)
Add smile-netlib module that leverages native BLAS/LAPACK for matrix computation. See below for the details how to enable it.

Add t-SNE implementation.

Improve LLE and Laplacian Eigenmaps performance.

Export DecisionTree and RegressionTree to Graphviz dot file for visualization.

Smile shell is now based on Scala 2.12.

Bug fixes.

To enable machine optimized matrix computation, the users should add the dependency of smile-netlib:

<dependency> <groupId>com.github.haifengl</groupId> <artifactId>smile-netlib</artifactId> <version>1.4.0</version> </dependency>

and also make their machine-optimised libblas3 (CBLAS) and liblapack3 (Fortran) available as shared libraries at runtime.

OS X

Apple OS X requires no further setup as it ships with the veclib framework.

Linux

Generically-tuned ATLAS and OpenBLAS are available with most distributions and must be enabled explicitly using the package-manager. For example,

sudo apt-get install libatlas3-base libopenblas-base

sudo update-alternatives --config libblas.so

sudo update-alternatives --config libblas.so.3

sudo update-alternatives --config liblapack.so

sudo update-alternatives --config liblapack.so.3

However, these are only generic pre-tuned builds. If you have an Intel MKL licence, you could also create symbolic links from libblas.so.3 and liblapack.so.3 to libmkl_rt.so or use Debian's alternatives system.

Windows

The native_system builds expect to find libblas3.dll and liblapack3.dll on the %PATH% (or current working directory). Besides vendor-supplied implementations, OpenBLAS provide generically tuned binaries, and it is possible to build ATLAS.
Source code(tar.gz)
Source code(zip)
smile-1.4.0.dmg(359.00 MB)
smile-1.4.0.tgz(134.36 MB)
v1.3.1(May 3, 2017)

Bug fixes.
Source code(tar.gz)
Source code(zip)
smile-1.3.1.dmg(431.00 MB)
smile-1.3.1.tgz(204.17 MB)
v1.3.0(Mar 27, 2017)
A new design of matrix library

Native matrix computation based on ND4j

Scala DSL for matrix computation

Update project website

Various bug fixes.

Source code(tar.gz)
Source code(zip)
smile-1.3.0.dmg(424.00 MB)
smile-1.3.0.tgz(204.03 MB)
v1.2.3(Mar 7, 2017)

Various bug fixes.
Source code(tar.gz)
Source code(zip)
smile-1.2.3.dmg(342.00 MB)
smile-1.2.3.tgz(125.03 MB)
v1.2.1(Dec 2, 2016)
Performance improvement, especially LASSO

Row major matrix and column major matrix in addition to existing Matrix(double[][])

Random Forest serializable

Reduce the memory footprint of DecisionTree, RandomForest, etc.

Scala API for Scala 2.10

Bug fixes

Source code(tar.gz)
Source code(zip)
smile-1.2.1.dmg(339.00 MB)
smile-1.2.1.tgz(125.02 MB)
v1.2.0(Aug 16, 2016)
The key features of the 1.2.0 release are:

Headless plot. Smile’s plot functions depends on Java Swing. In server applications, it is needed to generate plots without creating Swing windows. With headless plot (enabled by -Djava.awt.headless=true JVM options), we can create plots as follows:

val canvas = ScatterPlot.plot(x, '.') val headless = new Headless(canvas); headless.pack(); headless.setVisible(true);

canvas.save(new java.io.File("zone.png"))

All classification and regression models can be serialized by

write(model) // Java serialization

or

write.xstream(model) // XStream serialization

Refactor of smile.io Scala API.

Parsers are in smile.read object.

Parse JDBC ResultSet to AttributeDataset.

Model serialization methods in smile.write object.

Platt scaling for SVM

Smile NLP tokenizers are unicode-aware.

Least squares can handle rank deficient now.

Various code improvements.

Source code(tar.gz)
Source code(zip)
smile-1.2.0.dmg(341.00 MB)
smile-1.2.0.tgz(124.79 MB)
v1.1.0(Mar 17, 2016)

Smile 1.1.0 rocks the new Scala API and shell for quick development.
Source code(tar.gz)
Source code(zip)
smile-1.1.0.dmg(338.00 MB)
smile-1.1.0.tgz(124.69 MB)

Statistical Machine Intelligence & Learning Engine

Related tags

Overview

Smile

Shell

Model Serialization

Visualization

Gallery

Kernel PCA

IsoMap

Multi-Dimensional Scaling

SOM

Neural Network

SVM

Agglomerative Clustering

X-Means

DBSCAN

Neural Gas

Wavelet

Exponential Family Mixture

Comments

Expected behaviour

Actual behaviour

Code snippet

Input data

Information

commit summary:

Releases(v3.0.0)

v3.0.0(Dec 15, 2022)

v2.6.0(Dec 5, 2020)

v2.5.3(Sep 19, 2020)

v2.5.2(Sep 6, 2020)

v2.5.1(Aug 17, 2020)

v2.5.0(Jul 23, 2020)

v2.4.0(May 1, 2020)

v2.3.0(Apr 1, 2020)

v2.2.2(Mar 12, 2020)

v2.2.0(Feb 29, 2020)

v2.1.0(Jan 29, 2020)

v2.0.0(Nov 22, 2019)

v1.5.3(Jun 2, 2019)

v1.5.2(Oct 15, 2018)

v1.5.1(Feb 26, 2018)

v1.5.0(Nov 10, 2017)

v1.4.0(Aug 6, 2017)

OS X

Linux

Windows

v1.3.1(May 3, 2017)

v1.3.0(Mar 27, 2017)

v1.2.3(Mar 7, 2017)

v1.2.1(Dec 2, 2016)

v1.2.0(Aug 16, 2016)

v1.1.0(Mar 17, 2016)

Owner

Haifeng Li

Java Statistical Analysis Tool, a Java library for Machine Learning

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

This repo is created to help people with the machine coding interview. There is no free website to provide complete guide for machine coding round so I have created this repo where I have shared all my machine coding practices and created a medium post as well to help with theory part.

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

A machine learning package built for humans.

statistics, data mining and machine learning toolbox

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Tribuo - A Java machine learning library

Java time series machine learning tools in a Weka compatible toolkit

An Engine-Agnostic Deep Learning Framework in Java

An Engine-Agnostic Deep Learning Framework in Java

On-device wake word detection powered by deep learning.

java deep learning algorithms and deep neural networks with gpu acceleration

Learning Based Java (LBJava)

Test project for learning GoF design pattern

Abstract machine for formal semantics of SIMP (Simple Imperative Language)

💻 Machine Coding - leetcode LLD (coding blox) - It is an Online Coding Platform that allows a user to Sign Up, Create Contests and participate in Contests hosted by Others.