Statistical Machine Intelligence & Learning Engine

Overview

Smile

Join the chat at https://gitter.im/haifengl/smile Maven Central

Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance. Smile is well documented and please check out the project website for programming guides and more information.

Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.

Smile implements the following major machine learning algorithms:

  • Classification: Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.

  • Regression: Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, ElasticNet, Ridge Regression.

  • Feature Selection: Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, TreeSHAP, Signal Noise ratio, Sum Squares ratio.

  • Clustering: BIRCH, CLARANS, DBSCAN, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical Clustering, Sequential Information Bottleneck, Self-Organizing Maps, Spectral Clustering, Minimum Entropy Clustering.

  • Association Rule & Frequent Itemset Mining: FP-growth mining algorithm.

  • Manifold Learning: IsoMap, LLE, Laplacian Eigenmap, t-SNE, UMAP, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA.

  • Multi-Dimensional Scaling: Classical MDS, Isotonic MDS, Sammon Mapping.

  • Nearest Neighbor Search: BK-Tree, Cover Tree, KD-Tree, SimHash, LSH.

  • Sequence Learning: Hidden Markov Model, Conditional Random Field.

  • Natural Language Processing: Sentence Splitter and Tokenizer, Bigram Statistical Test, Phrase Extractor, Keyword Extractor, Stemmer, POS Tagging, Relevance Ranking

You can use the libraries through Maven central repository by adding the following to your project pom.xml file.

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-core</artifactId>
      <version>2.6.0</version>
    </dependency>

For NLP, use the artifactId smile-nlp.

For Scala API, please use

    libraryDependencies += "com.github.haifengl" %% "smile-scala" % "2.6.0"

For Kotlin API, add the below into the dependencies section of Gradle build script.

    implementation("com.github.haifengl:smile-kotlin:2.6.0")

For Clojure API, add the following dependency to your project or build file:

    [org.clojars.haifengl/smile "2.6.0"]

Some algorithms rely on BLAS and LAPACK (e.g. manifold learning, some clustering algorithms, Gaussian Process regression, MLP, etc). To use these algorithms, you should include OpenBLAS for optimized matrix computation:

    libraryDependencies ++= Seq(
      "org.bytedeco" % "javacpp"   % "1.5.4"        classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
      "org.bytedeco" % "openblas"  % "0.3.10-1.5.4" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
      "org.bytedeco" % "arpack-ng" % "3.7.0-1.5.4"  classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le"
    )

In this example, we include all supported 64-bit platforms and filter out 32-bit platforms. The user should include only the needed platforms to save spaces.

If you prefer other BLAS implementations, you can use any library found on the "java.library.path" or on the class path, by specifying it with the "org.bytedeco.openblas.load" system property. For example, to use the BLAS library from the Accelerate framework on Mac OS X, we can pass options such as -Djava.library.path=/usr/lib/ -Dorg.bytedeco.openblas.load=blas.

For a default installation of MKL that would be -Dorg.bytedeco.openblas.load=mkl_rt. Or you may simply include smile-mkl module in your project, which includes MKL binaries. With smile-mkl module in the class path, Smile will automatically switch to MKL.

    libraryDependencies += "com.github.haifengl" %% "smile-mkl" % "2.6.0"

Shell

Smile comes with interactive shells for Java, Scala and Kotlin. Download pre-packaged Smile from the releases page. In the home directory of Smile, type

    ./bin/smile

to enter the Scala shell. You can run any valid Scala expressions in the shell. In the simplest case, you can use it as a calculator. Besides, all high-level Smile operators are predefined in the shell. By default, the shell uses up to 75% memory. If you need more memory to handle large data, use the option -J-Xmx or -XX:MaxRAMPercentage. For example,

    ./bin/smile -J-Xmx30G

You can also modify the configuration file ./conf/smile.ini for the memory and other JVM settings.

To use Java's JShell, type

    ./bin/jshell.sh

which has Smile's jars in the classpath. Similarly, run

    ./bin/kotlin.sh

to enter Kotlin REPL.

Model Serialization

Most models support the Java Serializable interface (all classifiers do support Serializable interface) so that you can use them in Spark. For reading/writing the models in non-Java code, we suggest [XStream] (https://github.com/x-stream/xstream) to serialize the trained models. XStream is a simple library to serialize objects to XML and back again. XStream is easy to use and doesn't require mappings (actually requires no modifications to objects). Protostuff is a nice alternative that supports forward-backward compatibility (schema evolution) and validation. Beyond XML, Protostuff supports many other formats such as JSON, YAML, protobuf, etc.

Visualization

Smile provides a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, QQ plot, contour plot, surface, and wireframe.

To use SmilePlot, add the following to dependencies

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-plot</artifactId>
      <version>2.6.0</version>
    </dependency>

Smile also support data visualization in declarative approach. With smile.plot.vega package, we can create a specification that describes visualizations as mappings from data to properties of graphical marks (e.g., points or bars). The specification is based on Vega-Lite. The Vega-Lite compiler automatically produces visualization components including axes, legends, and scales. It then determines properties of these components based on a set of carefully designed rules.

Gallery

Kernel PCA

Kernel PCA

IsoMap

IsoMap

MDS

Multi-Dimensional Scaling

SOM

SOM

Neural Network

Neural Network

SVM

SVM

Agglomerative Clustering

Agglomerative Clustering

X-Means

X-Means

DBSCAN

DBSCAN

Neural Gas

Neural Gas

Wavelet

Wavelet

Mixture

Exponential Family Mixture

Comments
  • Sparse Data Memory Consumption

    Sparse Data Memory Consumption

    Do I have to create a sparse matrix to train a classifier on Smile? I am trying to train a logistic regression model with 300.000 examples and 40.000 features. This requires a new double[300000][40000] which consumes approximately 96 GB of memory. However, I am able to train a logistic regression model with StanfordNLP with the same data using dense Counter objects.

    How can I train smile with dense vectors?

    question 
    opened by hrzafer 50
  • ManiFoldLearning-ISOMap; java.lang.ArrayIndexOutOfBoundsException

    ManiFoldLearning-ISOMap; java.lang.ArrayIndexOutOfBoundsException

    Hi Hai, I am running into this issue while running ISOMap [main] INFO smile.manifold.IsoMap - IsoMap: 2 connected components, largest one has 986 samples. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10 at smile.math.matrix.EigenValueDecomposition.tql2(EigenValueDecomposition.java:1404) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:629) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:422) at smile.math.Math.eigen(Math.java:4316) at smile.manifold.IsoMap.<init>(IsoMap.java:179) at com.smile.dimensionality.reduction.IsoMapLearner.learn(IsoMapLearner.java:80) at com.smile.dimensionality.reduction.ManifoldLearningFunction.execute(ManifoldLearningFunction.java:85) at com.common.ModelingEngine.main(ModelingEngine.java:81)

    invalid 
    opened by aminaaslam 47
  • QRDecomposition is slow

    QRDecomposition is slow

    I discovered that OLS regression of 120k data points with 1200 dimensions per point takes an extremely long time (I'm not sure how long, it has been running for many minutes, with one core at 100% -- much longer than expected). The culprit is QRDecomposition, which has O(n^2) time complexity.

    I don't know if this can be improved -- the following doc may help: https://www.math.kth.se/na/SF2524/matber15/qrmethod.pdf

    Or maybe OLS can be performed in some other way that doesn't use QRDecomposition? Or is there an approximate / SGD-based way of performing approximate OLS regression?

    If not, QRDecomposition can be parallelized to some degree, that would help.

    enhancement 
    opened by lukehutch 43
  • DecisionTree and RegressionTree uses too much memory space

    DecisionTree and RegressionTree uses too much memory space

    I found that DecisionTree and RegressionTree implementation use too large heap space for constructing trees as split method call depth grows because trueSamples and falseSamples is not released for GC until each split method call returns.

    https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/classification/DecisionTree.java#L655 https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/regression/RegressionTree.java#L599

    Other RandomForest implementation uses bag[N], in which bag[i] represents sample[bag[i]]++, instead of samples[N] and bag.length is reduced as the tree depth grows. In smile, N is the number of training examples and it consumes O(2N * depth) in split.

    Releasing samples and trueSamples as soon as possible helps a little. https://github.com/myui/hivemall/pull/259/files

    Here is a condition that OOM happened due to recursive node splits.

        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:705)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
        at smile.regression.RegressionTree.<init>(RegressionTree.java:799)
    

    Using SparseIntArray for samples is an option but it'll consume more time for tree construction. @haifengl How do you think?

    enhancement 
    opened by myui 32
  • RandomForest: need correlation table before using factors ?

    RandomForest: need correlation table before using factors ?

    Describe the bug When using the randomForest classifier, it seems we need a "correlation table" to recover the real value of our predicted variable

    Expected behavior Would be easier to be able to directly use the real values of the predicted variable

    Actual behavior We need to convert the predicted variable values into factors by ourself in order to keep the correlation table between factors and "real values", then use the model and then convert the resulting factors back to the "real values". Do I understand well ?

    opened by j3r3m1 29
  • PCA:Array Index Out of Bounds

    PCA:Array Index Out of Bounds

    Hi Hai, I am running PCA(Principal Component Analysis) on my data set with following parameters Correlation: true Dimensions: 2 But i am running into this issue. This is the same issue that i am getting on ISOMap and LLE. Does this means this numeric stability issue will affect all of your algorithms. Do you have any time frame for fixing this problem?? Thanks, Amina

    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 194 at smile.math.matrix.EigenValueDecomposition.tql2(EigenValueDecomposition.java:1382) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:992) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:960) at smile.projection.PCA.(PCA.java:170) at com.baesystems.ai.analytics.smile.dimensionality.reduction.PCAProjection.reduce(PCAProjection.java:73) at com.baesystems.ai.analytics.smile.dimensionality.reduction.DimensionReductionFunction.execute(DimensionReductionFunction.java:89) at com.baesystems.ai.analytics.common.ModelingEngine.main(ModelingEngine.java:81)

    invalid 
    opened by aminaaslam 23
  • inverse of a densematrix throws segmentation fault

    inverse of a densematrix throws segmentation fault

    Expected behaviour

    smile> val a = randn(100,100) a: DenseMatrix = -0.2015 0.0481 0.0432 -0.6100 -1.0286 -0.7054 0.1032 ... 0.9704 0.3677 -0.0011 1.5735 -0.7982 1.1519 -0.3758 ... 2.7269 -0.7253 -0.2067 -0.8212 1.8176 -0.7403 -0.1626 ... -0.1460 -1.6094 1.0685 -0.7500 -2.9544 0.9874 -0.7922 ... 1.8722 0.6541 -1.5475 1.7311 1.5635 1.2492 -0.6935 ... 0.4959 0.6218 1.8119 2.1485 0.0260 -0.2688 0.0886 ... 0.5539 -1.8473 0.8755 -0.2812 1.3386 -0.3521 0.3232 ... ...

    The expected output is the matrix inverse of DenseMatrix a.

    Actual behaviour

    smile> val b = inv(a) Nov 03, 2017 1:13:28 AM com.github.fommil.jni.JniLoader liberalLoad INFO: successfully loaded /tmp/jniloader7844851871842582915netlib-native_system-linux-x86_64.so Segmentation fault

    Code snippet

    val a = randn(100,100) val b = inv(a)

    Input data

    randn(100,100)

    Information

    • What Java (OpenJDK, Orack JDK, etc.) are you using and which Java version openjdk 1.8.0_131

    • Which Smile version 1.5.0 RC2

    • What is your build system (e.g. Ubuntu, MacOS, Windows, Debian ) Ubuntu 16.04.3 LTS x64 32Gb RAM i7 6700

    help wanted 
    opened by sudarsun 22
  • RandomForest Example

    RandomForest Example

    Hello, I'm trying to train RandomForest Model, but getting same result for each test(about 300 entries) Here's Java Code

    RandomForest model;
    model = new RandomForest(X,Y,500);                                                        
    
    

    For this data, python sklearn works as expected

    clf = RandomForestClassifier(n_estimators=500)
    clf = clf.fit(X, Y)
    
    

    What am I missing?

    Thanks

    question 
    opened by Mega4alik 22
  • add simhash

    add simhash

    Hi Haifeng, I add the LSH search based on signature of simhash, during the implementation, I found that there are a lot of repeated code and we cannot reuse some code already has been implemented. I think we may extract the Hash functions as devices so that all LSH search could share them. But it's not quite easy when I tried, because even the Euclidean hash functions used by LSH and MPLSH is slightly different from each other. It will take time to do this refactor. BTW, do you have some recommended data sets to test this commit?

    commit summary:

    • add simhash(simply use weights[1,1,1...1]) and the LSH search for signatures
    new feature 
    opened by kid1412z 22
  • Headless Plotting

    Headless Plotting

    Is it possible to generate a plot and save it to file without needing to open a window? I'm trying to generate a few thousand plot images and I'm blocked at the moment.

    (0 until 2) foreach { zone =>
    
      val myData = // Generate data
    
      val canvas = ScatterPlot.plot(myData, '*', Color.BLACK)
      val plotFile = s"/tmp/channelPlots/$zone.png".toFile
    
      // (1)
      canvas.save(plotFile.toJava)
    
      // (2)
      val imageSize = 750
      val bi = new BufferedImage(imageSize, imageSize, BufferedImage.TYPE_INT_ARGB)
      val g2d = bi.createGraphics()
      canvas.print(g2d)
      g2d.dispose()
    
      ImageIO.write(bi, "png", plotFile.toJava)
    
      // (3)
      plot(myData)
    
    }
    
    1. Fails with: [error] Exception in thread "main" java.lang.IllegalArgumentException: Width (0) and height (0) cannot be <= 0
    2. Results in a solid colored image written to file
    3. Works fine but requires an opened window

    Any help would be greatly appreciated!

    new feature 
    opened by cranst0n 21
  • ensemble tree shap

    ensemble tree shap

    implementation for https://github.com/haifengl/smile/issues/515

    similar (although not exactly same) python result for boston housing dataset:

    CRIM      4.7090670e-01
    ZN        1.1862384e-03
    INDUS     3.6680367e-02
    CHAS      0.0000000e+00
    NOX       2.9865596e-01
    RM        1.6684741e+00
    AGE       3.4423586e-02
    DIS       2.4549793e-01
    RAD       1.1467161e-02
    TAX       4.9007054e-02
    PTRATIO   9.5596902e-02
    B         8.1507929e-02
    LSTAT     2.5720918e+00
    

    bostonShap

    opened by rayeaster 20
  • UMAP does not output the same amount of instances as input

    UMAP does not output the same amount of instances as input

    Describe the bug When running UMAP on a data array with 600 rows and 3 columns of uniform samples, UMAP returns an array with 29 rows and 2 columns or 31 rows (nondeterministic number of rows)

    Expected behavior I would expect that it returns the same number of rows (600) so that there is a 1-to-1 mapping from input samples to output projection.

    Actual behavior Returns 29 or 28 or 31 rows (probably depending on the randomness of the samples)

    Code snippet

    	public static void main(String[] args) {
    		double[][] data = new double[600][3];
    		for(int i=0; i<data.length; i++) {
    			for(int j=0; j<3; j++) {
    				data[i][j] = Math.random();
    			}
    		}
    		UMAP umap = UMAP.of(data, 2);
    		if(umap.coordinates.length != data.length)
    			throw new IllegalStateException("non bijective mapping");
    	}
    

    Input data input is generated by code snippet

    Additional context

    • What Java (OpenJDK, Orack JDK, etc.) are you using and which Java version
      • Adopt OpenJDK version 11 (LTS)
    • Which Smile version
      • <groupId>com.github.haifengl</groupId><artifactId>smile-core</artifactId><version>2.5.0</version>
    • What is your build system (e.g. Ubuntu, MacOS, Windows, Debian )
      • Windows 10
    • Add any other context about the problem here.
    opened by hageldave 5
  • GradientTreeBoost : OnlineRegression

    GradientTreeBoost : OnlineRegression

    Is your feature request related to a problem? Please describe. GradientTreeBoost is a powerfull machine learning algorithm, but it is difficult and painfull to find the good parameters. We have to make multiple attemps, which can be slow. But there is one parameter that could be analysed differently and efficiently : ntrees (nb of trees)

    Describe the solution you'd like It would be nice to adapt the fitting method to allow the caller to test the model, at each iteration, to compare the evolution of RMSE (for ex.) on training dataset and validation dataset to see the effect of ntrees, and then be abble to detect when the model is overfitting. It would avoid to test with ntrees=100 then ntrees = 200 etc. which is not efficient. So, in Smile vocabulary, it consists of making GradientTreeBoost an OnlineRegression with update method.

    This mechanism could also allow the caller to monitor the progress of the training (UI with progress bar, etc.) and to stop it if too long.

    opened by olivbrau 2
  • [Feature proposal] Dataframe merge by ID

    [Feature proposal] Dataframe merge by ID

    I've got a few different dataframes that I'd like to merge when doing calculating some regression, and right now I do so by converting to a matrix of doubles, aligning the rows by id, and then rebuilding a dataframe. In spark and pandas, they have utility methods that allow you to merge dataframes with a by option to specify which column is used to match the data.

    Describe the solution you'd like Extend the merge method with either a simple by option to specific key to merge on, add a mergeWith method, or a MergeOptions parameter that contains information such as by (key to join on), and mergeType (inner vs outerjoins, left vs right join).

    https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html

    opened by adamsar 2
  • Extend FeatureRanking interface for regression tasks

    Extend FeatureRanking interface for regression tasks

    It may be useful to have a feature ranking procedure applicable not only to classification tasks (e.g. SignalNoiseRatio and SumSquaresRatio, implementing the FeatureRanking interface), but also to regression tasks. At the moment the FeatureRanking interface only accepts integer target vectors for calculating the feature rank.

    opened by sabbatinif 2
  • TreeSHAP Values are inconsistent

    TreeSHAP Values are inconsistent

    Describe the bug When calculating TreeSHAP values for random forest classification they dont add up. I would expect that the prediction from .vote() minus the respective SHAP Values gives me the base value which is constant and should be the same for different observations. Note that this is the behaviour we observe in lundbergs python module. Also it would be really handy if there were a function that just calculates the base value (expected_value in python) for me.

    Expected behavior When calculating TreeSHAP Values I expect them to add up together with the base value to the predicted probability

    Actual behavior Calculated Base Values vary even for observations from the same class

    Code snippet

    val iris = read.arff("../data/weka/iris.arff")
    
    val formula: Formula = "class" ~
    val x = formula.x(iris).toArray
    val y = formula.y(iris).toIntArray
    
    val model = smile.classification.randomForest(formula,iris)
    
    val arr50 = new Array[Double](3)
    val arr52 = new Array[Double](3)
    
    model.vote(iris(50),arr50)
    model.vote(iris(52),arr52)
    
    val shap_50 = model.shap(iris(50))
    val shap_52 = model.shap(iris(52))
    
    arr50(1)-shap_50.indices.filter(x => (x+2) % 3 == 0).map(shap_50).sum
    // res15: Double = 0.41123849878987584
    arr52(1)-shap_52.indices.filter(x => (x+2) % 3 == 0).map(shap_52).sum
    // res16: Double = 0.4571260444068466
    

    Input data Iris data set

    Additional context

    • using smile from the Try-It-Online binder
    opened by ntrost-targ 11
  • HDBSCAN

    HDBSCAN

    Would be very useful to have the HDBSCAN clustering algorithm in addition to the regular DBSCAN. HDBSCAN really is state of the art, and having to use an additional ML library just for this is very inconvenient.

    new feature 
    opened by telhoc 0
Releases(v3.0.0)
  • v3.0.0(Dec 15, 2022)

    1. Java Module friendly with auto module name
    2. Redesigned feature engineering packages (missing value imputation, transform, selection, extraction, importance)
    3. One-class SVM
    4. Isolation forest
    5. Feature Hashing
    6. One-way ANOVA
    7. BigMatrix supporting more than 2 billion elements
    8. Latin hypercube sampling
    9. CLI supports training, batch prediction, endpoint, etc.
    10. Bug fixes
    Source code(tar.gz)
    Source code(zip)
    smile-3.0.0.zip(122.44 MB)
  • v2.6.0(Dec 5, 2020)

    • Spark integration (thanks Pierre Nodet)
    • t-SNE is 6X faster (thanks Brault Olivier-O)
    • Fully redesigned Gaussian Process Regression with HPO
    • L-BFGS-B
    • Matern kernel and composed kernels
    • Fully redesigned model validation facilities and metrics
    • Various optimization and bug fixes
    Source code(tar.gz)
    Source code(zip)
    smile-2.6.0.zip(157.44 MB)
  • v2.5.3(Sep 19, 2020)

  • v2.5.2(Sep 6, 2020)

  • v2.5.1(Aug 17, 2020)

  • v2.5.0(Jul 23, 2020)

  • v2.4.0(May 1, 2020)

    • All new declarative data visualization
    • TreeSHAP (contributed by Ray Ma @rayeaster)
    • UMAP (contributed by Ray Ma @rayeaster)
    • Levenberg-Marquardt algorithm
    • The packages smile-cas and smile-vega are merged into scala-scala package
    • Spark integration in smile-spark
    • NLP in Kotlin
    • Grid search and random search for hyperparameter tuning
    • Bug fixes
    • Smile Shell is based on Scala REPL (2.13.2) again
    • DataFrame and Tuple -> JSON
    • Kotlin and Clojure notebooks

    Kudos to Ray Ma @rayeaster for great contributions!

    Source code(tar.gz)
    Source code(zip)
    smile-2.4.0.zip(201.03 MB)
  • v2.3.0(Apr 1, 2020)

  • v2.2.2(Mar 12, 2020)

  • v2.2.0(Feb 29, 2020)

    The CAS module is a computer algebra system that has the ability to manipulate mathematical expressions in a way similar to the traditional manual computations of mathematicians and scientists.

    The symbolic manipulations supported include:

    • simplification to a smaller expression or some standard form, including automatic simplification with assumptions and simplification with constraints

    • substitution of symbols or numeric values for certain expressions

    • change of form of expressions: expanding products and powers, partial and full factorization, rewriting as partial fractions, constraint satisfaction, rewriting trigonometric functions as exponentials, transforming logic expressions, etc.

    • partial and total differentiation

    • matrix operations including products, inverses, etc.

    Source code(tar.gz)
    Source code(zip)
    smile-2.2.0.tgz(176.23 MB)
  • v2.1.0(Jan 29, 2020)

  • v2.0.0(Nov 22, 2019)

    Smile has been fully rewritten with more than 150,000 lines change.

    • Fully redesigned API. It is leaner, simpler and even more friendly.
    • Faster implementation and memory optimization. Many algorithms are fully reimplemented. RandomForest is 8X faster than XGBoost on large benchmark data (10MM samples).
    • New parallelism mechanism
    • All new DataFrame and Formula
    • New algorithms such as ICA, error reduction prune, quantile loss, TWCNB, etc.
    • Support arbitrary class labels.
    • Enhancement and harden numeric computations.
    • Support Parquet, SAS, Arrow, Avro, etc.
    • Bug fixes.
    Source code(tar.gz)
    Source code(zip)
    smile-2.0.0.zip(174.63 MB)
  • v1.5.3(Jun 2, 2019)

  • v1.5.2(Oct 15, 2018)

  • v1.5.1(Feb 26, 2018)

  • v1.5.0(Nov 10, 2017)

    1. DataFrame
    2. New Shell for Mac and Linux
    3. Shell improvement for Windows
    4. Out of box support of native LAPACK for Windows
    5. Scala functions to export AttributeDataset, double[][], double[] to ARFF or CSV
    6. Scala functions for validation measures
    7. Refactor feature transformation and generation classes
    8. NeuralNetwork for regression
    9. Recursive least squares
    10. Refactor Scala NLP API
    11. Bug fixes
    Source code(tar.gz)
    Source code(zip)
    smile-1.5.0.dmg(475.00 MB)
    smile-1.5.0.zip(166.17 MB)
  • v1.4.0(Aug 6, 2017)

    1. Add smile-netlib module that leverages native BLAS/LAPACK for matrix computation. See below for the details how to enable it.
    2. Add t-SNE implementation.
    3. Improve LLE and Laplacian Eigenmaps performance.
    4. Export DecisionTree and RegressionTree to Graphviz dot file for visualization.
    5. Smile shell is now based on Scala 2.12.
    6. Bug fixes.

    To enable machine optimized matrix computation, the users should add the dependency of smile-netlib:

        <dependency>
          <groupId>com.github.haifengl</groupId>
          <artifactId>smile-netlib</artifactId>
          <version>1.4.0</version>
        </dependency>
    

    and also make their machine-optimised libblas3 (CBLAS) and liblapack3 (Fortran) available as shared libraries at runtime.

    OS X

    Apple OS X requires no further setup as it ships with the veclib framework.

    Linux

    Generically-tuned ATLAS and OpenBLAS are available with most distributions and must be enabled explicitly using the package-manager. For example,

    • sudo apt-get install libatlas3-base libopenblas-base
    • sudo update-alternatives --config libblas.so
    • sudo update-alternatives --config libblas.so.3
    • sudo update-alternatives --config liblapack.so
    • sudo update-alternatives --config liblapack.so.3

    However, these are only generic pre-tuned builds. If you have an Intel MKL licence, you could also create symbolic links from libblas.so.3 and liblapack.so.3 to libmkl_rt.so or use Debian's alternatives system.

    Windows

    The native_system builds expect to find libblas3.dll and liblapack3.dll on the %PATH% (or current working directory). Besides vendor-supplied implementations, OpenBLAS provide generically tuned binaries, and it is possible to build ATLAS.

    Source code(tar.gz)
    Source code(zip)
    smile-1.4.0.dmg(359.00 MB)
    smile-1.4.0.tgz(134.36 MB)
  • v1.3.0(Mar 27, 2017)

  • v1.2.1(Dec 2, 2016)

  • v1.2.0(Aug 16, 2016)

    The key features of the 1.2.0 release are:

    • Headless plot. Smile’s plot functions depends on Java Swing. In server applications, it is needed to generate plots without creating Swing windows. With headless plot (enabled by -Djava.awt.headless=true JVM options), we can create plots as follows:
    val canvas = ScatterPlot.plot(x, '.')
    
    val headless = new Headless(canvas);
    headless.pack();
    headless.setVisible(true);
    

    canvas.save(new java.io.File("zone.png"))

    • All classification and regression models can be serialized by
    write(model) // Java serialization
    

    or

    write.xstream(model) // XStream serialization
    
    • Refactor of smile.io Scala API.
      • Parsers are in smile.read object.
      • Parse JDBC ResultSet to AttributeDataset.
      • Model serialization methods in smile.write object.
    • Platt scaling for SVM
    • Smile NLP tokenizers are unicode-aware.
    • Least squares can handle rank deficient now.
    • Various code improvements.
    Source code(tar.gz)
    Source code(zip)
    smile-1.2.0.dmg(341.00 MB)
    smile-1.2.0.tgz(124.79 MB)
  • v1.1.0(Mar 17, 2016)

Java Statistical Analysis Tool, a Java library for Machine Learning

Java Statistical Analysis Tool JSAT is a library for quickly getting started with Machine Learning problems. It is developed in my free time, and made

null 752 Dec 20, 2022
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

null 900 Jan 2, 2023
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
A machine learning package built for humans.

aerosolve Machine learning for humans. What is it? A machine learning library designed from the ground up to be human friendly. It is different from o

Airbnb 4.8k Dec 30, 2022
statistics, data mining and machine learning toolbox

Disambiguation (Italian dictionary) Field of turnips. It is also a place where there is confusion, where tricks and sims are plotted. (Computer scienc

Aurelian Tutuianu 63 Jun 11, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.2) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
Java time series machine learning tools in a Weka compatible toolkit

UEA Time Series Classification A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-co

Machine Learning and Time Series Tools and Datasets 140 Nov 7, 2022
An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

Amazon Web Services - Labs 2.9k Jan 7, 2023
An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

DeepJavaLibrary 2.9k Jan 7, 2023
On-device wake word detection powered by deep learning.

Porcupine Made in Vancouver, Canada by Picovoice Porcupine is a highly-accurate and lightweight wake word engine. It enables building always-listening

Picovoice 2.8k Dec 30, 2022
java deep learning algorithms and deep neural networks with gpu acceleration

Deep Neural Networks with GPU support Update This is a newer version of the framework, that I developed while working at ExB Research. Currently, you

Ivan Vasilev 1.2k Jan 6, 2023
Learning Based Java (LBJava)

Learning Based Java LBJava core LBJava examples LBJava maven plugin Compiling the whole package From the root directory run the following command: Jus

CogComp 12 Jun 9, 2019
Test project for learning GoF design pattern

DesignPattern Test project for learning GoF design pattern ㅁ개요 객체지향 설계의 교과서라고 불리는 Design Pattern 을 직접 Activity 별로 구현해봤습니다. ㅁ동기 물론 디자인패턴을 몰라도 기능은 얼마든지

null 11 Aug 8, 2022
Abstract machine for formal semantics of SIMP (Simple Imperative Language)

SIMP-abstract-machine In 2020/21 I was a Teaching Assistant for the second year module 5CCS2PLD Programming Language Paradigms at King's College Londo

Sten Arthur Laane 25 Oct 10, 2022
💻 Machine Coding - leetcode LLD (coding blox) - It is an Online Coding Platform that allows a user to Sign Up, Create Contests and participate in Contests hosted by Others.

leetcode-lld-flipkart-coding-blox Machine coding - leetcode LLD (coding blox) My Approach : https://leetcode.com/discuss/interview-question/object-ori

Hariom Yadav 50 Sep 15, 2022