Tribuo - A Java machine learning library

Last update: Dec 28, 2022

Overview

Tribuo - A Java prediction library (v4.2)

Tribuo is a machine learning library in Java that provides multi-class classification, regression, clustering, anomaly detection and multi-label classification. Tribuo provides implementations of popular ML algorithms and also wraps other libraries to provide a unified interface. Tribuo contains all the code necessary to load, featurise and transform data. Additionally, it includes the evaluation classes for all supported prediction types. Development is led by Oracle Labs' Machine Learning Research Group; we welcome community contributions.

All trainers are configurable using the OLCUT configuration system. This allows a user to define a trainer in an xml file and repeatably build models. Example configurations for each of the supplied Trainers can be found in the config folder of each package. These configuration files can also be written in json or edn by using the appropriate OLCUT configuration dependency. Models and datasets are serializable using Java serialization.

All models and evaluations include a serializable provenance object which records the creation time of the model or evaluation, the identity of the data and any transformations applied to it, as well as the hyperparameters of the trainer. In the case of evaluations, this provenance information also includes the specific model used. Provenance information can be extracted as JSON, or serialised directly using Java serialisation. For production deployments, provenance information can be redacted and replaced with a hash to provide model tracking through an external system. Many Tribuo models can be exported in ONNX format for deployment in other languages, platforms or cloud services.

Tribuo runs on Java 8+, and we test on LTS versions of Java along with the latest release. Tribuo itself is a pure Java library and is supported on all Java platforms; however, some of our interfaces require native code and are thus supported only where there is native library support. We test on x86_64 architectures on Windows 10, macOS and Linux (RHEL/OL/CentOS 7+), as these are supported platforms for the native libraries with which we interface. If you're interested in another platform and wish to use one of the native library interfaces (ONNX Runtime, TensorFlow, and XGBoost), we recommend reaching out to the developers of those libraries. Note the reproducibility package requires Java 17, and as such is not part of the tribuo-all Maven Central deployment.

Documentation

Tutorials

Tutorial notebooks, including examples of Classification, Clustering, Regression, Anomaly Detection, TensorFlow, document classification, columnar data loading, working with externally trained models, and the configuration system, can be found in the tutorials. These use the IJava Jupyter notebook kernel, and work with Java 10+, except the reproducibility tutotiral which requires Java 17. To convert the tutorials' code back to Java 8, in most cases simply replace the var keyword with the appropriate types.

Algorithms

General predictors

Tribuo includes implementations of several algorithms suitable for a wide range of prediction tasks:

Algorithm	Implementation	Notes
Bagging	Tribuo	Can use any Tribuo trainer as the base learner
Random Forest	Tribuo	For both classification and regression
Extra Trees	Tribuo	For both classification and regression
K-NN	Tribuo	Includes options for several parallel backends, as well as a single threaded backend
Neural Networks	TensorFlow	Train a neural network in TensorFlow via the Tribuo wrapper. Models can be deployed using the ONNX interface or the TF interface

The ensembles and K-NN use a combination function to produce their output. These combiners are prediction task specific, but the ensemble & K-NN implementations are task agnostic. We provide voting and averaging combiners for multi-class classification, multi-label classification and regression tasks.

Classification

Tribuo has implementations or interfaces for:

Algorithm	Implementation	Notes
Linear models	Tribuo	Uses SGD and allows any gradient optimizer
Factorization Machines	Tribuo	Uses SGD and allows any gradient optimizer
CART	Tribuo
SVM-SGD	Tribuo	An implementation of the Pegasos algorithm
Adaboost.SAMME	Tribuo	Can use any Tribuo classification trainer as the base learner
Multinomial Naive Bayes	Tribuo
Regularised Linear Models	LibLinear
SVM	LibSVM or LibLinear	LibLinear only supports linear SVMs
Gradient Boosted Decision Trees	XGBoost

Tribuo also supplies a linear chain CRF for sequence classification tasks. This CRF is trained via SGD using any of Tribuo's gradient optimizers.

To explain classifier predictions there is an implementation of the LIME algorithm. Tribuo's implementation allows the mixing of text and tabular data, along with the use of any sparse model as an explainer (e.g., regression trees, lasso etc), however it does not support images.

Regression

Tribuo's regression algorithms are multidimensional by default. Single dimensional implementations are wrapped in order to produce multidimensional output.

Algorithm	Implementation	Notes
Linear models	Tribuo	Uses SGD and allows any gradient optimizer
Factorization Machines	Tribuo	Uses SGD and allows any gradient optimizer
CART	Tribuo
Lasso	Tribuo	Using the LARS algorithm
Elastic Net	Tribuo	Using the co-ordinate descent algorithm
Regularised Linear Models	LibLinear
SVM	LibSVM or LibLinear	LibLinear only supports linear SVMs
Gradient Boosted Decision Trees	XGBoost

Clustering

Tribuo includes infrastructure for clustering and also supplies a single clustering algorithm implementation. We expect to implement additional algorithms over time.

Algorithm	Implementation	Notes
HDBSCAN*	Tribuo
K-Means	Tribuo	Includes both sequential and parallel backends, and the K-Means++ initialisation algorithm

Anomaly Detection

Tribuo offers infrastructure for anomaly detection tasks. We expect to add new implementations over time.

Algorithm	Implementation	Notes
One-class SVM	LibSVM
One-class linear SVM	LibLinear

Multi-label classification

Tribuo offers infrastructure for multi-label classification, along with a wrapper which converts any of Tribuo's multi-class classification algorithms into a multi-label classification algorithm. We expect to add more multi-label specific implementations over time.

Algorithm	Implementation	Notes
Independent wrapper	Tribuo	Converts a multi-class classification algorithm into a multi-label one by producing a separate classifier for each label
Classifier Chains	Tribuo	Provides classifier chains and randomized classifier chain ensembles using any of Tribuo's multi-class classification algorithms
Linear models	Tribuo	Uses SGD and allows any gradient optimizer
Factorization Machines	Tribuo	Uses SGD and allows any gradient optimizer

Interfaces

In addition to our own implementations of Machine Learning algorithms, Tribuo also provides a common interface to popular ML tools on the JVM. If you're interested in contributing a new interface, open a GitHub Issue, and we can discuss how it would fit into Tribuo.

Currently we have interfaces to:

LibLinear - via the LibLinear-java port of the original LibLinear (v2.43).
LibSVM - using the pure Java transformed version of the C++ implementation (v3.25).
ONNX Runtime - via the Java API contributed by our group (v1.9.0).
TensorFlow - Using TensorFlow Java v0.4.0 (based on TensorFlow v2.7.0). This allows the training and deployment of TensorFlow models entirely in Java.
XGBoost - via the built in XGBoost4J API (v1.5.0).

Binaries

Binaries are available on Maven Central, using groupId org.tribuo. To pull all of Tribuo, including the bindings for TensorFlow, ONNX Runtime and XGBoost (which are native libraries), use:

Maven:

<dependency>
    <groupId>org.tribuo</groupId>
    <artifactId>tribuo-all</artifactId>
    <version>4.2.0</version>
    <type>pom</type>
</dependency>

or from Gradle:

implementation ("org.tribuo:tribuo-all:4.2.0@pom") {
    transitive = true // for build.gradle (i.e., Groovy)
    // isTransitive = true // for build.gradle.kts (i.e., Kotlin)
}

The tribuo-all dependency is a pom which depends on all the Tribuo subprojects except for the reproducibility project which requires Java 17.

Most of Tribuo is pure Java and thus cross-platform, however some of the interfaces link to libraries which use native code. Those interfaces (TensorFlow, ONNX Runtime and XGBoost) only run on supported platforms for the respective published binaries, and Tribuo has no control over which binaries are supplied. If you need support for a specific platform, reach out to the maintainers of those projects. As of the 4.1 release these native packages all provide x86_64 binaries for Windows, macOS and Linux. It is also possible to compile each package for macOS ARM64 (i.e., Apple Silicon), though there are no binaries available on Maven Central for that platform. When developing on an ARM platform you can select the arm profile in Tribuo's pom.xml to disable the native library tests.

Individual jars are published for each Tribuo module. It is preferable to depend only on the modules necessary for the specific project. This prevents your code from unnecessarily pulling in large dependencies like TensorFlow.

Compiling from source

Tribuo uses Apache Maven v3.5 or higher to build. Tribuo is compatible with Java 8+, and we test on LTS versions of Java along with the latest release. To build, simply run mvn clean package. All Tribuo's dependencies should be available on Maven Central. Please file an issue for build-related issues if you're having trouble (though do check if you're missing proxy settings for Maven first, as that's a common cause of build failures, and out of our control).

Repository Layout

Development happens on the main branch, which has the version number of the next Tribuo release with "-SNAPSHOT" appended to it. Tribuo major and minor releases will be tagged on the main branch, and then have a branch named vA.B.X-release-branch (for release vA.B.0) branched from the tagged release commit for any point releases (i.e., vA.B.1, vA.B.2 etc) following from that major/minor release. Those point releases are tagged on the specific release branch e.g., v4.0.2 is tagged on the v4.0.X-release-branch.

Contributing

We welcome contributions! See our contribution guidelines.

We have a discussion mailing list [email protected], archived here. We're investigating different options for real time chat, check back in the future. For bug reports, feature requests or other issues, please file a Github Issue.

Security issues should follow our reporting guidelines.

License

Tribuo is licensed under the Apache 2.0 License.

Release Notes:

v4.2.0 - Added factorization machines, classifier chains, HDBSCAN. Added ONNX export and OCI Data Science integration. Added reproducibility framework. Various other small fixes and improvements, including the regression fixes from v4.1.1. Filled out the remaining javadoc, added 4 new tutorials (onnx export, multi-label classification, reproducibility, hdbscan), expanded existing tutorials.
v4.1.1 - Bug fixes for multi-output regression, multi-label evaluation, KMeans & KNN with SecurityManager, and update TF-Java 0.4.0.
v4.1.0 - Added TensorFlow training support, a BERT feature extractor, ExtraTrees, K-Means++, many linear model & CRF performance improvements, new tutorials on TF and document classification. Many bug fixes & documentation improvements.
v4.0.2 - Many bug fixes (CSVDataSource, JsonDataSource, RowProcessor, LibSVMTrainer, Evaluations, Regressor serialization). Improved javadoc and documentation. Added two new tutorials (columnar data and external models).
v4.0.1 - Bugfix for CSVReader to cope with blank lines, added IDXDataSource to allow loading of native MNIST format data.
v4.0.0 - Initial public release.
v3 - Added provenance system, the external model support and onnx integrations.
v2 - Expanded beyond a classification system, to support regression, clustering and multi-label classification.
v1 - Initial internal release. This release only supported multi-class classification.

Comments

Documentation: Proper DataSource format and usage for K-Means Clustering

Is your feature request related to a problem? Please describe. Still a newbie to this library, so thanks for bearing with me.

Right now, the documentation shows how to run K-Means clustering on an auto-generated data set of Gaussian clusters. This is great, as it shows K-Means is possible, but (unless I'm missing something) it does not show the steps to input real data. (It mentions You can also use any of the standard data loaders to pull in clustering data. but I don't see where that's documented).

I've figured out how to load a CSV file of features and metadata (thanks to your new Colunmar tutorial), but I can't seem to infer how to connect this data with KMeansTrainer, or if that's even the right approach.

Describe the solution you'd like A clear and concise description/example of how to load real-world (non-autogenerated) data into the K-Means algorithm.

Describe alternatives you've considered Looking through JavaDocs, but having trouble knowing what to focus on.

Additional context
enhancement

opened by lincolnthree 32
Tribuo KmeansTrainer doesn't work with Java 11 (or any later version) + SecurityManager because of using ForkJoinPool
Describe the bug From Java 11, the security policies doesn’t get propagated to ForkJoinPool worker threads if a SecurityManager is present ( https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/ForkJoinPool.html ) Kmeans depends on ForkJoinPool to run the training tasks: https://github.com/oracle/tribuo/blob/v4.1.0/Clustering/KMeans/src/main/java/org/tribuo/clustering/kmeans/KMeansTrainer.java#L197 So, any java application relay on SecurityManager and security policies won't able to work with Tribuo Kmeans because of AccessControlException.

To Reproduce We have a Java application with security manager and security policy, our Java application could work with Tribuo Kmeans and Java 8. But when the Java application (with Tribuo Kmeans) work with Java 11, Java 14 or Java 15, there would be AccessControlException: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "modifyThreadGroup")

Expected behaviour The security policy defined by the Java application should be applied to any threads.

Screenshots N/A.

System information:

OS: [e.g., Linux]

Java Version: [e.g., 11, 14 & 15]

JDK Vendor: [e.g., Oracle & OpenJDK ]

Additional context Reference URLs: https://stackoverflow.com/questions/55333290/jdk11-security-policies-doesn-t-get-propagated-to-forkjoinpool-worker-threads http://5.9.10.113/66435138/how-to-perform-the-parallel-execution-under-a-collection-when-securitymanager-is http://mail.openjdk.java.net/pipermail/core-libs-dev/2013-December/024044.html https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/ForkJoinPool.html
bug
opened by spbjss 25
ITS NOT A BUG cant import LogisticREGRESSION TRAINER OTHER PACKage.?

Its not a bug i know.but i m unable to import all the packages after successfully dowloaded all dependency from maven.logisticregression is not present in classpath and other evaluation methods are not impoting dont know why..some are already imported but others are not why?... preprocessing stuffs like Grid SearchCv is not present
dependency problem

opened by nikkisingh111333 21
HDBSCAN testing

We have a synthetically generated training and (anomaly) test data (attached herewith) for testing Tribuo HDBSCAN . We first trained a model with the python HDBSCAN and we predicted the anomalies from the synthetic test/anomaly data .

Then we used the same data to create a Tribuo HDBSCAN model following its standard workflow. (The crux of the code flow is attached herewith for ref.) and used the test data for predictions. However, Tribuo model did not predict any anomaly against the test data including when I played aound changing the hdbscan parameters . All the predicted score was pretty low like 0.10809591 .. using the getScore() method from prediction. My understanding from the Tribuo tutorial is that the low value predicted scores like 0.1, 0.2 etc are not anomalies and scores >=0.8 are anomalies . Is that correct?

Also, In general w.r.t Tribuo, I was able to test the Tribuo HDBSCAN model training and prediction working fine with the real world large size card holder labeled dataset mentioned as a benchmark with Python HDBSCAN in the Tribuo HDBSCAN paper, and I was able to test its labeled anomalies working. Further , the smaller toy dataset in the Tribuo repo is another thing that works fine.

Based on this,my question is what might be the possible reasons that I am unable to get anomaly outputs from Tribuo from my test data Vis-a-vis Python HDBSCAN? Does Tribuo's HDBSCAN require more training data than its python counterpart? or Is it dependent on any data characteristics?

Attachments:-

train and test data tribuo-syn-test-data.zip

crux of the hdbscan code flow. tribuo_code_crux.txt
question

opened by nsankar 15
transforming data?

hey i want a good tutorial on data transformation.i m following docs but getting confused by similar terminology like transformerMap transform transformation transformationMap etc..can you point me out to the quick start of transformation of dataset ..so that i can easily apply normalization binning vectorization all kind of data processesing stuff?
documentation

opened by nikkisingh111333 13
fix: addressing indeterminate Map orderings in tests

Description

Using Nondex when running command mvn edu.illinois:nondex-maven-plugin:1.1.2:nondex in the tests directory after installing all dependencies, several tests in CSVSaverWithMultiOutputsTest.java and EnsembleExportTest.java produce flaky results (nondeterministically pass or fail). The reason is that the outputs in the tests are dependent on iterators for hashmaps, which does not guarantee a fixed order by default. (From the Javadoc: "HashMap class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.") For example, in the function save(Path csvPath, Dataset<T> dataset, Set<String> responseNames) in CSVSaver.java, the code is using the iterator defined for ImmutableFeatureMap that deals with a Collection of values in a HashMap. Therefore, there is no guaranteed order for the entries in headerLine[], which leads to possible test flakiness.

To resolve the problem, I changed the code to construct a sorted ArrayList for all keys in the referenced HashMap to ensure determinate iteration order.

Motivation

The change is necessary as related functionalities / tests may fail when, for example, the Java version upgrades in the future, or when the code is run in a different environment.

OCA

https://oca.opensource.oracle.com/api/v1/oca-requests/6710/documents/6730/download
OCA signed

opened by kaiyaok2 12
[Question] Why is true negative represented by 'n' in the classification matrix?
Describe the bug

In the confusion matrix:

Class n tp fn fp recall prec f1 Iris-versicolor 16 15 1 0 0.938 1.000 0.968 Iris-virginica 15 15 0 1 1.000 0.938 0.968 Iris-setosa 14 14 0 0 1.000 1.000 1.000 Total 45 44 1 1

The title/label for True negative is shown as n instead of tn

Expected behaviour

Most documentations, on confusion matrix I have seen so far, represent it as tn.

It might lead to doubts by those who may be aware of the standard representations. Especially the dependent metrics like recall, precision, f1, accuracy etc... are made up these base metrics (and True negative is one of them).
documentation
opened by neomatrix369 12

[Docs] Discrepancy in Gradle instruction between docs on Tribuo.org and GitHub README.md

Describe the bug

The Gradle tab in https://tribuo.org/#gradle says:

implementation ("org.tribuo:tribuo-all:4.0.1@pom") {
    transitive = true // for Groovy
    // isTransitive = true // for Kotlin
}

while the README on GitHub says:

api 'org.tribuo:tribuo-all:4.0.0@pom'

The former works in Gradle 6.5, the latter does not work, and gives the below error:

* What went wrong:
A problem occurred evaluating root project 'tribuo-classification'.
> Could not find method api() for arguments [org.tribuo:tribuo-all:4.0.1@pom] on object of 
type org.gradle.api.internal.artifacts.dsl.dependencies.DefaultDependencyHandler.

Expected behaviour

Both docs to refer to the same information. Also would help if there was a bit more context to the Gradle docs, i.e.:

dependencies {
    implementation ("org.tribuo:tribuo-all:4.0.1@pom")
    ....
}

bug documentation

opened by neomatrix369 12

Tutorial: evaluation of the XGBoost ensemble training fails

Describe the bug

Running the below cell in the Regression tutorial fails:

var xgbModel = train("XGBoost",xgb,trainData);
evaluate(xgbModel,evalData);

with these results:

---------------------------------------------------------------------------
java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
	at ml.dmlc.xgboost4j.java.DMatrix.<init>(DMatrix.java:109)
	at org.tribuo.common.xgboost.XGBoostTrainer.convertExamples(XGBoostTrainer.java:309)
	at org.tribuo.regression.xgboost.XGBoostRegressionTrainer.train(XGBoostRegressionTrainer.java:174)
	at org.tribuo.regression.xgboost.XGBoostRegressionTrainer.train(XGBoostRegressionTrainer.java:64)
	at org.tribuo.Trainer.train(Trainer.java:44)
	at .train(#45:4)
	at .do_it$Aux(#57:1)
	at .(#57:1)

To Reproduce

Run notebook in a docker container using these steps at https://github.com/neomatrix369/awesome-ai-ml-dl/tree/master/examples/tribuo:

git clone https://github.com/neomatrix369/awesome-ai-ml-dl/tree/master/
cd awesome-ai-ml-dl/examples/tribuo
./docker-runner.sh --notebookMode --runContainer

### wait it downloads the contain and browser opens up
### or open the browser to http://localhost:8888/notebooks/tribuo/tutorials/regression-tribuo-v4.ipynb

Expected behaviour

Should have shown these results:

Training XGBoost took (00:00:00:375)
Evaluation (train):
  RMSE 0.143871
  MAE 0.097167
  R^2 0.968252
Evaluation (test):
  RMSE 0.599478
  MAE 0.426673
  R^2 0.447378

Screenshots

Screen Shot 2020-10-11 at 14 00 22

System information:

OS: Linux 6a5b46663314 4.19.76-linuxkit #1 SMP Tue May 26 11:42:35 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Java Version: 11
JDK Vendor: openjdk version "11.0.5" 2019-10-15 OpenJDK Runtime Environment (build 11.0.5+10-jvmci-19.3-b05-LTS) OpenJDK 64-Bit GraalVM CE 19.3.0 (build 11.0.5+10-jvmci-19.3-b05-LTS, mixed mode, sharing)

** Jar versions **

tribuo-classification-experiments-4.0.0-jar-with-dependencies.jar
tribuo-core-4.0.0.jar
tribuo-json-4.0.0-jar-with-dependencies.jar
tribuo-regression-sgd-4.0.0-jar-with-dependencies.jar
tribuo-regression-tree-4.0.0-jar-with-dependencies.jar
tribuo-regression-xgboost-4.0.0-jar-with-dependencies.jar

bug

opened by neomatrix369 12

About the documentation.

https://tribuo.org/learn/4.0/docs/

Isn't the documentation and implementation code listed in the above URL wrong?

For example, the variable definition is 'irisData', but the variable reference says 'irisesSource'. This will result in a compilation error.

Here's what else is going on

Constructor calls that don't have a LibSVMDataSource, for example. https://qiita.com/jashika/items/aa8ca340deb81a59cbea

A call to a method that does not exist. The class name to be called is likely to be different, etc. https://qiita.com/jashika/items/1ba0cad613ec919adaa7
documentation enhancement

opened by jashika-oss 12
Shut down ForkJoinPool in KMeansTrainer

Describe the bug

KMeansTrainer will create ForkJoinPool to train KMeans model in parallel, code link : https://github.com/oracle/tribuo/blob/main/Clustering/KMeans/src/main/java/org/tribuo/clustering/kmeans/KMeansTrainer.java#L284-L293

But the ForkJoinPool doesn't shutdown when training complete. As KMeansTrainer doesn't set keepAliveTime for ForkJoinPool, the default keep alive time is 60 seconds. Refer to https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ForkJoinPool.html

keepAliveTime - the elapsed time since last use before a thread is terminated (and then later replaced if needed). For the default value, use 60, TimeUnit.SECONDS.

It's risky when kmeans training requests come in high volume in short time (<60seconds).
bug

opened by ylwu-amzn 11
getClusters in HDBSCAN transformed model

The question is how to get the cluster information from a transformed model in HDBSCAN ?

I have a transformed hdbscan model (linear transformation)

TransformTrainer transformed = new TransformTrainer(HDBSCANTrainer,transformations); TransformedModel transformedModel = (TransformedModel) transformed.train(dataset);

The transformedModel does not have a method to get clusters (getclusters). I tried using getInnerModel as a normal HDBSCAN model as below. But the output is a empty list. HdbscanModel hdbscanInner = (HdbscanModel) transformedModel.getInnerModel(); System.out.println(hdbscanInner.getClusters().size());

Wondering how to getClusters() from the transformed model . Kindy suggest.
question

opened by nsankar 2
Add a simple way to use preprocessors with SimpleTextDataSource

I just tried using SimpleTextDataSource with a custom preprocessor but ended up creating a custom subclass of TextDataSource just for this.

I think it would be great if the class SimpleTextDataSource would accept a list of DocumentPreprocessor in its constructors' arguments. The parent class has these arguments.
enhancement

opened by olivierceulemans 1
is HDBSCAN model export possible?
Ask the question Is it possible to export HDBSCAN model via ONNX? No such functionality exists in Python as far as I am aware.

Is your question about a specific ML algorithm or approach? This is about clustering, specifically the HDBSCAN method

Is your question about a specific Tribuo class? HdbscanModel.java

System details

Tribuo version

Java version (if appropriate)

OS/Architecture (if appropriate)

Additional context It would be great if I can get some general pointers on how export/import of HDBSCAN could be achieved.
question
opened by kalpit-konverge 1
Option to use MMIO for large Datasets

Is your feature request related to a problem? Please describe. I'm frequently frustrated by OOMEs when building (or deserializing) Datasets that don't fit in heap memory (or even physical memory).

Describe the solution you'd like I'd like to see an option in (or around) org.tribuo.Dataset to use mapped memory for storing Examples rather than on-heap.

Describe alternatives you've considered I've considered subclassing Dataset and reimplementing everything that makes use of the data member, replacing it with an instance of Jan Kotek's MapDB, and using the existing protobuf implementations to marshall the Examples to/from storage. I also considered rolling my own MMIO-backed ISAM instead of MapDB, given how simple the use case is.

The reason I've not yet done these is that my Datasets are computationally expensive to prepare; I need to serialize and deserialize them when spinning processes up and down, and the existing protobuf-based implementations all instantiate Datasets with on-heap storage.

I've also considered buying a ton of physical memory. ;)
enhancement

opened by handshape 3
Monitoring convergence in gradient-based optimization

Is your feature request related to a problem? Please describe.

Currently, model training via gradient-based optimization in Tribuo terminates after a fixed number of epochs. The main problem with maximum iteration number as a stopping criterion is that there is no relation between the stopping criterion and the optimality of the current iterate. It is difficult to know a priori how many epochs will be sufficient for a given training problem, and there are costs to over- or under-estimating this number (but especially underestimation).

Describe the solution you'd like

Ideally, for iterative gradient-based optimization we would be able to use problem-specific stopping criteria such as a threshold on relative reduction of the loss or the norm of the gradient. Typically these are accompanied by a (large) max-epoch cutoff to bound computation time and catch cases where the loss diverges. For stochastic algorithms we could also consider early stopping rules, for example based on the loss on a held-out validation set.

Are there any plans to implement zero- or first-order stopping criteria for optimizers extending AbstractSGDTrainer? Are there other workarounds for checking convergence of the optimizer in the case of linear and logistic regression?

Describe alternatives you've considered

An alternative to implementing new stopping criteria could be to (optionally) report some metric(s) relevant to the specific training problem after training is "completed" according to the max-epoch rule. These could include the norm of the gradient or a sequence of loss values at each epoch.

One alternative that does not work in general is to change the optimization algorithm from the standard SGD. All optimizers implement some form of iterative, gradient-based optimization, so they all face the same problem of enforcing an appropriate stopping criterion.
enhancement

opened by greaa-aws 8

Releases(v4.3.1)

v4.3.1(Dec 23, 2022)
Small patch release to bump some dependencies and pull in minor fixes. The most notable fix allows CART trees to generate pure nodes, which previously they had been prevented from doing. This will likely improve the classification tree performance both for single trees and when used in an ensemble like RandomForests.

FeatureHasher should have an option to not hash the values, and TokenPipeline should default to not hashing the values (#309).

Improving the documentation for loading multi-label data with CSVLoader (#306).

Allows Example.densify to add arbitrary features (#304).

Adds accessors to ClassifierChainModel and IndependentMultiLabelModel so the individual models can be accessed (#302).

Allows CART trees to create pure leaves (#303).

Bumping jackson-core to 2.14.1, jackson-databind to 2.14.1, OpenCSV to 5.7.1 (pulling in the fixed commons-text 1.10.0).

Contributors

Adam Pocock (@Craigacp)

Jeffrey Alexander (@jhalexand)

Jack Sullivan (@JackSullivan)

Philip Ogren (@pogren)

Source code(tar.gz)
Source code(zip)
v4.2.2(Oct 25, 2022)
Small patch release to bump some dependencies and pull in minor fixes:

Validate hash salt during object creation (#237).

Fix XGBoost parameter overriding (#239).

Add some necessary accessors to TransformedModel (#244).

Bumping TF-Java to v0.4.2 (#281).

Fixes for test failures when running in a path with spaces in (#287).

Fix documentation links to the OCA.

Bumping jackson-core to 2.13.4, jackson-databind to 2.13.4.2, protobuf-java to 3.19.6, OpenCSV to 5.7.1 (pulling in the fixed commons-text 1.10.0).

Contributors

Adam Pocock (@Craigacp)

Geoff Stewart (@geoffreydstewart)

Jeffrey Alexander (@jhalexand)

Jack Sullivan (@JackSullivan)

Philip Ogren (@pogren)

Source code(tar.gz)
Source code(zip)
v4.3.0(Oct 7, 2022)
Tribuo v4.3 adds feature selection for classification problems, support for guided generation of model cards, and protobuf serialization for all serializable classes. In addition there is a new interface for distance based computations which can now use a kd-tree or brute force comparisons, the sparse linear model package has been rewritten to use Tribuo's linear algebra system improving the speed and reducing memory consumption, and we've added some more tutorials.

Note this is likely the last feature release of Tribuo to support Java 8. The next major version of Tribuo will require Java 17. In addition, support for using java.io.Serializable for serialization will be removed in the next major release, and Tribuo will exclusively use protobuf based serialization.

Feature Selection

In this release we've added support for feature selection algorithms to the dataset and provenance systems, along with implementations of 4 information theoretic feature selection algorithms for use in classification problems. The algorithms (MIM, CMIM, mRMR and JMI) are described in this review paper. Continuous inputs are discretised into a fixed number of equal width bins before the mutual information is computed. These algorithms are a useful feature selection baseline, and we welcome contributions to extend the set of supported algorithms.

Feature selection algorithms #254.

Model Card Support

Model Cards are a popular way of describing a model, its training data, expected applications and any use cases that should be avoided. In this release we've added guided generation of model cards, where many fields are automatically generated from the provenance information inside each Tribuo model. Fields which require user input (such as the expected use cases for a model, or its license) can be added via a CLI program, and the resulting model card can be saved in json format.

At the moment, the automatic data extraction fails on some kinds of nested ensemble models which are generated without using a Tribuo Trainer class, in the future we'll look at improving the data extraction for this case.

Model card infrastructure (#243, #250, #253).

Protobuf Serialization

In this release we've added protocol buffer definitions for serializing all of Tribuo's serializable types, along with the necessary code to interact with those definitions. This effort has improved the validation of serialized data, and will allow Tribuo models to be upwards compatible across major versions of Tribuo. Any serialized model or dataset from Tribuo v4.2 or earlier can be loaded in and saved out into the new format which will ensure compatibility with the next major version of Tribuo.

Protobuf support for core types (#226, #255, #262, #264).

Protobuf support for models (Multinomial Naive Bayes #267, Sparse linear models #269, XGBoost #270, OCI, ONNX and TF #271, LibSVM #272, LibLinear #273, SGD #275, Clustering models #276, Baseline models and ensembles #277, Trees #278).

Docs and supporting programs (#279).

Smaller improvements

We added an interface for querying the nearest neighbours of a vector, and updated HDBSCAN, K-Means and K-NN to use the new interface. The old implementation has been renamed the "brute force" search operator, and a new implementation which uses a kd-tree has been added.

Distance refactor (#213, #216, #221, #231, #285).

We migrated off Apache Commons Math, which necessitated adding several methods to Tribuo's math library. In the process we refactored the sparse linear model code, removing redundant matrix operations and greatly improving the speed of LASSO.

Refactor sparse linear models and remove Apache Commons Math (#241).

The ONNX export support has been refactored to allow the use of different ONNX opsets, and custom ONNX operations. This allows users of Tribuo's ONNX export support to supply their own operations, and increases the flexibility of the ONNX support on the JVM.

ONNX operator refactor (#245).

ONNX Runtime has been upgraded to v1.12.1, which includes Linux ARM64 and macOS ARM64 binaries. As a result we've removed the ONNX tests from the arm Maven profile, and so those tests will execute on Linux & macOS ARM64 platforms.

ONNX Runtime upgrade (#256).

Small improvements

Improved the assignment to the noise cluster in HDBSCAN (#222).

Upgrade liblinear-java to v2.44 (#228).

Added accessors for the HDBSCAN cluster exemplars (#229).

Improve validation of salts when hashing feature names (#237).

Added accessors to TransformedModel for the wrapped model (#244).

Added a regex text preprocessor (#247).

Upgrade OpenCSV to v5.6 (#259).

Added a builder to RowProcessor to make it less confusing (#263).

Upgrade TF-Java to v0.4.2 (#281).

Upgrade OCI Java SDK to v2.46.0, protobuf-java to 3.19.6, XGBoost to 1.6.2, jackson to 2.14.0-rc1 (#288).

Bug Fixes

Fix for HDBSCAN small cluster generation (#236).

XGBoost provenance capture (#239.

Contributors

Adam Pocock (@Craigacp)

Jack Sullivan (@JackSullivan)

Romina Mahinpei (@rmahinpei)

Philip Ogren (@pogren)

Katie Younglove (@katieyounglove)

Jeffrey Alexander (@jhalexand)

Geoff Stewart (@geoffreydstewart)

Source code(tar.gz)
Source code(zip)
v4.2.1(May 4, 2022)
Small patch release for three issues:

Ensure K-Means thread pools shut down when training completes (#224)

Fix issues where ONNX export of ensembles, K-Means initialization and several tests relied upon HashSet iteration order (#220,#225)

Upgrade to TF-Java 0.4.1 which includes an upgrade to TF 2.7.1 which brings in several fixes for native crashes operating on malformed or malicious models (#228)

OLCUT is updated to 5.2.1 to pull in updated versions of jackson & protobuf (#234). Also includes some docs and a small update for K-Means' toString (#209, #211, #212).

Contributors

Adam Pocock (@Craigacp)

Geoff Stewart (@geoffreydstewart)

Yaliang Wu (@ylwu-amzn)

Kaiyao Ke (@kaiyaok2)

Source code(tar.gz)
Source code(zip)
v4.2.0(Dec 20, 2021)
Tribuo 4.2 adds new models, ONNX export for several types of models, a reproducibility framework for recreating Tribuo models, easy deployment of Tribuo models on Oracle Cloud, along with several smaller improvements and bug fixes. We've added more tutorials covering the new features along with multi-label classification, and further expanded the javadoc to cover all public methods.

In Tribuo 4.1.0 and earlier there is a severe bug in multi-dimensional regression models (i.e., regression tasks with multiple output dimensions). Models other than LinearSGDModel and SparseLinearModel (apart from when using the ElasticNetCDTrainer) have a bug in how the output dimension indices are constructed, and may produce incorrect outputs for all dimensions (as the output will be for a different dimension than the one named in the Regressor object). This has been fixed, and loading in models trained in earlier versions of Tribuo will patch the model to rearrange the dimensions appropriately. Unfortunately this fix cannot be applied to tree based models, and so all multi-output regression tree based models should be retrained using Tribuo 4.2 as they are irretrievably corrupt. Additionally when using standardization in multi-output regression LibSVM models dimensions past the first dimension have the model improperly stored and will also need to be retrained with Tribuo 4.2. See #177 for more details.

Note the KMeans implementation had several internal changes to support running with a java.lang.SecurityManager which will break any subclasses of KMeansTrainer. In most cases changing the signature of any overridden mStep method to match the new signature, and allowing the fjp argument to be null in single threaded execution will fix the subclass.

New models

In this release we've added Factorization Machines, Classifier Chains and HDBSCAN*. Factorization machines are a powerful non-linear predictor which uses a factorized approximation to learn a per output feature-feature interaction term in addition to a linear model. We've added Factorization Machines for multi-class classification, multi-label classification and regression. Classifier chains are an ensemble approach to multi-label classification which given a specific ordering of the labels learns a chain of classifiers where each classifier gets the features along with the predicted labels from earlier in the chain. We also added ensembles of randomly ordered classifier chains which work well in situations when the ground truth label ordering is unknown (i.e., most of the time). HDBSCAN is a hierarchical density based clustering algorithm which chooses the number of clusters based on properties of the data rather than as a hyperparameter. The Tribuo implementation can cluster a dataset, and then at prediction time it provides the cluster the given datapoint would be in without modifying the cluster structure.

Classifier Chains (#149), which also adds the jaccard score as a multi-label evaluation metric, and a multi-label voting combiner for use in multi-label ensembles.

Factorization machines (#179).

HDBSCAN (#196).

ONNX Export

The ONNX format is a cross-platform and cross-library model exchange format. Tribuo can already serve ONNX models via its ONNX Runtime interface, and now has the ability to export models in ONNX format for serving on edge devices, in cloud services, or in other languages like Python or C#.

In this release Tribuo supports exporting linear models (multi-class classification, multi-label classification and regression), sparse linear regression models, factorization machines (multi-class classification, multi-label classification and regression), LibLinear models (multi-class classification and regression), LibSVM models (multi-class classification and regression), along with ensembles of those models, including arbitrary levels of ensemble nesting. We plan to expand this coverage to more models over time, however for TensorFlow we recommend users export those models as a Saved Model and use the Python tf2onnx converter.

Tribuo models exported in ONNX format preserve their provenance information in a metadata field which is accessible when the ONNX model is loaded back into Tribuo. The provenance is stored as a protobuf so could be read from other libraries or platforms if necessary.

The ONNX export support is in a separate module with no dependencies, and could be used elsewhere on the JVM to support generating ONNX graphs. We welcome contributions to build out the ONNX support in that module.

ONNX export for LinearSGDModels (#154), which also adds a multi-label output transformer for scoring multi-label ONNX models.

ONNX export for SparseLinearModel (#163).

Add provenance to ONNX exported models (#182).

Refactor ONNX tensor creation (#187).

ONNX ensemble export support (#186).

ONNX export for LibSVM and LibLinear (#191).

Refactor ONNX support to improve type safety (#199).

Extract ONNX support into separate module (#TBD).

Reproducibility Framework

Tribuo has strong model metadata support via its provenance system which records how models, datasets and evaluations are created. In this release we enhance this support by adding a push-button reproduction framework which accepts either a model provenance or a model object and rebuilds the complete training pipeline, ensuring consistent usage of RNGs and other mutable state.

This allows Tribuo to easily rebuild models to see if updated datasets could change performance, or even if the model is actually reproducible (which may be required for regulatory reasons). Over time we hope to expand this support into a full experimental framework, allowing models to be rebuilt with hyperparameter or data changes as part of the data science process or for debugging models in production.

This framework was written by Joseph Wonsil and Prof. Margo Seltzer at the University of British Columbia as part of a collaboration between Prof. Seltzer and Oracle Labs. We're excited to continue working with Joe, Margo and the rest of the lab at UBC, as this is excellent work.

Note the reproducibility framework module requires Java 16 or greater, and is thus not included in the tribuo-all meta-module.

Reproducibility framework (#185, with minor changes in #189 and #190).

OCI Data Science Integration

Oracle Cloud Data Science is a platform for building and deploying models in Oracle Cloud. The model deployment functionality wraps a Python runtime and deploys them with an auto-scaler at a REST endpoint. In this release we've added support for deploying Tribuo models which are ONNX exportable directly to OCI DS, allowing scale-out deployments of models from the JVM. We also added a OCIModel wrapper which scores Tribuo Example objects using a deployed model's REST endpoint, allowing easy use of cloud resources for ML on the JVM.

Oracle Cloud Data Science integration (#200).

Small improvements

Date field processor and locale support in metadata extractors (#148)

Multi-output response processor allowing loading different formats of multi-label and multi-dimensional regression datasets (#150)

ARM dev profile for compiling Tribuo on ARM platforms (#152)

Refactor CSVLoader so it uses CSVDataSource and parses CSV files using RowProcessor, allowing an easy transition to more complex columnar extraction (#153)

Configurable anomaly demo data source (#160)

Configurable clustering demo data source (#161)

Configurable classification demo data source (#162)

Multi-Label tutorial and configurable multi-label demo data source (#166) (also adds a multi-label tutorial) plus fix in #168 after #167

Add javadoc for all public methods and fields (#175) (also fixes a bug in Util.vectorNorm)

Add hooks for model equality checks to trees and LibSVM models (#183) (also fixes a bug in liblinear get top features)

XGBoost 1.5.0 (#192)

TensorFlow Java 0.4.0 (#195) (note this changes Tribuo's TF API slightly as TF-Java 0.4.0 has a different method of initializing the session)

KMeans now uses dense vectors when appropriate, speeding up training (#201)

Documentation updates, ONNX and reproducibility tutorials (#205)

Bug fixes

NPE fix for LIME explanations using models which don't support per class weights (#157)

Fixing a bug in multi-label evaluation which swapped FP for FN (#167)

Persist CSVDataSource headers in the provenance (#171)

Fixing LibSVM and LibLinear so they have reproducible behaviour (#172)

Provenance fix for TransformTrainer and an extra factory for XGBoostExternalModel so you can make them from an in memory booster (#176)

Fix multidimensional regression (#177) (fixes regression ids, fixes libsvm so it emits correct standardized models, adds support for per dimension feature weights in XGBoostRegressionModel)

Fix provenance generation for FieldResponseProcessor and BinaryResponseProcessor (#178)

Normalize LibSVMDataSource paths consistently in the provenance (#181)

KMeans and KNN now run correctly when using OpenSearch's SecurityManager (#197)

Contributors

Adam Pocock (@Craigacp)

Jack Sullivan (@JackSullivan)

Joseph Wonsil (@jwons)

Philip Ogren (@pogren)

Jeffrey Alexander (@jhalexand)

Geoff Stewart (@geoffreydstewart)

What's Changed

Bumping to 4.2.0-SNAPSHOT for new development by @Craigacp in https://github.com/oracle/tribuo/pull/143

Adding release notes for the earlier v4 releases by @Craigacp in https://github.com/oracle/tribuo/pull/146

Adds classifier chains as a generic multi-label classifier by @Craigacp in https://github.com/oracle/tribuo/pull/149

Adds a field processor which operates on dates by @Craigacp in https://github.com/oracle/tribuo/pull/148

Added support for multioutputs to ResponseProcesser, with tests. by @JackSullivan in https://github.com/oracle/tribuo/pull/150

Adding an ARM maven profile which skips the native library tests by @Craigacp in https://github.com/oracle/tribuo/pull/152

Fixing an NPE in LIMEExplanation.getActiveFeatures() by @Craigacp in https://github.com/oracle/tribuo/pull/157

CSVLoader refactor by @Craigacp in https://github.com/oracle/tribuo/pull/153

Adds a ConfigurableDataSource data generator for AnomalyDetection by @Craigacp in https://github.com/oracle/tribuo/pull/160

Adds ONNX export support to Tribuo's LinearSGDModels by @Craigacp in https://github.com/oracle/tribuo/pull/154

Adds a ConfigurableDataSource data generator for Clustering by @Craigacp in https://github.com/oracle/tribuo/pull/161

Adds ConfigurableDataSource data generators for Classification by @Craigacp in https://github.com/oracle/tribuo/pull/162

Adds a tutorial on multi-label problems and a configurable data source generator for multi-label demos. by @Craigacp in https://github.com/oracle/tribuo/pull/166

Bumping LibSVM version by @Craigacp in https://github.com/oracle/tribuo/pull/170

Adds ONNX export support to the sparse linear models by @Craigacp in https://github.com/oracle/tribuo/pull/163

Fixing a bug where MultiLabelConfusionMatrix swapped FP for FN by @Craigacp in https://github.com/oracle/tribuo/pull/167

Updating the multi-label tutorial after the evaluation bug fix by @Craigacp in https://github.com/oracle/tribuo/pull/168

CSVDataSource should persist the headers in the provenance by @Craigacp in https://github.com/oracle/tribuo/pull/171

LibLinear and LibSVM have unmanaged global RNGs by @Craigacp in https://github.com/oracle/tribuo/pull/172

Add Javadoc for all remaining undocumented public methods and fields. by @Craigacp in https://github.com/oracle/tribuo/pull/175

Two small fixes for provenance. by @Craigacp in https://github.com/oracle/tribuo/pull/176

Fixes multidimensional regression by @Craigacp in https://github.com/oracle/tribuo/pull/177

Normalizes the URL created from paths in LibSVMDataSource by @Craigacp in https://github.com/oracle/tribuo/pull/181

Factorization machines by @Craigacp in https://github.com/oracle/tribuo/pull/179

ResponseProcessor State-setting and Tests by @JackSullivan in https://github.com/oracle/tribuo/pull/178

Adding some accessors to allow deeper model equality checks by @Craigacp in https://github.com/oracle/tribuo/pull/183

Adds Tribuo provenance as a metadata field to exported ONNX models by @Craigacp in https://github.com/oracle/tribuo/pull/182

Bumping CI to Java 17 by @Craigacp in https://github.com/oracle/tribuo/pull/188

Refactor onnx math by @JackSullivan in https://github.com/oracle/tribuo/pull/187

Addition of a Reproducibility Framework by @jwons in https://github.com/oracle/tribuo/pull/185

Updates for the reproducibility changes in the rest of Tribuo by @Craigacp in https://github.com/oracle/tribuo/pull/190

ONNX ensemble support by @Craigacp in https://github.com/oracle/tribuo/pull/186

Bump XGBoost to 1.5.0 by @Craigacp in https://github.com/oracle/tribuo/pull/192

Reproducibility generics cleanup by @Craigacp in https://github.com/oracle/tribuo/pull/189

ONNX export support for LibLinear and LibSVM by @Craigacp in https://github.com/oracle/tribuo/pull/191

Update bug_report.md by @Craigacp in https://github.com/oracle/tribuo/pull/194

These are the changes for an implementation of HDBSCAN* by @geoffreydstewart in https://github.com/oracle/tribuo/pull/196

Tensorflow-Java 0.4.0 update by @Craigacp in https://github.com/oracle/tribuo/pull/195

Single threaded K-Means training no longer uses a ForkJoinPool by @Craigacp in https://github.com/oracle/tribuo/pull/197

Adds setInvocationCount to HdbscanTrainer by @Craigacp in https://github.com/oracle/tribuo/pull/198

Refactor Java ONNX Interface by @JackSullivan in https://github.com/oracle/tribuo/pull/199

Add the HDBSCAN* clustering tutorial, and add a small fix for predictions by @geoffreydstewart in https://github.com/oracle/tribuo/pull/202

Cleaned up some doc formatting and some typos by @jhalexand in https://github.com/oracle/tribuo/pull/204

Moving ONNX export utils out into a separate module by @Craigacp in https://github.com/oracle/tribuo/pull/203

Oracle Cloud Data Science interop by @Craigacp in https://github.com/oracle/tribuo/pull/200

KMeans DenseVector support by @Craigacp in https://github.com/oracle/tribuo/pull/201

Documentation updates for 4.2 by @Craigacp in https://github.com/oracle/tribuo/pull/205

Tribuo v4.2 release by @Craigacp in https://github.com/oracle/tribuo/pull/206

New Contributors

@jwons made their first contribution in https://github.com/oracle/tribuo/pull/185

@geoffreydstewart made their first contribution in https://github.com/oracle/tribuo/pull/196

Full Changelog: https://github.com/oracle/tribuo/compare/v4.1.0...v4.2.0
Source code(tar.gz)
Source code(zip)
v4.1.1(Dec 10, 2021)
This is the first patch release for Tribuo v4.1. The main fixes in this release are to the multi-dimensional output regression support, and to support the use of KMeans and KNN models when running under a restrictive SecurityManager. Additionally this release pulls in TensorFlow-Java 0.4.0 which upgrades the TensorFlow native library to 2.7.0 fixing several CVEs. Note those CVEs may not be applicable to TensorFlow-Java, as many of them relate to Python codepaths which are not included in TensorFlow-Java. Also note the TensorFlow upgrade is a breaking API change for Tribuo's TF API as graph initialization is handled differently in this release, which causes unavoidable changes.

Multi-dimensional Regression fix

In Tribuo 4.1.0 and earlier there is a severe bug in multi-dimensional regression models (i.e., regression tasks with multiple output dimensions). Models other than LinearSGDModel and SparseLinearModel (apart from when using the ElasticNetCDTrainer) have a bug in how the output dimension indices are constructed, and may produce incorrect outputs for all dimensions (as the output will be for a different dimension than the one named in the Regressor object). This has been fixed, and loading in models trained in earlier versions of Tribuo will patch the model to rearrange the dimensions appropriately. Unfortunately this fix cannot be applied to tree based models, and so all multi-output regression tree based models should be retrained using Tribuo 4.1.1 or newer as they are irretrievably corrupt. Additionally when using standardization in multi-output regression LibSVM models dimensions past the first dimension have the model improperly stored and will also need to be retrained with Tribuo 4.1.1 or newer. See #177 for more details.

Bug fixes

NPE fix for LIME explanations using models which don't support per class weights (#157).

Fixing a bug in multi-label evaluation which swapped FP for FN (#167).

Fixing LibSVM and LibLinear so they have reproducible behaviour (#172).

Provenance fix for TransformTrainer and an extra factory for XGBoostExternalModel so you can make them from an in memory booster (#176)

Fix multidimensional regression (#177) (fixes regression ids, fixes libsvm so it emits correct standardized models, adds support for per dimension feature weights in XGBoostRegressionModel).

Normalize LibSVMDataSource paths consistently in the provenance (#181).

KMeans and KNN now run correctly when using OpenSearch's SecurityManager (#197).

TensorFlow-Java 0.4.0 (#195).

Contributors

Adam Pocock (@Craigacp)

Jack Sullivan (@JackSullivan)

Philip Ogren (@pogren)

Jeffrey Alexander (@jhalexand)

Full Changelog: https://github.com/oracle/tribuo/compare/v4.1.0...v4.1.1
Source code(tar.gz)
Source code(zip)
v4.1.0(May 26, 2021)
Tribuo v4.1 Release Notes

Tribuo 4.1 is the first feature release after the initial open source release. We've added new models, new parameters for some models, improvements to data loading, documentation, transformations and the speed of our CRF and linear models, along with a large update to the TensorFlow interface. We've also revised the tutorials and added two new ones covering TensorFlow and document classification.

TensorFlow support

Migrated to TensorFlow Java 0.3.1 which allows specification and training of models in Java (#134). The TensorFlow models can be saved in two formats, either using TensorFlow's checkpoint format or Tribuo's native model serialization. They can also be exported as TensorFlow Saved Models for interop with other TensorFlow platforms. Tribuo can now load TF v2 Saved Models and serve them alongside TF v1 frozen graphs with it's external model loader.

We also added a TensorFlow tutorial which walks through the creation of a simple regression MLP, a classification MLP and a classification CNN, before exporting the model as a TensorFlow Saved Model and importing it back into Tribuo.

New models

Added extremely randomized trees, i.e., ExtraTrees (#51).

Added an SGD based linear model for multi-label classification (#106).

Added liblinear's linear SVM anomaly detector (#114).

Added arbitrary ensemble creation from existing models (#129).

New features

Added K-Means++ (#34).

Added XGBoost feature importance metrics (#52).

Added OffsetDateTimeExtractor to the columnar data package (#66).

Added an empty response processor for use with clustering datasets (#99).

Added IDFTransformation for generating TF-IDF features (#104).

Exposed more parameters for XGBoost models (#107).

Added a Wordpiece tokenizer (#111).

Added optional output standardisation to LibSVM regressors (#113).

Added a BERT feature extractor for text data (#116). This can load in ONNX format BERT (and BERT style) models from HuggingFace Transformers, and use them as part of Tribuo's text feature extraction package.

Added a configurable version of AggregateDataSource, and added iteration order parameters to both forms of AggregateDataSource (#125).

Added an option to RowProcessor which passes through newlines (#137).

Other improvements

Removed redundant computation in tree construction (#63).

Added better accessors for the centroids of a K-Means model (#98).

Improved the speed of the feature transformation infrastructure (#104).

Refactored the SGD models to reduce redundant code and allow models to share upcoming improvements (#106, #134).

Added many performance optimisations to the linear SGD and CRF models, allowing the automatic use of dense feature spaces (#112). This also adds specialisations to the math library for dense vectors and matrices, improving the performance of the CRF model even when operating on sparse feature sets.

Added provenance tracking of the Java version, OS and CPU architecture (#115).

Changed the behaviour of sparse features under transformations to expose additional behaviour (#122).

Improved MultiLabelEvaluation.toString() (#136).

Added a document classification tutorial which shows the various text feature extraction techniques available in Tribuo.

Expanded javadoc coverage.

Upgraded ONNX Runtime to 1.7.0, XGBoost to 1.4.1, TensorFlow to 0.3.1, liblinear-java to 2.43, OLCUT to 5.1.6, OpenCSV to 5.4.

Miscellaneous small bug fixes.

Contributors

Adam Pocock (@Craigacp)

Philip Ogren (@pogren)

Jeffrey Alexander (@jhalexand)

Jack Sullivan (@JackSullivan)

Samantha Campo (@samanthacampo)

Luke Nezda (@nezda)

Mani Sarkar (@neomatrix369)

Stephen Green (@eelstretching)

Kate Silverstein (@k8si)

Source code(tar.gz)
Source code(zip)
v4.0.2(Nov 5, 2020)
This is the first Tribuo point release after the initial public announcement. It fixes many of the issues our early users have found, and improves the documentation in the areas flagged by those users. We also added a couple of small new methods as part of fixing the bugs, and added two new tutorials: one on columnar data loading and one on external model loading (i.e. XGBoost and ONNX models).

Bugs fixed:

Fixed a locale issue in the evaluation tests.

Fixed issues with RowProcessor (expand regexes not being called, improper provenance capture).

IDXDataSource now throws FileNotFoundException rather than a mysterious NullPointerException when it can't find the file.

Fixed issues in JsonDataSource (consistent exceptions thrown, proper termination of reading in several cases).

Fixed an issue where regression models couldn't be serialized due to a non-serializable lambda.

Fixed UTF-8 BOM issues in CSV loading.

Fixed an issue where LibSVMTrainer didn't track state between repeated calls to train.

Fixed issues in the evaluators to ensure consistent exception throwing when discovering unlabelled or unknown ground truth outputs.

Fixed a bug in ONNX LabelTransformer where it wouldn't read pytorch outputs properly.

Bumped to OLCUT 5.1.5 to fix a provenance -> configuration conversion issue.

New additions:

Added a method which converts a Jackson ObjectNode into a Map suitable for the RowProcessor.

Added missing serialization tests to all the models.

Added a getInnerModels method to LibSVMModel, LibLinearModel and XGBoostModel to allow users to access a copy of the internal models.

More documentation.

Columnar data loading tutorial.

External model (XGBoost & ONNX) tutorial.

Dependency updates:

OLCUT 5.1.5 (brings in jline 3.16.0 and jackson 2.11.3).

Source code(tar.gz)
Source code(zip)
v4.0.1(Sep 1, 2020)
Fixes an issue where the CSVReader wouldn't read files with extraneous newlines at the end.

Adds an IDXDataSource so we can read IDX (i.e. MNIST) formatted datasets.

Updated the configuration tutorial to read MNIST from IDX files rather than libsvm files.

Source code(tar.gz)
Source code(zip)
v4.0.0(Aug 13, 2020)

This is the first public release of the Tribuo Java Machine Learning library. Tribuo provides classification, regression, clustering and anomaly detection algorithms along with data loading, transformation and model evaluation code. Tribuo also provides support for loading external ONNX models and scoring them in Java as well as support for training and evaluating deep learning models using TensorFlow.

Tribuo's development started in 2016 led by Oracle Labs' Machine Learning Research Group, and has been in production inside Oracle since 2017. It's now available under an Apache 2.0 license, and we'll continue to develop it in the open, including accepting community PRs under the Oracle Contributor Agreement.

See tribuo.org for a project overview, or explore the docs here on Github for more details. We have jupyter notebook based tutorials demonstrating various features of the library.
Source code(tar.gz)
Source code(zip)