Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Overview

Datumbox Machine Learning Framework

Build Status Windows Build status Maven Central License

Datumbox

The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid development Machine Learning and Statistical applications. The main focus of the framework is to include a large number of machine learning algorithms & statistical methods and to be able to handle large sized datasets.

Copyright & License

Copyright (C) 2013-2020 Vasilis Vryniotis.

The code is licensed under the Apache License, Version 2.0.

Installation & Versioning

Datumbox Framework is available on Maven Central Repository.

The latest stable version of the framework is 0.8.2 (Build 20200805). To use it, add the following snippet in your pom.xml:

    <dependency>
        <groupId>com.datumbox</groupId>
        <artifactId>datumbox-framework-lib</artifactId>
        <version>0.8.2</version>
    </dependency>

The latest snapshot version of the framework is 0.8.3-SNAPSHOT (Build 20201014). To test it, update your pom.xml as follows:

    <repository>
       <id>sonatype-snapshots</id>
       <name>sonatype snapshots repo</name>
       <url>https://oss.sonatype.org/content/repositories/snapshots</url>
    </repository>

    <dependency>
        <groupId>com.datumbox</groupId>
        <artifactId>datumbox-framework-lib</artifactId>
        <version>0.8.3-SNAPSHOT</version>
    </dependency>

The develop branch is the development branch (default github branch), while the master branch contains the latest stable version of the framework. All the stable releases are marked with tags.

The releases of the framework follow the Semantic Versioning approach. For detailed information about the various releases check out the Changelog.

Documentation and Code Examples

All the public methods and classes of the Framework are documented with Javadoc comments. Moreover for every model there is a JUnit Test which clearly shows how to train and use the models. Finally for more examples on how to use the framework checkout the Code Examples or the official Blog.

Pre-trained Models

Datumbox comes with a large number of pre-trained models which allow you to perform Sentiment Analysis (Document & Twitter), Subjectivity Analysis, Topic Classification, Spam Detection, Adult Content Detection, Language Detection, Commercial Detection, Educational Detection and Gender Detection. To get the binary models check out the Datumbox Zoo.

Which methods/algorithms are supported?

The Framework currently supports performing multiple Parametric & non-parametric Statistical tests, calculating descriptive statistics on censored & uncensored data, performing ANOVA, Cluster Analysis, Dimension Reduction, Regression Analysis, Timeseries Analysis, Sampling and calculation of probabilities from the most common discrete and continues Distributions. In addition it provides several implemented algorithms including Max Entropy, Naive Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Process Mixture Models, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and several other techniques that can be used for feature selection, ensemble learning, linear programming solving and recommender systems.

Bug Reports

Despite the fact that parts of the Framework have been used in commercial applications, not all classes are equally used/tested. Currently the framework is in Alpha version, so you should expect some changes on the public APIs on future versions. If you spot a bug please submit it as an Issue on the official Github repository.

Contributing

The Framework can be improved in many ways and as a result any contribution is welcome. By far the most important feature missing from the Framework is the ability to use it from command line or from other languages such as Python. Other important enhancements include improving the documentation, the test coverage and the examples, improving the architecture of the framework and supporting more Machine Learning and Statistical Models. If you make any useful changes on the code, please consider contributing them by sending a pull request.

Acknowledgements

Many thanks to Eleftherios Bampaletakis for his invaluable input on improving the architecture of the Framework. Also many thanks to ej-technologies GmbH for providing a license for their Java Profiler and to JetBrains for providing a license for their Java IDE.

Useful Links

Comments
  • Possible Error in Shapiro-Wilk P-Value

    Possible Error in Shapiro-Wilk P-Value

    Hi,

    I tried out your Shapiro-Wilk implementation as I needed to calculate some values for a paper-submission. I did cross reference it with the Real-Statistics Excel Plugin (http://www.real-statistics.com/) as well as several online tools that allow use of Shapiro-Wilk.

    If you run it with the following values: 488.0, 486.0, 492.0, 490.0, 489.0, 491.0, 488.0, 490.0, 496.0, 487.0, 487.0, 493.0 The results should be: W -> 0.944486 P -> 0.55826969...

    However the P value is 0.44173030948

    The offending line is here: https://github.com/datumbox/datumbox-framework/blob/957560901f3c87d3e9f6760263644a4d70b0a3b8/datumbox-framework-core/src/main/java/com/datumbox/framework/core/statistics/nonparametrics/onesample/ShapiroWilk.java#L257

    It should actually be m-y pw=ContinuousDistributions.gaussCdf((m-y)/s)

    bug 
    opened by oliver-krauss 9
  • FlatDataList with null values gets an exception when trying to calculate the variance

    FlatDataList with null values gets an exception when trying to calculate the variance

    So, as exposed in our discussion in the pull request, here is a minimal piece of code that shows when the thing throws an exception:

        public void testClassifier() {
            RandomGenerator.setGlobalSeed(42L);
            Configuration configuration = Configuration.getConfiguration();
            InMemoryConfiguration memConfiguration = new InMemoryConfiguration();
            final File f = new File(Datumbox.class.getProtectionDomain().getCodeSource().getLocation().getPath());
            memConfiguration.setDirectory(f.getAbsolutePath());
            configuration.setStorageConfiguration(memConfiguration);
    
            // List of positive and negative sentences, for training
            List<String> positives = new ArrayList<>();
            List<String> negatives = new ArrayList<>();
            positives.add("the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .");
            positives.add("the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .");
            positives.add("effective but too-tepid biopic");
            negatives.add("simplistic , silly and tedious .");
            negatives.add("it's so laddish and juvenile , only teenage boys could possibly find it funny .");
            negatives.add("exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .");
    
            // Construct the training parameters for the classifier
            TextClassifier.TrainingParameters trainingParameters = new TextClassifier.TrainingParameters();
            trainingParameters.setNumericalScalerTrainingParameters(new StandardScaler.TrainingParameters());
            trainingParameters.setCategoricalEncoderTrainingParameters(new CornerConstraintsEncoder.TrainingParameters());
            trainingParameters.setFeatureSelectorTrainingParametersList(Arrays.asList(new ChisquareSelect.TrainingParameters()));
            trainingParameters.setTextExtractorParameters(new NgramsExtractor.Parameters());
            trainingParameters.setModelerTrainingParameters(new BernoulliNaiveBayes.TrainingParameters());
    
            // Construct list of records form lists of positives/negatives
            AbstractTextExtractor textExtractor = AbstractTextExtractor.newInstance(trainingParameters.getTextExtractorParameters());
            List<Record> records = new ArrayList<>();
            for (String positive: positives) {
                AssociativeArray xData = new AssociativeArray(textExtractor.extract(StringCleaner.clear(positive)));
                records.add(new Record(xData, "positive"));
            }
            for (String negative: negatives) {
                AssociativeArray xData = new AssociativeArray(textExtractor.extract(StringCleaner.clear(negative)));
                records.add(new Record(xData, "negative"));
            }
    
            // Construct training dataframe
            Dataframe trainingData = new Dataframe(configuration);
            for (Record r: records)
                trainingData.set(trainingData.size(), r);
    
            // Construct and train the classifier
            TextClassifier classifier = MLBuilder.create(trainingParameters, configuration);
            classifier.fit(trainingData); // Here, you can follow the trace to see the problem
        }
    
    bug 
    opened by jluis2k10 5
  • Workaround when creating a FlatDataList with nulled values.

    Workaround when creating a FlatDataList with nulled values.

    Sometimes when creating a FlatDataList null values are added to the List, and consequently, when the library tries to retrieve those values it fails with an exception. For example, the method variance() in core.statistics.descriptivestatistics.Descriptive, fails in line 293 (double v = it.next()) when flatDataCollection has some of his values nulled.

    opened by jluis2k10 4
  • How to Set configs so that I can read Training Data from Disk?

    How to Set configs so that I can read Training Data from Disk?

    Hello I'm new to machine learning.Datumbox is my first ML framework I'm working with But I did not find any documentation on setting config properties to read trainingdata from Disk, please share a code example of reading training data sets from the disk & setting up config properties.

    question 
    opened by dilipbobby 4
  • Serialize Dataframe

    Serialize Dataframe

    How do I serialize a dataframe efficiently (records in bulk) onto the disk, with mapdb. My use case is, I have a large dataset for text classification, it takes a long time to deserialize and tokenize the text. I want to try out multiple experiments, without having to do tokenization again to convert to Record instances.

    enhancement 
    opened by shoubhik 4
  • How to use Pretrained Models in Datumbox Framework

    How to use Pretrained Models in Datumbox Framework

    Hi, I'm new in machine learning and I'm building a small project for analyzing data with "Sentiment Analysis, Content Readability, Content Quality, Adult Content and Spam Detection" features. so I wanted a small help to use Pre-trained Models with Framework.

    question 
    opened by versatilevimal 3
  • Bump junit from 4.13 to 4.13.1

    Bump junit from 4.13 to 4.13.1

    Bumps junit from 4.13 to 4.13.1.

    Release notes

    Sourced from junit's releases.

    JUnit 4.13.1

    Please refer to the release notes for details.

    Changelog

    Sourced from junit's changelog.

    Summary of changes in version 4.13.1

    Rules

    Security fix: TemporaryFolder now limits access to temporary folders on Java 1.7 or later

    A local information disclosure vulnerability in TemporaryFolder has been fixed. See the published security advisory for details.

    Test Runners

    [Pull request #1669:](junit-team/junit#1669) Make FrameworkField constructor public

    Prior to this change, custom runners could make FrameworkMethod instances, but not FrameworkField instances. This small change allows for both now, because FrameworkField's constructor has been promoted from package-private to public.

    Commits
    • 1b683f4 [maven-release-plugin] prepare release r4.13.1
    • ce6ce3a Draft 4.13.1 release notes
    • c29dd82 Change version to 4.13.1-SNAPSHOT
    • 1d17486 Add a link to assertThrows in exception testing
    • 543905d Use separate line for annotation in Javadoc
    • 510e906 Add sub headlines to class Javadoc
    • 610155b Merge pull request from GHSA-269g-pwp5-87pp
    • b6cfd1e Explicitly wrap float parameter for consistency (#1671)
    • a5d205c Fix GitHub link in FAQ (#1672)
    • 3a5c6b4 Deprecated since jdk9 replacing constructor instance of Double and Float (#1660)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 2
  • SVM example for text classfication

    SVM example for text classfication

    Hello,

    I need to classify the text. I already tried your awesome example based on MultinomialNaiveBayes classifier https://github.com/datumbox/datumbox-framework-examples/blob/develop/src/main/java/com/datumbox/examples/TextClassification.java

    I'd like to also test another algorithm - SVM. Could you please show an example how to transform the mentioned sample class in order to use SVM?

    Is it as simple as changing to:

    from

    trainingParameters.setModelerTrainingParameters(new MultinomialNaiveBayes.TrainingParameters());

    to:

    trainingParameters.setModelerTrainingParameters(new SupportVectorMachine.TrainingParameters()); or do I need to change something else also?

    Also, can I use the same text files for my SVM model that I used previously for MNB?

    question 
    opened by Artgit 2
  • Created model is giving slow response?

    Created model is giving slow response?

    Hi I created my own sentiment model for twitter. For preparation of model, I collected 4mb data files of pos,neg,neu.

    when I created the model the size of the model is increased to 191mb and when I tried out sentiment the by loading that 191 mb model, It is taking 40 sec time to give the output.

    In order to load that 191mb model in tomcat, I increased java -xmx value to 2gb. I want to know why it is taking 40 sec to give the output & when I created model I did not find any fs0 folder. Does it make any difference ?

    please give suggestions on following : Do I need to look at hardware or Do I need to decrease the size of raw files and create the model ? when I check your twitter-sentiment model in datumbox zoo,it as low size model (just 1.5 mb) not like me

    question 
    opened by dilipbobby 2
  • Will this work on Android

    Will this work on Android

    I just need the Naive Bayes Classifier, and wanted to know if there are any imports or codes (haven't check tho) that would prevent it from working on Android (Dalvik or ART Java VMs)

    question 
    opened by ohenepee 2
  • Cross Validation in Datumbox for parameter selection

    Cross Validation in Datumbox for parameter selection

    Does Datumbox support cross-validation internally to tune the parameters? I see in this post (http://blog.datumbox.com/how-to-build-your-own-twitter-sentiment-analysis-tool/) you talk about a 10 fold cross validation. Do we need to do it on our own? I could not find any example for such. Currently, what is the best way for tuning parameters in Datumbox.

    question 
    opened by shoubhik 2
  • How to setLogPriors for Naive Bayes model during cross validation?

    How to setLogPriors for Naive Bayes model during cross validation?

    I am using the Cross Validation to estimate the performance for my model, right now the way I am using it is ClassificationMetrics vm = new Validator<>(ClassificationMetrics.class, configuration).validate(new KFoldSplitter(10).split(trainingDataframe), new MultinomialNaiveBayes.TrainingParameters());

    in the com.datumbox.framework.core.machinelearning.common.abstracts.algorithms.AbstractNaiveBayes, I see there's a setLogPriors function which can probably be used to tune the model. (I want to create a DET graph for the model performance, by playing around with the prior probability). Is there a way to set the prior probability of different labels for cross validation? Thanks.

    enhancement 
    opened by jltchiu 1
Owner
Vasilis Vryniotis
Data Scientist, Software Engineer, author of Datumbox Machine Learning Framework and proud geek.
Vasilis Vryniotis
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

null 900 Jan 2, 2023
Java Statistical Analysis Tool, a Java library for Machine Learning

Java Statistical Analysis Tool JSAT is a library for quickly getting started with Machine Learning problems. It is developed in my free time, and made

null 752 Dec 20, 2022
Statistical Machine Intelligence & Learning Engine

Smile Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpola

Haifeng Li 5.7k Jan 1, 2023
MathParser - a simple but powerful open-source math tool that parses and evaluates algebraic expressions written in pure java

MathParser is a simple but powerful open-source math tool that parses and evaluates algebraic expressions written in pure java. This projec

AmirHosseinAghajari 40 Dec 24, 2022
DataLink is a new open source solution to bring Flink development to data center.

DataLink 简介 DataLink 是一个创新的数据中台解决方案,它基于 SpringCloud Alibaba 和 Apache Flink 实现。它使用了时下最具影响力的实时计算框架Flink,而且紧跟社区发展,试图只通过一种计算框架来解决离线与实时的问题,实现Sql语义化的批流一体,帮助

null 50 Dec 28, 2022
DataLink is a new open source solution to bring Flink development to data center.

DataLink 简介 DataLink 是一个创新的数据中台解决方案,它基于 SpringCloud Alibaba 和 Apache Flink 实现。它使用了时下最具影响力的实时计算框架Flink,而且紧跟社区发展,试图只通过一种计算框架来解决离线与实时的问题,实现Sql语义化的批流一体,帮助

null 39 Dec 22, 2021
💻 Machine Coding - leetcode LLD (coding blox) - It is an Online Coding Platform that allows a user to Sign Up, Create Contests and participate in Contests hosted by Others.

leetcode-lld-flipkart-coding-blox Machine coding - leetcode LLD (coding blox) My Approach : https://leetcode.com/discuss/interview-question/object-ori

Hariom Yadav 50 Sep 15, 2022
statistics, data mining and machine learning toolbox

Disambiguation (Italian dictionary) Field of turnips. It is also a place where there is confusion, where tricks and sims are plotted. (Computer scienc

Aurelian Tutuianu 63 Jun 11, 2022
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.2) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
Java time series machine learning tools in a Weka compatible toolkit

UEA Time Series Classification A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-co

Machine Learning and Time Series Tools and Datasets 140 Nov 7, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
A machine learning package built for humans.

aerosolve Machine learning for humans. What is it? A machine learning library designed from the ground up to be human friendly. It is different from o

Airbnb 4.8k Dec 30, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
A simple utility that allows you to query which items can be placed in a specific slot by holding down Left-Alt

What's That Slot This mod is a simple utility that allows you to query which items can be placed in a specific slot by holding down Left-Alt. You can

null 11 Dec 25, 2022
An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

Amazon Web Services - Labs 2.9k Jan 7, 2023
An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

DeepJavaLibrary 2.9k Jan 7, 2023
oj! Algorithms - ojAlgo - is Open Source Java code that has to do with mathematics, linear algebra and optimisation.

oj! Algorithms oj! Algorithms - ojAlgo - is Open Source Java code that has to do with mathematics, linear algebra and optimisation. General informatio

Optimatika 403 Dec 14, 2022