A machine learning package built for humans.

Airbnb

Last update: Dec 30, 2022

Related tags

Machine Learning aerosolve

Overview

aerosolve

Machine learning for humans.

What is it?

A machine learning library designed from the ground up to be human friendly. It is different from other machine learning libraries in the following ways:

A thrift based feature representation that enables pairwise ranking loss and single context multiple item representation.
A feature transform language gives the user a lot of control over the features
Human friendly debuggable models
Separate lightweight Java inference code
Scala code for training
Simple image content analysis code suitable for ordering or ranking images

This library is meant to be used with sparse, interpretable features such as those that commonly occur in search (search keywords, filters) or pricing (number of rooms, location, price). It is not as interpretable with problems with very dense non-human interpretable features such as raw pixels or audio samples.

There are a few reasons to focus on interpretability:

Your corpus is new and not fully defined and you want more insight into your corpus
Having interpretable models lets you iterate quickly. Figure out where the model disagrees most and have insight into what kind of new features are needed.
Debugging noisy features. By plotting the feature weights you can discover buggy features or fit them to splines and discover features that are unexpectedly complex (which usually indicates overfitting).
You can discover relationships between different variables and your target prediction. e.g. For the Airbnb demand model, plotting graphs of reviews and 3-star reviews is more interpretable than many nested if then else rules.

How to get started?

The artifacts for aerosolve are hosted on bintray. If you use Maven, SBT or Gradle you can just point to bintray as a repository and automatically fetch the artifacts.

Check out the image impression demo where you can learn how to teach the algorithm to paint in the pointillism style of painting. Image Impressionism Demo.

There is also an income prediction demo based on a popular machine learning benchmark. Income Prediction Demo.

Feature Representation

This section dives into the thrift based feature representation.

Features are grouped into logical groups called families of features. The reason for this is so we can express transformations on an entire feature family at once or interact two different families of features together to create a new feature family.

There are three kinds of features per FeatureVector:

stringFeatures - this is a map of feature family to binary feature strings. For example "GEO" -> { "San Francisco", "CA", "USA" }
floatFeatures - this is a map of feature family to feature name and value. For example "LOC" -> { "Latitude" : 37.75, "Longitude" : -122.43 }
denseFeatures - this is a map of feature family to a dense array of floats. Not really used except for the image content analysis code.

Example Representation

Examples are the basic unit of creating training data and scoring. A single example is composed of:

context - this is a FeatureVector that occurs once in the example. It could be the features representing a search session for example. e.g. "Keyword" -> "Free parking"
example(0..N) - this is a repeated list of FeatureVectors that represent the items being scored. These can correspond to documents in a search session. e.g. "LISTING CITY" -> "San Francisco"

The reasons for having this structure are:

having one context for hundreds of items saves a lot of space during RPCs or even on disk
you can compute the transforms for the context once, then apply the transformed context repeatedly in conjunction with each item
having a list of items allows the use of list based loss functions such as pairwise ranking loss, domination loss etc where we evaluate multiple items at once

Feature Transform language

This section dives into the feature transform language.

Feature transforms are applied with a separate transformer module that is decoupled from the model. This allows the user to break apart transforms or transform data ahead of time of scoring for example. e.g. in an application the items in a corpus may be transformed ahead of time and stored, while the context is not known until runtime. Then at runtime, one can transform the context and combined them with each transformed item to get the final feature vector that is then fed to the models.

Feature transforms allow us to modify FeatureVectors on the fly. This allows engineers to rapidly iterate on feature engineering quickly and in a controlled way.

Here are some examples of feature transforms that are commonly used:

List transform. A meta transform that specifies other transforms to be applied
Cross transform. Operates only on stringFeatures. Allows interactions between two different string feature families. e.g. "Keyword" cross "LISTING CITY" creates the new feature family "Keyword_x_city" -> "Free parking^San Francisco"
Multiscale grid transform Constructs multiple nested grids for 2D coordinates. Useful for modelling geography.

Please see the corresponding unit tests as to what these transforms do, what kind of features they operate on and what kind of config they expect.

Models

This section covers debuggable models.

Although there are several models in the model directory only two are the main debuggable models. The rest are experimental or sub-models that create transforms for the interpretable models.

Linear model. Supports hinge, logistic, epsilon insensitive regression, ranking loss functions. Only operates on stringFeatures. The label for the task is stored in a special feature family and specified by rank_key in the config. See the linear model unit tests on how to set up the models. Note that in conjunction with quantization and crosses you can get incredible amounts of complexity from the "linear" model, so it is not actually your regular linear model but something more complex and can be thought of as a bushy, very wide decision tree with millions of branches.

Spline model. A general additive linear piecewise spline model. The training is done at a higher resolution specified by num_buckets between the min and max of a feature's range. At the end of each iteration we attempt to project the linear piecewise spline into a lower dimensional function such as a polynomial spline with Dirac delta endpoints. If the RMSE of the projection is above threshold, we leave the spline alone in the high resolution piecewise linear mode. This allows us to debug the spline model for features that are buggy or unexpectedly complex (e.g. jumping up and down when we expect some kind of smoothness)

Boosted stumps model - small compact model. Not very interpretable but at small sizes useful for feature selection.
Decision tree model - in memory only. Mostly used to generate transforms for the linear or spline model.
Maxout neural network model. Experimental and mostly used as a comparison baseline.

IDE

If you use intellij, try build first, so that thrift classes is available and to fix the spark compiling error inside intellij, type command+; and click dependency and change related files from test to compile, such as org.apache.spark and org.apache.hadoop:hadoop-common. We keep gradle config as testCompile so that to reduce jar file size.

Support

Hackpad

Dev group

User group

In the wild

Organizations and projects using aerosolve can list themselves here.

Comments

Optimize AdditiveModelTrainer using sparse representation
This is a huge one. Thanks for reviewing it in advance. You might want to examine commit by commit and also add &w=1 or ?w=1 to the URL to ignore white spaces in the diff due to reformatting.

Aside a few minor optimizations mentioned in the commits, the major changes are:

Add SparseLabeledPoint for space efficient representation of feature vector

Add validation split into model trainer for better understanding of model convergence

Sparse Representation

It shall be noted this implementation is not ideal in any sense but as a compromise to best fit in Aerosolve existing architecture without rewriting all the transforms.

SparseLabeledPoint store feature vector in the form of

public final int[] indices; public final float[] values; public final int[] denseIndices; public final float[][] denseValues;

where float and string features are stored in indices,values and dense features are stored in denseIndices,denseValues. The index of each feature is backed by featureIndexer in AdditiveModel of type Map<String, Map<String, Integer>>.

So instead of

Map( 'family1' -> Map( `feature1' -> value1, `feature4' -> value2, ...), ... )

we store features as

indices: [1, 4, ...] values: [1, 2, ...]

Furthermore, to support efficient model updates with sparsely represented feature vector, we collect all features into Function[] weightVector where features are directly indexed by their corresponding index in the featureIndexer.

Validation Split

Because SGD loss is rather jittery in short window frame and we perform averaging of model weights among other regularizations during each iterations, training loss appears to be a poor proxy to model convergence. This PR adds an optional isTraining function to split training data into training and validation. After each iteration, validation loss is calculated without updating the model itself. This gives a much clearer guidance to both learning rate specification, its appropriate decay rate, early stopping, and if model of any iterations might have been overfitted. The final model is picked from the iteration with lowest validation loss instead.

Speed Optimization

With validation loss, we were able to experiment with parameters and sampling technique with much better confidence. This PR also adds option to sample_by_partition which take a sample of partitions. This played nicely when there are lots of partitions and data is randomly distributed when they are pulled. It appears model can converge much faster when a smaller sample is chosen and run with a few more iterations.

@deerzq @jq @timyitong

p.s. I will update hackpad once this is merged to outline a few techniques and findings.
opened by saurfang 9
add evaluation data source

The training and evaluation dataset is usually different and in a lot of the cases need to be separated by time not by random sample. e.g. training is data from 2015 and eval is data from 2016.

The current Aerosolve config doesn’t support adding evaluation dataset. This PR is to add this option to the user. As suggested by @deerzq , I refactored the makeTraningrun to make it more generic to make both the training and evaluation dataset(optional)

@jq @saurfang

opened by Patrick1860 6
additive model
Implement AdditiveModel which takes more flexible forms of functions. In addition to Spline, it also takes linear functions.

Add an abstract function class, Spline and Linear all inherit this class.

Add unit test, fix a small bug in LinearModelTest.java

implement additive model trainer

@hectorgon
opened by deerzq 6
AdditiveModelTrainer Refactoring and Documentation

This started as an attempt to refactor AdditiveModelTrainer to make it much faster. However I have come to the conclusion that the trainer implementation is not the bottleneck but rather the inefficient Example class is the culprit for the poor performance. It roughly takes 5~10x time to load Example data than the actually optimization routine at each iteration.

Before moving to a more efficient representation of feature vectors, I want to merge in the initial refactoring, clean up and documentation I added for the analysis. There should be no performance impact and logic changes for this PR.

@jq @deerzq @yolken

Let me know if you have any comment or suggestions on the documentation. You will probably find it easier to review commit by commit.

opened by saurfang 5
Add string tokenizer transform

This is nowhere close to being finished, just wanted to push it up so that we can more easily discuss the design of the transform, whether an explicit output should be generated, and whether repeated tokens are desired (since the output is currently a set). We can talk about this tomorrow morning. @hectorgon @yolken

opened by christhetree 5

Add eval_test option

eval_test was an option in income pipeline but not part of Genericpipeline. The goal is to be able to evaluate the performance of model on a separate test data other than hold_out data as specified in eval_test.

Example usage:

make_testing {
  hive_query : ${generic_hive_query}" from "${training_table}${testing_filter}
  num_shards : 20
  output : ${testing_data}
}

eval_test {
  input : ${testing_data}
  subsample : ${eval_subsample}
  bins : 100
  model_config : ${model_config}
  is_probability : false
  is_regression : false
  metric_to_maximize : "!HOLD_F1"
  model_name : ${model_name}
}

opened by spritehc 4

Add FloatCrossFloatTransform

to: @jq cc: @deerzq

Content: This is just a quick foundation of what I envisioned a float cross float transform might look like (essentially a MoveFloatToString transform followed by a StringCrossFloat transform, but without deleting the original float feature or creating an intermediate string feature which would occur if these two transforms were applied back to back). The ability to specify which keys to cross could be added as well instead of crossing the entire family.

Please let me know what you think / if this is what you need. Once we're on the same page I'll add comments and tests. Thanks!

opened by christhetree 4
combineContextAndItems(Example examples)

While trying to use aerosolve for kaggle airbnb competition, I noticed in com.airbnb.aerosolve.core.transforms.Transformer.java that only string features are being copied from context to examples. Is this behavior coded on purpose ? In general, is the differentiation between context and item features only for memory optimization or does it impact the algorithms ranking task ? Thanks in advance for any advices to help.

opened by randombishop 4
dynamic bucketing in AUC calculation

What

Optimize the ROC-AUC calculation for dataset with uneven distributed scoring. AUC is calculated by first divide the whole area to 100 small vertical strips, approximate the areas for each strips and sum them up. In our existing implementation, we first get the min and max of all scores(probabilities) and evenly divide the space to 100 intervals, no matter how many records are in each interval.

This will make the AUC inaccurate for score distribution that is not uniformly distributed. During my experiment with short term labeling, our existing implementation returned AUC of 0.74. I take a sample offline and use sklearn to calculate, the result is 0.87. With this PR, the result is 0.85, which is far more reasonable.

For our base models, the error introduced by this approximation is much smaller.

How

Divide the score space unevenly according to the true distribution. More specifically, first sort the records by score, use zipWithIndex to get the rank of score, then use rank/bucketSize to get the bucket.

I also changed the bucket number from 100 to 10000, turns out not much difference. But I think 10000 is better since we do want more accurate estimation and can afford the computation resources.

@deerzq @saurfang

opened by luanjunyi 3
Perform Evaluation on RDD

Revert code clean up that merges the evaluation of records from RDD and List. Instead of merging the code path using List, perform evaluation on RDD instead to handle large dataset.

There is no reliable way to get a SparkContext especially with the way we set up our tests. Since there is really no need to run evaluation on List of records and lists can always be parallelize with single function call, we are removing evaluation on list altogether.

Related to #237 #231

@jq @deerzq @luanjunyi

opened by saurfang 3
Allow kdtree bounds update

Adds an updateBounds function to KDTree so boundary of KDTree node can be updated after a KDTree model is trained. Split is not affected by this update. The boundary information is purely informational.

Reformatted code and applied static code analysis suggestions.

@jq @spencerde

opened by saurfang 3
Your project airbnb aerosolve is using buggy third-party libraries [WARNING]
Hi, there!

We are a research team working on third-party library analysis. We have found that some widely-used third-party libraries in your project have major/critical bugs, which will degrade the quality of your project. We highly recommend you to update those libraries to new versions.

We have attached the buggy third-party libraries and corresponding jira issue links below for you to have more detailed information. We have analyzed the api call related to the following libraries and found one library that is using the API call that might invoke buggy methods in the library of the history.

commons-codec commons-codec version: 1.4 API call in your project:org.apache.commons.codec.binary.Base64.setInitialBuffer(byte[],int,int)

Jira issues: Base64InputStream#read(byte[]) incorrectly returns 0 at end of any stream which is multiple of 3 bytes long version:1.4 ArrayIndexOutOfBoundsException when doing multiple reads() on encoding Base64InputStream version:1.4 Base64 encoding issue for larger avi files version:1.4 org.apache.commons.codec.net.URLCodec.ESCAPE_CHAR isn't final but should be version:1.2;1.3;1.4 org.apache.commons.codec.language.RefinedSoundex.US_ENGLISH_MAPPING should be package protected MALICIOUS_CODE version:1.4 org.apache.commons.codec.language.Soundex.US_ENGLISH_MAPPING should be package protected MALICIOUS_CODE version:1.4 Caverphone encodes names starting and ending with "mb" incorrectly. version:1.4 All links to fixed bugs in the "Changes Report" http://commons.apache.org/codec/changes-report.html point nowhere; e.g. http://issues.apache.org/jira/browse/34157. Looks as if all JIRA tickets were renumbered. version:1.1;1.2;1.3;1.4 Regression: Base64.encode(chunk=true) has bug when input length is multiple of 76 version:1.4 DigestUtils: MD5 checksum is not calculated correctly on linux64-platforms version:1.3;1.4 new Base64().encode() appends a CRLF; and chunks results into 76 character lines version:1.4 Base64 encode() method is no longer thread-safe; breaking clients using it as a shared BinaryEncoder version:1.4 Base64 default constructor behaviour changed to enable chunking in 1.4 version:1.4 Base64InputStream causes NullPointerException on some input version:1.4 Base64.encodeBase64String() shouldn't chunk version:1.4 2. org.apache.commons commons-lang3 version: 3.4 Jira issues: TypeUtils.ParameterizedType#equals doesn't work with wildcard types version:3.3.2;3.4 DateUtilsTest.testLang530 fails for some timezones version:3.4 StringUtils.stripAccents from "Ł" and "ł" version:3.4 No release notes for version 3.4 version:3.4 JsonToStringStyle doesn't handle chars and objects correctly version:3.4 ReflectionToStringBuilder doesn't throw IllegalArgumentException when the constructor's object param is null version:3.4 StrLookup.systemPropertiesLookup() no longer reacts on changes on system properties version:3.4 StringUtils#capitalize: Javadoc says toTitleCase; code uses toUpperCase version:3.4 Multiple calls of org.apache.commons.lang3.concurrent.LazyInitializer.initialize() are possible version:3.4;3.5 EnumUtils *BitVector issue with more than 32 values Enum version:3.4 StringUtils#equals fails with Index OOBE on non-Strings with identical leading prefix version:3.4 There are no tests for CharSequenceUtils.regionMatches version:3.4 ArrayUtils.removeAll(Object array; int... indices) should do the clone; not its callers version:3.4 TypeUtils.isAssignable throws NullPointerException when fromType has type variables and toType generic superclass specifies type variable version:3.4 FastDateFormat does not support the week-year component (uppercase 'Y') version:3.4 ordinalIndexOf("abc"; "ab"; 1) gives incorrect answer of -1 (correct answer should be 0) version:3.4 Fix implementation of StringUtils.getJaroWinklerDistance() version:3.4 parseDateStrictly does't pass specified locale version:3.4 ClassUtils.getClass(ClassLoader; String) fails for "void" version:3.4 NumberUtils.isNumber bug version:3.4 FastDateFormat doesn't respect summer daylight in localized strings version:3.4 StringUtils#normalizeSpace does not trim the string anymore version:3.4 DiffBuilder: Add null check on fieldName when appending Object or Object[] version:3.4 FastDatePrinter Memory allocation regression version:3.4 SerializationUtils.ClassLoaderAwareObjectInputStream should use static initializer to initialize primitiveTypes map. version:3.2;3.3;3.4 NumberUtils.isNumber and NumberUtils.createNumber resolve inconsistently version:3.4 ArrayUtils.contains returns false for instances of subtypes version:3.4 CompareToBuilder.append(Object;Object;Comparator) method is too big to be inlined version:3.4 StrBuilder#replaceAll ArrayIndexOutOfBoundsException version:3.2.1;3.4;3.5 NumberUtils#createNumber() returns positive BigDecimal when negative Float is expected version:3.x

Sincerely~ FDU Software Engineering Lab Marth 14th,2019
opened by FDUSELAB2 0
Error when $ gradle shadowjar --info
:core:compileJava (Thread[Task worker for ':',5,main]) started.

Task :core:compileJava Putting task artifact state for task ':core:compileJava' into context took 0.0 secs. Executing task ':core:compileJava' (up-to-date check took 0.006 secs) due to: Task has failed previously. All input files are considered out-of-date for incremental task ':core:compileJava'. Compiling with JDK Java compiler API. /Users/maximebodereau/Documents/Projects/Ux AI/aerosolve/core/src/main/java/com/airbnb/aerosolve/core/util/Weibull.java:13: error: cannot find symbol public WeibullBuilder defaultBuilder() { ^ symbol: class WeibullBuilder location: class Weibull 1 error

:core:compileJava (Thread[Task worker for ':',5,main]) completed. Took 0.262 secs.

FAILURE: Build failed with an exception.

What went wrong: Execution failed for task ':core:compileJava'.

java.lang.NoSuchFieldError: pid
opened by maximebodereau 0
Blocking issue with running Aerosolve demos
Hi, I'm a Stanford MS student trying to run the image impressionism and income classification demos. When running gradle shadowjar --info, I get multiple errors of the following type during the execution of the task :core:compileJava:

/Users/ei5h4/Documents/aerosolve/core/build/gen-java/com/airbnb/aerosolve/core/ModelRecord.java:1075: error: method hashCode in class Object cannot be applied to given types; hashCode = hashCode * 8191 + org.apache.thrift.TBaseHelper.hashCode(featureWeight); ^ required: no arguments found: double reason: actual and formal argument lists differ in length

My thrift version is 0.10.0. I tried downloading and installing an older version of thrift (0.9.0) from source since this demo is old and might rely on an older thrift (just a hypothesis). But that turned out to have some roadblocks as well since the older thrift uses some C code namespace tr1 that is no longer supported by C++11 on my OSX El Capitan. So I couldn't verify if thrift is the issue or something else. Basically I thought the hashCode function in the error above might have a changed signature from 0.9.0 to 0.10.0.

I think anyone else attempting to build the demo will run into this issue as well. Really hope to get this running on my machine soon. Aerosolve is super exciting!
opened by jaypatelh 2

Owner

Airbnb

GitHub http://airbnb.github.io/aerosolve/

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

1.1k Dec 9, 2022

Encog java core Apache 2 Encog java core Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data. Machine learning algorithms such as Support Vector Machines, Artificial Neural Networks, Genetic Programming, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported. License: Apache 2 , .

Encog Machine Learning Framework Encog is a pure-Java/C# machine learning framework that I created back in 2008 to support genetic programming, NEAT/H

739 Dec 17, 2022

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

900 Jan 2, 2023

This repo is created to help people with the machine coding interview. There is no free website to provide complete guide for machine coding round so I have created this repo where I have shared all my machine coding practices and created a medium post as well to help with theory part.

machineCoding This repo is created to help people with the machine coding interview. There is no free website to provide complete guide for machine co

121 Nov 3, 2022

Java Statistical Analysis Tool, a Java library for Machine Learning

Java Statistical Analysis Tool JSAT is a library for quickly getting started with Machine Learning problems. It is developed in my free time, and made

752 Dec 20, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.8k Dec 28, 2022

Statistical Machine Intelligence & Learning Engine

Smile Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpola

5.7k Jan 1, 2023

statistics, data mining and machine learning toolbox

Disambiguation (Italian dictionary) Field of turnips. It is also a place where there is confusion, where tricks and sims are plotted. (Computer scienc

63 Jun 11, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.7k Mar 12, 2021

Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.2) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

1.1k Dec 28, 2022

Java time series machine learning tools in a Weka compatible toolkit

UEA Time Series Classification A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-co

Machine Learning and Time Series Tools and Datasets

140 Nov 7, 2022

On-device wake word detection powered by deep learning.

Porcupine Made in Vancouver, Canada by Picovoice Porcupine is a highly-accurate and lightweight wake word engine. It enables building always-listening

2.8k Dec 30, 2022

An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

2.9k Jan 7, 2023

java deep learning algorithms and deep neural networks with gpu acceleration

Deep Neural Networks with GPU support Update This is a newer version of the framework, that I developed while working at ExB Research. Currently, you

1.2k Jan 6, 2023

Learning Based Java (LBJava)

Learning Based Java LBJava core LBJava examples LBJava maven plugin Compiling the whole package From the root directory run the following command: Jus

12 Jun 9, 2019

Test project for learning GoF design pattern

DesignPattern Test project for learning GoF design pattern ㅁ개요 객체지향 설계의 교과서라고 불리는 Design Pattern 을 직접 Activity 별로 구현해봤습니다. ㅁ동기 물론 디자인패턴을 몰라도 기능은 얼마든지

11 Aug 8, 2022

Abstract machine for formal semantics of SIMP (Simple Imperative Language)

SIMP-abstract-machine In 2020/21 I was a Teaching Assistant for the second year module 5CCS2PLD Programming Language Paradigms at King's College Londo

25 Oct 10, 2022

💻 Machine Coding - leetcode LLD (coding blox) - It is an Online Coding Platform that allows a user to Sign Up, Create Contests and participate in Contests hosted by Others.

leetcode-lld-flipkart-coding-blox Machine coding - leetcode LLD (coding blox) My Approach : https://leetcode.com/discuss/interview-question/object-ori

50 Sep 15, 2022

jMonkeyEngine Animation State Machine

jme-anim-state-machine jMonkeyEngine Animation State Machine State machine to make animation states consume from a character controller state. This wa

3 Oct 20, 2021

A machine learning package built for humans.

Related tags

Overview

aerosolve

What is it?

How to get started?

Feature Representation

Example Representation

Feature Transform language

Models

IDE

Support

In the wild

Comments

Sparse Representation

Validation Split

Speed Optimization

Owner

Airbnb

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

This repo is created to help people with the machine coding interview. There is no free website to provide complete guide for machine coding round so I have created this repo where I have shared all my machine coding practices and created a medium post as well to help with theory part.

Java Statistical Analysis Tool, a Java library for Machine Learning

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Statistical Machine Intelligence & Learning Engine

statistics, data mining and machine learning toolbox

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Tribuo - A Java machine learning library

Java time series machine learning tools in a Weka compatible toolkit

On-device wake word detection powered by deep learning.

An Engine-Agnostic Deep Learning Framework in Java

java deep learning algorithms and deep neural networks with gpu acceleration

Learning Based Java (LBJava)

Test project for learning GoF design pattern

Abstract machine for formal semantics of SIMP (Simple Imperative Language)

💻 Machine Coding - leetcode LLD (coding blox) - It is an Online Coding Platform that allows a user to Sign Up, Create Contests and participate in Contests hosted by Others.

jMonkeyEngine Animation State Machine