Java time series machine learning tools in a Weka compatible toolkit

Overview

UEA Time Series Classification

https://travis-ci.com/uea-machine-learning/tsml.svg?branch=master

A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-compatible version, see sktime

Find out more info about our broader work and dataset hosting for the UCR univariate and UEA multivariate time series classification archives on our website.

This codebase is actively being developed for our research. The dev branch will contain the most up-to-date, but stable, code.

Installation

We are looking into deploying this project on Maven or Gradle in the future. For now there are two options:

  • download the jar file and include as a dependency in your project, or you can run experiments through command line, see the examples on running experiments
  • fork or download the source files and include in a project in your favourite IDE you can then construct your own experiments (see our examples) and implement your own classifiers.

Overview

This codebase mainly represents the implementation of different algorithms in a common framework, which at the time leading up to the Great Time Series Classification Bake Off in particular was a real problem, with implementations being in any of Python, C/C++, Matlab, R, Java, etc. or even combinations thereof.

We therefore mainly provide implementations of different classifiers as well as experimental and results analysis pipelines with the hope of promoting and streamlining open source, easily comparable, and easily reproducible results, specifically within the TSC space.

While they are obviously very important methods to study, we shall very likely not be implementing any kind of deep learning methods in our codebase, and leave those rightfully in the land of optimised languages and libraries for them, such as sktime-dl , the Keras-enabled extension to sktime.

Our examples run through the basics of using the code, however the basic layout of the codebase is this:

evaluation/
contains classes for generating, storing and analysing the results of your experiments
experiments/
contains classes specifying the experimental pipelines we utilise, and lists of classifier and dataset specifications. The 'main' class is Experiments.java, however other experiments classes exist for running on simulation datasets or for generating transforms of time series for later classification, such as with the Shapelet Transform.
tsml/ and multivariate_timeseriesweka/
contain the TSC algorithms we have implemented, for univariate and multivariate classification respectively.
machine_learning/
contains extra algorithm implementations that are not specific to TSC, such as generalised ensembles or classifier tuners.

Implemented Algorithms

Classifiers

The lists of implemented TSC algorithms shall continue to grow over time. These are all in addition to the standard Weka classifiers and non-TSC algorithms defined under the machine_learning package.

We have implemented the following bespoke classifiers for univariate, equal length time series classification:

Distance Based Dictionary Based Kernel Based Shapelet Based Interval Based Hybrids
DD_DTW BOSS Arsenal LearnShapelets TSF HIVE-COTE
DTD_C cBOSS ROCKET ShapeletTransform TSBF Catch22
ElasticEnsemble TDE   FastShapelets LPS  
NN_CID WEASEL   ShapeletTree CIF  
SAX_1NN SAXVSM     DrCIF  
ProximityForest SpatialBOSS     RISE  
DTW_kNN SAX_1NN     STSF  
FastDTW BafOfPatterns...        
FastElasticEn... BOSSC45        
ShapeDTW_1NN BoTSWEnsemble        
ShapeDTW_SVM BOSSSpatialPy...        
SlowDTW_1NN          
KNN          

And we have implemented the following bespoke classifiers for multivariate, equal length time series classification:

NN_ED_D MultivariateShapeletTransform
NN_ED_I ConcatenateClassifier
NN_DTW_D MultivariateHiveCote
NN_DTW_I WEASEL+MUSE
STC_D MultivariateSingleEnsemble
NN_DTW_A MultivariateAbstractClassifier
MultivariateAbstractEnsemble

Clusterers

Currently quite limited, aside from those already shipped with Weka.

UnsupervisedShapelets  
K-Shape  
DictClusterer  
TTC  
AbstractTimeSeriesCLusterer  

Filters

SimpleBatchFilters that take an Instances (the set of time series), transforms them and returns a new Instances object.

ACF ACF_PACF ARMA
BagOfPatternsFilter BinaryTransform Clipping
Correlation Cosine DerivativeFilter
Differences FFT Hilbert
MatrixProfile NormalizeAttribute NormalizeCase
PAA PACF PowerCepstrum
PowerSepstrum RankOrder RunLength
SAX Sine SummaryStats

Transformers We will be shifting over to a bespoke Transformer interface

ShapeletTransform  
catch22  

Paper-Supporting Branches

This project acts as the general open-source codebase for our research, especially the Great Time Series Classification Bake Off. We are also trialling a process of creating stable branches in support of specific outputs.

Current branches of this type are:

Contributors

Lead: Anthony Bagnall (@TonyBagnall, @tony_bagnall, [email protected])

We welcome anyone who would like to contribute their algorithms!

License

GNU General Public License v3.0

Comments
  • Beginning of ROCKET implementation

    Beginning of ROCKET implementation

    implemented random kernel construction

    using this as the python reference: https://github.com/angus924/rocket/blob/master/code/rocket_functions.py

    #434

    opened by ABostrom 11
  • Experiments cmdline parameters

    Experiments cmdline parameters

    Idea for dealing with the current ClassifierLists issue of handling parameters. Defining a name with the parameter, e.g. "RotF<100/200/500>" becomes clunky with a large number of parameter options.

    Currently there's 2 use cases:

    1. I have some predetermined hyper-parameters that I want to set, e.g. numTrees, and I want to pass this via cmdline.
    2. I have some predetermined dependent parameters that I want to set via cmdline, say build time limit, e.g. 1m, 2m, 5m, 10m, 30m, 1h. Each of these is dependent on the previous and it is redundant to have 6 different cluster jobs, one for each time limit, as work would be repeated. It would be better to run one job which contracts the classifier for 1m, records results, 2m, records results, etc...

    Current thoughts on how this would work on the cmdline is:

    1. pass the parameter key/values as a string separated by spaces, e.g. "-p param1 val1 param2 val2".
    2. pass incremental parameter sets in the same way, but each increment is a separate cmdline option, e.g. "-ip trainContract 60 -ip trainContract 120 -ip trainContract 300".

    Benefits:

    • easy parameter passing for experiments, everything can be passed from the bash script which makes it easier to loop over parameters
    • no need to define a new classifier id (e.g. RotF200) per parameter you're examining
    • future proofing, every parameter for a classifier can be passed through this interface so there's no need to add more and more jcommander options as parameters grow

    I have a hacky version on the ee branch currently: https://github.com/uea-machine-learning/tsml/blob/a0278c0f11ae2fa5e8812bbef229debd69cdfdac/src/main/java/experiments/Experiments.java#L102

    opened by goastler 11
  • TSReader can't read data files that have new line characters in form \r\n

    TSReader can't read data files that have new line characters in form \r\n

    TSReader cannot read in data files when the new line format is CRLF (\r\n). It only works if the new line format is LF (\n). I believe it is an issue with StreamTokenizer.

    I've found that the problem line is 255 in TSReader (getLastToken). The ttype value is -3 (TT_WORD). Then when it reaches the \r of the first line in the .ts file, it becomes 13. However from the documentation, the ttype value apparently cannot be 13. I've debugged the tsml code and it isn't setting it to 13 anywhere, hence why I believe the issue to be with StreamTokenizer itself.

    The exact code where it sets the ttype value to 13 is m_Tokenizer.nextToken(). Before this code, ttype was -3, and after it becomes 13.

    opened by c-eg 10
  • This shapelet is strange, help me interprat it, please !

    This shapelet is strange, help me interprat it, please !

    Hi,

    I'm trying to analyse the shapelets that have been extracted from the Chinatown dataset. To do so, I used the following step.

    1. Train ST on the dataset to extract the shapelets; here is the code snippet
                    ShapeletTransform transform = new ShapeletTransform();
    		transform.setClassValue(new BinaryClassValue());  
    		transform.setSubSeqDistance(new SubSeqDistance());  
    		transform.setShapeletMinAndMax(min, max);
    		transform.setLengthIncrement(lenghtIncrement);
    		transform.useCandidatePruning();
    		transform.setNumberOfShapelets(train.numInstances() * 10);
    		transform.setQualityMeasure(ShapeletQualityChoice.INFORMATION_GAIN);
    		transform.setLogOutputFile(shapeletFile);
    		transform.supressOutput();
    
    1. Since shapelets are zNormalized, I try to recover the original values of each shapelet shp as follow x = y * std(shp) + mean(shp). this formula is applied on each value y of the considered shapelet shp

    2. Then I plot the recovered shapelet and its corresponding time series on the same axes Strange - Shapelet - Chinatown

    As you can see on the plot, the shapelet deviates too much from the original series and that is why it is very strange. Shapelets are supposed to be subsequences of a time series in the dataset.

    I'm wondering if it is not due to floating point number computation, or a mistake in my process, or something else.

    Thanks

    opened by frankl1 10
  • setClassifier and horribleGlobalPath

    setClassifier and horribleGlobalPath

    hi, james has this to be backward compatible with my hack for reading CAWPE from file. I propose we remove classic, my global stings and just have the switch inside setClassifier(ExperimentalArguments a)

    any reason not to do this anyone knows of? The only downside is that you need to assume the results files for CAWPEFROMFILE will have to all be in the same directory, and that will have to be where the CAWPEFROMFILE writes to (if you use setClassifier to run this). I have no problem with that.

    image

    opened by TonyBagnall 10
  • Standard tie-breaking practice

    Standard tie-breaking practice

    Based on a small discussion we had in the office, do we have a standard practice for tie-breaking? Do all classifiers use the same method and what method should be used? i.e. first item, random selection, weighted random selection using class distribution. Mostly in the context of class selection for classification but could be extended to other situations with ties. Mainly for discussion, if this is a non-issue feel free to close.

    good first issue 
    opened by MatthewMiddlehurst 10
  • Branch protection

    Branch protection

    We need some form of branch protection for the dev branch. @MatthewMiddlehurst accidentally deleted the dev branch earlier this week and managed to recover it (phew!). To prevent this from happening again, we need some kind of protection over the important branches. Master is already protected.

    On github, to protect a branch you have to make a branch protection rule. I made a rule for master which is why you can't push directly to it. We have two options for dev:

    1. make an empty rule which stops accidental deletion / renaming / etc
    2. protect dev in the same way as master, preventing pushing unless via pull request. This is a popular option to maintain a clean dev version, but comes at the cost / pain of having to do pull requests every time (although I think it's a good idea)

    We at the very least need option 1, would like some feedback on option 2 :)

    meta 
    opened by goastler 9
  • Difference in Algorithms Results on TSC website

    Difference in Algorithms Results on TSC website

    While using /java/tsml/examples/ClassificationExamples.java class to run all the classifiers, I found my results close but not the same as mentioned on the website.

    One significant discovery was that Published Results and Recreated Results had a very large difference on the website. For example,

    Published Result for Fast Shapelets on GunPoint = 0.061 Recreated Result for Fast Shapelets on GunPoint = 0.9296

    Published Result for Shapelet Transform on GunPoint = 0.013 Recreated Result for Shapelet Transform on GunPoint = 0.9987

    I am worried that I am misinterpreting the Published Results data on the website as accuracy. I went on studying other algorithms and found that Published results are 0.0XX for other algorithms too.

    Another discovery I made was the difference in Recreated Results and the Results I calculated. For example,

    Recreated Result for COTE on GunPoint = 0.9919 Reproduced Result for COTE on GunPoint (by me) = 0.98

    Recreated Result for LPS on GunPoint = 0.9719 Reproduced Result for LPS on GunPoint (by me) = 1.0

    I am unable to understand why I am not being able to get the results mentioned on the website. I am using MacBook Air with 8GB RAM & 1.8GHz Dual-core i5 processor.

    opened by ThisIsNSH 8
  • HiveCote.main() results in java.lang.UnsupportedOperationException

    HiveCote.main() results in java.lang.UnsupportedOperationException

    When I start the main function for the HIVE-Cote classifier (Master c3e7215cd58d96d5fdb7c36e7a719b76f3bbb9f6) it results in the following exception: training (group a): TSF java.lang.UnsupportedOperationException: getTrainAcc not implemented in class timeseriesweka.classifiers.interval_based.TSF at timeseriesweka.classifiers.TrainAccuracyEstimator.getTrainAcc(TrainAccuracyEstimator.java:72) at timeseriesweka.classifiers.hybrids.HiveCote.buildClassifier(HiveCote.java:193) at at.outfisltisl.classification.AbstractClassifierWrapper.performStep(AbstractClassifierWrapper.java:69) at at.outfisltisl.pipeline.Pipeline.runPipeline(Pipeline.java:114) at at.outfisltisl.app.RunClassificationComparison.runSingleClassifierExperiment(RunClassificationComparison.java:57) at at.outfisltisl.app.RunClassificationComparison.runExperimentsForDataset(RunClassificationComparison.java:40) at at.outfisltisl.app.OutfisltislApp.main(OutfisltislApp.java:341)

    The problem here is that the function getTrainAcc() is not implemented for timeseriesweka.classifiers.interval_based.TSF.java btw: I like the usage of default method within interface timeseriesweka.classifiers.TrainAccuracyEstimator ;-)

    opened by davcem 8
  • timeseriesweka.filters.shapelet_transforms

    timeseriesweka.filters.shapelet_transforms

    yikes! I will have a little tidy up (GraceShapeletTransform) but I think this needs a bottom up reconstruction. Perhaps do the python version first. I think James has a start on the slimmed down version? image

    opened by TonyBagnall 8
  • ShapeletTransformClassifier train fold bug

    ShapeletTransformClassifier train fold bug

    ShapeletTransformClassifiers train files generated by Experiments currently only have usable results for the first cross validation fold. Subsequent folds only output probabilities of 0 for all classes. As James noted this is most likely due to buildClassifier being called multiple times during the Experiments CV, with calls past the first retaining information and somehow breaking.

    STC should either be changed to fully reset upon subsequent calls of buildClassifier, or be made a TrainAccuracyEstimator to create its own results (or both).

    bug 
    opened by MatthewMiddlehurst 7
  • Question on UCR Datasets

    Question on UCR Datasets

    Hi, I am trying to follow some Shapelets work in time series classification, such as

    I notice that following datasets are widely used in these work:

    • DP_Little
    • DP_Middle
    • DP_Thumb
    • MP_Little
    • MP_Middle
    • PP_Little
    • PP_Middle
    • PP_Thumb

    However, I did not find them in http://www.timeseriesclassification.com/ (although the most authors state that the datasets came from UCR archive). I would like to know whether these datasets have been renamed or where I could get them.

    Any one help me? Thank you for your assistance.

    opened by Wwwwei 1
  • How to recreate 97,03% from ItalyPowerDemand, COTE, plus exception.

    How to recreate 97,03% from ItalyPowerDemand, COTE, plus exception.

    For jar - http://timeseriesclassification.com/Downloads/tsml11_3_2020.jar Running: java -jar tsml11_3_2020.jar -dp=C:\Recreate\Univariate_arff\ -rp=C:\Recreate\Temp\ -gtf=true -cn=FlatCote -dn=ItalyPowerDemand -f=100 --force=true -tb=true Getting: Raw args: -dp=C:\Recreate\Univariate_arff
    -rp=C:\Recreate\Temp
    -gtf=true -cn=FlatCote -dn=ItalyPowerDemand -f=100 --force=true -tb=true

    Exception in thread "main" java.lang.NoSuchMethodError: weka.classifiers.functions.SMO.setBuildLogisticModels(Z)V at machine_learning.classifiers.ensembles.CAWPE.setupDefaultEnsembleSettings(CAWPE.java:138) at machine_learning.classifiers.ensembles.AbstractEnsemble.(AbstractEnsemble.java:148) at machine_learning.classifiers.ensembles.CAWPE.(CAWPE.java:111) at tsml.classifiers.legacy.COTE.FlatCote.buildClassifier(FlatCote.java:102) at evaluation.evaluators.SingleTestSetEvaluator.evaluate(SingleTestSetEvaluator.java:97) at evaluation.evaluators.CrossValidationEvaluator.lambda$crossValidateWithStats$0(CrossValidationEvaluator.java:145) at evaluation.evaluators.CrossValidationEvaluator.crossValidateWithStats(CrossValidationEvaluator.java:154) at evaluation.evaluators.CrossValidationEvaluator.crossValidateWithStats(CrossValidationEvaluator.java:87) at experiments.Experiments.findExternalTrainEstimate(Experiments.java:564) at experiments.Experiments.runExperiment(Experiments.java:356) at experiments.Experiments.setupAndRunExperiment(Experiments.java:293) at experiments.Experiments.main(Experiments.java:143)

    Idea is: I want to recreate results of http://timeseriesclassification.com/description.php?Dataset=ItalyPowerDemand Meaning 97,03 % with COTE

    1. from http://timeseriesclassification.com/Resamples.csv (from http://timeseriesclassification.com/results.php Legacy results) I think COTE is FlatCote. Right?
    2. How program is getting 100 folds from 67 entries in ItalyPowerDemand Train set? I see I can run jar with "-f=100", when was running -cn=HiveCote, but how folds are created? Asking as in http://timeseriesclassification.com/AllSplits.zip for flatCote there are 100 columns with results for ItalyPowerDemand Or was program run 100 times on whole training set of 67 entries without any folds? How row 38 from Flat-COTE.csv meaning ItalyPowerDemand | 0,961127308 | 0,939747328 | .... and so on was created?

    Basing on it statiscits statistics of row 38-ItalyPowerDemand sum | min | max 97,034985425 | 0,93877551 | 0,980563654

    97,03 % was derived from here, but what are those 0,93877551, 0,980563654 ....

    1. How to deal with this exception? or maybe I should have run jar in some other way?
    opened by Blackisher 0
  • Datasets on the UCR Archive

    Datasets on the UCR Archive

    Hi I am a student working for the time series classification. I download the UCR Achrive from the http//www.timeseriesclassification.com/. I read the paper “The UCR Time Series Archive” and understand that the datasets on the UCR Archive have undergone some preprocessing. But, I do not know if the time series in these datasets are obtained by regular (equidistant) sampling? Could you please give me some help? Thanks a lot.

    opened by peter943 0
  • Feature requests for HIVE-COTE

    Feature requests for HIVE-COTE

    change HC modules to AbstractClassifiers, so that the capabilities can be checked in HC and matched to the minimum of the components, esp regard to multivariate/relational

    opened by TonyBagnall 2
Owner
Machine Learning and Time Series Tools and Datasets
machine learning resources for time series analysis
Machine Learning and Time Series Tools and Datasets
Now redundant weka mirror. Visit https://github.com/Waikato/weka-trunk for the real deal

weka (mirror) Computing and Mathematical Sciences at the University of Waikato now has an official github organization including a read-only git mirro

Benjamin Petersen 313 Dec 16, 2022
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

Vasilis Vryniotis 1.1k Dec 9, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Java Statistical Analysis Tool, a Java library for Machine Learning

Java Statistical Analysis Tool JSAT is a library for quickly getting started with Machine Learning problems. It is developed in my free time, and made

null 752 Dec 20, 2022
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.2) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

null 900 Jan 2, 2023
Statistical Machine Intelligence & Learning Engine

Smile Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpola

Haifeng Li 5.7k Jan 1, 2023
A machine learning package built for humans.

aerosolve Machine learning for humans. What is it? A machine learning library designed from the ground up to be human friendly. It is different from o

Airbnb 4.8k Dec 30, 2022
statistics, data mining and machine learning toolbox

Disambiguation (Italian dictionary) Field of turnips. It is also a place where there is confusion, where tricks and sims are plotted. (Computer scienc

Aurelian Tutuianu 63 Jun 11, 2022
An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

Amazon Web Services - Labs 2.9k Jan 7, 2023
java deep learning algorithms and deep neural networks with gpu acceleration

Deep Neural Networks with GPU support Update This is a newer version of the framework, that I developed while working at ExB Research. Currently, you

Ivan Vasilev 1.2k Jan 6, 2023
Learning Based Java (LBJava)

Learning Based Java LBJava core LBJava examples LBJava maven plugin Compiling the whole package From the root directory run the following command: Jus

CogComp 12 Jun 9, 2019
Simple ATM Machine made with Java

Output / Preview Enter your account number: Enter your pin number: ATM main menu: 1. - View Account Balance 2. - Withdraw funds 3. - Add funds 4. - T

SonLyte 10 Oct 21, 2021
An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

DeepJavaLibrary 2.9k Jan 7, 2023
On-device wake word detection powered by deep learning.

Porcupine Made in Vancouver, Canada by Picovoice Porcupine is a highly-accurate and lightweight wake word engine. It enables building always-listening

Picovoice 2.8k Dec 30, 2022
Test project for learning GoF design pattern

DesignPattern Test project for learning GoF design pattern ㅁ개요 객체지향 설계의 교과서라고 불리는 Design Pattern 을 직접 Activity 별로 구현해봤습니다. ㅁ동기 물론 디자인패턴을 몰라도 기능은 얼마든지

null 11 Aug 8, 2022