Java time series machine learning tools in a Weka compatible toolkit

Machine Learning and Time Series Tools and Datasets

Last update: Nov 7, 2022

Related tags

Overview

UEA Time Series Classification

https://travis-ci.com/uea-machine-learning/tsml.svg?branch=master

A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-compatible version, see sktime

Find out more info about our broader work and dataset hosting for the UCR univariate and UEA multivariate time series classification archives on our website.

This codebase is actively being developed for our research. The dev branch will contain the most up-to-date, but stable, code.

Installation

We are looking into deploying this project on Maven or Gradle in the future. For now there are two options:

download the jar file and include as a dependency in your project, or you can run experiments through command line, see the examples on running experiments
fork or download the source files and include in a project in your favourite IDE you can then construct your own experiments (see our examples) and implement your own classifiers.

Overview

This codebase mainly represents the implementation of different algorithms in a common framework, which at the time leading up to the Great Time Series Classification Bake Off in particular was a real problem, with implementations being in any of Python, C/C++, Matlab, R, Java, etc. or even combinations thereof.

We therefore mainly provide implementations of different classifiers as well as experimental and results analysis pipelines with the hope of promoting and streamlining open source, easily comparable, and easily reproducible results, specifically within the TSC space.

While they are obviously very important methods to study, we shall very likely not be implementing any kind of deep learning methods in our codebase, and leave those rightfully in the land of optimised languages and libraries for them, such as sktime-dl , the Keras-enabled extension to sktime.

Our examples run through the basics of using the code, however the basic layout of the codebase is this:

evaluation/: contains classes for generating, storing and analysing the results of your experiments
experiments/: contains classes specifying the experimental pipelines we utilise, and lists of classifier and dataset specifications. The 'main' class is Experiments.java, however other experiments classes exist for running on simulation datasets or for generating transforms of time series for later classification, such as with the Shapelet Transform.
tsml/ and multivariate_timeseriesweka/: contain the TSC algorithms we have implemented, for univariate and multivariate classification respectively.
machine_learning/: contains extra algorithm implementations that are not specific to TSC, such as generalised ensembles or classifier tuners.

Implemented Algorithms

Classifiers

The lists of implemented TSC algorithms shall continue to grow over time. These are all in addition to the standard Weka classifiers and non-TSC algorithms defined under the machine_learning package.

We have implemented the following bespoke classifiers for univariate, equal length time series classification:

Distance Based	Dictionary Based	Kernel Based	Shapelet Based	Interval Based	Hybrids
DD_DTW	BOSS	Arsenal	LearnShapelets	TSF	HIVE-COTE
DTD_C	cBOSS	ROCKET	ShapeletTransform	TSBF	Catch22
ElasticEnsemble	TDE		FastShapelets	LPS
NN_CID	WEASEL		ShapeletTree	CIF
SAX_1NN	SAXVSM			DrCIF
ProximityForest	SpatialBOSS			RISE
DTW_kNN	SAX_1NN			STSF
FastDTW	BafOfPatterns...
FastElasticEn...	BOSSC45
ShapeDTW_1NN	BoTSWEnsemble
ShapeDTW_SVM	BOSSSpatialPy...
SlowDTW_1NN
KNN

And we have implemented the following bespoke classifiers for multivariate, equal length time series classification:

NN_ED_D	MultivariateShapeletTransform
NN_ED_I	ConcatenateClassifier
NN_DTW_D	MultivariateHiveCote
NN_DTW_I	WEASEL+MUSE
STC_D	MultivariateSingleEnsemble
NN_DTW_A	MultivariateAbstractClassifier
	MultivariateAbstractEnsemble

Clusterers

Currently quite limited, aside from those already shipped with Weka.

UnsupervisedShapelets
K-Shape
DictClusterer
TTC
AbstractTimeSeriesCLusterer

Filters

SimpleBatchFilters that take an Instances (the set of time series), transforms them and returns a new Instances object.

ACF	ACF_PACF	ARMA
BagOfPatternsFilter	BinaryTransform	Clipping
Correlation	Cosine	DerivativeFilter
Differences	FFT	Hilbert
MatrixProfile	NormalizeAttribute	NormalizeCase
PAA	PACF	PowerCepstrum
PowerSepstrum	RankOrder	RunLength
SAX	Sine	SummaryStats

Transformers We will be shifting over to a bespoke Transformer interface

ShapeletTransform
catch22

Paper-Supporting Branches

This project acts as the general open-source codebase for our research, especially the Great Time Series Classification Bake Off. We are also trialling a process of creating stable branches in support of specific outputs.

Current branches of this type are:

paper/cawpe/ in support of "A probabilistic classifier ensemble weighting scheme based on cross-validated accuracy estimates"
paper/cawpeExtension/ in support of "Mixing hetero- and homogeneous models in weighted ensembles" (Accepted/in-press)

Contributors

Lead: Anthony Bagnall (@TonyBagnall, @tony_bagnall, [email protected])

James Large (@James-Large, @jammylarge, [email protected])
Jason Lines (@jasonlines),
George Oastler (@goastler),
Matthew Middlehurst (@MatthewMiddlehurst, @M_Middlehurst, [email protected]),
Michael Flynn (GitHub - @MJFlynn, Twitter - @M_J_Flynn, Email - [email protected])
Aaron Bostrom (@ABostrom, @_Groshh_, [email protected]),
Patrick Schäfer (@patrickzib)
Chang Wei Tan (@ChangWeiTan)
Alejandro Pasos Ruiz ([email protected])
Conor Egan (@c-eg)

We welcome anyone who would like to contribute their algorithms!

License

GNU General Public License v3.0

Comments

Beginning of ROCKET implementation

implemented random kernel construction

using this as the python reference: https://github.com/angus924/rocket/blob/master/code/rocket_functions.py

#434

opened by ABostrom 11
Experiments cmdline parameters
Idea for dealing with the current ClassifierLists issue of handling parameters. Defining a name with the parameter, e.g. "RotF<100/200/500>" becomes clunky with a large number of parameter options.

Currently there's 2 use cases:

I have some predetermined hyper-parameters that I want to set, e.g. numTrees, and I want to pass this via cmdline.

I have some predetermined dependent parameters that I want to set via cmdline, say build time limit, e.g. 1m, 2m, 5m, 10m, 30m, 1h. Each of these is dependent on the previous and it is redundant to have 6 different cluster jobs, one for each time limit, as work would be repeated. It would be better to run one job which contracts the classifier for 1m, records results, 2m, records results, etc...

Current thoughts on how this would work on the cmdline is:

pass the parameter key/values as a string separated by spaces, e.g. "-p param1 val1 param2 val2".

pass incremental parameter sets in the same way, but each increment is a separate cmdline option, e.g. "-ip trainContract 60 -ip trainContract 120 -ip trainContract 300".

Benefits:

easy parameter passing for experiments, everything can be passed from the bash script which makes it easier to loop over parameters

no need to define a new classifier id (e.g. RotF200) per parameter you're examining

future proofing, every parameter for a classifier can be passed through this interface so there's no need to add more and more jcommander options as parameters grow

I have a hacky version on the ee branch currently: https://github.com/uea-machine-learning/tsml/blob/a0278c0f11ae2fa5e8812bbef229debd69cdfdac/src/main/java/experiments/Experiments.java#L102
opened by goastler 11
$TSReader can't read data files that have new line characters in form \r\n$

TSReader can't read data files that have new line characters in form \r\n

TSReader cannot read in data files when the new line format is CRLF (\r\n). It only works if the new line format is LF (\n). I believe it is an issue with StreamTokenizer.

I've found that the problem line is 255 in TSReader (getLastToken). The ttype value is -3 (TT_WORD). Then when it reaches the \r of the first line in the .ts file, it becomes 13. However from the documentation, the ttype value apparently cannot be 13. I've debugged the tsml code and it isn't setting it to 13 anywhere, hence why I believe the issue to be with StreamTokenizer itself.

The exact code where it sets the ttype value to 13 is m_Tokenizer.nextToken(). Before this code, ttype was -3, and after it becomes 13.

opened by c-eg 10
This shapelet is strange, help me interprat it, please !
Hi,

I'm trying to analyse the shapelets that have been extracted from the Chinatown dataset. To do so, I used the following step.

Train ST on the dataset to extract the shapelets; here is the code snippet

ShapeletTransform transform = new ShapeletTransform(); transform.setClassValue(new BinaryClassValue()); transform.setSubSeqDistance(new SubSeqDistance()); transform.setShapeletMinAndMax(min, max); transform.setLengthIncrement(lenghtIncrement); transform.useCandidatePruning(); transform.setNumberOfShapelets(train.numInstances() * 10); transform.setQualityMeasure(ShapeletQualityChoice.INFORMATION_GAIN); transform.setLogOutputFile(shapeletFile); transform.supressOutput();

Since shapelets are zNormalized, I try to recover the original values of each shapelet shp as follow x = y * std(shp) + mean(shp). this formula is applied on each value y of the considered shapelet shp

Then I plot the recovered shapelet and its corresponding time series on the same axes

As you can see on the plot, the shapelet deviates too much from the original series and that is why it is very strange. Shapelets are supposed to be subsequences of a time series in the dataset.

I'm wondering if it is not due to floating point number computation, or a mistake in my process, or something else.

Thanks
opened by frankl1 10
setClassifier and horribleGlobalPath

hi, james has this to be backward compatible with my hack for reading CAWPE from file. I propose we remove classic, my global stings and just have the switch inside setClassifier(ExperimentalArguments a)

any reason not to do this anyone knows of? The only downside is that you need to assume the results files for CAWPEFROMFILE will have to all be in the same directory, and that will have to be where the CAWPEFROMFILE writes to (if you use setClassifier to run this). I have no problem with that.

opened by TonyBagnall 10
Standard tie-breaking practice

Based on a small discussion we had in the office, do we have a standard practice for tie-breaking? Do all classifiers use the same method and what method should be used? i.e. first item, random selection, weighted random selection using class distribution. Mostly in the context of class selection for classification but could be extended to other situations with ties. Mainly for discussion, if this is a non-issue feel free to close.
good first issue

opened by MatthewMiddlehurst 10
Branch protection
We need some form of branch protection for the dev branch. @MatthewMiddlehurst accidentally deleted the dev branch earlier this week and managed to recover it (phew!). To prevent this from happening again, we need some kind of protection over the important branches. Master is already protected.

On github, to protect a branch you have to make a branch protection rule. I made a rule for master which is why you can't push directly to it. We have two options for dev:

make an empty rule which stops accidental deletion / renaming / etc

protect dev in the same way as master, preventing pushing unless via pull request. This is a popular option to maintain a clean dev version, but comes at the cost / pain of having to do pull requests every time (although I think it's a good idea)

We at the very least need option 1, would like some feedback on option 2 :)
meta
opened by goastler 9
Difference in Algorithms Results on TSC website

While using /java/tsml/examples/ClassificationExamples.java class to run all the classifiers, I found my results close but not the same as mentioned on the website.

One significant discovery was that Published Results and Recreated Results had a very large difference on the website. For example,

Published Result for Fast Shapelets on GunPoint = 0.061 Recreated Result for Fast Shapelets on GunPoint = 0.9296

Published Result for Shapelet Transform on GunPoint = 0.013 Recreated Result for Shapelet Transform on GunPoint = 0.9987

I am worried that I am misinterpreting the Published Results data on the website as accuracy. I went on studying other algorithms and found that Published results are 0.0XX for other algorithms too.

Another discovery I made was the difference in Recreated Results and the Results I calculated. For example,

Recreated Result for COTE on GunPoint = 0.9919 Reproduced Result for COTE on GunPoint (by me) = 0.98

Recreated Result for LPS on GunPoint = 0.9719 Reproduced Result for LPS on GunPoint (by me) = 1.0

I am unable to understand why I am not being able to get the results mentioned on the website. I am using MacBook Air with 8GB RAM & 1.8GHz Dual-core i5 processor.

opened by ThisIsNSH 8
HiveCote.main() results in java.lang.UnsupportedOperationException

When I start the main function for the HIVE-Cote classifier (Master c3e7215cd58d96d5fdb7c36e7a719b76f3bbb9f6) it results in the following exception: training (group a): TSF java.lang.UnsupportedOperationException: getTrainAcc not implemented in class timeseriesweka.classifiers.interval_based.TSF at timeseriesweka.classifiers.TrainAccuracyEstimator.getTrainAcc(TrainAccuracyEstimator.java:72) at timeseriesweka.classifiers.hybrids.HiveCote.buildClassifier(HiveCote.java:193) at at.outfisltisl.classification.AbstractClassifierWrapper.performStep(AbstractClassifierWrapper.java:69) at at.outfisltisl.pipeline.Pipeline.runPipeline(Pipeline.java:114) at at.outfisltisl.app.RunClassificationComparison.runSingleClassifierExperiment(RunClassificationComparison.java:57) at at.outfisltisl.app.RunClassificationComparison.runExperimentsForDataset(RunClassificationComparison.java:40) at at.outfisltisl.app.OutfisltislApp.main(OutfisltislApp.java:341)

The problem here is that the function getTrainAcc() is not implemented for timeseriesweka.classifiers.interval_based.TSF.java btw: I like the usage of default method within interface timeseriesweka.classifiers.TrainAccuracyEstimator ;-)

opened by davcem 8
timeseriesweka.filters.shapelet_transforms

yikes! I will have a little tidy up (GraceShapeletTransform) but I think this needs a bottom up reconstruction. Perhaps do the python version first. I think James has a start on the slimmed down version?

opened by TonyBagnall 8
ShapeletTransformClassifier train fold bug

ShapeletTransformClassifiers train files generated by Experiments currently only have usable results for the first cross validation fold. Subsequent folds only output probabilities of 0 for all classes. As James noted this is most likely due to buildClassifier being called multiple times during the Experiments CV, with calls past the first retaining information and somehow breaking.

STC should either be changed to fully reset upon subsequent calls of buildClassifier, or be made a TrainAccuracyEstimator to create its own results (or both).
bug

opened by MatthewMiddlehurst 7
Question on UCR Datasets
Hi, I am trying to follow some Shapelets work in time series classification, such as

Learning time-series shapelets

Time series shapelets: a new primitive for data mining

Adversarial Dynamic Shapelet Network

I notice that following datasets are widely used in these work:

DP_Little

DP_Middle

DP_Thumb

MP_Little

MP_Middle

PP_Little

PP_Middle

PP_Thumb

However, I did not find them in http://www.timeseriesclassification.com/ (although the most authors state that the datasets came from UCR archive). I would like to know whether these datasets have been renamed or where I could get them.

Any one help me? Thank you for your assistance.
opened by Wwwwei 1
How to recreate 97,03% from ItalyPowerDemand, COTE, plus exception.
For jar - http://timeseriesclassification.com/Downloads/tsml11_3_2020.jar Running: java -jar tsml11_3_2020.jar -dp=C:\Recreate\Univariate_arff\ -rp=C:\Recreate\Temp\ -gtf=true -cn=FlatCote -dn=ItalyPowerDemand -f=100 --force=true -tb=true Getting: Raw args: -dp=C:\Recreate\Univariate_arff
-rp=C:\Recreate\Temp
-gtf=true -cn=FlatCote -dn=ItalyPowerDemand -f=100 --force=true -tb=true

Exception in thread "main" java.lang.NoSuchMethodError: weka.classifiers.functions.SMO.setBuildLogisticModels(Z)V at machine_learning.classifiers.ensembles.CAWPE.setupDefaultEnsembleSettings(CAWPE.java:138) at machine_learning.classifiers.ensembles.AbstractEnsemble.(AbstractEnsemble.java:148) at machine_learning.classifiers.ensembles.CAWPE.(CAWPE.java:111) at tsml.classifiers.legacy.COTE.FlatCote.buildClassifier(FlatCote.java:102) at evaluation.evaluators.SingleTestSetEvaluator.evaluate(SingleTestSetEvaluator.java:97) at evaluation.evaluators.CrossValidationEvaluator.lambda$crossValidateWithStats$0(CrossValidationEvaluator.java:145) at evaluation.evaluators.CrossValidationEvaluator.crossValidateWithStats(CrossValidationEvaluator.java:154) at evaluation.evaluators.CrossValidationEvaluator.crossValidateWithStats(CrossValidationEvaluator.java:87) at experiments.Experiments.findExternalTrainEstimate(Experiments.java:564) at experiments.Experiments.runExperiment(Experiments.java:356) at experiments.Experiments.setupAndRunExperiment(Experiments.java:293) at experiments.Experiments.main(Experiments.java:143)

Idea is: I want to recreate results of http://timeseriesclassification.com/description.php?Dataset=ItalyPowerDemand Meaning 97,03 % with COTE

from http://timeseriesclassification.com/Resamples.csv (from http://timeseriesclassification.com/results.php Legacy results) I think COTE is FlatCote. Right?

How program is getting 100 folds from 67 entries in ItalyPowerDemand Train set? I see I can run jar with "-f=100", when was running -cn=HiveCote, but how folds are created? Asking as in http://timeseriesclassification.com/AllSplits.zip for flatCote there are 100 columns with results for ItalyPowerDemand Or was program run 100 times on whole training set of 67 entries without any folds? How row 38 from Flat-COTE.csv meaning ItalyPowerDemand | 0,961127308 | 0,939747328 | .... and so on was created?

Basing on it statiscits statistics of row 38-ItalyPowerDemand sum | min | max 97,034985425 | 0,93877551 | 0,980563654

97,03 % was derived from here, but what are those 0,93877551, 0,980563654 ....

How to deal with this exception? or maybe I should have run jar in some other way?
opened by Blackisher 0
Datasets on the UCR Archive

Hi I am a student working for the time series classification. I download the UCR Achrive from the http//www.timeseriesclassification.com/. I read the paper “The UCR Time Series Archive” and understand that the datasets on the UCR Archive have undergone some preprocessing. But, I do not know if the time series in these datasets are obtained by regular (equidistant) sampling? Could you please give me some help? Thanks a lot.

opened by peter943 0
Feature requests for HIVE-COTE

change HC modules to AbstractClassifiers, so that the capabilities can be checked in HC and matched to the minimum of the components, esp regard to multivariate/relational

opened by TonyBagnall 2

Owner

Machine Learning and Time Series Tools and Datasets

machine learning resources for time series analysis

GitHub

Now redundant weka mirror. Visit https://github.com/Waikato/weka-trunk for the real deal

weka (mirror) Computing and Mathematical Sciences at the University of Waikato now has an official github organization including a read-only git mirro

313 Dec 16, 2022

Encog java core Apache 2 Encog java core Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data. Machine learning algorithms such as Support Vector Machines, Artificial Neural Networks, Genetic Programming, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported. License: Apache 2 , .

Encog Machine Learning Framework Encog is a pure-Java/C# machine learning framework that I created back in 2008 to support genetic programming, NEAT/H

739 Dec 17, 2022

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

1.1k Dec 9, 2022

This repo is created to help people with the machine coding interview. There is no free website to provide complete guide for machine coding round so I have created this repo where I have shared all my machine coding practices and created a medium post as well to help with theory part.

machineCoding This repo is created to help people with the machine coding interview. There is no free website to provide complete guide for machine co

121 Nov 3, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.8k Dec 28, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.7k Mar 12, 2021

Java Statistical Analysis Tool, a Java library for Machine Learning

Java Statistical Analysis Tool JSAT is a library for quickly getting started with Machine Learning problems. It is developed in my free time, and made

752 Dec 20, 2022

Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.2) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

1.1k Dec 28, 2022

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.