Now redundant weka mirror. Visit https://github.com/Waikato/weka-trunk for the real deal

Overview

weka (mirror)

Computing and Mathematical Sciences at the University of Waikato now has an official github organization including a read-only git mirror of Weka's subversion repository. Therefore, this repo is no longer necessary and will one day be removed. In the meantime, please follow the Waikato repo for the most up-to-date and official changes. Additionally, Waikato also maintain a curated list of repositories you may find interesting. Enjoy.

Click here for the official git mirror

(Official README) WEKA (developer version)

Read-only git mirror of Weka's subversion repository.

Source code

The official WEKA source code of the developer version is available from this URL:

https://svn.cms.waikato.ac.nz/svn/weka/trunk/

Contributions/Bug fixes

Contributions and bug fixes an be contributed as patch file and posted to the WEKA mailing list.

Links


A few notes from the unofficial mirror

NOTE The owner of this repository has no affiliation with official WEKA project. This repo is periodically updated as a kindness to others who have shown interest in it. It can take several hours to checkout the full official WEKA subversion repository and several minutes just to update with any new commits. Therefore, this exists to provide an easy way to access and peruse the WEKA source using git, nothing more.


This is a git mirror of the The University of Waikato machine learning project WEKA.

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

The official WEKA source code is hosted using subversion at the Waikato SVN server.

The current version of WEKA is licensed under the GNU General Public license version 3.0.

Use the weka-trunk branch to follow the official code base under active development at the University of Waikato.

Comments
  • Using EM clustering with weka in my JAVA code?

    Using EM clustering with weka in my JAVA code?

    I apply EM clustering in weka for cluster some points (x; y; z). I wrote EM on my JAVA code: EM em = new EM();em.setDebug(false);em.setDisplayModelInOldFormat(false);em.setMaxIterations(100);em.setMinStdDev(0.000001);em.buildClusterer(data_to_use);When it want to build (the last line); it get an error which may it is because of it get only one cluster. How can I fix this error?

    opened by Alexcsu 1
  • WEKA HierarchicalClusterer class always return 2 clusters

    WEKA HierarchicalClusterer class always return 2 clusters

    Here is my code:

    import weka.clusterers.ClusterEvaluation;
    import weka.clusterers.HierarchicalClusterer;
    import weka.clusterers.EM;
    import weka.core.converters.CSVLoader;
    import weka.core.converters.ConverterUtils.DataSource;
    import weka.core.neighboursearch.PerformanceStats;
    
    import java.io.File;
    import java.io.IOException;
    import java.text.ParseException;
    import java.util.ArrayList;
    import java.util.Enumeration;
    
    import weka.core.*;
    
    public class WEKASample1 {
    
    public static void main(String[] args) {
    
        Instances data = null;
        CSVLoader csvLoader = new CSVLoader();
        try {
            csvLoader.setSource(new File("D:\\WEKA\\numbers.csv"));
    
            data = csvLoader.getDataSet();
                    HierarchicalClusterer h = new HierarchicalClusterer();
    
                DistanceFunction d = new DistanceFunction() {
    
            @Override
            public void setOptions(String[] arg0) throws Exception {
    
            }
    
            @Override
            public Enumeration listOptions() {
                return null;
            }
    
            @Override
            public String[] getOptions() {
                return null;
            }
    
            @Override
            public void update(Instance arg0) {
    
            }
    
            @Override
            public void setInvertSelection(boolean arg0) {
    
            }
    
            @Override
            public void setInstances(Instances arg0) {
    
            }
    
            @Override
            public void setAttributeIndices(String arg0) {
    
            }
    
            @Override
            public void postProcessDistances(double[] arg0) {
    
            }
    
            @Override
            public boolean getInvertSelection() {
                return false;
            }
    
            @Override
            public Instances getInstances() {
                return null;
            }
    
            @Override
            public String getAttributeIndices() {
                return null;
            }
    
            @Override
            public double distance(Instance arg0, Instance arg1, double arg2,
                    PerformanceStats arg3) {
                return 0;
            }
    
            @Override
            public double distance(Instance arg0, Instance arg1, double arg2) {
                return 0;
            }
    
            @Override
            public double distance(Instance arg0, Instance arg1, PerformanceStats arg2)
                    throws Exception {
                return 0;
            }
    
            @Override
            public double distance(Instance arg0, Instance arg1) {
    
                double s1 = arg0.value(0);
                double s2 = arg1.value(0);
    
                return Double.POSITIVE_INFINITY;
            }
        };
    
        h.setDistanceFunction(d);
        SelectedTag s = new SelectedTag(1, HierarchicalClusterer.TAGS_LINK_TYPE);
        h.setLinkType(s);
    
        h.buildClusterer(data);
    
    
    //      double[] arr;
    //      for(int i=0; i<data.size(); i++)/          arr = h.distributionForInstance(data.get(i));
    //          for(int j=0; j< arr.length; j++)
    //              System.out.print(arr[j]+",");
    //          System.out.println();
    //          
    //      }
    
            System.out.println(h.numberOfClusters());
        } catch (Exception e) {
            e.printStackTrace();

    Now, the output for the number of clusters generated is always 2 even if I modify the distancefucntion method also. How do I know which instance if of which cluster? When I uncomment the code above that is written to get the distribution for the instances, I get an ArrayOutOfBound exception.

    But in general, can anyone explain how is the clustering done hierarchically by WEKA here?

    opened by GloryHank 1
  • Create MyClusterer

    Create MyClusterer

    I would suggest that yes, you should just recommend other songs items drawn from the same cluster as the current song.From the way you've phrased your question, it appears you weren't aware of this, but actually, Weka exposes its own API, containing all of the same classes available internally within the GUI. For classes related to clustering, I'd suggest you take a look at EM, XMeans, and Cobweb, although there are also other clustering algorithms that you could use as well. The clustering classes all have a pretty consistent design, there is usually a buildClusterer() method that you can use to build the cluster, and a clusterInstance() method that you can use to retrieve the cluster ID for a given song in the database. I actually built a small Java-based clustering demo project a few months ago in an attempt to try to improve my skills in both Java and Weka at the same time. Feel free to take a look at the source code if you feel it will help.

    opened by GloryHank 1
  • Does it have to be K-means?

    Does it have to be K-means?

    Does it have to be K-means? Another possible approach is to transform your data into a network first, then apply graph clustering. I am the author of MCL, an algorithm used quite often in bioinformatics. The implementation linked to should easily scale up to networks with millions of nodes - your example would have 300K nodes, assuming that you have 100K attributes. With this approach, the data will be naturally pruned in the data transformation step - and that step will quite likely become the bottleneck. How do you compute the distance between two vectors? In the applications that I have dealt with I used the Pearson or Spearman correlation, and MCL is shipped with software to efficiently perform this computation on large scale data (it can utilise multiple CPUs and multiple machines).

    There is still an issue with the data size, as most clustering algorithms will require you to at least perform all pairwise comparisons at least once. Is your data really stored as a giant matrix? Do you have many zeros in the input? Alternatively, do you have a way of discarding smaller elements? Do you have access to more than one machine in order to distribute these computations?

    opened by GloryHank 1
  • MOA's StreamKM clustering doesn't return any result

    MOA's StreamKM clustering doesn't return any result

    I'm currently trying to cluster a great amount of data points into a given amount of clusters and I wanted to try MOA's streaming based k-means StreamKM. A very simple example of what I'm trying to do using random data looks as follows:

    StreamKM streamKM = new StreamKM();
    streamKM.numClustersOption.setValue(5); // default setting
    streamKM.widthOption.setValue(100000); // default setting
    streamKM.prepareForUse();
    for (int i = 0; i < 150000; i++) {
        streamKM.trainOnInstanceImpl(randomInstance(2));
    }
    Clustering result = streamKM.getClusteringResult();
    System.out.println("size = " + result.size());
    System.out.println("dimension = " + result.dimension());
    

    The random instances are created as follows:

    static DenseInstance randomInstance(int size) {
        DenseInstance instance = new DenseInstance(size);
        for (int idx = 0; idx < size; idx++) {
            instance.setValue(idx; Math.random());
        }
        return instance;
    }
    

    However; when running the given code; no clusters seem to be created:

    System.out.println("size = " + result.size()); // size = 0
    System.out.println("dimension = " + result.dimension()); // NPE
    

    Is there anything else I need to take care of; or do I have a fundamental misunderstanding of the MOA clustering concepts?

    opened by lalalaqixiaofei 0
  • Clustering many sentence using weka lib in java

    Clustering many sentence using weka lib in java

    I have 5 files text. I merge these files into 1 file. That file contain about 60 sentences. I want to clustering that file to 5 cluster. I am using weka to clustering.

    public static void doClustering(String pathSentences; int numberCluster) throws IOException {
    
        Helper.deleteAllFileInFolder("results");
    
        //so cum bang so cau trong file / so cau trung binh trong 1 file
        HashMap<Integer; String> sentences = new HashMap<>();
        HashMap<Integer; Integer> clustering = new HashMap<>();
        try {
            StringToWordVector filter = new StringToWordVector();
            SimpleKMeans kmeans = new SimpleKMeans();
            FastVector atts = new FastVector(5);
            atts.addElement(new Attribute("text"; (FastVector) null));
            Instances docs = new Instances("text_files"; atts; 0);
            Scanner sc = new Scanner(new File(pathSentences));
            int count = 0;
            while (sc.hasNextLine()) {
                String content = sc.nextLine();
                double[] newInst = new double[1];
                newInst[0] = (double) docs.attribute(0).addStringValue(content);
                docs.add(new SparseInstance(1.0; newInst));
                sentences.put(sentences.size(); content);
                clustering.put(clustering.size(); -1);
            }
            NGramTokenizer tokenizer = new NGramTokenizer();
            tokenizer.setNGramMinSize(10);
            tokenizer.setNGramMaxSize(10);
            tokenizer.setDelimiters("\\W");
            filter.setTokenizer(tokenizer);
            filter.setInputFormat(docs);
            filter.setLowerCaseTokens(true);
            filter.setWordsToKeep(1);
            Instances filteredData = Filter.useFilter(docs; filter);
            kmeans.setPreserveInstancesOrder(true);
            kmeans.setNumClusters(numberCluster);
            kmeans.buildClusterer(filteredData);
            int[] assignments = kmeans.getAssignments();
    
            int i = 0;
            for (int clusterNum : assignments) {
                clustering.put(i; clusterNum);
                i++;
            }
            PrintWriter[] pw = new PrintWriter[numberCluster];
            for (int j = 0; j < numberCluster; j++) {
                pw[j] = new PrintWriter(new File("results/result" + j + ".txt"));
            }
            sentences.entrySet().stream().forEach((entry) -> {
                Integer key = entry.getKey();
                String value = entry.getValue();
                Integer cluster = clustering.get(key);
                pw[cluster].println(value);
            });
            for (int j = 0; j < numberCluster; j++) {
                pw[j].close();
            }
        } catch (Exception e) {
            System.out.println("Error K means " + e);
        }
    }
    

    When I change the order of the input file; the clustering results also vary. Can you help me fix it. Thanks you so much.

    opened by lalalaqixiaofei 0
  • Is it meaningful to get the centroid of clusters generated by CobWeb clustering algorithm?

    Is it meaningful to get the centroid of clusters generated by CobWeb clustering algorithm?

    I am using the CobWeb provided by Weka; but there is no function to provide the details of centroids of the clusters. I wonder if it is meaningful to get the centroids? If not; which points should I pick to represent every cluster?

    opened by lalalaqixiaofei 0
  • Finding probability of belonging to its subcluster and its class

    Finding probability of belonging to its subcluster and its class

    I have 3000 sample and 10 class and i want to make clustering them.Because i need to sub cluster of each class. I will use probability of belonging to a sub cluster and probability of belonging to a class. My teacher said to me that i should hierarchical clustering for finding number of clusters. But i cannot take good results. When i increase number of clusters; samples are collecting in a few clusters. Then i have tried elbow method with K-Means. I have looked within cluster sum of squared errors.

    opened by lalalaqixiaofei 0
  • Unable to use XMeans in Weka 3-7-5 After Installing Via Package Manager

    Unable to use XMeans in Weka 3-7-5 After Installing Via Package Manager

    I am trying to use a clustering algorithm that lets me choose the initial seeds; so I decided to try and use Weka's Xmeans through the weka GUI. However; when I install Xmeans using weka's package manager; it remains greyed out in the GUI; and I am unable to start clustering even after loading in one of weka's provided test.arff files. Can anyone point me into the right direction or suggest another program or java library to accomplish such a task?

    opened by lalalaqixiaofei 0
  • MOA's StreamKM clustering doesn't return any result

    MOA's StreamKM clustering doesn't return any result

    I'm currently trying to cluster a great amount of data points into a given amount of clusters and I wanted to try MOA's streaming based k-means StreamKM. A very simple example of what I'm trying to do using random data looks as follows:

    StreamKM streamKM = new StreamKM();
    streamKM.numClustersOption.setValue(5); // default setting
    streamKM.widthOption.setValue(100000); // default setting
    streamKM.prepareForUse();
    for (int i = 0; i < 150000; i++) {
        streamKM.trainOnInstanceImpl(randomInstance(2));
    }
    Clustering result = streamKM.getClusteringResult();
    System.out.println("size = " + result.size());
    System.out.println("dimension = " + result.dimension());
    

    The random instances are created as follows:

    static DenseInstance randomInstance(int size) {
        DenseInstance instance = new DenseInstance(size);
        for (int idx = 0; idx < size; idx++) {
            instance.setValue(idx; Math.random());
        }
        return instance;
    }
    

    However; when running the given code; no clusters seem to be created:

    System.out.println("size = " + result.size()); // size = 0
    System.out.println("dimension = " + result.dimension()); // NPE
    

    Is there anything else I need to take care of; or do I have a fundamental misunderstanding of the MOA clustering concepts?

    opened by lalalaqixiaofei 0
  • Distance/proximity matrix in hierachical clustering

    Distance/proximity matrix in hierachical clustering

    I'm new to Weka and I am trying to do hierachical clustering. I have a symmetric distance/proximity matrix like this:

        a      b      c      d 
    a   0      0.1    0.3     0.2
    b   0.1    0      0.7     0.4
    c   0.3    0.7    0       0.9
    d   0.2    0.4    0.9     0 
    

    I want to do hierarchical agglomerative clustering with these instances(a;b;c;d;..). I installed Weka3.6.11 but I couldn't find any part for passing this distance/proximity matrix in cluster Tab. can anyone help me? Is there any easy way in other environments for this purpose? Thanks in advance.

    opened by lalalaqixiaofei 0
Owner
Benjamin Petersen
homeless romantic
Benjamin Petersen
Java time series machine learning tools in a Weka compatible toolkit

UEA Time Series Classification A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-co

Machine Learning and Time Series Tools and Datasets 140 Nov 7, 2022
Contrubute Now to help make aide a best platform for developing

AIDE-Pro AIDE-based mod that will help you develop mobile applications based on the Android platform. How is it different from AIDE? Added Assets Mana

Hosni Fraj 25 Dec 7, 2022
Mirror of Apache Mahout

Welcome to Apache Mahout! The goal of the Apache Mahout™ project is to build an environment for quickly creating scalable, performant machine learning

The Apache Software Foundation 2k Jan 4, 2023
Mirror of Apache SystemML

Apache SystemDS Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engine

The Apache Software Foundation 940 Dec 25, 2022
Mirror of Apache SystemML

Apache SystemDS Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engine

The Apache Software Foundation 940 Dec 25, 2022
Mirror of Apache Qpid

We have moved to using individual Git repositories for the Apache Qpid components and you should look to those for new development. This Subversion re

The Apache Software Foundation 125 Dec 29, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Please visit https://github.com/h2oai/h2o-3 for latest H2O

Caution: H2O-3 is now the current H2O! Please visit https://github.com/h2oai/h2o-3 H2O H2O makes Hadoop do math! H2O scales statistics, machine learni

H2O.ai 2.2k Jan 6, 2023
Please visit https://github.com/h2oai/h2o-3 for latest H2O

Caution: H2O-3 is now the current H2O! Please visit https://github.com/h2oai/h2o-3 H2O H2O makes Hadoop do math! H2O scales statistics, machine learni

H2O.ai 2.2k Dec 9, 2022
*old repository* --> this is now integrated in https://github.com/javaparser/javaparser

JavaSymbolSolver has been integrated in JavaParser: development will continue there! We will work on current issues opened here, but all new issues sh

JavaParser 288 Nov 25, 2022
Mystral (pronounced "Mistral") is an efficient library to deal with relational databases quickly.

Mystral An efficient library to deal with relational databases quickly. A little request: read the Javadoc to understand how these elements work in de

null 13 Jan 4, 2023
Winfoom is an HTTP(s) proxy server facade that allows applications to authenticate through the proxy without having to deal with the actual handshake.

winfoom Basic Proxy Facade for NTLM, Kerberos, SOCKS and Proxy Auto Config file proxies To help this project please give it a star ⭐ Overview Winfoom

Eugen Covaci 56 Dec 8, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Apache Kylin Apache Kylin is an open source Distributed Analytics Engine to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supp

Kylin OLAP Engine 561 Dec 4, 2022
Java time series machine learning tools in a Weka compatible toolkit

UEA Time Series Classification A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-co

Machine Learning and Time Series Tools and Datasets 140 Nov 7, 2022
Eclipse Foundation 3k Dec 31, 2022
DEPRECATED: use https://github.com/jhipster/jhipster-bom instead

JHipster BOM and server-side library - DEPRECATED Full documentation and information is available on our website at https://www.jhipster.tech/ This pr

JHipster 407 Nov 29, 2022
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin

Spark - a tiny web framework for Java 8 Spark 2.9.3 is out!! Changeset <dependency> <groupId>com.sparkjava</groupId> <artifactId>spark-core</a

Per Wendel 9.4k Dec 29, 2022