MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

Overview

Build Status codecov

Mallet

Website: https://mimno.github.io/Mallet/

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Installation

To build a Mallet 2.0 development release, you must have the Apache ant build tool installed. From the command prompt, first change to the mallet directory, and then type ant

If ant finishes with "BUILD SUCCESSFUL", Mallet is now ready to use.

If you would like to deploy Mallet as part of a larger application, it is helpful to create a single ".jar" file that contains all of the compiled code. Once you have compiled the individual Mallet class files, use the command: ant jar

This process will create a file "mallet.jar" in the "dist" directory within Mallet.

Usage

Once you have installed Mallet you can use it using the following command:

bin/mallet [command] --option value --option value ...

Type bin/mallet to get a list of commands, and use the option --help with any command to get a description of valid options.

For details about the commands please visit the API documentation and website at: https://mimno.github.io/Mallet/

List of Algorithms:

  • Topic Modelling
    • LDA
    • Parallel LDA
    • DMR LDA
    • Hierarchical LDA
    • Labeled LDA
    • Polylingual Topic Model
    • Hierarchical Pachinko Allocation Model (PAM)
    • Weighted Topic Model
    • LDA with integrated phrase discovery
    • Word Embeddings (word2vec) using skip-gram with negative sampling
  • Classification
    • AdaBoost
    • Bagging
    • Winnow
    • C45 Decision Tree
    • Ensemble Trainer
    • Maximum Entropy Classifier (Multinomial Logistic Regression)
    • Naive Bayes
    • Rank Maximum Entropy Classifier
    • Posterior Regularization Auxiliary Model
  • Clustering
    • Greedy Agglomerative
    • Hill Climbing
    • K-Means
    • K-Best
  • Sequence Prediction Models
    • Conditional Random Fields
    • Maximum Entropy Markov Models
    • Hidden Markov Models
    • Semi-Supervised Sequence Prediction Models
  • Linear Regression
Comments
  • Ant build fails: error: unmappable character for encoding ASCII

    Ant build fails: error: unmappable character for encoding ASCII

    Buildfile: /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/build.xml
    
    init:
        [mkdir] Created dir: /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/class
         [copy] Copying 1 file to /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/class
        [mkdir] Created dir: /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/dist
         [copy] Copying 1 file to /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/dist
        [mkdir] Created dir: /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/test
    
    compile:
        [javac] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/build.xml:61: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
        [javac] Compiling 617 source files to /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/class
        [javac] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/pipe/tests/TestCharSequenceNoDiacritics.java:25: error: unmappable character for encoding ASCII
        [javac]     assertEquals("medecin", oneWordCleansing("m??decin"));
        [javac]                                                ^
        [javac] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/pipe/tests/TestCharSequenceNoDiacritics.java:25: error: unmappable character for encoding ASCII
        [javac]     assertEquals("medecin", oneWordCleansing("m??decin"));
        [javac]                                                 ^
        [javac] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/pipe/tests/TestCharSequenceNoDiacritics.java:26: error: unmappable character for encoding ASCII
    
    opened by yurivict 7
  • Fix misleading help text

    Fix misleading help text

    There is some irritation about the hyperparameter optimization of the Dirichlet priors. See https://stackoverflow.com/questions/52099379/mallet-hyperparameter-optimization

    opened by jonaschn 6
  • Add support for numeric features in GRMM

    Add support for numeric features in GRMM

    • Code changes were outlined in this mailing list conversation: http://t47962.ai-mallet-development.aitalk.info/continuous-observations-for-dcrf-t47962.html.
    • Special thanks to Ming Tu.
    opened by nrockweiler 4
  • Topic Modelling for Non-text data

    Topic Modelling for Non-text data

    Hi,

    I am trying to run topic modelling on a user-modelling task. I have a list of activities of users and I am treating each activity (on item) as word. The items are encoded as integers.

    I am trying to run LDA on this data. I tried a similar approach to running topic modelling on a directory with multiple text files, but LDA is throwing NaN when running it.

    is the tool limited to textual input? Any suggestions would be appreciated.

    opened by shashankg7 3
  • puzzling results with simple test code, replicates on gensim -- possibly because of very short documents?

    puzzling results with simple test code, replicates on gensim -- possibly because of very short documents?

    Hello Mallet Gurus,

    I hesitate to bring up this sort of basic question, but we've been getting some unusual results with mallet, using the "import-file" method, and we're not sure what we're doing wrong. We're using mallet 2.0. Here's what I think is a minimal example.

    First, the input data. That's this "AskP.txt" file here:

    AskP.txt

    we then run

    mallet import-file --input AskP.txt --output AskP.mallet --keep-sequence

    to get the mallet-input style file.

    We then run

    mallet train-topics --random-seed 1 --num-iterations 1000 --input AskP.mallet --num-topics 2 --num-top-words 100 --output-doc-topics doc-topics.txt --output-topic-keys topic-keys.txt --topic-word-weights-file word-topics.txt

    which gives us the following output files:

    doc-topics.txt topic-keys.txt word-topics.txt

    These are pretty strange! In particular, consider the first document. It has only the word "certainly" in it. The model says it is equally weighted on topic 0 and topic 1. However, this doesn't make sense, because topic 1 is more heavily weighted on the word certainly (as can be seen by inspecting the word-topics.txt file).

    Even stranger, documents 7, 8, and 9 are all the same (the word "necessarily", alone). However, the output gives document 8 a different topic decomposition (0.64, 0.36) than it does document 7 and 9 (which both get 0.5, 0.5).

    Can anyone help us understand what is going on? (I am leaving my collaborator anonymous here, so he doesn't get embarrassed by what is likely a simple error on my side.)

    We've never noticed problems like this before, but perhaps that's because we had been running with documents that were generally much longer?

    Update: my collaborator ran the MALLET via the gensim interface; he has a completely different installation. He has a slightly different output, despite using the same random seed, but still a lot of the similar issues:

    MALLET_via_GENSIM

    and meanwhile, LdaModel from GENSIM seems to give more normal answers:

    LdaModel_via_GENSIM

    Can anyone help us make sense of this? We're super-happy to run any tests here on our end as well, and happy to try experiments.

    opened by sdedeo 2
  • mallet bash wrapper script misses an option to set Java heap size like most other scripts.

    mallet bash wrapper script misses an option to set Java heap size like most other scripts.

    Mallet bash wrapper script misses an option to set Java heap size like most other scripts.

    Other scripts have something like:

    # Default Java heap size.  Change with -Xmx800m as the first argument.
    mem=200m
    
    # If first argument is something like -Xmx900m, process appropriately
    arg=`echo "$1" | sed -e 's/-Xmx//'`
    if test $1 != $arg ; then
      mem=$arg
      shift 
    fi
    

    Another way to allow specifying memory would be:

    # Set 1GB of memory if ${MALLET_MEMORY} is not defined
      mem="${MALLET_MEMORY:-1g}"
    
    # Run Mallet with 10GB of heap memory.
    MALLET_MEMORY=10g ./bin/mallet
    
    opened by ghuls 2
  • Not clear if `trainingProportions` may be `null`

    Not clear if `trainingProportions` may be `null`

    The third parameter of CRFTrainerByLabelLikelihood#train(InstanceList, int, double[]) is documented as follows: https://github.com/mimno/Mallet/blob/12487de1aa6433bdcf5af0ee0a17b368e64c7acf/src/cc/mallet/fst/CRFTrainerByLabelLikelihood.java#L182-L184 “If non-null” implies that setting it to null is a valid alternative. However, doing so induces a NullPointerException. Either the method should treat null like new double[] {1.0} or the documentation should clarify that null is not allowed.

    opened by dscorbett 2
  • Can only have 2147483647 (max int) tokens in corpus.

    Can only have 2147483647 (max int) tokens in corpus.

    I have a pretty large corpus. After filtering out common and rare tokens I end up with 3270690345 words. (Not unique, just counting the total words in the corpus.)

    If I try to use Mallet for topic modelling I get the following output after which nothing else happens (no CPU usage).

    Mallet LDA: 90 topics, 7 topic bits, 1111111 topic mask
    Data loaded.
    max tokens: 1214
    total tokens: -1024276951
    

    The obvious indication that something is wrong is the total tokens: -1024276951 line. This is an obvious integer overflow as 3270690345 is to large for the signed int that seems to store the number of tokens.

    I have tested some values below or above 2147483647 (max int) and every time I am above it fails and when I am below it works.

    Mallet LDA: 90 topics, 7 topic bits, 1111111 topic mask
    Data loaded.
    max tokens: 1145
    total tokens: 2147483646
    <10> LL/token: -11,88627
    <20> LL/token: -10,03233
    ...
    

    I currently try to use as many documents as possible but I cannot train on the full corpus.

    This may be a simple fix by changing the datatype of the total tokens counter to long but I haven't looked at the internals.

    opened by UnfinishedArchitect 2
  • Doesn't PolylingualTopicModel.java include a source about --evaluator-filename?

    Doesn't PolylingualTopicModel.java include a source about --evaluator-filename?

    I did --evaluator-filename of PolylingualTopicModel, but the evaluator-file was not created. I checked PolylingualTopicModel.java and couldn't find any mention of --evaluator-filename.

    I want to use PolylingualTopicModel's --evaluator-filename. How do I run the command --evaluator-filename in a PolylingualTopicModel on Mallet?

    opened by i-kohey 2
  • Mallet commandline determine convergence?

    Mallet commandline determine convergence?

    We are testing Mallet commandline and have found some diagnostics file after run. How can we check sampling convergence? Where is the report or how to check this measure. And will you please simply interpret that measure?
    Many thanks

    opened by findgit123 2
  • Maven build fails: src/cc/mallet/topics/DMRTopicModel.java:[325,33] cannot find symbol method sumTypeTopicCounts(cc.mallet.topics.DMRRunnable[])

    Maven build fails: src/cc/mallet/topics/DMRTopicModel.java:[325,33] cannot find symbol method sumTypeTopicCounts(cc.mallet.topics.DMRRunnable[])

    [INFO] -------------------------------------------------------------
    [WARNING] COMPILATION WARNING : 
    [INFO] -------------------------------------------------------------
    [WARNING] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/types/FeatureVector.java: Some input files use or override a deprecated API.
    [WARNING] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/types/FeatureVector.java: Recompile with -Xlint:deprecation for details.
    [WARNING] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/types/Alphabet.java: Some input files use unchecked or unsafe operations.
    [WARNING] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/types/Alphabet.java: Recompile with -Xlint:unchecked for details.
    [INFO] 4 warnings 
    [INFO] -------------------------------------------------------------
    [INFO] -------------------------------------------------------------
    [ERROR] COMPILATION ERROR : 
    [INFO] -------------------------------------------------------------
    [ERROR] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/topics/DMRTopicModel.java:[325,33] cannot find symbol
      symbol:   method sumTypeTopicCounts(cc.mallet.topics.DMRRunnable[])
      location: class cc.mallet.topics.DMRTopicModel
    [INFO] 1 error
    [INFO] -------------------------------------------------------------
    [INFO] 
    [INFO] ------------------------------------------------------------------------
    [INFO] Skipping MAchine Learning for LanguagE Toolkit (MALLET)
    [INFO] This project has been banned from the build due to previous failures.
    [INFO] ------------------------------------------------------------------------
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD FAILURE
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time:  21.474 s
    [INFO] Finished at: 2019-06-08T16:49:16-07:00
    [INFO] ------------------------------------------------------------------------
    [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project mallet: Compilation failure
    [ERROR] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/topics/DMRTopicModel.java:[325,33] cannot find symbol
    [ERROR]   symbol:   method sumTypeTopicCounts(cc.mallet.topics.DMRRunnable[])
    [ERROR]   location: class cc.mallet.topics.DMRTopicModel
    [ERROR] 
    [ERROR] -> [Help 1]
    [ERROR] 
    

    Revision 2.0.8-122-g47d690a OS: FreeBSD 12 amd64 openjdk8-8.212.4.1

    opened by yurivict 2
  • Lemmatizing with Mallet

    Lemmatizing with Mallet

    HI there,

    First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.

    To be specific, here is what I need to do:

    • standardize inconsistencies in spelling, e.g. topicmodeling -> topic modeling
    • remove extra whitespaces from words, e.g. two whitespaces in a row
    • stem and lemmatize

    I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?

    Many thanks in advance!

    opened by Glorifier85 1
  • Auto-correlation between samples (Binkley et al.)

    Auto-correlation between samples (Binkley et al.)

    I recently found this paper by Binkley et al.

    A short extract from this paper follows:

    • b – the number of burn-in iterations
    • n – the number of samples (random variates)
    • si – the sampling interval

    If si is large enough, the observations are practically independent. However, too small a value risks unwanted correlation. To summarize the effect of b, n, and si: if any of these settings are too low, then the Gibbs sampler will produce inaccurate or inadequate information; if any of these settings are too high, then the only penalty is wasted computational effort. Unfortunately, as described in Section 6, support for extracting interval-separated observations is limited in existing LDA tools. For example, For example, Mallet provides this capability but appears to suffer from a local maxima problem

    with a footnote linking to http://www.cs.loyola.edu/~binkley/topic_models/additional-images/mallet-fixation/

    Does this problem still exist?

    Reference: Binkley, D., Heinz, D., Lawrie, D., & Overfelt, J. (2014). Understanding LDA in source code analysis. 22nd International Conference on Program Comprehension, ICPC 2014 - Proceedings, 26–36. https://doi.org/10.1145/2597008.2597150

    opened by jonaschn 0
  • Exception in thread

    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

    I have this issue when importing the data to the format for LDA. I tried enlarge the MALLET_MEMORY=128G (the memory of my server is also 128G), but it still does not work.
    My data contains 6,712,484 documents in one .txt file and its size is 3.07G I sampled 100 documents to test the script for importing data, it works well. But keep popping this error message when importing my entire data. Could you please help to figure out the problem? Really appreciate your help!! 截屏2021-04-11 下午8 14 08

    opened by jialu-stellar-xia 1
  • Computing Perplexity

    Computing Perplexity

    Is there anyway to return perplexity with respect to the number of iterations? In case we want to optimise the number of iteration and avoid getting into burn-in periods in future executions. A way I found to do that, is to add a list attribute that is iteratively filled with the perplexity corresponding to that iteration. Perplexity is computed as exp(-(modelLogLikelihood() / totalTokens))) Any chance I can submit a pull request?

    opened by waelbenamara 1
Releases(v202108)
  • v202108(Aug 11, 2021)

    This is a serialization-breaking release due to the switch to HPPC, which affects feature alphabets.

    Added

    • Nonnegative Matrix Factorization
    • Word embeddings (word2vec clone)
    • PagedInstanceList supports iteration correctly
    • lebiathan added stratified sampling of InstanceList
    • This file!

    Changed

    • All merging and propagation of sampling statistics for topic modeling is now multi-threaded (if num-threads is more than 1), leading to a 5-10% speed boost.
    • The primitive collections library (for example mapping String to int) has been changed from GNU trove to Carrotlabs HPPC. This change removes all GNU dependencies.
    • The license has been changed from CPL to Apache.
    • Use of VMID for unique identifier for serialized objects. (Breaks serialization!)
    • Many small fixes suggested by ErrorProne.
    • Unneeded imports removed.

    Removed

    • The Matrix2 class has been removed.
    • GRMM has been moved to a separate package.

    Fixed

    • Te Rutherford fixed a bug where non-String instance IDs were being cast as Strings.
    • The import functions (Csv2Vectors, Text2Vectors) have a case-sensitive flag, but this was not being passed to the stopword remover.
    Source code(tar.gz)
    Source code(zip)
    Mallet-202108-bin.tar.gz(6.85 MB)
    Mallet-202108-bin.zip(7.97 MB)
  • v2.1-alpha(Jun 13, 2019)

    This is a serialization-breaking release due to the switch to HPPC, which affects feature alphabets.

    Added

    • Nonnegative Matrix Factorization
    • Word embeddings (word2vec clone)
    • PagedInstanceList supports iteration correctly
    • lebiathan added stratified sampling of InstanceList
    • This file!

    Changed

    • All merging and propagation of sampling statistics for topic modeling is now multi-threaded (if num-threads is more than 1), leading to a 5-10% speed boost.
    • The primitive collections library (for example mapping String to int) has been changed from GNU trove to Carrotlabs HPPC. This change removes all GNU dependencies.
    • The license has been changed from CPL to Apache.
    • Use of VMID for unique identifier for serialized objects. (Breaks serialization!)
    • Many small fixes suggested by ErrorProne.
    • Unneeded imports removed.

    Removed

    • The Matrix2 class has been removed.
    • GRMM has been moved to a separate package.

    Fixed

    • Te Rutherford fixed a bug where non-String instance IDs were being cast as Strings.
    Source code(tar.gz)
    Source code(zip)
  • v2.0.8RC3(Nov 11, 2015)

  • v2.0.8RC2(Jun 19, 2015)

  • v2.0.8RC1(Dec 10, 2014)

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

Vasilis Vryniotis 1.1k Dec 9, 2022
Chih-Jen Lin 4.3k Jan 2, 2023
Java Statistical Analysis Tool, a Java library for Machine Learning

Java Statistical Analysis Tool JSAT is a library for quickly getting started with Machine Learning problems. It is developed in my free time, and made

null 752 Dec 20, 2022
Statistical Machine Intelligence & Learning Engine

Smile Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpola

Haifeng Li 5.7k Jan 1, 2023
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Quick Info this library tries to solve language detection of very short words and phrases, even shorter than tweets makes use of both statistical and

Peter M. Stahl 532 Dec 28, 2022
A machine learning package built for humans.

aerosolve Machine learning for humans. What is it? A machine learning library designed from the ground up to be human friendly. It is different from o

Airbnb 4.8k Dec 30, 2022
Detection, Classification, and Localisation of marine mammal and other bioacoustic signals

This is the main code repository for the PAMGuard software. This repository was created on 7 January 2022 from sourceforge SVN repository at https://s

PAMGuard 8 Nov 4, 2022
Realization of clustering algorithm "kmeans" on Java

kmeans Realization of clustering algorithm "kmeans" on Java Лабораторная работа №2 Ниже представлен отчёт по проделанной второй лабораторной работе (в

Life Sweetener 1 Jan 22, 2022
Stream Processing and Complex Event Processing Engine

Siddhi Core Libraries Siddhi is a cloud native Streaming and Complex Event Processing engine that understands Streaming SQL queries in order to captur

Siddhi - Cloud Native Stream Processor 1.4k Jan 6, 2023
statistics, data mining and machine learning toolbox

Disambiguation (Italian dictionary) Field of turnips. It is also a place where there is confusion, where tricks and sims are plotted. (Computer scienc

Aurelian Tutuianu 63 Jun 11, 2022
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.2) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
Java time series machine learning tools in a Weka compatible toolkit

UEA Time Series Classification A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-co

Machine Learning and Time Series Tools and Datasets 140 Nov 7, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Abstract machine for formal semantics of SIMP (Simple Imperative Language)

SIMP-abstract-machine In 2020/21 I was a Teaching Assistant for the second year module 5CCS2PLD Programming Language Paradigms at King's College Londo

Sten Arthur Laane 25 Oct 10, 2022
Learning Based Java (LBJava)

Learning Based Java LBJava core LBJava examples LBJava maven plugin Compiling the whole package From the root directory run the following command: Jus

CogComp 12 Jun 9, 2019