MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

Last update: Jan 2, 2023

Related tags

Overview

Mallet

Website: https://mimno.github.io/Mallet/

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Installation

To build a Mallet 2.0 development release, you must have the Apache ant build tool installed. From the command prompt, first change to the mallet directory, and then type ant

If ant finishes with "BUILD SUCCESSFUL", Mallet is now ready to use.

If you would like to deploy Mallet as part of a larger application, it is helpful to create a single ".jar" file that contains all of the compiled code. Once you have compiled the individual Mallet class files, use the command: ant jar

This process will create a file "mallet.jar" in the "dist" directory within Mallet.

Usage

Once you have installed Mallet you can use it using the following command:

bin/mallet [command] --option value --option value ...

Type bin/mallet to get a list of commands, and use the option --help with any command to get a description of valid options.

For details about the commands please visit the API documentation and website at: https://mimno.github.io/Mallet/

List of Algorithms:

Topic Modelling
- LDA
- Parallel LDA
- DMR LDA
- Hierarchical LDA
- Labeled LDA
- Polylingual Topic Model
- Hierarchical Pachinko Allocation Model (PAM)
- Weighted Topic Model
- LDA with integrated phrase discovery
- Word Embeddings (word2vec) using skip-gram with negative sampling
Classification
- AdaBoost
- Bagging
- Winnow
- C45 Decision Tree
- Ensemble Trainer
- Maximum Entropy Classifier (Multinomial Logistic Regression)
- Naive Bayes
- Rank Maximum Entropy Classifier
- Posterior Regularization Auxiliary Model
Clustering
- Greedy Agglomerative
- Hill Climbing
- K-Means
- K-Best
Sequence Prediction Models
- Conditional Random Fields
- Maximum Entropy Markov Models
- Hidden Markov Models
- Semi-Supervised Sequence Prediction Models
Linear Regression

Comments

Ant build fails: error: unmappable character for encoding ASCII

Buildfile: /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/build.xml

init:
    [mkdir] Created dir: /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/class
     [copy] Copying 1 file to /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/class
    [mkdir] Created dir: /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/dist
     [copy] Copying 1 file to /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/dist
    [mkdir] Created dir: /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/test

compile:
    [javac] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/build.xml:61: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 617 source files to /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/class
    [javac] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/pipe/tests/TestCharSequenceNoDiacritics.java:25: error: unmappable character for encoding ASCII
    [javac]     assertEquals("medecin", oneWordCleansing("m??decin"));
    [javac]                                                ^
    [javac] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/pipe/tests/TestCharSequenceNoDiacritics.java:25: error: unmappable character for encoding ASCII
    [javac]     assertEquals("medecin", oneWordCleansing("m??decin"));
    [javac]                                                 ^
    [javac] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/pipe/tests/TestCharSequenceNoDiacritics.java:26: error: unmappable character for encoding ASCII

opened by yurivict 7

Fix misleading help text

There is some irritation about the hyperparameter optimization of the Dirichlet priors. See https://stackoverflow.com/questions/52099379/mallet-hyperparameter-optimization

opened by jonaschn 6
Add support for numeric features in GRMM
Code changes were outlined in this mailing list conversation: http://t47962.ai-mallet-development.aitalk.info/continuous-observations-for-dcrf-t47962.html.

Special thanks to Ming Tu.
opened by nrockweiler 4
Topic Modelling for Non-text data

Hi,

I am trying to run topic modelling on a user-modelling task. I have a list of activities of users and I am treating each activity (on item) as word. The items are encoded as integers.

I am trying to run LDA on this data. I tried a similar approach to running topic modelling on a directory with multiple text files, but LDA is throwing NaN when running it.

is the tool limited to textual input? Any suggestions would be appreciated.

opened by shashankg7 3
puzzling results with simple test code, replicates on gensim -- possibly because of very short documents?

Hello Mallet Gurus,

I hesitate to bring up this sort of basic question, but we've been getting some unusual results with mallet, using the "import-file" method, and we're not sure what we're doing wrong. We're using mallet 2.0. Here's what I think is a minimal example.

First, the input data. That's this "AskP.txt" file here:

AskP.txt

we then run

mallet import-file --input AskP.txt --output AskP.mallet --keep-sequence

to get the mallet-input style file.

We then run

mallet train-topics --random-seed 1 --num-iterations 1000 --input AskP.mallet --num-topics 2 --num-top-words 100 --output-doc-topics doc-topics.txt --output-topic-keys topic-keys.txt --topic-word-weights-file word-topics.txt

which gives us the following output files:

doc-topics.txt topic-keys.txt word-topics.txt

These are pretty strange! In particular, consider the first document. It has only the word "certainly" in it. The model says it is equally weighted on topic 0 and topic 1. However, this doesn't make sense, because topic 1 is more heavily weighted on the word certainly (as can be seen by inspecting the word-topics.txt file).

Even stranger, documents 7, 8, and 9 are all the same (the word "necessarily", alone). However, the output gives document 8 a different topic decomposition (0.64, 0.36) than it does document 7 and 9 (which both get 0.5, 0.5).

Can anyone help us understand what is going on? (I am leaving my collaborator anonymous here, so he doesn't get embarrassed by what is likely a simple error on my side.)

We've never noticed problems like this before, but perhaps that's because we had been running with documents that were generally much longer?

Update: my collaborator ran the MALLET via the gensim interface; he has a completely different installation. He has a slightly different output, despite using the same random seed, but still a lot of the similar issues:

and meanwhile, LdaModel from GENSIM seems to give more normal answers:

Can anyone help us make sense of this? We're super-happy to run any tests here on our end as well, and happy to try experiments.

opened by sdedeo 2

mallet bash wrapper script misses an option to set Java heap size like most other scripts.

Mallet bash wrapper script misses an option to set Java heap size like most other scripts.

Other scripts have something like:

# Default Java heap size.  Change with -Xmx800m as the first argument.
mem=200m

# If first argument is something like -Xmx900m, process appropriately
arg=`echo "$1" | sed -e 's/-Xmx//'`
if test $1 != $arg ; then
  mem=$arg
  shift 
fi

Another way to allow specifying memory would be:

# Set 1GB of memory if ${MALLET_MEMORY} is not defined
  mem="${MALLET_MEMORY:-1g}"

# Run Mallet with 10GB of heap memory.
MALLET_MEMORY=10g ./bin/mallet

opened by ghuls 2

Not clear if `trainingProportions` may be `null`

The third parameter of CRFTrainerByLabelLikelihood#train(InstanceList, int, double[]) is documented as follows: https://github.com/mimno/Mallet/blob/12487de1aa6433bdcf5af0ee0a17b368e64c7acf/src/cc/mallet/fst/CRFTrainerByLabelLikelihood.java#L182-L184 “If non-null” implies that setting it to null is a valid alternative. However, doing so induces a NullPointerException. Either the method should treat null like new double[] {1.0} or the documentation should clarify that null is not allowed.

opened by dscorbett 2
Can only have 2147483647 (max int) tokens in corpus.
I have a pretty large corpus. After filtering out common and rare tokens I end up with 3270690345 words. (Not unique, just counting the total words in the corpus.)

If I try to use Mallet for topic modelling I get the following output after which nothing else happens (no CPU usage).

Mallet LDA: 90 topics, 7 topic bits, 1111111 topic mask Data loaded. max tokens: 1214 total tokens: -1024276951

The obvious indication that something is wrong is the total tokens: -1024276951 line. This is an obvious integer overflow as 3270690345 is to large for the signed int that seems to store the number of tokens.

I have tested some values below or above 2147483647 (max int) and every time I am above it fails and when I am below it works.

Mallet LDA: 90 topics, 7 topic bits, 1111111 topic mask Data loaded. max tokens: 1145 total tokens: 2147483646 <10> LL/token: -11,88627 <20> LL/token: -10,03233 ...

I currently try to use as many documents as possible but I cannot train on the full corpus.

This may be a simple fix by changing the datatype of the total tokens counter to long but I haven't looked at the internals.
opened by UnfinishedArchitect 2
Doesn't PolylingualTopicModel.java include a source about --evaluator-filename?

I did --evaluator-filename of PolylingualTopicModel, but the evaluator-file was not created. I checked PolylingualTopicModel.java and couldn't find any mention of --evaluator-filename.

I want to use PolylingualTopicModel's --evaluator-filename. How do I run the command --evaluator-filename in a PolylingualTopicModel on Mallet?

opened by i-kohey 2
Mallet commandline determine convergence?

We are testing Mallet commandline and have found some diagnostics file after run. How can we check sampling convergence? Where is the report or how to check this measure. And will you please simply interpret that measure?
Many thanks

opened by findgit123 2

Maven build fails: src/cc/mallet/topics/DMRTopicModel.java:[325,33] cannot find symbol method sumTypeTopicCounts(cc.mallet.topics.DMRRunnable[])

[INFO] -------------------------------------------------------------
[WARNING] COMPILATION WARNING : 
[INFO] -------------------------------------------------------------
[WARNING] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/types/FeatureVector.java: Some input files use or override a deprecated API.
[WARNING] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/types/FeatureVector.java: Recompile with -Xlint:deprecation for details.
[WARNING] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/types/Alphabet.java: Some input files use unchecked or unsafe operations.
[WARNING] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/types/Alphabet.java: Recompile with -Xlint:unchecked for details.
[INFO] 4 warnings 
[INFO] -------------------------------------------------------------
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/topics/DMRTopicModel.java:[325,33] cannot find symbol
  symbol:   method sumTypeTopicCounts(cc.mallet.topics.DMRRunnable[])
  location: class cc.mallet.topics.DMRTopicModel
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Skipping MAchine Learning for LanguagE Toolkit (MALLET)
[INFO] This project has been banned from the build due to previous failures.
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  21.474 s
[INFO] Finished at: 2019-06-08T16:49:16-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project mallet: Compilation failure
[ERROR] /usr/ports/science/mallet/work/Mallet-2.0.8-122-g47d690a/src/cc/mallet/topics/DMRTopicModel.java:[325,33] cannot find symbol
[ERROR]   symbol:   method sumTypeTopicCounts(cc.mallet.topics.DMRRunnable[])
[ERROR]   location: class cc.mallet.topics.DMRTopicModel
[ERROR] 
[ERROR] -> [Help 1]
[ERROR]

Revision 2.0.8-122-g47d690a OS: FreeBSD 12 amd64 openjdk8-8.212.4.1

opened by yurivict 2

Lemmatizing with Mallet
HI there,

First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.

To be specific, here is what I need to do:

standardize inconsistencies in spelling, e.g. topicmodeling -> topic modeling

remove extra whitespaces from words, e.g. two whitespaces in a row

stem and lemmatize

I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?

Many thanks in advance!
opened by Glorifier85 1
Auto-correlation between samples (Binkley et al.)
I recently found this paper by Binkley et al.

A short extract from this paper follows:

b – the number of burn-in iterations

n – the number of samples (random variates)

si – the sampling interval

If si is large enough, the observations are practically independent. However, too small a value risks unwanted correlation. To summarize the effect of b, n, and si: if any of these settings are too low, then the Gibbs sampler will produce inaccurate or inadequate information; if any of these settings are too high, then the only penalty is wasted computational effort. Unfortunately, as described in Section 6, support for extracting interval-separated observations is limited in existing LDA tools. For example, For example, Mallet provides this capability but appears to suffer from a local maxima problem

with a footnote linking to http://www.cs.loyola.edu/~binkley/topic_models/additional-images/mallet-fixation/

Does this problem still exist?

Reference: Binkley, D., Heinz, D., Lawrie, D., & Overfelt, J. (2014). Understanding LDA in source code analysis. 22nd International Conference on Program Comprehension, ICPC 2014 - Proceedings, 26–36. https://doi.org/10.1145/2597008.2597150
opened by jonaschn 0
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

I have this issue when importing the data to the format for LDA. I tried enlarge the MALLET_MEMORY=128G (the memory of my server is also 128G), but it still does not work.
My data contains 6,712,484 documents in one .txt file and its size is 3.07G I sampled 100 documents to test the script for importing data, it works well. But keep popping this error message when importing my entire data. Could you please help to figure out the problem? Really appreciate your help!!

opened by jialu-stellar-xia 1
Computing Perplexity

Is there anyway to return perplexity with respect to the number of iterations? In case we want to optimise the number of iteration and avoid getting into burn-in periods in future executions. A way I found to do that, is to add a list attribute that is iteratively filled with the perplexity corresponding to that iteration. Perplexity is computed as exp(-(modelLogLikelihood() / totalTokens))) Any chance I can submit a pull request?

opened by waelbenamara 1

Releases(v202108)

v202108(Aug 11, 2021)
This is a serialization-breaking release due to the switch to HPPC, which affects feature alphabets.

Added

Nonnegative Matrix Factorization

Word embeddings (word2vec clone)

PagedInstanceList supports iteration correctly

lebiathan added stratified sampling of InstanceList

This file!

Changed

All merging and propagation of sampling statistics for topic modeling is now multi-threaded (if num-threads is more than 1), leading to a 5-10% speed boost.

The primitive collections library (for example mapping String to int) has been changed from GNU trove to Carrotlabs HPPC. This change removes all GNU dependencies.

The license has been changed from CPL to Apache.

Use of VMID for unique identifier for serialized objects. (Breaks serialization!)

Many small fixes suggested by ErrorProne.

Unneeded imports removed.

Removed

The Matrix2 class has been removed.

GRMM has been moved to a separate package.

Fixed

Te Rutherford fixed a bug where non-String instance IDs were being cast as Strings.

The import functions (Csv2Vectors, Text2Vectors) have a case-sensitive flag, but this was not being passed to the stopword remover.

Source code(tar.gz)
Source code(zip)
Mallet-202108-bin.tar.gz(6.85 MB)
Mallet-202108-bin.zip(7.97 MB)
v2.1-alpha(Jun 13, 2019)
This is a serialization-breaking release due to the switch to HPPC, which affects feature alphabets.

Added

Nonnegative Matrix Factorization

Word embeddings (word2vec clone)

PagedInstanceList supports iteration correctly

lebiathan added stratified sampling of InstanceList

This file!

Changed

All merging and propagation of sampling statistics for topic modeling is now multi-threaded (if num-threads is more than 1), leading to a 5-10% speed boost.

The primitive collections library (for example mapping String to int) has been changed from GNU trove to Carrotlabs HPPC. This change removes all GNU dependencies.

The license has been changed from CPL to Apache.

Use of VMID for unique identifier for serialized objects. (Breaks serialization!)

Many small fixes suggested by ErrorProne.

Unneeded imports removed.

Removed

The Matrix2 class has been removed.

GRMM has been moved to a separate package.

Fixed

Te Rutherford fixed a bug where non-String instance IDs were being cast as Strings.

Source code(tar.gz)
Source code(zip)
v2.0.8RC3(Nov 11, 2015)

Source code(tar.gz)
Source code(zip)
v2.0.8RC2(Jun 19, 2015)

Source code(tar.gz)
Source code(zip)
v2.0.8RC1(Dec 10, 2014)

Source code(tar.gz)
Source code(zip)