Machine learning components for Apache UIMA

Overview

Build Status

Introduction

ClearTK provides a framework for developing statistical natural language processing (NLP) components in Java and is built on top of Apache UIMA. It is developed by the Center for Computational Language and Education Research (CLEAR) at the University of Colorado at Boulder.

ClearTK is built with Maven and we recommend that you build your project that depends on ClearTK with Maven. This will allow you to add dependencies for only the parts of ClearTK that you are interested and automatically pull in only those dependencies that those parts depend on. The zip file you have downloaded is provided as a convenience to those who are unable to build with Maven. It provides jar files for each of the sub-projects of ClearTK as well as all the dependencies that each of those sub-projects uses. To use ClearTK in your Java project, simply add all of these jar files to your classpath. If you are only interested in one (or a few) sub-project of ClearTK, then you may not want to add every jar file provided here. Please consult the maven build files to determine which jar files are required for the parts of ClearTK you want to use.

Please see the section titled "Dependencies" below for important licensing information.

License

Copyright (c) 2007-2014, Regents of the University of Colorado All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the University of Colorado at Boulder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Dependencies

ClearTK depends on a variety of different open source libraries that are redistributed here subject to the respective licensing terms provided by each library. We have been careful to use only libraries that are commercially friendly. Please see the notes below for exceptions. For a complete listing of the dependencies and their respective licenses please see the file licenses/index.html.

GPL Dependencies

ClearTK has two sub-projects that depend on GPL licensed libraries:

  • cleartk-syntax-berkeley
  • cleartk-stanford-corenlp Neither of these projects nor their dependencies are provided in this release. To obtain these projects, please manually download them from our googlecode hosted maven repository:

http://cleartk.googlecode.com/svn/repo/org/cleartk/cleartk-syntax-berkeley/ http://cleartk.googlecode.com/svn/repo/org/cleartk/cleartk-stanford-corenlp/

SVMLIGHT

ClearTK also has two projects called cleartk-ml-svmlight and cleartk-ml-tksvmlight which have special licensing considerations. The ClearTK project does not redistribute SVMlight. ClearTK does, however, facilitate the building of SVMlight models via the ClassifierBuilder interface. In order to use the implementations of this interface to good effect you will need to have SVMlight installed on your machine. The ClassifierBuilders for SVMlight simply call the executable "svm_learn" provided by the SVMlight distribution. ClearTK does not use SVMlight at classification time - it only uses the models that are build by SVMlight. Instead, ClearTK provides its own code for classification that makes use of an SVMlight generated model. This code is provided with ClearTK and is available with the above BSD license as is all of the other code written for ClearTK. Therefore, be advised that while ClearTK is not required (or compelled) to redistribute the code or license of SVMlight or to comply with it (i.e. the noncommercial license provided by SVMlight is not compatible with our BSD License) - it would be very difficult to use the SVMlight wrappers we provide in a commercial setting without obtaining a license for SVMlight directly from its authors.

LGPL

The cleartk-ml-mallet project depends on Mallet (http://mallet.cs.umass.edu/), which depends on trove4j (http://trove.starlight-systems.com/), which is released under the LGPL license. If you do not need Mallet classifiers and would like to avoid the LGPL license, you can omit the cleartk-ml-mallet dependency.

Comments
  • Implement viterbi for Classifier_ImplBase.classifySequence()

    Implement viterbi for Classifier_ImplBase.classifySequence()

    Original issue 76 created by ClearTK on 2009-03-20T17:22:01.000Z:

    We need to provide a mechanism for returning K-best results from a classifier for sequential tagging tasks. There are two primary drivers for this. 1) a viterbi-style search through the space of possible solutions for a sequential tagging task such as part-of-speech tagging or chunking is very common because it improves the accuracy of the top choice which is returned. 2) It is often the case that you want to pass along the K-best sequences to downstream components. Right now we assume that a non-sequential tagger performing a sequential tagging task can return the best result by simply making the best decision it can at each decision point and returning the resulting sequence.

    I was just reading the Ratnaparkhi parsing paper which describes the basic algorithm that the OpenNLP parser is based on. Here is an excerpt:

    "It should be emphasize that if K > 1 (beam size), the parser does not commit to a single POS or chunk assignment for the input sentence before building constituent structure. All three of the passes described in section 2 are integrated in the search, i.e. when parsing a test sentence, the input to the second pass consists of K of the best distinct POS tag assignments for the input sentence."

    Reimplementing this parser in ClearTK would be really difficult right now. We should think about how to provide search capability for sequential tagging tasks and a mechanism to "return" the K-best sequences. The former may not make sense for all non-sequential classifiers (i.e. LIBSVM doesn't provide probabilities with a classification) but should be quite straightforward for e.g. Maxent.

    This issue was briefly raised in # 57 but was not addressed there because we used that issue to address a related problem of providing OutcomeFeatureExtractor.

    🐛 Bug Priority-Critical 
    opened by bethard 32
  • consider moving evaluation out of cleartk-ml into cleartk-ml-eval

    consider moving evaluation out of cleartk-ml into cleartk-ml-eval

    Original issue 228 created by ClearTK on 2011-02-07T21:14:35.000Z:

    So I've been playing around with the evaluation package, and it feels pretty clunky to me in a number of ways. I don't like the idea of labeling it as 0.9.9 (which is what it will get if it stays in cleartk-ml) because I think that may give the impression that the API are pretty much the way we want them. Could we move this out into a new module, say cleartk-ml-eval, and label it with 0.5.0 or 0.7.0 or something a little less committal than 0.9.9?

    Some things that I think are clunky:

    • I don't like that EngineFactory forces you to use Descriptions. In many cases, it may be possible to use a single AnalysisEngine for both data writing and classifying by simply flipping the isTraining flag (e.g. via a few setConfigurationParameterValue calls and a reconfigure call). But to support this, the EngineFactory methods would need to return AnalysisEngines instead of AnalysisEngineDescriptions, so that you could have a single instance and just reconfigure it when the interface methods were called.
    • There aren't any default implementations of anything, so it makes it hard to be confident we have the abstraction right. Some things I think we should implement before we declare the APIs satisfactory:
      • A FilesCorpusFactory that takes a list of training files and a list of testing files (to be read in as plain text using a FilesCollectionReader) and fills in all the CorpusFactory methods using this information
      • An AnnotationSpanEvaluationFactory that takes an annotation type, and compares the spans of the gold annotations of this type to the model annotations using precision, recall and F-measure. And probably similarly for an AnnotationAttributeEvaluationFactory using accuracy.

    Don't get me wrong - I don't think the evaluation package is horribly broken or anything. I just think we should label it as something less than a 0.9.9 release until we've pushed its boundaries a bit.

    🐛 Bug Priority-Medium 
    opened by bethard 30
  • feature extraction vs. feature encoding

    feature extraction vs. feature encoding

    Original issue 55 created by ClearTK on 2009-02-18T17:54:29.000Z:

    [Steve]

    Consider this scenario:

    • I'm working on a document classification task where I want to combine, say, word-based features with citation-based features.
    • I want to apply Euclidean normalization to each group of features, that is, I want to normalize the word-based features separately from the citation-based features.

    How should I go about this?

    • Normalization is currently done in FeaturesEncoders, but FeaturesEncoders just see an Iterable of Features - they have no concept of groups.
    • I know which features are in which groups in the AnnotationHandler, so maybe I should do the normalization by hand there? I'd basically have to re-implement FeatureVector.l2Norm() to work over Feature objects though...

    Anyone have any better ideas?

    🐛 Bug Priority-Medium 
    opened by bethard 29
  • #442 - Upgrade to UIMAv3

    #442 - Upgrade to UIMAv3

    • [x] Pull UIMA and uimaFIT versions up to 3.3.0
    • [x] All tests are successful
    • [x] Odd problem with uima.tcas.Annotation not being in the type system (slot 21 = null) in TimeML module
    ⚙️ Refactoring 
    opened by reckart 23
  • Multiple Classifier CleartkAnnotator

    Multiple Classifier CleartkAnnotator

    Original issue 226 created by ClearTK on 2011-02-03T18:31:48.000Z:

    [Lee]

    I was thinking since I have to do it anyway for my work, I should go ahead and write a CleartkMultiAnnotator class. The question is what can one reasonably expect from such a class? It looks like CleartkAnnotator doesn't provide much save for initialization of the classifier and data writer, and I was thinking CleartkMultiAnnotator would just provide help in dealing with collections of classifiers/ data writers. Based on conversations with Philip and my own needs, I was thinking that the pieces that differentiate CleartkMultiAnnotator from CleartkAnnotator would look something like this:

    public abstract class CleartkAnnotator<KEY_TYPE, OUTCOME_TYPE> extends JCasAnnotator_ImplBase implements Initializable {

     protected Map&lt;KEY_TYPE, Classifier&lt;OUTCOME_TYPE&gt; classifiers;
     protected Map&lt;KEY_TYPE, DataWriter&lt;OUTCOME_TYPE&gt; dataWriters;
    
     // Retrieves or creates classifier as needed
     getClassifier(KEY_TYPE key);
    
    // Retrieves or creates dataWriter as needed
    getDataWriter(KEY_TYPE key);
    

    }

    Can you think of anything else that would be useful? I know when Philip implemented something similar for his own purposes he had an output directory configuration parameter so that he could squirrel away the classifiers and data writers in jar files.

    [Steve] Thanks, this sounds like a great contribution!

    This looks like basically what I had in mind, except I would just drop the generics for KEY_TYPE and use String everywhere. Then you can have a String[] PARAM_CLASSIFIER_NAMES @ConfigurationParameter, and the DataWriters can all delegate to PARAM_OUTPUT_DIRECTORY/name. Within those subdirectories, you should be able to package and unpackage jars using the usual ClearTK machinery.

    By the way, I think you probably don't need the getter methods if the Maps are protected. No one but subclasses should really be asking for the DataWriters and Classifiers here, right?

    [Philip] The getters can be protected. In my solution, I used getClassifier(keyType) to get a classifier if it was already available. If not, then it would initialize the classifier. I like doing this because you may not need every classifier that's been created for/from the training data when you are running it on your evaluation data because you may (perhaps inadvertently) have some key types that are very sparse and never appear in your evaluation data.

    I agree that KEY_TYPE may be unnecessary and String may suffice. Although, I can see that a boolean would be quite reasonable for some scenarios.

    I would rather not use the word "key" or "type" for the naming the key of the map as they are both overloaded terms. Maybe "name"?

    Finally, it would be nice if we were able to abstract out the output directory as a parameter to this analysis engine as we do with CleartkAnnotator. I don't want to wire in our jar-based solution into this new class.

    [Lee] I'm leaning toward just using String like Steve suggested. If someone really needed the boolean condition they could just create appropriate strings to describe their classifiers.

    As per the getters, I go back and forth on whether or not they are needed. They are handy for the on-the-fly creation like Philip described, but they also muddle the semantics when using these classifier/datawriter maps. It's not clear when writing an extension of this proposed class whether one should use the maps directly or should only access them through the getters.

    Lastly, does anyone have a good name for this class?

    ⭐️ Enhancement Component-ml Priority-Medium 
    opened by bethard 22
  • set up continuous integration server

    set up continuous integration server

    Original issue 197 created by ClearTK on 2011-01-07T21:36:41.000Z:

    so - it seems like it's time to get a continuous integration server set up for this project so that we know that the code is compiling and tests are running. This project is already too big to be doing this manually and assuming that we are staying up-to-date. I have never done this before but am willing to help out with this anyways.

    I don't know much about the tooling space for this - but my understanding is that Hudson is popular and easy to use. It's what Richard uses for uimaFIT. He might be some to ask. He gets cobutera test coverage reports generated too.

    Priority-Medium Type-Task 
    opened by bethard 18
  • Sequential classifier wrapper for non-sequential classifiers

    Sequential classifier wrapper for non-sequential classifiers

    Original issue 57 created by ClearTK on 2009-02-21T00:36:40.000Z:

    I think we have a problem in how we are thinking about using non-sequential classifiers for sequential tagging tasks. For starters, the notion that you are going to do separate feature extraction for the sequential case vs. the non-sequential case is just silly. The only difference between the two is that in the latter case you want add features based on the classifications of the previous instances. It would be easy enough to provide a generic wrapper for all non-sequential classifiers so that we aren't constantly breaking our code up into "sequential" and "non-sequential" mode.

    Equally silly, though, is the notion of doing a sequential tagging task with a non-sequential classifier without a viterbi beam search of some kind. This is pretty basic.

    I suggest that we create a sequential classifier that wraps any non-sequential classifier so that it:

    • does something sensible with converting previous classifications into features
    • provides a mechanism to specify a viterbi-style search of some kind. We should be able to come with a handful (one-per sequential classifier?) of these to cover all of our sequential classifiers.
    🐛 Bug Priority-High 
    opened by bethard 18
  • Introduce parent project for maven building

    Introduce parent project for maven building

    Original issue 181 created by ClearTK on 2011-01-05T18:13:00.000Z:

    One thing that is going to get annoying very quickly about the current arrangement is that building all of the projects requires some effort. It sure would be nice if we had a parent pom.xml file that we could run to compile, test, and possibly install all of the projects. I think uimaFIT does a nice job with this and we can model our approach on that. Any thoughts?

    🐛 Bug Priority-Medium 
    opened by bethard 14
  • allow ClassifierFactory to create Classifiers with resources relative to the CleartkAnnotator class

    allow ClassifierFactory to create Classifiers with resources relative to the CleartkAnnotator class

    Original issue 218 created by ClearTK on 2011-01-31T09:22:46.000Z:

    So, assuming we do the refactoring of cleartk-ml proposed in Issue 217, I think there is no longer any use for ClassifierFactory. It was originally introduced for Issue 135, where the problem was that Classifiers had to take a JarFile in the constructor, and Philip needed to create a test Classifier without a JarFile. That problem goes away with the refactoring in Issue 217 because Classifiers can now have any constructor parameters they want.

    But why not just leave well enough alone? ;-) Because ClassifierFactory prevents us from easily using models as Resources on the classpath. JarClassifierFactory (the only real ClassifierFactory we have) currently does this:

    InputStream stream;
    try {
      stream = new URL(this.classifierJarPath).openStream();
    } catch (MalformedURLException e) {
      stream = new FileInputStream(this.classifierJarPath);
    }
    stream = new BufferedInputStream(stream);
    JarInputStream modelStream = new JarInputStream(stream);
    ClassifierBuilder&lt;?&gt; builder = JarClassifierBuilder.fromManifest(modelStream.getManifest());
    try {
      return superClass.cast(builder.loadClassifier(modelStream));
    } finally {
      stream.close();
    }
    

    It would really be nice if before we try the FileInputStream we could try something like:

    stream = annotatorClass.getResourceAsStream(this.classifierJarPath)
    

    But this won't work because ClassifierFactory objects don't have access to the annotator class. If we get rid of ClassifierFactory and just move the bit of code above to CleartkAnnotator, we can easily make this work using this.getClass().

    What do you think?

    ⭐️ Enhancement Priority-Medium 
    opened by bethard 12
  • unit tests that contain material that should not be redistributed

    unit tests that contain material that should not be redistributed

    Original issue 7 created by ClearTK on 2008-12-05T18:21:14.000Z:

    The following unit tests have data in the text of the .java file that should not be redistributed (e.g. song lyrics, treebank data, etc.) These unit tests are only available on the old repository and either need to be refactored or replaced.

    Parent directory /ClearTK/test/src/

    Affected Files org/cleartk/classifier/encoder/features/string/StringFeatureEncoderTests.java

    🐛 Bug Priority-Medium Component-Tests 
    opened by bethard 12
  • Document feature extraction library

    Document feature extraction library

    Original issue 3 created by ClearTK on 2008-12-05T17:48:34.000Z:

    Please use labels and text to provide additional information.

    There is currently no documentation for the different feature extractors. We need to both complete the JavaDocs, and, more importantly, give a more tutorial-like overview of what is available and how it can be used.

    Priority-Critical 📖 Documentation Component-ml 
    opened by bethard 11
  • added featurename from named sub-extractor

    added featurename from named sub-extractor

    previously feature name from sub-extractors in ClearTkExtractor were silently discarded, for reasons I can't discern (presumably accidental?). Compare with the out-of-bounds features a few lines below, which preserve the name from the sub-extractor. Please advise if this was in fact a principled decision (and I'll abandon the PR). If not, I can provide other supporting changes you need for this very simple patch

    🐛 Bug 
    opened by admackin 0
  • SimpleFeatureExtractor does not exist in the current version

    SimpleFeatureExtractor does not exist in the current version

    In the tutorial the class SimpleFeatureExtractor is part of the sample code but in the code (at least in the newest version) such a class does not seem to exist!

    🐛 Bug 📖 Documentation 
    opened by mmahdavim 0
  • The Berkeley parser wrapper adds duplicate annotations to a document.

    The Berkeley parser wrapper adds duplicate annotations to a document.

    The parser adds two TopTreebankNode for each sentence and also for each TerminalTreebankNode annotation, it creates one extra TreebankNode annotation.

    🐛 Bug 
    opened by mjlaali 0
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

Vasilis Vryniotis 1.1k Dec 9, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text

Welcome to Apache OpenNLP! The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. This toolkit is

The Apache Software Foundation 1.2k Dec 29, 2022
Sample application demonstrating an order fulfillment system decomposed into multiple independant components (e.g. microservices). Showing concrete implementation alternatives using e.g. Java, Spring Boot, Apache Kafka, Camunda, Zeebe, ...

Sample application demonstrating an order fulfillment system decomposed into multiple independant components (e.g. microservices). Showing concrete implementation alternatives using e.g. Java, Spring Boot, Apache Kafka, Camunda, Zeebe, ...

Bernd Ruecker 1.2k Dec 14, 2022
Java Statistical Analysis Tool, a Java library for Machine Learning

Java Statistical Analysis Tool JSAT is a library for quickly getting started with Machine Learning problems. It is developed in my free time, and made

null 752 Dec 20, 2022
Statistical Machine Intelligence & Learning Engine

Smile Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpola

Haifeng Li 5.7k Jan 1, 2023
A machine learning package built for humans.

aerosolve Machine learning for humans. What is it? A machine learning library designed from the ground up to be human friendly. It is different from o

Airbnb 4.8k Dec 30, 2022
statistics, data mining and machine learning toolbox

Disambiguation (Italian dictionary) Field of turnips. It is also a place where there is confusion, where tricks and sims are plotted. (Computer scienc

Aurelian Tutuianu 63 Jun 11, 2022
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
A Java Virtual Machine - running on a Java Virtual Machine - running on a (jk).

Javaception A Java Virtual Machine - running on a Java Virtual Machine - running on a (jk). Goals JVMS compliant Java Virtual Machine Somewhat fast Re

null 33 Oct 10, 2022
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.2) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
Java time series machine learning tools in a Weka compatible toolkit

UEA Time Series Classification A Weka-compatible Java toolbox for time series classification, clustering and transformation. For the python sklearn-co

Machine Learning and Time Series Tools and Datasets 140 Nov 7, 2022
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

null 900 Jan 2, 2023
This program is a simple machine learning implementation in Java for detecting skin pixels.

Skin Detector ?? ?? Detects human skin from images This program is a simple machine learning implementation in Java for detecting skin pixels. How to

Tasmia Zerin 1 Jan 21, 2022
React Native On-Device Machine Learning w/ Google ML Kit

React Native ML Kit React Native On-Device Machine Learning w/ Google ML Kit Supported Modules Module Android iOS Image Labeling ✅ ✅ Identify Language

Ahmed Mahmoud 148 Dec 29, 2022
An evolving set of open source web components for building mobile and desktop web applications in modern browsers.

Vaadin components Vaadin components is an evolving set of high-quality user interface web components commonly needed in modern mobile and desktop busi

Vaadin 519 Dec 31, 2022