Stanford CoreNLP: A Java suite of core NLP tools.

Overview

Stanford CoreNLP

Build Status Maven Central Twitter

Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities. It was originally developed for English, but now also provides varying levels of support for (Modern Standard) Arabic, (mainland) Chinese, French, German, and Spanish. Stanford CoreNLP is an integrated framework, which makes it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications. Stanford CoreNLP is a set of stable and well-tested natural language processing tools, widely used by various groups in academia, industry, and government. The tools variously use rule-based, probabilistic machine learning, and deep learning components.

The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v3 or later). Note that this is the full GPL, which allows many free uses, but not its use in proprietary software that you distribute to others.

Build Instructions

Several times a year we distribute a new version of the software, which corresponds to a stable commit.

During the time between releases, one can always use the latest, under development version of our code.

Here are some helpful instructions to use the latest code:

Provided build

Sometimes we will provide updated jars here which have the latest version of the code.

At present, the current released version of the code is our most recent released jar, though you can always build the very latest from GitHub HEAD yourself.

Build with Ant

  1. Make sure you have Ant installed, details here: http://ant.apache.org/
  2. Compile the code with this command: cd CoreNLP ; ant
  3. Then run this command to build a jar with the latest version of the code: cd CoreNLP/classes ; jar -cf ../stanford-corenlp.jar edu
  4. This will create a new jar called stanford-corenlp.jar in the CoreNLP folder which contains the latest code
  5. The dependencies that work with the latest code are in CoreNLP/lib and CoreNLP/liblocal, so make sure to include those in your CLASSPATH.
  6. When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-models, and english-models-kbp and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.

Build with Maven

  1. Make sure you have Maven installed, details here: https://maven.apache.org/
  2. If you run this command in the CoreNLP directory: mvn package , it should run the tests and build this jar file: CoreNLP/target/stanford-corenlp-4.0.0.jar
  3. When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-models, and english-models-kbp and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.
  4. If you want to use Stanford CoreNLP as part of a Maven project you need to install the models jars into your Maven repository. Below is a sample command for installing the Spanish models jar. For other languages just change the language name in the command. To install stanford-corenlp-models-current.jar you will need to set -Dclassifier=models. Here is the sample command for Spanish: mvn install:install-file -Dfile=/location/of/stanford-spanish-corenlp-models-current.jar -DgroupId=edu.stanford.nlp -DartifactId=stanford-corenlp -Dversion=4.0.0 -Dclassifier=models-spanish -Dpackaging=jar

Models

The models jars that correspond to the latest code can be found in the table below.

Some of the larger (English) models -- like the shift-reduce parser and WikiDict -- are not distributed with our default models jar. These require downloading the English (extra) and English (kbp) jars. Resources for other languages require usage of the corresponding models jar.

Language Model Jar Last Updated
Arabic download 4.2.0
Chinese download 4.2.0
English (default) download 4.2.0
English (extra) download 4.2.0
English (kbp) download 4.2.0
French download 4.2.0
German download 4.2.0
Spanish download 4.2.0

Useful resources

You can find releases of Stanford CoreNLP on Maven Central.

You can find more explanation and documentation on the Stanford CoreNLP homepage.

For information about making contributions to Stanford CoreNLP, see the file CONTRIBUTING.md.

Questions about CoreNLP can either be posted on StackOverflow with the tag stanford-nlp, or on the mailing lists.

Comments
  • An Issue in importing StanfordCoreNLP library in an Android Studio project

    An Issue in importing StanfordCoreNLP library in an Android Studio project

    I am developing an Android application (I am a beginner). I want to use Stanford CoreNPL 3.8.0 library in my app to extract the part of speech, the lemma, the parser and so on from the user sentences.I have tried a simple java code in NetBeans by following this youtube tutorial https://www.youtube.com/watch?v=9IZsBmHpK3Y, and it is working perfectly.The jar files that I imported to the NetBeans project are: stanford-corenlp-3.8.0.jar and stanford-corenlp-3.8.0-models.jar.

    And this is the java source code:

    import edu.stanford.nlp.ling.CoreAnnotations;
    import edu.stanford.nlp.ling.CoreLabel;
    import edu.stanford.nlp.pipeline.Annotation;
    import edu.stanford.nlp.pipeline.StanfordCoreNLP;
    import edu.stanford.nlp.util.CoreMap;
    
    import java.util.List;
    import java.util.Properties;
    
    public class CoreNlpExample {
    
        public static void main(String[] args) {
    
            // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
            // read some text in the text variable
            String text = "What is the Weather in Bangalore right now?";
    
            // create an empty Annotation just with the given text
            Annotation document = new Annotation(text);
    
            // run all Annotators on this text
            pipeline.annotate(document);
    
            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
    
            for (CoreMap sentence : sentences) {
                // traversing the words in the current sentence
                // a CoreLabel is a CoreMap with additional token-specific methods
                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    // this is the text of the token
                    String word = token.get(CoreAnnotations.TextAnnotation.class);
                    // this is the POS tag of the token
                    String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    // this is the NER label of the token
                    String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
    
                    System.out.println(String.format("Print: word: [%s] pos: [%s] ne: [%s]", word, pos, ne));
                }
            }
        }
    }
    

    I wanted to try the same code in Android Studio and display the result in a textview, but I am facing a problem with adding these external libraries in my Android Studio 3.0.1 project.

    I have read on some websites that I need to reduce the size of the jar files, and I did that and made sure that the reduced jars are still working fine in the Netbeans project. But I am still facing problems in Android studio and this is the error that I am getting:

    java.lang.VerifyError: Rejecting class edu.stanford.nlp.pipeline.StanfordCoreNLP that attempts to sub-type erroneous class edu.stanford.nlp.pipeline.AnnotationPipeline (declaration of 'edu.stanford.nlp.pipeline.StanfordCoreNLP' appears in /data/app/com.example.fatimah.nlpapplication-bhlUJOCUwLhSbkWE7NBERA==/split_lib_dependencies_apk.apk)

    Any suggestions on how I can fix this and import Stanford library successfully?

    opened by ftoom235 52
  • Use JaFaMa for faster math, and optimize critical code paths

    Use JaFaMa for faster math, and optimize critical code paths

    These changes substantially cut down the processing time; by several hours when I process all of Wikipedia. Feel free to benchmark on your own data.

    The first commit uses JaFaMa instead of java.lang.Math, which is 2-3x faster for exp, log: http://blog.element84.com/improving-java-math-perf-with-jafama.html In some places I switched back to log1p, because the runtime of log and log1p in JaFaMa are similar, and log1p offers better precision for small values of x than log(1+x).

    The other patches optimize the crucial code around the Viterbi algorithm:

    • HotSpot optimizes better if large functions with multiple loops are split into multiple methods (as they can be recompiled independently).
    • It pays off to save repeated nested array lookups (e.g. array[i][j] in a loop over j; move array_i = array[i] outside of the loop and use array_i[j] inside).
    • I also add a cache to avoid recomputing the open tags set in Ttags.

    All of these may appear to be trivial changes, but once you benchmark you will see how much this improves the run time.

    Processing the first 20000 articles with tokenize,ssplit,pos, doing some further processing such as my own lemmatization based on hunspell, and then loading them into a lucene index with the CoreNLP master branch took 08:51 minutes, and with my patches only 04:38 minutes (sloopy benchmark only). I consider this a substantial speedup, because Wikipedia is 5.3 million articles, and it still needed 19 hours to build the full text index, but it used to take almost two days...

    opened by kno10 39
  • Could the project switch to using log4j for logs?

    Could the project switch to using log4j for logs?

    I see a lot of logs printed to System.out or System.err. Would it be possible to use a library like log4j http://logging.apache.org/log4j/2.x/ and use log.error, log.warning, log.info, log.debug instead? That would make it easier for users of the StanfordCoreNLP to manage which logs should be printed by choosing the log level of the project.

    enhancement 
    opened by Asimov4 33
  • Quote Annotation - AnnotationException StringIndexOutOfBoundsException

    Quote Annotation - AnnotationException StringIndexOutOfBoundsException

    Hello,

    I had a situation with text that had this: ""=

    It seems to throw an error when I try running the pipeline with quote annotation on this small fragment. Just wanted to verify that it was an issue.

    Thank you.

    opened by allenkim 29
  • Parsing fails on AssertionError when using OpenIE (v3.9.2)

    Parsing fails on AssertionError when using OpenIE (v3.9.2)

    Happens with the following sentence, under version 3.9.2, only when adding openIE annotator:

    It was a long and stern face, but with eyes that twinkled in a kindly way.

    stack trace:

    java.lang.AssertionError at edu.stanford.nlp.naturalli.Util.cleanTree(Util.java:324) at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:463) at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$2(OpenIE.java:547) at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:547) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)

    to replicate:

            Properties props = new Properties();
            props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
            String text = "It was a long and stern face, but with eyes that twinkled in a kindly way.";
    
            CoreDocument document = new CoreDocument(text);
            pipeline.annotate(document);
    

    works fine if openie is disabled, with other sentences, or when using https://corenlp.run/ so looks like it's fixed in later versions but I did not verify it locally as I can't upgrade at the moment anyway.

    advice much appreciated

    opened by manzurola 29
  • Stanford CoreNLP server not responding

    Stanford CoreNLP server not responding

    I have been trying to use the CoreNLP server using various python packages including Stanza. I am always running into the same problem that I do not hear back from the server.

    I downloaded a copy of CoreNLP from the website. I then try to start a server from the terminal and go to my localhost as described here. Based on the documentation I should see something when I go to http://localhost:9000/, but nothing loads up.

    Here are to commands I use:

    cd stanford-corenlp-full-2018-10-05/
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
    

    Here is the output of running the commands above:

    Samarths-MacBook-Pro-2:stanford-corenlp-full-2018-10-05 samarthbhandari$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
    [main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
    [main] INFO CoreNLP - setting default constituency parser
    [main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
    [main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
    [main] INFO CoreNLP - to use shift reduce parser download English models jar from:
    [main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
    [main] INFO CoreNLP -     Threads: 8
    [main] INFO CoreNLP - Starting server...
    [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
    

    I then go to http://localhost:9000/, nothing loads up. Like I mentioned above originally I have been trying to do the same thing using some of the python packages and observed similar behavior.

    Here is a stack overflow post related to server not responding using Stanza.

    OS: MacOS 10.15.4 Python: 3.7.7 Java: 1.8

    cantreproduce 
    opened by samarth12 25
  • [MEMORY] Possibly use float instead of double in models/weights

    [MEMORY] Possibly use float instead of double in models/weights

    double

    double arrays are a large portion of the heap.

    There are some places with 2d double arrays with dimensions like

    345k x 16, 150k x 24, 80k x 46: CRFCLassifier.weights 100k x 1000: Classifier.saved in DependecyParser 60k x 50: Classifier.E, .eg2E 1000x2400: Classifier.W1, .wg2W1

    Most are weights of some sort, making me wonder if they could be stored in less than 64bit each.

    The obvious step would be to use float[], halving the memory use of this portion.

    Another would be to encode weights in something else, for example a small integer and scale that into a float again when using the weight.

    Machine Learning models often use fp16 or even fp8 to store weights, there are java implementations of float -> short -> float (with fp16 semantics stored in a 16bit short)

    like https://android.googlesource.com/platform/frameworks/base/+/master/core/java/android/util/Half.java with https://android.googlesource.com/platform/libcore/+/master/luni/src/main/java/libcore/util/FP16.java

    or https://stackoverflow.com/questions/6162651/half-precision-floating-point-in-java

    The latter approached would need some performance testing as each time a weight is used it would have to be converted first.


    I saw that some models serialize themselves using ObjectStreams, that would need an adapter to deserialize to double[] first and then array-cast it to float[].

    Like in CRFClassifier.loadClassifier

    opened by lambdaupb 25
  • TokenSequenceParser ignoring tail of patterns mentioned in rules

    TokenSequenceParser ignoring tail of patterns mentioned in rules

    Following function in TokenSequenceParser class ignores tail of patterns defined in rules for tokensregex

    private String getStringFromTokens(Token head, Token tail, boolean includeSpecial) { StringBuilder sb = new StringBuilder(); for( Token p = head ; p != tail ; p = p.next ) { if (includeSpecial) { appendSpecialTokens( sb, p.specialToken ); } sb.append( p.image ); } return sb.toString(); }

    Eg: ([{lemma:/([a-zA-Z]{2,}_)?[a-zA-Z]{2,}[0-9]{2,}/}]) gets converted to ([{lemma:/([a-zA-Z]{2,}_)?[a-zA-Z]{2,}[0-9]{2,}/}] while reading and don't provide intended matches

    opened by ankitsingh2 23
  • Exception thrown for operation attempted on unknown vertex

    Exception thrown for operation attempted on unknown vertex

    CoreNLP version 4.5.0 using pos lemma depparse. I run the pipeline within Spark (Scala). I lazy initialise the CoreNLP pipeline and I broadcast the pipeline to each executor using lazy instantiation wrapped in a case object. Also I force not to split the text fragment as it is intended to be a sentence already. The objective here is to do dependency analysis on the sentence and run some semgraph rules against it. We got a case where it throws an exception like this

    Caused by: edu.stanford.nlp.semgraph.UnknownVertexException: Operation attempted on unknown vertex happens/VBZ'''' in graph -> observed/VBD (root)
      -> 24/CD (nsubj)
        -> response/NN (nmod:in)
          -> In/IN (case)
          -> CoV/NNP (nmod:to)
            -> to/IN (case)
            -> SARS/NNP (compound)
            -> ‐/SYM (dep)
            -> ‐/SYM (dep)
            -> peptides/NNS (dep)
              -> 2/CD (nummod)
      -> ,/, (punct)
      -> we/PRP (nsubj)
      -> unexpectedly/RB (advmod)
      -> associated/VBN (ccomp)
        -> that/IN (mark)
        -> sirolimus/NN (nsubj:pass)
        -> was/VBD (aux:pass)
        -> significantly/RB (advmod)
        -> release/NN (obl:with)
          -> with/IN (case)
          -> a/DT (det)
          -> proinflammatory/JJ (amod)
          -> cytokine/NN (compound)
          -> levels/NNS (nmod:including)
            -> including/VBG (case)
            -> higher/JJR (amod)
            -> α/NN (nmod:of)
              -> of/IN (case)
              -> TNF/NN (compound)
              -> ‐/SYM (dep)
              -> IL/NN (conj:and)
                -> and/CC (cc)
            -> IL/NN (nmod:of)
            -> 1β/NN (nmod)
              -> ‐/SYM (dep)
      -> ./. (punct)
    
    	at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.<init>(GraphRelation.java:1084)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.<init>(GraphRelation.java:310)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:339)
    	at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.resetChildIter(SemgrexMatcher.java:80)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:363)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:457)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:574)
    	at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
    	at az.bikg.nlp.etl.common.nlp.Pattern.go$3(Pattern.scala:200)
    	at az.bikg.nlp.etl.common.nlp.Pattern.$anonfun$findCauseEffectMatches$6(Pattern.scala:268)
    	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
    	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    	at az.bikg.nlp.etl.common.nlp.Pattern.findCauseEffectMatches(Pattern.scala:266)
    	at az.bikg.nlp.etl.steps.ERs$.findRelations(ERs.scala:107)
    	at az.bikg.nlp.etl.steps.ERs$.findRelationsSpark(ERs.scala:229)
    	at az.bikg.nlp.etl.steps.ERs$.$anonfun$extractERs$1(ERs.scala:242)
    	... 28 more
    

    Am I doing anything wrong because of this exception? It didn't happen with version 4.4.0.

    opened by mkarmona 22
  • Are these latest Chines model significantly worse than the Stanford online parser?

    Are these latest Chines model significantly worse than the Stanford online parser?

    I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples:

    我的朋友:always tags "我的" as one NN token. 我的狗吃苹果: ‘我的狗’ tagged as one NN token. 他的狗吃苹果:'狗吃' tagged as one NN token. 高质量就业成时代: '就业' tagged as VV

    When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model?

    opened by lingvisa 21
  • pos-tagger cannot load models from stanford-corenlp-3.5.2-models.jar

    pos-tagger cannot load models from stanford-corenlp-3.5.2-models.jar

    I use Stanford Core NLP in Java as a Maven dependency. I want to use the MaxentTagger with a model supplied in the stanford-corenlp-3.5.2-models package. The problem is that I cannot access this model through classpath.

    My code is

    tagger = new MaxentTagger("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    

    The file "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" exists in the jar and should be loaded through classpath, but the following exception is thrown

    Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:770)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:298)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:263)
        at cz.zcu.kiv.nlp.semeval.cwi.features.POSFeature.<init>(POSFeature.java:24)
        at cz.zcu.kiv.nlp.semeval.cwi.CWIModel.train(CWIModel.java:60)
        at cz.zcu.kiv.nlp.semeval.cwi.TrainingCrossValidation.main(TrainingCrossValidation.java:51)
    Caused by: java.io.IOException: Unable to resolve "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
        at  edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:481)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:765)
        ... 5 more
    

    If I copy the model out of the jar and use e.g.

    tagger = new MaxentTagger("./english-left3words-distsim.tagger");
    

    then everything works perfectly.

    The problem is probably in the class IOUtils, method findStreamInClasspathOrFileSystem(String name).

    In the line

    InputStream is = IOUtils.class.getClassLoader().getResourceAsStream(name);
    

    the returned classloader is probably the JarClassLoader which loaded the library (stanford-corenlp-3.5.2.jar) and it does not have access to other libraries.

    This theory is supported by the following code

    InputStream stream = POSFeature.class.getResourceAsStream("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    System.out.println("Stream == null: " + (stream == null));
    
    tagger = new MaxentTagger("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    

    which outputs

    Stream == null: false
    Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
    ...
    Caused by: java.io.IOException: Unable to resolve "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
    
    opened by konkol 21
  • Why is there no description of how to set up the models jar with a build tool?

    Why is there no description of how to set up the models jar with a build tool?

    On README, the way of how to install models jars is written but a method using build tools(e.g. Gradle) is not written. However, I try this way( https://stackoverflow.com/a/68859054/3809427 ) and succeeded. Why don't you write this useful method?

    opened by lamrongol 3
  • EntityMentions returns null instead of empty list

    EntityMentions returns null instead of empty list

    I ran into an issue where if an empty string is passed in then getting the entityMentions returns null instead of an empty list like I figure out be standard practice.

    Example code:

    StanfordCoreNLP processor = new StanfordCoreNLP(props);
    CoreDocument nlpDocument = new CoreDocument("");
    
    nlpProcessor.annotate(nlpDocument);
    List<CoreEntityMention>  entities = nlpDocument.entityMentions(); <== returns null
    

    Just wanted to know if there is any light that can be shed on this. If this is expected behavior then I will try my best to document it in the documentation

    opened by cholojuanito 2
  • 'email' tokenizing as 'em, ail, and '

    'email' tokenizing as 'em, ail, and '

    In the following sentence (from Twitter), 'email' is being tokenized as 'em, ail, and '. This is obviously incorrect. What can be done to stop this split?

    • It's official (according to the AP) it's 'email' not 'e-mail' and 'website' not 'web-site'!

    I have the following parameters set: tokenize.language: English tokenize.whitespace: false (because we want tokens like it's to separate into it and 's) tokenize.keepeol: false tokenize.verbose: false tokenize.options: invertible=true,splitAssimilations=false,splitHyphenated=false,splitForwardSlash=true,untokenizable=allKeep,strictTreebank3=true,normalizeSpace=false,ellipses=original

    opened by saxtell-cb 6
  • parsing '`'

    parsing '`'

    curl 'http://localhost:9000/?properties={%22annotators%22%3A%22lemma%22%2C%22outputFormat%22%3A%22json%22}' -d '`'
    

    Gives me:

    {
      "sentences": [
        {
          "index": 0,
          "tokens": [
            {
              "index": 1,
              "word": "`",
              "originalText": "`",
              "lemma": "`",
              "characterOffsetBegin": 0,
              "characterOffsetEnd": 1,
              "pos": "``",
              "before": "",
              "after": ""
            }
          ]
        }
      ]
    }
    

    With the standard English model. Is this expected? I'm particularly surprised at the POS.

    opened by AntonOfTheWoods 5
  • Stanford CoreNLP slower after upgrade from 3.7.0 to 4.5.1

    Stanford CoreNLP slower after upgrade from 3.7.0 to 4.5.1

    A unit test that runs in a loop calling pipeline.annotate(document) appears to be taking about 50% longer. Our configuration properties didn't change during the upgrade, but maybe some new properties have been added in 4.5.1? Below is what we have. Is there a way to determine which annotator is using more time now?

    customAnnotatorClass.tokensregex=edu.stanford.nlp.pipeline.TokensRegexAnnotator sutime.binders=0 tokensregex.rules= .... (omitted) ssplit.eolonly=false customAnnotatorClass.tokenOverride_en= .... (omitted) annotators=tokenize, ssplit, tokenOverride_en, pos, lemmaOverride_en, ner, tokensregex, entitymentions, parse language=en tokenize.whitespace=false customAnnotatorClass.lemmaOverride_en=.... (omitted) tokenize.options=untokenizable=allKeep,americanize=false ssplit.isOneSentence=true nermention.acronyms=true

    opened by dsbanks99 1
Releases(v4.5.1)
  • v4.5.1(Aug 30, 2022)

    CoreNLP 4.5.1

    Bugfixes!

    • Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word https://github.com/stanfordnlp/CoreNLP/commit/974383ab7336a254d260264885186dd77df0cf81
    • Use a LinkedHashMap in the PTBTokenizer instead of Properties. Keeps the option processing order predictable. https://github.com/stanfordnlp/CoreNLP/issues/1289 https://github.com/stanfordnlp/CoreNLP/commit/655018895e2f2870ce721de42d31b845fa991335
    • Fix \r\n not being properly processed on Windows: #1291 https://github.com/stanfordnlp/CoreNLP/commit/9889f4ef4ee9feb8b70f577db8353c8d6c896ae3
    • Handle one half of surrogate character pairs in the tokenizer w/o crashing https://github.com/stanfordnlp/CoreNLP/issues/1298 https://github.com/stanfordnlp/CoreNLP/commit/1b12faa64b9ea85f808b27ab74ccf9f79ccb01f4
    • Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: https://github.com/stanfordnlp/CoreNLP/issues/1296 https://github.com/stanfordnlp/CoreNLP/issues/1229 https://github.com/stanfordnlp/CoreNLP/issues/1169 https://github.com/stanfordnlp/CoreNLP/commit/f99b5ab87f073118a971c4d1e39df85ab9abbab1
    Source code(tar.gz)
    Source code(zip)
  • v4.5.0(Jul 22, 2022)

    CoreNLP 4.5.0

    Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

    • All PTB and German tokens normalized now in PTBLexer (previously only German umlauts). This makes the tokenizer 2% slower, but should avoid issues with resume' for example https://github.com/stanfordnlp/CoreNLP/commit/d46fecd93c6964f635efe85d9b7c327ee8880fb9

    • log4j removed entirely from public CoreNLP (internal "research" branch still has a use) https://github.com/stanfordnlp/CoreNLP/commit/f05cb54ec0a4f3c90395771817f44a81eb549baf

    • Fix NumberFormatException showing up in NER models: https://github.com/stanfordnlp/CoreNLP/issues/547 https://github.com/stanfordnlp/CoreNLP/commit/5ee2c391104109a338a28f35c647b7684b00ad41

    • Fix "seconds" in the lemmatizer: https://github.com/stanfordnlp/CoreNLP/commit/e7a073bde9ba7bbdb40ba81ed96d379455629e44

    • Fix double escaping of & in the online demos: https://github.com/stanfordnlp/CoreNLP/commit/8413fa1fc432aa2a13cbb4a296352bb9bad4d0cb

    • Report the cause of an error if "tregex" is asked for but no parse annotator is added: https://github.com/stanfordnlp/CoreNLP/commit/4db80c051322697c983ecda873d8d38f808cb96c

    • Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): https://github.com/stanfordnlp/CoreNLP/pull/1259

    • Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: https://github.com/stanfordnlp/CoreNLP/pull/1263

    • Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: https://github.com/stanfordnlp/CoreNLP/commit/3c40ba32ca51af02936b907d03406e2158883f7b https://github.com/stanfordnlp/CoreNLP/commit/58a2288239f631df47fac3eed105fe78c08b1a5d https://github.com/stanfordnlp/CoreNLP/commit/8b97d64e48e6d4161f62a8635d2bb4cee2e95553

    • Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas https://github.com/stanfordnlp/CoreNLP/commit/9476a8eb724e01df4b05bce38789dd8a7e61397c https://github.com/stanfordnlp/CoreNLP/commit/6193934af8ae0abb0b4c6a2522d7efdfa426e5b3 https://github.com/stanfordnlp/CoreNLP/commit/afb1ea89c874acd58bab584f1e29a059c44dfd20 https://github.com/stanfordnlp/CoreNLP/commit/7c84960df4ac9d391ef37855572e2f8bc301ee17

    • Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases https://github.com/stanfordnlp/CoreNLP/pull/1266

    • Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) https://github.com/stanfordnlp/CoreNLP/commit/45b47e245c367663bba2e81a26ea7c29262ad0d8

    • Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models https://github.com/stanfordnlp/CoreNLP/commit/0d9e9c829bfa75bb661cccea03fc682a0f955f0d

    • Fix NBSP in the Chinese segmenter https://github.com/stanfordnlp/stanza/issues/1052 https://github.com/stanfordnlp/CoreNLP/pull/1279

    Source code(tar.gz)
    Source code(zip)
  • v4.4.0(Jan 25, 2022)

    Enhancements

    • added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline

    • tsurgeon CLI - python side added to stanza
      https://github.com/stanfordnlp/CoreNLP/pull/1240

    • sutime WORKDAY definition https://github.com/stanfordnlp/CoreNLP/commit/0dfb11817c2b46a532985c24289e128fbb81a2c0

    Fixes

    • rebuilt Italian dependency parser using CoreNLP predicted tags

    • XML security issue: https://github.com/stanfordnlp/CoreNLP/pull/1241

    • NER server security issue: https://github.com/stanfordnlp/CoreNLP/commit/5ee097dbede547023e88f60ed3f430ff09398b87

    • fix infinite loop in tregex: https://github.com/stanfordnlp/CoreNLP/pull/1238

    • json utf-8 output on windows https://github.com/stanfordnlp/CoreNLP/pull/1231 https://github.com/stanfordnlp/stanza/issues/894

    • fix openie crash in certain unusual graphs https://github.com/stanfordnlp/CoreNLP/pull/1230 https://github.com/stanfordnlp/CoreNLP/issues/1082

    • fix nondeterministic results in certain SemanticGraph structures https://github.com/stanfordnlp/CoreNLP/pull/1228 https://github.com/stanfordnlp/CoreNLP/commit/cc806f265292977b69fd55f36408fe5ad3a695a0

    • workaround for NLTK sending % unescaped to the server https://github.com/stanfordnlp/CoreNLP/issues/1226 https://github.com/stanfordnlp/CoreNLP/commit/20fe1e996455b1c1434022d6e7f0b8524f41f253

    • make TimingTest function on Windows https://github.com/stanfordnlp/CoreNLP/commit/4aafb84f6ea5c0102c921a503cbfb8e3d34f3e22

    Source code(tar.gz)
    Source code(zip)
  • v4.3.2(Nov 18, 2021)

  • v4.3.1(Oct 22, 2021)

    Fixes

    • character offset issue with StatTok
    • fixes path issue with default Hungarian properties
    • adds Hungarian and Italian to demo
    • fixes umlaut issue
    Source code(tar.gz)
    Source code(zip)
  • v4.3.0(Oct 6, 2021)

    Overview

    This release adds new European languages, improvements to the parsers and tokenizers, and other misc. fixes.

    Enhancements

    • Hungarian pipeline
    • Italian pipeline
    • Improvements to English tokenizer
    • Better memory usage by dependency parser

    Fixes

    • issue with umlaut handling in German #1184
    Source code(tar.gz)
    Source code(zip)
  • v4.2.2(May 14, 2021)

    This release includes some small fixes to version 4.2.1.

    It includes:

    • demo fixes for 4.2.2, resolving cache issues with demo resources
    • small fix to RegexNERSequenceClassifier issue allowing AnswerAnnotation to be overwritten
    Source code(tar.gz)
    Source code(zip)
  • v4.2.1(May 5, 2021)

    Fix the server having some links http instead of https https://github.com/stanfordnlp/CoreNLP/issues/1146

    Improve MWE expressions in the enhanced dependency conversion https://github.com/stanfordnlp/CoreNLP/commit/1ef9ef9c75e6948eed10092bf6d1c49c49cfabaa

    Add the ability for the command line semgrex processor to handle multiple calls in one process https://github.com/stanfordnlp/CoreNLP/commit/c9d50ef9cb2e1851257d06cda55b1456d69145b7

    Fix interaction between discarding tokens in ssplit and assigning NER tags https://github.com/stanfordnlp/CoreNLP/commit/a803bc357c32841beb3919f2e4dc22a1375dca4d

    Reduce the size of the sr parser models (not a huge amount, but some) https://github.com/stanfordnlp/CoreNLP/pull/1142

    Various QuoteAnnotator bug fixes https://github.com/stanfordnlp/CoreNLP/pull/1135 https://github.com/stanfordnlp/CoreNLP/issues/1134 https://github.com/stanfordnlp/CoreNLP/pull/1121 https://github.com/stanfordnlp/CoreNLP/issues/1118 https://github.com/stanfordnlp/CoreNLP/commit/9f1b015ea91f1db6dce6ab7f35aacb9cdc33e463 https://github.com/stanfordnlp/CoreNLP/issues/1147

    Switch to newer istack implementation https://github.com/stanfordnlp/CoreNLP/pull/1133 Newer protobuf https://github.com/stanfordnlp/CoreNLP/pull/1150

    Add a conllu output format to some of the segmenter code, useful for testing with the official test scripts https://github.com/stanfordnlp/CoreNLP/commit/c70ddec9736e9d3c7effd4660f63e363caeb333d

    Fix Turkish locale enums https://github.com/stanfordnlp/CoreNLP/pull/1126 https://github.com/stanfordnlp/stanza/issues/580

    Use StringBuilder instead of StringBuffer where possible https://github.com/stanfordnlp/CoreNLP/pull/1010

    Source code(tar.gz)
    Source code(zip)
  • v4.2.0(Nov 17, 2020)

    Overview

    This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.

    Enhancements

    • Upgrade libraries (EJML, JUnit, JFlex)
    • Add character offsets to Tregex responses from server
    • Improve cleaning of treebanks for English models
    • Speed up loading of Wikidict annotator
    • New utility for tagging CoNLL-U files in place
    • Command line tool for processing TokensRegex

    Fixes

    • Output single token NER entities in inline XML output format
    • Add currency symbol part of speech training data
    • Fix issues with tree binarizing
    Source code(tar.gz)
    Source code(zip)
  • v4.0.0(May 4, 2020)

    Overview

    The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.

    Enhancements

    • UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
    • Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
    • Have WhitespaceTokenizer support same newline processing as PTBTokenizer
    • New mwt annotator for handling multiword tokens in French, German, and Spanish.
    • New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
    • Add French NER
    • New Chinese segmentation based off CTB9
    • Improved handling of double codepoint characters
    • Easier syntax for specifying language specific pipelines and NER pipeline properties
    • Improved CoNLL-U processing
    • Improved speed and memory performance for CRF training
    • Tregex support in CoreSentence
    • Updated library dependencies

    Fixes

    • NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
    • NPE in EntityMentionsAnnotator during language check
    • NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
    • NPE in NERCombinerAnnotator in certain configurations of models on/off
    • Incorrect handling of eolonly option in ArabicSegmenterAnnotator
    • Apply named entity granularity change prior to coref mention detection
    • Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
    • Incorrect handling of reading in German treebank files
    • SR parser crashes when given bad training input
    • New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
    • Fix ancient bug in printing constituency tree with multiple roots.
    • Fix parser from failing on word "STOP" because it treated it as a special word
    Source code(tar.gz)
    Source code(zip)
A suite of software tools and services created to support activity planning and sequencing needs of missions with modeling, simulation, scheduling and validation capabilities

Aerie A suite of software tools and services created to support activity planning and sequencing needs of missions with modeling, simulation, scheduli

NASA Advanced Multi-Mission Operations System 31 Jan 3, 2023
GMC-Tools - Plugin with basic tools for Minecraft server administrator

GMC-Tools - Plugin with basic tools for Minecraft server administrator. Currently we do not support configuration files and we do not recommend using this plugin on production servers.

GamesMC Studios 4 Jan 14, 2022
A Java 8+ Jar & Android APK Reverse Engineering Suite (Decompiler, Editor, Debugger & More)

Bytecode Viewer Bytecode Viewer - a lightweight user friendly Java Bytecode Viewer. New Features WAR & JSP Loading JADX-Core Decompiler Fixed APK & de

Kalen (Konloch) Kinloch 13.5k Jan 7, 2023
uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.

Welcome to univocity-parsers univocity-parsers is a collection of extremely fast and reliable parsers for Java. It provides a consistent interface for

univocity 874 Dec 15, 2022
A complete 3D game development suite written purely in Java.

jMonkeyEngine jMonkeyEngine is a 3-D game engine for adventurous Java developers. It’s open-source, cross-platform, and cutting-edge. 3.2.4 is the lat

jMonkeyEngine 3.3k Dec 31, 2022
The JTS Topology Suite is a Java library for creating and manipulating vector geometry.

JTS Topology Suite The JTS Topology Suite is a Java library for creating and manipulating vector geometry. It also provides a comprehensive set of geo

LocationTech 1.5k Jan 6, 2023
OAUTHScan is a Burp Suite Extension written in Java with the aim to provide some automatic security checks

OAUTHScan is a Burp Suite Extension written in Java with the aim to provide some automatic security checks, which could be useful during penetration testing on applications implementing OAUTHv2 and OpenID standards.

Maurizio S 163 Nov 29, 2022
Ultimate Component Suite for JavaServer Faces

PrimeFaces This is an overview page, please visit PrimeFaces.org for more information. Overview PrimeFaces is one of the most popular UI libraries in

PrimeFaces 1.5k Jan 3, 2023
a Business Process Management (BPM) Suite

Quick Links Homepage: http://jbpm.org/ Business Applications: https://start.jbpm.org/ Documentation: https://docs.jboss.org/jbpm/release/latestFinal/j

KIE (Drools, OptaPlanner and jBPM) 1.4k Jan 2, 2023
Performance monitoring and benchmarking suite

Introduction Likwid is a simple to install and use toolsuite of command line applications and a library for performance oriented programmers. It works

null 1.3k Dec 28, 2022
GreenMail is an open source, intuitive and easy-to-use test suite of email servers for testing purposes.

GreenMail GreenMail is an open source, intuitive and easy-to-use test suite of email servers for testing purposes. Supports SMTP, POP3, IMAP with SSL

null 529 Dec 28, 2022
Conformance test suite for OpenShift

Origin Kubernetes This repo was previously the core Kubernetes tracking repo for OKD, and where OpenShift's hyperkube and openshift-test binaries were

OpenShift 8.2k Jan 4, 2023
A Forge mod based on the EssentialsX plugin suite

A Forge mod based on the EssentialsX plugin suite. Wiki (WIP) | Discord (WIP) | Website (WIP) | | | | Disclaimer: Mod is in early development. Note: I

Gigawhat 6 Aug 4, 2022
Copy Regex Matches is a Burp Suite plugin to copy regex matches from selected requests and/or responses to the clipboard.

Copy Regex Matches Copy Regex Matches is a Burp Suite plugin to copy regex matches from selected requests and/or responses to the clipboard. Install D

null 28 Dec 2, 2022
Create a Music Playlist Library -Core JAVA, JAVA Swing, AWT

Project Specifications Manage Everything about a basic Music Playlist Application in Java A Music Library for Listing, Replaying, Navigating between c

Muhammad Asad 7 Nov 8, 2022
Google core libraries for Java

Guava: Google Core Libraries for Java Guava is a set of core Java libraries from Google that includes new collection types (such as multimap and multi

Google 46.5k Jan 1, 2023
Core ORMLite functionality that provides a lite Java ORM in conjunction with ormlite-jdbc or ormlite-android

ORMLite Core This package provides the core functionality for the JDBC and Android packages. Users that are connecting to SQL databases via JDBC shoul

Gray 547 Dec 25, 2022