A fast and accurate POS and morphological tagging toolkit (EACL 2014)

Overview

RDRPOSTagger

enrdr3

RDRPOSTagger is a robust and easy-to-use toolkit for POS and morphological tagging. It employs an error-driven approach to automatically construct tagging rules in the form of a binary tree.

  • RDRPOSTagger obtains very fast tagging speed and achieves a competitive accuracy in comparison to the state-of-the-art results. See experimental results including performance speed and tagging accuracy for 13 languages in our AI Communications article.

  • RDRPOSTagger now supports pre-trained UPOS, XPOS and morphological tagging models for about 80 languages. See folder Models for more details.

The general architecture and experimental results of RDRPOSTagger can be found in our following papers:

Please CITE either the EACL or the AICom paper whenever RDRPOSTagger is used to produce published results or incorporated into other software.

Current release (41MB .zip file containing about 330 pre-trained tagging models) is available to download at: https://github.com/datquocnguyen/RDRPOSTagger/archive/master.zip

Find more information about RDRPOSTagger at: http://rdrpostagger.sourceforge.net/

In addition, you might want to try my neural network-based toolkit jPTDP for joint POS tagging and dependency parsing.

Comments
  • Certain tokens receive null POS and token is not outputted

    Certain tokens receive null POS and token is not outputted

    For some tokens, such as âManli are tagged as:

    ''/null
    

    I couldn't find anything on the documentation about a null POS being returned in any case, so I figure this is an undesirable feature. In either case, the inclusion of the original token would be necessary.

    opened by matgrioni 5
  • RDRPOSTagger.py returns blank error

    RDRPOSTagger.py returns blank error

    I using the following command within RDRPOSTagger/pSCRDRtagger

    python RDRPOSTagger.py ../Models/UniPOS/UD_Latin/la-upos.RDR ../Models/UniPOS/UD_Latin/la-upos.DICT rawDataPath
    

    For some of the files I run it on it works as expected. For others, such as the one attached there is an error output as follows:

    => Read a POS tagging model from ../Models/UniPOS/UD_Latin/la-upos.RDR
    
    => Read a lexicon from ../Models/UniPOS/UD_Latin/la-upos.DICT
    
    => Perform POS tagging on /home/grioni.2/NER/Preprocessing/Preprocessed/UNKNOWN/Tacitus.txt
    
    ERROR ==>  "''"
    
    ===== Usage =====
    
    #1: To train RDRPOSTagger on a gold standard training corpus:
    
    python RDRPOSTagger.py train PATH-TO-GOLD-STANDARD-TRAINING-CORPUS
    
    Example: python RDRPOSTagger.py train ../data/goldTrain
    
    #2: To use the trained model for POS tagging on a raw text corpus:
    
    python RDRPOSTagger.py tag PATH-TO-TRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS
    
    Example: python RDRPOSTagger.py tag ../data/goldTrain.RDR ../data/goldTrain.DICT ../data/rawTest
    
    #3: Find the full usage at http://rdrpostagger.sourceforge.net !
    

    I'm not sure where this blank error is coming from as it is blank. This problem does not occur for the java implementation however, so:

    java RDRPOSTagger ../Models/UniPOS/UD_Latin/la-upos.RDR ../Models/UniPOS/UD_Latin/la-upos.DICT rawDataPath
    

    works for the same file.

    Alexander_Severus.txt

    opened by matgrioni 5
  • Getting

    Getting "string index out of range" error while trying to train

    Thanks a lot for making available this open source package!

    I am trying to train RDRPOSTagger for POS tagging of Tamil language. I have converted the POS tagged text to be same format as your goldTrain file. Each line in my input training corpus is a tokenized/word-segmented sentence. I was able to train using a small test file (after some trial and error). I was able to generate the .RDR and .DICT files.

    However, when running a larger file, I am getting a "string index out of range" error. This happens immediately after starting to generate the lexicon. So, the next 3 steps, namely extracting the raw text corpus, POS tagging and learning the tree model of rules do not happen. I am using a Ubuntu Linux box.The command I am using to train is: python RDRPOSTagger.py train [my POS tagged formatted file path and name]

    Any tips on how to identify the source of this error will be very helpful.

    opened by AshokR 5
  • Successfully trained the RDRPOSTagger in Tamil

    Successfully trained the RDRPOSTagger in Tamil

    I am happy to report that, after extensive tweaking of my gold standard training corpus, I have successfully trained the tagger with a corpus of about 200,000 Tamil words. I used 80% of the corpus for training and 20% for testing. I see a difference of about 15% from my gold standard testing corpus.

    It will be great if you can take a look at my corpus and let me know whether there is anything I can do to improve it.

    opened by AshokR 3
  • Single double quote,

    Single double quote, ", becomes two single quotes in output.

    When there is a double quote in the source, the output consists of two single quotes. The line in question is

    multos aut affectatio alienae fortunae aut suae querella querella Madvig : qua A : cura Haase . detinuit ; plerosque nihil certum sequentis vaga et inconstans et sibi displicens levitas per nova consilia iactavit ; quibusdam nihil , quo cursum derigant , placet , sed marcentis oscitantisque fata deprendunt , adeo ut quod apud maximum poetarum more oraculi dictum est , verum esse non dubitem : " Exigua pars est vitae , qua vivimus .  " Ceterum quidem omne spatium non vita sed tempus est .
    

    whose tokens have been space separated ;). In the output the double quote after dubitem is two single quotes, which is a problem for my purposes.

    opened by matgrioni 2
  • Porting to Python 3

    Porting to Python 3

    Hello, I've ported the script to Python 3, it's here, you may want to link it or get it back as a git branch.

    Unfortunately, the code does not work on Python 2

    opened by jacopofar 2
  • Slimmed down version of RDRPOSTagger

    Slimmed down version of RDRPOSTagger

    Hi! I'm planning to do a slimmed down, easy to use version of RDRPosTagger. The idea is to simplify the projekt to only deal with UniPOS tagging with those models and remove everything else, including training, and non-UniPOS models. I will also remove the java version.

    I will work on it as a fork of this project, hosted here: https://github.com/EmilStenstrom/RDRPOSTagger/

    My idea is that this will make the code easier to read, and by simplifying it also easier to use. I have an idea of how I can use this commercially at the company I work with.

    Are you ok with this? I see you have chosen the GNU license, so it looks that way, but I just want to be sure.

    opened by EmilStenstrom 2
  • Creating a lexicon

    Creating a lexicon

    Could you please add a few lines on how to use the LexiconCreator.py script? I am not sure what parameters to set to the function createLexicon. Is corpusFilePath the path to the universal dependencies file? What about fullLexicon?

    opened by senisioi 1
  • casesless/case-insensitive POS tagging?

    casesless/case-insensitive POS tagging?

    Thanks for this awesome project.

    My project background : step 1) Audio -> Text step 2) Text -> Gather POS tags + Identify Named Entities from text. In the speech to text step, all of the text is more or less getting converted to lower case.

    I'd like to know how to make the model identify/classify words based on the context and not on the whether the word is capitalized or not.

    If the sentence is " I ate an apple while sitting inside the apple headquarters"
    I want the RDRPOS tagger to identify first apple as a fruit and the second apple as an organization. At present it is identifying the Apple as an organisation only when it is capitalized. The RDRPOSTagger.py is able to identify headquarters as an NNP, but RDRPOSTagger4en.py is identifying it as a NN.

    Thanks.

    opened by StanSilas 1
  • Tagging process takes such a long time for Thai lauguage

    Tagging process takes such a long time for Thai lauguage

    Thank you for such a great work you have done. So far I have an issue and I have no idea how to fix.

    I was able to tag Thai language with a very tiny input (Thai text file for 1MB), and it worked very well (got an output file with POS-tagged words). Then I use the bigger input (Thai text file for 4.5GB) with the exact same code and directory, but the program did not give me any result but waiting with no ending (10 hours and more).

    I am curious is there any way to solve this waiting problem, or it was actually working for that long (Text 4.5GB sized for 10 hours) ?

    Thai text input as an example, "โครงการ พี่น้อง วิก พี เดียด เนิน มูลนิธิ วิก มีเดีย องค์กร แสวง ผลกำไร ผู้ดำเนินการ ภาษา อื่น ดาราศาสตร์ ดาราศาสตร์ วิชา วิทยาศาสตร์ ศึกษา วัตถุ ท้องฟ้า อาทิ ดาวฤกษ์ ดาวเคราะห์ ดาวหาง ดารา จักร รวมทั้ง ปรากฏการณ์ ทางธรรมชาติ ต่าง ที่เกิด ขึ้น ชั้น บรรยากาศ โลก ศึกษา เกี่ยวกับ วิวัฒนาการ ลักษณะ ทางกายภาพ เคมี ทาง อุตุนิยมวิทยา และ เคลื่อนที่ วัตถุ ท้องฟ้า ตลอดจน การกำ นิด และ วิวัฒนาการ ของ เอกภพ ดาราศาสตร์ เป็นหนึ่ง สาขา วิทยาศาสตร์ เก่าแก่ ที่สุด นัก ดาราศาสตร์ วัฒนธรรม โบราณ สังเกตการณ์ ดวงดาว ท้องฟ้า ใน เวลา กลางคืน วัตถุ ดาราศาสตร์ หลายอย่าง ก็ได้ ถูก ค้นพบ เรื่อย ตาม ยุคสมัย กล้องโทรทรรศน์ สิ่งประดิษฐ์ จำเป็น ก่อนที่จะ การพัฒนา มา เป็น วิทยาศาสตร์ สมัยใหม่ อดีตกาล ดาราศาสตร์ ประกอบ สาขา ที่ หลากหลาย วัด ตำแหน่ง ดาว การเดินเรือ ดาราศาสตร์ ดาราศาสตร์ เชิง สังเกตการณ์ การ สร้าง ปฏิทิน และ รวมทั้ง โหราศาสตร์ ดาราศาสตร์ ทุกวันนี้ ถูก จัด มีความหมาย เหมือนกับ ฟิสิกส์ ดาราศาสตร์ ตั้งแต่ คริสต์ ศตวรรษ ที่ เป็นต้นมา ดาราศาสตร์ ออก เป็น สอง สาขา ดาราศาสตร์ เชิง สังเกตการณ์ และ ดาราศาสตร์ เชิงทฤษฎี ดาราศาสตร์ เชิง สังเกตการณ์ ให้ความสำคัญ ไป ที่ การ เก็บ และ การ วิเคราะห์ ข้อมูล การ ความรู้ ทางกายภาพ เบื้องต้น เป็นหลัก ส่วน ดาราศาสตร์ เชิงทฤษฎี ให้ความสำคัญ ไป ที่ การพัฒนา คอมพิวเตอร์ แบบจำลอง เชิง วิเคราะห์ อธิบาย วัตถุ ท้องฟ้า และ ปรากฏการณ์ ต่าง ทั้งสอง สาขา เป็น องค์ประกอบ ซึ่งกันและกัน กล่าวคือ ดาราศาสตร์ เชิงทฤษฎี ใช้ อธิบาย ผล การ สังเกตการณ์ และ ดาราศาสตร์ เชิง สังเกตการณ์ ใช้ ใน การ รับรอง ผล จาก ทางทฤษฎี"

    Cheers,

    opened by vvorakit 1
  • Add a setup.py file so

    Add a setup.py file so "pip" can be used to install

    I was looking to quickly try your library (which looks great) by installing it with python-pip. Problem is, it doesn't work unless you have setup.py file that describes the project.

    Adding a file is easy, here's an example from one of my projects: https://github.com/EmilStenstrom/python-nutshell/blob/master/setup.py

    After this is done your project can be installed by typing:

    pip install https://github.com/datquocnguyen/RDRPOSTagger/archive/master.zip
    

    If you also want to reserve a name and make your project even easier to install you could publish it on PyPI. Then installation would be as easy as:

    pip install RDRPOSTagger
    

    Let me know if I can do anything to help you.

    opened by EmilStenstrom 1
  • Get RDRPOSTagger installed on python

    Get RDRPOSTagger installed on python

    Hi,

    I've tried top download the zip file and then could not figure out how to install the RDRPOSTagger into python. Could you help on these? Had a look on your website but couldn't find a help section.

    Many thanks!

    opened by sangohanfly 1
  • Convert FullUsage.html to Markdown

    Convert FullUsage.html to Markdown

    • Convert FullUsage.html to Markdown (USAGE.md) for easy access and editing, directly from within GitHub
      • See example at https://github.com/bact/RDRPOSTagger/blob/markdown-usage/USAGE.md
    • Add badges to README.md, as well as a link to USAGE.md
    • Line wrapping license.txt for easy reading on web browser
    opened by bact 0
  • Follows PEP 8 Python code convention and format

    Follows PEP 8 Python code convention and format

    • Replace comparisons with None from == None to is None and from != None to is not None
    • Drop unnecessary comparisons when obvious, follows Python convention that if a value is not None/False/zero/empty, it is True in a boolean test
    • Drop ; semicolons at the end of some statements
    • Formatted with black, remove trailing spaces
    • Use with context manager for open files
    opened by bact 0
CogComp's Natural Language Processing libraries and Demos:

CogCompNLP This project collects a number of core libraries for Natural Language Processing (NLP) developed by Cognitive Computation Group. How to use

CogComp 457 Dec 20, 2022
Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.

twitter-text This repository is a collection of libraries and conformance tests to standardize parsing of Tweet text. It synchronizes development, tes

Twitter 2.9k Jan 8, 2023
For English vocabulary analysis and sentence analysis in natural language, model trainin

Sword Come ?? For English vocabulary analysis and sentence analysis in natural language, model training, intelligent response and emotion analysis rea

James Zow 2 Apr 9, 2022
Tribal Trouble GNU 2 Tribal Trouble - Tribal Trouble is a realtime strategy game released by Oddlabs in 2004. In 2014 the source was released under GPL2 license. License: GNU 2, .

Tribal Trouble Tribal Trouble is a realtime strategy game released by Oddlabs in 2004. In 2014 the source was released under GPL2 license, and can be

Sune Hagen Nielsen 147 Dec 8, 2022
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Quick Info this library tries to solve language detection of very short words and phrases, even shorter than tweets makes use of both statistical and

Peter M. Stahl 532 Dec 28, 2022
Slicer4J is an accurate, low-overhead dynamic slicer for Java programs.

Slicer4J This repository hosts Slicer4J, an accurate, low-overhead dynamic slicer for Java programs. Slicer4J automatically generates a backward dynam

The Reliable, Secure, and Sustainable Software Lab 25 Dec 19, 2022
Inspect pmap -X output of a java process, requires Java11, likely not 100% accurate

java-pmap-inspector Inspect pmap -X output of a java process, requires Java 11, likely not 100% accurate. Usage examples $ pmap -X pid > pmap.txt; jav

Brice Dutheil 7 Jul 6, 2022
java math accurate implementation & experiments

Marlin-Math Accurate and fastest Math functions in java, like the Marlin renderer ! Rationale Java supports Quadratic & Cubic curves in Java2D & JavaF

Laurent Bourgès 7 Nov 18, 2021
Accurate stronghold calculator for Minecraft speedrunning.

Ninjabrain Bot An accurate stronghold calculator for minecraft speedrunning. Achieves better results than regular calculators by accounting for user e

Filip R 97 Dec 28, 2022
Official React Native client for FingerprintJS PRO. 100% accurate device identification for fraud detection.

FingerprintJS PRO React Native Official React Native module for 100% accurate device identification, created for the FingerprintJS Pro Server API. Thi

FingerprintJS 26 Nov 22, 2022
Fast and Easy mapping from database and csv to POJO. A java micro ORM, lightweight alternative to iBatis and Hibernate. Fast Csv Parser and Csv Mapper

Simple Flat Mapper Release Notes Getting Started Docs Building it The build is using Maven. git clone https://github.com/arnaudroger/SimpleFlatMapper.

Arnaud Roger 418 Dec 17, 2022
OpenMap is an Open Source JavaBeans-based programmer's toolkit. Using OpenMap, you can quickly build applications and applets that access data from legacy databases and applications.

$Source: /cvs/distapps/openmap/README,v $ $RCSfile: README,v $ $Revision: 1.11 $ $Date: 2002/11/06 19:11:02 $ $Author: bmackiew $ OpenMap(tm) What

OpenMap 65 Nov 12, 2022
A Toolkit for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments

The iFogSimToolkit (with its new release iFogSim2) for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments. In the new release Mobili Management, Microservice Management, and Dynamic Clustering mechanisms are added as new features.

The Cloud Computing and Distributed Systems (CLOUDS) Laboratory 69 Dec 17, 2022
Speedment is a Stream ORM Java Toolkit and Runtime

Java Stream ORM Speedment is an open source Java Stream ORM toolkit and runtime. The toolkit analyzes the metadata of an existing SQL database and aut

Speedment 2k Dec 21, 2022
Toolkit for testing multi-threaded and asynchronous applications

ConcurrentUnit A simple, zero-dependency toolkit for testing multi-threaded code. Supports Java 1.6+. Introduction ConcurrentUnit was created to help

Jonathan Halterman 406 Dec 30, 2022
A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

LockUp An Android-based Cellebrite UFED self-defense application LockUp is an Android application that will monitor the device for signs for attempts

mbkore 300 Dec 4, 2022
APIKit:Discovery, Scan and Audit APIs Toolkit All In One.

APIKit:Discovery, Scan and Audit APIs Toolkit All In One.

APISecurity Community 976 Jan 9, 2023
ScalaTest is a free, open-source testing toolkit for Scala and Java programmers

ScalaTest is a free, open-source testing toolkit for Scala and Java programmers.

ScalaTest 1.1k Dec 26, 2022
A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

LockUp An Android-based Cellebrite UFED self-defense application LockUp is an Android application that will monitor the device for signs for attempts

levlesec 300 Dec 4, 2022
Java bytecode engineering toolkit

Java bytecode engineering toolkit Javassist version 3 Copyright (C) 1999-2020 by Shigeru Chiba, All rights reserved. Javassist (JAVA programming ASSIS

null 3.7k Dec 29, 2022