A fast and accurate POS and morphological tagging toolkit (EACL 2014)

Last update: Sep 9, 2022

Overview

RDRPOSTagger

RDRPOSTagger is a robust and easy-to-use toolkit for POS and morphological tagging. It employs an error-driven approach to automatically construct tagging rules in the form of a binary tree.

RDRPOSTagger obtains very fast tagging speed and achieves a competitive accuracy in comparison to the state-of-the-art results. See experimental results including performance speed and tagging accuracy for 13 languages in our AI Communications article.
RDRPOSTagger now supports pre-trained UPOS, XPOS and morphological tagging models for about 80 languages. See folder Models for more details.

The general architecture and experimental results of RDRPOSTagger can be found in our following papers:

Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, pp. 17-20, 2014. [.PDF] [.bib]
Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging. AI Communications (AICom), vol. 29, no. 3, pp. 409-422, 2016. [.PDF] [.bib]

Please CITE either the EACL or the AICom paper whenever RDRPOSTagger is used to produce published results or incorporated into other software.

Current release (41MB .zip file containing about 330 pre-trained tagging models) is available to download at: https://github.com/datquocnguyen/RDRPOSTagger/archive/master.zip

Find more information about RDRPOSTagger at: http://rdrpostagger.sourceforge.net/

In addition, you might want to try my neural network-based toolkit jPTDP for joint POS tagging and dependency parsing.

Comments

Certain tokens receive null POS and token is not outputted
For some tokens, such as âManli are tagged as:

''/null

I couldn't find anything on the documentation about a null POS being returned in any case, so I figure this is an undesirable feature. In either case, the inclusion of the original token would be necessary.
opened by matgrioni 5

RDRPOSTagger.py returns blank error

I using the following command within RDRPOSTagger/pSCRDRtagger

python RDRPOSTagger.py ../Models/UniPOS/UD_Latin/la-upos.RDR ../Models/UniPOS/UD_Latin/la-upos.DICT rawDataPath

For some of the files I run it on it works as expected. For others, such as the one attached there is an error output as follows:

=> Read a POS tagging model from ../Models/UniPOS/UD_Latin/la-upos.RDR

=> Read a lexicon from ../Models/UniPOS/UD_Latin/la-upos.DICT

=> Perform POS tagging on /home/grioni.2/NER/Preprocessing/Preprocessed/UNKNOWN/Tacitus.txt

ERROR ==>  "''"

===== Usage =====

#1: To train RDRPOSTagger on a gold standard training corpus:

python RDRPOSTagger.py train PATH-TO-GOLD-STANDARD-TRAINING-CORPUS

Example: python RDRPOSTagger.py train ../data/goldTrain

#2: To use the trained model for POS tagging on a raw text corpus:

python RDRPOSTagger.py tag PATH-TO-TRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS

Example: python RDRPOSTagger.py tag ../data/goldTrain.RDR ../data/goldTrain.DICT ../data/rawTest

#3: Find the full usage at http://rdrpostagger.sourceforge.net !

I'm not sure where this blank error is coming from as it is blank. This problem does not occur for the java implementation however, so:

java RDRPOSTagger ../Models/UniPOS/UD_Latin/la-upos.RDR ../Models/UniPOS/UD_Latin/la-upos.DICT rawDataPath

works for the same file.

Alexander_Severus.txt

opened by matgrioni 5

Getting "string index out of range" error while trying to train

Thanks a lot for making available this open source package!

I am trying to train RDRPOSTagger for POS tagging of Tamil language. I have converted the POS tagged text to be same format as your goldTrain file. Each line in my input training corpus is a tokenized/word-segmented sentence. I was able to train using a small test file (after some trial and error). I was able to generate the .RDR and .DICT files.

However, when running a larger file, I am getting a "string index out of range" error. This happens immediately after starting to generate the lexicon. So, the next 3 steps, namely extracting the raw text corpus, POS tagging and learning the tree model of rules do not happen. I am using a Ubuntu Linux box.The command I am using to train is: python RDRPOSTagger.py train [my POS tagged formatted file path and name]

Any tips on how to identify the source of this error will be very helpful.

opened by AshokR 5
Successfully trained the RDRPOSTagger in Tamil

I am happy to report that, after extensive tweaking of my gold standard training corpus, I have successfully trained the tagger with a corpus of about 200,000 Tamil words. I used 80% of the corpus for training and 20% for testing. I see a difference of about 15% from my gold standard testing corpus.

It will be great if you can take a look at my corpus and let me know whether there is anything I can do to improve it.

opened by AshokR 3

Single double quote, ", becomes two single quotes in output.

When there is a double quote in the source, the output consists of two single quotes. The line in question is

multos aut affectatio alienae fortunae aut suae querella querella Madvig : qua A : cura Haase . detinuit ; plerosque nihil certum sequentis vaga et inconstans et sibi displicens levitas per nova consilia iactavit ; quibusdam nihil , quo cursum derigant , placet , sed marcentis oscitantisque fata deprendunt , adeo ut quod apud maximum poetarum more oraculi dictum est , verum esse non dubitem : " Exigua pars est vitae , qua vivimus .  " Ceterum quidem omne spatium non vita sed tempus est .

whose tokens have been space separated ;). In the output the double quote after dubitem is two single quotes, which is a problem for my purposes.

opened by matgrioni 2

Porting to Python 3

Hello, I've ported the script to Python 3, it's here, you may want to link it or get it back as a git branch.

Unfortunately, the code does not work on Python 2

opened by jacopofar 2
Slimmed down version of RDRPOSTagger

Hi! I'm planning to do a slimmed down, easy to use version of RDRPosTagger. The idea is to simplify the projekt to only deal with UniPOS tagging with those models and remove everything else, including training, and non-UniPOS models. I will also remove the java version.

I will work on it as a fork of this project, hosted here: https://github.com/EmilStenstrom/RDRPOSTagger/

My idea is that this will make the code easier to read, and by simplifying it also easier to use. I have an idea of how I can use this commercially at the company I work with.

Are you ok with this? I see you have chosen the GNU license, so it looks that way, but I just want to be sure.

opened by EmilStenstrom 2
Creating a lexicon

Could you please add a few lines on how to use the LexiconCreator.py script? I am not sure what parameters to set to the function createLexicon. Is corpusFilePath the path to the universal dependencies file? What about fullLexicon?

opened by senisioi 1
casesless/case-insensitive POS tagging?

Thanks for this awesome project.

My project background : step 1) Audio -> Text step 2) Text -> Gather POS tags + Identify Named Entities from text. In the speech to text step, all of the text is more or less getting converted to lower case.

I'd like to know how to make the model identify/classify words based on the context and not on the whether the word is capitalized or not.

If the sentence is " I ate an apple while sitting inside the apple headquarters"
I want the RDRPOS tagger to identify first apple as a fruit and the second apple as an organization. At present it is identifying the Apple as an organisation only when it is capitalized. The RDRPOSTagger.py is able to identify headquarters as an NNP, but RDRPOSTagger4en.py is identifying it as a NN.

Thanks.

opened by StanSilas 1
Tagging process takes such a long time for Thai lauguage

Thank you for such a great work you have done. So far I have an issue and I have no idea how to fix.

I was able to tag Thai language with a very tiny input (Thai text file for 1MB), and it worked very well (got an output file with POS-tagged words). Then I use the bigger input (Thai text file for 4.5GB) with the exact same code and directory, but the program did not give me any result but waiting with no ending (10 hours and more).

I am curious is there any way to solve this waiting problem, or it was actually working for that long (Text 4.5GB sized for 10 hours) ?

Thai text input as an example, "โครงการ พี่น้อง วิก พี เดียด เนิน มูลนิธิ วิก มีเดีย องค์กร แสวง ผลกำไร ผู้ดำเนินการ ภาษา อื่น ดาราศาสตร์ ดาราศาสตร์ วิชา วิทยาศาสตร์ ศึกษา วัตถุ ท้องฟ้า อาทิ ดาวฤกษ์ ดาวเคราะห์ ดาวหาง ดารา จักร รวมทั้ง ปรากฏการณ์ ทางธรรมชาติ ต่าง ที่เกิด ขึ้น ชั้น บรรยากาศ โลก ศึกษา เกี่ยวกับ วิวัฒนาการ ลักษณะ ทางกายภาพ เคมี ทาง อุตุนิยมวิทยา และ เคลื่อนที่ วัตถุ ท้องฟ้า ตลอดจน การกำ นิด และ วิวัฒนาการ ของ เอกภพ ดาราศาสตร์ เป็นหนึ่ง สาขา วิทยาศาสตร์ เก่าแก่ ที่สุด นัก ดาราศาสตร์ วัฒนธรรม โบราณ สังเกตการณ์ ดวงดาว ท้องฟ้า ใน เวลา กลางคืน วัตถุ ดาราศาสตร์ หลายอย่าง ก็ได้ ถูก ค้นพบ เรื่อย ตาม ยุคสมัย กล้องโทรทรรศน์ สิ่งประดิษฐ์ จำเป็น ก่อนที่จะ การพัฒนา มา เป็น วิทยาศาสตร์ สมัยใหม่ อดีตกาล ดาราศาสตร์ ประกอบ สาขา ที่ หลากหลาย วัด ตำแหน่ง ดาว การเดินเรือ ดาราศาสตร์ ดาราศาสตร์ เชิง สังเกตการณ์ การ สร้าง ปฏิทิน และ รวมทั้ง โหราศาสตร์ ดาราศาสตร์ ทุกวันนี้ ถูก จัด มีความหมาย เหมือนกับ ฟิสิกส์ ดาราศาสตร์ ตั้งแต่ คริสต์ ศตวรรษ ที่ เป็นต้นมา ดาราศาสตร์ ออก เป็น สอง สาขา ดาราศาสตร์ เชิง สังเกตการณ์ และ ดาราศาสตร์ เชิงทฤษฎี ดาราศาสตร์ เชิง สังเกตการณ์ ให้ความสำคัญ ไป ที่ การ เก็บ และ การ วิเคราะห์ ข้อมูล การ ความรู้ ทางกายภาพ เบื้องต้น เป็นหลัก ส่วน ดาราศาสตร์ เชิงทฤษฎี ให้ความสำคัญ ไป ที่ การพัฒนา คอมพิวเตอร์ แบบจำลอง เชิง วิเคราะห์ อธิบาย วัตถุ ท้องฟ้า และ ปรากฏการณ์ ต่าง ทั้งสอง สาขา เป็น องค์ประกอบ ซึ่งกันและกัน กล่าวคือ ดาราศาสตร์ เชิงทฤษฎี ใช้ อธิบาย ผล การ สังเกตการณ์ และ ดาราศาสตร์ เชิง สังเกตการณ์ ใช้ ใน การ รับรอง ผล จาก ทางทฤษฎี"

Cheers,

opened by vvorakit 1
Add a setup.py file so "pip" can be used to install
I was looking to quickly try your library (which looks great) by installing it with python-pip. Problem is, it doesn't work unless you have setup.py file that describes the project.

Adding a file is easy, here's an example from one of my projects: https://github.com/EmilStenstrom/python-nutshell/blob/master/setup.py

After this is done your project can be installed by typing:

pip install https://github.com/datquocnguyen/RDRPOSTagger/archive/master.zip

If you also want to reserve a name and make your project even easier to install you could publish it on PyPI. Then installation would be as easy as:

pip install RDRPOSTagger

Let me know if I can do anything to help you.
opened by EmilStenstrom 1
Get RDRPOSTagger installed on python

Hi,

I've tried top download the zip file and then could not figure out how to install the RDRPOSTagger into python. Could you help on these? Had a look on your website but couldn't find a help section.

Many thanks!

opened by sangohanfly 1
Convert FullUsage.html to Markdown
Convert FullUsage.html to Markdown (USAGE.md) for easy access and editing, directly from within GitHub

See example at https://github.com/bact/RDRPOSTagger/blob/markdown-usage/USAGE.md

Add badges to README.md, as well as a link to USAGE.md

Line wrapping license.txt for easy reading on web browser
opened by bact 0
Follows PEP 8 Python code convention and format
Replace comparisons with None from == None to is None and from != None to is not None

Drop unnecessary comparisons when obvious, follows Python convention that if a value is not None/False/zero/empty, it is True in a boolean test

Drop ; semicolons at the end of some statements

Formatted with black, remove trailing spaces

Use with context manager for open files
opened by bact 0

Owner

Dat Quoc Nguyen

GitHub http://rdrpostagger.sourceforge.net

CogComp's Natural Language Processing libraries and Demos:

CogCompNLP This project collects a number of core libraries for Natural Language Processing (NLP) developed by Cognitive Computation Group. How to use

457 Dec 20, 2022

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.

twitter-text This repository is a collection of libraries and conformance tests to standardize parsing of Tweet text. It synchronizes development, tes

2.9k Jan 8, 2023

For English vocabulary analysis and sentence analysis in natural language, model trainin

Sword Come ?? For English vocabulary analysis and sentence analysis in natural language, model training, intelligent response and emotion analysis rea

2 Apr 9, 2022

Tribal Trouble GNU 2 Tribal Trouble - Tribal Trouble is a realtime strategy game released by Oddlabs in 2004. In 2014 the source was released under GPL2 license. License: GNU 2, .

Tribal Trouble Tribal Trouble is a realtime strategy game released by Oddlabs in 2004. In 2014 the source was released under GPL2 license, and can be

147 Dec 8, 2022

👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Quick Info this library tries to solve language detection of very short words and phrases, even shorter than tweets makes use of both statistical and

532 Dec 28, 2022

Slicer4J is an accurate, low-overhead dynamic slicer for Java programs.

Slicer4J This repository hosts Slicer4J, an accurate, low-overhead dynamic slicer for Java programs. Slicer4J automatically generates a backward dynam

The Reliable, Secure, and Sustainable Software Lab

25 Dec 19, 2022

Inspect pmap -X output of a java process, requires Java11, likely not 100% accurate

java-pmap-inspector Inspect pmap -X output of a java process, requires Java 11, likely not 100% accurate. Usage examples $ pmap -X pid > pmap.txt; jav

7 Jul 6, 2022

java math accurate implementation & experiments

Marlin-Math Accurate and fastest Math functions in java, like the Marlin renderer ! Rationale Java supports Quadratic & Cubic curves in Java2D & JavaF

7 Nov 18, 2021

Accurate stronghold calculator for Minecraft speedrunning.

Ninjabrain Bot An accurate stronghold calculator for minecraft speedrunning. Achieves better results than regular calculators by accounting for user e

97 Dec 28, 2022

Official React Native client for FingerprintJS PRO. 100% accurate device identification for fraud detection.

FingerprintJS PRO React Native Official React Native module for 100% accurate device identification, created for the FingerprintJS Pro Server API. Thi

26 Nov 22, 2022

Fast and Easy mapping from database and csv to POJO. A java micro ORM, lightweight alternative to iBatis and Hibernate. Fast Csv Parser and Csv Mapper

Simple Flat Mapper Release Notes Getting Started Docs Building it The build is using Maven. git clone https://github.com/arnaudroger/SimpleFlatMapper.

418 Dec 17, 2022

OpenMap is an Open Source JavaBeans-based programmer's toolkit. Using OpenMap, you can quickly build applications and applets that access data from legacy databases and applications.

$Source: /cvs/distapps/openmap/README,v $ $RCSfile: README,v $ $Revision: 1.11 $ $Date: 2002/11/06 19:11:02 $ $Author: bmackiew $ OpenMap(tm) What

65 Nov 12, 2022

A Toolkit for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments

The iFogSimToolkit (with its new release iFogSim2) for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments. In the new release Mobili Management, Microservice Management, and Dynamic Clustering mechanisms are added as new features.

The Cloud Computing and Distributed Systems (CLOUDS) Laboratory

69 Dec 17, 2022

Speedment is a Stream ORM Java Toolkit and Runtime

Java Stream ORM Speedment is an open source Java Stream ORM toolkit and runtime. The toolkit analyzes the metadata of an existing SQL database and aut

2k Dec 21, 2022

Toolkit for testing multi-threaded and asynchronous applications

ConcurrentUnit A simple, zero-dependency toolkit for testing multi-threaded code. Supports Java 1.6+. Introduction ConcurrentUnit was created to help

406 Dec 30, 2022

A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

LockUp An Android-based Cellebrite UFED self-defense application LockUp is an Android application that will monitor the device for signs for attempts

300 Dec 4, 2022

APIKit：Discovery, Scan and Audit APIs Toolkit All In One.

976 Jan 9, 2023

ScalaTest is a free, open-source testing toolkit for Scala and Java programmers

ScalaTest is a free, open-source testing toolkit for Scala and Java programmers.

1.1k Dec 26, 2022

A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

LockUp An Android-based Cellebrite UFED self-defense application LockUp is an Android application that will monitor the device for signs for attempts

300 Dec 4, 2022

Java bytecode engineering toolkit

3.7k Dec 29, 2022

A fast and accurate POS and morphological tagging toolkit (EACL 2014)

Related tags

Overview

RDRPOSTagger

Comments

Owner

Dat Quoc Nguyen

CogComp's Natural Language Processing libraries and Demos:

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.

For English vocabulary analysis and sentence analysis in natural language, model trainin

Tribal Trouble GNU 2 Tribal Trouble - Tribal Trouble is a realtime strategy game released by Oddlabs in 2004. In 2014 the source was released under GPL2 license. License: GNU 2, .

👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Slicer4J is an accurate, low-overhead dynamic slicer for Java programs.

Inspect pmap -X output of a java process, requires Java11, likely not 100% accurate

java math accurate implementation & experiments

Accurate stronghold calculator for Minecraft speedrunning.

Official React Native client for FingerprintJS PRO. 100% accurate device identification for fraud detection.

Fast and Easy mapping from database and csv to POJO. A java micro ORM, lightweight alternative to iBatis and Hibernate. Fast Csv Parser and Csv Mapper

OpenMap is an Open Source JavaBeans-based programmer's toolkit. Using OpenMap, you can quickly build applications and applets that access data from legacy databases and applications.

A Toolkit for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments

Speedment is a Stream ORM Java Toolkit and Runtime

Toolkit for testing multi-threaded and asynchronous applications

A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

APIKit：Discovery, Scan and Audit APIs Toolkit All In One.

ScalaTest is a free, open-source testing toolkit for Scala and Java programmers

A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

Java bytecode engineering toolkit