An efficient and flexible token-based regular expression language and engine.

Overview

OpenRegex

OpenRegex is written by Michael Schmitz at the Turing Center http://turing.cs.washington.edu/. It is licensed under the lesser GPL. Please see the LICENSE file for more details.

Introduction

OpenRegex is an efficient and flexible token-based regular expression language and engine. Most regular expression implementations are closed to run only over characters. Although this is the the most common application for regular expressions, OpenRegex does not have this restriction. OpenRegex is open to any sequences of user-defined objects.

Applied to Natural Language

For example, OpenRegex is used in the R2A2 extension to ReVerb, an open-domain information extractor, to determine argument boundaries. In this case, tokens are words in English sentences with additional information (the string of the word, the part-of-speech tag, and the chunk tag).

case class WordToken(string: String, postag: String, chunk: String)

Now that we have defined our token, we can build up a sentence (a NLP library such as OpenNLP can help out here). We will also need to define a way to translate each token in the expression (text between ) into an expression that can be applied to a word token.

  def compile(string: String): RegularExpression[WordToken] = {
    // create a parser for regular expression language that have
    // the same token representation
    val parser =
      new RegularExpressionParser[WordToken]() {
        // Translate an string "part=value" into a BaseExpression that
        // checks whether the part of a WordToken has value 'value'.
        override def factory(string: String): BaseExpression[WordToken] = {
          new BaseExpression[WordToken](string) {
            val Array(part, quotedValue) = string.split("=")
            val value = quotedValue.drop(1).take(quotedValue.size - 2)
            override def apply(entity: WordToken) = {
              part match {
                case "string" => entity.string equalsIgnoreCase value
                case "postag" => entity.postag equalsIgnoreCase value
                case "chunk" => entity.chunk equalsIgnoreCase value
              }
            }
          }
        }
      }

    parser.parse(string)
  }

Now we can compile a regular expression and apply it to a sentence. Consider the following pattern. The first line defines a non-matching group that matches a determiner ("a", "an", or "the"). The second line matches a sequence of part-of-speech tags ("JJ" is adjective, "NNP" is proper noun, and "NN" is common noun).

(?:<string='a'> | <string='an'> | <string='the'>)?
<postag="JJ">* <postag='NNP'>+ <postag='NN'>+ <postag='NNP'>+

We can try applying it to a couple of sentences.

  1. The US president Barack Obama is travelling to Mexico.
    regex.find(sentence).groups.get(0) matches "The US president Barack Obama"
  1. If all the ice melted from the frigid Earth continent Antarctica, sea levels would rise hundreds of feet.
    regex.find(sentence).groups.get(0) matches "the frigid Earth continent Antarctica"

We may want to pull out the text from certain parts of our match. We can do this with either named or unnamed groups. Consider the following new form of the pattern and the sentence in example 2.

      (?:<string="a"> | <string="an"> | <string="the">)? <postag="JJ">*
      (<arg1>:<postag='NNP'>+) (<rel>:<postag='NN'>+) (<arg2>:<postag='NNP'>+)

      regex.find(sentence).groups.get(0) matches "the frigid Earth continent Antarctica"
      regex.find(sentence).groups.get(1) matches "Earth"
      regex.find(sentence).groups.get(2) matches "continent"
      regex.find(sentence).groups.get(2) matches "Antarctica"

      regex.find(sentence).group("arg1") matches "Earth"
      regex.find(sentence).group("rel")  matches "continent"
      regex.find(sentence).group("arg2") matches "Antarctica"

Supported Constructs

The regular expression library supports the following constructs.

    | alternation
    ? option
    * Kleene-star
    + plus
    ^ beginning
    $ end
    {x,y}     match at least x but not more than y times
    ()        matching groups
    (?:)      non-matching groups
    (<name>:) named groups

Most of these operators work the same as in java.util.regex. Presently, however, alternation binds to its immediate neighbors. This means that <a> <b> | <c> means <a> (?:<b> | <c>) whereas in Java it would mean (?:<a> <b>) | <c>. This may change in a future release so it is advised that the alternation arguments be made explicit with non-matching groups.

All operators are greedy, and there are no non-greedy counterparts. Backreferences are not supported because the underlying representation only supports regular languages (backreferences are not regular).

Simple Java Example

The NLP example is rather complex but it shows the power of OpenRegex. For a simpler example, look at RegularExpressions.word. This is a static factory method for a simple word-based regular expression where only the string is considered. This factory is used in the test cases.

You can also play around with RegularExpressions.word by running the main method in RegularExpression and specifying an expression with arg1.

sbt 'run-main edu.washington.cs.knowitall.regex.RegularExpression "<the> <fat>* <cows> <are> <mooing> (?:<loudly>)?"'

Logic Expressions

Included is an engine for parsing and evaluating logic expressions. For example, you might want to extend the NLP regular expression language to be able to check multiple fields in a single regular expression token. If you assumed each regular expression token to be a logic expression, you could write patterns such as the following.

    <string="the" & postag="DT"> <postag="JJ"> <string="earth" | postag="NNP">

Extending the regular expression in this way is easy. It only involves rewriting the apply method in BaseExpression inside the compile method. Most of the code below existed before--now it's just moved outside the apply method.

    val logic = new LogicExpressionParser[WordToken] {
      override def factory(expr: String) = {
        new Arg.Pred[WordToken](expr) {
          val Array(part, quotedValue) = expr.split("=")
          val value = quotedValue.drop(1).take(quotedValue.size - 2)
          override def apply(entity: WordToken) = part match {
            case "string" => entity.string == value
            case "postag" => entity.postag == value
            case "chunk" => entity.chunk == value
          }
        }
      }
    }.parse(value)

    override def apply(entity: WordToken) = {
      logic.apply(entity)
    }

Play around with logic expression by using the main method in LogicExpression.

sbt 'run-main edu.washington.cs.knowitall.logic.LogicExpression'

You can enter logic expressions such as "true & false" or "true | false" and have them evaluated interactively.

Implementation

Regular expressions are evaluated using Thomson NFA, which is fast and does not have the pathological cases that most regular expression libraries have. For more information about Thomson NFA in comparison to recursive backtracking, read http://swtch.com/~rsc/regexp/regexp1.html. Future work may involve compiling NFAs to DFAs.

Future Work

  1. Compile to DFA.
  2. Use parser combinators for parsing regular expressions.
Comments
  • idea for performance optimization

    idea for performance optimization

    Here's an idea for an optimization: RegularExpression.find() contains this loop:

    for (int i = start; i < tokens.size(); i++) ...
    

    This seems to loop even when no match is possible. For example, if my pattern is "abc", the minimum length of a text that can match is 3. So if my input text is shorter than the minimum pattern length, I can stop immediately. More general, I don't have to iterate up to tokens.size() but only to tokens.size()-minPatternLen.

    Finding the minimum pattern length shouldn't be too difficult, e.g. Expression.Star is always 0, Expression.Or is the minimum of both sub expressions etc.

    opened by danielnaber 7
  • Min/max a.k.a {x,y}

    Min/max a.k.a {x,y}

    I needed a min/max expression, so I wrote one. I'm not sure if my approach is the best one, but it seems to work for me. I know the use of short as a data type is unusual for Java, you might want to re-consider that. Sorry for being too lazy to write a Scala test case.

    opened by danielnaber 7
  • Fix tokenization of tokens

    Fix tokenization of tokens

    The tokenization code is presently one big function. It needs to be broken up into smaller functions. Particularly so those function can be re-used by others (see https://github.com/knowitall/taggers/pull/13).

    Also, the special case code for quotations ' and " should be customizable.

    opened by schmmd 2
  • Match.endIndex() is exclusive

    Match.endIndex() is exclusive

    When Match.endIndex() returns 3, it's actually the token at index 2 which is the last matching token. In this context, I think it would be more natural if the endIndex really pointed to the last matching token. Of course, then endIndex - startIndex will be zero for matches of length one...

    Alternatively, the javadoc could be reworded, it currently says: "the index of the last token matched."

    opened by danielnaber 2
  • Add efficiency check to determine if a pattern could possible match n tokens

    Add efficiency check to determine if a pattern could possible match n tokens

    If the pattern couldn't possible match n tokens, abort without running the automaton. This pull request adds a minMatchingLength method to the parts of the automaton so this check can be performed. This addresses https://github.com/knowitall/openregex/issues/8.

    opened by schmmd 2
  • add a new constructor for users who don't have a string representation

    add a new constructor for users who don't have a string representation

    Maybe I'm missing something, but this constructor seems useful for users who don't have a string representation of their expression. For example, I already have a List of objects that I can convert into a List of Expressions.

    opened by danielnaber 2
  • RegularExpressionParsers  should be public?

    RegularExpressionParsers should be public?

    I assume that word and character in RegularExpressionParsers are intended to be used as part of the public interface? It seems that since RegularExpressionParsers is itself not marked public, it makes it impossible to access the word and character fields. At least it does when trying to access it via clojure.

    opened by akhudek 1
  • backreferences?

    backreferences?

    Is it correct that OpenRegex does not support backreferences and that this cannot easily be added due to restrictions of the underlying formalism? If so, maybe this should be added to the documentation. Similar, are the expressions greedy or not? This might only be obvious for people familiar with regular expressions.

    opened by danielnaber 1
  • Configurable and modular parsing code

    Configurable and modular parsing code

    Split the parsing code from the compiling/execution code so that it is configurable. Also, improve default tokenization to handle escaped quotations.

    opened by schmmd 0
  • Help with Openregex constructs

    Help with Openregex constructs

    1. Can we use ? construct as reluctant quantifier (as it is in normal java regular expressions) too?
    2. Why does the following regular expression doesn't pick "is found" from the specified string. I am assuming that <string='.'>* means zero or more string of any length. Regular expression: < string='but' > < string='now' > < string='.' > * ( < pos='VB.?' > * ) Sentence: He was lost , but now he is found.
    opened by p6jain 2
Releases(v1.1.1)
Owner
KnowItAll
KnowItAll
For English vocabulary analysis and sentence analysis in natural language, model trainin

Sword Come ?? For English vocabulary analysis and sentence analysis in natural language, model training, intelligent response and emotion analysis rea

James Zow 2 Apr 9, 2022
A fast and accurate POS and morphological tagging toolkit (EACL 2014)

RDRPOSTagger RDRPOSTagger is a robust and easy-to-use toolkit for POS and morphological tagging. It employs an error-driven approach to automatically

Dat Quoc Nguyen 137 Sep 9, 2022
Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.

twitter-text This repository is a collection of libraries and conformance tests to standardize parsing of Tweet text. It synchronizes development, tes

Twitter 2.9k Jan 8, 2023
ReDoSHunter: A Combined Static and Dynamic Approach for Regular Expression DoS Detection

ReDoSHunter ReDoSHunter is a combined static and dynamic approach for regular expression DoS detection. LATEST NOTE (updated at 2021.09.13): ReDoSHunt

Yeting Li 43 Dec 23, 2022
Spring Boot Refresh Token using JWT example - Expire and Renew JWT Token

Spring Boot Refresh Token with JWT example Build JWT Refresh Token in the Java Spring Boot Application. You can know how to expire the JWT, then renew

null 152 Dec 28, 2022
lazy-language-loader improves loading times when changing your language by only reloading the language instead of all the game resources!

lazy-language-loader lazy-language-loader improves loading times when changing your language by only reloading the language instead of all the game re

Shalom Ademuwagun 7 Sep 7, 2022
Converting Infix to Postfix and Evaluating Postfix Expression. (Java & Scala).

GUI-Calculator-Infix-to-Postfix Converting Infix to Postfix and Evaluating Postfix Expression. (Java & Scala). Converting Infix to Postfix and Evaluat

Mohamed Abdalazez 4 Sep 22, 2022
EvalEx is a handy expression evaluator for Java, that allows to evaluate expressions.

EvalEx - Java Expression Evaluator EvalEx is a handy expression evaluator for Java, that allows to parse and evaluate expression strings. Key Features

null 18 Sep 18, 2022
Drools is a rule engine, DMN engine and complex event processing (CEP) engine for Java.

An open source rule engine, DMN engine and complex event processing (CEP) engine for Java™ and the JVM Platform. Drools is a business rule management

KIE (Drools, OptaPlanner and jBPM) 4.9k Dec 31, 2022
Java 8 optimized, memory efficient, speedy template engine producing statically typed, plain java objects

Rocker Templates by Fizzed Fizzed, Inc. (Follow on Twitter: @fizzed_inc) Sponsored by Rocker is proudly sponsored by Greenback. We love the service an

Fizzed, Inc. 669 Dec 29, 2022
An extension for Keycloak, that enables web-based sign in with Apple and token exchange

Apple Identity Provider for Keycloak ?? This repository represents an extension for Keycloak, which enables Sign in with Apple for web-based applicati

Klaus Betz 58 Dec 29, 2022
Java rate limiting library based on token/leaky-bucket algorithm.

Java rate-limiting library based on token-bucket algorithm. Advantages of Bucket4j Implemented on top of ideas of well known algorithm, which are by d

Vladimir Bukhtoyarov 1.7k Jan 8, 2023
A complete and performing library to highlight text snippets (EditText, SpannableString and TextView) using Spannable with Regular Expressions (Regex) for Android.

Highlight A complete and performing library to highlight text snippets (EditText/Editable and TextView) using Spannable with Regular Expressions (Rege

Irineu A. Silva 16 Dec 22, 2022
Annotation processor to create immutable objects and builders. Feels like Guava's immutable collections but for regular value objects. JSON, Jackson, Gson, JAX-RS integrations included

Read full documentation at http://immutables.org // Define abstract value type using interface, abstract class or annotation @Value.Immutable public i

Immutables 3.2k Dec 31, 2022
Java regular expressions made easy.

JavaVerbalExpressions VerbalExpressions is a Java library that helps to construct difficult regular expressions. Getting Started Maven Dependency: <de

null 2.6k Dec 30, 2022
Human friendly alternative to Regular Expressions

Cucumber Expressions Cucumber Expressions is an alternative to Regular Expressions with a more intuitive syntax. Try Cucumber Expressions in your brow

Cucumber 86 Jan 8, 2023
Easy-to-use placeholder library to focus on replacement of regular expressions inside a text.

WPlaceholder Flexible library to replace placeholders inside messages/texts with Regex expressions. For what it's good for Placeholder replacement is

Luiz Otávio 7 Oct 11, 2022
A plugin of Grasscutter for send regular notice.

MeaNotice - Grasscutter Regular Notice Plugin MeaNotice is a plugin of Grasscutter, you can use this plugin to publish notifications in-game regularly

ButterCookies 39 Oct 17, 2022
This application can recognize the sign language alphabets and help people who do not understand sign language to communicate with the speech and hearing impaired.

Sign Language Recognition App This application can recognize the sign language alphabets and help people who do not understand sign language to commun

Mihir Gandhi 12 Oct 7, 2021
Kotlin-decompiled - (Almost) every single language construct of the Kotlin programming language compiled to JVM bytecode and then decompiled to Java again for better readability

Kotlin: Decompiled (Almost) every single language construct of the Kotlin programming language compiled to JVM bytecode and then decompiled to Java ag

The Self-Taught Software Engineer 27 Dec 14, 2022