An efficient and flexible token-based regular expression language and engine.

Last update: Jul 12, 2022

Related tags

Overview

OpenRegex

OpenRegex is written by Michael Schmitz at the Turing Center http://turing.cs.washington.edu/. It is licensed under the lesser GPL. Please see the LICENSE file for more details.

Introduction

OpenRegex is an efficient and flexible token-based regular expression language and engine. Most regular expression implementations are closed to run only over characters. Although this is the the most common application for regular expressions, OpenRegex does not have this restriction. OpenRegex is open to any sequences of user-defined objects.

Applied to Natural Language

For example, OpenRegex is used in the R2A2 extension to ReVerb, an open-domain information extractor, to determine argument boundaries. In this case, tokens are words in English sentences with additional information (the string of the word, the part-of-speech tag, and the chunk tag).

case class WordToken(string: String, postag: String, chunk: String)

Now that we have defined our token, we can build up a sentence (a NLP library such as OpenNLP can help out here). We will also need to define a way to translate each token in the expression (text between ) into an expression that can be applied to a word token.

  def compile(string: String): RegularExpression[WordToken] = {
    // create a parser for regular expression language that have
    // the same token representation
    val parser =
      new RegularExpressionParser[WordToken]() {
        // Translate an string "part=value" into a BaseExpression that
        // checks whether the part of a WordToken has value 'value'.
        override def factory(string: String): BaseExpression[WordToken] = {
          new BaseExpression[WordToken](string) {
            val Array(part, quotedValue) = string.split("=")
            val value = quotedValue.drop(1).take(quotedValue.size - 2)
            override def apply(entity: WordToken) = {
              part match {
                case "string" => entity.string equalsIgnoreCase value
                case "postag" => entity.postag equalsIgnoreCase value
                case "chunk" => entity.chunk equalsIgnoreCase value
              }
            }
          }
        }
      }

    parser.parse(string)
  }

Now we can compile a regular expression and apply it to a sentence. Consider the following pattern. The first line defines a non-matching group that matches a determiner ("a", "an", or "the"). The second line matches a sequence of part-of-speech tags ("JJ" is adjective, "NNP" is proper noun, and "NN" is common noun).

(?:<string='a'> | <string='an'> | <string='the'>)?
<postag="JJ">* <postag='NNP'>+ <postag='NN'>+ <postag='NNP'>+

We can try applying it to a couple of sentences.

The US president Barack Obama is travelling to Mexico.

    regex.find(sentence).groups.get(0) matches "The US president Barack Obama"

If all the ice melted from the frigid Earth continent Antarctica, sea levels would rise hundreds of feet.

    regex.find(sentence).groups.get(0) matches "the frigid Earth continent Antarctica"

We may want to pull out the text from certain parts of our match. We can do this with either named or unnamed groups. Consider the following new form of the pattern and the sentence in example 2.

      (?:<string="a"> | <string="an"> | <string="the">)? <postag="JJ">*
      (<arg1>:<postag='NNP'>+) (<rel>:<postag='NN'>+) (<arg2>:<postag='NNP'>+)

      regex.find(sentence).groups.get(0) matches "the frigid Earth continent Antarctica"
      regex.find(sentence).groups.get(1) matches "Earth"
      regex.find(sentence).groups.get(2) matches "continent"
      regex.find(sentence).groups.get(2) matches "Antarctica"

      regex.find(sentence).group("arg1") matches "Earth"
      regex.find(sentence).group("rel")  matches "continent"
      regex.find(sentence).group("arg2") matches "Antarctica"

Supported Constructs

The regular expression library supports the following constructs.

    | alternation
    ? option
    * Kleene-star
    + plus
    ^ beginning
    $ end
    {x,y}     match at least x but not more than y times
    ()        matching groups
    (?:)      non-matching groups
    (<name>:) named groups

Most of these operators work the same as in java.util.regex. Presently, however, alternation binds to its immediate neighbors. This means that <a> <b> | <c> means <a> (?:<b> | <c>) whereas in Java it would mean (?:<a> <b>) | <c>. This may change in a future release so it is advised that the alternation arguments be made explicit with non-matching groups.

All operators are greedy, and there are no non-greedy counterparts. Backreferences are not supported because the underlying representation only supports regular languages (backreferences are not regular).

Simple Java Example

The NLP example is rather complex but it shows the power of OpenRegex. For a simpler example, look at RegularExpressions.word. This is a static factory method for a simple word-based regular expression where only the string is considered. This factory is used in the test cases.

You can also play around with RegularExpressions.word by running the main method in RegularExpression and specifying an expression with arg1.

sbt 'run-main edu.washington.cs.knowitall.regex.RegularExpression "<the> <fat>* <cows> <are> <mooing> (?:<loudly>)?"'

Logic Expressions

Included is an engine for parsing and evaluating logic expressions. For example, you might want to extend the NLP regular expression language to be able to check multiple fields in a single regular expression token. If you assumed each regular expression token to be a logic expression, you could write patterns such as the following.

    <string="the" & postag="DT"> <postag="JJ"> <string="earth" | postag="NNP">

Extending the regular expression in this way is easy. It only involves rewriting the apply method in BaseExpression inside the compile method. Most of the code below existed before--now it's just moved outside the apply method.

    val logic = new LogicExpressionParser[WordToken] {
      override def factory(expr: String) = {
        new Arg.Pred[WordToken](expr) {
          val Array(part, quotedValue) = expr.split("=")
          val value = quotedValue.drop(1).take(quotedValue.size - 2)
          override def apply(entity: WordToken) = part match {
            case "string" => entity.string == value
            case "postag" => entity.postag == value
            case "chunk" => entity.chunk == value
          }
        }
      }
    }.parse(value)

    override def apply(entity: WordToken) = {
      logic.apply(entity)
    }

Play around with logic expression by using the main method in LogicExpression.

sbt 'run-main edu.washington.cs.knowitall.logic.LogicExpression'

You can enter logic expressions such as "true & false" or "true | false" and have them evaluated interactively.

Implementation

Regular expressions are evaluated using Thomson NFA, which is fast and does not have the pathological cases that most regular expression libraries have. For more information about Thomson NFA in comparison to recursive backtracking, read http://swtch.com/~rsc/regexp/regexp1.html. Future work may involve compiling NFAs to DFAs.

Future Work

Compile to DFA.
Use parser combinators for parsing regular expressions.

Comments

idea for performance optimization
Here's an idea for an optimization: RegularExpression.find() contains this loop:

for (int i = start; i < tokens.size(); i++) ...

This seems to loop even when no match is possible. For example, if my pattern is "abc", the minimum length of a text that can match is 3. So if my input text is shorter than the minimum pattern length, I can stop immediately. More general, I don't have to iterate up to tokens.size() but only to tokens.size()-minPatternLen.

Finding the minimum pattern length shouldn't be too difficult, e.g. Expression.Star is always 0, Expression.Or is the minimum of both sub expressions etc.
opened by danielnaber 7
Min/max a.k.a {x,y}

I needed a min/max expression, so I wrote one. I'm not sure if my approach is the best one, but it seems to work for me. I know the use of short as a data type is unusual for Java, you might want to re-consider that. Sorry for being too lazy to write a Scala test case.

opened by danielnaber 7
Fix tokenization of tokens

The tokenization code is presently one big function. It needs to be broken up into smaller functions. Particularly so those function can be re-used by others (see https://github.com/knowitall/taggers/pull/13).

Also, the special case code for quotations ' and " should be customizable.

opened by schmmd 2
Match.endIndex() is exclusive

When Match.endIndex() returns 3, it's actually the token at index 2 which is the last matching token. In this context, I think it would be more natural if the endIndex really pointed to the last matching token. Of course, then endIndex - startIndex will be zero for matches of length one...

Alternatively, the javadoc could be reworded, it currently says: "the index of the last token matched."

opened by danielnaber 2
Add efficiency check to determine if a pattern could possible match n tokens

If the pattern couldn't possible match n tokens, abort without running the automaton. This pull request adds a minMatchingLength method to the parts of the automaton so this check can be performed. This addresses https://github.com/knowitall/openregex/issues/8.

opened by schmmd 2
add a new constructor for users who don't have a string representation

Maybe I'm missing something, but this constructor seems useful for users who don't have a string representation of their expression. For example, I already have a List of objects that I can convert into a List of Expressions.

opened by danielnaber 2
RegularExpressionParsers should be public?

I assume that word and character in RegularExpressionParsers are intended to be used as part of the public interface? It seems that since RegularExpressionParsers is itself not marked public, it makes it impossible to access the word and character fields. At least it does when trying to access it via clojure.

opened by akhudek 1
backreferences?

Is it correct that OpenRegex does not support backreferences and that this cannot easily be added due to restrictions of the underlying formalism? If so, maybe this should be added to the documentation. Similar, are the expressions greedy or not? This might only be obvious for people familiar with regular expressions.

opened by danielnaber 1
Configurable and modular parsing code

Split the parsing code from the compiling/execution code so that it is configurable. Also, improve default tokenization to handle escaped quotations.

opened by schmmd 0
Help with Openregex constructs
Can we use ? construct as reluctant quantifier (as it is in normal java regular expressions) too?

Why does the following regular expression doesn't pick "is found" from the specified string. I am assuming that <string='.'>* means zero or more string of any length. Regular expression: < string='but' > < string='now' > < string='.' > * ( < pos='VB.?' > * ) Sentence: He was lost , but now he is found.
opened by p6jain 2

Releases(v1.1.1)

v1.1.1(Oct 8, 2013)
Fix regular expressions for identifying quotations.

Source code(tar.gz)
Source code(zip)

An efficient and flexible token-based regular expression language and engine.

Related tags

Overview

OpenRegex

Introduction

Applied to Natural Language

Supported Constructs

Simple Java Example

Logic Expressions

Implementation

Future Work

Comments

Releases(v1.1.1)

v1.1.1(Oct 8, 2013)

Owner

KnowItAll

For English vocabulary analysis and sentence analysis in natural language, model trainin

A fast and accurate POS and morphological tagging toolkit (EACL 2014)

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.

ReDoSHunter: A Combined Static and Dynamic Approach for Regular Expression DoS Detection

Spring Boot Refresh Token using JWT example - Expire and Renew JWT Token

lazy-language-loader improves loading times when changing your language by only reloading the language instead of all the game resources!

Converting Infix to Postfix and Evaluating Postfix Expression. (Java & Scala).

EvalEx is a handy expression evaluator for Java, that allows to evaluate expressions.

Drools is a rule engine, DMN engine and complex event processing (CEP) engine for Java.

Java 8 optimized, memory efficient, speedy template engine producing statically typed, plain java objects

An extension for Keycloak, that enables web-based sign in with Apple and token exchange

Java rate limiting library based on token/leaky-bucket algorithm.

A complete and performing library to highlight text snippets (EditText, SpannableString and TextView) using Spannable with Regular Expressions (Regex) for Android.

Annotation processor to create immutable objects and builders. Feels like Guava's immutable collections but for regular value objects. JSON, Jackson, Gson, JAX-RS integrations included

Java regular expressions made easy.

Human friendly alternative to Regular Expressions

Easy-to-use placeholder library to focus on replacement of regular expressions inside a text.

A plugin of Grasscutter for send regular notice.

This application can recognize the sign language alphabets and help people who do not understand sign language to communicate with the speech and hearing impaired.

Kotlin-decompiled - (Almost) every single language construct of the Kotlin programming language compiled to JVM bytecode and then decompiled to Java again for better readability