Beagle helps you identify keywords, phrases, regexes, and complex search queries of interest in streams of text documents.

Overview

Beagle

License pipeline status Maven Central

Beagle is a detector of interesting things in text. Its intended use is in-stream search applications. Suppose you need to monitor a stream of text documents such as web crawl results, chat messages, or corporate documents in order to identify keywords, phrases, regexes, and complex search queries of interest. With Beagle you can quickly be up and running with such a system, allowing you to focus on productively monitoring your documents.

Beagle is based on the Lucene monitor library which is based on Luwak.

Components

Phrase Annotator Usage

(require '[beagle.phrases :as phrases])

(let [dictionary [{:text "to be annotated" :id "1"}]
      highlighter-fn (phrases/highlighter dictionary)]
  (highlighter-fn "before annotated to be annotated after annotated"))
=> ({:text "to be annotated", :type "LABEL", :dict-entry-id "1", :meta {}, :begin-offset 17, :end-offset 32})

;; Case sensitivity is controlled per dictionary entry 
(let [dictionary [{:text "TO BE ANNOTATED" :id "1" :case-sensitive? false}]
      highlighter-fn (phrases/highlighter dictionary)]
  (highlighter-fn "before annotated to be annotated after annotated"))
=> ({:text "to be annotated", :type "LABEL", :dict-entry-id "1", :meta {}, :begin-offset 17, :end-offset 32})

;; ASCII folding is controlled per dictionary entry
(let [dictionary [{:text "TÖ BE ÄNNÖTÄTED" :id "1" :case-sensitive? false :ascii-fold? true}]
      highlighter-fn (phrases/highlighter dictionary)]
  (highlighter-fn "before annotated to be annotated after annotated"))
=> ({:text "to be annotated", :type "LABEL", :dict-entry-id "1", :meta {}, :begin-offset 17, :end-offset 32})

;; Stemming is supported for multiple languages per dictionary entry
(let [dictionary [{:text "Kaunas" :id "1" :stem? true :stemmer :lithuanian}]
      highlighter-fn (phrases/highlighter dictionary)]
  (highlighter-fn "Kauno miestas"))
=> ({:text "Kauno", :type "PHRASE", :dict-entry-id "1", :meta {}, :begin-offset 0, :end-offset 5})

;; Phrases also support slop (i.e. terms edit distance) per dictionary entry
(let [txt "before start and end after"
      dictionary [{:text "start end" :id "1" :slop 1}]
      highlighter-fn (phrases/highlighter dictionary)]
  (highlighter-fn txt))
=> ({:text "start and end", :type "PHRASE", :dict-entry-id "1", :meta {}, :begin-offset 7, :end-offset 20})

;; Every phrase can specify which tokenizer to use
(let [txt "[URGENT!] Do this immediately!"
      dictionary [{:text "[URGENT!]" :id "a" :tokenizer :whitespace}
                  {:text "[URGENT!]" :id "b" :tokenizer :standard}]
      highlighter-fn (phrases/highlighter dictionary)]
  (clojure.pprint/pprint (highlighter-fn txt)))
=> 
({:text "[URGENT!]",
  :type "PHRASE",
  :dict-entry-id "a",
  :meta {},
  :begin-offset 0,
  :end-offset 9}
 {:text "URGENT",
  :type "PHRASE",
  :dict-entry-id "b",
  :meta {},
  :begin-offset 1,
  :end-offset 7})

;; Ensure that phrase terms are matched in the provided order
;; e.g. NOT preserving order (default)
(let [txt "Mill Token"
      dictionary [{:text "Token Mill" :slop 2 :in-order? false}]
      highlighter-fn (phrases/highlighter dictionary)]
  (highlighter-fn txt))
=> [{:text "Mill Token" :type "PHRASE" :dict-entry-id "0" :meta {} :begin-offset 0 :end-offset 10}]
;; e.g. Preserving order
(let [txt "Mill Token"
      dictionary [{:text "Token Mill" :slop 2 :in-order? true}]
      highlighter-fn (phrases/highlighter dictionary)]
  (highlighter-fn txt))
=> ()

Java Interface to the Phrase Highlighter

Example:

import lt.tokenmill.beagle.phrases.Annotation;
import lt.tokenmill.beagle.phrases.Annotator;
import lt.tokenmill.beagle.phrases.DictionaryEntry;

import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;

public class Main {
    public static void main(String[] args) {
        DictionaryEntry dictionaryEntry = new DictionaryEntry("test phrase");
        Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
        Collection<Annotation> annotations = annotator.annotate("This is my test phrase");
        annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
    }
}

// => Annotated: 'test phrase' at offset: 11:22

The available options for the Java API are explained with examples in the Java Interface for Phrase Highlighting wiki page.

All the options that are present in the Clojure interface are also available for use in Java, just convert Clojure keywords to Java strings, e.g.

:case-sensitive? => "case-sensitive?"

Project Setup with Maven

The library is deployed in the Maven Central Repository and you can just add the beagle dependency to your pom.xml:

<dependency>
    <groupId>lt.tokenmill</groupId>
    <artifactId>beagle</artifactId>
    <version>0.3.1</version>
</dependency>

Lucene Query Support

Examples:

(require '[beagle.lucene-alpha :as lucene])

(let [txt "some text this other that"
        dictionary [{:text "this AND that" :id "1" :slop 1}]
        annotator-fn (lucene/annotator dictionary)]
  (annotator-fn txt {}))
=> ({:text "this AND that", :type "QUERY", :dict-entry-id "1", :meta {}})

Performance

The performance was measured on a desktop PC with Ubuntu 19.04 and 8-core Ryzen 1700.

The test setup was for news articles and dictionary made up of names of city names in USA.

Code and data for benchmarking and more benchmarks can be found here.

Single-thread

Average time spent per document ranged from 1.58 ms for dictionary of 5k phrases to 4.58 ms per document for 80k phrases.

alt text

Throughput of docs analyzed ranged from 626 docs/sec for dictionary of 5k phrases to 210 docs/sec for 80k phrases.

alt text

Max time spent per document has couple of spikes when processing a document takes ~1000ms. These spikes should have been caused either by GC pauses, or JVM deoptimizations. Aside from those spikes, max time ranges grows steadily from 15 ms to 72 ms as the dictionary size grows.

Min time spent per document is fairly stable for any dictionary size and is about 0.45 ms. Most likely these are the cases when Presearcher haven't found any candidate queries to run against the document.

alt text

Multi-threaded

Using core.async pipeline time spent per single doc ranged from 3.38 ms for dictionary of 5k phrases to 15.34 ms per document for 80k phrases.

alt text

Total time spent to process all 10k docs ranged from 2412 ms for dictionary of 5k phrases to 12595 ms per document for 80k phrases.

alt text

Throughput of docs analyzed ranged from 4143 docs/sec for dictionary of 5k phrases to 793 docs/sec for 80k phrases.

alt text

Max time spent per document has risen fairy steady from 24.15 ms for dictionary of 10k phrases to 113.45 ms per document for 60k phrases.

Min time spent per document varied from 0.6 ms for dictionary of 10k phrases to 1.1 ms per document for 55k phrases.

alt text

Conclusions about Performance

Processing of a one document on average is faster in the single-thread mode by roughly by 3x compared to multi-threaded mode but even in multi-threaded mode one document rarely takes more than 10 ms.

In multi-threaded mode throughput grows with the number on CPU cores almost linearly: 4143/8=518 docs per core per sec in multi-threaded mode while in single-thread mode 626 docs per core per sec.

Dictionary Readers

Three file formats are supported: csv, edn, json.

CSV Dictionary Format

Separator: "," Escape: """

The first line MUST be a header.

Supported header keys: ["text" "type" "id" "synonyms" "case-sensitive?" ":ascii-fold?" "meta"]

Order is not important.

Under synonyms, there should be a list of string separated by ";" Under meta, there should be a list of strings separated by ";". Even number of strings is expected. In case of odd number, last one is ignored.

Dictionary Validator

Accepts any number of dictionaries to validate as long as they are provided in pairs as '"/path/to/dictionary/file" "file-type"'

Supported File Types

  • csv
  • json
  • edn

Output

  • If any dictionary is invalid exception will be thrown with exit status 1

Usage

Clojure

To use validator directly execute command: clj -m beagle.validator "/path/to/dictionary/file" "file-type" "/path/to/dictionary/file2" "file-type" & ...

Example:
clj -m beagle.validator "your-dict.csv" "csv" "your-other-dict.json" "json"

Docker

Example in Gitlab CI:

validate-dictionaries:
  stage: dictionary-validation
  when: always
  image: tokenmill/beagle-dictionary-validator
  script:
    - >
      dictionary-validator
      /path/to/dict.csv csv
      /path/to/dict.json json
      /path/to/dict.edn edn

Dictionary Optimizer

Supported optimizations:

  • Remove duplicate dictionary entries
  • Merge synonyms
  • Synonyms and text equality check

There are cases when dictionary entries can't be merged:

  • Differences in text analysis

Examples:

(require '[beagle.dictionary-optimizer :as optimizer])

; Remove duplicates
(let [dictionary [{:text "TO BE ANNOTATED" :id "1"}
                  {:text "TO BE ANNOTATED"}]]
  (optimizer/optimize dictionary))
=> ({:text "TO BE ANNOTATED", :id "1"})

; Merge synonyms
(let [dictionary [{:text "TO BE ANNOTATED" :synonyms ["ONE"]}
                  {:text "TO BE ANNOTATED" :synonyms ["TWO"]}]]
  (optimizer/optimize dictionary))
=> ({:text "TO BE ANNOTATED", :synonyms ("TWO" "ONE")})

; Synonyms and text equality check
(let [dictionary [{:text "TO BE ANNOTATED" :synonyms ["TO BE ANNOTATED"]}]]
  (optimizer/optimize dictionary))
=> ({:text "TO BE ANNOTATED", :synonyms ["TO BE ANNOTATED"]})

; Can't be merged because of differences in text analysis
(let [dictionary [{:text "TO BE ANNOTATED" :case-sensitive? true}
                  {:text "TO BE ANNOTATED" :case-sensitive? false}]]
  (optimizer/optimize dictionary))
=> ({:text "TO BE ANNOTATED", :case-sensitive? true} {:text "TO BE ANNOTATED", :case-sensitive? false})

Annotation Merger

Only annotations of the same type are merged.

Handled cases:

  • Duplicate annotations
  • Nested annotations

Examples:

(require '[beagle.annotation-merger :as merger])

(let [dictionary [{:text "TEST"}
                  {:text "This TEST is"}]
      highlighter-fn (phrases/highlighter dictionary)
      annotations (highlighter-fn "This TEST is")]
  (println "Annotations: " annotations)
  (merger/merge-same-type-annotations annotations))
Annotations:  ({:text TEST, :type PHRASE, :dict-entry-id 0, :meta {}, :begin-offset 5, :end-offset 9} {:text This TEST is, :type PHRASE, :dict-entry-id 1, :meta {}, :begin-offset 0, :end-offset 12})
=> ({:text "This TEST is", :type "PHRASE", :dict-entry-id "1", :meta {}, :begin-offset 0, :end-offset 12})

;; You can also inline the need of merging annotations
(let [dictionary [{:text "TEST"}
                  {:text "This TEST is"}]
      highlighter-fn (phrases/highlighter dictionary)]
  (highlighter-fn "This TEST is" {:merge-annotations? true}))
=> ({:text "This TEST is", :type "PHRASE", :dict-entry-id "1", :meta {}, :begin-offset 0, :end-offset 12})

License

Copyright © 2019 TokenMill UAB.

Distributed under the The Apache License, Version 2.0.

You might also like...

A sample repo to help you use CDP console in Java-TestNG automation test on LambdaTest. Run Selenium tests with TestNG on LambdaTest platform.

How to use CDP console in Java-TestNG automation test on LambdaTest Environment Setup Global Dependencies Install Maven Or Install Maven with Homebrew

Jul 13, 2022

A sample repo to help you capture performance logs in Java-TestNG using CDP on LambdaTest. Run Selenium tests with TestNG on LambdaTest platform.

How to capture performance logs in Java-TestNG using CDP on LambdaTest Environment Setup Global Dependencies Install Maven Or Install Maven with Homeb

Jul 13, 2022

A sample repo to help you intercept network with Java-TestNG on LambdaTest cloud. Run Selenium tests with TestNG on LambdaTest platform.

How to intercept network with Java-TestNG on LambdaTest cloud Environment Setup Global Dependencies Install Maven Or Install Maven with Homebrew (Easi

Oct 23, 2022

A sample repo to help you set geolocation for automation test in Java-TestNG on LambdaTest. Run Selenium tests with TestNG on LambdaTest platform.

How to set geolocation for automation test in Java-TestNG on LambdaTest Environment Setup Global Dependencies Install Maven Or Install Maven with Home

Jul 13, 2022

A sample repo to help you emulate network control using CDP in Java-TestNG automation test on LambdaTest. Run Selenium tests with TestNG on LambdaTest platform.

How to emulate network control using CDP in Java-TestNG automation test on LambdaTest Environment Setup Global Dependencies Install Maven Or Install M

Oct 23, 2022

A sample repo to help you handle basic auth for automation test in Java-TestNG on LambdaTest. Run Selenium tests with TestNG on LambdaTest platform.

How to handle basic auth for automation test in Java-TestNG on LambdaTest Environment Setup Global Dependencies Install Maven Or Install Maven with Ho

Jul 13, 2022

A sample repo to help you set device mode using CDP in Java-TestNG automation test on LambdaTest. Run Selenium tests with TestNG on LambdaTest platform.

How to set device mode using CDP in Java-TestNG automation test on LambdaTest Environment Setup Global Dependencies Install Maven Or Install Maven wit

Jul 13, 2022

A sample repo to help you handle basic auth for automation test in Java-selenium on LambdaTest. Run your Java Selenium tests on LambdaTest platform.

A sample repo to help you handle basic auth for automation test in Java-selenium on LambdaTest. Run your Java Selenium tests on LambdaTest platform.

How to handle basic auth for automation test in Java-selenium on LambdaTest Prerequisites Install and set environment variable for java. Windows - htt

Jul 13, 2022

A sample repo to help you clear browser cache with Selenium 4 Java on LambdaTest cloud. Run your Java Selenium tests on LambdaTest platform.

A sample repo to help you clear browser cache with Selenium 4 Java on LambdaTest cloud. Run your Java Selenium tests on LambdaTest platform.

How to clear browser cache with Selenium 4 Java on LambdaTest cloud Prerequisites Install and set environment variable for java. Windows - https://www

Jul 13, 2022
Comments
  • Enforce order for phrase query highlighter with slop

    Enforce order for phrase query highlighter with slop

    E.g. phrase "A B" with slop 2 will match the text "B A". But there are cases when we want that A must precede B, e.g. "A x x B".

    Try out SpanNearQuery with flag inOrder.

    opened by dainiusjocas 1
  • Bad annotations

    Bad annotations

    (deftest corner-cases
      (let [annotator (beagle.phrases/annotator [{:text "N-Able N-Central"
                                                  :case-sensitive? false}])
            url  (java.net.URL. 
                  "https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/source/Source.html")
            text (some-> (Jsoup/parse url 2000) (.body) (.text))]
    
        (is (empty? (annotator text)))))
    
    expected: (empty? (annotator text))
    
      actual: (not          
               (empty?
                ({:text "n\"\n                + \"}\\n",
                  :type "PHRASE",
                  :dict-entry-id "0",
                  :meta {},
                  :begin-offset 2041,
                  :end-offset 2066}
                 {:text "n\"\n                + \"}\\n",
                  :type "PHRASE",
                  :dict-entry-id "0",
                  :meta {},
                  :begin-offset 22991,
                  :end-offset 23016})))
    
    opened by zmedelis 1
  • Suggest dictionary optimizations

    Suggest dictionary optimizations

    • [x] optimizations are the same as those already implemented
    • [x] dry-run
    • [x] collect possible optimizations to list
    • [x] export as JSON at the end of the process

    E.g.:

    1. dictionary item in line 23 has text and synonyms that are equal.
    2. on lines 23 and 32 dictionary entries are equal.
    opened by dainiusjocas 1
  • Fix dictionary validator build

    Fix dictionary validator build

    Step 1/13 : FROM oracle/graalvm-ce:19.2.0.1 as builder
    pull access denied for oracle/graalvm-ce, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
    
    opened by rokasramas 0
Owner
TokenMill
We can help you with your natural language generation and processing projects
TokenMill
A sample repo to help you find an element by text for automation test in Java-selenium on LambdaTest. Run your Java Selenium tests on LambdaTest platform.

How to find an element by text for automation test in Java-selenium on LambdaTest Prerequisites Install and set environment variable for java. Windows

null 12 Jul 13, 2022
fastjson auto type derivation search

Fastjson Auto Type Discovery and Visualization Blackhat USA 2021 Session ---- How I Used a JSON Deserialization 0day to Steal Your Money on the Blockc

RonnyX 17 Dec 9, 2022
MockServer enables easy mocking of any system you integrate with via HTTP or HTTPS with clients written in Java, JavaScript and Rub

MockServer enables easy mocking of any system you integrate with via HTTP or HTTPS with clients written in Java, JavaScript and Ruby. MockServer also includes a proxy that introspects all proxied traffic including encrypted SSL traffic and supports Port Forwarding, Web Proxying (i.e. HTTP proxy), HTTPS Tunneling Proxying (using HTTP CONNECT) and SOCKS Proxying (i.e. dynamic port forwarding).

Mock-Server 4k Jan 4, 2023
PowerMock is a Java framework that allows you to unit test code normally regarded as untestable.

Writing unit tests can be hard and sometimes good design has to be sacrificed for the sole purpose of testability. Often testability corresponds to go

PowerMock 3.9k Dec 28, 2022
PowerMock is a Java framework that allows you to unit test code normally regarded as untestable.

Writing unit tests can be hard and sometimes good design has to be sacrificed for the sole purpose of testability. Often testability corresponds to go

PowerMock 3.9k Dec 28, 2022
fabric-carpet extension mod which attempts to fix as many vanilla bugs as possible. Feel free to add as many fixes as you want!

Carpet-Fixes Fabric Carpet extension mod which attempts to fix as many vanilla bugs as possible! Feel free to contribute by adding as many fixes as yo

Fx Morin 90 Jan 6, 2023
A FREE Selenium course that takes you step-by-step through building a custom Selenium Framework from scratch.

Selenium For Everyone The book and code repo for the FREE Selenium For Everyone book by Kevin Thomas. FREE Book Download Chapter 1: Getting Started Th

Kevin Thomas 5 May 10, 2022
This repository contains example codes which will help you to know how to use selenium webdriver.

❓ What is this Repository about? This repo has example codes with Selenium 4 features. Websites used for testing are: automationpractice.com, saucedem

Mohammad Faisal Khatri 86 Dec 30, 2022
A sample repo to help you capture JavaScript exception for automation test in Java-TestNG on LambdaTest. Run Selenium tests with TestNG on LambdaTest platform.

How to capture JavaScript exception for automation test in Java-TestNG on LambdaTest Environment Setup Global Dependencies Install Maven Or Install Ma

null 11 Jul 13, 2022
A sample repo to help you use relative locators for automation test in Java-TestNG on LambdaTest. Run Selenium tests with TestNG on LambdaTest platform.

How to use relative locators for automation test in Java-TestNG on LambdaTest Environment Setup Global Dependencies Install Maven Or Install Maven wit

null 11 Jul 13, 2022