Web-Scale Open Information Extraction

Overview

ReVerb

ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.

ReVerb takes raw text as input, and outputs (argument1, relation phrase, argument2) triples. For example, given the sentence "Bananas are an excellent source of potassium," ReVerb will extract the triple (bananas, be source of, potassium).

More information is available at the ReVerb homepage: http://reverb.cs.washington.edu

Quick Start

If you want to run ReVerb on a small amount of text without modifying its source code, we provide an executable jar file that can be run from the command line. Follow these steps to get started:

  1. Download the latest ReVerb jar from http://reverb.cs.washington.edu/reverb-latest.jar

  2. Run java -Xmx512m -jar reverb-latest.jar yourfile.txt.

  3. Run java -Xmx512m -jar reverb-latest.jar -h for more options.

Building

Building ReVerb from source requires Apache Maven (http://maven.apache.org). Run this command to download the required dependencies, compile, and create a single executable jar file.

mvn clean compile assembly:single

The compiled class files will be put in the target/classes directory. The single executable jar file will be written to target/reverb-core-*-jar-with-dependencies.jar where * is replaced with the version number.

Command Line Interface

Once you have built ReVerb, you can run it from the command line.

The command line interface to ReVerb takes plain text or HTML as input, and outputs a tab-separated table of output. Each row in the output represents a single extracted (argument1, relation phrase, argument2) triple, plus metadata. The output has the following columns:

  1. The filename (or stdin if the source is standard input)
  2. The sentence number this extraction came from.
  3. Argument1 words, space separated
  4. Relation phrase words, space separated
  5. Argument2 words, space separated
  6. The start index of argument1 in the sentence. For example, if the value is i, then the first word of argument1 is the i-1th word in the sentence.
  7. The end index of argument1 in the sentence. For example, if the value is j, then the last word of argument1 is the jth word in the sentence.
  8. The start index of relation phrase.
  9. The end index of relation phrase.
  10. The start index of argument2.
  11. The end index of argument2.
  12. The confidence that this extraction is correct. The higher the number, the more trustworthy this extraction is.
  13. The words of the sentence this extraction came from, space-separated.
  14. The part-of-speech tags for the sentence words, space-separated.
  15. The chunk tags for the sentence words, space separated. These represent a shallow parse of the sentence.
  16. A normalized version of arg1. See the BinaryExtractionNormalizer javadoc for details about how the normalization is done.
  17. A normalized version of rel.
  18. A normalized version of arg2.

For example:

$ echo "Bananas are an excellent source of potassium." | 
    ./reverb -q | tr '\t' '\n' | cat -n
 1  stdin
 2  1
 3  Bananas
 4  are an excellent source of
 5  potassium
 6  0
 7  1
 8  1
 9  6
10  6
11  7
12  0.9999999997341693
13  Bananas are an excellent source of potassium .
14  NNS VBP DT JJ NN IN NN .
15  B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16  bananas
17  be source of
18  potassium

For a list of options to the command line interface to ReVerb, run reverb -h.

Examples

Running ReVerb on small set of files

./reverb file1 file2 file3 ...

Running ReVerb on standard input

./reverb < input

Running ReVerb on HTML files

The --strip-html flag (short version: -s) removes tags from the input before running ReVerb.

./reverb --strip-html myfile.html

Running ReVerb on a list of files

You may have an entire directory structure that you want to run ReVerb on. ReVerb takes approximately 10 seconds to initialize, so it is not efficient to start a new process for each file. To pass ReVerb a list of paths, use the -f switch:

# Run ReVerb on all files under mydir/
find mydir/ -type f | ./reverb -f

Java Interface

To include ReVerb as a library in your own project, please take a look at the example class ReVerbExample in the src/main/java/edu/washington/cs/knowitall/examples directory.

When running code that calls ReVerb, make sure to increase the Java Virtual Machine heap size by passing the argument -Xmx512m to java. ReVerb loads multiple models into memory, and will be significantly slower if the heap size is not large enough.

Using Eclipse

To modify the ReVerb source code in Eclipse, use Apache Maven to create the appropriate project files:

mvn eclipse:eclipse

Then, start Eclipse and navigate to File > Import. Then, under General, select "Existing Projects into Workspace". Then point Eclipse to the main ReVerb directory.

Including ReVerb as a Dependency

If you want to start a new project that depends on ReVerb, first create a new skeleton project using Maven. The following command will ask you to fill in the details of your project name, etc.:

mvn archetype:generate

Next, add ReVerb as a dependency. To make sure you are using the latest version of ReVerb, consult Maven Central. Do this by adding the following XML under the <project> element:

<dependencies>
  <dependency>
    <groupId>edu.washington.cs.knowitall</groupId>
    <artifactId>reverb-core</artifactId>
    <version>1.4.1</version>
  </dependency>
</dependencies>

Your final pom.xml file should look something like this:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>mygroup</groupId>
  <artifactId>myartifact</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>myartifact</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>edu.washington.cs.knowitall</groupId>
      <artifactId>reverb-core</artifactId>
      <version>1.4.1</version>
    </dependency>
  </dependencies>
</project>

You should be able to include ReVerb in your code now. You can try this out by including import edu.washington.cs.knowitall.extractor.ReVerbExtractor in your program.

Retraining the Confidence Function

ReVerb includes a class for training new confidence functions, given a list of labeled examples, called ReVerbClassifierTrainer. Example code for training a new confidence function confFunction is shown below - the non-trivial part is likely to be converting your labeled data to an Iterable<LabeledBinaryExtraction>.

Example Pseudocode:

// Provide your labeled data here
Iterable<LabeledBinaryExtraction> myLabeledData = ??? 
ReVerbClassifierTrainer trainer = 
    new ReVerbClassifierTrainer(myLabeledData);
Logistic classifier = trainer.getClassifier();
ReVerbConfFunction confFunction = new ReVerbConfFunction(classifier);
 // confFunction is ready to use here.
double conf = confFunction.getConf(extraction);

If you already have a list of binary labeled ReVerb extractions, it should be easy to convert them to ChunkedBinaryExtraction objects, and then to LabeledBinaryExtraction objects (see the constructors for these classes). Also note that ReVerb includes a LabeledBinaryExtractionReader and Writer class. You may wish to (re-)serialize your data using LabeledBinaryExtractionWriter - this will put it in the same format as all previous data used to train ReVerb confidence functions, and it will be easy to read in the future with LabeledBinaryExtractionReader.

Help and Contact

For more information, please visit the ReVerb homepage at the University of Washington: http://reverb.cs.washington.edu.

FAQ

  1. How fast is ReVerb?

    You should really benchmark ReVerb yourself, but on my computer (a new computer in 2011) ReVerb processed 5000 high-quality web sentences in 21 s, or 238 sentences per second, in a single thread. ReVerb is easily parallelizable by processing different sentences concurrently.

Contributors

Citing ReVerb

If you use ReVerb in your academic work, please cite ReVerb with the following BibTeX citation:

@inproceedings{ReVerb2011,
  author =   {Anthony Fader and Stephen Soderland and Oren Etzioni},
  title =    {Identifying Relations for Open Information Extraction},
  booktitle =    {Proceedings of the Conference of Empirical Methods
                  in Natural Language Processing ({EMNLP} '11)},
  year =     {2011},
  month =    {July 27-31},
  address =  {Edinburgh, Scotland, UK}
}
Comments
  • License?

    License?

    Could I encourage you to place some open source license on this? I'm interested in using this, but cannot do so without clear understanding of my legal rights to do so.

    opened by temujin9 4
  • How to use our own data set for Open Information Extraction? A text corpus?

    How to use our own data set for Open Information Extraction? A text corpus?

    I have a text corpus of agriculture domain. I need to use Reverb to extract relationships between certain concepts in the corpus and use it for another purpose in java code. Is there a way to do this using java? In the readme.md file, there is a way to read a list of files under a certain directory, how can this be done using java for reverb?

    opened by Akila94 1
  • --argLearner error

    --argLearner error

    Hi, I'm using reverb on the http://en.wikipedia.org/wiki/Vacuum_tube with --argLearner I'm using the jar version.

    I get : Initializing ReVerb+ArgLearner extractor...Couldn't open cc.mallet.util.MalletLogger resources/logging.properties file. Perhaps the 'resources' directories weren't copied into the 'class' directory. Continuing. Done. Initializing confidence function...Done. Initializing NLP tools...Done. Starting extraction. Extracting from Vacuum_tube Exception in thread "main" java.lang.NoClassDefFoundError: bsh/Interpreter at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at cc.mallet.util.CommandOption.(CommandOption.java:62) at cc.mallet.util.CommandOption$Double.(CommandOption.java:483) at cc.mallet.fst.SimpleTagger.(SimpleTagger.java:170) at cc.mallet.fst.SimpleTagger$SimpleTaggerSentence2FeatureVectorSequence.pipe(SimpleTagger.java:158) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:294) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282) at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267) at edu.washington.cs.knowitall.argumentidentifier.ArgSubstructureClassifier.applyCRF(ArgSubstructureClassifier.java:67) at edu.washington.cs.knowitall.argumentidentifier.ArgSubstructureClassifier.classifyData(ArgSubstructureClassifier.java:151) at edu.washington.cs.knowitall.argumentidentifier.ArgSubstructureClassifier.getArgBound(ArgSubstructureClassifier.java:202) at edu.washington.cs.knowitall.argumentidentifier.ArgLearner.getArg1LeftBound(ArgLearner.java:163) at edu.washington.cs.knowitall.argumentidentifier.ArgLearner.getArg1(ArgLearner.java:141) at edu.washington.cs.knowitall.argumentidentifier.ArgLearner.extractCandidates(ArgLearner.java:65) at edu.washington.cs.knowitall.argumentidentifier.ArgLearner.extractCandidates(ArgLearner.java:20) at edu.washington.cs.knowitall.extractor.Extractor.extract(Extractor.java:89) at edu.washington.cs.knowitall.extractor.RelationFirstNpChunkExtractor.extractCandidates(RelationFirstNpChunkExtractor.java:124) at edu.washington.cs.knowitall.extractor.RelationFirstNpChunkExtractor.extractCandidates(RelationFirstNpChunkExtractor.java:36) at edu.washington.cs.knowitall.extractor.Extractor.extract(Extractor.java:89) at edu.washington.cs.knowitall.util.CommandLineReVerb.extractFromSentReader(CommandLineReVerb.java:357) at edu.washington.cs.knowitall.util.CommandLineReVerb.extractFromNextFile(CommandLineReVerb.java:313) at edu.washington.cs.knowitall.util.CommandLineReVerb.runExtractor(CommandLineReVerb.java:241) at edu.washington.cs.knowitall.util.CommandLineReVerb.main(CommandLineReVerb.java:129) Caused by: java.lang.ClassNotFoundException: bsh.Interpreter at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) ... 34 more

    opened by ontologiae 1
  • Pull request for the unary bug fix

    Pull request for the unary bug fix

    Tony

    I have added a fix that removes the IndexOutOfBounds exception for some unary extractions.

    The hack for creating a unary relation, had an incorrect range for the missing argument.

    I have added a test case for this bug.

    ~ Niranjan.

    opened by niranjanb 1
  • Remove some unused classes

    Remove some unused classes

    Previously there were two BooleanFeatureSets, one of which was not actually used. There were also two instances of some other classes left over from a rename.

    opened by schmmd 0
  • Hi Tony, I thought that in your sick state you should try out merging a pull request.

    Hi Tony, I thought that in your sick state you should try out merging a pull request.

    Using Joiner.on to format output instead of String.format for minor performance reasons.

    This is pull request is really to give an example of a pull request.

    Previously formatting the strings was taking a small percentage of the total running time of the program and now that percentage is much smaller.

    opened by schmmd 0
  • quick start

    quick start

    i excute the quick start as guide where is the output?

    C:\Users\pc>java -Xmx512m -jar  reverb-latest.jar  test.txt
    Initializing ReVerb extractor...Done.
    Initializing confidence function...Done.
    Initializing NLP tools...Done.
    Starting extraction.
    Extracting from test.txt
    Done with extraction.
    Summary: 0 extractions, 2 sentences, 1 files, 0 seconds
    
    opened by cdhx 1
  • the problem of building  Reverb

    the problem of building Reverb

    hello! I follow you guide to build Reverb ,but when I use the command:mvn clean compile assembly:single ,there are many warnings such as: [WARNING] [WARNING] Some problems were encountered while building the effective model for edu.washington.cs.knowitall:reverb-core:jar:1.4.3-SNAPSHOT [WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-sour ce-plugin is missing. @ edu.washington.cs.knowitall:knowitall-oss:1.0.0, D:\Soft ware\maven\LocalWarehouse\edu\washington\cs\knowitall\knowitall-oss\1.0.0\knowit all-oss-1.0.0.pom, line 41, column 15 [WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-java doc-plugin is missing. @ edu.washington.cs.knowitall:knowitall-oss:1.0.0, D:\Sof tware\maven\LocalWarehouse\edu\washington\cs\knowitall\knowitall-oss\1.0.0\knowi tall-oss-1.0.0.pom, line 29, column 15 [WARNING] [WARNING] It is highly recommended to fix these problems because they threaten t he stability of your build. [WARNING] [WARNING] For this reason, future Maven versions might no longer support buildin g such malformed projects. [WARNING]

    And "Build success", but I couldn't use the command :./reverb file1 Could you give me some advice? Thank you!

    opened by hedyHe 0
  • Is it possible to use rever with large dataset (11 GB)?

    Is it possible to use rever with large dataset (11 GB)?

    I wanted to try reverb relation extractor with large dataset as mentioned in code section of http://knowitall.cs.washington.edu/paralex/ Apart from using jar file directly, how can I create new jar which works with large lexicon, questions datasets?

    opened by karimkhanp 0
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

null 900 Jan 2, 2023
Similar to the minimap application, this program gets information from the center of the screen and displays information about that creature from a database.

New-World-CreatureInfo Similar to the minimap application, this program gets information from the center of the screen and displays information about

Mal Ball 2 Sep 21, 2022
Building Open-Ended Embodied Agents with Internet-Scale Knowledge

Building Open-Ended Embodied Agents with Internet-Scale Knowledge [Website] [Arxiv Paper] [PDF] [Docs] [Open Database] [MineCLIP] [Team] is a new AI r

null 927 Jan 4, 2023
Maven plugin to help creating CHANGELOG by keeping one format and solving merge request conflicts problem by extraction of new CHANGELOG entries to seperate files.

keep-changelog-maven-plugin CHANGELOG.md is one of the most important files in a repository. It allows others to find out about the most important cha

Piotr Zmilczak 22 Aug 28, 2022
A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

LockUp An Android-based Cellebrite UFED self-defense application LockUp is an Android application that will monitor the device for signs for attempts

mbkore 300 Dec 4, 2022
A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

LockUp An Android-based Cellebrite UFED self-defense application LockUp is an Android application that will monitor the device for signs for attempts

levlesec 300 Dec 4, 2022
Simplified PDF Data Extraction

PDF Mantis Simplified PDF Data Extraction Table of Contents What is PDF Mantis Why was PDF Mantis created and who is it for Requirements Installation

null 5 Dec 1, 2021
Data extraction from smartphones and GPS and Accelerometer data "fusion" with Kalman filter.

This is library for GPS and Accelerometer data "fusion" with Kalman filter. All code is written in Java. It helps to increase position accuracy and GP

Rahul Goel 4 Nov 22, 2022
An evolving set of open source web components for building mobile and desktop web applications in modern browsers.

Vaadin components Vaadin components is an evolving set of high-quality user interface web components commonly needed in modern mobile and desktop busi

Vaadin 519 Dec 31, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 2, 2023
APM, (Application Performance Management) tool for large-scale distributed systems.

Visit our official web site for more information and Latest updates on Pinpoint. Latest Release (2020/01/21) We're happy to announce the release of Pi

null 12.6k Jan 4, 2023
APM, (Application Performance Management) tool for large-scale distributed systems.

Visit our official web site for more information and Latest updates on Pinpoint. Latest Release (2020/01/21) We're happy to announce the release of Pi

null 12.5k Dec 29, 2022
A fork of Cliff Click's High Scale Library. Improved with bug fixes and a real build system.

High Scale Lib This is Boundary's fork of Cliff Click's high scale lib. We will be maintaining this fork with bug fixes, improvements and versioned bu

BMC TrueSight Pulse (formerly Boundary) 402 Jan 2, 2023
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
APM, (Application Performance Management) tool for large-scale distributed systems.

Visit our official web site for more information and Latest updates on Pinpoint. Latest Release (2020/01/21) We're happy to announce the release of Pi

null 12.6k Jan 6, 2023
A scale demo of Neo4j Fabric spanning up to 1129 machines/shards running a 100TB (LDBC) dataset with 1.2tn nodes and relationships.

Demo application instructions Overview This repository contains the code necessary to reproduce the results for the Trillion Entity demonstration that

Neo4j 84 Nov 23, 2022