DeepDive

Overview

DeepDive

Build Status Join the chat at https://gitter.im/HazyResearch/deepdive


As of 2017, DeepDive project is in maintenance mode and no longer under active development. The user community remains active, but the original project members can no longer promise exciting new features/improvements or responding to requests. For the more up-to-date research, please see the Snorkel Project or Ce Zhang's Projects.

See deepdive.stanford.edu or doc/ to learn how to install and use DeepDive.

Or, just start with this one-liner command:

bash <(curl -fsSL git.io/getdeepdive)

Read the DeepDive developer's guide to learn more about this source tree and how to contribute.

Licensed under Apache License, Version 2.0.

Comments
  • Sequence assignment is major performance bottleneck

    Sequence assignment is major performance bottleneck

    On greenplum (raiders4), the following query is executed while running our rates extractor:

    SELECT fast_seqassign('rates', 0)

    Throughput is 2.4 MB/sec.

    This means it takes alone 3h to assign IDs to the candidates, and another 3h+ (still running) to assign IDs to the features.

    Any ideas on how we could speed up sequence assignment?

    opened by raphaelhoffmann 30
  • DeepDive execution plan compiler

    DeepDive execution plan compiler

    This not-so-small PR adds a new implementation of DeepDive that literally compiles an execution plan to run the app in a much more efficient way in both human time and machine time/space. I invite everyone to review the code and give a try on your existing app (especially @feiranwang @SenWu @ajratner @zhangce @raphaelhoffmann @alldefector @zifeishan @xiaoling @Colossus @ThomasPalomares @juhanaka) The plan is to get feedback over the next few days while I update the documentation and carve out release v0.8.0.

    An execution plan is basically a set of shell scripts that tells what to run for user-defined extractors as well as the built-in processes for grounding the factor graph and performing learning and inference, and a Makefile and that describes the complete dependencies among them. These are compiled from the app's deepdive.conf, app.ddlog, and schema.json, mainly by a series of JSON transformations implemented as jq programs with a little bit of help from bash, which all resides under compiler/ in the source tree.

    It implements most of the existing functionality provided by the current Scala implementation, except a few things, e.g., Greenplum parallel unloading/loading, which won't be difficult to add given the much much modular architecture. On the other hand, exciting new features and improvements are added. To highlight just a few:

    • Full Dependency Support with Selective Execution. It's now possible to selectively run, repeat, or skip certain parts of the app's data flow or extraction pipeline without being aware of all the dependencies between them. (fixes #431, closes #427, closes #273) The user has full control over every step of the execution plan. Not only that, but grounding is also broken down into smaller processes, so it's possible to just change or add one inference rule and update the grounded factor graph without having to recompute everything from scratch. (fixes #280)
    • Zero Footprint Extraction. tsv_extractors now have nearly zero footprint on the filesystem. The data is streamed from the database through the UDFs and back to the database. mkmimo is used with named pipes to make the connection between the database and the parallel UDF processes (cf. $DEEPDIVE_NUM_PROCESSES). (fixes #428) It doesn't support other extractor styles yet, but we can probably drop them unless there's a compelling reason. (closes #384)
    • Compute Drivers. A new compute driver architecture for executing such extractors is now in place, so it's now clear where to extend to support remote execution or clusters with a job scheduler, such as Hadoop/YARN, SLURM, Torque/GridEngine/PBS. (#426) The local execution driver is what implements the streaming tsv_extractor mentioned above. Moreover, the grounding processes as well as the user-defined extractors also make use of the compute drivers, so some parts of the grounding will automatically take advantage of such extensions.
    • Zero Footprint Grounding. The grounding processes also minimizes footprint on the filesystem. No data for the factor graph are duplicated. Instead of creating a concatenated copies of factors, weights, and variables, they are merged as the sampler loads them. Also, the binary format conversion is done on the fly as the grounded rows are unloaded from the database, so no ephemeral textual form is ever stored anywhere. In fact, only a few line changes to the compiler can compress the binary forms and shrink the factor graphs footprint on the filesystem by an order of magnitude (not included in this PR).
    • More User Commands. deepdive.pipeline config is obsolete as well as the deepdive run command, although they'll work the same as before. Now, the user can simply state what is the goal for execution to the deepdive do command, e.g., deepdive do model/calibration-plot or deepdive do data/has_spouse_features, as many times as necessary once the app is compiled with the deepdive compile command. Supporting commands such as deepdive plan, deepdive mark, deepdive redo are there to speed up typical user workflows with DeepDive apps. deepdive initdb and deepdive load have been improved to be more useful, and is used by a few compiled processes. (fixes #351, fixes #357)
    • Error Checking. Errors in the app (and of course the compiler itself) are checked at compile time, and can also be done by the user with the deepdive check command. The checkers are modular and isolated, so many useful checks can be quickly added, such as test firing UDF commands, etc. Only basic checks have been implemented so far. (fixes #349, fixes #1)
    • Simpler, Efficient Multinomial Factors. Multinomial factors are won't materialize unnecessary data and use VIEWs as much as possible, e.g., dd_*_cardinality tables and the dd_graph_weights. Also, nearly no code has been duplicated to support it, so that's good news for developers.
    • Bundled Runtime Dependencies. DeepDive now builds essential runtime dependencies, such as bash, coreutils, jq, bc, graphviz, so no more Mac vs. Linux or software installation version issues will pop up. (fixes #441 as documentation also ended up in this PR)

    Also some good stuffs that comes with a clean rewrite:

    • Closes #412 by doing a reasonable job and making the base relations clear to the user, and also closes #421.
    • Closes #110 as no more SQL parsing is done.
    • Closes #383 as logging has been completely redone.
    • Closes #361, closes #20 as JDBC is no longer used.
    • Closes #329 as Scala implementation will be simply dropped in a future release (v0.8.1 or v0.9?).
    opened by netj 22
  • We should output

    We should output "supervised" data points as well

    Looks like data points with filled is_correct values get excluded from Gibbs sampling. But the expectations on those data points are useful for both dev process and the final data product.

    Right now, people would have to cumbersomely make copies of those supervised data points to force DD to run inference on them (@ajratner @zhangce @raphaelhoffmann for example). I think I've brought this up a while ago.

    Do you agree that we should always include supervised data points in inference? I realize that we may also need to change code of the sampler for this. Would that be a lot of work?

    opened by alldefector 22
  • Installer scripts with test cleanup

    Installer scripts with test cleanup

    util/install.sh is the master installer script that detects operating system, loads platform-specific installers, and runs them in batch or interactive fashion. Now, installing DeepDive can be done with a oneliner:

    bash <(curl -fsSL deepdive.stanford.edu/install)
    

    @raphaelhoffmann's scripts from #302 for installing DeepDive dependencies as well as Postgres-XL on Ubuntu has been slightly modified to fit into this modular, extensible structure. A minimal script for Mac has been added mainly to show how the installer supports many platforms. Each OS-specific install script can define as many installable components as it wants (by enumerating them from list_installers) and keep dirty details in separate files under install/ (loaded with source_script even over GitHub! See how install.Ubuntu.pgxl.sh is run).

    Makefile and Travis config are rewritten to use the identical scripts, and this will unify the once diverging codebase towards local make, Travis, Docker, Shippable, AMI, etc. As a side effect, tests were cleaned up, and broken link check running on Travis has been factored out as make -C doc/ linkcheck available to doc/ editors to check all external links as well in a minute or two. Also, this cleanup is supposed to fix the always-broken-Shippable issue #304, but chunking test is not passing for some strange file permission error.

    opened by netj 22
  • Clean up unicode handling in pg text dump to json

    Clean up unicode handling in pg text dump to json

    The old combination of (a) text-formatted psql output and (b) Python 2 unicode handling was causing a mess. This PR:

    1. Ports pgtsv_to_json to Python 3 for cleaner unicode handling,
    2. Changes its expected input to csv,
    3. Changes the call to pgtsv_to_json to provide csv input
    4. Renames pgtsv_to_json to pgcsv_to_json to avoid confusion
    opened by shahin 18
  • deepdive run with ddlog is not idempotent

    deepdive run with ddlog is not idempotent

    Suppose you have a ddlog program with the following structure:

    table1(
    ...
    )
    function f over () returns ... implementation ...
    table1 += f(...) :- ...
    

    Running deepdive run multiple times is not idempotent on such programs. More rows will be appended to table1 in each run.

    Is there a way to overwrite or clear the tables? Using '=' instead of '+=' throws a compiler error.

    opened by raphaelhoffmann 18
  • wrong results when passing array of variables to factor function

    wrong results when passing array of variables to factor function

    I'm getting different output when passing variables to a factor function explicitly vs. as array. The results when using array are wrong.

    Here's a simple example:

    deepdive {
    
      schema.variables {
        candidates.is_true: Boolean
      }
    
      pipeline.run: debug
    
      pipeline.pipelines {
        debug = [ create_candidates, one_link_is_true_broken ]
        #debug = [ create_candidates, one_link_is_true_works ]
      }
    
      extraction.extractors {
        create_candidates: {
            style: sql_extractor
            sql: """DROP TABLE IF EXISTS candidates CASCADE;
                    CREATE TABLE candidates (id BIGINT, group_id INTEGER, target INTEGER, is_true BOOLEAN);
                    INSERT INTO candidates VALUES (NULL, 0, 0, FALSE);
                    INSERT INTO candidates VALUES (NULL, 0, 1, NULL);
                   """
        }
      }
    
      inference.factors {
    
        one_link_is_true_broken {
          input_query = """
            SELECT array_agg(l.id order by l.id) as "candidates.id", array_agg(l.is_true order by l.id) as "candidates.is_true"
            FROM candidates l
            GROUP BY l.group_id
             """
          function: "And(candidates.is_true)"
          weight: 3
        }
    
        one_link_is_true_works {
         input_query = """
           SELECT c1.id as "candidates.c1.id", c1.is_true as "candidates c1.is_true",
                  c2.id as "candidates.c2.id", c2.is_true as "candidates c2.is_true"
           FROM candidates c1, candidates c2
           WHERE c1.group_id = 0 AND c1.target = 0
           AND c2.group_id = 0 AND c2.target = 1
            """
         function: "And(candidates.c1.is_true, candidates.c2.is_true)"
         weight: 3
        }
      }
    

    I'm expecting an expectation of ~0.5 for the unknown variable. When using the array version, the expectation is ~1.0.

    opened by raphaelhoffmann 17
  • Extractor parallel loading

    Extractor parallel loading

    • Do not use environment variable PARALLEL_LOADING any more, use extraction.parallel_loading instead
    • Updated the docs for parallel loading
    • Updated BiasedCoin test sampler arguments for more stability
    opened by zifeishan 15
  • Crash when run has spouse example with signalmedia-1m.jsonl

    Crash when run has spouse example with signalmedia-1m.jsonl

    Hi all, I'm trying to run deepdive and it goes pretty well with small dataset with example has spouse in tutorial. I think that deepdive should support to run larger dataset, so I downloaded signalmedia-1m dataset (around 1GB data), using articles.tsv.sh (customize a little) to extract all content to have full articles-1m.tsv file (around 210MB). With that file, I try to run again deepdive:

    • remove old stuff: drop database, remove 'run' folder + run 'deepdive compile'
    • then, using directly "deepdive do spouse_feature". It runs around 5 minute then crashed (my PC is restarted).

    I think source data is rather small (210MB) comparing to my PC's configuration (MacOS, 16G RAM, core i7), but surprisingly that it crashed too soon (in step parsing document, I think in step to create sentences table because lots of java processes run and consume much more RAM, around 2.2G each process, total 5 or 6 java process).

    Is this a bug mentioned in #478 ? Do you have plan to fix it?

    Thanks in advanced.

    Ps: Attached is screenshot of my postgres database, not much data there. deepdive-postgresql

    bug 
    opened by lanphan 14
  • Greenplum DB driver ID assignment is buggy for categorical variables

    Greenplum DB driver ID assignment is buggy for categorical variables

    We ran the chunking example with DD 0.8 on Greenplum (with 20 segments) but always get this error:

    2016-02-21 00:24:11.464530 ################################################
    2016-02-21 00:24:11.661727 LOADED VARIABLES: #270052
    2016-02-21 00:24:11.661813          N_QUERY: #49389
    2016-02-21 00:24:11.661828          N_EVID : #220663
    2016-02-21 00:24:11.722552 LOADED WEIGHTS: #268970
    2016-02-21 00:24:12.756042 LOADED FACTORS: #1049594
    2016-02-21 00:24:12.995216 sampler-dw.bin: src/dstruct/factor_graph/factor_graph.cpp:375: void dd::FactorGraph::safety_check(): Assertion `this->weights[i].id == i' failed.
    2016-02-21 00:24:13.171646 process/model/learning/run.sh: line 22: 184415 Aborted                 (core dumped) 
    

    The corresponding code verifies that the weight vector is loaded in the order of id = 0, 1, 2, ...: https://github.com/HazyResearch/sampler/blob/master/src/dstruct/factor_graph/factor_graph.cpp#L395

    The dump_weights/run.sh scripts run queries similar to

    select * from dd_weightsmulti_inf_istrue_tag limit 10;
    

    While the above query returns results consistently in the same order for PG, the order can vary if we run it on GP again and again. So the weight vector ordering seems to have relied on an implementation detail of Postgres, and now it's broken in Greenplum (we are running latest GP release as of today).

    Interestingly, we could run the spouse example on GP. With spouse, the first column is isfixed -- which seems to be always f. With chunking, the first column of dd_weightsmulti_inf_istrue_tag is dd_weight_column_0, which could take on different values. And GP by default distributes on the first column. So that seems to explain the ordering issue:

    # select * from dd_weightsmulti_inf_istrue_tag limit 3;
     dd_weight_column_0 | isfixed | initvalue | id  | categories 
    --------------------+---------+-----------+-----+------------
     word=Results       | f       |         0 | 296 | 0
     word=obliged       | f       |         0 | 307 | 0
     word=Eurocom       | f       |         0 | 318 | 0
    (3 rows)
    
    # select * from dd_weightsmulti_inf_istrue_tag limit 3;
     dd_weight_column_0 | isfixed | initvalue | id  | categories 
    --------------------+---------+-----------+-----+------------
     word=Leahy         | f       |         0 | 170 | 0
     word=banana        | f       |         0 | 176 | 0
     word=9.875         | f       |         0 | 182 | 0
    (3 rows)
    
    # select * from dd_weightsmulti_inf_istrue_tag limit 3;
     dd_weight_column_0 | isfixed | initvalue | id  | categories 
    --------------------+---------+-----------+-----+------------
     word=Results       | f       |         0 | 296 | 0
     word=obliged       | f       |         0 | 307 | 0
     word=Eurocom       | f       |         0 | 318 | 0
    (3 rows)
    

    Between adding ORDER BY to the dump_weights code and revising the sampler, which would be a better solution?

    bug 
    opened by alldefector 13
  • Add support for functional dependencies within a variable table

    Add support for functional dependencies within a variable table

    This is the front-end support for the multinomial variable type in the sampler.

    This is critical for classification targets such as entity linking, where the table schema is <object, class> and for each value of 'object', only one 'class' can be true in any state. If we don't support this constraint, we'd have to use the pairwise exclusivity rule to emulate it -- which would result in dramatic blowups in the factor graph.

    enhancement 
    opened by alldefector 13
Releases(UNSTABLE)
  • UNSTABLE(Feb 25, 2016)

  • v0.8-STABLE(Feb 25, 2016)

  • v0.8.0(Feb 19, 2016)

    A completely re-architected version of DeepDive is here. Now the system compiles an execution plan ahead of time, checkpoints at a much finer granularity, and gives users full visibility and control of the execution, so any parts of the computation can be flexibly repeated, resumed, or optimized later. The new architecture naturally enforces modularity and extensibility, which enables us to innovate most parts independently without having to understand every possible combination of the entire code. The abstraction layers that encapsulate database operations as well as compute resources are now clearly established, giving a stable ground for extensions in the future that support more types of database engines and compute clusters such as Hadoop/YARN and ones with traditional job schedulers.

    As an artifact of this redesign, exciting performance improvements are now observed:

    • The database drivers show more than 20x higher throughput (2MB/s -> 50MB/s, per connection) with zero storage footprint by streaming data in and out of UDFs.
    • The grounded factor graphs save up to 100x storage space (12GB -> 180MB) by employing compression during the factor graph's grounding and loading, incurring less than 10% overhead in time (400s -> 460s, measuring only the dumping and loading, hence a much smaller fraction in practice).

    See the issues and pull requests for this milestone on GitHub (most notably #445) for further details.

    New commands and features

    An array of new commands have been added to deepdive, and existing ones have been rewritten, such as deepdive initdb and deepdive run.

    To learn more about individual deepdive COMMAND, use the following deepdive help command.

    deepdive help COMMAND
    

    Dropped and deprecated features

    Scala code base has been completely dropped and rewritten in Bash and jq. Many superfluous features have been dropped and are deprecated to be dropped as summarized below:

    • All other extractor style than tsv_extractor, sql_extractor, and cmd_extractor have been dropped, namely:
      • plpy_extractor
      • piggy_extractor
      • json_extractor
    • Manually writing deepdive.conf is strongly discouraged as filling in more fields such as dependencies: and input_relations: became mandatory. Rewriting them in DDlog is strongly recommended.
    • Database configuration in deepdive.db.default is completely ignored. db.url must be used instead.
    • deepdive.extraction.extractors.*.input in deepdive.conf should always be SQL queries. TSV(filename.tsv) or CSV(filename.csv) no longer supported.
    Source code(tar.gz)
    Source code(zip)
    deepdive-v0.8.0-Darwin.tar.gz(78.32 MB)
    deepdive-v0.8.0-Linux.tar.gz(90.09 MB)
  • v0.7.1(Sep 28, 2015)

    • Adds better support for applications written in DDlog. deepdive run now runs DDlog-based applications (app.ddlog).
    • Makes PL/Python extension no longer necessary for PostgreSQL. It is still needed for Greenplum and PostgreSQL-XL.
    • Adds deepdive sql eval command now supports format=json.
    • Adds deepdive load command for loading TSV and CSV data.
    • Adds deepdive help command for quick usage instructions for deepdive command.
    • Includes the latest Mindbender with the Search GUI for browsing data produced by DeepDive.
    • Adds various bug fixes and improvements.
    Source code(tar.gz)
    Source code(zip)
    deepdive-v0.7.1-Darwin.tar.gz(101.34 MB)
    deepdive-v0.7.1-Linux.tar.gz(152.17 MB)
  • v0.7.0(Jul 13, 2015)

    • Provides a new command-line interface deepdive with a new standard DeepDive application layout.
      • No more installation/configuration complication: Users run everything through the only deepdive command, and everything just works in any environment. The only possible failure mode is not being able to run deepdive command, e.g., by not setting up the PATH environment correctly.
      • No more pathname/environment clutter in apps: repeated settings for DEEPDIVE_HOME, APP_HOME, PYTHONPATH, LD_LIBRARY_PATH, PGHOST, PGPORT, ... in run.sh or env.sh or env_local.sh or env_db.sh or etc. are gone. Path names (e.g., extractor udf) in application.conf are all relative to the application root, and brittle relative paths are no longer used in any of the examples.
      • Clear separation of app code from infrastructure code, as well as source code from object code: No more confusing of deepdive source tree with binary/executable/shared-library distribution or temporary/log/output directories.
      • Binary releases can be built with make package.
    • Here are a summary of changes visible to users:
      • Application settings is now kept in deepdive.conf file instead of application.conf.
      • Database settings is now done by putting everything (host, port, user, password, database name) into a single URL in file db.url.
      • Path names (e.g., extractor udf) in deepdive.conf are all relative to the application root unless they are absolute paths.
      • SQL queries against the database can be run easily with deepdive sql command when run under an application.
      • Database schema is now put in file schema.sql and optional initial data loading can be done by a script input/init.sh. Input data is recommended to be kept under input/.
      • By passing the pipeline name as an extra argument to the deepdive run command, different pipelines can be run very easily: No more application.conf editing.
      • Logs and outputs are placed under application root, under snapshot/.
    • Adds piggy extractor that replaces the now deprecated plpy extractor.
    • Includes the latest DDlog compiler with extended syntax support for writing more real world applications.
    • Includes the latest Mindbender with Dashboard GUI for producing summary reports after each DeepDive run and interactively analyzing data products.
    Source code(tar.gz)
    Source code(zip)
    deepdive-v0.7.0-Darwin.tar.gz(88.18 MB)
    deepdive-v0.7.0-Linux.tar.gz(91.66 MB)
  • v0.6.0(Jun 17, 2015)

    • Adds DDlog for writing applications in Datalog-like syntax.
    • Adds support for incremental development cycles.
    • Adds preliminary support for Postgres-XL backend.
    • Simplifies installation on Ubuntu and Mac with a quick installer that takes care of all dependencies.
    • Drops maintenance of AMI favoring the new quick installer.
    • Fixes sampler correctness issues.
    • Drops "FeatureStatsView" view due to performance issues.
    • Corrects various issues.
    • Starts using Semantic Versioning for consistent and meaningful version numbers for all future releases.
    Source code(tar.gz)
    Source code(zip)
  • 0.05-RELEASE(Feb 9, 2015)

    Changelog for release 0.0.5-alpha (02/08/2015)

    • Added support to build Docker images for DeepDive. See the README.md for more.
    • Added SQL "FeatureStatsView" view. Populated with feature statistics; useful for debugging.
    • Added a few fixes to greenplum docs
    • Added parallel greenplum loading for extractor data
    • A few misc bugfixes
    Source code(tar.gz)
    Source code(zip)
  • 0.04.1-RELEASE(Nov 25, 2014)

    Changelog for release 0.0.4.1-alpha (11/25/2014)

    This release focuses mostly on bug fixing and minor new features.

    • Improve handling of failures in extractors and inference rules.
    • Add support for running tests on GreenPlum.
    • Add support for -q, --quiet in the DimmWitted sampler. This allows to reduce the verbosity of the output.
    • Remove some dead code.
    • Fix a small bug in the spouse_example test.
    Source code(tar.gz)
    Source code(zip)
  • 0.04-RELEASE(Nov 20, 2014)

    Changelog for release 0.0.4-alpha (11/19/2014)

    This release focuses mostly on new features and bug fixing.

    • Added experimental support for MariaDB / MySQL / MySQL Cluster. See Using DeepDive with MySQL for details, including limitations of the current support. The code base was refactored to make it much easier to add support for additional DBMS in the future.
    • Ported Tuffy to DeepDive. It is now possible to run Tuffy programs for Markov Logic Networks on DeepDive. See Markov Logic Networks for details.
    • Added a graphical interface called Mindtagger to label data products for estimating precision/recall. See Labeling Data Products of DeepDive and files under examples/labeling/ in the source tree.
    • Added support for the DEEPDIVE_HOME environmental variable. It's now possible to run applications from any location, when this variable is set. See Installation for details.
    • Added support for -c datacopies to the DimmWitted sampler (Linux only!). This allows to control the number of replications of the data. It is useful for performing inference on very large factor graphs while leveraging on NUMA. See The DimmWitted High-Speed Sampler for details.
    • Fixed integer overflow bug (and use of scientific notation) in tobinary.py. This allows to use DeepDive for inference on very large factor graphs.
    • Fix a bug when using multinomial and GreenPlum: the mapping from weight ID and weight description is not consistent
    • Fixed various bugs (including known JDBC bug) that prevented DeepDive from performing inference on very large factor graphs.
    Source code(tar.gz)
    Source code(zip)
  • v0.0.3-alpha.1(May 26, 2014)

    Changelog for version 0.0.3-alpha.1 (05/25/2014)

    • Updated example walkthrough and spouse_example code
    • Added python utility ddlib for text manipulation (need exporting PYTHONPATH, usage see its pydoc)
    • Added utility script util/extractor_input_writer.py to sample extractor inputs
    • Updated nlp_extractor format (use sentence_offset, textual sentence_id)
    • Cleaned up unused datastore code
    • Update templates
    • Bug fixes

    Changelog for version 0.0.3-alpha (05/07/2014)

    • Non-backward-compatible syntax change: Developers must include id column with type bigint in any table containing variables, but they MUST NOT use this column anywhere. This column is preserved for learning and inference, and all values will be erased and reassigned in grounding phase.
    • Non-backward-compatible functionality change: DeepDive is no longer responsible for any automatic assignment of sequential variable IDs. You may use examples/spouse_example/scripts/fill_sequence.sh for this task.
    • Updated dependency requirement: requires JDK 7 or higher.
    • Supported four new types of extractors. See documentation for details:
    • Even faster factor graph grounding and serialization using better optimized SQL.
    • The previous default Java sampler is no longer supported. Made C++ sampler as the default sampler.
    • New configuration supported: pipeline.relearn_from to skip extraction and grounding, only perform learning and inference with a previous version. Useful for tuning sampler arguments.
    • New configuration supported: inference.skip_learning to use weights learned in the last execution.
    • New configuration supported: inference.weight_table to fix factor weights in a table and skip learning. The table is specified by factor description and weights. This table can be results from one execution of DeepDive, or manually assigned, or a combination of them. It is useful for learning once and using learned model for later inference tasks.
    • Supported manual holdout by a holdout query.
    • Updated spouse_example with implementations in different styles of extractors.
    • The nlp_extractor example has changed table requirements and usage. See HERE.
    • In db.default configuration, users should define dbname, host, port and user. If not defined, by default system will use environmental variables DBNAME,PGHOST, PGPORT and PGUSER accordingly.
    • Fixed all examples.
    • Updated documentation.
    • Print SQL query execution plans for extractor inputs.
    • Skip grounding, learning and inference if no factors are active.
    • If using GreenPlum, users should add DISTRIBUTED BY clause in all CREATE TABLE commands. Do not use variable id as distribution key. Do not use distribution key that is not initially assigned.
    Source code(tar.gz)
    Source code(zip)
Owner
HazyResearch
We are a CS research group led by Prof. Chris RĂ©.
HazyResearch