We ran the chunking example with DD 0.8 on Greenplum (with 20 segments) but always get this error:
2016-02-21 00:24:11.464530 ################################################
2016-02-21 00:24:11.661727 LOADED VARIABLES: #270052
2016-02-21 00:24:11.661813 N_QUERY: #49389
2016-02-21 00:24:11.661828 N_EVID : #220663
2016-02-21 00:24:11.722552 LOADED WEIGHTS: #268970
2016-02-21 00:24:12.756042 LOADED FACTORS: #1049594
2016-02-21 00:24:12.995216 sampler-dw.bin: src/dstruct/factor_graph/factor_graph.cpp:375: void dd::FactorGraph::safety_check(): Assertion `this->weights[i].id == i' failed.
2016-02-21 00:24:13.171646 process/model/learning/run.sh: line 22: 184415 Aborted (core dumped)
The corresponding code verifies that the weight vector is loaded in the order of id = 0, 1, 2, ...:
https://github.com/HazyResearch/sampler/blob/master/src/dstruct/factor_graph/factor_graph.cpp#L395
The dump_weights/run.sh
scripts run queries similar to
select * from dd_weightsmulti_inf_istrue_tag limit 10;
While the above query returns results consistently in the same order for PG, the order can vary if we run it on GP again and again. So the weight vector ordering seems to have relied on an implementation detail of Postgres, and now it's broken in Greenplum (we are running latest GP release as of today).
Interestingly, we could run the spouse example on GP. With spouse, the first column is isfixed
-- which seems to be always f
. With chunking, the first column of dd_weightsmulti_inf_istrue_tag
is dd_weight_column_0
, which could take on different values. And GP by default distributes on the first column. So that seems to explain the ordering issue:
# select * from dd_weightsmulti_inf_istrue_tag limit 3;
dd_weight_column_0 | isfixed | initvalue | id | categories
--------------------+---------+-----------+-----+------------
word=Results | f | 0 | 296 | 0
word=obliged | f | 0 | 307 | 0
word=Eurocom | f | 0 | 318 | 0
(3 rows)
# select * from dd_weightsmulti_inf_istrue_tag limit 3;
dd_weight_column_0 | isfixed | initvalue | id | categories
--------------------+---------+-----------+-----+------------
word=Leahy | f | 0 | 170 | 0
word=banana | f | 0 | 176 | 0
word=9.875 | f | 0 | 182 | 0
(3 rows)
# select * from dd_weightsmulti_inf_istrue_tag limit 3;
dd_weight_column_0 | isfixed | initvalue | id | categories
--------------------+---------+-----------+-----+------------
word=Results | f | 0 | 296 | 0
word=obliged | f | 0 | 307 | 0
word=Eurocom | f | 0 | 318 | 0
(3 rows)
Between adding ORDER BY
to the dump_weights code and revising the sampler, which would be a better solution?
bug