Permuting columns when computing arc strength via bootstrap resampling
When learning the structure of a Bayesian network, the order of the columns in the data should not matter. However,
there is solid evidence in the literature that it does: for instance, Colombo's work on PC Stable
(link) or Kitson's work on the impact of variable ordering
on structural stability (link). Hence the question: does permuting
the columns in the data we pass to boot.strength()
at the same time as we draw the bootstrap samples
improve the averaged network structures produced by averaged.network
? This would mean that each of the
network structures learned by boot.strength()
is learned both from a different set of observations and
from data with a different ordering of the columns.
A simple simulation to contrast the current behaviour (columns in the data are always in the same order) with permuting columns in different bootstrap samples:
> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > > data.file = commandArgs()[6] > data = readRDS(data.file) > out.file = commandArgs()[7] > truebn = attr(data, "truebn") > > netlist.orig = vector(100, mode = "list") > netlist.perm = vector(100, mode = "list") > > data = data[, sample(ncol(data), ncol(data)), drop = FALSE] > > for (i in seq_along(netlist.orig)) { + bootsample = data[sample(nrow(data), nrow(data)), , drop = FALSE] + netlist.orig[[i]] = tabu(bootsample) + bootsample = bootsample[, sample(ncol(data), ncol(data)), drop = FALSE] + netlist.perm[[i]] = tabu(bootsample) + }#FOR > > avg.orig = averaged.network(custom.strength(netlist.orig, names(data))) > avg.perm = averaged.network(custom.strength(netlist.perm, names(data))) > shd.orig = shd(avg.orig, bn.net(truebn)) > shd.perm = shd(avg.perm, bn.net(truebn)) > > res = list(avg.orig = avg.orig, avg.perm = avg.perm, shd.orig = shd.orig, + shd.perm = shd.perm, data = attr(data, "label"), np = attr(data, "np"))
We run structure learning with and without permutating the columns on the same bootstrap samples to minimize
simulation variability. The input data come from nine different networks and span sample sizes between
0.1 * nparams(bn)
and 5 * nparams(bn)
. The SHD between the network structures learned
with/without permuting the columns and the true network structure is as follows:
While the impact of permuting varies, it seems clear that it is worthwhile: for all nine networks,
tabu()
and averaged.network()
learn networks with lower SHD (green points) much more often
than they learn networks with higher SHD (the red points). Sometimes not by much (see MILDEW and INSURANCE),
sometimes overwhelmingly (see PIGS, LINK and ANDES)
Interestingly, the sample size (relative the number of parameters of the true network) seems to have an impact: in the large-sample-size regime (at least as many observations as parameters) permuting the columns seems to give better results than in the small-sample-size regime (fewer observations that the number of parameters of the true network). Model averaging works best when models are independent of each other, or at least as unrelated as possible: structure learning always returns the same network (or very similar networks) when the sample size is large, and permuting the columns introduces some noisiness in the process that helps with that.
There does not seem to be any particular pattern of behaviour across the networks we used with respect to the number of nodes, of arcs, of parameters. The ratio between the SHD for the network learned without permuting the columns and that for the network without learned while permuting the columns for the same data is shown in the y-axis below.
Note that we really need to perform an initial permutation of the columns in the data, instead of keeping them in
the order in which they are generated from the network, to provide a fair assessment in the simulation study. The nodes
in the bn.fit
objects from the Bayesian network repository are typically stored in topological order, so
the variables in the data would also be stored in topological order, giving an unfair advantage to the "no permutations"
scenario. The difference in the results is stark, as can be seen below.
The PC Stable algorithm is designed to be more robust against the order of the conditional independence tests it
performs to assess arcs, which is determined by the ordering of the variables in the data in pc.stable()
.
However, the order of the tests used to discover v-structures and the order in which Meek's rules are applied are still
determined by the ordering of the variables in the data. Even so, permuting the columns does not have as large an effect
as with tabu()
: in many simulations we end up with equally accurate network structures, and in no
simulation there is a preponderance of improved SHD values. LINK and PIGS would take several weeks of compute time to
investigate with pc.stable()
, so we dropped them for convenience.
Thu Dec 22 02:45:21 2022
with bnlearn
4.9-20221220
and R version 4.2.2 (2022-10-31)
.