Permuting columns when computing arc strength via bootstrap resampling

When learning the structure of a Bayesian network, the order of the columns in the data should not matter. However, there is solid evidence in the literature that it does: for instance, Colombo's work on PC Stable (link) or Kitson's work on the impact of variable ordering on structural stability (link). Hence the question: does permuting the columns in the data we pass to boot.strength() at the same time as we draw the bootstrap samples improve the averaged network structures produced by averaged.network? This would mean that each of the network structures learned by boot.strength() is learned both from a different set of observations and from data with a different ordering of the columns.

A simple simulation to contrast the current behaviour (columns in the data are always in the same order) with permuting columns in different bootstrap samples:

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE)
> 
> data.file = commandArgs()[6]
> data = readRDS(data.file)
> out.file = commandArgs()[7]
> truebn = attr(data, "truebn")
> 
> netlist.orig = vector(100, mode = "list")
> netlist.perm = vector(100, mode = "list")
> 
> data = data[, sample(ncol(data), ncol(data)), drop = FALSE]
> 
> for (i in seq_along(netlist.orig)) {

+   bootsample = data[sample(nrow(data), nrow(data)), , drop = FALSE]
+   netlist.orig[[i]] = tabu(bootsample)
+   bootsample = bootsample[, sample(ncol(data), ncol(data)), drop = FALSE]
+   netlist.perm[[i]] = tabu(bootsample)

+ }#FOR
> 
> avg.orig = averaged.network(custom.strength(netlist.orig, names(data)))
> avg.perm = averaged.network(custom.strength(netlist.perm, names(data)))
> shd.orig = shd(avg.orig, bn.net(truebn))
> shd.perm = shd(avg.perm, bn.net(truebn))
> 
> res = list(avg.orig = avg.orig, avg.perm = avg.perm, shd.orig = shd.orig,
+         shd.perm = shd.perm, data = attr(data, "label"), np = attr(data, "np"))

We run structure learning with and without permutating the columns on the same bootstrap samples to minimize simulation variability. The input data come from nine different networks and span sample sizes between 0.1 * nparams(bn) and 5 * nparams(bn). The SHD between the network structures learned with/without permuting the columns and the true network structure is as follows:

plot of chunk unnamed-chunk-3

While the impact of permuting varies, it seems clear that it is worthwhile: for all nine networks, tabu() and averaged.network() learn networks with lower SHD (green points) much more often than they learn networks with higher SHD (the red points). Sometimes not by much (see MILDEW and INSURANCE), sometimes overwhelmingly (see PIGS, LINK and ANDES)

Interestingly, the sample size (relative the number of parameters of the true network) seems to have an impact: in the large-sample-size regime (at least as many observations as parameters) permuting the columns seems to give better results than in the small-sample-size regime (fewer observations that the number of parameters of the true network). Model averaging works best when models are independent of each other, or at least as unrelated as possible: structure learning always returns the same network (or very similar networks) when the sample size is large, and permuting the columns introduces some noisiness in the process that helps with that.

plot of chunk unnamed-chunk-4
plot of chunk unnamed-chunk-4

There does not seem to be any particular pattern of behaviour across the networks we used with respect to the number of nodes, of arcs, of parameters. The ratio between the SHD for the network learned without permuting the columns and that for the network without learned while permuting the columns for the same data is shown in the y-axis below.

plot of chunk unnamed-chunk-5

Note that we really need to perform an initial permutation of the columns in the data, instead of keeping them in the order in which they are generated from the network, to provide a fair assessment in the simulation study. The nodes in the bn.fit objects from the Bayesian network repository are typically stored in topological order, so the variables in the data would also be stored in topological order, giving an unfair advantage to the "no permutations" scenario. The difference in the results is stark, as can be seen below.

plot of chunk unnamed-chunk-6

The PC Stable algorithm is designed to be more robust against the order of the conditional independence tests it performs to assess arcs, which is determined by the ordering of the variables in the data in pc.stable(). However, the order of the tests used to discover v-structures and the order in which Meek's rules are applied are still determined by the ordering of the variables in the data. Even so, permuting the columns does not have as large an effect as with tabu(): in many simulations we end up with equally accurate network structures, and in no simulation there is a preponderance of improved SHD values. LINK and PIGS would take several weeks of compute time to investigate with pc.stable(), so we dropped them for convenience.

plot of chunk unnamed-chunk-7
Last updated on Thu Dec 22 02:45:21 2022 with bnlearn 4.9-20221220 and R version 4.2.2 (2022-10-31).