## Interfacing with the parallel and snow R packages

The **parallel** package and its predecessor **snow**
(link) provide a basic
implementation of the master-slave parallel programming model, and are the *de facto*
standard way of doing parallel computing in R. Since most tasks in the application of Bayesian
networks are computationally intensive, many functions in **bnlearn** have a
`cluster`

argument that takes the cluster returned by **parallel**'s
`makeCluster()`

and split their computation among the available slave processes.

The same is true for the clusters provided by the older **snow** package, since
**parallel** is a drop-in replacement for **snow**.

### Parallel structure learning

All constraint-based structure learning functions can be parallelised following
Scutari (2017); Markov blankets and
neighbourhoods of different nodes can be learned independently from each other, and later merged
and reconciled to produce a coherent Bayesian network. First, we need to load the
**parallel** package and initialise the cluster of slave processes, called
`cl`

below.

> library(parallel) > library(bnlearn)

> cl = makeCluster(2, type = "SOCK")

Note that there are various `type`

s of clusters; they use different inter-process
communication mechanisms to pass R code and objects between the master process and the slaves, and
some may be more appropriate than others depending on the hardware. In **parallel**,
random number generators in the slaves are properly initialised to give independent streams of
random numbers. (This is not the case for `snow`

.)

> rand = clusterEvalQ(cl, runif(10)) > rand

[[1]] [1] 0.2653976 0.2307654 0.4760748 0.7102806 0.2927784 0.8210281 0.9616853 [8] 0.4075056 0.1576284 0.1940886 [[2]] [1] 0.7155422 0.7610362 0.7223570 0.2094305 0.9278226 0.1082685 0.6748543 [8] 0.1427551 0.4874325 0.7536883

> cor(rand[[1]], rand[[2]])

[1] -0.4465836

Otherwise, it is always possible to set a random seed with `clusterSetRNGStream()`

to get reproducible results across the whole cluster.

> clusterSetRNGStream(cl, 42)

After the cluster is set up, performing parallel structure learning just means passing the
`cl`

object to, for instance, the `si.hiton.pc()`

function.

> data(learning.test) > pdag = si.hiton.pc(learning.test, cluster = cl) > clusterEvalQ(cl, test.counter())

[[1]] [1] 15 [[2]] [1] 15

> (test.counter())

[1] 0

Using `test.counter`

, we find that each slave executed about half of the conditional
independence tests, and the master process executed just a few.

Finally, we may want to stop the cluster and kill the slave processes when we are done.

> stopCluster(cl)

### Parallel parameter learning

Parameter learning is embarrasingly parallel by construction, since each local distribution is
independent from the others. Again, we just need to pass `cl`

to `bn.fit()`

to spread the computations to the slave processes.

> dag = cextend(pdag) > fitted = bn.fit(dag, learning.test, cluster = cl) > fitted$B

Parameters of node B (multinomial distribution) Conditional probability table: E B a b c a 0.73364245 0.46483590 0.15581098 b 0.07521896 0.10113865 0.17305236 c 0.19113859 0.43402545 0.67113665

### Parallel cross-validation

Structure learning also be parallelised in a different way: when learning a Bayesian network
under *k* cross-validation we are in fact learning *k* networks independently from
different subsets of the data (identified by the fold we are witholding to use for validation).
Hence we can meangfully use a number of slaves up to the number of folds and learn the networks
in parallel passing `cl`

to `bn.cv()`

.

> bn.cv(learning.test, "hc", k = 10, cluster = cl)

k-fold cross-validation for Bayesian networks target learning algorithm: Hill-Climbing number of folds: 10 loss function: Log-Likelihood Loss (disc.) expected loss: 4.775614

This is also true for other other kinds of cross-validation, like `hold-out`

and
`custom-folds`

.

### Parallel bootstrap

Another classic technique that is embarrasingly parallel is the bootstrap; each bootstrap
sample is extracted independently from the original sample and then the statistic of interest
is estimated independently from each bootstrap sample. This being the case, both
`bn.boot()`

and `boot.strength()`

have `cluster`

argument to
perform computations in parallel.

For instance, we can use `bn.boot()`

to compute the number of arcs
(`narcs`

) of the networks learned with a particular structure learning algorithm
(`hc()`

) and a sample size of `100`

.

> A = bn.boot(learning.test, algorithm = "hc", cluster = cl, + statistic = narcs, R = 10, m = 100) > unlist(A)

[1] 4 4 5 4 4 4 5 4 4 5

Or we can use `boot.strength()`

to compute the confidence in the presence and
direction of each arc.

> confidence = boot.strength(learning.test, algorithm = "hc", cluster = cl) > confidence[(confidence$strength > 0.50) & (confidence$direction >= 0.50), ]

from to strength direction 1 A B 1 0.5000 3 A D 1 1.0000 6 B A 1 0.5000 9 B E 1 0.9925 13 C D 1 1.0000 30 F E 1 0.9925

### Parallel approximate inference

Approximate inference techniques for Bayesian networks are typically embarrassingly parallel, since they are built on particles-based simulation methods that use independent particles. This makes it possible to generate particles in parallel in the slaves, to compute an estimate of the desired (conditional) probability in each slave, and then to return their average as the overall estimated value.

Below we do this with `cpquery()`

for both logic sampling and likelihood weighting.

> cpquery(fitted, event = (A == "a") & (B == "b"), + evidence = (F == "a"), method = "ls", cluster = cl)

[1] 0.01238686

> cpquery(fitted, event = (A == "a") & (B == "b"), + evidence = list(F = "a"), method = "lw", cluster = cl)

[1] 0.008436615

In the same spirit, `cpdist()`

can also generate random samples in parallel.

> cpdist(fitted, nodes = "A", evidence = (F == "a"), n = 10, + method = "ls", cluster = cl)$A

[1] c c c b b c Levels: a b c

> cpdist(fitted, nodes = "A", evidence = list(F = "a"), n = 10, + method = "lw", cluster = cl)$A

[1] a b a b b b a c b c Levels: a b c

`Tue Nov 14 15:47:06 2017`

with **bnlearn**

`4.3-20171001`

and `R version 3.0.2 (2013-09-25)`

.