Structure learning benchmarks in Scutari, Graafland and Gutiérrez, International Journal of Approximate Reasoning (2019)

This is a short HOWTO describing the simulation setup in “Who Learns Better Bayesian Network Structures: Accuracy and Speed of Structure Learning Algorithms” by Scutari, Graafland and Gutiérrez (2019, IJAR), which is an extended version of “Who Learns Better Bayesian Network Structures: Constraint-Based, Score-Based or Hybrid Algorithms?” by the same authors (2018, PGM). The code below includes several incremental improvements over that used here for PGM 2016.

The makefile

An abridged version of the makefile that drives the simulation study is below.

SHELL := /bin/bash
REPS := 20
NETWORKS := $(sort $(patsubst %.rds,%.rds.done,$(wildcard networks/*.rds)))

.PHONY: setup plots clean learn-all

# perform the whole simulation.
learn-all: pc-bict ...

setup: $(NETWORKS)

# generating data.
networks/%.rds.done: networks/%.rds
	mkdir -p data results
	Rscript --vanilla setup.R $< $(REPS)
	touch $@

# learn the network using the BIC test and PC.
pc-bict: setup
	@for DATA in `ls -1 data`; \
	do \
	  TARGET="$$TARGET results/`sed -e s/rds/pc.bict.rds/ <<< $$DATA`"; \
	done; \
	$(MAKE) $$TARGET;

results/%.pc.bict.rds: data/%.rds
	Rscript pc.bict.R $< $@

...

# summarise.
summaries.rds: pc-bict ...
	Rscript tabulate.R

# plot.
plots: summaries.rds
	Rscript plot.R

# clean up.
clean:
	rm -rf data results networks/*.done

The targets perform the following actions:

  1. learn-all: runs all the structure learning algorithms.
  2. networks/%.rds.done: generates 20 data sets for each sample size, from each of the network.
  3. pc-bict: runs the PC algorithm with the BIC test described in the paper. It dynamically generates the call to the target in the following line (results/%.pc.bict.rds) populating it with the file names of the individual simulations (one for each combination of network, sample size, rep).
  4. summaries.rds: extracts the statistics of interest from the RDS files containing the results of the individual simulations.
  5. plots: creates the figures in the paper from the files created by summaries.
  6. clean: removes the results and the generated data.

There is one target like pc-bict (identical in all but the name and the R script it calls) for each of the algorithms in the simulation study. Calling $MAKE with a dynamic argument set makes it possible to run individual simulations in parallel; and saving them each in a separate files makes it easy to interrupt and resume the simulation without losing too much work.

Generating the data

The setup.R script reads the RDS files with the reference networks from the Bayesian network repository (stored in the networks directory) and generated 20 data sets for each sample size ceiling(np * nparams(bn)).

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE)
> 
> rds.file = commandArgs()[7]
> reps = as.integer(commandArgs()[8])
> np.ratios = c(0.1, 0.2, 0.5, 1, 2, 5)
> 
> bn = readRDS(rds.file)
> label = toupper(gsub("networks/(.+)\\.rds$", "\\1", rds.file))
> 
> for (np in np.ratios) {

+   for (r in seq(reps)) {

+     id = sprintf("%02d", r)
+     npr = sprintf("%.1f", np)
+     filename = paste("data/", label, "[", npr, "]", ".", id, ".rds", sep = "")

+     if (file.exists(filename))
+       next

+     data = rbn(bn, ceiling(np * nparams(bn)))
+     attr(data, "network") = rds.file
+     attr(data, "label") = label
+     attr(data, "np") = np
+     saveRDS(data, file = filename)

+   }#FOR

+ }#FOR

Performing structure learning

The scripts that implement structure learning all follow the same structure; pc.bict.R is shown below as an example. In addition to saving the bn object for the network, they store the sample size and the network label to make it easier to label results later in tabulate.R.

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE)
> 
> data.file = commandArgs()[6]
> data = readRDS(data.file)
> out.file = commandArgs()[7]
> 
> if (bnlearn:::data.type(data) == "factor")
+   test = "bic-t"
> if (bnlearn:::data.type(data) == "continuous")
+   test = "bic-gt"
> 
> net = pc.stable(data, test = test)
> net$learning$from = attr(data, "network")
> net$learning$label = attr(data, "label")
> net$learning$np = attr(data, "np")
> 
> saveRDS(net, file = out.file)

The Greedy Equivalent Search (GES) is performed using the rcausal package (github), which wraps the TETRAD Bayesian network suite. The fges.bde.R script is as follows; not that the number of model comparisons is measured separately by instrumenting the TETRAD code.

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE)
> library(rcausal)
> 
> data.file = commandArgs()[6]
> data = readRDS(data.file)
> out.file = commandArgs()[7]
> 
> if (bnlearn:::data.type(data) == "factor") {

+   sink("/dev/null")
+   net = fges.discrete(data, structurePrior = 1, samplePrior = 1)
+   sink()

+   directed = do.call(rbind, strsplit(grep("-->", net$edges, value = TRUE), " --> "))
+   undirected = do.call(rbind, strsplit(grep("---", net$edges, value = TRUE), " --- "))

+   net = empty.graph(names(data))
+   arcs(net) = rbind(directed, undirected, undirected[, 2:1])
+   net$learning$from = attr(data, "network")
+   net$learning$label = attr(data, "label")
+   net$learning$np = attr(data, "np")
+   net$learning$algo = "fges"
+   net$learning$test = "bde"

+   saveRDS(net, file = out.file)

+ }#THEN
> 
> if (bnlearn:::data.type(data) == "continuous")
+   saveRDS(NA, file = out.file)

Simulated annealing is implemented using the catnet package in sann.bic.R. After the node ordering is learned with cnSearchSA() and cnFindBIC(), we confirm it by running tabu() from bnlearn. The number of model comparisons is again handled separately by instrumenting the code in catnet.

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE)
> library(catnet)
> 
> data.file = commandArgs()[6]
> data = readRDS(data.file)
> out.file = commandArgs()[7]
> 
> if (bnlearn:::data.type(data) == "factor")
+   score = "bic"
> if (bnlearn:::data.type(data) == "continuous")
+   score = "bic-g"
> 
> netlist = cnSearchSA(data = data, maxParentSet = 1,
+             tempCheckOrders = 10, maxIter = 10, numThreads = 10)
> net = cnFindBIC(netlist)
> arcs = bnlearn:::elist2arcs(cnEdges(net))
> net = empty.graph(names(data))
> arcs(net) = arcs
> net = tabu(data, score = score,
+         blacklist = ordering2blacklist(node.ordering(net)))
> 
> net$learning$from = attr(data, "network")
> net$learning$label = attr(data, "label")
> net$learning$np = attr(data, "np")
> net$learning$algo = "sann"
> net$learning$test = score
> 
> saveRDS(net, file = out.file)

Collecting summary statistics

The script tabulate.R creates a data frame with the statistics of interest: SHD from the true network structure and number of model evaluations. It iterates over all the files with the simulation results stored in the results directory and appends the statistics of each individual simulation to the data frame, which admittedly is not the most efficient way of going about it.

> library(bnlearn)
> 
> shd = data.frame(
+   label = character(0),
+   np = character(0),
+   algo = character(0),
+   criterion = character(0),
+   shd = numeric(0)
+ )
> 
> learned = list.files("results", full.names = TRUE)
> 
> for (l in learned) {

+   bn = readRDS(l)
+   true = bn.net(readRDS(bn$learning$from))

+   shd = rbind(shd, data.frame(
+     label = bn$learning$label,
+     np = as.character(bn$learning$np),
+     algo = bn$learning$algo,
+     criterion = gsub("-.*", "", bn$learning$test),
+     shd = shd(bn, true)
+   ))

+ }#FOR
> 
> saveRDS(shd, file = "summaries.rds")

Generating the figures

The figures in the paper are generated from the tabulated statistics stored in the summaries.rds using the lattice package.

> library(lattice)
> library(latticeExtra)
> 
> summaries = readRDS("summaries.rds")
> 
> # separate discrete and continuous data sets.
> discrete = !grepl("MAGIC", summaries$label)
> bic = grepl("bic", summaries$criterion)
> dsummaries = summaries[discrete & bic, ]
> gsummaries = summaries[!discrete & bic, ]
> 
> p1 = xyplot(shd ~ np | label, group = droplevels(interaction(algo, criterion)),
+        data = dsummaries, auto.key = TRUE, as.table = TRUE,
+        scales = list(alternating = FALSE, y = list(relation = "free")),
+        panel = function(x, y, groups, ...) {

+   panel.xyplot(x = x, y = y, groups = groups, ..., pch = 19, alpha = 0.10)
+   panel.superpose(x = x, y = y, groups = groups, panel.group = "panel.smoother",
+     ..., span = 0.99)

+ })
> 
> p2 = xyplot(shd ~ np | label, group = droplevels(interaction(algo, criterion)),
+        data = gsummaries, auto.key = TRUE, as.table = TRUE,
+        scales = list(alternating = FALSE, y = list(relation = "free")),
+        panel = function(x, y, groups, ...) {

+   panel.xyplot(x = x, y = y, groups = groups, ..., pch = 19, alpha = 0.10)
+   panel.superpose(x = x, y = y, groups = groups, panel.group = "panel.smoother",
+     ..., span = 0.99)

+ })
> 
> pdf(file = "shd-discrete-bic.pdf", width = 9, height = 12)
> print(p1)
> dev.off()
> 
> pdf(file = "shd-gaussian-bic.pdf", width = 9, height = 12)
> print(p2)
> dev.off()
Last updated on Thu Aug 13 13:16:10 2020 with bnlearn 4.6-20200410 and R version 4.0.2 (2020-06-22).