Benchmarking optimizations in Scutari, Vitolo and Tucker, Statistics and Computing (2019)

This is a short HOWTO describing the simulation studies presented in “Learning Bayesian Networks from Big Data with Greedy Search: Computational Complexity and Efficient Implementation” by Scutari, Vitolo and Tucker (Statistics and Computing, 2019). The code benchmarks different optimizations in the implementation of BIC available in bnlearn, as well as the predictive log-likelihood score originally proposed by Chickering and Heckerman (available since bnlearn 4.5).

Benchmarks on the MEHRA network

The MEHRA network originates from a previous paper by the same authors (the supplementary material is here) and in now available from the Bayesian network repository (here). It is a conditional Gaussian Bayesian network containing 24 nodes, 71 arcs and 325k parameters.

First of all, we loaded the MEHRA network from the repository and extracted its DAG with bn.net().

> library(bnlearn)
> mehra.bn = readRDS("https://www.bnlearn.com/bnrepository/mehra/mehra-complete.rds")
> mehra.dag = bn.net(mehra.bn)

We then generated 5 data sets for each of sample sizes 1, 2, 5, 10, 20 and 50 millions.

> for (size in c(1, 2, 5, 10, 20, 50)) {

+   data = vector(5, mode = "list")
+   for (rep in 1:5)
+     data[[rep]] = rbn(mehra.bn, size * 10^6)

+ }#FOR
> 
> saveRDS(data, file = paste0("mehra", size, ".rds"))

For the data sets stored in each file (with name defined by the variable file in the code below), we first benchmarked BIC.

> library(bnlearn)
> data = readRDS(file)
> mehra.bn = readRDS("https://www.bnlearn.com/bnrepository/mehra/mehra-complete.rds")
> mehra.dag = bn.net(mehra.bn)
> 
> timing = numeric(5)
> distance = numeric(5)
> for (i in seq_along(data)) {

+   timing[i] = system.time(net <<- hc(data[[i]], score = "bic"))
+   distance[i] = shd(mehra.dag, net)

+ }#FOR
> 
> print(data.frame(timing, distance))

We selected the different levels of optimization of the BIC score by commenting those we did not want to use in the c_ols() function in src/least.squares.c and in the c_cls() function in src/conditional.least.squares.c in the source code of bnlearn. Both functions look like this:

  if (ncol == 0) {

    ...

  }/*THEN*/
  else if (ncol == 1) {

    ...

  }/*THEN*/
  else if (ncol == 2) {

    ...

  }/*THEN*/
  else {

    ...

  }/*ELSE*/

Commenting the ncol == 1 and ncol == 2 branches gives the QR setting in the paper; commenting the ncol == 2 branch gives the 1P setting; and not commenting any code gives the 2P setting.

The predictive log-likelihood score was benchmarked along the same lines with the code below.

> library(bnlearn)
> data = readRDS(file)
> mehra.bn = readRDS("https://www.bnlearn.com/bnrepository/mehra/mehra-complete.rds")
> mehra.dag = bn.net(mehra.bn)
> 
> timing = numeric(5)
> distance = numeric(5)
> for (i in seq_along(data)) {

+   training.set = sample(nrow(data[[i]]), round(nrow(data[[i]]) * 0.25))
+   train = data[[i]][training.set]
+   test = data[[i]][-training.set]
+   timing[i] = system.time(net <<- hc(train, score = "pred-loglik-cg", newdata = test))
+   distance[i] = shd(mehra.dag, net)

+ }#FOR
> 
> print(data.frame(timing, distance))

Benchmarks on other large data sets

The same code, minus the random sampling in the first snippet, was used to benchmark the data sets from the UCI Machine Learning Repository and the JSM Data Exposition session. The pre-processed files used in these benchmarks (hosted on Google Drive) are:

Last updated on Mon Oct 18 09:11:43 2021 with bnlearn 4.8-20211005 and R version 4.1.1 (2021-08-10).