Predicting new observations from a Bayesian network

Predicting the values of one or more variables is the prototypical task of a machine learning model. In the context of Bayesian networks, we take it to involve just a single variable: producing the best possible instantiation of multiple variables is a most probable explanation (MPE) problem that deserves a separate implementation. Prediction is simpler and it can be optimized more effectively.

bnlearn provides a predict() function (documented here) for the fitted Bayesian networks returned by bn.fit() (illustrated here). predict() provides different methods to compute predictions, with different trade-offs: "parents", "bayes-lw" and "exact". For all methods, predict() takes

  1. a bn.fit object encoding the Bayesian network;
  2. a node that we would like to predict;
  3. a data set we will predict it from;
  4. a prob logical argument to return the prediction probabilities for discrete Bayesian networks;

and returns a vector with the predictions. Note that one or more predicted values may be NAs if the fitted Bayesian network contains NA parameters, which can be avoided by fitting it using posteriors estimators or using maximum likelihood estimators with replace.unidentifiable = TRUE.

Predicting from the parents

The most computationally-efficient way of computing predictions for a variable is to use its local distribution and to find its most probable value conditional on the values of its parents. This does not require any inference: it boils down to a look-up in the conditional probability table (for a discrete node in a discrete or conditional Gaussian network) or to a vector product (for a continuous node in a Gaussian or conditional Gaussian network). However, it does not use the Bayesian network and the information in the observation being predicted to the fullest because it disregards the rest of the Markov blanket of the node that is the target of the prediction.

Consider a Bayesian network learned and fitted from a training.set, with an associated validation.set for which we want to predict node E.

> library(bnlearn)
> dtraining.set = learning.test[1:4000, ]
> dvalidation.set = learning.test[4001:5000, ]
> dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> dfitted = bn.fit(dag, dtraining.set)

The method = "parents" is the default method for prediction, and has no optional arguments.

> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "parents")
> head(pred)
[1] c b b a c b
Levels: a b c
> head(dvalidation.set$E)
[1] c a c b c b
Levels: a b c
> table(dvalidation.set$E, pred)
   pred
      a   b   c
  a 176 156  30
  b  26 258  36
  c  23  99 196

Like other prediction methods, if the prob argument is set to TRUE and the network is a discrete Bayesian network the prediction probabilities for all values of the target variables are attached as an attribute to the predictions. The columns correspond to the observations in validation.set and the rows to the values of E.

> pred = predict(dfitted, node = "E", data = dvalidation.set, prob = TRUE)
> attr(pred, "prob")[, 1:5]
       [,1]      [,2]      [,3]       [,4]      [,5]
a 0.2301587 0.4125133 0.2345079 0.81066946 0.1191646
b 0.1785714 0.4772004 0.5018226 0.09309623 0.1117936
c 0.5912698 0.1102863 0.2636695 0.09623431 0.7690418

Similarly, for continuous data (this time predicting node F).

> gtraining.set = gaussian.test[1:4000, ]
> gvalidation.set = gaussian.test[4001:5000, ]
> dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]")
> gfitted = bn.fit(dag, gtraining.set)
> pred = predict(gfitted, node = "F", data = gvalidation.set, method = "parents")
> head(pred)
[1] 18.57683 17.26517 21.57541 28.03751 29.75274 24.87890
> head(gvalidation.set$F)
[1] 19.61011 16.12044 19.88823 26.99181 29.55846 25.65805
> cor(gvalidation.set$F, pred)
[1] 0.9869819

The main shortcoming of predicting from parents is that it fails to produce meaningful predictions for root nodes, since they have no parents. In the case of a discrete node, all predictions will be equal to the mode of its distribution.

> table(predict(dfitted, node = "A", data = dvalidation.set, method = "parents"))

   a    b    c 
   0 1000    0

In the case of a continuous node, all predictions will be equal to the expected values of its distribution.

> unique(predict(gfitted, node = "A", data = gvalidation.set, method = "parents"))
[1] 1.002117

Predicting with Monte Carlo posterior inference

A better approach to prediction is to use some form of inference and predict from more nodes than just the parents of the target node. Likelihood weighting is better suited for this task than logic sampling because it guarantees that none of the particles it generates are discarded, no matter how unlikely the values taken by the predictors are. (However, they must be observable.) This is what method = "bayes-lw" does for each observation being predicted:

  1. generate a sufficiently large number of particles and the associated likelihood weights using the predictors as the evidence;
  2. compute the value with the largest weight mass (the posterior mode, for a discrete target node) or the weighted average of the particles (the posterior expectation, for a continuous target node).

Therefore, predict() takes two optional arguments when method = "bayes-lw": n, the number of particles to generate for each observation, and from, the nodes to use as predictors.

> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw")
> head(pred)
[1] c b b a c b
Levels: a b c
> head(dvalidation.set$E)
[1] c a c b c b
Levels: a b c
> table(dvalidation.set$E, pred)
   pred
      a   b   c
  a 180 152  30
  b  34 250  36
  c  24  98 196

The predicted values are computed from the empirical posterior distribution of the particles, so in general they may be different each time we invoke predict(). Generating larger number of particles makes predictions more stable at the cost of longer running times.

> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", n = 500))
   user  system elapsed 
  0.075   0.001   0.076
> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", n = 5000))
   user  system elapsed 
  0.569   0.000   0.569

As for the predictors, at a minimum from should contain the nodes that are part of the Markov blanket of the target node to achieve the best possible accuracy. By default, from contains all nodes in the network apart the predictor itself to minimize the computational effort of generating the particles. However, we can choose to use any valid set of predictors depending on the application.

> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", from = c("A", "F"))
> head(pred)
[1] c b b a c b
Levels: a b c

Predicting with exact inference

Similarly, we can use exact inference instead of approximate inference to compute predictions. On the one hand, the predicted values will not have any simulation variability because we no longer rely on Monte Carlo simulations. On the other hand, exact inference is often more computationally expensive than approximate inference, especially for large Bayesian networks.

The method = "exact" leverages either junction trees (interfacing with the gRain package) or the closed-form properties of the multivariate normal distribution to produce predictions for discrete and Gaussian Bayesian networks. Conditional Gaussian networks are not supported. It has an optional argument from like the "bayes-lw" method, which by default contains the nodes in the Markov blanket of the target node.

> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "exact")
Loading required namespace: gRain

Attaching package: 'gRbase'
The following objects are masked from 'package:bnlearn':

    ancestors, children, nodes, parents
> head(pred)
[1] c b b a c b
Levels: a b c
> pred = predict(gfitted, node = "F", data = gvalidation.set, method = "parents")
> head(pred)
[1] 18.57683 17.26517 21.57541 28.03751 29.75274 24.87890

The number of nodes in from and their location in the network relative to the target node has a strong influence the prediction speed. In fact, it can make method = "exact" faster than method "bayes-lw".

> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "exact"))
   user  system elapsed 
  0.025   0.000   0.025
> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw"))
   user  system elapsed 
  0.071   0.000   0.072
> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "exact", from = c("A", "F", "B")))
   user  system elapsed 
  0.024   0.000   0.024
> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", from = c("A", "F", "B")))
   user  system elapsed 
  0.075   0.000   0.075

Predictions in Bayesian network classifiers

Bayesian network classifiers like naive.bayes() and tree.bayes() (which implements TAN), have separate predict() methods that use the closed-form expressions available for those models. Note this method has no method argument to choose how to compute the predicted values, but it still has the prob argument as illustrated here.

Last updated on Sat Feb 17 23:24:54 2024 with bnlearn 5.0-20240208 and R version 4.3.2 (2023-10-31).