## Imputing missing values from a Bayesian network

Imputing missing values is essential to make it possible to apply methods thought for complete data (that is, most of
them) to incomplete data. Conceptually, imputation is similar to prediction: both are *most probable explanation*
queries in which we observe a subset of the variables in the data and we infer the values of some of the remaining
variables. For this reason, the implementation of `impute()`

in **bnlearn** has the same
interface as `impute()`

(both documented here).

Like `predict()`

(illustrated here), `impute()`

takes as arguments a
fitted Bayesian network, a data frame with missing data to impute and the label of the method used to perform the
imputation. Available methods are the same as in `predict()`

: `"parents"`

,
`"bayes-lw"`

and `"exact"`

.

### Imputing from the parents

With `method = "parents"`

, the missing values in each variable are imputed from the parents of that
variable in the Bayesian network. The imputation is performed in topological order, starting from the root nodes, so
that the parents are completed by the time they are needed to impute their children.

> library(bnlearn) > > dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]") > dfitted = bn.fit(dag, learning.test) > > incomplete = learning.test > missing = matrix(c(sample(nrow(incomplete), 1000), + sample(ncol(incomplete), 1000, replace = TRUE)), + ncol = 2) > incomplete[missing] = NA > head(incomplete, n = 10)

A B C D E F 1 b c b a b b 2 b a c a b <NA> 3 a a a a a a 4 a a a a b b 5 a a b c a a 6 c c a c c a 7 c c b c c a 8 b b a <NA> b b 9 b b b a c a 10 b a <NA> a a a

> completed = impute(dfitted, data = incomplete, method = "parents") > all(complete.cases(completed))

[1] TRUE

As in `predict()`

, `method = "parents"`

is ill-suited to impute missing values in root nodes.
Those nodes do not have any parents, so all missing values are imputed with either the average (for Gaussian nodes) or
the mode (for discrete nodes) of the respective local distributions. This issue will impact the quality of the
imputation of other variables that are descendants of those nodes.

### Imputing with Monte Carlo posterior inference

With `method = "bayes-lw"`

, the missing values in each observation are imputed from their joint posterior
distribution conditional on the variables that are observed. The posterior distribution is estimated empirically using
likelihood weighting, and the imputed values are either the mean or the mode of the distribution. Therefore, the
imputed values may vary between different runs of `impute()`

. As in `predict()`

,
`method = "bayes-lw"`

takes as an optional argument the number `n`

of particles produced by
likelihood weighting for each observation in the data.

> completed = impute(dfitted, data = incomplete, method = "bayes-lw") > completed = impute(dfitted, data = incomplete, method = "bayes-lw", n = 5000) > head(completed, n = 10)

A B C D E F 1 b c b a b b 2 b a c a b b 3 a a a a a a 4 a a a a b b 5 a a b c a a 6 c c a c c a 7 c c b c c a 8 b b a b b b 9 b b b a c a 10 b a b a a a

### Imputing with exact inference

Similarly, with `method = "bayes-exact"`

the missing values in each observation are imputed from their
joint posterior distribution conditional on the variables that are observed. However, the posterior distribution is
reconstructed using exact inference and therefore the imputed values have no simulation variability.

> completed = impute(dfitted, data = incomplete, method = "exact")

> head(completed, n = 10)

A B C D E F 1 b c b a b b 2 b a c a b b 3 a a a a a a 4 a a a a b b 5 a a b c a a 6 c c a c c a 7 c c b c c a 8 b b a b b b 9 b b b a c a 10 b a b a a a

`Fri Nov 11 18:55:34 2022`

with **bnlearn**

`4.9-20221107`

and `R version 4.2.2 (2022-10-31)`

.