Log-likelihood of data with missing values

The behaviour of logLik() when data are complete is described here here. When data are incomplete, logLik() will return NA for all observations containing missing values if by.sample = TRUE.

> library(bnlearn)
> dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> bn = bn.fit(dag, learning.test)
> incomplete = rbn(bn, 20)
> incomplete[1, "A"] = NA
> incomplete[2, "B"] = NA
> incomplete[3, "C"] = NA
> incomplete[4, "D"] = NA
> incomplete[5, "E"] = NA
> incomplete[6, "F"] = NA
> logLik(bn, incomplete, by.sample = TRUE, debug = TRUE)
> computing the log-likelihood of a discrete network.
* processing node A.
* processing node B.
* processing node C.
* processing node D.
* processing node E.
* processing node F.
 [1]        NA        NA        NA        NA        NA        NA -5.764190
 [8] -5.215685 -6.484397 -4.790550 -6.272148 -3.329707 -5.228377 -7.568037
[15] -4.011499 -6.474869 -6.126178 -4.121276 -3.254190 -3.181380

If we are computing the log-likelihood only for a subset of nodes, logLik() will only return NA for those observations that are not locally complete (that is, either the nodes or some of their parents are missing).

> logLik(bn, incomplete, by.sample = TRUE, nodes = "B")
 [1]         NA         NA -0.8098829 -1.0962199 -0.8098829 -0.1553504
 [7] -0.1553504 -2.1635035 -0.2349458 -0.1553504 -2.3595312 -0.8098829
[13] -0.8098829 -0.1553504 -0.2349458 -3.6817110 -0.1553504 -0.1553504
[19] -0.2349458 -0.1553504

If by.sample = FALSE (the default) and we are computing the log-likelihood of the whole sample, logLik() will return NA if the data are incomplete at all.

> logLik(bn, incomplete, by.sample = FALSE, debug = TRUE)
> computing the log-likelihood of a discrete network.
* incomplete data for node A, the log-likelihood is NA.
[1] NA

We can prevent logLik() from returning NA when by.sample = FALSE by setting na.rm = TRUE. logLik() will then compute the log-likelihood as the node-average log-likelihood scaled by the sample size for each node.

> logLik(bn, incomplete, by.sample = FALSE, na.rm = TRUE, debug = TRUE)
> computing the log-likelihood of a discrete network.
* processing node A.
  > 19 locally-complete observations out of 20.
  > log-likelihood is -21.967650.
* processing node B.
  > 18 locally-complete observations out of 20.
  > log-likelihood is -15.925319.
* processing node C.
  > 19 locally-complete observations out of 20.
  > log-likelihood is -16.876785.
* processing node D.
  > 17 locally-complete observations out of 20.
  > log-likelihood is -10.077951.
* processing node E.
  > 17 locally-complete observations out of 20.
  > log-likelihood is -20.721341.
* processing node F.
  > 19 locally-complete observations out of 20.
  > log-likelihood is -13.851705.
[1] -99.42075

In practice, this means that:

> logLik(bn, incomplete, node = "B", na.rm = TRUE)
[1] -15.92532
> locally.complete = complete.cases(incomplete[, c("B", parents(bn, "B"))])
> ncomplete = sum(locally.complete)
> logLik(bn, incomplete[locally.complete, ], node = "B") / ncomplete * nrow(incomplete)
[1] -15.92532

Note that the log-likelihood can still be NA if any of the parameters of the Bayesian network involved in its computation is equal to NA: na.rm = TRUE only prevents the missing values in the data from propagating, not those in the parameters.

Last updated on Sat Feb 17 23:40:14 2024 with bnlearn 5.0-20240208 and R version 4.3.2 (2023-10-31).