bnlearn - man/bn.cv.html

bn.cv {bnlearn}

R Documentation

Cross-validation for Bayesian networks

Description

Perform a k-fold or hold-out cross-validation for a learning algorithm or a fixed network structure.

Usage

bn.cv(data, bn, loss = NULL, ..., algorithm.args = list(),
  loss.args = list(), fit, fit.args = list(), method = "k-fold",
  cluster, debug = FALSE)

## S3 method for class 'bn.kcv'
plot(x, ..., main, xlab, ylab, connect = FALSE)
## S3 method for class 'bn.kcv.list'
plot(x, ..., main, xlab, ylab, connect = FALSE)

loss(x)

Arguments

`data`	a data frame containing the variables in the model.
`bn`	either a character string (the label of the learning algorithm to be applied to the training data in each iteration) or an object of class `bn` (a fixed network structure).
`loss`	a character string, the label of a loss function. If none is specified, the default loss function is the Classification Error for Bayesian networks classifiers; otherwise, the Log-Likelihood Loss for both discrete and continuous data sets. See below for additional details.
`algorithm.args`	a list of extra arguments to be passed to the learning algorithm.
`loss.args`	a list of extra arguments to be passed to the loss function specified by `loss`.
`fit`	a character string, the label of the method used to fit the parameters of the network. See `bn.fit` for details.
`fit.args`	additional arguments for the parameter estimation procedure, see again `bn.fit` for details.
`method`	a character string, either `k-fold`, `custom-folds` or `hold-out`. See below for details.
`cluster`	an optional cluster object from package parallel.
`debug`	a boolean value. If `TRUE` a lot of debugging output is printed; otherwise the function is completely silent.
`x`	an object of class `bn.kcv` or `bn.kcv.list` returned by `bn.cv()`.
`...`	additional objects of class `bn.kcv` or `bn.kcv.list` to plot alongside the first.
`main`, `xlab`, `ylab`	the title of the plot, an array of labels for the boxplot, the label for the y axis.
`connect`	a logical value. If `TRUE`, the medians points in the boxplots will be connected by a segmented line.

Value

bn.cv() returns an object of class bn.kcv.list if runs is at least 2, an object of class bn.kcv if runs is equal to 1.

loss() returns a numeric vector with a length equal to runs.

Cross-Validation Strategies

The following cross-validation methods are implemented:

k-fold: the data are split in k subsets of equal size. For each subset in turn, bn is fitted (and possibly learned as well) on the other k - 1 subsets and the loss function is then computed using that subset. Loss estimates for each of the k subsets are then combined to give an overall loss for data.
custom-folds: the data are manually partitioned by the user into subsets, which are then used as in k-fold cross-validation. Subsets are not constrained to have the same size, and every observation must be assigned to one subset.
hold-out: k subsamples of size m are sampled independently without replacement from the data. For each subsample, bn is fitted (and possibly learned) on the remaining m - nrow(data) samples and the loss function is computed on the m observations in the subsample. The overall loss estimate is the average of the k loss estimates from the subsamples.

If cross-validation is used with multiple runs, the overall loss is the averge of the loss estimates from the different runs.

To clarify, cross-validation methods accept the following optional arguments:

k: a positive integer number, the number of groups into which the data will be split (in k-fold cross-validation) or the number of times the data will be split in training and test samples (in hold-out cross-validation).
m: a positive integer number, the size of the test set in hold-out cross-validation.
runs: a positive integer number, the number of times k-fold or hold-out cross-validation will be run.
folds: a list in which element corresponds to one fold and contains the indices for the observations that are included to that fold; or a list with an element for each run, in which each element is itself a list of the folds to be used for that run.

Loss Functions

The following loss functions are implemented:

Log-Likelihood Loss (logl): also known as negative entropy or negentropy, it is the negated expected log-likelihood of the test set for the Bayesian network fitted from the training set. Lower valuer are better.
Classification Error (pred): the prediction error for a single discrete node. Lower values are better.
Exact Classification Error (pred-exact): closed-form exact posterior predictions are available for Bayesian network classifiers. Lower values are better.
Predictive Correlation (cor): the correlation between the observed and the predicted values for a single continuous node. Higher values are better.
Mean Squared Error (mse): the mean squared error between the observed and the predicted values for a single continuous node. Lower values are better.
F1 score (f1): the F1 score between observed and predicted values for both binary and multiclass target variables.
AUROC (auroc): the area under the ROC curve for both binary and multiclass target variables. The multiclass AUROC score is computed as one-vs-rest by averaging the AUROC for each level of the target variable.

Optional arguments that can be specified in loss.args are:

predict: a character string, the label of the method used to predict the observations in the test set. The default is "parents". Other possible values are the same as in predict().
predict.args: a list containing the optional arguments for the prediction method. See the documentation for predict() for more details.
target: a character string, the label of target node for prediction in all loss functions but logl, logl-g and logl-cg.

Plotting Results from Cross-Validation

Both plot methods accept any combination of objects of class bn.kcv or bn.kcv.list (the first as the x argument, the remaining as the ... argument) and plot the respected expected loss values side by side. For a bn.kcv object, this mean a single point; for a bn.kcv.list object this means a boxplot.

Author(s)

Marco Scutari

References

Koller D, Friedman N (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Examples

bn.cv(learning.test, 'hc', loss = "pred",
  loss.args = list(predict = "bayes-lw", target = "F"))

folds = list(1:2000, 2001:3000, 3001:5000)
bn.cv(learning.test, 'hc', loss = "logl", method = "custom-folds",
  folds = folds)

xval = bn.cv(gaussian.test, 'mmhc', method = "hold-out",
         k = 5, m = 50, runs = 2)
xval
loss(xval)

## Not run: 
# comparing algorithms with multiple runs of cross-validation.
gaussian.subset = gaussian.test[1:50, ]
cv.gs = bn.cv(gaussian.subset, 'gs', runs = 10)
cv.iamb = bn.cv(gaussian.subset, 'iamb', runs = 10)
cv.inter = bn.cv(gaussian.subset, 'inter.iamb', runs = 10)
plot(cv.gs, cv.iamb, cv.inter,
  xlab = c("Grow-Shrink", "IAMB", "Inter-IAMB"), connect = TRUE)

# use custom folds.
folds = split(sample(nrow(gaussian.subset)), seq(5))
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)

# multiple runs, with custom folds.
folds = replicate(5, split(sample(nrow(gaussian.subset)), seq(5)),
          simplify = FALSE)
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)

## End(Not run)

[Package bnlearn version 5.1-20250224 Index]