Index | Topics |
data preprocessing {bnlearn} | R Documentation |
Pre-process data to better learn Bayesian networks
Description
Screen and transform the data to make them more suitable for structure and parameter learning.
Usage
# discretize continuous data into factors.
discretize(data, method, breaks = 3, ordered = FALSE, ..., debug = FALSE)
# screen continuous data for highly correlated pairs of variables.
dedup(data, threshold, debug = FALSE)
Arguments
data |
a data frame containing numeric columns (for |
threshold |
a numeric value between zero and one, the absolute correlation used a threshold in screening highly correlated pairs. |
method |
a character string, either |
breaks |
an integer number, the number of levels the variables will be discretized into; or a vector of integer numbers, one for each column of the data set, specifying the number of levels for each variable. |
ordered |
a boolean value. If |
... |
additional tuning parameters, see below. |
debug |
a boolean value. If |
Details
discretize()
takes a data frame as its first argument and returns a secdond data frame of
discrete variables, transformed using of three methods: interval
, quantile
or
hartemink
. Discrete variables are left unchanged.
The hartemink
method has two additional tuning parameters:
-
idisc
: the method used for the initial marginal discretization of the variables, eitherinterval
orquantile
. -
ibreaks
: the number of levels the variables are initially discretized into, in the same format as in thebreaks
argument.
It is sometimes the case that the quantile
method cannot discretize one or more variables
in the data without generating zero-length intervals because the quantiles are not unique. If method
= "quantile"
, discretize()
will produce an error. If method = "quantile"
and idisc = "quantile"
, discretize()
will try to lower the number of breaks set
by the ibreaks
argument until quantiles are distinct. If this is not possible without making
ibreaks
smaller than breaks
, discretize()
will produce an error.
dedup()
screens the data for pairs of highly correlated variables, and discards one in each
pair.
Both discretize()
and dedup()
accept data with missing values.
Value
discretize()
returns a data frame with the same structure (number of columns, column names,
etc.) as data
, containing the discretized variables.
dedup()
returns a data frame with a subset of the columns of data
.
Author(s)
Marco Scutari
References
Hartemink A (2001). Principled Computational Methods for the Validation and Discovery of Genetic Regulatory Networks. Ph.D. thesis, School of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.
Examples
data(gaussian.test)
d = discretize(gaussian.test, method = 'hartemink', breaks = 4, ibreaks = 10)
plot(hc(d))
d2 = dedup(gaussian.test)
Index | Topics |