Plotting arc strengths with Rgraphviz and lattice

Measuring the strength of the relationship represented by each arc in a Bayesian network is a fundamental tool to investigate and validate its structure. Arc strengths are mostly used programmatically for both purposes; nevertheless it is sometimes useful to explore them visually.

We will consider all the different measures of arc strengths implemented in bnlearn:

  1. p-values from the conditional independence tests that would remove individual arcs present in a network;
    > library(bnlearn)
    > dag = model2network("[A][C][F|C:A][B|A][D|A:C][E|B:F]")
    > pvalues = arc.strength(dag, data = learning.test, criterion = "x2")
    > head(pvalues)
    
      from to      strength
    1    C  F  6.564063e-01
    2    A  F  7.470134e-01
    3    A  B  0.000000e+00
    4    A  D  0.000000e+00
    5    C  D  0.000000e+00
    6    B  E 6.226919e-289
    
  2. differences in network score from removing individual arcs present in a network;
    > score.deltas = arc.strength(dag, data = learning.test, criterion = "bic")
    > head(score.deltas)
    
      from to    strength
    1    C  F    23.47069
    2    A  F    23.81276
    3    A  B -1153.87956
    4    A  D -1938.94993
    5    C  D  -823.76049
    6    B  E  -720.82662
    
  3. probabilities of inclusion of all possible arcs, and their directions;
    > bagging = boot.strength(learning.test, algorithm = "tabu", R = 200)
    > head(bagging)
    
      from to strength direction
    1    A  B        1       0.5
    2    A  C        0       0.0
    3    A  D        1       1.0
    4    A  E        0       0.0
    5    A  F        0       0.0
    6    B  A        1       0.5
    
  4. probabilities of inclusion of individual arcs present in a network, and their directions.
    > bayes.factors = bf.strength(dag, data = learning.test)
    
    Loading required namespace: Rmpfr
    
    > head(bayes.factors)
    
      from to     strength    direction
    1    A  B 1.000000e+00 5.000000e-01
    2    A  C 3.251115e-08 5.000000e-01
    3    A  D 1.000000e+00 1.000000e+00
    4    A  E 2.667283e-54 1.000000e+00
    5    A  F 1.016754e-04 2.065548e-09
    6    B  A 1.000000e+00 5.000000e-01
    

As far as plotting is concerned, the bn.strength objects produced above differ in two key ways:

  • pvalues and score.deltas contain an entry for each of the arcs in dag, and provide the confidence level in each arc's presence (in the strength column);
  • bagging and bayes.factors contain an entry for every possible arc (that is, for every pair of nodes), and provide the confidence level in each arc's presence (in the strength column) and direction (in the direction column).

The strength values in pvalues, bagging and bayes.factors are bound in the [0, 1] interval; and the same is true for the direction values in bagging and bayes.factors. The strength values in score.deltas are real numbers.

Plotting the distribution of arc strengths

The bagging and bayes.factors encode the complete distribution of the arc strengths, since they cover all possible arcs. This makes it possible to plot the empirical cumulative distribution function (ECDF) of the confidence in the arcs' presence (modulo their directions) using the values stored in the strength column. This is what the default plot() method does for the bn.strength objects produced by boot.strength() and bf.strength().

> plot(bagging)
plot of chunk unnamed-chunk-6

The plot is defined in [0, 1]2 because both the arc strengths and their cumulative probabilities are defined in [0, 1]. At the left-hand side, the vertical gap between 0.0 and the first point gives the proportion of the possible arcs that have strength equal to zero: they are completely unsupported by the data. At the right hand side, the gap between the horizontal line and 1.0 gives the proportion of arcs that have strength equal to one, the maximum possible. The vertical dashed line is the default threshold for significance computed by inclusion.threshold() (we can choose not to draw it by setting draw.threshold = FALSE). Points to the right of the threshold represent arcs that would be included in the Bayesian network produced by averaged.network().

If structure learning is stable, that is, it reliably learns the same set of arcs, then the ECDF looks like the above: arcs cluster around strength zero and around strength one, with a large gap between the two groups. If, on the other hand, structure learning is noisy then the points in the ECDF are more uniformly distributed along the x-axis to represent arcs with varying strength values. Look, for instance, at how the ECDF changes if we reduce the sample size from 5000 to 50.

> weak = boot.strength(learning.test[1:50, ], algorithm = "tabu", R = 200)
> plot(weak)
plot of chunk unnamed-chunk-7

The pvalues and score.deltas objects typically contain information on just a few arcs, so it is not meaningful to plot their ECDF.

Plotting the network structure together with the arc strengths

Another useful type of plot is overlaying arc strengths on a network to get a visual impression of the layout of the arcs and their presence at the same time. strength.plot() implements this by increasing each arc's thickness in proportion to its strength, and by drawing weak arcs with dashed lines. Modes and arcs are laid out using the Rgraphviz package as in graphviz.plot().

When arc strengths are estimated using p-values, by default they are grouped in the following intervals to determine the thickness of the arcs:

> c(0, 0.05 / c(10, 5, 2, 1.5, 1), 1)
[1] 0.00000000 0.00500000 0.01000000 0.02500000 0.03333333 0.05000000 1.00000000

The value of the significance threshold is the 0.05 above by default, but it can be changed using the threshold argument of strength.plot(). Arcs with strengths in the right-most interval are associated with p-values larger than the threshold and are plotted with dashed lines. Other arcs are drawn with increasingly thick lines as we move intervals from right to left, that is, the closer to zero the interval the pvalues fall in is.

Arc strengths estimated using scores are discretized in a similar way by default. Since the scale of the score differences depends on the specific network and on the data they were learned from, intervals are defined using the c(0.50, 0.75, 0.90, 0.95, 1) empirical quantiles of positive score differences. (They correspond to arcs that are well supported by the data: if we remove one of those arcs, the network score increases.) The significance threshold is equal to zero by default, but it can be changed with the threshold argument.

When arc strength is estimated using boot.strength() or bf.strength(), it takes values in [0, 1] again but this time higher values are better. (With p-values, lower values are better.) Hence intervals by default are defined as:

> c(0, (1 - threshold) / c(10, 5, 2, 1.5, 1), 1)

The resulting plots for pvalues, score.deltas, bagging and bayes.factors are below.

> par(mfrow = c(2, 2))
> strength.plot(dag, strength = pvalues, main = "pvalues")
Loading required namespace: Rgraphviz
> strength.plot(dag, strength = score.deltas, main = "score.deltas")
> strength.plot(dag, strength = bagging, main = "bagging")
> strength.plot(dag, strength = bayes.factors, main = "bayes.factors")
plot of chunk unnamed-chunk-10

The plots produced by strength.plot() can be customized much in the same way as those produced by graphviz.plot(), which we illustrated here. An example:

> strength.plot(dag, strength = bagging, main = "bagging", shape = "rectangle",
+   highlight = list(nodes = "F", arcs = incident.arcs(dag, "F"),
+                 col = "darkblue", fill = "lightblue"))
plot of chunk unnamed-chunk-11

Obviously, setting either lwd or lty in the highlight argument will reset the formatting introduced by strength.plot() and should be avoided. In addition, we can manually specify both the threshold of significance and the cutpoints used to bin the arc strengths into intervals.

> strength.plot(dag, strength = pvalues,
+   threshold = 0.01, cutpoints = c(0, 1e-300, 1e-250, 1e-200, 0.01, 1))
plot of chunk unnamed-chunk-12
Last updated on Sat Feb 17 23:23:35 2024 with bnlearn 5.0-20240208 and R version 4.3.2 (2023-10-31).