Commit 99dd3d3e by Gilles Kratzer

### update: website with SHD

parent 96757e50
Pipeline #2547 passed with stage
in 3 seconds
 ... ... @@ -158,7 +158,7 @@

Below is the summary of the MCMC run:

for(i in 1:nb.dag){ dag <- u.list.dag[,,order(tab,decreasing = TRUE)[i]] colnames(dag) <- rownames(dag) <- names(dist.asia) fabn <- fitabn(dag.m = (dag),data.df = asia,data.dists = dist.asia,method = "bayes") scores.dags[i] <- fabn$mlik num.arcs[i] <- sum(dag) shd[i] <- compareDag(ref = u.list.dag[,,order(tab,decreasing = TRUE)[1]],u.list.dag[,,order(tab,decreasing = TRUE)[i]])$Hamming-distance }

Let us plot the MCMC run

plot(out)

Let us plot the diversity plot, where the x-axis is the number of arcs of each structure, the y1-axis is the occurence of the structures and the y2-axis is the scores. As one can see, half of the posterior distribution is represented by one DAG.

Let us plot the diversity plot, where the x1-axis is the number of arcs of each structure, the x2-axis is the structural Hamming distance (a proxy for how different two DAGs are), the y1-axis is the occurence of the structures and the y2-axis is the scores. As one can see, half of the posterior distribution is represented by one DAG.

plot(1:nb.dag, sort(tab,decreasing = TRUE)[1:nb.dag], type = 'n',ylab = "",xlab = "Number of arcs",xaxt="n",yaxt="n", ylim = c(0,2000))
axis(2,at = c(0, 1000, 2000),labels = c("0.0%","50%","100%"),col.axis = "#4393C3")
mtext("Occurence of DAGs", side=2, line=2, col="#4393C3")
...  ...  @@ -346,7 +350,9 @@
plot(x = 1:nb.dag,y = scores.dags,col="red", type = 'b', lwd=2, axes = FALSE, xlab = "",ylab="")
axis(4, col.axis = 'red')
mtext("DAGs scores", side=4, line=1.5, col="red")
axis(1, col.axis = 'black',at = 1:nb.dag,labels = num.arcs)
axis(1, col.axis = 'black',at = 1:nb.dag,labels = num.arcs) axis(3, col.axis = 'orange',at = 1:nb.dag,labels = shd) mtext("Structural Hamming distances", side=3, line=2, col="orange")

... ...

90.9 KB | W: | H:

102 KB | W: | H:

• 2-up
• Swipe
• Onion skin
 ... ... @@ -233,7 +233,7 @@

One can also print the summary of the MCMC run:

83.8 KB | W: | H:

74.9 KB | W: | H:

• 2-up
• Swipe
• Onion skin
 ... ... @@ -100,48 +100,27 @@

mcmcabn: An R Package for sampling DAGs using structural MCMC

(PUBLIC) !!! UNSTABLE VERSION !!!

mcmcabn is a one-man-show (me!) and made of more than 10’000 lines of code which are not bug free! So use it with caution and awareness.

Quick start

To install mcmabn you need two R packages: abn and gRbase which requires libraries not stored on CRAN but on bioconductor. Hence you must install these packages before installing mcmcabn:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("RBGL","Rgraphviz","graph"),  version = "3.8")

install.packages("mcmcabn", dependencies = TRUE)

The three main problems addressed by this R package are:

• selecting the most probable structure based on a cache of pre-computed scores.
• controlling for overfitting.
• sampling the landscape of high scoring structures.

The latter could be very useful in an applied perspective to avoid reducing the richeness of Bayesian network modelling to report only one structure. Indeed, it allows user to quantify the marginal impact of relationships of interest by marginalising out over structures or nuisance dependencies. Structural MCMC seems a very elegant and natural way to estimate the true marginal impact, so one can determine if it’s magnitude is big enough to consider as a worthwhile intervention.

Installation

install.packages("https://git.math.uzh.ch/gkratz/mcmcabn/raw/master/mcmcabn_0.3.tar.gz", repo=NULL, type="source")

CRAN: https://CRAN.R-project.org/package=mcmcabn Website: https://www.math.uzh.ch/pages/mcmcabn/

Description

mcmcabn is a flexible implementation of a structural MCMC sampler for Directed Acyclic Graphs (DAGs). It supports the new edge reversal move from Grzegorczyk and Husmeier (2008) https://doi.org/10.1007/s10994-008-5057-7 and the Markov blanket resampling from Su and Borsuk (2016) http://jmlr.org/papers/v17/su16a.html. It supports three priors: a prior controlling for structure complexity from Koivisto and Sood (2004) http://dl.acm.org/citation.cfm?id=1005332.1005352, an uninformative prior and a user defined prior. The three main problems that can be addressed by this R package are selecting the most probable structure based on a cache of pre-computed scores, controlling for overfitting and sampling the landscape of high scoring structures. It allows to quantify the marginal impact of relationships of interest by marginalising out over structures or nuisance dependencies. Structural MCMC seems a very elegant and natural way to estimate the true marginal impact, so one can determine if it’s magnitude is big enough to consider as a worthwhile intervention.

The package provides a flexible implementation of a structural MCMC sampler for Directed Acyclic Graphs (DAGs). It supports the new edge reversal move from Grzegorczyk and Husmeier (2008) https://doi.org/10.1007/s10994-008-5057-7 and the Markov blanket resampling from Su and Borsuk (2016) http://jmlr.org/papers/v17/su16a.html. It supports three priors: a prior controlling for structure complexity from Koivisto and Sood (2004) http://dl.acm.org/citation.cfm?id=1005332.1005352, an uninformative prior and a user defined prior. The three main problems that can be addressed by this R package are selecting the most probable structure based on a cache of pre-computed scores, controlling for overfitting and sampling the landscape of high scoring structures. It allows to quantify the marginal impact of relationships of interest by marginalising out over structures or nuisance dependencies. Structural MCMC seems a very elegant and natural way to estimate the true marginal impact, so one can determine if it’s magnitude is big enough to consider as a worthwhile intervention.

What’s New

mcmcabn is developed and maintained by Gilles Kratzer and Prof. Dr. Reinhard Furrer from Applied Statistics Group from the University of Zurich.

Future implementations (ordered by urgency)

Dev status

... ...
 ... ... @@ -160,6 +160,7 @@ mcmcabn 0.3:
• update mcmcabn to make it compatible with constraints imported from the cache of scores. Heating parameter to increase or decrease acceptance probability.
• new article
... ...
 ... ... @@ -159,14 +159,14 @@

Examples

## This data set was generated using the following code: library(bnlearn) #for the dataset
#> library("bnlearn") # for the dataset
#> #> Attaching package: ‘bnlearn’
#> The following object is masked from ‘package:stats’: #> #> sigma
library(abn) #for the cache of score function
#> #> sigma
library("abn") # for the cache of score function
#> #> Attaching package: ‘abn’
#> The following object is masked from ‘package:bnlearn’: #> #> mb
#renaming columns of the dataset # Renaming columns of the dataset colnames(asia) <- c("Asia", "Smoking", "Tuberculosis", ... ... @@ -176,16 +176,6 @@ "XRay", "Dyspnea") #lets define the distribution list dist.asia <- list(Asia = "binomial", Smoking = "binomial", Tuberculosis = "binomial", LungCancer = "binomial", Bronchitis = "binomial", Either = "binomial", XRay = "binomial", Dyspnea = "binomial") bsc.compute.asia <- buildscorecache(data.df = asia, data.dists = dist.asia, max.parents = 2)
... ...
 ... ... @@ -159,9 +159,9 @@

Examples

## This data set was generated using the following code: library(bnlearn) #for the dataset library("bnlearn") # for the dataset #renaming columns of the dataset # Renaming columns of the dataset colnames(asia) <- c("Asia", "Smoking", "Tuberculosis", ... ... @@ -171,7 +171,7 @@ "XRay", "Dyspnea") #lets define the distribution list # Defining the distribution list dist.asia <- list(Asia = "binomial", Smoking = "binomial", Tuberculosis = "binomial", ... ...
 ... ... @@ -193,6 +193,12 @@

Function to plot mcmcabn class objects

print(<summary.mcmcabn>)

Methods for printing the summary of mcmcabn objects

print(<mcmcabn>)

... ...
 ... ... @@ -163,30 +163,6 @@ library(bnlearn) #for the dataset library(abn) #for the cache of scores computing function #renaming columns of the dataset colnames(asia) <- c("Asia", "Smoking", "Tuberculosis", "LungCancer", "Bronchitis", "Either", "XRay", "Dyspnea") #lets define the distribution list dist.asia <- list(Asia = "binomial", Smoking = "binomial", Tuberculosis = "binomial", LungCancer = "binomial", Bronchitis = "binomial", Either = "binomial", XRay = "binomial", Dyspnea = "binomial") bsc.compute.asia <- buildscorecache(data.df = asia, data.dists = dist.asia, max.parents = 2) mcmc.out.asia <- mcmcabn(score.cache = bsc.compute.asia, score = "mlik", data.dists = dist.asia, ... ...
 ... ... @@ -228,10 +228,10 @@

The procedure runs a structural Monte Carlo Markov Chain Model Choice (MC)^3 to find the most probable posterior network (DAG). The default algorithm is based on three MCMC move: edge addition, edge deletion and edge reversal. This algorithm is known as the (MC)^3. It is known to mix slowly and getting stuck in low probability regions. Indeed, changing of Markov equivalence region often requires multiple MCMC moves. Then large scale MCMC moves are implemented. Their relative frequency can be set by the user. The new edge reversal move (REV) from Grzegorczyk and Husmeier (2008) and the Markov blanket resampling (MBR) from Su and Borsuk (2016). The classical reversal move depends on the global configuration of the parents and children and fails to propose MCMC jumps that produce valid but very different DAGs in a unique move. The REV move sample globally a new set of parent. The MBR workaround applies the same idea but to the entire Markov blanket of a randomly chosen node.

The classical (MC)^3 is unbiased but inefficient in mixing, the two radical MCMC alternative move are known to massively accelerate mixing without introducing biases. But those move are computationally expensive. Then low frequencies are advised. The REV move is not necessarily ergotic , then it should not be used alone.

The parameter start.dag can be: "random", "hc" or user defined. If user select "random" then a random valid DAG is selected. The routine used favourise low density structure. If "hc" (for Hill-climber: searchHeuristic then a DAG is selected using 100 different searches with 500 optimization steps. A user defined DAG can be provided. It should be a named square matrix containing only zeros and ones. The DAG should be valid (i.e. acyclic).

The parameter prior.choice determines the prior used within each individual node for a given choice of parent combination. In Koivisto and Sood (2004) p.554 a form of prior is used which assumes that the prior probability for parent combinations comprising of the same number of parents are all equal. Specifically, that the prior probability for parent set G with cardinality |G| is proportional to 1/[n-1 choose |G|] where there are n total nodes. Note that this favours parent combinations with either very low or very high cardinality which may not be appropriate. This prior is used when prior.choice=2. When prior.choice=1 an uninformative prior is used where parent combinations of all cardinalities are equally likely. When prior.choice=3 a user defined prior is used, defined by prior.dag. It is given by an adjacency matrix (squared and same size as number of nodes) where entries ranging from zero to one give the user prior belief. An hyper parameter defining the global user belief in the prior is given by prior.lambda.

MCMC sampler came with asymptotic statistical guarantees. Therefore it is highly advised to run multiple long enough chains. The burn in phase length (i.e throwing away first MCMC iterations) should be adequately chosen.

The argument data.dists must be a list with named arguments, one for each of the variables in data.df, where each entry is either "poisson","binomial", or "gaussian".

The parameter start.dag can be: "random", "hc" or user defined. If user select "random" then a random valid DAG is selected. The routine used favourise low density structure. If "hc" (for Hill-climber: searchHeuristic then a DAG is selected using 100 different searches with 500 optimization steps. A user defined DAG can be provided. It should be a named square matrix containing only zeros and ones. The DAG should be valid (i.e. acyclic).

The parameter prior.choice determines the prior used within each individual node for a given choice of parent combination. In Koivisto and Sood (2004) p.554 a form of prior is used which assumes that the prior probability for parent combinations comprising of the same number of parents are all equal. Specifically, that the prior probability for parent set G with cardinality |G| is proportional to 1/[n-1 choose |G|] where there are n total nodes. Note that this favours parent combinations with either very low or very high cardinality which may not be appropriate. This prior is used when prior.choice=2. When prior.choice=1 an uninformative prior is used where parent combinations of all cardinalities are equally likely. When prior.choice=3 a user defined prior is used, defined by prior.dag. It is given by an adjacency matrix (squared and same size as number of nodes) where entries ranging from zero to one give the user prior belief. An hyper parameter defining the global user belief in the prior is given by prior.lambda.

MCMC sampler come with asymptotic statistical guarantees. Therefore it is highly advised to run multiple long enough chains. The burn in phase length (i.e throwing away first MCMC iterations) should be adequately chosen.

The argument data.dists must be a list with named arguments, one for each of the variables in data.df, where each entry is either "poisson", "binomial", or "gaussian".

The parameter heating could improve convergence. It should be a real positive number. If smaller than one it is a tuning parameter which transforms the score by raising it to this power. One is neutral. The smaller the more probable to accept any move. If larger than one, it indicates the number of returned steps where an exponentially decrease heating scheme is applied. After this number of steps, the heating parameter is set to one.

Value

... ... @@ -239,30 +239,31 @@

References

For the implementation of the function:

Kratzer, G. Furrer, R. "Is a single unique Bayesian network enough to accurately represent your data?". arXiv preprint arXiv:1902.06641.

Kratzer, G., Furrer, R. "Is a single unique Bayesian network enough to accurately represent your data?". arXiv preprint arXiv:1902.06641.

For the new edge reversal:

Grzegorczyk, M.Husmeier, D. "Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move", Machine Learning, vol. 71(2-3), pp. 265, 2008.

Grzegorczyk, M., Husmeier, D. (2008). "Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move", Machine Learning, vol. 71(2-3), 265.

For the Markov Blanket resampling move:

Su, C.Borsuk, M. E. "Improving structure MCMC for Bayesian networks through Markov blanket resampling", The Journal of Machine Learning Research, vol. 17(1), pp. 4042-4061, 2016.

Su, C., Borsuk, M. E. (2016). "Improving structure MCMC for Bayesian networks through Markov blanket resampling", The Journal of Machine Learning Research, vol. 17(1), 4042-4061.

For the Koivisto prior:

Koivisto, M. V. (2004). Exact Structure Discovery in Bayesian Networks, Journal of Machine Learning Research, vol 5, 549-573.

For the user defined prior:

Werhli, A. V., & Husmeier, D. (2007). "Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge". Statistical Applications in Genetics and Molecular Biology, 6 (Article 15).

Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., & Miyano, S. (2003). Using Bayesian networks for estimating gene networks from microarrays and biological knowledge. In Proceedings of the European Conference on Computational Biology.

Werhli, A. V., Husmeier, D. (2007). "Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge". Statistical Applications in Genetics and Molecular Biology, 6 (Article 15).

Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., Miyano, S. (2003). Using Bayesian networks for estimating gene networks from microarrays and biological knowledge. In Proceedings of the European Conference on Computational Biology.

For the asia dataset:

Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3), 1 - 22. doi:http://dx.doi.org/10.18637/jss.v035.i03.

Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3), 1-22. doi:http://dx.doi.org/10.18637/jss.v035.i03.

Examples

## Example from the asia dataset from Lauritzen and Spiegelhalter (1988) provided by Scutari (2010)
## Example from the asia dataset from Lauritzen and Spiegelhalter (1988) ## provided by Scutari (2010) # the number of MCMC run is delibaretelly chosen too small (computing time) # The number of MCMC run is delibaretelly chosen too small (computing time) # no thinning (usually not recommended) # no burn-in (usually not recommended, # even if not supported by any theoretical arguments) data("mcmc_run_asia") # let us run: 0.03 REV, 0.03 MBR, 0.94 MC3 MCMC jumps # Let us run: 0.03 REV, 0.03 MBR, 0.94 MC3 MCMC jumps # with a random DAG as starting point mcmc.out.asia.small <- mcmcabn(score.cache = bsc.compute.asia, ... ... @@ -278,7 +279,7 @@ prior.choice = 2) summary(mcmc.out.asia.small)
#> MCMC summary: #> Number of Burn in steps: 0 #> Number of burn-in steps: 0 #> Number of MCMC steps: 100 #> Thinning: 0 #> ... ... @@ -319,7 +320,7 @@ prior.choice = 2) summary(mcmc.out.asia.small)
#> MCMC summary: #> Number of Burn in steps: 0 #> Number of burn-in steps: 0 #> Number of MCMC steps: 100 #> Thinning: 0 #> ... ... @@ -343,7 +344,7 @@ #> acf 1 0.9115666 0.8232014 0.7162509 0.6092764 0.5204256 0.4313169 0.3309841 #> 8 9 10 #> acf 0.2299462 0.1679331 0.105142
#let us define a starting DAG # Defining a starting DAG startDag <- matrix(data = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ... ... @@ -369,7 +370,7 @@ prior.choice = 1) summary(mcmc.out.asia.small)
#> MCMC summary: #> Number of Burn in steps: 0 #> Number of burn-in steps: 0 #> Number of MCMC steps: 100 #> Thinning: 0 #> ... ... @@ -417,7 +418,7 @@ prior.dag = priorDag) summary(mcmc.out.asia.small)
#> MCMC summary: #> Number of Burn in steps: 0 #> Number of burn-in steps: 0 #> Number of MCMC steps: 100 #> Thinning: 0 #> ... ... @@ -448,41 +449,41 @@ score = "mlik", data.dists = dist.asia, max.parents = 2, mcmc.scheme = c(100,0,0), seed = 321, mcmc.scheme = c(150,0,0), seed = 41242, verbose = FALSE, start.dag = "random", prob.rev = 0.03, prob.mbr = 0.03, prior.choice = 2, heating = 20) summary(mcmc.out.asia.small)
#> MCMC summary: #> Number of Burn in steps: 0 #> Number of MCMC steps: 100 summary(mcmc.out.asia.small)
#> MCMC summary: #> Number of burn-in steps: 0 #> Number of MCMC steps: 150 #> Thinning: 0 #> #> Maximum score: -12382.55 #> Empirical mean: -13038.58 #> Empirical standard deviation: 691.47 #> Maximum score: -11158.05 #> Empirical mean: -11743.41 #> Empirical standard deviation: 1145.562 #> Quantiles of the posterior network score: #> 0.025 0.25 0.5 0.75 0.975 #> BN score -14150.79 -13826.07 -12644.32 -12387.14 -12382.55 #> 0.025 0.25 0.5 0.75 0.975 #> BN score -15122.56 -11795.9 -11278.11 -11165.74 -11158.05 #> #> #> Global acceptance rate: 0.2277228 #> Global acceptance rate: 0.2317881 #> Accepted Rejected #> MBR 2 0 #> MC3 21 77 #> REV 0 1 #> MBR 2 2 #> MC3 32 111 #> REV 1 3 #> #> #> Sample size adjusted for autocorrelation: 1.311912 #> Sample size adjusted for autocorrelation: 2.998704 #> #> Autocorrelations by lag: #> 0 1 2 3 4 5 6 7 #> acf 1 0.974099 0.948068 0.9219238 0.8957141 0.869547 0.8422681 0.8149795 #> 0 1 2 3 4 5 6 7 #> acf 1 0.9607992 0.9215685 0.8768169 0.83133 0.779314 0.7273621 0.6703544 #> 8 9 10 #> acf 0.787679 0.7594535 0.7336757
#> acf 0.612575 0.5547808 0.4964446
 ... ... @@ -182,15 +182,16 @@ H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New Yor

Alboukadel Kassambara (2018). ggpubr: 'ggplot2' Based Publication Ready Plots. R package version 0.2. https://CRAN.R-project.org/package=ggpubr

Claus O. Wilke (2019). cowplot: Streamlined Plot Theme and Plot Annotations for 'ggplot2'. R package version 0.9.4. https://CRAN.R-project.org/package=cowplot

Data: Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3), 1 - 22. doi:http://dx.doi.org/10.18637/jss.v035.i03.

Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3), 1-22. doi:http://dx.doi.org/10.18637/jss.v035.i03.

Examples

## Example from the asia dataset from Lauritzen and Spiegelhalter (1988) provided by Scutari (2010)
## Example from the asia dataset from Lauritzen and Spiegelhalter (1988) ## provided by Scutari (2010) data("mcmc_run_asia") #plot the mcmc run # plot the mcmc run plot(mcmc.out.asia)
#plot cumulative max score # plot cumulative max score plot(mcmc.out.asia, max.score = TRUE)
 Methods for printing the summary of mcmcabn objects — print.summary.mcmcabn • mcmcabn