@@ -59,14 +59,31 @@ Binary variables must be declared as factors with two levels, and the argument d

\author{Gilles Kratzer}

\references{Kratzer, G. Furrer, R. "Is a single unique Bayesian network enough to accurately represent your data?". arXiv preprint arXiv:1902.06641.

\references{

For the implementation of the function:

Kratzer, G. Furrer, R. "Is a single unique Bayesian network enough to accurately represent your data?". arXiv preprint arXiv:1902.06641.

For the new edge reversal:

Grzegorczyk, M.Husmeier, D. "Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move", Machine Learning, vol. 71(2-3), pp. 265, 2008.

For the Markov Blanket resampling move:

Su, C.Borsuk, M. E. "Improving structure MCMC for Bayesian networks through Markov blanket resampling", The Journal of Machine Learning Research, vol. 17(1), pp. 4042-4061, 2016.

For the Koivisto prior:

Koivisto, M. V. (2004). Exact Structure Discovery in Bayesian Networks, Journal of Machine Learning Research, vol 5, 549-573.

For the user defined prior:

Werhli, A. V., & Husmeier, D. (2007). "Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge". Statistical Applications in Genetics and Molecular Biology, 6 (Article 15).

Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., & Miyano, S. (2003). Using Bayesian networks for estimating gene networks from microarrays and biological knowledge. In Proceedings of the European Conference on Computational Biology.

Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3), 1 - 22. doi:http://dx.doi.org/10.18637/jss.v035.i03.

@@ -202,7 +210,21 @@ terms, `.` replaces all the variables in name. Additionnaly, when one want to ex

## Technical fundations

This part aimed at describing the implemented methodology in *mcmcabn*. The rational for mcmcabn is that MCMC over DAGs allow the user to query the familly of high scoring DAGs and then draw more general conclusion than using only one DAG. The rational for using structural algorithms for sampling the DAG space and not the classical node odering MCMC described by Friedman and Koller (2003) is that it biased. Additionaly, it is not transparent regarding the prior. One characteristic of systems epidemiology dataset is that data are scarce, then in a Bayesian setting the prior play a major role to model complex systems. Finally, structural approaches allows user defined prior which is of great interest.

*Note: in the following text the terms: Bayesian networks, networks, structures and graphs are considered as synonyms.*

This part aimed at describing the implemented methodology in *mcmcabn*. The rational for mcmcabn is that MCMC over DAGs allow the user to query the familly of high scoring DAGs and then draw more general conclusion than using only one DAG. The rational for using structural algorithms for sampling the DAG space and not the classical node odering MCMC described by Friedman and Koller (2003) is that it biased. Additionaly, it is not transparent regarding the prior. One characteristic of systems epidemiology dataset is that data are scarce, then in a Bayesian setting the prior play a major role. Thus it is of high importance to be trasnparent about our prior belief. Finally, structural approaches allows user defined prior which is of great interest.

Form a mathematical perspective, one want to samples overall possible structures in moving from structure to

structure according to its support by the data. The posterior distribution of the structures is given by:

Where G is a Bayesian network, D the data and $z^*$ is a normalization factor. In a score based approache, the likelihood i.e $p(D|G)$ are precisely the scores. In mcmcabn they have been pre-computed using abn and stored in a cache of scores.Once one got an estimate of the posterior distribution, one could compute the posterior probability of any structural feature (f) from a Markov chain sample computed from graph structures by

$$E[f|D] \approx \frac{1}{S} \sum_{s=1}^{S}f(G^s), \text{ where } G^s \sim p(G|D)$$

where $f$ is any structural query (`formula` statement in the `query()` function) and $S$ is the set of visited structures $G$.

Classically, structure MCMC is done using the algorithm described by Madigan and York (1995) and Giudici and Castelo (2003). It is called the Monte Carlo Markov Chain Model Choice (MC^3). It is an algorithm based on a single edge move: deletion, addition or reversal. This algorithm is known to be slow in mixing and easily stuck in local maxima. Then updates have been proposed. The ideas is to perform a drastic change in the proposed DAG in order to cross low probability regions easily.

...

...

@@ -212,11 +234,39 @@ Two structurally transparent workarounds have been proposed: the new edge revers

We believe that having those three algorithms in a unique function with user adjustable relative frequencies could only lead to better results. Indeed, those three algorithms work very differently. Indeed, the MC^3 is very stable and sample efficiently the nearby region. The REV and MBR could produce large MCMC jumps but different then possibly complementary.

### Prior

The general method is the Metropolis Hasting algorithm. Essentially, it is a sequential application of two steps:

1. a new DAG is proposed from some proposal distribution Q

2. the proposed DAG is accepted with some acceptance probability A

The distribution Q have to be smart. Closer it is from the posterior faster will the algorithm be. In order to compute the acceptance probability A one need to compute the posterior probability of each DAG. In this step, the prior on the index structure play a major role.

### Priors

Three priors are implemented in mcmcabn R package. The parameter `prior.choice` determines the prior used within each individual node for a given choice of parent combination. In Koivisto and Sood (2004) p.554 a form of prior, called Koivisto prior, is used which assumes that the prior probability for parent combinations comprising of the same number of parents are all equal. Specifically, that the prior probability for parent set G with cardinality $|G_n|$ is proportional to 1/[N-1 choose |G|] where there are N nodes in total.

As we are in a MCMC setting and we will compute an Hasting ratio, the normalizing constant $z$ will cancelled out. Then we will never explicitelly compute it.

Note that this favours parent combinations with either very low or very high cardinality which may not be appropriate. This prior is used when `prior.choice=2`. When `prior.choice=1` an uninformative prior is used where parent combinations of all cardinalities are equally likely.

When `prior.choice=3` a user defined prior is used, defined by `prior.dag`. It is given by an adjacency matrix (squared and same size as number of nodes) where entries ranging from zero to one give the user prior belief. An hyper parameter defining the global user belief in the prior is given by `prior.lambda`. In the vein of Werhli and Husmeier (2007) we defined the biological prior knowledge matrix B where rows are parent set and column children, each entries $B_{ij} \in [0,1]$ such that:

- $B_{ij}$ = 0.5, indicates no prior knowledge about the presence or absence of an arc between node i and j.

- $B_{ij} \in [0,0.5[$ indicates a prior belief that there is no arc between node i and j. The evidence get stronger as $B_{ij}$ get closer to 0.

- $B_{ij} \in ]0.5,1]$ indicates a prior belief that there is an arc between node i and j. The evidence get stronger as $B_{ij}$ get closer to 1.

Such a matrix is still not a prior as we need to introduce a proper normalization procedure. To do so, let us define the energy of a structure as:

$$E(G) = \sum_{i,j=1}^N |B_{i,j} −G_{i,j}|$$

Three priors are implemented in mcmcabn R package. The parameter `prior.choice` determines the prior used within each individual node for a given choice of parent combination. In Koivisto and Sood (2004) p.554 a form of prior, called Koivisto prior, is used which assumes that the prior probability for parent combinations comprising of the same number of parents are all equal. Specifically, that the prior probability for parent set G with cardinality |G| is proportional to 1/[n-1 choose |G|] where there are n total nodes. Note that this favours parent combinations with either very low or very high cardinality which may not be appropriate. This prior is used when `prior.choice=2`. When `prior.choice=1` an uninformative prior is used where parent combinations of all cardinalities are equally likely.

The energy E is zero for a perfect match between the prior knowledge B and the actual network G, while increasing values of E indicate an increasing mismatch between B and G. Following, Imoto et al. (2003) we define the prior belief on graph G by

When `prior.choice=3` a user defined prior is used, defined by `prior.dag`. It is given by an adjacency matrix (squared and same size as number of nodes) where entries ranging from zero to one give the user prior belief. An hyper parameter defining the global user belief in the prior is given by `prior.lambda`. As

$$p(G|\beta) = \frac{\exp^{-\lambda E(G)}}{z}$$

where $z^*$ is a normalizing constant and $\lambda$ is an hyperparameter defined by `prior.lambda`. Again, as we are in a MCMC setting and we will compute an Hasting ratio, the normalizing constant will cancelled out. Then we will never explicitelly compute them.

In statistical physics, in a Gibbs distribution the hyperparameter $\lambda$ is the inverese of temperature. It can be interpreted as a proxy indicating the strength of the influence of the prior over the data. Thus the strength of the user belief in the given priro. For $\lambda \rightarrow 0$ the prior distribution becomes flatter then uninformative about the structure. Inversly, for $\lambda \rightarrow \infty$, the prior distribution becomes sharply peaked at the network with the lowest energy.