# Full text of "Exact and approximate inference in graphical models: variable elimination and beyond"

## See other formats

arXivil506.08544V 1 [stat.ML] 29Jun2015 Exact and approximate inference in graphical models: variable elimination and beyond Nathalie Peyrard®, Simon de Givry“, Alain Franc^, Stephane Robin*^’^, Regis Sabbadin®, Thomas Schiex®, Matthieu Vignes® “ INRA UR 875, Unite de Mathematiques et Informatique Appliquees, Chemin de Borde Rouge, 31326 Castanet-Tolosan, France ^ INRA UMR 1202, Biodiversite, Genes et Communautes, 69, route d’Arcachon, Pierroton, 33612 Cestas Cedex, France ^ AgroParisTech, UMR 518 MIA, Paris 5e, France ^ INRA, UM R518 MIA, Paris 5e, France June 30, 2015 Abstract Probabilistic graphical models offer a powerful framework to account for the dependence structure between variables, which can be represented as a graph. The dependence between variables may render inference tasks such as computing normalizing constant, marginalization or optimization intractable. The objective of this paper is to review techniques exploiting the graph structure for exact inference borrowed from optimization and computer science. They are not yet standard in the statistician toolkit, and we specify under which conditions they are efficient in practice. They are huilt on the principle of variable elimination whose complex¬ ity is dictated in an intricate way by the order in which variables are eliminated in the graph. The so-called treewidth of the graph characterizes this algorithmic complexity: low-treewidth graphs can be processed efficiently. Algorithmic solutions derived from variable elimination and the notion of treewidth are illustrated on problems of treewidth computation and inference in challenging benchmarks from optimization competitions. We also review how efficient tech¬ niques for approximate inference such as loopy belief propagation and variational approaches can be linked to variable elimination and we illustrate them in the context of Expectation- Maximisation procedures for parameter estimation in coupled Hidden Markov Models. Keywords: graphical model, computational inference, treewidth, message passing, variational approximations 1 1 Introduction Most real complex systems are made up or modeled by elementary objects that locally interact with each other. Graphical models ( Bishop[ 2006 [ Koller and Friedman] 2009 [ Murphy[ 20121 are formed by variables linked to each other by stochastic relationships. They enable to model dependencies in possibly high-dimensional heterogeneous data and to capture uncertainty. Graph¬ ical models have been applied in a wide range of areas like image analysis, speech recognition, bioinformatics, ecology to name a few. In real applications a large number of random variables with a complex dependency structure are involved. As a consequence, inference tasks such as the calculation of a normalization con¬ stant, a marginal distribution or the mode of the joint distribution are challenging. Three main approaches exist to evaluate such quantities for a given distribution p defining a graphical model: (a) compute them in an exact manner; (6) use a stochastic algorithm to sample from the distribu¬ tion p to get (unbiased) estimates; (c) derive an approximation of p for which the exact calculation is possible. Even if appealing, exact computation on p often leads to very time and memory con¬ suming procedures, since the number of elements to store or elementary operations to perform increase exponentially with n the number of random variables. The second approach is probably the most widely used by statisticians and modelers. Stochastic algorithms such as Monte-Carlo Markov Chains (MCMC, Robert and Casella 20041, Gibbs sampling ( Casella and Georgej 19921 and particle filtering (Gordon and Smith, 1993) have become standard tools in many fields of ap¬ plication using statistical models. The last approach includes variational approximation techniques ( [Wainwright and Jordan[ |2008| ), which are starting to become common practice in computational statistics. In essence, approaches of type (6) provide an approximate answer to an exact problem whereas approaches of type (c) provide an exact answer to an approximate problem. In this paper we focus on approaches of type (a) and (c), and we will review techniques for exact or approximate inference in graphical models borrowed from both optimization and computer science. They are computationally efficient, yet not standard in the statistician toolkit. Our purpose is to show that the characterization of the structure of the graph G associated to a graphical model (precise definitions are given in Section]^ enables both to determine if the exact calculation of the quantities of interest (marginal distribution, normalization constant, mode) can be implemented efficiently and to derive a class of operational algorithms. When the answer is no, the same analysis enables to design algorithms to compute an approximation of the desired quantities for which an acceptable complexity can be obtained. The central algorithmic tool is the variable elimination concept ( Bertele and Brioshi] , 1972| ). In Section we adopt a unified algebraic presentation of the different inference tasks (marginal¬ ization, normalizing constant or mode evaluation) to emphasize that all of them can be solved as particular cases of variable elimination. This implies that if variable elimination is efficient for one task it will also be efficient for the other ones. The key ingredient to design efficient algorithms based on variable elimination is the clever use of distributivity between algebraic operators. For instance distributivity of the product (x) over the sum (-f) enables to write (a x b) + (a x c) = a X (b + c) and evaluating the left-hand side of this equality requires two multiplications and one addition while evaluating the right-hand side requires one multiplication and one addition. Simi¬ larly since max(a + b, a + c) = a -h max(b, c) it is more efficient to compute the right-hand side from an algorithmic point of view. Distributivity enables to minimize the number of operations. 2 Associativity and commutativity are also required and the algebra behind is the semi-ring eate- gory (from whieh some notations will be borrowed). Inferenee algorithms using the distributivity property have been known and published in the Artifieial Intelligenee and Maehine Learning liter¬ ature under different names, sueh as sum-prod, or max-sum ( Pearl[ 1988; Bishop[ 20061 and are examples of variable elimination. Variable elimination relies on the ehoiee of an order of elimination of the variables (either by marginalization or by maximization). This eorresponds to the ordering ealeulations are performed when applying distributivity. The topology of the graph G provides key information to organize the ealeulations for an optimal use of distributivity, i.e. to minimize the number of elementary operations to perform. For example, when the graph is a tree, the most efficient elimination order eorresponds to eliminating reeursively the vertices of degree one, starting from the leaves towards the root. For an arbitrary graph, the notion of an optimal elimination order for inference in a graph- ieal model is elosely linked to the notion of treewidth of the assoeiated graph G. We will see in Seetion the reason why inference algorithms based on variable elimination with the best elimi¬ nation order are of eomplexity linear in n but exponential in the treewidth. Therefore treewidth is the key eharaeterization of G to determine if exaet inferenee is possible in praetiee or not. The eoneept of treewidth has been proposed in parallel in computer scienee (Bodlaender 19941 and in diserete mathematies and graph minor theory (see Robertson and Seymour[ 1986 Lovasz 20051. Diserete mathematies existenee theorems ([Robertson and Seymour, 19861 establish that there exists an algorithm for eomputing the treewidth of any graph with eomplexity polynomial in n (but exponential in the treewidth), and even the degree of the polynomial is given. However this result does not tell how to derive and implement the algorithm, apart from some specifie eases (as trees, chordal graphs, and series-parallel graphs, see |Duffin] ( |1965| ). So we will also present in Section several state-of-the-art algorithms for approximate evaluation of the treewidth and illustrate their behavior on benchmarks borrowed from optimization eompetitions. Variable elimination has also lead to message passing algorithms ( Pearl[ 19881 whieh are now eornmon tools in computer seience or machine learning. More reeently these algorithms have been reinterpreted as re-parametrization tools ( Roller and Friedman[ 2009). We will explain in Seetion how re-parametrization ean be used as a pre-processing tool to transform the original graphieal model into an equivalent one for whieh inference may been simpler. Message passing is not the only way to perform re-parametrization and we will diseuss alternative effieient algorithms that have been proposed in the eontext of eonstraint satisfaetion problems (CSP, see ([Rossi et aL| 20061) and that have not yet been exploited in the eontext of graphieal models. As emphasized above, effieient exaet inferenee algorithms ean only be designed for graphieal models with limited treewidth (mueh less than the number of vertices), whieh is a far from being the general ease. But the prineiple of variable elimination and message passing for a tree ean still be applied to any graph leading then to heuristie inferenee algorithms. The most famous heuristies is the Loopy Belief Propagation algorithm ( Ksehisehang et akj 2001) We reeall in Section the result that establishes LBP as a variational approximation method. Variational methods rely on the ehoiee of a distribution whieh renders inference easier, to approximate the original eomplex graphieal model p. The approximate distribution is chosen within a class of models for which efficient inferenee algorithms exist, that is models with small treewidth (0, 1 or 2 in practice). We review some of the standard ehoiees and we illustrate on the problem of parameter estimation in eoupled Hidden Markov Model (Ghahramani and Jordan 1997[) how variational methods have 3 been applied in praetiee with different approximate distributions, each of them corresponding to a different underlying treewidth (Section |^. 2 Graphical Models 2.1 Models definition Consider a a stochastic system defined by a set of random variables X = (Xi,..., X„). Each variable Xj takes values in Aj. Then, a realization of X is a set x = (xi,..., x„), with Xi G Aj. The set of all possible realizations is called the state space, and is denoted A = nr=i ^ is a subset of n}, X^, xa and A^ are respectively the subset of random variables {Xi,i G A}, the set of realizations {xi,i G A} and the state space of Xa- If p is the joint probability distribution of X on A, we denote Va; G A, p{x) = p{X = x). Note that we focus here on discrete variables (we will discuss inference in the case of contin¬ uous variables on examples in Section [^. A joint distribution p on A is said to be a probabilistic graphical model ( Lauritzen] 1996^ [Bishop[ |2006[ Koller and Friedman] |2009[ ) indexed on a set B of parts of V if there exists a set ih = {V’s} bgb of maps from A^ to M+, called potential functions, indexed by B such that p can be expressed in the following factorized form: p{x )=7 n '^b{xb), ( 1 ) B&B where Z = Yhx&k Y\b&b'^b{.xb) is the normalizing constant, also called partition function. The elements B E B are the scopes of the potential functions and |i?| is the arity of the potential function The set of scopes of all the potential functions involving variable Xj is denoted Bi = \jBeB{B}. One desirable property of graphical models is that of Markov local independence: if p(X = x) can be expressed as Q then a variable X, is (stochastically) independent of all others in X con¬ ditionally to the set of variables B)\i- This set is called the Markov blanket of X,, or its neighborhood. We will denote it Xj. These conditional independences can be represented graph¬ ically, by a graph with one vertex per variable in X. The question of encoding the independence properties associated with a given distribution into a graph structure is well described in ( [Koller and 2009| ), and we will not discuss it here. We will consider the classical graph G = (V, E) to a decomposition of the form ([^ where an edge is drawn between two vertices i and i if there exists B E B such that i and j are in 5. Such a representation of a graphical model is actually not as rich as the representation ([^. For instance, if n = 3, the two cases B = {{1, 2, 3}} and B = {{1, 2}, {2, 3}, {3,1}} are represented by the same graph G, namely a clique of size 3. The factor graph representation goes beyond this limit: this graphical representation is a bipartite graph with one vertex per potential function and one vertex per variable, edges are only between functions and variables. An edge is present between a function vertex (also called factor vertex) Friedman associated 4 Figure 1: From left to right: (a) Graphieal representation of a direeted graphieal model where po¬ tential funetions define the eonditional probability of eaeh variable given its parents values; (b) The corresponding factor graph where every potential function is represented as a factor (square ver¬ tex) connected to the variables that are involved in it; (c) Graphical representation of an undirected graphical model. It is impossible from this graph to distinguish between a graphical model defined by a unique potential function on vertices 3, 4 and 5 from a model defined by 3 pairwise poten¬ tial functions over each pair (3,4), (3,5) and (4,5); (d) The corresponding factor graph, which unambiguously defines these potential functions (here three pairwise potential functions) and a variable vertex if the variable is in the scope of the potential function. Figure displays examples of the two graphical representations. There exists several families of probabilistic graphical models ( jKoller and Friedman , 2009 Murphy[ |2012|). They can be grouped into directed and undirected ones. The most classical directed framework is that of Bayesian network (Pearl 1988; Jensen and Nielsen| , 2007 1 ). In a Bayesian network, an edge is directed from a parent vertex to a child vertex and potential func¬ tions are conditional probabilities of a variable given its parents in the graph (see Figure (a)). In such models, trivially Z = 1. Undirected probabilistic graphical models (see Figure [^(c)) are equivalent to Markov Random Fields (Li, 20011 as soon as the potential functions are in IR+*. In a Markov random field (MRF), a potential function is not necessarily a probability distribution: ips is not required to be normalized. Deterministic Graphical models. Although the terminology of ’Graphical Models’ is often used to refer to stochastic graphical models, the idea of describing a joint distribution on a set of variables through local functions has also been used in Artificial Intelligence to concisely describe Boolean functions or cost functions, with no normalization constraint. In a graphical model with only Boolean (0/1) potential functions, each potential function describes a constraint between vari¬ ables. If the potential function takes value 1, the corresponding realization is said to satisfy the constraint. The graphical model is known as a ’Constraint Network’. It describes a joint Boolean function on all variables that takes value 1 if and only if all constraints are satisfied. The problem of finding a realization that satisfies all the constraints, called a solution of the network, is the ’Constraint Satisfaction Problem’ (CSP) (|Rossi et al.[ 120061). This framework is used to model and solve combinatorial optimization problems and there is a variety of software tools to solve it. 5 When variables are Boolean too and when the Boolean functions are described as disjunctions of variables or of their negation, the CSP reduces to the ’Boolean Satisfiability’ problem (or SAT), the seminal NP-complete problem (Cook 19711. CSP have been extended to describe joint cost functions, decomposed as a sum of local cost functions in the ‘Weighted Constraint Network’ ( Rossi et aL| [2006 ) or ‘Cost Function Network’. In this case, potential functions take finite or infinite integer or rational values: infinity enables to express hard constraints while finite values encode costs for unsatisfied soft constraints. The problem of finding a realization of minimum cost is the ’Weighted Constraint Satisfaction Prob¬ lem’ (WCSP), which is also NP-hard. It is easy to observe that any stochastic graphical model can be translated in a weighted constraint network using a simple — log(-) transformation. With this equivalence, it becomes possible to use exact WCSP resolution algorithms that have been developed in this field for mode evaluation in stochastic graphical model. 2.2 Inference tasks in probabilistic graphical models Computations on probabilities and potentials rely on two fundamental types of operations. Firstly, multiplication (or addition in the log domain) is used to combine potentials to define a joint potential distribution. Secondly, sum or max/min can be used to eliminate variables and com¬ pute marginals or modes of the joint distribution on subsets of variables. The precise identity of these two basic operations is not crucial for the inference algorithms considered in this pa¬ per. We denote as © the combination operator and as © the elimination operator. The algo¬ rithms just require that (IR+, ©, 0) defines a commutative semi-ring. Specifically, the semi-ring algebra offers distributivity: (a ©6) ©(a 0 c) = a©(6©c). This corresponds to distributivity of product over sum since (a x 6) + (a x c) = a x (6 + c) or distributivity of max over sum since max(a + b, a + c) = a + max(6, c), or again distributivity of max over product since max(a x b, a x c) = a x (max(6, c)). These two abstract operators can be defined to be applied to potential functions, as follows: Combine operator: the combination of two potential functions ^jJA and i/’s is a new function from Aaub to M + defined as ^ja © 'iIjb{.xavjb) = i>A{xA) 0 i>B{xB)- Elimination operator: the elimination of variable Xi,i G B from a potential function -0^ is a new function (©^^^ 0 b) from AB\{q to M+ defined as {®xi^B){xB\{i}) = ®xi{i’B{xB\{i}, Xi)). For © = +, (©^, 0B)(a:B\{i}) represents Y^xi i’BixB\{i}, Xi). We can now describe classical counting and optimization tasks in graphical models in terms of these two operators. For simplicity, we denote by ©a;^, where B C V a sequence of eliminations ©a;, for all i & B, the result being insensitive to the order in a commutative semi-ring. Similarly, ©BeB represents the successive combination of all potential functions 0 b such that B G B. Counting tasks. Under this name we group all tasks that involve summing over the state space of a subset of variables in X. This includes the computation of the partition function Z or of any 6 marginal distribution, as well as entropy evaluation. For A C V and A = V \ A, the marginal distribution pA of Xa assoeiated to the joint distribution p is defined as: Pa(xa) = ^ p(^a,Xa) = ^ ^ ( 2 ) B^}3 The funetion pa then satisfies: paQ z = paQ { O i’B)) = (0(0 -06)) V XV B&B / ^ B&B / where © eombines funetions using x and © eliminates variables using +. Marginal evaluation is also interesting in the ease where some variables are observed. If the val¬ ues of some variables xq {O C V) have been observed, we ean eompute the marginal eonditional distribution by restrieting the domains of variables Xq to the observed value. The entropy i/ of a probabilistie graphieal model p is defined as H{p) = -E[\n{p{x))], (3) where E[-] denotes the mathematieal expeetation. In the ease of a graphieal model, by linearity of the expeetation, the entropy is equal to H{p) = ln(Z) - E E p{xb) \n{^jJB{xB))■ B&Bxb&Ab This expression is an alternation of use of © and © operators (for p{xb) evaluation, for eaeh B and xb)- Optimization task An optimization task in a graphieal model eorresponds to the evaluation of the most probable state x* of the random veetor X, defined as X = aigmaxpix) = argmax IT ipBixB) = argmaxN hiipBixB)- xSA t(=A -I--I- BeB (4) The maximum itself is 0a;,., Obsb In '4’b{xb) with © set to max and © to +. The eomputation of the mode x* does not require the eomputation of the normalizing eonstant Z, however eomputing the mode probability p{x*) does. Therefore eounting and optimization tasks ean be interpreted as two instantiations of the same eomputational task expressed in terms of eombination and elimination operators, namely 0a;^ O b&b'4’b where A C V. When the eombination operator © and the elimination operator © are respeetively set to X and +, this eomputational problem is known as a sum-produet problem in the Artifieial Intelligenee literature ( |Pear4[T988| ),( |Bishop[ |2006[ ehapter 8). When © is set to max and © to the sum operator it is a max-sum problem (Bishop 2006[ ehapter 8). We will see in Seetion that there exists an exaet algorithm solving this general task that exploits the distributivity of the eombination and elimination operators to perform operations in a smart order. From this generie algorithm, known as variable elimination (Bertele and Brioshi 19721 or bueket elimination ( Deehter[ 19991, one ean deduee exaet algorithms to solve eounting and optimization tasks in a graphieal model, by instantiating the operators © and ©. 7 ©--®--(5)-<7) @ @ ® Figure 2: Graphical representation of a HMM. Hidden variables correspond to vertices 1, 3, 5, 7, and observed variables to vertices 2, 4, 6, 8. Deterministic Graphical models : the Constraint Satisfaction Problem is a V-A problem as it can can be defined using V (logical ’or’) as the elimination operator and A (logical ’and’) as the combination operator over Booleans. The weighted CSP is a min-+ as it uses min as the elimination operator and + (or bounded variants of +) as the combination operator. Several other variants exist (] ^ossi et al.[ 20061, including generic algebraic variants ( Schiex et al. 1995 Bistarelli et al. 1997 Cooper 2004 Pralet et al.[|2007 Kohlasl 2003|). 3 Variable elimination for exact inference We describe now the principle of variable elimination. We first recall the Viterbi algorithm for Hidden Markov Chains, a classical example of variable elimination for optimization. Then we formally describe the variable elimination procedure in the general graphical model framework. The key element is the choice of an ordering for the sequential elimination of the variables. It is closely linked to the notion of treewidth of the graphical representation of the graphical model. As it will be shown, the complexity of the variable elimination is fully characterized by this notion. Conversely, the treewidth can be bounded from above from a given variable elimination scheme. 3.1 An example: hidden Markov chain models As an introduction to exact inference on graphical models by variable elimination, we consider a well studied stochastic process: the discrete Hidden Markov Chain model (HMC). A HMC (Figure is defined by two sequences of random variables O and H of the same length, T. A realization o = (oi,... ot) of the variables O = (Oi,... Ot) is observed, while the states of variables H = {Hi,... Ht) are unknown. In the HMC model the assumption is made that Oi is independent of Hv\{i} and Ov\{i} given the hidden variable Hi. These independences are modeled by pairwise potential functions 1 < * < T’- Furthermore, hidden vari¬ able Hi is independent of i7i,..., and Oi,, Oi-i given the hidden variable These independences are modeled by pairwise potential functions 1 < i < T. Then the model is fully defined by specifying an additional potential function 'ij)Hi{hi) to model the initial distribution. In the classical HMC formulation (Rabiner, 19891, these potential functions are nor¬ malized conditional probability distributions i.e., ipHi_-i_,Hi{hi-i, hi) = p{Hi = hi\Hi_i = hj_i), '4^Oi,Hi{0i, hi) = p{Oi = Oi\Hi = hi) and = PiHi = hi). As a consequence, the normal¬ izing constant Z is equal to 1, as in any Bayesian network. 8 A classical inference task for HMC is to identify the most likely value of the variables H given a realization o of the variables O. The problem is to eompute argmax/ip(if = h\0 = o) or equivalently the argument of: max hi T {hi_u hi)'ilJo,,HXoh hi)) i=2 (5) The number of possible realizations of H is exponential in T. Nevertheless this optimization problem ean be solved in a number of operations linear in T using the well-known Viterbi al¬ gorithm ( |Rabiner| 19891. This algorithm, based on dynamie programming, performs sueeessive eliminations (by maximization) of all hidden variables, starting with Ht, then Ht-i, and finishing by Hi, to eompute the most likely sequenee of hidden variables. By using distributivity between the max and the produet operators, the elimination of variable Ht ean be done by rewriting equa¬ tion @ as: max h\,...,hT-\ - T-l tpHAhi)tpOuHAoi,hi TT hi)'ipOr,HXoi-: hi).v[va-x'ipHT-i,HT{hT-i-, hT)i>OT,HT{oT-, hr )) i=2 V-^. New potential function The new potential funetion ereated by maximizing on Ht depends only on variable Ht-i, so that variables Ht, Ot and potential funetions involving them have been removed from the opti¬ mization problem. This is a simple applieation of the general variable elimination algorithm that we deseribe in the next seetion. 3.2 General principle of variable elimination In Seetion we have seen that eounting and optimization tasks ean be formalized by the same generie algebraie formulation 0( O '0b) (6) XA -BeB where A <zV. The triek behind variable elimination ( |Bertele and Brioshi] \\912) relies on a elever use of the distributivity property. Indeed, evaluating (a © 6) ©(a 0 c) as a 0(6 © c) requires fewer operations. Sinee distributivity applies both for eounting and optimizing tasks, variable elimination ean be applied to both tasks. It also means that if variable elimination is effieient for one task it will also be effieient for the other one. As in the HMC example, the prineiple of the variable elimination algorithm for eounting or optimizing eonsists in eliminating variables one by one in expression Q. Elimination of the first variable, say Xi,i G A, is performed by merging all potential funetions involving Xj and applying operator to these potential funetions. Using eommutativity and assoeiativity equation Q ean be rewritten as follows: 0(O'0b)= 0 0 O '0b)( 0 ^b) XA BeB Xi \ B&Bi 9 Figure 3: Elimination of variable Xi replaces the four pairwise potential functions involving vari¬ able Xi with a new potential involving the four neighbors of vertex 1 in the original graph. The new edges created between these four vertices are called fill-in edges (dashed edges in the middle figure). Then using distributivity we obtain 0( O V's) xa BgB 0 ( O '0b) © (0 O tpB) Xi BeBi ' -V-' New potential function ipNi This shows that the elimination of Xi results in a new graphical model where the variable Xi and the potential functions ipBiB e Bi have been removed and replaced by a new potential which does not involve Xi, but its neighboring vertices. The graph associated to the new graphical model is similar to the graph of the original model except that vertex Xi has been removed and that the neighbors Ni of Xi are now connected together in a clique. The new edges between the neighbors of Xi are called edges. For instance, when eliminating variable Xi in the graph of Figure (left), potential functions 01^2, "01,3, 0i,4 and -01^5 are replaced by 02,3,4,5 = ©11 (01,2 0 01,3 © 01,4 © 01,4)- The new graph is shown in Figure]^ right part. When the first elimination step is applied with © = -f and 0 = x, the probability distribution defined by this new graphical model is the marginal distribution p{xv\{i}) of the original distri¬ bution p. The complete elimination can be obtained by successively eliminating all variables in Xa- The result is a graphical model over Xv\a which is the marginal distribution p{xv\a)- When A = V, the result is a model with a single constant potential function with value Z. If instead © is max, and 0 = x (or + with a log transformation of the potential functions) and A = V, the last potential function obtained after elimination of the last variable is equal to the maximum of the non normali z ed distribution. So evaluating Z or the maximal probability of a graphical model can be both obtained with the same variable elimination algorithm, just changing the meaning of © and ©. However, if one is interested in the mode itself, an additional simple computation is required. The mode is obtained by induction: if is the mode of the graphical model obtained after the elimination of the first variable, Xi, then the mode of p can be defined as x*) where x* is a value in Aj that maximizes Qbgb Xi). This maximization is straightforward to derive because Xi can take only |Ai| values. itself is obtained by completing the mode 10 of the graphical model obtained after elimination the second variable, and so on. We stress here that the procedure requires to keep the intermediary potential functions generated during the successive eliminations. When eliminating a variable Xj, the task that can be computationally expensive is the com¬ putation of the intermediate ipNi- It requires to compute the product QBeBii’BixB) of several potential functions for all elements of AAr.u{i}, the state space of The time and space complexity of the operation are entirely determined by the cardinality |iVi| of the set of indices Xj. If K = maxjgy |Aj| is the maximum domain size of a variable, the time complexity (i.e. number of elementary operations performed) is in and space complexity (i.e. memory space needed) is in Complexity is therefore exponential in |Xj|, the number of neighbors of the eliminated variable in the current graphical model. The total complexity of the variable elim¬ ination is then exponential in the maximum cardinality \Ni\ over all successive eliminations (but linear in n). Because the graphical model changes at elimination each step, this number usually depends on the order in which variables are eliminated. As a consequence, the prerequisite to apply variable elimination is to decide for an ordering of elimination of the variables. As illustrated in Figure]^ two different orders can lead to two different Ni subsets. The key message is that the choice of the order is crucial/ it dictates the efficiency of the variable elimination procedure. We will now illustrate and formalize this intuition. 3.3 When is variable elimination efficient ? We can understand why the Viterbi algorithm is an efficient algorithm for mode evaluation in a HMC. The graph associated to a HMC is comb-shaped: the hidden variables form a line and each observed variable is a leaf in the comb (see Figure]^. So it is possible to design an elimination order where the current variable to eliminate has a unique neighbor in the graphical representation of the current model: for instance Ot > > Ot-i > Ht-i, ■ ■ ■ > Oi > Hi (the first eliminated variable is the largest according to this ordering). Following this elimination order, when eliminat¬ ing a variable using ©, the resulting graphical model has one fewer vertex than the previous one and no fill-in edge. Indeed, the new potential function di function of a single variable since m = 1 . More generally, variable elimination is very efficient, i.e. leads to intermediate Ni sets of small cardinality, on graphical models whose graph representation is a tree, again because it is always possible to design an elimination order where the current variable to eliminate has only one neighbor in the graphical representation of the current model. Another situation where variable elimination can be efficient is when the graph associated to the graphical model is chordal (any cycle of length 4 or more has a chord i.e., an edge connecting two non adjacent vertices in the cycle), the size of the largest clique being low. The reason is the following. In Figure 2, new edges are created between neighbors of the eliminated vertex. If this neighborhood is a clique, no new edge is added. A vertex whose neighborhood is a clique is called a simplicial vertex. Chordal graphs have the property that there exists an elimination order of the vertices such that every vertex during elimination process is simplicial. Then, there exists an elimination order such that no fill-in edges are created. Thus, the largest size of is no more than the size of a clique, and is equal to or less than the size of the largest clique in the graph. Let 11 Figure 4: A graph and two elimination orders. Left, the graph; middle, induced graph associated to the elimination order (7>6>5>4>3>2>1). Vertices are eliminated from the largest to the smallest. The maximum size of A* sets created during elimination is 2 (maximum number of outgoing edges) and only one (dashed) fill-in edge is added when vertex 4 is eliminated; right, induced graph associated to the elimination order (7>5>3>1>6>4>2). The maximum size of Ni sets created during elimination is 3 and 5 (dashed) fill-in edges are used. us note that a tree is a chordal graph in which all edges and only edges are cliques. Hence, for a tree, simplicial vertices are vertices of degree one. Then, elimination of degree one vertices on a tree is an example of simplicial elimination on a chordal graph. For arbitrary graphs, if the maximal scope size of the intermediate functions created during variable elimination is too large, then memory and time required for the storage and computation quickly exceed computer capacities. Depending on the chosen elimination order, this maximal scope can be reasonable from a computational point of view, or too large. So again, the choice of the elimination order is crucial. 3.4 The treewidth to characterize variable elimination complexity The lowest complexity achievable when performing variable elimination is characterized by a pa¬ rameter called the treewidth of the graph associated to the original graphical model. This concept has been repeatedly discovered and redefined. The treewidth of a graph is sometimes called its induced width ([Dechter and Pearl] |1988|), its minimum front size (|Liul |1992[), its fc-tree num¬ ber (Amborg 1985), its dimension (Bertele and Brioshi 1972) and is also equal to the min-max clique number of G minus one ( Arnborgj 1985). The treewidth is also a key notion in the theory of graph minors (see Robertson and Seymour|, 1986 Lovasz 20051. We insist here on two definitions. The first one from (Bodlaender 1994) relies on the notion of induced graph and the link between fill-in edges and the intermediate A* sets created during variable elimination is straightforward. The second (Robertson and Seymour[ [T986[ Bodlaend^ 1994) is the commonly used characterization of the treewidth using so-called tree decompositions, also known as junction trees which are key tools to derive variable elimination algorithms. It underlies the block-by-block elimination procedure described in Section [T5j 12 Definition 1 (induced graph) Let G = {y,E) be a graph defined by a set of vertices indexed on V and a set E of edges. Given an ordering tt of the vertices of G, the induced graph Gf^ is obtained as follows. G and Gf'‘^ have same vertices. Then to each edge in E corresponds an oriented edge in Gf'^ going from the first of the two nodes according to tt toward the second. Then each vertex i of V is considered one after the other following the order defined by n. When vertex i is treated, an oriented edge is created between all pairs of neighbors ofi in G that follows i according to tt. Again the edge is going from the first of the two nodes according to vr toward the second. The induced graph G'f''^ is also called the fill graph of G and the process of computing it is sometimes referred to as “playing the elimination game” on G, as it just simulates elimination on G using the variable ordering tt. This graph is chordal (Vandenberghe and Andersen 20141. It is known that every chordal graph G has at least one vertex ordering vr such that Gf"^ = G, called a perfect elimination ordering (Fulkerson and Gross 1965[). The second notion that enables to define the treewidth is the notion of tree decomposition. Intuitively, a tree decomposition of a graph G organizes the vertices of G in clusters of vertices which are linked by edges such that the graph obtained is a tree. Specific constraints on the way vertices of G are associated to clusters in the decomposition tree are demanded. These contraints ensure properties to tree decomposition usefull for building variable elimination algorithms. Definition 2 (tree decomposition) Given a graph G = (V, E), a tree decomposition T is a tree {C, Et), where C = {Gi ,..., Gi} is a family of subsets ofV (called clusters), and E^ is a set of edges between the subsets Gi, satisfying the following properties: • The union of all clusters Gk equals V (each vertex is associated with at least one vertex of T). • For every edge {i,j) in E, there is at least one cluster Gk that contains both i and j. • If clusters Gk and Gi both contain a vertex i of G, then all clusters Gg ofT in the (unique) path between Gk and Gi contain i as well: clusters containing vertex i form a connected subset ofT. This is known as the running intersection property). The concept of tree decomposition is illustrated in Figure Definition 3 (treewidth) The two following definitions of the treewidth derived respectively from the notion of induce graph and from that of tree decomposition are equivalent (but this is not trivial to establish): • The treewidth TIF,r(G') of a graph Gfor the ordering vr is the maximum number of outgoing edges of a vertex in the induced graph Glf^. The treewidth TW (G) of a graph G is the minimum treewidth over all possible orderings vr. • The width of a tree decomposition (C, Et) is the size of the largest Gi G C. and the treewidth TW (G) of a graph is the minimum width among all its tree decompositions. 13 Figure 5: Left: graphical representation of a graphical model. Right: tree decomposition over clusters Ci = {1, 2,4}, C 2 = {1, 3,4}, C 3 = {3,4, 5}, C 4 = {5, 6 } and C 5 = {5, 7}. Each edge between two clusters is labeled by their common variables. TWt,{G) is exactly the cardinality of the largest set Ni created during variable elimination with elimination order tt. For example, in Figure Q the middle and right graphs are the two induced graphs for two different orderings. TWt,{G) is equal to 2 with the first ordering and to 3 with the second. It is easy to see that in this example TW{G) = 2. The treewidth of the graph of the HMC model, and of any tree is equal to 1 . It has been established that finding a minimum treewidth ordering tt for a graph G, finding a minimum treewidth tree decomposition or computing the treewidth of a graph are of equivalent complexity. For an arbitrary graph, computing the treewidth is not an easy task. Section is dedicated to this question, from a theoretical and a practical point of view. The treewidth is therefore a key indicator to answer the driving subject of this review: will variable elimination be efficient for a given graphical model? For instance, the principle of variable elimination have been applied to the exact computation of the normalizing constant of a Markov random field on a small r by c lattice in ( [Reeves and Pettittl 20041. For this regular graph, it is known that the treewidth is equal to min(r, c). So exact computation through variable elimination is only possible for lattices with a small value for min(r, c). It is however well beyond computer capacities for real challenging problems in image analysis. In this case variable elimination can be used to define heuristic computational solutions, such as the algorithm of (Friel et al. 2009) which relies on exact computations on small sub-lattices of the original lattice. 3.5 Tree decomposition and block by block elimination Given a graphical model and a tree decomposition of its graph, a possible alternative to solve counting or optimization tasks is to eliminate variables in successive blocks instead of one after the other. To do so, the block by block elimination procedure (Bertele and Brioshi, I972| ) relies on the tree decomposition characterization of the treewidth. The underlying idea is to apply the variable elimination procedure on the tree decomposition, eliminating one cluster of the tree at each step. First a root cluster G (7 is chosen and used to define an order of elimination of the clusters, by progressing from the leaves toward the roots, such that every eliminated cluster corresponds to a leaf of the current intermediate tree. Then each potential function "05 is assigned 14 to the cluster Ci in C such that B <Z Ci which is the closest to the root. Sucha cluster always exists from the properties of a tree decomposition and the fact that a potential function is associated to a clique in G). The procedure starts with the elimination of any leaf cluster Ci of T, with parent Cj in T. Let us note B{Ci) = {B e assigned to Ci}. Here again, commutativity and distributivity are used to rewrite expression ([^ (with A = V) follows: 0 0 V’B = 0 xv B&B XY\(Ci-Cj) O ^bO( 0 0 V’b) xc^-Cj B(^B{Ci) '• -V-^ New potential function Note that only variables with indices in C* \ = C, fl (1/ \ Cj) are eliminated, even if it is common to say that the cluster has been eliminated. For instance, for the graph of Figure]^ if the first eliminated cluster is Ci, the new potential function is X 2 )i’ 2 , 4 :i^ 2 , X 4 ), it depends only on variables Xi and X 4 . Cluster elimination continues until no cluster is left. The interest of this procedure is that the intermediate potential function created after each cluster elimination may have a scope much smaller than the treewidth, leading to better space complexity (Bertele and Brioshi \912\ chapter 4). However, the time complexity is increased. In summmary, the lowest complexity achievable when performing variable elimination is for elimination orders whose cardinalities of the intermediate Ni sets are lower of equal to the treewidth of G. This treewidth can be determined by considering clusters sizes in tree decompositions of G. Furthermore, a tree decomposition T can be used to build an elimination order and vice versa. Indeed, an elimination order can be defined by using a cluster elimination order based on T and choosing an arbitrary order to eliminate variables with indices in the subsets Ci \ Cj. Conversely, it is easy to build a tree decomposition from a given vertex ordering vr. Since the induced graph is chordal, its maximum cliques can be identified in polynomial time. Each such clique defines a cluster Ci of the tree decomposition. Edges of T can be identified as the edges of any maximum spanning tree in the graph with vertices Ci and edges (G*, Cj) weighed by |Gj fl Cj\. Deterministic Graphical Models : to our knowledge, the notion of treewidth and its properties have been first identified in combinatorial optimization in (Bertele and Brioshi 1972) where it was called “dimension”, a graph parameter which has been shown equivalent to the treewidth (Bodlaen- der 1998). Variable elimination itself is related to Eourier-Motzkin elimination (Eourier, 18271, a variable elimination algorithm that benefits from the linearity of the handled formulas. Variable elimination has been repeatedly rediscovered, as non-serial dynamic programming (Bertele and Brioshi 1972), in the David-Putnam procedure for boolean satisfiability problems (SAT, Davis and Putnam 1960| ), as Bucket elimination for the CSP and WCSP ( |Dechter[|1999| ), in the Viterbi and Eorward-Backward algorithms for HMM (Rabiner 19891 and many more. Theree exists other situations where the choice of an elimination order has a deep impact on the complexity of the computations as in Gauss elimination scheme for a system of linear equations, or Choleski factorization of very large sparse matrices, and where equivalence between elimination and decomposition have been used (see|Bodlaender et aL||1995[). 15 4 Treewidth computation and approximation As already mentioned, the complexity of the counting and the optimization tasks on graphical models is heavily linked to the treewidth TW{G) of the underlying graph G. If one could guess the optimal vertex ordering, vr*, leading to TW.„*{G) = TW{G), then, one would be able to achieve the “optimal complexity” for solving exactly these tasks (we recall that K is the maximal domain size of a variable in the graphical model). However, the problem is that one cannot easily evaluate the treewidth of a given graph. The treewidth computation problem is known to be NP-hard (Amborg et aL| 19871. In the following subsections we provide a short presentation of the state-of-the-art theoretical and experimental results concerning the exact computation of the treewidth of a graph, and the computation of suboptimal vertex orderings providing approximations of the treewidth in the form of an upper bound. 4.1 Exact solution algorithms Several exponential time exact algorithms have been proposed to compute the treewidth. These algorithms compute the treewidth in time exponential in n. The algorithm with the best complex¬ ity bound has been proposed by ( Fomin and Villanger[ 2012). They provide an exact algorithm for computing the treewidth, which run in time 0(2.6151"^) (using polynomial space), or in time 0(1.7549"), using exponential (memory) space. Since the treewidth of a network can be quite small (compared to n) in practice, there has been a great deal of interest in finding exact algorithms with time complexity exponential in TW{G) and potentially only polynomial in n. Some of these algorithms even have complexity linear in n ( Bodlaenderj 1996; [Perkovic and Reed[ 2000). In Bodlaender ( 1996| ), an algorithm is pro¬ posed to compute the treewidth (it also provides an associated tree decomposition) of G in time n). If this algorithm is used to compute the treewidth of graphs in a family of graphs whose treewidth is uniformly bounded, then computing the treewidth would become of time complexity linear in n (however, even for a small bound on the treewidth, the constant can be huge). Moreover, in the general case, there is no way to bound the treewidth a priori. 4.2 Approximation of the treewidth with guarantee Now, recall that even though crucial, finding a “good” tree decomposition of the graph G is only one element in the computation of quantities of interest in graphical models. If one has to spend more time on finding an optimal vertex ordering than on computing probabilities on the underlying graphical model using an easy-to-compute suboptimal ordering, the utility of exact treewidth com¬ putation becomes limited. Therefore, an alternative line of search is to look for algorithms com¬ puting a vertex ordering vr leading to a suboptimal width, TWt,{G) > TW{G), but more efficient in terms of computational time. When defining such approximation algorithms, one is particularly interested in polynomial time (in n) algorithms, finding a vertex ordering vr that approaches the optimal ordering within a constant multiplicative factor : TW^^iG) < aTW(G), a > 1. 16 However, the existenee of sueh eonstant-faetor approximation algorithms is not guaranteed for all NP-hard problems. Some NP-hard problems are even known not to admit polynomial time ap¬ proximation algorithms. As far as treewidth approximation is eoneemed, we are in the interesting ease where it is not yet known whether or not a polynomial time approximation algorithm does exist (Austria et al. 20121. Finally, there have been a variety of proposed algorithms, trading off approximation quality and running time eomplexity ( [Robertson and Seymour[ 1986[ Lagergreen] , 1996[[Aimr| 2010[ Bodlaen- der et al.i|2013| ). Table [T](extraeted from |Bodlaender et al.i|2013] ) summarizes the results in terms of approximation guarantee and time eomplexity for these algorithms. Algorithm Approximation guarantee Time eomplexil f{TW{G)) -y gin) Robertson and Seymour . \D 00 o^ ATWiG) + 3 srwiG) + 7 3.Q7TW{G) 3rW{G) + 4 ^3TW(G) 2 TW{G)logTW{G) 23.69S2TW (G) 3 2TW{G) n log^ n v? n\ogn Lagergreen (1996 Amir (2010 Bodlaender et al.^( 12013^ Table 1: Approximation guarantee and time eomplexity of state-of-the-art treewidth approximation algorithms. Eaeh algorithm provides a vertex ordering vr sueh that TWt,{G) is upper bounded by the approximation guarantee indieated in eolumn 2. The time eomplexity of these algorithms is is 0{f{TW{G)).g{n)) where n is the number of vertiees in G. The theoretieal results about the eomplexity and approximability of treewidth eomputation are interesting by the insight they give about the diffieulties of finding good, if not optimal, vertex ordering. They are also interesting in that they offer worst-ease guarantees, i.e., the approximation quality is guaranteed to be at least that promised by the algorithm. Furthermore, the inerease in eomputation time is also upper-bounded, allowing to get some guarantees that the approximation ean be obtained in “reasonable” time. However, the main drawbaek of these worst-ease based approaehes, is that they ean be domi¬ nated, empirieally, by heuristie approaehes, on most instanees. Indeed, several algorithms, working well in praetiee, even though without worst-ease eomplexity/quality bounds have been proposed. We deseribe some of these approaehes in the following seetion. 4.3 Treewidth in practice A broad elass of heuristie approaehes is that of greedy algorithms (Bodlaender and Koster, 20101. They use the same iterative approaeh as the variable elimination algorithm (Seetion]^ exeept that they manipulate the graph strueture only and do not perform any aetual eombination/elimination eomputation. Starting from an empty vertex ordering and an initial graph G, they repeatedly seleet the next vertex to add in the ordering by loeally optimizing one of the following eriteria: • seleet a vertex with minimum degree in the eurrent graph ; • seleet a vertex with minimum number of fill-in edges in the eurrent graph. 17 After each vertex selection, the current graph is modified by removing the selected vertex and making a clique on its neighbors. The new edges added by this clique are fill-in edges. A ver¬ tex with no fill-in edges is called a simplicial vertex. Fast implementations of minimum degree algorithms have been developed, see e.g., AMD ( [Amestoy et al.[ 19961 with time complexity in 0{nm) (Heggernes et al. 20011 for an input graph G with n vertices and m edges. The minimum fill-in heuristic tends to be slower to compute but yields slightly better treewidth approximations in practice. Moreover, it will find a perfect elimination ordering (i.e., adding no fill-in edges) if it exists, thus recognizing chordal graphs and it returns the optimal treewidth in this particular case ((this can be easily established from results in [Bodlaender et al.[ 2005). Notice that there exists linear time 0{n -I- m) algorithms to detect chordal graphs as the Maxi¬ mum Cardinality Search (MCS) greedy algorithm ( Tarjan and Yannakakis[ 19841 but the treewidth approximation they return is usually worse than the previous heuristic approaches. A simple way to improve the treewidth bound found by these greedy algorithms is to break ties for the selected criterion using a second criterion, such as minimum fill-in first and then maximum degree, or to break ties at random and to iterate on the resulting randomized algorithms as done in Kask et ar] ( |2011| ). We compared the mean treewidth bound found by these four approaches (minimum degree, minimum fill-in, MCS and randomized iterative minimum fill-in) on a set of five CSP and MRF benchmarks used as combinatorial optimization problems in various solver competitions. Par- ityLeaming is an optimization variant of the minimal disagreement parity CSP problem originally contributed to the DIMACS benchmark and used in the Minizinc challenge (Optimization Re search Group[, 20121. Linkage is a genetic linkage analysis benchmark (Elidan and Globerson 20101. GeomSurf and SceneDecomp are respectively geometric surface labeling and scene de¬ composition problems in computer vision ( Andres et al.[ |2013[ ). The number of instances per problem as well as their mean characteristics are given in TableResults are reported in Figure (Left).The randomized iterative minimum fill-in algorithm used a maximum of 30, 000 iterations or 180 seconds (respectively 10, 000 iterations and 60 seconds for ParityLeaming and Linkage), compared to a maximum of 0.37 second used by the non-iterative approaches. The minimum fill- in algorithm (using maximum degree for ties breaking) performed better than the other greedy approaches, being slightly improved by its randomized iterative version. Problem Type/Name Nb of instances Mean nb of vertices Mean nb of potential functions CSP/Parity Learning 7 659 1246 MRF/Linkage 22 917 1560 MRF/GeomSurf-3 300 505 2140 MRF/GeomSurf-7 300 505 2140 MRF/SceneDecomp 715 183 672 Table 2: Characteristics of the five optimization problems of the benchmark. For a given prob¬ lem, several instances are available, corresponding to differents number of variables (equal to the number of vertices in the underlying graph) and different numbers of potential functions. 18 On the same benchmark, we also compared three exact methods for the task of mode evaluation that exploit either minimum fill-in ordering or its randomized iterative version: variable elimina¬ tion (ELIM), BTD ( |de Givry et al.[ 20061 and AND/OR search (Marinescu and Dechter 20061. Elim and BTD exploit the minimum fill-in ordering while AND/OR search used its randomized iterative version. In addition, BTD and AND/OR Search exploit a tree decomposition during a Depth Eirst Branch and Bound method in order to get a good trade-off between memory space and search effort. As variable elimination, they have a worst-case time complexity exponential in the treewidth. All methods were allocated a maximum of 1 hour and 4 GB of RAM on an AMD Operon 6176 at 2.3 GHz. The results, as reported in Eigure (Right) shown that BTD was able to solve more problems than the two other methods for a fixed CPU time. However, on a given problem, the best method heavily depends on the problem category. On ParityEeaming, EEIM was the fastest method, but it ran out of memory on 83% of the total set of instances, while BTD (resp. AND/OR search) used less than 1.7 GB (resp. 4GB). The randomized iterative minimum fill-in heuristic used by AND/OR search in preprocessing consumed a fixed amount of time (~ 180 seconds, included in the CPU time measurements) larger than the cost of a simple minimum fill- in heuristic. BDT was faster than AND/OR search to solve most of the instances except on two problem categories (ParityEeaming and Einkage). To perform this comparison, we ran the followinf implementation of each method. The version of EEIM was the one implemented in the combinatorial optimization solver TOOLBAR 2.3 (options -i -T3, availableatmulcyber.toulouse.inra.fr/projects/toolbar). The ver¬ sion of BTD was the one implemented in the combinatorial optimization solver T0ULBAR2 0.9.7 (options -B=l -0=-3 -nopre. Toulbar2 is available at mulcyber . toulouse . inra . f r/ pro jects/toulbar2. This software won the UAI 2010 ([Elidan and Globersonl [20T0]) and 2014 (Gogate, 2014) Inference Competitions on the MAP task. AND/OR search was the version implemented in the open-source version 1.1.2 of DAOOPT (Otten et al. 20121 (options -y -1 35 --slsX=20 --slsT=10 --Ids 1 -m 4000 -t 30000 --orderTime=18 0forbench- marks from computer vision and -y -1 25 --slsX=10 --slsT=6 --Ids 1 -m 4000 -t 10000 --orderTime=60 for the other benchmarks) which won the Probabilistic Infer¬ ence Challenge 2011 ( Elidan and Globerson[ 2011), albeit with a different closed-source ver¬ sion (lOtten et al. 1 120121). 5 From Variable Elimination to Message Passing Message passing algorithms make use of messages, which can be described as potential functions which are external to the definition of graphical models. On tree-structured graphical models mes¬ sage passing algorithms extend the variable elimination algorithm by efficiently computing every marginals (or modes) simultaneously, when variable elimination only computes one. On general graphical models, message passing algorithms can still be applied but either provide approximate results efficiently or have an exponential running cost. We present how it may be conceptually interesting to view these algorithms as performing a re-parametrization of the original graphical model i.e., modifications of the potentials, instead of producing external messages, which are not easy to interpret by themselves. 19 Figure 6: Left: Comparison of treewidth bounds provided by MCS (red), minimum degree (green), minimum fill-in (blue) and randomized iteralive minimum fill-in (cyan) for the 5 cate¬ gories of problems Right: Mode evaluation by three exact methods exploiting minimum fill-in ordering or its randomized iterative version. Number of instances solved (x-axis) within a given CPU time (loglO scale y-axis) of Elim (red), BTD (green), and AND/OR SEARCH (blue). 5.1 Message passing and belief propagation Message passing algorithms over trees can be described as an extension of variable elimination, where the marginals of all variables are computed in a double pass of the algorithm (instead of one variable in classical variable elimination). Instead of eliminating a leaf Xi and the potential functions involving Xi, we just mark the leaf Xi as “processed" and consider that the new potential -0AT. is a “message” sent from Xi to Xpa(i) (the parents of Xi in the tree), denoted as ^i^pa{i)- This message is a potential function over Xpa(i) only. We can iterate this process, always applying it to a leaf in the subgraph defined by unmarked variables, handling already computed messages as unary potentials. When only one variable remains unmarked (defining the root of the tree), the combination of all the functions on this variable (messages and possibly original potential function) will be equal to the marginal unnormalized distribution on this variable. This results directly from the fact that the operations are equivalent to variable elimination. The root of the tree defines a directed tree where the root is at the top, descendants are below and messages are flowing upwards, to the root. To compute the marginal of another variable, one can redirect the tree using this new root. Then some subtrees will remain unchanged (in terms of direction from the root of the subtree to the leaves) in this new tree and the messages in these subtrees do not need to be recomputed. It turns out that in a tree, one can organize all these computations cleverly so that only two messages are computed for each edge, one for each possible direction of the edge. 20 Formally, in the Sum-product algorithm over a tree (V, E), messages are defined for eaeh edge (f, j) G E (there are 2\E\ sueh messages, one for eaeh edge direetion) in a leaves-to-root-to- leaves order. Messages pi^j are funetions of Xj, whieh are eomputed iteratively, by the following algorithm: 1. First, messages leaving the leaves of the tree are initialized: Vz G V, where z is a leaf of the tree, Vj, s.t. (z, j) G E,W{xi,Xj) G A* X Aj,pi^j{xj) ^ 1 Mark all leaves as proeessed. 2. Then, messages are sent upward through all edges. Message updates are performed itera¬ tively, from marked nodes z to their only unmarked neighbor j through edge (z, j) G E. Message updates take the following form: Wxj e Aj,pi^j{xj) ^ ( 7 ) where K = ^l:ij{x[, Xj)iJi{x',) Pk^i{x'i). Mark node j as processed. See Figure |7] for an illustration. 3. It remains to send the latter messages downward (from root to leaves). This seeond phase of message updates takes the following form: • f/nmark root node. • While there remains a marked node, send update (|7]) from an unmarked node to one of its marked neighbors, unmark the eorresponding neighbor. 4. After the two above steps, messages have been transmitted through all edges in both diree- tions. Finally, marginal distributions over variables and pairs of variables (linked by an edge) are eomputed as follows: Pi{Xi) ^ Pj^i{Xi),\/Xi G Ai, Pij{Xi,Xj) ^ ■j^'llJij{Xi,Xj) Y[kf^j,{k,i)&E dk^i{Xi) Y[l^i,{l,j)eE ' Ki and K 2 are suitable normalizing eonstants. When the graph of the original graphieal model is not a tree, the two-pass message passing al¬ gorithm ean no more be applied. Still, for general graphieal models, this message passing approaeh ean be generalized in two different ways. One ean eompute a tree deeomposition, as previously shown. Message passing ean then be applied on the resulting eluster tree, handling eaeh eluster as a eross-produet of variables following a bloek-by-bloek approaeh. This yields an exaet algorithm, for whieh eomputa- tions ean be expensive (exponential in the treewidth) and spaee intensive (exponential in the separator size). A typieal example of sueh algorithm is the algebraie exaet message passing algorithm of [Shafer and Sheno^ (|1988|) ; jShenoy and Shaferj (|1990|) . 21 Figure 7: Example of message update on a tree. In this example, nodes t, v and w are marked, while node s is still unmarked, fit^s is a funetion of all the ineoming messages to node t, exeept Alternatively, the Loopy Belief Propagation algorithm ( Frey and Mae Kay] 1998| ) is another extension of Message Passing in which messages updates are repeated, in arbitrary order through all edges (possibly many times through each edge), until a termination condition is met. The algorithm returns approximations of the marginal probabilities (over variables and pairs of variables). The quality of the approximation and the convergence to steady state messages are not guaranteed (hence, the importance of the termination condition). However, it has been observed that EBP often provides good estimates of the marginals, in practice. A deeper analysis of message-passing algorithms will be provided in Section]^ We have described above the Sum-product algorithm. Max-product, Max-sum algorithms can be equivalently defined, for exact computation or approximation of the max-marginal of a joint distribution or its logarithm. In algebraic language, updates like defined in formula (|7]) take the general form: yxj e = ®iJij{x[,Xj)ilJi{x'i} © Hk^i{x'i). As for sum-product, the resulting algorithm computes exact ©-marginals on a tree-structured graphical model from which the mode of the distribution can be computed while on general graph¬ ical models, it provides only approximations. 5.2 Message Passing and Re-parametrization It is possible to use message passing on trees as a re-parametrization technique. Instead of com¬ puting external messages, message passing can reformulate the original tree-structured graphical model in a new equivalent tree-structured graphical model. By “equivalent” we mean that the resulting tree defines exactly the same joint distribution as the original graphical model. In the re-parameterized problem, information of interest (marginals) can be directly read in the potential functions (IKoller and Friedmanl 120091). 22 The idea behind re-parametrization is eoneeptually very simple: when a message is eom- puted, instead of keeping it as a message, it is possible to multiply any potential funetion involving Xj by using ©. To preserve the joint distribution defined by the graphieal model, we need to divide another potential funetion involving Xj by the same message using the inverse of ©j^ One possibility is to ineorporate the messages in the binary potentials: we replaee V’ij by ^jJij © ^i^j © while ijji is divided by {ij^i and V'j is divided by ^i^j. In this ease, eaeh pairwise potential ean be shown to be equal to the marginal of the joint potential on {Xj, Xj}. The resulting tree-struetured MRF is said to be calibrated to emphasize the faet that all pairs of binary potentials sharing a eommon variable agree on their marginals: © il^ij = © iJik X, The main advantage of a ealibrated re-parametrization is that it ean be used instead for the original model for any further proeessing. This is useful in the eontext of ineremental updates, where new evidenee is introdueed inerementally and eaeh reealibration is simpler than a new eali- bration ( |Koller and Friedmanj |2009| ) . Message passing based re-parameterizations ean be generalized to eyelie graphs. If an exaet approaeh using tree deeompositions is followed, messages may have a size exponential in the interseetion of pairs of elusters and the re-parametrization will ereate new potentials of this size. If these messages are multiplied inside the elusters, eaeh resulting eluster will be the marginal of the joint distribution on the eluster variables. The tree-deeomposition is ealibrated and any two interseeting elusters agree on their marginals. This is exploited in the Lauritzen-Spiegelhalter and Jensen sum-produet-divide algorithms ( Lauritzen and Spiegelhalter(|1988[ Jensen et'aL| 1990| ). In this eontext, besides ineremental updates, a ealibrated tree deeomposition allows also to loeally eompute exaet marginals for any set of variables in the same eluster. If a loeal “loopy” approaeh is used instead, re-parameterizations do not ehange seopes but pro¬ vide a re-parameterized model where estimates of the marginals ean be direetly read. For MAP, sueh re-parameterizations ean follow elever update rules to provide eonvergent re-parameterizations maximizing a well defined eriterion. Typieal examples of this sehema are the sequential version of the tree reweighted algorithm (TRWS, Kolmogorov[ , 20061, or the Max Produet Linear Program¬ ming algorithm (MPLP, |Globerson and Jaakkola[ 20081 whieh try to optimize a bound on the non- normalized probability of the mode. A seminal referenee, published in Russian is ( Sehlesinger[ 1976|). These algorithms ean be exaet on graphieal models with loops, provided the potential funetions are all submodular (often deseribed as the diserete version of eonvexity). Deterministic graphical models : message passing algorithms have also been used in deter- ministie graphieal models where they are known as “loeal eonsisteney” enforeing or eonstraint propagation algorithms. A loeal eonsisteney property defines the targeted ealibration property and 'Zeros in potential can be dealt with by a proper extension of the algebraic operations, including an inverse for zero. If the algebraic structure equipped with © is not a group but only a semi-group or monoid, suitable pseudo inverses can often be defined. See (Cooper and Schiex 2004 Gondran and Minoux 20081. 23 the enforcing algorithm allows to transform the original network into an equivalent network (defin¬ ing the same joint function) that satisfies the desired calibration/local consistency property. Similar to LBP, Arc Consistency |Waltz| ( |1972| ); [Rossi et al.| ( |2006[ ) is the most usual form of local consis¬ tency and is related to Unit Propagation in SAT Biere et al. (2009). Arc consistency is exact on trees and is usually incrementally maintained during an exact tree search, using re-parametrization. Because of the idempotency of logical operators, local consistencies always converge to a unique fix-point. Local consistency properties and algorithms for the Weighted CSP are very closely related to message passing for MAP. They however are always convergent, thanks to suitable calibration properties ( |Schiex[ |2000[ [Cooper and Schiex] [2004[ [Cooper et al.[ [2010[ ) and may also solve tree structured or fully submodular problems. 6 Heuristics and approximations for inference We mainly discussed methods for exact inference in graphical models. They are useful if an or¬ der for variable elimination with small treewidth is available. In real life applications, interaction network are seldom tree-shaped, and their treewidth can be large (e.g. a grid of pixel in image analysis) and exact methods cannot be applied anymore. However, they are starting points to derive heuristic methods for inference that can be applied to any graphical model. By heuristic method, we mean an algorithm that is (a priori) not derived from the optimization of a particular criterion, as opposed to what we will call approximation methods. Nevertheless, we shall alleviate this distinction and show that good performing heursitics can sometimes be interpreted as approx¬ imate methods. For the marginalization task, the most widespread heuristics derived from vari¬ able elimination and message passing principles is the Loopy Belief Propagation algorithm (LBP, Kschischang et al.[ [2001b described in Section [5.1[ and numerous extensions (e.g. Generalized BP, Yedidia et al. 20051 have been proposed since then. In the last decade, a better understanding of these heuristics has been reached and they can now be reinterpreted as particular instances of variational approximation methods Wainwright and Jordan (2008). A variational approximation of a distribution p is defined as the best approximation of p in a class Q of tractable distributions (for inference), according to the Kiillback-Leibler divergence. Depending of the application (e.g. discrete or continuous variables), several choices for Q have been considered. We are apparently far from variable elimination principles and treewidth issues. However, as we just emphasized, LBP can be cast in the variational framework. The treewidth of the chosen variational distribution depends on the nature of the variables: i) in the case of discrete variables the treewidth is low: the class Q is in the majority of cases that of independent variables (mean field approximation), with associated treewidth of 0, and some works consider a class Q with associated treewidth of 1; ii) in the case of continuous variables, the treewidth of the variational distribution is the same as in the original model: Q is in general a class of multivariate Gaussian distributions, for which numerous inference tools are available. We will illustrate these remarks in Section [^ Before that in the remainder of this section, we recall the two key component for a variational approximation method: the Kiillback-Leibler divergence and the choice of a class of tractable distributions. We also explain how LBP can be interpreted as a variational approximation method. 24 6.1 Variational approximations The Kiillback-Leibler divergence KL{q\\p) = q{x) log ^ measures the dissimilarity between two probability distributions p and q. KL is positive, and it is null if and only if p and q are equal. Let us consider now that q is constrained to belong to a family Q which does not include p. The solution q* of aigming^Q KL{p\\q) is then the best approximation of p in Q according to diver¬ gence KL. If Q is a set of distributions tractable for inference, marginals, mode or normalizing constant of q* can be used as approximations of the same quantities on p. For example, let us consider a binary Potts model on n vertices whose joint distribution is p(a:) = ^ n + X] hjXiXj) i We can derive its so called mean field approximation, corresponding to the class of fully factorized distributions (i.e. an associated graph of treewidth equal to 0): = {q s.t. q{x) = Since variables are binary corresponds to joint distributions of independent Bernoulli variables with respective parameters g* = qi{l), namely for all q in q{x) = ~ qiY~^\ The optimal approximation (in terms of Kiillback-Leibler divergence) within this class of distributions is characterized by the set of g/s which minimize KL{q\\p). Denoting Eg the expectation with respect to g, KL{q\\p) — log Z is Eg I ^ [Xi logg* -f (1 - Xi) log(l - g*)] - ^ ^ hjXiXj | \ * * (*j)6S / = [qi log g* + (1 - qi) log(l - qi)] - Y “ Y i i This expectation has a simple form because of the specific structure of g. Minimizing it with respect to g* gives the fixed-point relation that each optimal gf^^’s must satisfy: log =ai+ Y bijqf^. j:{i,j)eE leading to Qi ME qY^ is equal the conditional probability that Xj = 1 given that all other variables are fixed to their mean field expected values under distribution g, which explain the name of mean field approxima¬ tion. Note that in general g* is not equal to the marginal p(Xj = 1). The choice of the class Q is indeed a critical trade-off with opposite desirable properties: it must be large enough to guarantee a good approximation and small enough to contain only man¬ ageable distributions. We will focus in the next section on a particular choice for Q, the Bethe class, that will enable to link the LBP heuristics to variational methods. Other choices are possible 25 and have been used. For instanee, the Chow-Liu algorithm (Chow and Liu 1968) eomputes the minimum of KL{q\ \p) for q a distribution whose assoeiated graph is a spanning tree of the graph of p. This amounts to eomputing the best approximation of p among graphieal models with treewidth equal to 1. In the struetured mean field approximation ( Ghahramani and Jordan^ 1997t Wain wright and Jordanf |2008| ) the distribution of a faetorial Hidden Markov Model is approximated in a variational approaeh: the multivariate hidden state is deeoupled and the variational distribution of the eonditional distribution of hidden states is that of independent Markov ehains (here again, the treewidth is equal to 1). Finally, an alternative to treewidth reduetion is to ehoose the varia¬ tional approximation in the class of exponential distributions This has been applied for Gaussian process classification (Kim and Ghahramani[ 2006) using a multivariate Gaussian approximation of the posterior distribution of the hidden field. This relies on the use of the EP algorithm (Minka 20011. Note that in this algorithm, KL{p\\q) is minimized instead of KL{q\\p). 6.2 LBP heuristics as a variational method The mean field approximation is the most naive approximation among the so-called Kikuchi approximations from statistical mechanics, also known as Cluster Variational Methods (CVM, Kikuchi 1951| ). Originally, they are not defined by a minimization of the Kiillback-Leibler diver¬ gence, but as an approximation of the minimum of the free energy H{q), W(9) = -E q{x) log n -f ^g(x) logg(x). X BgB X The two problems are equivalent since H(q) is equal to KL(q\\p) — \ogZ and is minimum when p = q.lfp and q are pairwise MRF whose associated graph G = (V, E) is the same and is a tree, then q{x) = > where {q{xi,Xj)} and {q{xi)} coherent sets of order 1 and order 2 marginals of q, and di is the degree of vertex i in the tree. In this particular case the Bethe free energy, defined as H{q) is expressed as (see Heskes et'H^ |2004[ [Yedidia et akf 20051 H{q) = - q{xi,Xj) \ogi>{xi,Xj) - LL q{xi) \ogi>{xi) iev Xi E E q{xi,Xj)\ogq{xi,Xj) + E(“'<-i) E q{xi) logg(a;i) (* Xi,Xj i&V The Bethe approximation consists in applying to an arbitrary graphical model the same formula of the free energy than for a tree and minimizing it over the variables {q{xi, Xj)} and {q{xi)} under the constraint that they are probability distributions and that q{xi) is the marginal of q{xi, Xj). By extension the Bethe approximation can be interpreted as a variational method associated to the with family of unnormalized distributions that can be expressed as q{x) = {q{xi, Xj)} and {q{xi)} coherent sets of order 1 and order 2 marginals. Yli^V It has been established by Yedidia et al. (20051 that the fixed points of LBP (when they exist. convergence is still not well understood, see |Weiss| ( |2000| ) and |Mooij and Kappen] ( |2007| )) are stationary points of the problem of minimizing the Bethe free energy, or equivalently KL{q\\p) with q in the class of distributions. 26 Furthermore Yedidia et al. ( |2005 ) showed that for any elass of distributions Q eorresponding to a partieular CVM method, it is possible to define a generalized BP algorithm whose fixed points are stationary points of the problem of minimizing KL{q\\p) in Q. The drawbaek of the LBP algorithm and its extensions (Yedidia et al4 2005) it that they are not assoeiated with any theoretieal bound on the error made on the marginals approximations. Nevertheless, this algorithm is inereasingly used for inferenee in graphieal models for its good behavior in praetiee ( [Murphy et ah , 19991. It is implemented in several pieees of software for inferenee in graphieal models like libDAI ( Mooij[[2010| ) or OpenGM2 (Andres et al. 2012). 7 Illustration with coupled HMM We now illustrate how the the proeedures we have deseribed have been used for parameter es¬ timation in a elaborated example: eoupled HMM. Consider a set of / signals observed at times t G {1,... T} and denote 01 the variable eorresponding to the observed signal i at time t. HMM models assume that the distribution of eaeh 01 is eonditional to some hidden state HI, where the series {Hl)t=i^...T is a Markov ehain. Coupled HMM further assumes that the hidden states display a eorrelation at eaeh time (see jJordEi] , 2004[ Wainwright and Jordan 20081, resulting in the graph¬ ieal model displayed in Figure]^ Sueh models have been eonsidered in a series of domains sueh as bioinformaties ( Choi et ^[2013 1, eleetroeneephalogram analysis ( jZhong and Chosh 2002) or speeeh reeognition ( Noek and Ostendorf[|2003 ). More eomplex versions are sometimes eonsid¬ ered, assuming dependeney between two times series at two consecutive time steps. For this model, the joint distribution of the hidden variables H = and observed variables O = {ODi^t fac¬ torizes as p{h,o)QC h]) X ( 8 ) where ht = {hl)i stands the vector of all hidden variables at time t and where -0^ encodes the Markovian dependency of the hidden variables within a series, encodes the coupling between the hidden variables of all series at a given time and encodes the emission of the observed signal given the corresponding hidden state. A fairly comprehensive exploration of these models can be found in ( Murphy [|2002[ ). 7.1 Exact EM algorithm for coupled HMM Coupled HMM are examples of incomplete data models, as they involve variables (O, H) where only the variables O are observed. Maximum likelihood inference for such a model aims at finding the value of the parameter 6 that maximizes the (log-)likelihood of the observed data o, that is to solve maxelogp® The most popular algorithm to achieve this task is the EM algorithm. ([Dempster et al.[ 1977), that can be rephrased in the following way. Observe that logp®(o) = E{\ogp^{o,H)\o) — E {log p^{H\o)\o) = maxF(6*,g), <? 27 Figure 8: Graphical representation of p{h, 6) for a coupled HMM where F{e,q) = E,i\ogp\o,H)) - E,{\ogq{H)) = \ogp\o) - KL{q{H)\\p\H\o)), and q stands for any distribution on the hidden variables H, E stands for the expectation under the true distribution p and Eg under the arbitrary distribution q. The EM algorithm consists in alternatively maximizing E{9, q) with respect to q (E-step) and to 9 (M-step). The solution of the E-step is q{h) = p{h\o) since the Kullback-Leibler divergence is minimal (and null) in this case. Exact computation of p{h\o) can be performed by observing that ([^ can be rewritten as p{h,o) Qc ht)^ X where encodes both the Markovian dependency and the coupling of the hidden variables within a given time step. This writing is equivalent to merging all hidden variables of a given time step and corresponds to the graphical model given in Figure Denoting K the number of possible values for each hidden variables, we end up with a regular hidden Markov model with possible hidden states. Both p{h\o) (and its mode) can then be computed in a exact manner with either the forward-backward recursion or the Viterbi algorithm, which have the same complexity: 0(T^^). The exact calculation can therefore be achieved provided that remains small enough, but becomes intractable when the number of signals / exceeds few tens. 7.2 Approximate E step for the EM algorithm A first approach to derive an approximate E step it to seek for a variational approximation of p{h\o) assuming that q{h) is restricted to a family Q of tractable distributions, as described in Section 6.1[ This approach results in the maximization of a lower bound of the original log-likelihood. The choice of Q is critical and is a balance between approximation accuracy and computation efficiency. Choosing Q typically amounts to breaking down some dependencies in the original 28 Figure 9: Graphical representation of p(/i, o) for a coupled HMM when merging hidden variables at each time step distribution to end up with some tractable distribution. In the case of coupled HMM, the simplest distribution is the class of fully factorized distributions (i.e. mean field approximation), that is Qo = {q- q{h) = Y[Y[qit{hl)}. i t Such an approximation corresponds to the graphical model of Figure [T^ Intuitively, this approxi¬ mation replaces the stochastic influence existing between the hidden variables by its mean value. Figure 10: Graphical representation for the independent mean-field approximation of p{h, o) in a coupled HMM. Observed variables are indicated in light gray since they are not part of the variational distribution which is a distribution only on the hidden variables. As suggested in |Wainwright and Jordan] ( |2008[ ), a less drastic approximation can be obtained using the distribution family of independent heterogeneous Markov chains: Qm = {g : q{h) = \\_W_qit{.h\\h\_^)} i t which is consistent with the graphical representation of independent HMM, as depicted in Fig¬ ure [HI An alternative to the approximate maximization of max^ F(9, q) consists in seeking for the maximum of an approximation F{9,q) of F{9,q) which involves only marginals of p{h\o) on 29 Figure 11: Graphical representation for the independent Markov approximation of p(/i, o) in a cou¬ pled HMM. Observed variables are indicated in light gray since they are not part of the variational distribution which is a distribution only on the hidden variables. subsets of variables in H of limited size. Then the LBP algorithm can be used to provide an approximation of these marginals. This approach has been proposed in Heskes et aH ( 2004] ) where the authors approximated the negative entropy term Eq{\og q{H)) in F{9, q) by its so-called Bethe approximation as follows (the first term in F by definition depends only on marginals of variables involved in the potential functions ^ 'ip'", E E ft’,) log ftj) + E'i^('>*) iog9^('!<) it t + EE {hi, ol) log q^{h],o]) - EE Iq{hl) log q{hl) it it because each hidden variable F[l has degree J -f 1 in the original graphical model given in Figure The advantage of this approach compared to the variational approximations based on fa mi lies Qo and Qm is that it provides an approximation of the joint conditional distribution of all hidden variables within the same time step. 8 Conclusion This tutorial on variable elimination for exact and approximate inference is an introduction to the basic concepts of variable elimination, message passing and their links with variational methods. It introduces these fields to statisticians confronted with inference in graphical models. The main message is that exact inference should not be systematically considered as out of reach. Before looking for an efficient approximate method, a wise advice would be to know the treewidth of the graphical model. In practice this question is not easy to answer exactly but several implementations of the presented algorithms exist and provide an upper bound of the treewidth. Obviously this tutorial is not exhaustive, since we have chosen to focus on the fundamental concepts. While many important results on treewidth and graphical models have several decades 30 in age, the area is still lively and we now broaden our diseussion to a few reeent works whieh taekle some ehallenges related to the eomputation of the treewidth. Beeause they offer effieient algorithms, graphieal models with a bounded treewidth offer an attraetive target when the aim is to learn a model that best represents some given sample. In (Kumar and Baeh 20121, the problem of learning the strueture of an undireeted graphieal model with bounded treewidth is approximated by a eonvex optimization problem. The resulting algorithm has a polynomial time eomplexity. As diseussed in ( Kumar and Baeh[ 2012), this algorithm is useful to derive traetable eandidate distributions in a variational approaeh, enabling to go beyond the usual variational distributions with treewidth zero or 1. For optimization (MAP), other exaet teehniques are offered by tree seareh algorithms sueh as Braneh and Bound ( [Lawler and Wood| |1966| ), that reeursively eonsider possible eonditioning of variables. These teehniques often exploit limited variable elimination proeessing to prevent exhaustive seareh, either using message-passing like algorithms (Cooper et al. 2010) to eompute bounds that ean be used for pruning, or by performing “on-the-fly” elimination of variables with small degree ( |Larrosa[ [2000[ ). Beyond pairwise potential funetions, the time needed for simple update rules of message pass¬ ing beeomes exponential in the size of the seope of the potential funetions. However, for speeifie potential funetions involving many (or all) variables, exaet messages ean be eomputed in reason¬ able time, even in the eontext of eonvergent message passing for optimization. This ean be done using poly time graph optimization algorithms sueh as shortest path or mineost flow algorithms. Sueh funetions are known as global potential funetions ( Vieente et al.[ [2008 Wemer[ 2008) in stoehastie graphieal models, and as global eost funetions ( Lee and Leung , [2009 Allouehe et al. 2012 Lee and Leung[ [20T2 ) in deterministie Cost Funetion Networks. Different problems appears with eontinuous variables, where eounting requires integration of funetions. Here again, for speeifie families of distributions, exaet (analytie) eomputations ean be obtained for distributions with eonjugate distributions. For message passing, several solutions have been proposed. For instanee, a reeent message passing seheme proposed by |Noorshams and Wain wright ( [2013[ ) relies on the eombination of orthogonal series approximation of the messages, and the use of stoehastie updates. We refer the reader to referenees in (Noorshams and Wainwrightj 20131 for a state-of-the-art of alternative methods dealing with eontinuous variables message pass¬ ing. Finally, we have exeluded Monte-Carlo methods from the seope of our review. But reeent sampling algorithms have been proposed that use exaet optimization algorithms to sample points with high probability in the eontext of estimating the partition funetion. Additional eontrol in the sampling method is needed to avoid biased estimations: this may be hashing funetions enforeing a fair sampling (Ermon et al. 2014) or randomly perturbed potential funetions using a suitable noise distribution (Hazan et al. 2013). We hope this review will enable more cross-fertili z ations of this sort, combining statistics and computer science, stochastic and deterministic algorithms for inference in graphical models. 31 References D. Allouche, C. Bessiere, P. Boizumault, S. de Givry, P. Gutierrez, S. Loudni, J.P Metivier, and T. Sehiex. Deeomposing Global Cost Funetions. In Proceedings ofAAAI’ll, pages 407-413, 2012 . P Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm. SIAM Journal on Matrix Analysis and Applications, 17(4):886-905, 1996. E. Amir. Approximation algorithms for treewidth. Algorithmica, 56:448-479, 2010. B. Andres, Beier T., and J. H. Kappes. OpenGM: A C++ library for diserete graphieal models. ArXiv e-prints, 2012. URL http://hci.iwr.uni-heidelberg.de/opengm2/, Bjoem Andres, Thorsten Beier, and Joerg H. Kappes. Open gm benehmark - evpr’2013 seetion. See http : / / hci . iwr . uni-heidelberg. de/opengm2 / ? 10=benchmark, 2013. S. Amborg. Effieient algorithms for eombinatorial problems on graphs with bounded deeompos- ability — a survey. BIT, 25:2-23, 1985. S. Arnborg, D. G. Cornell, and A. Proskurowski. Complexity of finding embeddings in a fc-tree. SIAM J. Algebraic Discrete Methods, 8:277-284, 1987. P. Austria, T. Pitassi, and Y. Wu. Inapproximability of treewidth, one-shot pebbling, and related layout problems. In International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX),, pages 13-24, Boston, USA, 2012. U. Bertele and F. Brioshi. Nonserial Dynamic Programming. Aeademie Press, 1972. Armin Biere, Marijn Heule, and Hans van Maaren. Handbook of satisfiability, volume 185. los press, 2009. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. S. Bistarelli, U. Montanari, and F. Rossi. Semiring based eonstraint solving and optimization. Journal of the ACM, 44(2):201-236, 1997. H. F. Bodlaender. A tourist guide through treewidth. Developments in Theoretical Computer Science, 1, 1994. H. F. Bodlaender. A linear time algorithm for finding tree-deeompositions of small treewidth. SIAM Journal on Computing, 25:1305-1317, 1996. H. F. Bodlaender and A. M. C. A. Koster. Treewidth eomputations I. upper bounds. Information and Computation, 208(3):259-275, 2010. H. F. Bodlaender, J. R. Gilbert, H. Hafsteinsson, and T. Kloks. Approximating treewidth, path- width, frontsize, and shortest elimination tree. Journal of Algorithms, 18:238-255, 1995. 32 H. L. Bodlaender, A. Koster, and F. van den Eijkhof. Preprocessing rules for triangulation of probabilistic networks. Computational Intelligence, 21(3):286-305, 2005. H. L. Bodlaender, P. Drange, M. S. Dregi, F. V. Fomin, D. Lokshtanov, and M. Pilipczuk. A c^n 5-approximation algorithm for treewidth. In IEEE Symposium on Eoundations of Computer Science, pages 499-508, 2013. Hans L Bodlaender. A partial k-arboretum of graphs with bounded treewidth. Theoretical computer science, 209(1): 1-45, 1998. G. Casella and E. I. George. Explaining the Gibbs sampler. The American Statistician, 46(3): 167-174, 1992. H. Choi, D. Eermin, A. Nesvizhskii, D. Ghosh, and Z. Qin. Sparsely correlated hidden Markov models with application to genome-wide location studies. Bioinformatics, 29(5):533-541, 2013. C. K. Chow and C.N. Eiu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462-467, 1968. S.A. Cook. The complexity of theorem proving procedures. In 3’’'^ ACM symp. on theory of computing, pages I5I-I58, 1971. M C. Cooper. Cyclic consistency: a local reduction operation for binary valued constraints. Artifi¬ cial Intelligence, 155(l-2):69-92, 2004. M C. Cooper and T. Schiex. Arc consistency for soft constraints. Artificial Intelligence, 154(1-2): 199-227, 2004. M. C. Cooper, S. de Givry, M. Sanchez, T. Schiex, M. Zytnicki, and T. Werner. Soft arc consistency revisited. Artificial Intelligence, 174:449-478, 2010. M. Davis and H. Putnam. A computing procedure for quantification theory. Journal of the ACM, 7(3):210-215, 1960. S. de Givry, T. Schiex, and G. Verfaillie. Exploiting Tree Decomposition and Soft Eocal Consis¬ tency in Weighted CSP. In Proceedings of the National Conference on Artificial Intelligence, AAAI-2006, pages 22-27, 2006. R. Dechter. Bucket elimination: A unifying framework for reasoning. Artificial Intelligence, 113 (l-2):41-85, 1999. R. Dechter and J. Pearl. Network-based heuristics for constraint satisfaction problems. In E. Kanal and V. Kumar, editors. Search in Artificial Intelligence, chapter II, pages 370-425. Springer- Verlag, 1988. A. P. Dempster, N. M. Eaird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1977. R. J. Duffin. Topology of series-parallel networks. Journal of Mathematical Analysis and Appli¬ cation, 10(2):303-313, 1965. 33 G. Elidan and A. Globerson. UAI inference challenge 2010. See www.cs.huji.ac.il/ pro ject/UAIlO, 2010. G. Elidan and A. Globerson. The probabilistic inference challenge. See http://www.es. hu j i . ac. il/pro ject/PASCAL/ index . php, 2011. S. Ermon, C. Gomes, A. Sabharwal, and B. Selman. Eow-density parity constraints for hashing- based discrete integration. In Proceedings of the 31st International Conference on Machine Learning, pages 271-279, 2014. E. V. Eomin and Y. Villanger. Treewidth computation and extremal combinatorics. Combinatorica, 32(3):289-308, 2012. J. Eourier. Memoires de I’Academie des sciences de I’Institut de France 7, chapter Histoire de r Academic, partie mathematique (1824). Gauthier-Villars., 1827. B. Erey and D. MacKay. A revolution: Belief propagation in graphs with cycles. In Advances in Neural Information Processing Systems, pages 479-485. MIT Press, 1998. N. Eriel, A. N. Pettitt, R. Reeves, and E. Wit. Bayesian inference in hidden Markov random fields for binary data defined on large lattices. Journal of Computational and Graphical Statistics, 18: 243-261, 2009. D. Eulkerson and O. Gross. Incidence matrices and interval graphs. Pacific journal of mathematics, 15(3):835-855, 1965. Z. Ghahramani and M. Jordan. Eactorial hidden Markov models. Machine learning, 29(2-3): 245-273, 1997. A; Globerson and T. Jaakkola. Eixing max-product: Convergent message passing algorithms for map Ip-relaxations. In Advances in Neural Information Processing Systems, pages 553-560, 2008. V. Gogate. UAI 2014 inference competition. Seewww.hlt.utdallas.edu/~vgogate/ uail4-competition, 2014. M. Gondran and M. Minoux. Graphs, Dioids and Semirings, volume 41 of Operations Re¬ search/Computer Science Interfaces Series. Springer, 2008. D. J. Gordon, N. J.and Salmond and A. E. M. Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. lEE Proceedings on Radar and Signal Processing, 140(2): 107-113, 1993. T. Kazan, S. Maji, and T. Jaakkola. On sampling from the Gibbs distribution with random maxi¬ mum a-posteriori perturbations. In Advances in Neural Information Processing Systems, pages 1268-1276, 2013. P. Heggernes, S. Eisenstat, G. Kumfert, and A. Pothen. The computational complexity of the minimum degree algorithm. In 14th Norwegian Computer Science Conference, Troms, Norway, 2001 . 34 T. Heskes, O. Zoeter, and W. Wiegerinck. Approximate expectation maximization. Advances in Neural Information Processing Systems, 16:353-360, 2004. F. Jensen, K. Olesen, and S. Andersen. An algebra of bayesian belief universes for knowledge- based systems. Networks, 20(5):637-659, 1990. F. V. Jensen and T. D. Nielsen. Bayesian Networks and Decision Graphs. Springer Publishing Company, Incorporated, 2nd edition, 2007. M. Jordan. Graphical models. Statistical Science, pages 140-155, 2004. K. Kask, A. Gelfand, L. Otten, and R. Dechter. Pushing the power of stochastic greedy ordering schemes for inference in graphical models. In AAAI, 2011. R. Kikuchi. A theory of cooperative phenomena. Physical Review, 81(6):988-1003, 1951. H.C. Kim and Z. Ghahramani. Bayesian gaussian process classification with the EM-EP algorithm. IEEE Transactions on Pattern Analyis and Machine Intelligence, 28(12): 1948-1959, 2006. J. Kohlas. Information algebras: Generic structures for inference. Springer Science & Business Media, 2003. D. Roller and N. Eriedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10): 1568-1583, 2006. E R. Kschischang, B. J. Erey, and H-A. Eoeliger. Eactor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2):498 -519, 2001. K. S. Sesh Kumar and E. Bach. Convex relaxations for learning bounded treewidth decomposable graphs. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, United States, 2012. J. Eagergreen. Efficient parallel algorithms for graphs of bounded treewidth. Journal of Algo¬ rithms, 20:20-44, 1996. J. Earrosa. Boosting search with variable elimination. In Principles and Practice of Constraint Programming - CP 2000, volume 1894 of LNCS, pages 291-305, Singapore, September 2000. S. E. Eauritzen. Graphical Models. Clarendon Press, 1996. S.E. Eauritzen and D.J. Spiegelhalter. Eocal computations with probabilities on graphical struc¬ tures and their application to expert systems. Journal of the Royal Statistical Society - Series B, 50:157-224, 1988. E. Eawler and D. Wood. Branch-and-bound methods: A survey. Operations Research, 14(4): 699-719, 1966. 35 J. Lee and K. L. Leung. Towards effieient eonsisteney enforeement for global eonstraints in weighted eonstraint satisfaetion. In International Conference on Artificial Intelligence, vol¬ ume 9, pages 559-565, 2009. J. H. M. Lee and K. L. Leung. Consisteney Teehniques for Global Cost Funetions in Weighted Constraint Satisfaetion. Journal of Artificial Intelligence Research, 43:257-292, 2012. S. Z. Li. Markov random field modeling in image analysis. Springer-Verlag, 2001. J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and praetiee. SIAM Review, 34:82-109, 1992. First papers on the multifrontal teehnique go baek to 1983. L. Lovasz. Graph minor theory. Bulletin of the American Mathematical Society, 43:75-86, 2005. R. Marineseu and R. Deehter. Memory intensive braneh-and-bound seareh for graphieal models. In proceedings of the National Conference on Artificial Intelligence, AAAI-2006, pages 1200- 1205, 2006. T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001. J. M. Mooij. libDAI: A free and open souree C-i-i- library for diserete approximate inferenee in graphieal models. Journal of Machine Learning Research, 11:2169-2173, August 2010. URL https://staff.fnwi.uva.nl/j.m.mooij/libDAI/, J. M. Mooij and H. J. Kappen. Suffieient eonditions for eonvergenee of the sum-produet algorithm. IEEE Transactions on Information Theory, 53(12):4422-4437, 2007. K. Murphy. Dynamic bayesian networks: representation, inference and learning. PhD thesis. University of California, Berkeley, 2002. K. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inferenee: An empirieal study. In Proceedings of the 15th conference on Uncertainty in Artificial Intelligence, pages 467-475, 1999. H. Noek and M. Ostendorf. Parameter reduetion sehemes for loosely eoupled HMMs. Computer Speech & Language, 17(2):233-262, 2003. N. Noorshams and M. J. Wainwright. Belief propagation for eontinuous state spaees: stoehastie message-passing with quantitative guarantees. Journal of Machine Learning Research, 14(1): 2799-2835, 2013. NICTA Optimization Researeh Group. Minizine ehallenge 2012. See http://www. minizinc.org/challenge2012/challenge.html, 2012. L. Otten, A. Ihler, K. Kask, and R. Deehter. Winning the PASCAL 2011 MAP ehallenge with enhaneed AND/OR braneh-and-bound. In NIPS DISCML Workshop, Lake Tahoe, USA, 2012. 36 J. Pearl. Probabilistic Reasoning in Intelligent Systems, Networks of Plausible Inference. Morgan Kaufmann, Palo Alto, 1988. L. Perkovic and B. Reed. An improved algorithm for finding tree decompositions of small treewidth. InternationalJournal of Foundations of Computer Science, 11:365-371, 2000. C. Pralet, G. Verfaillie, and T. Schiex. An algebraic graphical model for decision with uncertainties, feasibilities, and utilities. Journal of Artificial Intelligence Research, pages 421-489, 2007. L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286, 1989. R. Reeves and A. N. Pettitt. Efficient recursions for general factorisable models. Biometrika, 91: 751-757, 2004. C.P Robert and G. Casella. Monte Carlo Statistical Methods. Springer- Verlag, New York, 2004. N. Robertson and P. Seymour. Graph minors, ii. algorithmic aspects of tree-width. Journal of Algorithms, 7(3):309-322, 1986. F. Rossi, P. van Beek, and T. Walsh, editors. Handbook of Constraint Programming. Elsevier, 2006. T. Schiex. Arc consistency for soft constraints. In Principles and Practice of Constraint Program¬ ming - CP 2000, volume 1894 of LNCS, pages 411-424, Singapore, September 2000. T. Schiex, H. Fargier, and G. Verfaillie. Valued constraint satisfaction problems: hard and easy problems. In Proc. of the IJCAI, pages 631-637, Montreal, Canada, August 1995. M. I. Schlesinger. Sintaksicheskiy analiz dvumernykh zritelnikh signalov v usloviyakh pomekh (Syntactic analysis of two-dimensional visual signals in noisy conditions). Kibernetika, 4:113- 130, 1976. G. Shafer and P. Shenoy. Focal computations in hyper-trees. Working paper 201, School of business. University of Kansas, 1988. P. Shenoy and G. Shafer. Axioms for probability and belief-function propagation. In Proceed¬ ings of the 6th Conference on Uncertainty in Artificial Intelligence, pages I69-I98, Cambridge, USA, 1990. R. Tarjan and M. Yannakakis. Simple linear-time algorithms to test chordality of graphs, test acyclicity of hypergraphs and selectively reduce acyclic hypergraphs. SIAM Journal of Comput¬ ing, 13(3):566-579, 1984. F. Vandenberghe and M. S. Andersen. Chordal graphs and semidefinite optimization. Foundations and Trends in Optimization, l(4):241-433, 2014. S. Vicente, V. Kolmogorov, and C. Rother. Graph cut based image segmentation with connectivity priors. In Computer Vision and Pattern Recognition, CVPR 2008, pages 1-8, Alaska, USA, 2008. 37 M. J. Wainwright and M. 1. Jordan. Graphical models, exponential families, and variational infer¬ ence. Foundations and Trends in Machine Learning, 1(1-2): 1-305, 2008. D. L. Waltz. Generating semantic descriptions from drawings of scenes with shadows. Technical Report A1271, M.I.T., Cambridge MA, 1972. Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12(1), 2000. T. Werner. High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft constraint optimisation (map-mrf). In Computer Vision and Pattern Recognition, CVPR 2008, pages 1-8, Alaska, USA, 2008. J. Yedidia, W. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7):2282-2312, 2005. S. Zhong and J. Ghosh. HMMs and coupled HMMs for multi-channel EEG classification. In Proceedings of the IEEE International Joint Conference on Neural Networks, volume 2, pages 1254-1159, Honolulu, Hawai, 2002. 38