Skip to main content

Full text of "Exact and approximate inference in graphical models: variable elimination and beyond"

See other formats

arXivil506.08544V 1 [stat.ML] 29Jun2015 

Exact and approximate inference in graphical models: 
variable elimination and beyond 

Nathalie Peyrard®, Simon de Givry“, Alain Franc^, Stephane Robin*^’^, 
Regis Sabbadin®, Thomas Schiex®, Matthieu Vignes® 

“ INRA UR 875, Unite de Mathematiques et Informatique Appliquees, 
Chemin de Borde Rouge, 31326 Castanet-Tolosan, France 
^ INRA UMR 1202, Biodiversite, Genes et Communautes, 

69, route d’Arcachon, Pierroton, 33612 Cestas 
Cedex, France 

^ AgroParisTech, UMR 518 MIA, Paris 5e, France 
^ INRA, UM R518 MIA, Paris 5e, France 

June 30, 2015 


Probabilistic graphical models offer a powerful framework to account for the dependence 
structure between variables, which can be represented as a graph. The dependence between 
variables may render inference tasks such as computing normalizing constant, marginalization 
or optimization intractable. The objective of this paper is to review techniques exploiting the 
graph structure for exact inference borrowed from optimization and computer science. They 
are not yet standard in the statistician toolkit, and we specify under which conditions they are 
efficient in practice. They are huilt on the principle of variable elimination whose complex¬ 
ity is dictated in an intricate way by the order in which variables are eliminated in the graph. 

The so-called treewidth of the graph characterizes this algorithmic complexity: low-treewidth 
graphs can be processed efficiently. Algorithmic solutions derived from variable elimination 
and the notion of treewidth are illustrated on problems of treewidth computation and inference 
in challenging benchmarks from optimization competitions. We also review how efficient tech¬ 
niques for approximate inference such as loopy belief propagation and variational approaches 
can be linked to variable elimination and we illustrate them in the context of Expectation- 
Maximisation procedures for parameter estimation in coupled Hidden Markov Models. 

Keywords: graphical model, computational inference, treewidth, message passing, variational 


1 Introduction 

Most real complex systems are made up or modeled by elementary objects that locally interact 
with each other. Graphical models ( Bishop[ 2006 [ Koller and Friedman] 2009 [ Murphy[ 20121 
are formed by variables linked to each other by stochastic relationships. They enable to model 
dependencies in possibly high-dimensional heterogeneous data and to capture uncertainty. Graph¬ 
ical models have been applied in a wide range of areas like image analysis, speech recognition, 
bioinformatics, ecology to name a few. 

In real applications a large number of random variables with a complex dependency structure 
are involved. As a consequence, inference tasks such as the calculation of a normalization con¬ 
stant, a marginal distribution or the mode of the joint distribution are challenging. Three main 
approaches exist to evaluate such quantities for a given distribution p defining a graphical model: 
(a) compute them in an exact manner; (6) use a stochastic algorithm to sample from the distribu¬ 
tion p to get (unbiased) estimates; (c) derive an approximation of p for which the exact calculation 
is possible. Even if appealing, exact computation on p often leads to very time and memory con¬ 
suming procedures, since the number of elements to store or elementary operations to perform 
increase exponentially with n the number of random variables. The second approach is probably 
the most widely used by statisticians and modelers. Stochastic algorithms such as Monte-Carlo 
Markov Chains (MCMC, Robert and Casella 20041, Gibbs sampling ( Casella and Georgej 19921 
and particle filtering (Gordon and Smith, 1993) have become standard tools in many fields of ap¬ 
plication using statistical models. The last approach includes variational approximation techniques 
( [Wainwright and Jordan[ |2008| ), which are starting to become common practice in computational 
statistics. In essence, approaches of type (6) provide an approximate answer to an exact problem 
whereas approaches of type (c) provide an exact answer to an approximate problem. 

In this paper we focus on approaches of type (a) and (c), and we will review techniques for 
exact or approximate inference in graphical models borrowed from both optimization and computer 
science. They are computationally efficient, yet not standard in the statistician toolkit. Our purpose 
is to show that the characterization of the structure of the graph G associated to a graphical model 
(precise definitions are given in Section]^ enables both to determine if the exact calculation of the 
quantities of interest (marginal distribution, normalization constant, mode) can be implemented 
efficiently and to derive a class of operational algorithms. When the answer is no, the same analysis 
enables to design algorithms to compute an approximation of the desired quantities for which an 
acceptable complexity can be obtained. 

The central algorithmic tool is the variable elimination concept ( Bertele and Brioshi] , 1972| ). 
In Section we adopt a unified algebraic presentation of the different inference tasks (marginal¬ 
ization, normalizing constant or mode evaluation) to emphasize that all of them can be solved as 
particular cases of variable elimination. This implies that if variable elimination is efficient for one 
task it will also be efficient for the other ones. The key ingredient to design efficient algorithms 
based on variable elimination is the clever use of distributivity between algebraic operators. For 
instance distributivity of the product (x) over the sum (-f) enables to write (a x b) + (a x c) = 
a X (b + c) and evaluating the left-hand side of this equality requires two multiplications and one 
addition while evaluating the right-hand side requires one multiplication and one addition. Simi¬ 
larly since max(a + b, a + c) = a -h max(b, c) it is more efficient to compute the right-hand side 
from an algorithmic point of view. Distributivity enables to minimize the number of operations. 


Associativity and commutativity are also required and the algebra behind is the semi-ring eate- 
gory (from whieh some notations will be borrowed). Inferenee algorithms using the distributivity 
property have been known and published in the Artifieial Intelligenee and Maehine Learning liter¬ 
ature under different names, sueh as sum-prod, or max-sum ( Pearl[ 1988; Bishop[ 20061 and are 
examples of variable elimination. 

Variable elimination relies on the ehoiee of an order of elimination of the variables (either by 
marginalization or by maximization). This eorresponds to the ordering ealeulations are performed 
when applying distributivity. The topology of the graph G provides key information to organize 
the ealeulations for an optimal use of distributivity, i.e. to minimize the number of elementary 
operations to perform. For example, when the graph is a tree, the most efficient elimination order 
eorresponds to eliminating reeursively the vertices of degree one, starting from the leaves towards 
the root. For an arbitrary graph, the notion of an optimal elimination order for inference in a graph- 
ieal model is elosely linked to the notion of treewidth of the assoeiated graph G. We will see in 
Seetion the reason why inference algorithms based on variable elimination with the best elimi¬ 
nation order are of eomplexity linear in n but exponential in the treewidth. Therefore treewidth is 
the key eharaeterization of G to determine if exaet inferenee is possible in praetiee or not. 

The eoneept of treewidth has been proposed in parallel in computer scienee (Bodlaender 19941 

and in diserete mathematies and graph minor theory (see Robertson and Seymour[ 1986 Lovasz 

20051. Diserete mathematies existenee theorems ([Robertson and Seymour, 19861 establish that 

there exists an algorithm for eomputing the treewidth of any graph with eomplexity polynomial 
in n (but exponential in the treewidth), and even the degree of the polynomial is given. However 
this result does not tell how to derive and implement the algorithm, apart from some specifie eases 
(as trees, chordal graphs, and series-parallel graphs, see |Duffin] ( |1965| ). So we will also present 
in Section several state-of-the-art algorithms for approximate evaluation of the treewidth and 
illustrate their behavior on benchmarks borrowed from optimization eompetitions. 

Variable elimination has also lead to message passing algorithms ( Pearl[ 19881 whieh are now 
eornmon tools in computer seience or machine learning. More reeently these algorithms have 
been reinterpreted as re-parametrization tools ( Roller and Friedman[ 2009). We will explain in 
Seetion how re-parametrization ean be used as a pre-processing tool to transform the original 
graphieal model into an equivalent one for whieh inference may been simpler. Message passing is 
not the only way to perform re-parametrization and we will diseuss alternative effieient algorithms 
that have been proposed in the eontext of eonstraint satisfaetion problems (CSP, see ([Rossi et aL| 

20061) and that have not yet been exploited in the eontext of graphieal models. 

As emphasized above, effieient exaet inferenee algorithms ean only be designed for graphieal 
models with limited treewidth (mueh less than the number of vertices), whieh is a far from being 
the general ease. But the prineiple of variable elimination and message passing for a tree ean still 
be applied to any graph leading then to heuristie inferenee algorithms. The most famous heuristies 
is the Loopy Belief Propagation algorithm ( Ksehisehang et akj 2001) We reeall in Section the 
result that establishes LBP as a variational approximation method. Variational methods rely on 
the ehoiee of a distribution whieh renders inference easier, to approximate the original eomplex 
graphieal model p. The approximate distribution is chosen within a class of models for which 
efficient inferenee algorithms exist, that is models with small treewidth (0, 1 or 2 in practice). 
We review some of the standard ehoiees and we illustrate on the problem of parameter estimation 
in eoupled Hidden Markov Model (Ghahramani and Jordan 1997[) how variational methods have 


been applied in praetiee with different approximate distributions, each of them corresponding to a 
different underlying treewidth (Section |^. 

2 Graphical Models 

2.1 Models definition 

Consider a a stochastic system defined by a set of random variables X = (Xi,..., X„). Each 
variable Xj takes values in Aj. Then, a realization of X is a set x = (xi,..., x„), with Xi G Aj. 
The set of all possible realizations is called the state space, and is denoted A = nr=i ^ 

is a subset of n}, X^, xa and A^ are respectively the subset of random variables 

{Xi,i G A}, the set of realizations {xi,i G A} and the state space of Xa- If p is the joint 
probability distribution of X on A, we denote 

Va; G A, p{x) = p{X = x). 

Note that we focus here on discrete variables (we will discuss inference in the case of contin¬ 
uous variables on examples in Section [^. A joint distribution p on A is said to be a probabilistic 
graphical model ( Lauritzen] 1996^ [Bishop[ |2006[ Koller and Friedman] |2009[ ) indexed on a set B 
of parts of V if there exists a set ih = {V’s} bgb of maps from A^ to M+, called potential functions, 
indexed by B such that p can be expressed in the following factorized form: 

p{x )=7 n '^b{xb), 

( 1 ) 


where Z = Yhx&k Y\b&b'^b{.xb) is the normalizing constant, also called partition function. The 
elements B E B are the scopes of the potential functions and |i?| is the arity of the potential 
function The set of scopes of all the potential functions involving variable Xj is denoted 

Bi = \jBeB{B}. 

One desirable property of graphical models is that of Markov local independence: if p(X = x) 
can be expressed as Q then a variable X, is (stochastically) independent of all others in X con¬ 
ditionally to the set of variables B)\i- This set is called the Markov blanket of X,, or its 

neighborhood. We will denote it Xj. These conditional independences can be represented graph¬ 
ically, by a graph with one vertex per variable in X. The question of encoding the independence 
properties associated with a given distribution into a graph structure is well described in ( [Koller and 
2009| ), and we will not discuss it here. We will consider the classical graph G = (V, E) 
to a decomposition of the form ([^ where an edge is drawn between two vertices i and 
i if there exists B E B such that i and j are in 5. Such a representation of a graphical model is 
actually not as rich as the representation ([^. For instance, if n = 3, the two cases B = {{1, 2, 3}} 
and B = {{1, 2}, {2, 3}, {3,1}} are represented by the same graph G, namely a clique of size 3. 
The factor graph representation goes beyond this limit: this graphical representation is a bipartite 
graph with one vertex per potential function and one vertex per variable, edges are only between 
functions and variables. An edge is present between a function vertex (also called factor vertex) 




Figure 1: From left to right: (a) Graphieal representation of a direeted graphieal model where po¬ 
tential funetions define the eonditional probability of eaeh variable given its parents values; (b) The 
corresponding factor graph where every potential function is represented as a factor (square ver¬ 
tex) connected to the variables that are involved in it; (c) Graphical representation of an undirected 
graphical model. It is impossible from this graph to distinguish between a graphical model defined 
by a unique potential function on vertices 3, 4 and 5 from a model defined by 3 pairwise poten¬ 
tial functions over each pair (3,4), (3,5) and (4,5); (d) The corresponding factor graph, which 
unambiguously defines these potential functions (here three pairwise potential functions) 

and a variable vertex if the variable is in the scope of the potential function. Figure displays 
examples of the two graphical representations. 

There exists several families of probabilistic graphical models ( jKoller and Friedman , 2009 
Murphy[ |2012|). They can be grouped into directed and undirected ones. The most classical 

directed framework is that of Bayesian network (Pearl 1988; Jensen and Nielsen| , 2007 1 ). In a 
Bayesian network, an edge is directed from a parent vertex to a child vertex and potential func¬ 
tions are conditional probabilities of a variable given its parents in the graph (see Figure (a)). 
In such models, trivially Z = 1. Undirected probabilistic graphical models (see Figure [^(c)) are 
equivalent to Markov Random Fields (Li, 20011 as soon as the potential functions are in IR+*. In a 
Markov random field (MRF), a potential function is not necessarily a probability distribution: ips 
is not required to be normalized. 

Deterministic Graphical models. Although the terminology of ’Graphical Models’ is often 
used to refer to stochastic graphical models, the idea of describing a joint distribution on a set of 
variables through local functions has also been used in Artificial Intelligence to concisely describe 
Boolean functions or cost functions, with no normalization constraint. In a graphical model with 
only Boolean (0/1) potential functions, each potential function describes a constraint between vari¬ 
ables. If the potential function takes value 1, the corresponding realization is said to satisfy the 
constraint. The graphical model is known as a ’Constraint Network’. It describes a joint Boolean 
function on all variables that takes value 1 if and only if all constraints are satisfied. The problem 
of finding a realization that satisfies all the constraints, called a solution of the network, is the 
’Constraint Satisfaction Problem’ (CSP) (|Rossi et al.[ 120061). This framework is used to model 

and solve combinatorial optimization problems and there is a variety of software tools to solve it. 


When variables are Boolean too and when the Boolean functions are described as disjunctions of 
variables or of their negation, the CSP reduces to the ’Boolean Satisfiability’ problem (or SAT), 

the seminal NP-complete problem (Cook 19711. 

CSP have been extended to describe joint cost functions, decomposed as a sum of local cost 
functions in the ‘Weighted Constraint Network’ ( Rossi et aL| [2006 ) or ‘Cost Function Network’. 
In this case, potential functions take finite or infinite integer or rational values: infinity enables 
to express hard constraints while finite values encode costs for unsatisfied soft constraints. The 
problem of finding a realization of minimum cost is the ’Weighted Constraint Satisfaction Prob¬ 
lem’ (WCSP), which is also NP-hard. It is easy to observe that any stochastic graphical model 
can be translated in a weighted constraint network using a simple — log(-) transformation. With 
this equivalence, it becomes possible to use exact WCSP resolution algorithms that have been 
developed in this field for mode evaluation in stochastic graphical model. 

2.2 Inference tasks in probabilistic graphical models 

Computations on probabilities and potentials rely on two fundamental types of operations. Firstly, 
multiplication (or addition in the log domain) is used to combine potentials to define a joint 
potential distribution. Secondly, sum or max/min can be used to eliminate variables and com¬ 
pute marginals or modes of the joint distribution on subsets of variables. The precise identity 
of these two basic operations is not crucial for the inference algorithms considered in this pa¬ 
per. We denote as © the combination operator and as © the elimination operator. The algo¬ 
rithms just require that (IR+, ©, 0) defines a commutative semi-ring. Specifically, the semi-ring 
algebra offers distributivity: (a ©6) ©(a 0 c) = a©(6©c). This corresponds to distributivity 
of product over sum since (a x 6) + (a x c) = a x (6 + c) or distributivity of max over sum 
since max(a + b, a + c) = a + max(6, c), or again distributivity of max over product since 
max(a x b, a x c) = a x (max(6, c)). These two abstract operators can be defined to be applied to 
potential functions, as follows: 

Combine operator: the combination of two potential functions ^jJA and i/’s is a new function 
from Aaub to M + defined as ^ja © 'iIjb{.xavjb) = i>A{xA) 0 i>B{xB)- 

Elimination operator: the elimination of variable Xi,i G B from a potential function -0^ is a new 
function (©^^^ 0 b) from AB\{q to M+ defined as {®xi^B){xB\{i}) = ®xi{i’B{xB\{i}, Xi)). 
For © = +, (©^, 0B)(a:B\{i}) represents Y^xi i’BixB\{i}, Xi). 

We can now describe classical counting and optimization tasks in graphical models in terms of 
these two operators. For simplicity, we denote by ©a;^, where B C V a sequence of eliminations 
©a;, for all i & B, the result being insensitive to the order in a commutative semi-ring. Similarly, 
©BeB represents the successive combination of all potential functions 0 b such that B G B. 

Counting tasks. Under this name we group all tasks that involve summing over the state space 
of a subset of variables in X. This includes the computation of the partition function Z or of any 


marginal distribution, as well as entropy evaluation. For A C V and A = V \ A, the marginal 
distribution pA of Xa assoeiated to the joint distribution p is defined as: 

Pa(xa) = ^ p(^a,Xa) = ^ ^ ( 2 ) 


The funetion pa then satisfies: 

paQ z = paQ { O i’B)) = (0(0 -06)) 

V XV B&B / ^ B&B / 

where © eombines funetions using x and © eliminates variables using +. 

Marginal evaluation is also interesting in the ease where some variables are observed. If the val¬ 
ues of some variables xq {O C V) have been observed, we ean eompute the marginal eonditional 
distribution by restrieting the domains of variables Xq to the observed value. 

The entropy i/ of a probabilistie graphieal model p is defined as 

H{p) = -E[\n{p{x))], (3) 

where E[-] denotes the mathematieal expeetation. In the ease of a graphieal model, by linearity of 
the expeetation, the entropy is equal to 

H{p) = ln(Z) - E E p{xb) \n{^jJB{xB))■ 


This expression is an alternation of use of © and © operators (for p{xb) evaluation, for eaeh B 
and xb)- 

Optimization task An optimization task in a graphieal model eorresponds to the evaluation of 
the most probable state x* of the random veetor X, defined as 


= aigmaxpix) = argmax IT ipBixB) = argmaxN hiipBixB)- 
xSA t(=A -I--I- 



The maximum itself is 0a;,., Obsb In '4’b{xb) with © set to max and © to +. The eomputation of 
the mode x* does not require the eomputation of the normalizing eonstant Z, however eomputing 
the mode probability p{x*) does. 

Therefore eounting and optimization tasks ean be interpreted as two instantiations of the same 
eomputational task expressed in terms of eombination and elimination operators, namely 0a;^ O b&b'4’b 
where A C V. When the eombination operator © and the elimination operator © are respeetively 
set to X and +, this eomputational problem is known as a sum-produet problem in the Artifieial 
Intelligenee literature ( |Pear4[T988| ),( |Bishop[ |2006[ ehapter 8). When © is set to max and © to the 
sum operator it is a max-sum problem (Bishop 2006[ ehapter 8). 

We will see in Seetion that there exists an exaet algorithm solving this general task that 
exploits the distributivity of the eombination and elimination operators to perform operations in 

a smart order. From this generie algorithm, known as variable elimination (Bertele and Brioshi 

19721 or bueket elimination ( Deehter[ 19991, one ean deduee exaet algorithms to solve eounting 
and optimization tasks in a graphieal model, by instantiating the operators © and ©. 



@ @ ® 

Figure 2: Graphical representation of a HMM. Hidden variables correspond to vertices 1, 3, 5, 7, 
and observed variables to vertices 2, 4, 6, 8. 

Deterministic Graphical models : the Constraint Satisfaction Problem is a V-A problem 
as it can can be defined using V (logical ’or’) as the elimination operator and A (logical ’and’) 
as the combination operator over Booleans. The weighted CSP is a min-+ as it uses min as 
the elimination operator and + (or bounded variants of +) as the combination operator. Several 

other variants exist (] 

^ossi et al.[ 20061, including generic algebraic variants ( 

Schiex et al. 


Bistarelli et al. 




Pralet et al.[|2007 Kohlasl 2003|). 

3 Variable elimination for exact inference 

We describe now the principle of variable elimination. We first recall the Viterbi algorithm for 
Hidden Markov Chains, a classical example of variable elimination for optimization. Then we 
formally describe the variable elimination procedure in the general graphical model framework. 
The key element is the choice of an ordering for the sequential elimination of the variables. It is 
closely linked to the notion of treewidth of the graphical representation of the graphical model. As 
it will be shown, the complexity of the variable elimination is fully characterized by this notion. 
Conversely, the treewidth can be bounded from above from a given variable elimination scheme. 

3.1 An example: hidden Markov chain models 

As an introduction to exact inference on graphical models by variable elimination, we consider a 
well studied stochastic process: the discrete Hidden Markov Chain model (HMC). 

A HMC (Figure is defined by two sequences of random variables O and H of the same 
length, T. A realization o = (oi,... ot) of the variables O = (Oi,... Ot) is observed, while the 
states of variables H = {Hi,... Ht) are unknown. In the HMC model the assumption is made 
that Oi is independent of Hv\{i} and Ov\{i} given the hidden variable Hi. These independences 
are modeled by pairwise potential functions 1 < * < T’- Furthermore, hidden vari¬ 

able Hi is independent of i7i,..., and Oi,, Oi-i given the hidden variable These 
independences are modeled by pairwise potential functions 1 < i < T. Then the 

model is fully defined by specifying an additional potential function 'ij)Hi{hi) to model the initial 
distribution. In the classical HMC formulation (Rabiner, 19891, these potential functions are nor¬ 
malized conditional probability distributions i.e., ipHi_-i_,Hi{hi-i, hi) = p{Hi = hi\Hi_i = hj_i), 
'4^Oi,Hi{0i, hi) = p{Oi = Oi\Hi = hi) and = PiHi = hi). As a consequence, the normal¬ 

izing constant Z is equal to 1, as in any Bayesian network. 


A classical inference task for HMC is to identify the most likely value of the variables H given 
a realization o of the variables O. The problem is to eompute argmax/ip(if = h\0 = o) or 
equivalently the argument of: 




{hi_u hi)'ilJo,,HXoh hi)) 



The number of possible realizations of H is exponential in T. Nevertheless this optimization 
problem ean be solved in a number of operations linear in T using the well-known Viterbi al¬ 
gorithm ( |Rabiner| 19891. This algorithm, based on dynamie programming, performs sueeessive 
eliminations (by maximization) of all hidden variables, starting with Ht, then Ht-i, and finishing 
by Hi, to eompute the most likely sequenee of hidden variables. By using distributivity between 
the max and the produet operators, the elimination of variable Ht ean be done by rewriting equa¬ 
tion @ as: 


h\,...,hT-\ - 



TT hi)'ipOr,HXoi-: hi).v[va-x'ipHT-i,HT{hT-i-, hT)i>OT,HT{oT-, hr )) 

i=2 V-^. 

New potential function 

The new potential funetion ereated by maximizing on Ht depends only on variable Ht-i, so 
that variables Ht, Ot and potential funetions involving them have been removed from the opti¬ 
mization problem. This is a simple applieation of the general variable elimination algorithm that 
we deseribe in the next seetion. 

3.2 General principle of variable elimination 

In Seetion we have seen that eounting and optimization tasks ean be formalized by the same 
generie algebraie formulation 

0( O '0b) (6) 

XA -BeB 

where A <zV. 

The triek behind variable elimination ( |Bertele and Brioshi] \\912) relies on a elever use of the 
distributivity property. Indeed, evaluating (a © 6) ©(a 0 c) as a 0(6 © c) requires fewer operations. 
Sinee distributivity applies both for eounting and optimizing tasks, variable elimination ean be 
applied to both tasks. It also means that if variable elimination is effieient for one task it will also 
be effieient for the other one. As in the HMC example, the prineiple of the variable elimination 
algorithm for eounting or optimizing eonsists in eliminating variables one by one in expression Q. 

Elimination of the first variable, say Xi,i G A, is performed by merging all potential funetions 
involving Xj and applying operator to these potential funetions. Using eommutativity and 
assoeiativity equation Q ean be rewritten as follows: 

0(O'0b)= 0 0 O '0b)( 0 ^b) 
XA BeB Xi \ B&Bi 


Figure 3: Elimination of variable Xi replaces the four pairwise potential functions involving vari¬ 
able Xi with a new potential involving the four neighbors of vertex 1 in the original graph. 
The new edges created between these four vertices are called fill-in edges (dashed edges in the 
middle figure). 

Then using distributivity we obtain 

0( O V's) 

xa BgB 

0 ( O '0b) © 

(0 O tpB) 

Xi BeBi 

' -V-' 

New potential function ipNi 

This shows that the elimination of Xi results in a new graphical model where the variable Xi 
and the potential functions ipBiB e Bi have been removed and replaced by a new potential 
which does not involve Xi, but its neighboring vertices. The graph associated to the new graphical 
model is similar to the graph of the original model except that vertex Xi has been removed and 
that the neighbors Ni of Xi are now connected together in a clique. The new edges between 
the neighbors of Xi are called edges. For instance, when eliminating variable Xi in the 
graph of Figure (left), potential functions 01^2, "01,3, 0i,4 and -01^5 are replaced by 02,3,4,5 = 
©11 (01,2 0 01,3 © 01,4 © 01,4)- The new graph is shown in Figure]^ right part. 

When the first elimination step is applied with © = -f and 0 = x, the probability distribution 
defined by this new graphical model is the marginal distribution p{xv\{i}) of the original distri¬ 
bution p. The complete elimination can be obtained by successively eliminating all variables in 
Xa- The result is a graphical model over Xv\a which is the marginal distribution p{xv\a)- When 
A = V, the result is a model with a single constant potential function with value Z. 

If instead © is max, and 0 = x (or + with a log transformation of the potential functions) 
and A = V, the last potential function obtained after elimination of the last variable is equal to 
the maximum of the non normali z ed distribution. So evaluating Z or the maximal probability of a 
graphical model can be both obtained with the same variable elimination algorithm, just changing 
the meaning of © and ©. 

However, if one is interested in the mode itself, an additional simple computation is required. 
The mode is obtained by induction: if is the mode of the graphical model obtained after 

the elimination of the first variable, Xi, then the mode of p can be defined as x*) where 

x* is a value in Aj that maximizes Qbgb Xi). This maximization is straightforward 

to derive because Xi can take only |Ai| values. itself is obtained by completing the mode 


of the graphical model obtained after elimination the second variable, and so on. We stress here 
that the procedure requires to keep the intermediary potential functions generated during the 
successive eliminations. 

When eliminating a variable Xj, the task that can be computationally expensive is the com¬ 
putation of the intermediate ipNi- It requires to compute the product QBeBii’BixB) of several 
potential functions for all elements of AAr.u{i}, the state space of The time and space 

complexity of the operation are entirely determined by the cardinality |iVi| of the set of indices Xj. 
If K = maxjgy |Aj| is the maximum domain size of a variable, the time complexity (i.e. number 
of elementary operations performed) is in and space complexity (i.e. memory space 

needed) is in Complexity is therefore exponential in |Xj|, the number of neighbors of 

the eliminated variable in the current graphical model. The total complexity of the variable elim¬ 
ination is then exponential in the maximum cardinality \Ni\ over all successive eliminations (but 
linear in n). Because the graphical model changes at elimination each step, this number usually 
depends on the order in which variables are eliminated. 

As a consequence, the prerequisite to apply variable elimination is to decide for an ordering of 
elimination of the variables. As illustrated in Figure]^ two different orders can lead to two different 
Ni subsets. The key message is that the choice of the order is crucial/ it dictates the efficiency of 
the variable elimination procedure. We will now illustrate and formalize this intuition. 

3.3 When is variable elimination efficient ? 

We can understand why the Viterbi algorithm is an efficient algorithm for mode evaluation in a 
HMC. The graph associated to a HMC is comb-shaped: the hidden variables form a line and each 
observed variable is a leaf in the comb (see Figure]^. So it is possible to design an elimination 
order where the current variable to eliminate has a unique neighbor in the graphical representation 
of the current model: for instance Ot > > Ot-i > Ht-i, ■ ■ ■ > Oi > Hi (the first eliminated 

variable is the largest according to this ordering). Following this elimination order, when eliminat¬ 
ing a variable using ©, the resulting graphical model has one fewer vertex than the previous one 
and no fill-in edge. Indeed, the new potential function di function of a single variable since 

m = 1 . 

More generally, variable elimination is very efficient, i.e. leads to intermediate Ni sets of 
small cardinality, on graphical models whose graph representation is a tree, again because it is 
always possible to design an elimination order where the current variable to eliminate has only one 
neighbor in the graphical representation of the current model. 

Another situation where variable elimination can be efficient is when the graph associated to 
the graphical model is chordal (any cycle of length 4 or more has a chord i.e., an edge connecting 
two non adjacent vertices in the cycle), the size of the largest clique being low. The reason is 
the following. In Figure 2, new edges are created between neighbors of the eliminated vertex. If 
this neighborhood is a clique, no new edge is added. A vertex whose neighborhood is a clique is 
called a simplicial vertex. Chordal graphs have the property that there exists an elimination order 
of the vertices such that every vertex during elimination process is simplicial. Then, there exists 
an elimination order such that no fill-in edges are created. Thus, the largest size of is no more 
than the size of a clique, and is equal to or less than the size of the largest clique in the graph. Let 


Figure 4: A graph and two elimination orders. Left, the graph; middle, induced graph associated 
to the elimination order (7>6>5>4>3>2>1). Vertices are eliminated from the largest 
to the smallest. The maximum size of A* sets created during elimination is 2 (maximum number 
of outgoing edges) and only one (dashed) fill-in edge is added when vertex 4 is eliminated; right, 
induced graph associated to the elimination order (7>5>3>1>6>4>2). The maximum 
size of Ni sets created during elimination is 3 and 5 (dashed) fill-in edges are used. 

us note that a tree is a chordal graph in which all edges and only edges are cliques. Hence, for a 
tree, simplicial vertices are vertices of degree one. Then, elimination of degree one vertices on a 
tree is an example of simplicial elimination on a chordal graph. 

For arbitrary graphs, if the maximal scope size of the intermediate functions created during 

variable elimination is too large, then memory and time required for the storage and computation 
quickly exceed computer capacities. Depending on the chosen elimination order, this maximal 
scope can be reasonable from a computational point of view, or too large. So again, the choice of 
the elimination order is crucial. 

3.4 The treewidth to characterize variable elimination complexity 

The lowest complexity achievable when performing variable elimination is characterized by a pa¬ 
rameter called the treewidth of the graph associated to the original graphical model. This concept 
has been repeatedly discovered and redefined. The treewidth of a graph is sometimes called its 
induced width ([Dechter and Pearl] |1988|), its minimum front size (|Liul |1992[), its fc-tree num¬ 

ber (Amborg 1985), its dimension (Bertele and Brioshi 1972) and is also equal to the min-max 
clique number of G minus one ( Arnborgj 1985). The treewidth is also a key notion in the theory 
of graph minors (see Robertson and Seymour|, 1986 Lovasz 20051. 

We insist here on two definitions. The first one from (Bodlaender 1994) relies on the notion 

of induced graph and the link between fill-in edges and the intermediate A* sets created during 
variable elimination is straightforward. The second (Robertson and Seymour[ [T986[ Bodlaend^ 

1994) is the commonly used characterization of the treewidth using so-called tree decompositions, 
also known as junction trees which are key tools to derive variable elimination algorithms. It 
underlies the block-by-block elimination procedure described in Section [T5j 


Definition 1 (induced graph) Let G = {y,E) be a graph defined by a set of vertices indexed 
on V and a set E of edges. Given an ordering tt of the vertices of G, the induced graph Gf^ 
is obtained as follows. G and Gf'‘^ have same vertices. Then to each edge in E corresponds an 
oriented edge in Gf'^ going from the first of the two nodes according to tt toward the second. Then 
each vertex i of V is considered one after the other following the order defined by n. When vertex 
i is treated, an oriented edge is created between all pairs of neighbors ofi in G that follows i 
according to tt. Again the edge is going from the first of the two nodes according to vr toward the 

The induced graph G'f''^ is also called the fill graph of G and the process of computing it is 
sometimes referred to as “playing the elimination game” on G, as it just simulates elimination on 
G using the variable ordering tt. This graph is chordal (Vandenberghe and Andersen 20141. It is 
known that every chordal graph G has at least one vertex ordering vr such that Gf"^ = G, called a 
perfect elimination ordering (Fulkerson and Gross 1965[). 

The second notion that enables to define the treewidth is the notion of tree decomposition. 
Intuitively, a tree decomposition of a graph G organizes the vertices of G in clusters of vertices 
which are linked by edges such that the graph obtained is a tree. Specific constraints on the way 
vertices of G are associated to clusters in the decomposition tree are demanded. These contraints 
ensure properties to tree decomposition usefull for building variable elimination algorithms. 

Definition 2 (tree decomposition) Given a graph G = (V, E), a tree decomposition T is a tree 
{C, Et), where C = {Gi ,..., Gi} is a family of subsets ofV (called clusters), and E^ is a set of 
edges between the subsets Gi, satisfying the following properties: 

• The union of all clusters Gk equals V (each vertex is associated with at least one vertex of 

• For every edge {i,j) in E, there is at least one cluster Gk that contains both i and j. 

• If clusters Gk and Gi both contain a vertex i of G, then all clusters Gg ofT in the (unique) 
path between Gk and Gi contain i as well: clusters containing vertex i form a connected 
subset ofT. This is known as the running intersection property). 

The concept of tree decomposition is illustrated in Figure 

Definition 3 (treewidth) The two following definitions of the treewidth derived respectively from 
the notion of induce graph and from that of tree decomposition are equivalent (but this is not trivial 
to establish): 

• The treewidth TIF,r(G') of a graph Gfor the ordering vr is the maximum number of outgoing 
edges of a vertex in the induced graph Glf^. The treewidth TW (G) of a graph G is the 
minimum treewidth over all possible orderings vr. 

• The width of a tree decomposition (C, Et) is the size of the largest Gi G C. and the treewidth 
TW (G) of a graph is the minimum width among all its tree decompositions. 


Figure 5: Left: graphical representation of a graphical model. Right: tree decomposition over 
clusters Ci = {1, 2,4}, C 2 = {1, 3,4}, C 3 = {3,4, 5}, C 4 = {5, 6 } and C 5 = {5, 7}. Each edge 
between two clusters is labeled by their common variables. 

TWt,{G) is exactly the cardinality of the largest set Ni created during variable elimination with 
elimination order tt. For example, in Figure Q the middle and right graphs are the two induced 
graphs for two different orderings. TWt,{G) is equal to 2 with the first ordering and to 3 with the 
second. It is easy to see that in this example TW{G) = 2. The treewidth of the graph of the HMC 
model, and of any tree is equal to 1 . 

It has been established that finding a minimum treewidth ordering tt for a graph G, finding a 
minimum treewidth tree decomposition or computing the treewidth of a graph are of equivalent 
complexity. For an arbitrary graph, computing the treewidth is not an easy task. Section is 
dedicated to this question, from a theoretical and a practical point of view. 

The treewidth is therefore a key indicator to answer the driving subject of this review: will 
variable elimination be efficient for a given graphical model? For instance, the principle of variable 
elimination have been applied to the exact computation of the normalizing constant of a Markov 
random field on a small r by c lattice in ( [Reeves and Pettittl 20041. For this regular graph, it is 
known that the treewidth is equal to min(r, c). So exact computation through variable elimination 
is only possible for lattices with a small value for min(r, c). It is however well beyond computer 
capacities for real challenging problems in image analysis. In this case variable elimination can be 
used to define heuristic computational solutions, such as the algorithm of (Friel et al. 2009) which 
relies on exact computations on small sub-lattices of the original lattice. 

3.5 Tree decomposition and block by block elimination 

Given a graphical model and a tree decomposition of its graph, a possible alternative to solve 
counting or optimization tasks is to eliminate variables in successive blocks instead of one after 
the other. To do so, the block by block elimination procedure (Bertele and Brioshi, I972| ) relies 
on the tree decomposition characterization of the treewidth. The underlying idea is to apply the 
variable elimination procedure on the tree decomposition, eliminating one cluster of the tree at 
each step. First a root cluster G (7 is chosen and used to define an order of elimination of 
the clusters, by progressing from the leaves toward the roots, such that every eliminated cluster 
corresponds to a leaf of the current intermediate tree. Then each potential function "05 is assigned 


to the cluster Ci in C such that B <Z Ci which is the closest to the root. Sucha cluster always exists 
from the properties of a tree decomposition and the fact that a potential function is associated to 
a clique in G). The procedure starts with the elimination of any leaf cluster Ci of T, with parent 
Cj in T. Let us note B{Ci) = {B e assigned to Ci}. Here again, commutativity and 

distributivity are used to rewrite expression ([^ (with A = V) follows: 

0 0 V’B = 0 

xv B&B XY\(Ci-Cj) 

O ^bO( 0 0 V’b) 

xc^-Cj B(^B{Ci) 

'• -V-^ 

New potential function 

Note that only variables with indices in C* \ = C, fl (1/ \ Cj) are eliminated, even if it is 

common to say that the cluster has been eliminated. For instance, for the graph of Figure]^ if the 
first eliminated cluster is Ci, the new potential function is X 2 )i’ 2 , 4 :i^ 2 , X 4 ), it depends 

only on variables Xi and X 4 . Cluster elimination continues until no cluster is left. The interest 
of this procedure is that the intermediate potential function created after each cluster elimination 

may have a scope much smaller than the treewidth, leading to better space complexity (Bertele and 
Brioshi \912\ chapter 4). However, the time complexity is increased. 

In summmary, the lowest complexity achievable when performing variable elimination is for 
elimination orders whose cardinalities of the intermediate Ni sets are lower of equal to the treewidth 
of G. This treewidth can be determined by considering clusters sizes in tree decompositions of G. 
Furthermore, a tree decomposition T can be used to build an elimination order and vice versa. 
Indeed, an elimination order can be defined by using a cluster elimination order based on T and 
choosing an arbitrary order to eliminate variables with indices in the subsets Ci \ Cj. Conversely, it 
is easy to build a tree decomposition from a given vertex ordering vr. Since the induced graph 
is chordal, its maximum cliques can be identified in polynomial time. Each such clique defines a 
cluster Ci of the tree decomposition. Edges of T can be identified as the edges of any maximum 
spanning tree in the graph with vertices Ci and edges (G*, Cj) weighed by |Gj fl Cj\. 

Deterministic Graphical Models : to our knowledge, the notion of treewidth and its properties 

have been first identified in combinatorial optimization in (Bertele and Brioshi 1972) where it was 

called “dimension”, a graph parameter which has been shown equivalent to the treewidth (Bodlaen- 

der 1998). Variable elimination itself is related to Eourier-Motzkin elimination (Eourier, 18271, 

a variable elimination algorithm that benefits from the linearity of the handled formulas. Variable 

elimination has been repeatedly rediscovered, as non-serial dynamic programming (Bertele and 

Brioshi 1972), in the David-Putnam procedure for boolean satisfiability problems (SAT, Davis 
and Putnam 1960| ), as Bucket elimination for the CSP and WCSP ( |Dechter[|1999| ), in the Viterbi 

and Eorward-Backward algorithms for HMM (Rabiner 19891 and many more. 

Theree exists other situations where the choice of an elimination order has a deep impact on the 
complexity of the computations as in Gauss elimination scheme for a system of linear equations, or 
Choleski factorization of very large sparse matrices, and where equivalence between elimination 
and decomposition have been used (see|Bodlaender et aL||1995[). 


4 Treewidth computation and approximation 

As already mentioned, the complexity of the counting and the optimization tasks on graphical 
models is heavily linked to the treewidth TW{G) of the underlying graph G. If one could guess 
the optimal vertex ordering, vr*, leading to TW.„*{G) = TW{G), then, one would be able to 
achieve the “optimal complexity” for solving exactly these tasks (we recall that K 

is the maximal domain size of a variable in the graphical model). However, the problem is that 
one cannot easily evaluate the treewidth of a given graph. The treewidth computation problem is 

known to be NP-hard (Amborg et aL| 19871. 

In the following subsections we provide a short presentation of the state-of-the-art theoretical 
and experimental results concerning the exact computation of the treewidth of a graph, and the 
computation of suboptimal vertex orderings providing approximations of the treewidth in the form 
of an upper bound. 

4.1 Exact solution algorithms 

Several exponential time exact algorithms have been proposed to compute the treewidth. These 
algorithms compute the treewidth in time exponential in n. The algorithm with the best complex¬ 
ity bound has been proposed by ( Fomin and Villanger[ 2012). They provide an exact algorithm 
for computing the treewidth, which run in time 0(2.6151"^) (using polynomial space), or in time 
0(1.7549"), using exponential (memory) space. 

Since the treewidth of a network can be quite small (compared to n) in practice, there has been 
a great deal of interest in finding exact algorithms with time complexity exponential in TW{G) 
and potentially only polynomial in n. Some of these algorithms even have complexity linear 
in n ( Bodlaenderj 1996; [Perkovic and Reed[ 2000). In Bodlaender ( 1996| ), an algorithm is pro¬ 
posed to compute the treewidth (it also provides an associated tree decomposition) of G in time 
n). If this algorithm is used to compute the treewidth of graphs in a family 
of graphs whose treewidth is uniformly bounded, then computing the treewidth would become of 
time complexity linear in n (however, even for a small bound on the treewidth, the constant can be 
huge). Moreover, in the general case, there is no way to bound the treewidth a priori. 

4.2 Approximation of the treewidth with guarantee 

Now, recall that even though crucial, finding a “good” tree decomposition of the graph G is only 
one element in the computation of quantities of interest in graphical models. If one has to spend 
more time on finding an optimal vertex ordering than on computing probabilities on the underlying 
graphical model using an easy-to-compute suboptimal ordering, the utility of exact treewidth com¬ 
putation becomes limited. Therefore, an alternative line of search is to look for algorithms com¬ 
puting a vertex ordering vr leading to a suboptimal width, TWt,{G) > TW{G), but more efficient 
in terms of computational time. When defining such approximation algorithms, one is particularly 
interested in polynomial time (in n) algorithms, finding a vertex ordering vr that approaches the 
optimal ordering within a constant multiplicative factor : TW^^iG) < aTW(G), a > 1. 


However, the existenee of sueh eonstant-faetor approximation algorithms is not guaranteed for 
all NP-hard problems. Some NP-hard problems are even known not to admit polynomial time ap¬ 
proximation algorithms. As far as treewidth approximation is eoneemed, we are in the interesting 
ease where it is not yet known whether or not a polynomial time approximation algorithm does 

exist (Austria et al. 20121. 

Finally, there have been a variety of proposed algorithms, trading off approximation quality and 
running time eomplexity ( [Robertson and Seymour[ 1986[ Lagergreen] , 1996[[Aimr| 2010[ Bodlaen- 
der et al.i|2013| ). Table [T](extraeted from |Bodlaender et al.i|2013] ) summarizes the results in terms 
of approximation guarantee and time eomplexity for these algorithms. 




Time eomplexil 



Robertson and Seymour 





ATWiG) + 3 
srwiG) + 7 
3rW{G) + 4 


2 TW{G)logTW{G) 
23.69S2TW (G) 3 


n log^ n 


Lagergreen (1996 

Amir (2010 

Bodlaender et al.^( 


Table 1: Approximation guarantee and time eomplexity of state-of-the-art treewidth approximation 
algorithms. Eaeh algorithm provides a vertex ordering vr sueh that TWt,{G) is upper bounded by 
the approximation guarantee indieated in eolumn 2. The time eomplexity of these algorithms is is 
0{f{TW{G)).g{n)) where n is the number of vertiees in G. 

The theoretieal results about the eomplexity and approximability of treewidth eomputation are 
interesting by the insight they give about the diffieulties of finding good, if not optimal, vertex 
ordering. They are also interesting in that they offer worst-ease guarantees, i.e., the approximation 
quality is guaranteed to be at least that promised by the algorithm. Furthermore, the inerease in 
eomputation time is also upper-bounded, allowing to get some guarantees that the approximation 
ean be obtained in “reasonable” time. 

However, the main drawbaek of these worst-ease based approaehes, is that they ean be domi¬ 
nated, empirieally, by heuristie approaehes, on most instanees. Indeed, several algorithms, working 
well in praetiee, even though without worst-ease eomplexity/quality bounds have been proposed. 
We deseribe some of these approaehes in the following seetion. 

4.3 Treewidth in practice 

A broad elass of heuristie approaehes is that of greedy algorithms (Bodlaender and Koster, 20101. 
They use the same iterative approaeh as the variable elimination algorithm (Seetion]^ exeept that 
they manipulate the graph strueture only and do not perform any aetual eombination/elimination 
eomputation. Starting from an empty vertex ordering and an initial graph G, they repeatedly seleet 
the next vertex to add in the ordering by loeally optimizing one of the following eriteria: 

• seleet a vertex with minimum degree in the eurrent graph ; 

• seleet a vertex with minimum number of fill-in edges in the eurrent graph. 


After each vertex selection, the current graph is modified by removing the selected vertex and 
making a clique on its neighbors. The new edges added by this clique are fill-in edges. A ver¬ 
tex with no fill-in edges is called a simplicial vertex. Fast implementations of minimum degree 
algorithms have been developed, see e.g., AMD ( [Amestoy et al.[ 19961 with time complexity in 
0{nm) (Heggernes et al. 20011 for an input graph G with n vertices and m edges. The minimum 

fill-in heuristic tends to be slower to compute but yields slightly better treewidth approximations 
in practice. Moreover, it will find a perfect elimination ordering (i.e., adding no fill-in edges) if it 
exists, thus recognizing chordal graphs and it returns the optimal treewidth in this particular case 
((this can be easily established from results in [Bodlaender et al.[ 2005). 

Notice that there exists linear time 0{n -I- m) algorithms to detect chordal graphs as the Maxi¬ 
mum Cardinality Search (MCS) greedy algorithm ( Tarjan and Yannakakis[ 19841 but the treewidth 
approximation they return is usually worse than the previous heuristic approaches. 

A simple way to improve the treewidth bound found by these greedy algorithms is to break ties 
for the selected criterion using a second criterion, such as minimum fill-in first and then maximum 
degree, or to break ties at random and to iterate on the resulting randomized algorithms as done in 
Kask et ar] ( |2011| ). 

We compared the mean treewidth bound found by these four approaches (minimum degree, 
minimum fill-in, MCS and randomized iterative minimum fill-in) on a set of five CSP and MRF 
benchmarks used as combinatorial optimization problems in various solver competitions. Par- 
ityLeaming is an optimization variant of the minimal disagreement parity CSP problem originally 

contributed to the DIMACS benchmark and used in the Minizinc challenge (Optimization Re 

search Group[, 20121. Linkage is a genetic linkage analysis benchmark (Elidan and Globerson 

20101. GeomSurf and SceneDecomp are respectively geometric surface labeling and scene de¬ 
composition problems in computer vision ( Andres et al.[ |2013[ ). The number of instances per 
problem as well as their mean characteristics are given in TableResults are reported in Figure 
(Left).The randomized iterative minimum fill-in algorithm used a maximum of 30, 000 iterations 
or 180 seconds (respectively 10, 000 iterations and 60 seconds for ParityLeaming and Linkage), 
compared to a maximum of 0.37 second used by the non-iterative approaches. The minimum fill- 
in algorithm (using maximum degree for ties breaking) performed better than the other greedy 
approaches, being slightly improved by its randomized iterative version. 




of instances 

Mean nb 
of vertices 

Mean nb 

of potential functions 

CSP/Parity Learning 




















Table 2: Characteristics of the five optimization problems of the benchmark. For a given prob¬ 
lem, several instances are available, corresponding to differents number of variables (equal to the 
number of vertices in the underlying graph) and different numbers of potential functions. 


On the same benchmark, we also compared three exact methods for the task of mode evaluation 
that exploit either minimum fill-in ordering or its randomized iterative version: variable elimina¬ 
tion (ELIM), BTD ( |de Givry et al.[ 20061 and AND/OR search (Marinescu and Dechter 20061. 
Elim and BTD exploit the minimum fill-in ordering while AND/OR search used its randomized 
iterative version. In addition, BTD and AND/OR Search exploit a tree decomposition during a 
Depth Eirst Branch and Bound method in order to get a good trade-off between memory space 
and search effort. As variable elimination, they have a worst-case time complexity exponential in 
the treewidth. All methods were allocated a maximum of 1 hour and 4 GB of RAM on an AMD 
Operon 6176 at 2.3 GHz. The results, as reported in Eigure (Right) shown that BTD was able 
to solve more problems than the two other methods for a fixed CPU time. However, on a given 
problem, the best method heavily depends on the problem category. On ParityEeaming, EEIM 
was the fastest method, but it ran out of memory on 83% of the total set of instances, while BTD 
(resp. AND/OR search) used less than 1.7 GB (resp. 4GB). The randomized iterative minimum 
fill-in heuristic used by AND/OR search in preprocessing consumed a fixed amount of time (~ 180 
seconds, included in the CPU time measurements) larger than the cost of a simple minimum fill- 
in heuristic. BDT was faster than AND/OR search to solve most of the instances except on two 
problem categories (ParityEeaming and Einkage). 

To perform this comparison, we ran the followinf implementation of each method. The version 
of EEIM was the one implemented in the combinatorial optimization solver TOOLBAR 2.3 (options 
-i -T3, The ver¬ 
sion of BTD was the one implemented in the combinatorial optimization solver T0ULBAR2 0.9.7 
(options -B=l -0=-3 -nopre. Toulbar2 is available at mulcyber . toulouse . inra . f r/ 
pro jects/toulbar2. This software won the UAI 2010 ([Elidan and Globersonl [20T0]) and 

2014 (Gogate, 2014) Inference Competitions on the MAP task. AND/OR search was the version 
implemented in the open-source version 1.1.2 of DAOOPT (Otten et al. 20121 (options -y -1 35 
--slsX=20 --slsT=10 --Ids 1 -m 4000 -t 30000 --orderTime=18 0forbench- 
marks from computer vision and -y -1 25 --slsX=10 --slsT=6 --Ids 1 -m 4000 
-t 10000 --orderTime=60 for the other benchmarks) which won the Probabilistic Infer¬ 
ence Challenge 2011 ( Elidan and Globerson[ 2011), albeit with a different closed-source ver¬ 
sion (lOtten et al. 1 120121). 

5 From Variable Elimination to Message Passing 

Message passing algorithms make use of messages, which can be described as potential functions 
which are external to the definition of graphical models. On tree-structured graphical models mes¬ 
sage passing algorithms extend the variable elimination algorithm by efficiently computing every 
marginals (or modes) simultaneously, when variable elimination only computes one. On general 
graphical models, message passing algorithms can still be applied but either provide approximate 
results efficiently or have an exponential running cost. 

We present how it may be conceptually interesting to view these algorithms as performing a 
re-parametrization of the original graphical model i.e., modifications of the potentials, instead of 
producing external messages, which are not easy to interpret by themselves. 


Figure 6: Left: Comparison of treewidth bounds provided by MCS (red), minimum degree 
(green), minimum fill-in (blue) and randomized iteralive minimum fill-in (cyan) for the 5 cate¬ 
gories of problems Right: Mode evaluation by three exact methods exploiting minimum fill-in 
ordering or its randomized iterative version. Number of instances solved (x-axis) within a given 
CPU time (loglO scale y-axis) of Elim (red), BTD (green), and AND/OR SEARCH (blue). 

5.1 Message passing and belief propagation 

Message passing algorithms over trees can be described as an extension of variable elimination, 
where the marginals of all variables are computed in a double pass of the algorithm (instead of 
one variable in classical variable elimination). Instead of eliminating a leaf Xi and the potential 
functions involving Xi, we just mark the leaf Xi as “processed" and consider that the new potential 
-0AT. is a “message” sent from Xi to Xpa(i) (the parents of Xi in the tree), denoted as ^i^pa{i)- This 
message is a potential function over Xpa(i) only. We can iterate this process, always applying it 
to a leaf in the subgraph defined by unmarked variables, handling already computed messages as 
unary potentials. 

When only one variable remains unmarked (defining the root of the tree), the combination of 
all the functions on this variable (messages and possibly original potential function) will be equal 
to the marginal unnormalized distribution on this variable. This results directly from the fact that 
the operations are equivalent to variable elimination. The root of the tree defines a directed tree 
where the root is at the top, descendants are below and messages are flowing upwards, to the root. 

To compute the marginal of another variable, one can redirect the tree using this new root. 
Then some subtrees will remain unchanged (in terms of direction from the root of the subtree to 
the leaves) in this new tree and the messages in these subtrees do not need to be recomputed. 

It turns out that in a tree, one can organize all these computations cleverly so that only two 
messages are computed for each edge, one for each possible direction of the edge. 


Formally, in the Sum-product algorithm over a tree (V, E), messages are defined for eaeh 
edge (f, j) G E (there are 2\E\ sueh messages, one for eaeh edge direetion) in a leaves-to-root-to- 
leaves order. Messages pi^j are funetions of Xj, whieh are eomputed iteratively, by the following 

1. First, messages leaving the leaves of the tree are initialized: Vz G V, where z is a leaf of the 

Vj, s.t. (z, j) G E,W{xi,Xj) G A* X Aj,pi^j{xj) ^ 1 

Mark all leaves as proeessed. 

2. Then, messages are sent upward through all edges. Message updates are performed itera¬ 
tively, from marked nodes z to their only unmarked neighbor j through edge (z, j) G E. 
Message updates take the following form: 

Wxj e Aj,pi^j{xj) ^ ( 7 ) 

where K = ^l:ij{x[, Xj)iJi{x',) Pk^i{x'i). 

Mark node j as processed. See Figure |7] for an illustration. 

3. It remains to send the latter messages downward (from root to leaves). This seeond phase of 
message updates takes the following form: 

• f/nmark root node. 

• While there remains a marked node, send update (|7]) from an unmarked node to one of 
its marked neighbors, unmark the eorresponding neighbor. 

4. After the two above steps, messages have been transmitted through all edges in both diree- 
tions. Finally, marginal distributions over variables and pairs of variables (linked by an edge) 
are eomputed as follows: 

Pi{Xi) ^ Pj^i{Xi),\/Xi G Ai, 

Pij{Xi,Xj) ^ ■j^'llJij{Xi,Xj) Y[kf^j,{k,i)&E dk^i{Xi) Y[l^i,{l,j)eE ' 

Ki and K 2 are suitable normalizing eonstants. 

When the graph of the original graphieal model is not a tree, the two-pass message passing al¬ 
gorithm ean no more be applied. Still, for general graphieal models, this message passing approaeh 
ean be generalized in two different ways. 

One ean eompute a tree deeomposition, as previously shown. Message passing ean then be 
applied on the resulting eluster tree, handling eaeh eluster as a eross-produet of variables 
following a bloek-by-bloek approaeh. This yields an exaet algorithm, for whieh eomputa- 
tions ean be expensive (exponential in the treewidth) and spaee intensive (exponential in the 
separator size). A typieal example of sueh algorithm is the algebraie exaet message passing 
algorithm of [Shafer and Sheno^ (|1988|) ; jShenoy and Shaferj (|1990|) . 


Figure 7: Example of message update on a tree. In this example, nodes t, v and w are marked, 
while node s is still unmarked, fit^s is a funetion of all the ineoming messages to node t, exeept 

Alternatively, the Loopy Belief Propagation algorithm ( Frey and Mae Kay] 1998| ) is another 
extension of Message Passing in which messages updates are repeated, in arbitrary order 
through all edges (possibly many times through each edge), until a termination condition is 
met. The algorithm returns approximations of the marginal probabilities (over variables and 
pairs of variables). The quality of the approximation and the convergence to steady state 
messages are not guaranteed (hence, the importance of the termination condition). However, 
it has been observed that EBP often provides good estimates of the marginals, in practice. A 
deeper analysis of message-passing algorithms will be provided in Section]^ 

We have described above the Sum-product algorithm. Max-product, Max-sum algorithms can 
be equivalently defined, for exact computation or approximation of the max-marginal of a joint 
distribution or its logarithm. In algebraic language, updates like defined in formula (|7]) take the 
general form: 

yxj e = ®iJij{x[,Xj)ilJi{x'i} © Hk^i{x'i). 

As for sum-product, the resulting algorithm computes exact ©-marginals on a tree-structured 
graphical model from which the mode of the distribution can be computed while on general graph¬ 
ical models, it provides only approximations. 

5.2 Message Passing and Re-parametrization 

It is possible to use message passing on trees as a re-parametrization technique. Instead of com¬ 
puting external messages, message passing can reformulate the original tree-structured graphical 
model in a new equivalent tree-structured graphical model. By “equivalent” we mean that the 
resulting tree defines exactly the same joint distribution as the original graphical model. In the 
re-parameterized problem, information of interest (marginals) can be directly read in the potential 
functions (IKoller and Friedmanl 120091). 


The idea behind re-parametrization is eoneeptually very simple: when a message is eom- 
puted, instead of keeping it as a message, it is possible to multiply any potential funetion involving 
Xj by using ©. To preserve the joint distribution defined by the graphieal model, we need to 
divide another potential funetion involving Xj by the same message using the inverse of ©j^ 

One possibility is to ineorporate the messages in the binary potentials: we replaee V’ij by 
^jJij © ^i^j © while ijji is divided by {ij^i and V'j is divided by ^i^j. In this ease, eaeh pairwise 

potential ean be shown to be equal to the marginal of the joint potential on {Xj, Xj}. 

The resulting tree-struetured MRF is said to be calibrated to emphasize the faet that all pairs 
of binary potentials sharing a eommon variable agree on their marginals: 

© il^ij = © iJik 


The main advantage of a ealibrated re-parametrization is that it ean be used instead for the 
original model for any further proeessing. This is useful in the eontext of ineremental updates, 
where new evidenee is introdueed inerementally and eaeh reealibration is simpler than a new eali- 
bration ( |Koller and Friedmanj |2009| ) . 

Message passing based re-parameterizations ean be generalized to eyelie graphs. If an exaet 
approaeh using tree deeompositions is followed, messages may have a size exponential in the 
interseetion of pairs of elusters and the re-parametrization will ereate new potentials of this size. 
If these messages are multiplied inside the elusters, eaeh resulting eluster will be the marginal of 
the joint distribution on the eluster variables. The tree-deeomposition is ealibrated and any two 
interseeting elusters agree on their marginals. This is exploited in the Lauritzen-Spiegelhalter and 
Jensen sum-produet-divide algorithms ( Lauritzen and Spiegelhalter(|1988[ Jensen et'aL| 1990| ). In 
this eontext, besides ineremental updates, a ealibrated tree deeomposition allows also to loeally 
eompute exaet marginals for any set of variables in the same eluster. 

If a loeal “loopy” approaeh is used instead, re-parameterizations do not ehange seopes but pro¬ 
vide a re-parameterized model where estimates of the marginals ean be direetly read. For MAP, 
sueh re-parameterizations ean follow elever update rules to provide eonvergent re-parameterizations 
maximizing a well defined eriterion. Typieal examples of this sehema are the sequential version of 
the tree reweighted algorithm (TRWS, Kolmogorov[ , 20061, or the Max Produet Linear Program¬ 
ming algorithm (MPLP, |Globerson and Jaakkola[ 20081 whieh try to optimize a bound on the non- 
normalized probability of the mode. A seminal referenee, published in Russian is ( Sehlesinger[ 
1976|). These algorithms ean be exaet on graphieal models with loops, provided the potential 

funetions are all submodular (often deseribed as the diserete version of eonvexity). 

Deterministic graphical models : message passing algorithms have also been used in deter- 
ministie graphieal models where they are known as “loeal eonsisteney” enforeing or eonstraint 
propagation algorithms. A loeal eonsisteney property defines the targeted ealibration property and 

'Zeros in potential can be dealt with by a proper extension of the algebraic operations, including an inverse for 
zero. If the algebraic structure equipped with © is not a group but only a semi-group or monoid, suitable pseudo 
inverses can often be defined. See (Cooper and Schiex 2004 Gondran and Minoux 20081. 


the enforcing algorithm allows to transform the original network into an equivalent network (defin¬ 
ing the same joint function) that satisfies the desired calibration/local consistency property. Similar 
to LBP, Arc Consistency |Waltz| ( |1972| ); [Rossi et al.| ( |2006[ ) is the most usual form of local consis¬ 
tency and is related to Unit Propagation in SAT Biere et al. (2009). Arc consistency is exact on 
trees and is usually incrementally maintained during an exact tree search, using re-parametrization. 
Because of the idempotency of logical operators, local consistencies always converge to a unique 

Local consistency properties and algorithms for the Weighted CSP are very closely related to 
message passing for MAP. They however are always convergent, thanks to suitable calibration 
properties ( |Schiex[ |2000[ [Cooper and Schiex] [2004[ [Cooper et al.[ [2010[ ) and may also solve tree 
structured or fully submodular problems. 

6 Heuristics and approximations for inference 

We mainly discussed methods for exact inference in graphical models. They are useful if an or¬ 
der for variable elimination with small treewidth is available. In real life applications, interaction 
network are seldom tree-shaped, and their treewidth can be large (e.g. a grid of pixel in image 
analysis) and exact methods cannot be applied anymore. However, they are starting points to 
derive heuristic methods for inference that can be applied to any graphical model. By heuristic 
method, we mean an algorithm that is (a priori) not derived from the optimization of a particular 
criterion, as opposed to what we will call approximation methods. Nevertheless, we shall alleviate 
this distinction and show that good performing heursitics can sometimes be interpreted as approx¬ 
imate methods. For the marginalization task, the most widespread heuristics derived from vari¬ 
able elimination and message passing principles is the Loopy Belief Propagation algorithm (LBP, 
Kschischang et al.[ [2001b described in Section [5.1[ and numerous extensions (e.g. Generalized 

BP, Yedidia et al. 20051 have been proposed since then. In the last decade, a better understanding 
of these heuristics has been reached and they can now be reinterpreted as particular instances of 
variational approximation methods Wainwright and Jordan (2008). A variational approximation 
of a distribution p is defined as the best approximation of p in a class Q of tractable distributions 
(for inference), according to the Kiillback-Leibler divergence. Depending of the application (e.g. 
discrete or continuous variables), several choices for Q have been considered. We are apparently 
far from variable elimination principles and treewidth issues. However, as we just emphasized, 
LBP can be cast in the variational framework. The treewidth of the chosen variational distribution 
depends on the nature of the variables: i) in the case of discrete variables the treewidth is low: the 
class Q is in the majority of cases that of independent variables (mean field approximation), with 
associated treewidth of 0, and some works consider a class Q with associated treewidth of 1; ii) in 
the case of continuous variables, the treewidth of the variational distribution is the same as in the 
original model: Q is in general a class of multivariate Gaussian distributions, for which numerous 
inference tools are available. 

We will illustrate these remarks in Section [^ Before that in the remainder of this section, 
we recall the two key component for a variational approximation method: the Kiillback-Leibler 
divergence and the choice of a class of tractable distributions. We also explain how LBP can be 
interpreted as a variational approximation method. 


6.1 Variational approximations 

The Kiillback-Leibler divergence KL{q\\p) = q{x) log ^ measures the dissimilarity between 
two probability distributions p and q. KL is positive, and it is null if and only if p and q are equal. 
Let us consider now that q is constrained to belong to a family Q which does not include p. The 
solution q* of aigming^Q KL{p\\q) is then the best approximation of p in Q according to diver¬ 
gence KL. If Q is a set of distributions tractable for inference, marginals, mode or normalizing 
constant of q* can be used as approximations of the same quantities on p. 

For example, let us consider a binary Potts model on n vertices whose joint distribution is 

p(a:) = ^ n + X] hjXiXj) 


We can derive its so called mean field approximation, corresponding to the class of fully 
factorized distributions (i.e. an associated graph of treewidth equal to 0): = {q s.t. q{x) = 

Since variables are binary corresponds to joint distributions of independent Bernoulli 
variables with respective parameters g* = qi{l), namely for all q in q{x) = ~ 

qiY~^\ The optimal approximation (in terms of Kiillback-Leibler divergence) within this class 
of distributions is characterized by the set of g/s which minimize KL{q\\p). Denoting Eg the 
expectation with respect to g, KL{q\\p) — log Z is 

Eg I ^ [Xi logg* -f (1 - Xi) log(l - g*)] - ^ ^ hjXiXj | 

\ * * (*j)6S / 

= [qi log g* + (1 - qi) log(l - qi)] - Y “ Y 

i i 

This expectation has a simple form because of the specific structure of g. Minimizing it with 
respect to g* gives the fixed-point relation that each optimal gf^^’s must satisfy: 

log =ai+ Y bijqf^. 


leading to 



qY^ is equal the conditional probability that Xj = 1 given that all other variables are fixed to their 
mean field expected values under distribution g, which explain the name of mean field approxima¬ 
tion. Note that in general g* is not equal to the marginal p(Xj = 1). 

The choice of the class Q is indeed a critical trade-off with opposite desirable properties: it 
must be large enough to guarantee a good approximation and small enough to contain only man¬ 
ageable distributions. We will focus in the next section on a particular choice for Q, the Bethe 
class, that will enable to link the LBP heuristics to variational methods. Other choices are possible 


and have been used. For instanee, the Chow-Liu algorithm (Chow and Liu 1968) eomputes the 

minimum of KL{q\ \p) for q a distribution whose assoeiated graph is a spanning tree of the graph of 
p. This amounts to eomputing the best approximation of p among graphieal models with treewidth 
equal to 1. In the struetured mean field approximation ( Ghahramani and Jordan^ 1997t Wain 
wright and Jordanf |2008| ) the distribution of a faetorial Hidden Markov Model is approximated in 
a variational approaeh: the multivariate hidden state is deeoupled and the variational distribution 
of the eonditional distribution of hidden states is that of independent Markov ehains (here again, 
the treewidth is equal to 1). Finally, an alternative to treewidth reduetion is to ehoose the varia¬ 
tional approximation in the class of exponential distributions This has been applied for Gaussian 
process classification (Kim and Ghahramani[ 2006) using a multivariate Gaussian approximation 

of the posterior distribution of the hidden field. This relies on the use of the EP algorithm (Minka 

20011. Note that in this algorithm, KL{p\\q) is minimized instead of KL{q\\p). 

6.2 LBP heuristics as a variational method 

The mean field approximation is the most naive approximation among the so-called Kikuchi 
approximations from statistical mechanics, also known as Cluster Variational Methods (CVM, 
Kikuchi 1951| ). Originally, they are not defined by a minimization of the Kiillback-Leibler diver¬ 
gence, but as an approximation of the minimum of the free energy H{q), 

W(9) = -E q{x) log n -f ^g(x) logg(x). 

X BgB X 

The two problems are equivalent since H(q) is equal to KL(q\\p) — \ogZ and is minimum when 
p = q.lfp and q are pairwise MRF whose associated graph G = (V, E) is the same and is a tree, 

then q{x) = > where {q{xi,Xj)} and {q{xi)} coherent sets of order 1 and order 2 

marginals of q, and di is the degree of vertex i in the tree. In this particular case the Bethe free 
energy, defined as H{q) is expressed as (see Heskes et'H^ |2004[ [Yedidia et akf 20051 

H{q) = - 

q{xi,Xj) \ogi>{xi,Xj) - LL q{xi) \ogi>{xi) 

iev Xi 

E E q{xi,Xj)\ogq{xi,Xj) + E(“'<-i) E q{xi) logg(a;i) 

(* Xi,Xj 


The Bethe approximation consists in applying to an arbitrary graphical model the same formula of 
the free energy than for a tree and minimizing it over the variables {q{xi, Xj)} and {q{xi)} under 
the constraint that they are probability distributions and that q{xi) is the marginal of q{xi, Xj). By 
extension the Bethe approximation can be interpreted as a variational method associated to the 


family of unnormalized distributions that can be expressed as q{x) = 

{q{xi, Xj)} and {q{xi)} coherent sets of order 1 and order 2 marginals. 


It has been established by Yedidia et al. (20051 that the fixed points of LBP (when they exist. 

convergence is still not well understood, see |Weiss| ( |2000| ) and |Mooij and Kappen] ( |2007| )) are 
stationary points of the problem of minimizing the Bethe free energy, or equivalently KL{q\\p) 
with q in the class of distributions. 


Furthermore Yedidia et al. ( |2005 ) showed that for any elass of distributions Q eorresponding 
to a partieular CVM method, it is possible to define a generalized BP algorithm whose fixed points 
are stationary points of the problem of minimizing KL{q\\p) in Q. 

The drawbaek of the LBP algorithm and its extensions (Yedidia et al4 2005) it that they are 

not assoeiated with any theoretieal bound on the error made on the marginals approximations. 
Nevertheless, this algorithm is inereasingly used for inferenee in graphieal models for its good 
behavior in praetiee ( [Murphy et ah , 19991. It is implemented in several pieees of software for 
inferenee in graphieal models like libDAI ( Mooij[[2010| ) or OpenGM2 (Andres et al. 2012). 

7 Illustration with coupled HMM 

We now illustrate how the the proeedures we have deseribed have been used for parameter es¬ 
timation in a elaborated example: eoupled HMM. Consider a set of / signals observed at times 
t G {1,... T} and denote 01 the variable eorresponding to the observed signal i at time t. HMM 
models assume that the distribution of eaeh 01 is eonditional to some hidden state HI, where the 
series {Hl)t=i^...T is a Markov ehain. Coupled HMM further assumes that the hidden states display 
a eorrelation at eaeh time (see jJordEi] , 2004[ Wainwright and Jordan 20081, resulting in the graph¬ 
ieal model displayed in Figure]^ Sueh models have been eonsidered in a series of domains sueh 
as bioinformaties ( Choi et ^[2013 1, eleetroeneephalogram analysis ( jZhong and Chosh 2002) or 
speeeh reeognition ( Noek and Ostendorf[|2003 ). More eomplex versions are sometimes eonsid¬ 
ered, assuming dependeney between two times series at two consecutive time steps. For this model, 
the joint distribution of the hidden variables H = and observed variables O = {ODi^t fac¬ 

torizes as 

p{h,o)QC h]) 


( 8 ) 

where ht = {hl)i stands the vector of all hidden variables at time t and where -0^ encodes the 
Markovian dependency of the hidden variables within a series, encodes the coupling between 
the hidden variables of all series at a given time and encodes the emission of the observed 
signal given the corresponding hidden state. A fairly comprehensive exploration of these models 
can be found in ( Murphy [|2002[ ). 

7.1 Exact EM algorithm for coupled HMM 

Coupled HMM are examples of incomplete data models, as they involve variables (O, H) where 
only the variables O are observed. Maximum likelihood inference for such a model aims at finding 
the value of the parameter 6 that maximizes the (log-)likelihood of the observed data o, that is 

to solve maxelogp® 

The most popular algorithm to achieve this task is the EM algorithm. 

([Dempster et al.[ 1977), that can be rephrased in the following way. Observe that 

logp®(o) = E{\ogp^{o,H)\o) — E {log p^{H\o)\o) = maxF(6*,g), 



Figure 8: Graphical representation of p{h, 6) for a coupled HMM 


F{e,q) = E,i\ogp\o,H)) - E,{\ogq{H)) = \ogp\o) - KL{q{H)\\p\H\o)), 

and q stands for any distribution on the hidden variables H, E stands for the expectation under 
the true distribution p and Eg under the arbitrary distribution q. The EM algorithm consists in 
alternatively maximizing E{9, q) with respect to q (E-step) and to 9 (M-step). The solution of the 
E-step is q{h) = p{h\o) since the Kullback-Leibler divergence is minimal (and null) in this case. 

Exact computation of p{h\o) can be performed by observing that ([^ can be rewritten as 

p{h,o) Qc ht)^ X 

where encodes both the Markovian dependency and the coupling of the hidden variables 
within a given time step. This writing is equivalent to merging all hidden variables of a given 
time step and corresponds to the graphical model given in Figure Denoting K the number of 
possible values for each hidden variables, we end up with a regular hidden Markov model with 
possible hidden states. Both p{h\o) (and its mode) can then be computed in a exact manner with 
either the forward-backward recursion or the Viterbi algorithm, which have the same complexity: 
0(T^^). The exact calculation can therefore be achieved provided that remains small enough, 
but becomes intractable when the number of signals / exceeds few tens. 

7.2 Approximate E step for the EM algorithm 

A first approach to derive an approximate E step it to seek for a variational approximation of p{h\o) 
assuming that q{h) is restricted to a family Q of tractable distributions, as described in Section 
6.1[ This approach results in the maximization of a lower bound of the original log-likelihood. 

The choice of Q is critical and is a balance between approximation accuracy and computation 
efficiency. Choosing Q typically amounts to breaking down some dependencies in the original 


Figure 9: Graphical representation of p(/i, o) for a coupled HMM when merging hidden variables 
at each time step 

distribution to end up with some tractable distribution. In the case of coupled HMM, the simplest 
distribution is the class of fully factorized distributions (i.e. mean field approximation), that is 

Qo = {q- q{h) = Y[Y[qit{hl)}. 

i t 

Such an approximation corresponds to the graphical model of Figure [T^ Intuitively, this approxi¬ 
mation replaces the stochastic influence existing between the hidden variables by its mean value. 

Figure 10: Graphical representation for the independent mean-field approximation of p{h, o) in 
a coupled HMM. Observed variables are indicated in light gray since they are not part of the 
variational distribution which is a distribution only on the hidden variables. 

As suggested in |Wainwright and Jordan] ( |2008[ ), a less drastic approximation can be obtained 
using the distribution family of independent heterogeneous Markov chains: 

Qm = {g : q{h) = \\_W_qit{.h\\h\_^)} 

i t 

which is consistent with the graphical representation of independent HMM, as depicted in Fig¬ 
ure [HI 

An alternative to the approximate maximization of max^ F(9, q) consists in seeking for the 
maximum of an approximation F{9,q) of F{9,q) which involves only marginals of p{h\o) on 


Figure 11: Graphical representation for the independent Markov approximation of p(/i, o) in a cou¬ 
pled HMM. Observed variables are indicated in light gray since they are not part of the variational 
distribution which is a distribution only on the hidden variables. 

subsets of variables in H of limited size. Then the LBP algorithm can be used to provide an 
approximation of these marginals. This approach has been proposed in Heskes et aH ( 2004] ) where 
the authors approximated the negative entropy term Eq{\og q{H)) in F{9, q) by its so-called Bethe 
approximation as follows (the first term in F by definition depends only on marginals of variables 
involved in the potential functions ^ 'ip'", 

E E ft’,) log ftj) + E'i^('>*) iog9^('!<) 

it t 

+ EE {hi, ol) log q^{h],o]) - EE Iq{hl) log q{hl) 

it it 

because each hidden variable F[l has degree J -f 1 in the original graphical model given in Figure 
The advantage of this approach compared to the variational approximations based on fa mi lies 
Qo and Qm is that it provides an approximation of the joint conditional distribution of all hidden 
variables within the same time step. 

8 Conclusion 

This tutorial on variable elimination for exact and approximate inference is an introduction to the 
basic concepts of variable elimination, message passing and their links with variational methods. 
It introduces these fields to statisticians confronted with inference in graphical models. The main 
message is that exact inference should not be systematically considered as out of reach. Before 
looking for an efficient approximate method, a wise advice would be to know the treewidth of the 
graphical model. In practice this question is not easy to answer exactly but several implementations 
of the presented algorithms exist and provide an upper bound of the treewidth. 

Obviously this tutorial is not exhaustive, since we have chosen to focus on the fundamental 
concepts. While many important results on treewidth and graphical models have several decades 


in age, the area is still lively and we now broaden our diseussion to a few reeent works whieh taekle 
some ehallenges related to the eomputation of the treewidth. 

Beeause they offer effieient algorithms, graphieal models with a bounded treewidth offer an 

attraetive target when the aim is to learn a model that best represents some given sample. In (Kumar 

and Baeh 20121, the problem of learning the strueture of an undireeted graphieal model with 

bounded treewidth is approximated by a eonvex optimization problem. The resulting algorithm 
has a polynomial time eomplexity. As diseussed in ( Kumar and Baeh[ 2012), this algorithm is 
useful to derive traetable eandidate distributions in a variational approaeh, enabling to go beyond 
the usual variational distributions with treewidth zero or 1. 

For optimization (MAP), other exaet teehniques are offered by tree seareh algorithms sueh 
as Braneh and Bound ( [Lawler and Wood| |1966| ), that reeursively eonsider possible eonditioning 
of variables. These teehniques often exploit limited variable elimination proeessing to prevent 

exhaustive seareh, either using message-passing like algorithms (Cooper et al. 2010) to eompute 

bounds that ean be used for pruning, or by performing “on-the-fly” elimination of variables with 
small degree ( |Larrosa[ [2000[ ). 

Beyond pairwise potential funetions, the time needed for simple update rules of message pass¬ 
ing beeomes exponential in the size of the seope of the potential funetions. However, for speeifie 
potential funetions involving many (or all) variables, exaet messages ean be eomputed in reason¬ 
able time, even in the eontext of eonvergent message passing for optimization. This ean be done 
using poly time graph optimization algorithms sueh as shortest path or mineost flow algorithms. 
Sueh funetions are known as global potential funetions ( Vieente et al.[ [2008 Wemer[ 2008) in 
stoehastie graphieal models, and as global eost funetions ( Lee and Leung , [2009 Allouehe et al. 
2012 Lee and Leung[ [20T2 ) in deterministie Cost Funetion Networks. 

Different problems appears with eontinuous variables, where eounting requires integration of 
funetions. Here again, for speeifie families of distributions, exaet (analytie) eomputations ean be 
obtained for distributions with eonjugate distributions. For message passing, several solutions have 

been proposed. For instanee, a reeent message passing seheme proposed by |Noorshams and Wain 
wright ( [2013[ ) relies on the eombination of orthogonal series approximation of the messages, and 
the use of stoehastie updates. We refer the reader to referenees in (Noorshams and Wainwrightj 

20131 for a state-of-the-art of alternative methods dealing with eontinuous variables message pass¬ 

Finally, we have exeluded Monte-Carlo methods from the seope of our review. But reeent 
sampling algorithms have been proposed that use exaet optimization algorithms to sample points 
with high probability in the eontext of estimating the partition funetion. Additional eontrol in the 
sampling method is needed to avoid biased estimations: this may be hashing funetions enforeing a 

fair sampling (Ermon et al. 2014) or randomly perturbed potential funetions using a suitable noise 

distribution (Hazan et al. 2013). 

We hope this review will enable more cross-fertili z ations of this sort, combining statistics and 
computer science, stochastic and deterministic algorithms for inference in graphical models. 



D. Allouche, C. Bessiere, P. Boizumault, S. de Givry, P. Gutierrez, S. Loudni, J.P Metivier, and 
T. Sehiex. Deeomposing Global Cost Funetions. In Proceedings ofAAAI’ll, pages 407-413, 
2012 . 

P Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm. 
SIAM Journal on Matrix Analysis and Applications, 17(4):886-905, 1996. 

E. Amir. Approximation algorithms for treewidth. Algorithmica, 56:448-479, 2010. 

B. Andres, Beier T., and J. H. Kappes. OpenGM: A C++ library for diserete graphieal models. 
ArXiv e-prints, 2012. URL, 

Bjoem Andres, Thorsten Beier, and Joerg H. Kappes. Open gm benehmark - evpr’2013 seetion. 

See http : / / hci . iwr . uni-heidelberg. de/opengm2 / ? 10=benchmark, 2013. 

S. Amborg. Effieient algorithms for eombinatorial problems on graphs with bounded deeompos- 
ability — a survey. BIT, 25:2-23, 1985. 

S. Arnborg, D. G. Cornell, and A. Proskurowski. Complexity of finding embeddings in a fc-tree. 
SIAM J. Algebraic Discrete Methods, 8:277-284, 1987. 

P. Austria, T. Pitassi, and Y. Wu. Inapproximability of treewidth, one-shot pebbling, and related 
layout problems. In International Workshop on Approximation Algorithms for Combinatorial 
Optimization Problems (APPROX),, pages 13-24, Boston, USA, 2012. 

U. Bertele and F. Brioshi. Nonserial Dynamic Programming. Aeademie Press, 1972. 

Armin Biere, Marijn Heule, and Hans van Maaren. Handbook of satisfiability, volume 185. los 
press, 2009. 

C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 

S. Bistarelli, U. Montanari, and F. Rossi. Semiring based eonstraint solving and optimization. 
Journal of the ACM, 44(2):201-236, 1997. 

H. F. Bodlaender. A tourist guide through treewidth. Developments in Theoretical Computer 
Science, 1, 1994. 

H. F. Bodlaender. A linear time algorithm for finding tree-deeompositions of small treewidth. 
SIAM Journal on Computing, 25:1305-1317, 1996. 

H. F. Bodlaender and A. M. C. A. Koster. Treewidth eomputations I. upper bounds. Information 
and Computation, 208(3):259-275, 2010. 

H. F. Bodlaender, J. R. Gilbert, H. Hafsteinsson, and T. Kloks. Approximating treewidth, path- 
width, frontsize, and shortest elimination tree. Journal of Algorithms, 18:238-255, 1995. 


H. L. Bodlaender, A. Koster, and F. van den Eijkhof. Preprocessing rules for triangulation of 
probabilistic networks. Computational Intelligence, 21(3):286-305, 2005. 

H. L. Bodlaender, P. Drange, M. S. Dregi, F. V. Fomin, D. Lokshtanov, and M. Pilipczuk. A 
c^n 5-approximation algorithm for treewidth. In IEEE Symposium on Eoundations of Computer 
Science, pages 499-508, 2013. 

Hans L Bodlaender. A partial k-arboretum of graphs with bounded treewidth. Theoretical computer 
science, 209(1): 1-45, 1998. 

G. Casella and E. I. George. Explaining the Gibbs sampler. The American Statistician, 46(3): 
167-174, 1992. 

H. Choi, D. Eermin, A. Nesvizhskii, D. Ghosh, and Z. Qin. Sparsely correlated hidden Markov 
models with application to genome-wide location studies. Bioinformatics, 29(5):533-541, 2013. 

C. K. Chow and C.N. Eiu. Approximating discrete probability distributions with dependence trees. 
IEEE Transactions on Information Theory, 14(3):462-467, 1968. 

S.A. Cook. The complexity of theorem proving procedures. In 3’’'^ ACM symp. on theory of 
computing, pages I5I-I58, 1971. 

M C. Cooper. Cyclic consistency: a local reduction operation for binary valued constraints. Artifi¬ 
cial Intelligence, 155(l-2):69-92, 2004. 

M C. Cooper and T. Schiex. Arc consistency for soft constraints. Artificial Intelligence, 154(1-2): 
199-227, 2004. 

M. C. Cooper, S. de Givry, M. Sanchez, T. Schiex, M. Zytnicki, and T. Werner. Soft arc consistency 
revisited. Artificial Intelligence, 174:449-478, 2010. 

M. Davis and H. Putnam. A computing procedure for quantification theory. Journal of the ACM, 
7(3):210-215, 1960. 

S. de Givry, T. Schiex, and G. Verfaillie. Exploiting Tree Decomposition and Soft Eocal Consis¬ 
tency in Weighted CSP. In Proceedings of the National Conference on Artificial Intelligence, 
AAAI-2006, pages 22-27, 2006. 

R. Dechter. Bucket elimination: A unifying framework for reasoning. Artificial Intelligence, 113 
(l-2):41-85, 1999. 

R. Dechter and J. Pearl. Network-based heuristics for constraint satisfaction problems. In E. Kanal 
and V. Kumar, editors. Search in Artificial Intelligence, chapter II, pages 370-425. Springer- 
Verlag, 1988. 

A. P. Dempster, N. M. Eaird, and D. B. Rubin. Maximum likelihood from incomplete data via the 
EM algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1977. 

R. J. Duffin. Topology of series-parallel networks. Journal of Mathematical Analysis and Appli¬ 
cation, 10(2):303-313, 1965. 


G. Elidan and A. Globerson. UAI inference challenge 2010. See 

pro ject/UAIlO, 2010. 

G. Elidan and A. Globerson. The probabilistic inference challenge. See 

hu j i . ac. il/pro ject/PASCAL/ index . php, 2011. 

S. Ermon, C. Gomes, A. Sabharwal, and B. Selman. Eow-density parity constraints for hashing- 
based discrete integration. In Proceedings of the 31st International Conference on Machine 
Learning, pages 271-279, 2014. 

E. V. Eomin and Y. Villanger. Treewidth computation and extremal combinatorics. Combinatorica, 
32(3):289-308, 2012. 

J. Eourier. Memoires de I’Academie des sciences de I’Institut de France 7, chapter Histoire de 
r Academic, partie mathematique (1824). Gauthier-Villars., 1827. 

B. Erey and D. MacKay. A revolution: Belief propagation in graphs with cycles. In Advances in 
Neural Information Processing Systems, pages 479-485. MIT Press, 1998. 

N. Eriel, A. N. Pettitt, R. Reeves, and E. Wit. Bayesian inference in hidden Markov random fields 
for binary data defined on large lattices. Journal of Computational and Graphical Statistics, 18: 
243-261, 2009. 

D. Eulkerson and O. Gross. Incidence matrices and interval graphs. Pacific journal of mathematics, 
15(3):835-855, 1965. 

Z. Ghahramani and M. Jordan. Eactorial hidden Markov models. Machine learning, 29(2-3): 
245-273, 1997. 

A; Globerson and T. Jaakkola. Eixing max-product: Convergent message passing algorithms for 
map Ip-relaxations. In Advances in Neural Information Processing Systems, pages 553-560, 

V. Gogate. UAI 2014 inference competition. 
uail4-competition, 2014. 

M. Gondran and M. Minoux. Graphs, Dioids and Semirings, volume 41 of Operations Re¬ 
search/Computer Science Interfaces Series. Springer, 2008. 

D. J. Gordon, N. J.and Salmond and A. E. M. Smith. Novel approach to nonlinear/non-gaussian 
bayesian state estimation. lEE Proceedings on Radar and Signal Processing, 140(2): 107-113, 

T. Kazan, S. Maji, and T. Jaakkola. On sampling from the Gibbs distribution with random maxi¬ 
mum a-posteriori perturbations. In Advances in Neural Information Processing Systems, pages 
1268-1276, 2013. 

P. Heggernes, S. Eisenstat, G. Kumfert, and A. Pothen. The computational complexity of the 
minimum degree algorithm. In 14th Norwegian Computer Science Conference, Troms, Norway, 
2001 . 


T. Heskes, O. Zoeter, and W. Wiegerinck. Approximate expectation maximization. Advances in 
Neural Information Processing Systems, 16:353-360, 2004. 

F. Jensen, K. Olesen, and S. Andersen. An algebra of bayesian belief universes for knowledge- 
based systems. Networks, 20(5):637-659, 1990. 

F. V. Jensen and T. D. Nielsen. Bayesian Networks and Decision Graphs. Springer Publishing 
Company, Incorporated, 2nd edition, 2007. 

M. Jordan. Graphical models. Statistical Science, pages 140-155, 2004. 

K. Kask, A. Gelfand, L. Otten, and R. Dechter. Pushing the power of stochastic greedy ordering 
schemes for inference in graphical models. In AAAI, 2011. 

R. Kikuchi. A theory of cooperative phenomena. Physical Review, 81(6):988-1003, 1951. 

H.C. Kim and Z. Ghahramani. Bayesian gaussian process classification with the EM-EP algorithm. 
IEEE Transactions on Pattern Analyis and Machine Intelligence, 28(12): 1948-1959, 2006. 

J. Kohlas. Information algebras: Generic structures for inference. Springer Science & Business 
Media, 2003. 

D. Roller and N. Eriedman. Probabilistic Graphical Models: Principles and Techniques. MIT 
Press, 2009. 

V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 28(10): 1568-1583, 2006. 

E R. Kschischang, B. J. Erey, and H-A. Eoeliger. Eactor graphs and the sum-product algorithm. 
IEEE Transactions on Information Theory, 47(2):498 -519, 2001. 

K. S. Sesh Kumar and E. Bach. Convex relaxations for learning bounded treewidth decomposable 
graphs. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, 
United States, 2012. 

J. Eagergreen. Efficient parallel algorithms for graphs of bounded treewidth. Journal of Algo¬ 
rithms, 20:20-44, 1996. 

J. Earrosa. Boosting search with variable elimination. In Principles and Practice of Constraint 
Programming - CP 2000, volume 1894 of LNCS, pages 291-305, Singapore, September 2000. 

S. E. Eauritzen. Graphical Models. Clarendon Press, 1996. 

S.E. Eauritzen and D.J. Spiegelhalter. Eocal computations with probabilities on graphical struc¬ 
tures and their application to expert systems. Journal of the Royal Statistical Society - Series B, 
50:157-224, 1988. 

E. Eawler and D. Wood. Branch-and-bound methods: A survey. Operations Research, 14(4): 
699-719, 1966. 


J. Lee and K. L. Leung. Towards effieient eonsisteney enforeement for global eonstraints in 
weighted eonstraint satisfaetion. In International Conference on Artificial Intelligence, vol¬ 
ume 9, pages 559-565, 2009. 

J. H. M. Lee and K. L. Leung. Consisteney Teehniques for Global Cost Funetions in Weighted 
Constraint Satisfaetion. Journal of Artificial Intelligence Research, 43:257-292, 2012. 

S. Z. Li. Markov random field modeling in image analysis. Springer-Verlag, 2001. 

J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and praetiee. SIAM 
Review, 34:82-109, 1992. First papers on the multifrontal teehnique go baek to 1983. 

L. Lovasz. Graph minor theory. Bulletin of the American Mathematical Society, 43:75-86, 2005. 

R. Marineseu and R. Deehter. Memory intensive braneh-and-bound seareh for graphieal models. 
In proceedings of the National Conference on Artificial Intelligence, AAAI-2006, pages 1200- 
1205, 2006. 

T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001. 

J. M. Mooij. libDAI: A free and open souree C-i-i- library for diserete approximate inferenee in 
graphieal models. Journal of Machine Learning Research, 11:2169-2173, August 2010. URL, 

J. M. Mooij and H. J. Kappen. Suffieient eonditions for eonvergenee of the sum-produet algorithm. 
IEEE Transactions on Information Theory, 53(12):4422-4437, 2007. 

K. Murphy. Dynamic bayesian networks: representation, inference and learning. PhD thesis. 
University of California, Berkeley, 2002. 

K. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. 

K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inferenee: An 
empirieal study. In Proceedings of the 15th conference on Uncertainty in Artificial Intelligence, 
pages 467-475, 1999. 

H. Noek and M. Ostendorf. Parameter reduetion sehemes for loosely eoupled HMMs. Computer 
Speech & Language, 17(2):233-262, 2003. 

N. Noorshams and M. J. Wainwright. Belief propagation for eontinuous state spaees: stoehastie 
message-passing with quantitative guarantees. Journal of Machine Learning Research, 14(1): 
2799-2835, 2013. 

NICTA Optimization Researeh Group. Minizine ehallenge 2012. See http://www., 2012. 

L. Otten, A. Ihler, K. Kask, and R. Deehter. Winning the PASCAL 2011 MAP ehallenge with 
enhaneed AND/OR braneh-and-bound. In NIPS DISCML Workshop, Lake Tahoe, USA, 2012. 


J. Pearl. Probabilistic Reasoning in Intelligent Systems, Networks of Plausible Inference. Morgan 
Kaufmann, Palo Alto, 1988. 

L. Perkovic and B. Reed. An improved algorithm for finding tree decompositions of small 
treewidth. InternationalJournal of Foundations of Computer Science, 11:365-371, 2000. 

C. Pralet, G. Verfaillie, and T. Schiex. An algebraic graphical model for decision with uncertainties, 
feasibilities, and utilities. Journal of Artificial Intelligence Research, pages 421-489, 2007. 

L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. 
Proceedings of the IEEE, 77(2):257-286, 1989. 

R. Reeves and A. N. Pettitt. Efficient recursions for general factorisable models. Biometrika, 91: 
751-757, 2004. 

C.P Robert and G. Casella. Monte Carlo Statistical Methods. Springer- Verlag, New York, 2004. 

N. Robertson and P. Seymour. Graph minors, ii. algorithmic aspects of tree-width. Journal of 
Algorithms, 7(3):309-322, 1986. 

F. Rossi, P. van Beek, and T. Walsh, editors. Handbook of Constraint Programming. Elsevier, 

T. Schiex. Arc consistency for soft constraints. In Principles and Practice of Constraint Program¬ 
ming - CP 2000, volume 1894 of LNCS, pages 411-424, Singapore, September 2000. 

T. Schiex, H. Fargier, and G. Verfaillie. Valued constraint satisfaction problems: hard and easy 
problems. In Proc. of the IJCAI, pages 631-637, Montreal, Canada, August 1995. 

M. I. Schlesinger. Sintaksicheskiy analiz dvumernykh zritelnikh signalov v usloviyakh pomekh 
(Syntactic analysis of two-dimensional visual signals in noisy conditions). Kibernetika, 4:113- 
130, 1976. 

G. Shafer and P. Shenoy. Focal computations in hyper-trees. Working paper 201, School of 
business. University of Kansas, 1988. 

P. Shenoy and G. Shafer. Axioms for probability and belief-function propagation. In Proceed¬ 
ings of the 6th Conference on Uncertainty in Artificial Intelligence, pages I69-I98, Cambridge, 
USA, 1990. 

R. Tarjan and M. Yannakakis. Simple linear-time algorithms to test chordality of graphs, test 
acyclicity of hypergraphs and selectively reduce acyclic hypergraphs. SIAM Journal of Comput¬ 
ing, 13(3):566-579, 1984. 

F. Vandenberghe and M. S. Andersen. Chordal graphs and semidefinite optimization. Foundations 
and Trends in Optimization, l(4):241-433, 2014. 

S. Vicente, V. Kolmogorov, and C. Rother. Graph cut based image segmentation with connectivity 
priors. In Computer Vision and Pattern Recognition, CVPR 2008, pages 1-8, Alaska, USA, 2008. 


M. J. Wainwright and M. 1. Jordan. Graphical models, exponential families, and variational infer¬ 
ence. Foundations and Trends in Machine Learning, 1(1-2): 1-305, 2008. 

D. L. Waltz. Generating semantic descriptions from drawings of scenes with shadows. Technical 
Report A1271, M.I.T., Cambridge MA, 1972. 

Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural 
Computation, 12(1), 2000. 

T. Werner. High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft 
constraint optimisation (map-mrf). In Computer Vision and Pattern Recognition, CVPR 2008, 
pages 1-8, Alaska, USA, 2008. 

J. Yedidia, W. Freeman, and Y. Weiss. Constructing free energy approximations and generalized 
belief propagation algorithms. IEEE Transactions on Information Theory, 51(7):2282-2312, 

S. Zhong and J. Ghosh. HMMs and coupled HMMs for multi-channel EEG classification. In 
Proceedings of the IEEE International Joint Conference on Neural Networks, volume 2, pages 
1254-1159, Honolulu, Hawai, 2002.