# Simulated and Deterministic Annealing

## Maximum Common Subgraph

**Motivation**. Chemical compounds exhibit structural patters shared between compounds of the same family. One of these patterns is the circular (hexagonal) structure formed by carbon (C) atoms linked by single or double bonds (see {numref}`Aspirin`-Left and Center) where the non-labeled vertices denote C atoms. In this example, the presence of the hexagonal ring in the popular [Aspirin](https://pubchem.ncbi.nlm.nih.gov/compound/Aspirin) ({numref}`Aspirin`-Right) is well known. 

The development of AI pattern matching techniques for detetecting substructures has powered the field of Bioinformatics. Nowadays, the analysis of exponentially growing datasets of proteins such as the Protein Data Bank [PDB](https://www.wwpdb.org/) or of chemical compounds such as [PubChem](https://pubchem.ncbi.nlm.nih.gov/) is a must in AI. 

```{figure} ./images/Topic2/MCS-Aspirin-Photoroom.png
---
name: Aspirin
width: 800px
align: center
height: 300px
---
Structure of the Aspirin.
```

**MCS**. Comparing substructures is not an easy problem. Whereas comparing substrings is polynomial, comparing substrutures is a well-known NP problem called the <span style="color:#f88146">Maximum Common Subgraph</span> or MCS: 

*Given two graphs $G_1=(V_1,E_1)$  and $G_2=(V_2,E_2)$, what is the graph $H=(V,E)$ that is *common* to $G_1$ and $G_2$ with the maximum number of nodes*?

Herein, *common* means the following: $V = V\subseteq V_1$ and $E\subseteq E_1$ and exists an injective (one-to-one) mapping $f:V\rightarrow V_2$ such that $(i,j)\in E$ iff $(f(i),f(j))\in E_2$. 

This <span style="color:#f88146">**subsetness flavor**</span> clarifies the fact that <span style="color:#f88146">$\text{MCS}\in \text{NP}$</span>. Actually, MCS does not only search for one common subset but for *the subset with the maximum number of nodes*. 

This is quite clear in the following *Breaking Bad* problem: <span style="color:#f88146">What is the structural coincidence between Ectasy and Amphetamine?</span>. 
- According to PubChem, [Ectasy](https://pubchem.ncbi.nlm.nih.gov/compound/3_4-Methylenedioxymethamphetamine) is derived from the [Amphetamine](https://pubchem.ncbi.nlm.nih.gov/compound/Amphetamine).
- Looking at {numref}`Drugs`-Center (Ectasy)
comes from removing an Hidrogen atom from Amphetamine ({numref}`Drugs`-Right), adding a carbon and an oxigen-carbon pentagonal cycle. 
- The MCS is in {numref}`Drugs`-Left.

```{figure} ./images/Topic2/Drugs-Photoroom.png
---
name: Drugs
width: 800px
align: center
height: 200px
---
Maximal Common Substructure (left) of two design drugs (center and right).
```

All the figures have been generated by the [RDKit: Open-Source Cheminformatics Software](https://www.rdkit.org/). 


## Graph Matching
Solving MCS via a polynomial approximation requires *finding the injective function* $f$ between the vertices $V$ of small graph $X=(V,E)$ and those $V'$ of larger graph $Y=(V',E')$ in polynomial time. 

### Assignments and Matchings

**Node assignments**. Consider the (undirected) graphs: 

$$
\begin{aligned}
&\begin{array}{cll}
\text{Graph}  & \text{Nodes} & \text{Edges}\\
X & V=\{a,b,c\} & E=\{(a,b),(b,c)\}\\
Y & V'=\{1,2,3,4,5\} & E=\{(1,2),(1,3),(1,4),(3,4)\}\\
\end{array}
\end{aligned}
$$

As we have $n=|V'|=5$ and $r=|V|=3$ we have $P(n,r)=n\cdot (n-1)\cdot (n-r + 1)$ **r-permutations** i.e. one-to-one functions $f$ or <span style="color:#f88146">**node assignments**</span> that can be potential solutions to the MCS problem. In this case, we have $P(5,3)=5\cdot 4\cdot 3=60$ different  assignments. Then, <span style="color:#f88146">what assignments provide the MCS between $X$ and $Y$</span>? 

For the graphs $X$ and $Y$ the size of the MCS is $2$, since $X$ has only two edges and if it is a subgraph of $Y$ (as it is really is), the size of the MCS has $|V|=3$ nodes and |E|=2$ edges. 

However, this depends on how good is the injective function $f$. For the one in {numref}`Matching` we have: 

$$
f(a)=1,\; f(b)=2\;\text{and}\; f(c)=3\;.
$$

As we can see, $(a,b)\in E$ and $(1=f(a),2=f(b))\in E'$. However, for the other edge $(b,c)\in E$ we have that $(2=f(b),3=f(c))\not\in E'$: As a result $X$ and $Y$ <span style="color:#f88146">**do only have one common edge for $f$**</span> (in magenta). 


```{figure} ./images/Topic2/Matching1-Photoroom.png
---
name: Matching
width: 600px
align: center
height: 400px
---
non-Maximal Common Substructure.
```

Consider now the assignment: 

$$
f(a)=1,\; f(b)=4\;\text{and}\; f(c)=5\;.
$$

which indeed leads to a MCS: 

```{figure} ./images/Topic2/Matching2-Photoroom.png
---
name: Matching2
width: 600px
align: center
height: 400px
---
A Maximal Common Substructure.
```

However, this MCS assignment **is not unique**. Actually we have ${5 \choose 3}=10$ subgraphs of $Y$ where we can *embed* $X$ but only $8$ of them lead to MCS assignments. In the above table, we encode each MCS injective function $f_i$ as a list of <span style="color:#f88146"> **pairwise matchings** $(v\in V, f_i(v)\in V')$</span>: 

$$
\begin{aligned}
&\begin{array}{cl}
\text{Injection} & \text{MCS Assignment} \\
f_1 & [(a, 1), \mathbf{(b, 4)}, (c, 5)]\\
f_2 & [(a, 2), \mathbf{(b, 1)}, (c, 3)]\\
f_3 & [(a, 2), \mathbf{(b, 1)}, (c, 4)]\\
f_4 & [(a, 3), \mathbf{(b, 1)}, (c, 2)]\\
f_5 & [(a, 3), \mathbf{(b, 1)}, (c, 4)]\\
f_6 & [(a, 4), \mathbf{(b, 1)}, (c, 2)]\\
f_7 & [(a, 4), \mathbf{(b, 1)}, (c, 3)]\\
f_8 & [(a, 5), \mathbf{(b, 4)}, (c, 1)]\\
\end{array}
\end{aligned}
$$

**Notable vertices**. In the above table, note that MCS assignments usually contain pairwise matchings between nodes with large degrees (notable vertices) in both graphs, namely $\mathbf{b}$ and $\mathbf{1}$ or $\mathbf{4}$ (in bold). These matchings are $\mathbf{(b,1)}$ and $\mathbf{(b,4)}$.

In the following exercise we emphasize the link between the subgraphs of $Y$ and the MCS assignments, which is a fundamental concept in this topic. 
<br></br>
<span style="color:#d94f0b"> 
**Exercise**. Identify the ${5 \choose 3}=10$ subgraphs of size $3$ in $Y$ (indicating both nodes and edges) and:\
**a)** Detect the ones which do correspond to a complete  embedding of $X$ in $Y$.**b)** Identify these subgraphs in the $8$ MCS assigmnets of the above table. 
<br></br>
**NOTE**. Express always the edges in **lexicographical order** i.e. $(i,j)$ if $i<j$. 
<br></br>
Answer. **a)** The subgraphs are the $10$ combinations of the $5$ nodes $V'=\{1,2,3,4,5\}$ in groups of $3$ nodes (because $|V|=3$). However, only those subgraphs wich are connected can lead to a complete embeddding of $X$ in $Y$: 
</span>
<br></br>
<span style="color:#d94f0b"> 
$
\begin{aligned}
&\begin{array}{cclc}
\text{Subgraph} & \text{Nodes} & \text{Edges} & \text{is Connected?}\\
s_1    & V_1'=(1, 2, 3) & E_1'=\{(1,2),(1,3)\} & \checkmark \\
s_2    & V_2'=(1, 2, 4) & E_2'=\{(1,2),(1,3)\} & \checkmark \\
s_3    & V_3'=(1, 2, 5) & E_3'=\{(1,2)\}&\\
s_4    & V_4'=(1, 3, 4) & E_4'=\{(1,3),(1,4)\}& \checkmark\\
s_5    & V_5'=(1, 3, 5) & E_5'=\{(1,3)\}&\\
s_6    & V_6'=(1, 4, 5) & E_6'=\{(1,4),(1,5)\}& \checkmark\\
s_7    & V_7'=(2, 3, 4) & E_7'=\emptyset &\\
s_8    & V_8'=(2, 3, 5) & E_8'=\emptyset &\\
s_9    & V_9'=(2, 4, 5) & E_9'=\{(4,5)\} &\\
s_{10} & V_{10}'=(3, 4, 5) & E_{10}'=\{(4,5)\} &\\
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
**b)** We have that only subgraphs $s_1$, $s_2$, $s_4$ and  $s_6$ can accomodate the full graph $X$. In graph theory we say that these subgraphs are **isomorphic** to $Y$. Note that each of these subgraphs leads to $2$ MCS assignments in the injection table above: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
V_1'&=(1,2,3)\rightarrow f_2=[(a,2),(b,1),(c,3)],\; f_4=[(a,3),(b,1),(c,2)]\\
V_2'&=(1,2,4)\rightarrow f_3=[(a,2),(b,1),(c,4)],\; f_6=[(a,4),(b,1),(c,2)]\\
V_4'&=(1,3,4)\rightarrow f_5=[(a,3),(b,1),(c,4)],\; f_7=[(a,4),(b,1),(c,3)]\\
V_6'&=(1,4,5)\rightarrow f_1=[(a,1),(b,4),(c,5)],\; f_5=[(a,5),(b,4),(c,1)]\\
\end{aligned}
$
</span>
<br></br>

### Rectangle rule: Cost Function 
So far, solving MCS revolves around the idea of <span style="color:#f88146">**maximizing the number of pairwise matchings**</span> between graphs $X$ and $Y$. 

For a human, is quite clear to adopt a *greedy strategy*:
1. Identify notable vertices in $V$ and $V'$.
2. For each matching pair between notable nodes: 
  - Extend the matching pairs to neighbors.
  - Increment the number of rectables accordingly. 

Machines need a more precise specification. Maximizing the number of rectangles, however, arises from the following procedure: 

*Given $a\in V$ and $i\in V'$, a rectangle appears if:* 
 1. *We match $a$ and $i$*
 2. *We match $b$ and $j$, which are respective neighbors of $a$ and $i$: $b\in {\cal N}_a$ and $j\in {\cal N}_i$.*

 In other words, a rectangle happens when we match two nodes and their respective neighbors do also match. This is exacly why we prefer pairwise matchings between notable vertices: because they are more likely to produce more rectangles. 

 This rationale leads to the <span style="color:#f88146">**second-order flavor**</span> of MCS, also known as <span style="color:#f88146">**Graph Matching**</span> inside the Pattern Recognition community. Indeed, posing the problem in matricial terms, we have:
 - The adjacency matrices $\mathbf{X}\in\{0,1\}^{m\times m}$ and $\mathbf{Y}\in\{0,1\}^{n\times n}$ of graphs $X$ and $Y$.
 - The unknown <span style="color:#f88146">**matching matrix**</span> $\mathbf{M}\in\{0,1\}^{n\times m}$, where $\mathbf{M}_{ia}=1$ means that $i\in V$ matches $a\in V'$.

Therefore, a rectagle exists if *its four sides do exist*, i.e. if

$$
\exists a,b\in V, \exists i,j\in V':\; \mathbf{M}_{ai}\mathbf{M}_{bj}\mathbf{X}_{ab}\mathbf{Y}_{ij}>0\;,
$$

where $\mathbf{X}_{ab}$ and $\mathbf{Y}_{ij}$ encode the respective neighboring rules (sides of the rectangle) as we show in {numref}`Rectangle`. 


```{figure} ./images/Topic2/Rectangle-Photoroom.png
---
name: Rectangle
width: 600px
align: center
height: 400px
---
The Rectangle Rule in Graph Matching.
```

**Quadratic Cost function**. A function that counts the number of rectangles given by a matching matrix $\mathbf{M}$ is simply: 

$$
F(\mathbf{M}) = \frac{1}{2}\sum_{a\in V}\sum_{i\in V'}\sum_{b\in V}\sum_{j\in V'}\mathbf{M}_{ai}\mathbf{M}_{bj}\mathbf{X}_{ab}\mathbf{Y}_{ij}\;,
$$

where $1/2$ factor removes rectangles counted twice (see the explanation to follow). 

Consider for instance the graphs $X$ and $Y$ in the previous section (see {numref}`Matching`). Let us count the number of rectangles por a particular matching using the matricial formulation and the above function. 

Firstly, we have the adjacency matrices

$$
\begin{align}
\mathbf{X} = 
\begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
0 & 1 & 0 \\
1 & 0 & 1 \\
0 & 1 & 0 \\
\end{bmatrix}
\;\; 
\mathbf{Y} = 
\begin{array}{l}
1\\
2\\
3\\
4\\
5\\
\end{array}
\begin{bmatrix}
0 & 1 & 1 & 1 & 0\\
1 & 0 & 0 & 0 & 0\\
1 & 0 & 0 & 0 & 0\\
1 & 0 & 0 & 0 & 1\\
0 & 0 & 0 & 1 & 0\\
\end{bmatrix}
\end{align}
$$

The matching matrix for the assignment in {numref}`Matching2` is given by 

$$
\begin{align}
\mathbf{M} = 
\begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
1 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 1 & 0\\
0 & 0 & 0 & 0 & 1\\
\end{bmatrix}
\;\;\text{where}\;\;
\mathbf{M}^T = 
\begin{array}{l}
1\\
2\\
3\\
4\\
5\\
\end{array}
\begin{bmatrix}
1 & 0 & 0 \\
0 & 0 & 0 \\
0 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
\end{bmatrix}
\end{align}
$$

Let us fix the matricial version of $F(\mathbf{M})$.
1) **Mapping**. We commence by analyzing the product $\mathbf{X}\mathbf{M}$ which maps the nodes in $X$ to those in $Y$ via the matching matrix $\mathbf{M}$: 

$$
(\mathbf{X}\mathbf{M})_{ai} = \mathbf{X}_{a:}\mathbf{M}_{:i} = \sum_{a}\sum_{b}\mathbf{X}_{ab}\mathbf{M}_{bi}
$$

In our example: 

$$
\mathbf{X}\mathbf{M} = 
\begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
0 & 1 & 0 \\
1 & 0 & 1 \\
0 & 1 & 0 \\
\end{bmatrix}
\cdot
\begin{bmatrix}
1 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 1 & 0\\
0 & 0 & 0 & 0 & 1\\
\end{bmatrix}
=
\begin{bmatrix}
0 & 0 & 0 & 1 & 0\\
1 & 0 & 0 & 0 & 1\\
0 & 0 & 0 & 1 & 0\\
\end{bmatrix}
$$

Each column of $\mathbf{M}$ has *at most* one $1$, we have: 

$$
(\mathbf{X}\mathbf{M})_{ai} = \mathbf{X}_{ab^{\ast}}\mathbf{M}_{b^{\ast}i}
$$

where $b^{\ast}\in {\cal N}_a$ is the neighbor of $a$ that matches $i$, if any. Therefore: 

$$
(\mathbf{X}\mathbf{M})_{ai} = \mathbf{X}_{ab^{\ast}}\mathbf{M}_{b^{\ast}i}=1\Rightarrow \exists\; \text{path}\; a\rightarrow b^{\ast}\rightarrow i\in {\cal M}\;,
$$

where ${\cal M}$ is the <span style="color:#f88146">**matching graph**</span> (includes links of $X$, $Y$ and $M$ as in {numref}`Matching2`). 

Therefore, $\mathbf{X}\mathbf{M}$ has the following paths in ${\cal M}$ (one per each $1$ in the matrix): 

$$
\begin{align}
(\mathbf{X}\mathbf{M})_{a4}&=a\rightarrow b\rightarrow 4\\
(\mathbf{X}\mathbf{M})_{b1}&=b\rightarrow a\rightarrow 1\\
(\mathbf{X}\mathbf{M})_{b5}&=b\rightarrow c\rightarrow 5\\
(\mathbf{X}\mathbf{M})_{c4}&=c\rightarrow b\rightarrow 4\\
\end{align}
$$

For instance, we show the above paths in {numref}`MPaths1` and {numref}`MPaths2`. 

```{figure} ./images/Topic2/MatchingPaths1-Photoroom.png
---
name: MPaths1
width: 600px
align: center
height: 400px
---
Matching paths $b\rightarrow a\rightarrow 1$ and $b\rightarrow c\rightarrow 5$.
```

```{figure} ./images/Topic2/MatchingPaths2-Photoroom.png
---
name: MPaths2
width: 600px
align: center
height: 400px
---
Matching paths $a\rightarrow b\rightarrow 4$ and $c\rightarrow b\rightarrow 4$.
```

2) **Paths**. We continue by analyzing the product $(\mathbf{X}\mathbf{M})\mathbf{Y}$. Intuition: since $\mathbf{X}\mathbf{M}$ provides paths in ${\cal M}$ between adjacent nodes in $X$ and then crosses through the matching $\mathbf{M}$ reaching nodes in $Y$, the product $(\mathbf{X}\mathbf{M})\mathbf{Y}$ just *continues these paths*: some of them will end in $Y$ *defining $3/4$ sides of a rectangle* (see below) and some others just travel through $Y$ without defining any rectangle. 

In our example: 

$$
(\mathbf{X}\mathbf{M})\mathbf{Y} = 
\begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
0 & 0 & 0 & 1 & 0\\
1 & 0 & 0 & 0 & 1\\
0 & 0 & 0 & 1 & 0\\
\end{bmatrix}
\cdot\;
\begin{array}{l}
1\\
2\\
3\\
4\\
5\\
\end{array}
\begin{bmatrix}
0 & 1 & 1 & 1 & 0\\
1 & 0 & 0 & 0 & 0\\
1 & 0 & 0 & 0 & 0\\
1 & 0 & 0 & 0 & 1\\
0 & 0 & 0 & 1 & 0\\
\end{bmatrix} = 
\begin{bmatrix}
\mathbf{1} & 0 & 0 & 0 & 1\\
0 & 1 & 1 & \mathbf{2} & 0\\
1 & 0 & 0 & 0 & \mathbf{1}\\
\end{bmatrix}
$$

Then, the following paths **do** form $3/4$ of a rectangle (in bold in $(\mathbf{X}\mathbf{M})\mathbf{Y}$): 

$$
\begin{align}
((\mathbf{X}\mathbf{M})\mathbf{Y})_{a1} &= a\rightarrow b\rightarrow 4\rightarrow 1\\
((\mathbf{X}\mathbf{M})\mathbf{Y})_{b4} &= b\rightarrow a\rightarrow 1\rightarrow 4 + b\rightarrow c\rightarrow 5\rightarrow 4\\
((\mathbf{X}\mathbf{M})\mathbf{Y})_{a1} &= a\rightarrow b\rightarrow 4\rightarrow 1\\
\end{align}
$$

where it is significant the $2$ paths of $((\mathbf{X}\mathbf{M})\mathbf{Y})_{b4}$ between nodes $b\in V$ and $4\in V'$ (see {numref}`MPaths3`): 

```{figure} ./images/Topic2/MatchingPaths3-Photoroom.png
---
name: MPaths3
width: 600px
align: center
height: 400px
---
Two $3/4$ matching paths $b\rightarrow a\rightarrow 1\rightarrow 4$ and $b\rightarrow c\rightarrow 5\rightarrow 4$.
```


On the other hand, theses other paths **do not** form $3/4$ of a rectangle:

$$
\begin{align}
((\mathbf{X}\mathbf{M})\mathbf{Y})_{a5} &= a\rightarrow b\rightarrow 4\rightarrow 5\\
((\mathbf{X}\mathbf{M})\mathbf{Y})_{b2} &= b\rightarrow a\rightarrow 1\rightarrow 2\\
((\mathbf{X}\mathbf{M})\mathbf{Y})_{b3} &= b\rightarrow a\rightarrow 1\rightarrow 3\\
((\mathbf{X}\mathbf{M})\mathbf{Y})_{c1} &= c\rightarrow b\rightarrow 4\rightarrow 1\\
\end{align}
$$

For instance, paths $a\gg b\rightarrow 4\gg 5$ and $c\gg b\rightarrow 4\gg 1$ diverge from rectangles
where the notation $\gg$ denotes the first and last edges of the paths (see {numref}`MPaths4`). 

```{figure} ./images/Topic2/MPaths4-Photoroom.png
---
name: MPaths4
width: 600px
align: center
height: 400px
---
Paths $a\gg b\rightarrow 4\gg 5$ and $c\gg b\rightarrow 4\gg 1$ diverge from rectangles.
```

Similarly, paths $b\gg a\rightarrow 1\gg 2$ and $b\gg a\rightarrow 1\gg 3$ diverge even more clearly from ending up in rectangles (see {numref}`MPaths5`).

```{figure} ./images/Topic2/MPaths5-Photoroom.png
---
name: MPaths5
width: 600px
align: center
height: 400px
---
Remaining diverging paths in ${\cal M}$.
```

3) **Closing rectangles**. Once we have $(\mathbf{X}\mathbf{M})\mathbf{Y}$ we simply perform a *pointwise* multiplication (Hadamard product) to close the rectangle: 

$$
((\mathbf{X}\mathbf{M})\mathbf{Y})\odot \mathbf{M}\;.
$$

In our example: 

$$
((\mathbf{X}\mathbf{M})\mathbf{Y})\odot \mathbf{M} = 
\begin{bmatrix}
\mathbf{1} & 0 & 0 & 0 & 1\\
0 & 1 & 1 & \mathbf{2} & 0\\
1 & 0 & 0 & 0 & \mathbf{1}\\
\end{bmatrix}\odot 
\begin{bmatrix}
1 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 1 & 0\\
0 & 0 & 0 & 0 & 1\\
\end{bmatrix} = 
\begin{bmatrix}
\mathbf{1} & 0 & 0 & 0 & 0\\
0 & 0 & 0 & \mathbf{2} & 0\\
0 & 0 & 0 & 0 & \mathbf{1}\\
\end{bmatrix}
$$

Note that $a$ closes $1$ rectangle, $c$ closes $1$ and $b$ closes $2$ (one for $a$ and another one for $c$). Therefore the multiplication $((\mathbf{X}\mathbf{M})\mathbf{Y})\odot \mathbf{M}$ simply **filters out** impossible rectangles. The result is in {numref}`MPaths6`.

```{figure} ./images/Topic2/MPaths6-Photoroom.png
---
name: MPaths6
width: 600px
align: center
height: 400px
---
Final rectangles in ${\cal M}$.
```

4) **Counting**. Note that the number of rectangles comes naturally from 

$$
F(\mathbf{M}) = \frac{1}{2}\sum_{a}\sum_{i}(((\mathbf{X}\mathbf{M})\mathbf{Y})\odot \mathbf{M})_{ai}\;,
$$

which is $2$ in this case, as expected! Let us review these concepts in the following exercise: 

<br></br>
<span style="color:#d94f0b"> 
**Exercise**. Given the following graphs $X$ and $Y$:
</span>
<br></br>
<span style="color:#d94f0b"> 
$
\begin{aligned}
&\begin{array}{cll}
\text{Graph}  & \text{Nodes} & \text{Edges}\\
X & V=\{a,b,c\} & E=\{(a,b),(b,c)\}\\
Y & V'=\{1,2,3,4,5\} & E=\{(1,2),(1,3),(2,4),(3,4),(1,5),(2,5)\}\\
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
**a)** Draw them so that the assignment $f=[(a,1),(b,2),(c,3)]$ is clearly visisble. Is $X$ a subgraph of $Y$? If so, what is the maximal number of rectangles to close in ${\cal M}(f^{\ast})$ and find an optimal assignment $f^{\ast}$. 
</span>
<br></br>
```{figure} ./images/Topic2/MiniMatch-Photoroom.png
---
name: Mini
width: 600px
align: center
height: 400px
---
Matching graph ${\cal M}$ for $f=[(a,1),(b,2),(c,3)]$.
```
<br></br>
<span style="color:#d94f0b">
In {numref}`Mini`, we show a 3D projection of both graphs where the assignment is clearly visible. Clearly, $X$ is a subgraph of $Y$. As a result, we can map all vertices and edges of $X$ to $Y$, closing at most $3$ rectangles. 
</span>
<br></br>
<span style="color:#d94f0b">
**b)** Is that optimal assignment unique? Compute the maximum number of rectangles by matricially evaluating $M^{\ast}$, the matching matrix corresponding to an optimal assignment. Indicate the **matching paths** in each step. 
</span>
<br></br>
<span style="color:#d94f0b">
In this case the optimal assignment is unique: it comes from matching the notable vertices between both graphs and it is $f=[(a,1),(b,2),(c,5)]$ since this is the only way to embed $X$ in $Y$ completely (see {numref}`Mini2`). 
</span>
<br></br>
```{figure} ./images/Topic2/MiniMatch2-Photoroom.png
---
name: Mini2
width: 600px
align: center
height: 400px
---
Matching graph ${\cal M}$ for $f=[(a,1),(b,2),(c,5)]$.
```
<br></br>
<span style="color:#d94f0b">
Matricially, we have
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
\mathbf{X} = 
\begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
0 & 1 & 1 \\
1 & 0 & 1 \\
1 & 1 & 0 \\
\end{bmatrix}
\;\; 
\mathbf{Y} = 
\begin{array}{l}
1\\
2\\
3\\
4\\
5\\
\end{array}
\begin{bmatrix}
0 & 1 & 1 & 0 & 1\\
1 & 0 & 0 & 1 & 1\\
1 & 0 & 0 & 1 & 0\\
0 & 1 & 1 & 0 & 0\\
1 & 1 & 0 & 0 & 0\\
\end{bmatrix}
\;\;\text{and}\;\;
\mathbf{M}^{\ast} = \begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then, we proceed to evaluate 
</span>
<br></br>
<span style="color:#d94f0b">
 $
F(\mathbf{M}^{\ast}) = \frac{1}{2}\sum_{a}\sum_{i}(((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})\odot \mathbf{M}^{\ast})_{ai}\;,
 $ 
</span>
<br></br>
<span style="color:#d94f0b">
**Step 1:** Compute $\mathbf{X}\mathbf{M}^{\ast}$: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
\mathbf{X}\mathbf{M}^{\ast} = 
\begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
0 & 1 & 1 \\
1 & 0 & 1 \\
1 & 1 & 0 \\
\end{bmatrix}
\;\cdot \;
\begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 \\
\end{bmatrix} = 
\begin{bmatrix}
0 & 1 & 0 & 0 & 1\\
1 & 0 & 0 & 0 & 1\\
1 & 1 & 0 & 0 & 0\\
\end{bmatrix}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
And the paths so far are: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
(\mathbf{X}\mathbf{M}^{\ast})_{a2} &= a\rightarrow b\rightarrow 2\\
(\mathbf{X}\mathbf{M}^{\ast})_{a5} &= a\rightarrow c\rightarrow 5\\
(\mathbf{X}\mathbf{M}^{\ast})_{b1} &= b\rightarrow a\rightarrow 1\\
(\mathbf{X}\mathbf{M}^{\ast})_{b5} &= b\rightarrow c\rightarrow 5\\
(\mathbf{X}\mathbf{M}^{\ast})_{c1} &= c\rightarrow a\rightarrow 1\\
(\mathbf{X}\mathbf{M}^{\ast})_{c2} &= c\rightarrow b\rightarrow 2\\
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
**Step 2:** Compute $(\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y}$: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
(\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y} = 
\begin{array}{l}
a\\
b\\
c\\
\end{array}
\begin{bmatrix}
0 & 1 & 0 & 0 & 1\\
1 & 0 & 0 & 0 & 1\\
1 & 1 & 0 & 0 & 0\\
\end{bmatrix}
\cdot\; 
\begin{array}{l}
1\\
2\\
3\\
4\\
5\\
\end{array}
\begin{bmatrix}
0 & 1 & 1 & 0 & 1\\
1 & 0 & 0 & 1 & 1\\
1 & 0 & 0 & 1 & 0\\
0 & 1 & 1 & 0 & 0\\
1 & 1 & 0 & 0 & 0\\
\end{bmatrix} = 
\begin{bmatrix}
2 & 1 & 0 & 1 & 1\\
1 & 2 & 1 & 0 & 1\\
1 & 1 & 1 & 1 & 2\\
\end{bmatrix}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
And the $3/4$-rectangle paths are: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{a1} &= a\rightarrow b\rightarrow 2\rightarrow 1 + a\rightarrow  c\rightarrow 5\rightarrow 1\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{a2} &= a\rightarrow c\rightarrow 5\rightarrow 2\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{a4} &= a\rightarrow b\rightarrow 2\rightarrow 4\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{a5} &= a\rightarrow b\rightarrow 2\rightarrow 5\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{b1} &= b\rightarrow c\rightarrow 5\rightarrow 1\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{b2} &= b\rightarrow a\rightarrow 1\rightarrow 2 + 
b\rightarrow c\rightarrow 5\rightarrow 2\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{b3} &= b\rightarrow a\rightarrow 1\rightarrow 3\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{b5} &= b\rightarrow a\rightarrow 1\rightarrow 5\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{c1} &= c\rightarrow b\rightarrow 2\rightarrow 1\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{c2} &= c\rightarrow a\rightarrow 1\rightarrow 2\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{c3} &= c\rightarrow a\rightarrow 1\rightarrow 3\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{c4} &= c\rightarrow b\rightarrow 2\rightarrow 4\\
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})_{c5} &= c\rightarrow a\rightarrow 1\rightarrow 5 + 
c\rightarrow b\rightarrow 2\rightarrow 5\\
\end{align}
$
</span>
<br><br>
<span style="color:#d94f0b">
**Step 3**. We higlight the correct matches and apply Hadamard product: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
((\mathbf{X}\mathbf{M}^{\ast})\mathbf{Y})\odot \mathbf{M}^{\ast} = 
\begin{bmatrix}
\mathbf{2} & 1 & 0 & 1 & 1\\
1 & \mathbf{2} & 1 & 0 & 1\\
1 & 1 & 1 & 1 & \mathbf{2}\\
\end{bmatrix}
\odot\; 
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
Finally, if we sum all the components of the Hadamard product and divide by $2$ we have $3$ rectangles as shown in {numref}`Mini2`. We are done!
</span>

### Integer Constraints: QAP
We have clarified an objective function of MCS (graph matching) $F(\mathbf{M})$ interpretation for a particular $\mathbf{M}\in\{0,1\}^{n\times m}$ (number of rectangles). The function is <span style="color:#f88146">**quadratic** in $\mathbf{M}$</span> since it requires two multiplications of the matching matrix for its evaluation, but evaluating it takes $O(n^4)$.

In addition, the structure of $\mathbf{M}$ is not simply binary. Since each candidate assignment $f$ encoded by this matrix must be an injective (one-to-one) function $f:V\rightarrow V'$ we have that for $m\le n$: 

$$
\forall a\in V:\sum_{i\in V'}\mathbf{M}_{ai}\le 1\;\text{and}\; \forall i\in V':\sum_{a\in V}\mathbf{M}_{ai}\le 1\;,
$$

i.e. each row/column cannot have more than one $1$. Some precisions: 
- Formally, all row/column may be all-zeros but if they are, then we do not have a solution. 
- In particular, we need to match a coumple of vertices to close a single rectangle. Therefore we need at least to columns with a single one placed in different rows.  
- Thus, the above notation means that if one row/column is all-zeros there is only one of them which is not! 
- Ideally, for $m\le n$ we should have $m$ columns with a $1$ all placed in different rows. 

This constrained nature of the so called <span style="color:#f88146">**Quadratic Assignment Problem (QAP)**</span> is what makes it NP:

$$
\begin{align}
\mathbf{M}^{\ast}= & \arg\max_{{\cal P}}F(\mathbf{M}) = \frac{1}{2}\sum_{a\in V}\sum_{i\in V'}\sum_{b\in V}\sum_{j\in V'}\mathbf{M}_{ai}\mathbf{M}_{bj}\mathbf{X}_{ab}\mathbf{Y}_{ij}\\
{\cal P}=&\left\{\mathbf{M}\in \{0,1\}^{m\times n}:\forall a\in V:\sum_{i\in V'}\mathbf{M}_{ai}\le 1\;\text{and}\; \forall i\in V':\sum_{a\in V}\mathbf{M}_{ai}\le 1\;\right\}
\end{align}
$$

Therefore, the QAP (we use this name in the following) is originally posed in terms of <span style="color:#f88146">**Integer Programming (IP)**</span> 
since the *unknowns* to discover are binary $m\times n$ matrices $\mathbf{M}\in\{0,1\}^{m\times n}$ subject to be in ${\cal P}$. 

In particular, if $m=n$ (both graphs have the same number of nodes), then:
- ${\cal P}$ becomes $\Pi_n$, the space of *permutations* of order $n$ and we have $n!$ *bijections* $f=\pi:V\rightarrow V'$. 
- Each matrix $\mathbf{M}$ encodes a permutation $\pi(a),\pi(b),\ldots$, where $V=\{a,b,\ldots\}$ with $|V|=n$ are the rows of $\mathbf{M}$ (each one with a unique $1$ at different columns.
- Then $V'=\{\pi(i):i\in\{a,b,c,\ldots\}\}$ with $|V'|=n$ and $\pi^{-1}(i)\in V$ does exist and it is unique.

More formally: 

$$
\begin{align}
\Pi_n=&\left\{\mathbf{M}\in \{0,1\}^{m\times n}:\forall a\in V:\sum_{i\in V'}\mathbf{M}_{ai}= 1\;\text{and}\; \forall i\in V':\sum_{a\in V}\mathbf{M}_{ai}= 1\;\right\}
\end{align}
$$

where we have transformed the inequality constraints into equality ones!

The [Permutohedron](https://en.wikipedia.org/wiki/Permutohedron) is a useful means of visualizing the space of solutions $\Pi_n$. Each point is a permutation $\pi$ and it has $n-1$ neighbors. Given $\pi$, its neighboring permutations $\pi'\in {\cal N}_{\pi}$ differ from $\pi$ in *one transposition*, i.e. one interchange between two components. 

For instance, in {numref}`PiN` we show $\Pi_4$. It has $n!=24$ nodes with $n-1=3$ neighbors each. For instance, the neighbors of $\pi=[3,4,2,1]$ are ${\cal N}_{\pi}=\{[2,4,3,1],[3,4,1,2],[4,3,2,1]\}$, where  $\pi'=[2,4,3,1]$ is the result of interchanging (transposing) $2$ and $3$ in $\pi=[3,4,2,1]$. 

```{figure} ./images/Topic2/PiN-Photoroom.png
---
name: PiN
width: 600px
align: center
height: 500px
---
Permutohedron for of order $n=4$. Please define the unlabeled nodes. 
```

## Simulated Annealing 

The Permutohedron provides a glimpse of the <span style="color:#f88146">**discrete seach space**</span>, specially because it highlighs its topology, i.e. its local structure. 

With this instrument to hand, it is natural to <span style="color:#f88146">envision **intelligent search** as a random walk between an initial permutation $\pi^0$ and a final one $\pi^{\ast}$</span>:
- The *intelligence* is partially provided by the **objective function** $F(\pi)$ that discriminates between good and bad solutions. 
- The *search strategy* is what exploits the knowledge provided by the objective function. One of the simplest strategies is that of a **random walk (RW)** $\pi^0\rightarrow\pi^1\rightarrow\ldots\rightarrow \pi^{\ast}$ inside $\Pi_n$.

[Simulated Annealing (SA)](https://en.wikipedia.org/wiki/Simulated_annealing) is a principled way of combining intelligence and RW-search. It works as follows: 

Let $\Omega$ be the **search space** and $\omega\in \Omega$ a **discrete state**, then we define a **Markov chain** $X_0,X_1,\ldots$ for a *maximization problem* as follows:

$$
p(X_{t+1}=\omega'|X_{t}=\omega) =
\begin{cases}
     1 &\;\text{if}\; \Delta F\ge 0 \\[2ex]
     \exp(\beta(t)\cdot \Delta F)&\;\text{otherwise}\\[2ex]
\end{cases}
$$

where $\Delta F:=F(\omega')-F(\omega)$ and $\beta(t):=\frac{1}{T(t)}$ where $T$ is the **computational temperature**. Some explanations: 

- Since we are maximizing $F$, having $\Delta F:=F(\omega')-F(\omega)\ge 0$ for $\omega'\in {\cal N}_{\omega}$ means that $F(\omega')\ge F(\omega)$, i.e. that we are in the *correct way* (or at least not in the wrong way!). In this case, we should **accept** (with probability $1$) $\omega'$ as the new state of the search. 
- If $F(\omega')< F(\omega)$, a **greedy method** or a **branch-and-bound** should reject $\omega'$ as a promising direction. However, SA gives a **chance of acceptance** which is proportional to how far is $\omega'$ of being promising, which is $\Delta F = F(\omega')-F(\omega)< 0$. 

However, if $F(\omega')< F(\omega)$, the chance of accepting $\omega'$ must depend of the **stage of the search process** where we take the decision. This stage is parameterized by the computational temperature $\beta(t)=\frac{1}{T(t)}$

Suppose that the computational temperature is $T(t)$ is a **monotonic decreasing function** with $t$ such as $T(t)=\frac{1}{\log(1+t)}$. This is why this is called an **annealing schedule** Then: 

- At $t\rightarrow 0$ (early stages of the search) the chance should be hight. Then, we usually set $T(0)\rightarrow\infty$ (high temperature). As $\Delta F<0$, then $\exp(\beta(t)\cdot \Delta F)$ does not decays too much, since it is attenuated. As a result, solutions $\omega'$ with $\Delta F<0$ are typically accepted if $|\Delta F<0|$ is small enough. The larger $|\Delta F<0|$ the more probable that $\omega'$ is consider an **extremal event** (part of the queue of the exponential) and thus discarded. 

- However, for $t\rightarrow \infty$, we have $T(t)\rightarrow 0$ (low temperature). The same decrement $|\Delta F<0|$ that was admissible for smaller values of $t$ is not yet accepted because SA behaves like a greedy method. 

[//]: https://sphinx-proof.readthedocs.io/en/latest/syntax.html

[//]: https://pypi.org/project/sphinxcontrib-pseudocode/


```{prf:algorithm} Metropolis-Hastings
:label: Metropolis

**Inputs** Given a function $F(\omega)$, an annealing schedule $T(t)$ and an inital state $\omega_0$\
**Output** $\omega^{\ast}=\arg\max_{\omega\in\Omega} F(\omega)$

1. convegence = False
2. $\omega_{old}\leftarrow \omega_0$
3. $t\leftarrow 0$
4. **while** $\neg$ convergence:
    
    1. Select $\omega'\in {\cal N}_{\omega}$
    2. Calculate $\Delta F=F(\omega')-F(\omega_{old})$
    3. **if** $\Delta F\ge 0$ **then** $\omega_{new}\leftarrow \omega'$
    4. **else**:
          1. Draw $p\in [0,1]$
          2. $q\leftarrow \exp\left(\frac{\Delta F}{T(t)}\right)$
          2. **if** $q\ge p$ **then** $\omega_{new}\leftarrow \omega'$
    5. $t\leftarrow t+1$
    6. $T(t)\leftarrow \frac{1}{\log(1 + t)}$
    7. convergence = $(T(t)<T_{min})$ **or** $(t>t_{max})$
4. **return** $\omega_{new}$
```

### Approximating QAP with SA 
Let us apply SA to approximate the QAP. To that end, we are going to **match the Aspirin's structure** (see {numref}`SA-Aspirin`-Left) with a random permutation of its vertices ({numref}`SA-Aspirin`-Right). The left graph with nodes $V=\{a,b,\ldots,m\}$ and links, mimics the structure of the Aspirin. In this first approach we discard, for the moment, both nodes (atom) labels ($\text{C}$,$\text{O}$ and $\text{OH}$) and edges (bonds) labels (single, double).

On the right, the graph has nodes $V'=\{1,2,\ldots,13\}$. Since $|V|=|V'|=13$, there **state space** $\Omega$ is given by the $\Pi_{n=13}$ Permutohedron, with  $n!\approx 6\times 10^{9}$ states, each one with $n-1=12$ edges.  

We know that the optimal permutation $\pi^{\ast}$ is given by $\pi^{\ast}(a)=1,\pi^{\ast}(b)=2,\ldots, \pi^{\ast}(m)=13$, i.e. $\mathbf{M}^{\ast}=\mathbf{I}_n$ is the identity matrix of dimension $n$. However, we assume a realistic case where the input permutation $\pi^0$ is rather different from the optimal one: 

$$
\pi^0 = 
\begin{bmatrix}
\overbrace{7}^a & \overbrace{12}^b & \overbrace{5}^{c} & \overbrace{11}^{d} & \overbrace{3}^e & \overbrace{9}^f & \overbrace{2}^g & \overbrace{8}^h & \overbrace{10}^i & \overbrace{4}^j & \overbrace{1}^k & \overbrace{6}^l & \overbrace{13}^m\\
\end{bmatrix}
$$

**1) Initial Matching**. However, instead of starting SA with a random $\pi^0$, it is a good practice to compute a <span style="color:#f88146">**greedy approximation**</span>. In the case of graph matching or QAP, a greedy algorithm may rely on matching $a,b,c,\ldots$ with nodes $1,2,3,\ldots$ in such a way that it is <span style="color:#f88146">*preferred to match nodes with the largest number of neighbors (degree) as possible*</span>. 

Then, the greedy matching leads to 

$$
\pi^0 = 
\begin{bmatrix}
\overbrace{3}^a & \overbrace{4}^b & \underline{\overbrace{5}^{c}} & \overbrace{9}^{d} & \underline{\overbrace{1}^e} & \overbrace{6}^f & \overbrace{8}^g & \overbrace{10}^h & \overbrace{12}^i & \overbrace{2}^j & \overbrace{7}^k & \overbrace{10}^l & \overbrace{13}^m\\
\end{bmatrix}
$$

We check {numref}`SA-Aspirin`, where we plot the *final approximation of SA* with $10$ closing rectangles, that the greedy $\pi^0$ is consistent with matching nodes with mutually largerst degrees:
- We underlined the matchings with largest degrees (maximum degree is $3$): $c\rightarrow 5$, $e\rightarrow 1$.
- One of the risks of greedy matching is that we may be **stuck in a local maximum**, which really happens (SA here recovers from $e\rightarrow 1$ but not from $c\rightarrow 5$).
- However, the proper annealing scheduling typically **takes the search out of local optima** in the early stages of the search.  

**2) Neighboring**. The second aspect to define when applying SA to the QAP is how to <span style="color:#f88146">**ensure proper neighborhoods ${\cal N}_{\omega}$**</span>. In other words, if given $\omega$ we choose as $\omega'$ an arbitrarily large  random variation, we *<span style="color:#f88146">may miss the irreducibility property of the Markov chain (all states must be reached)*</span>. 

Then, a good way of navigating properly through the search space (the Permutohedron $\Pi_n$ in this case) is to: 
1) **Choose** randomly if we are going to interchange rows or colums in $\mathbf{M}_{\omega}$. 
2) **Interchange** two rows (or columns) whose indexes are randomly chosen. The result is $\mathbf{M}_{\omega'}$

This is **equivalent** to perform a transposition in the Permutohedron $\Pi_n$. 

**3)Annealing Schedule**. The last ingredient to set is a <span style="color:#f88146">**proper temperature cooling (annealing schedule)**</span> and the specific values $T_0$ and $T_f$. 
- In this example, we simply make $T(t+1)=T(t)-\alpha\cdot T(t)=(1-\alpha)\cdot T(t)$ with $\alpha\ll 1$. 
- This leads to a neg-exponential decrease since: 

$$
\begin{align}
T(1)&=(1-\alpha)\cdot T(0)\\
T(2)&=(1-\alpha)\cdot T(1)=(1-\alpha)^2\cdot T(0)\\
    &\vdots\\
T(n)&=(1-\alpha)\cdot T(n-1)=(1-\alpha)^n\cdot T(0)\\
\end{align}
$$

This succession is decreasing since we only multiply $T(0)$ by a decreasing fraction. Actually, we have set $T(0)=1/5=0.2$ and $\alpha = 1/1.075\approx 0.93$. The exponential decrease can be seen by looking at the **logarithmic transformation** (exponentials are lines in log space): 

$$
T(n) = (1-\alpha)^n\cdot T(0)\Rightarrow \log(T(n))=n\cdot log(1-\alpha) + \log T(0)\;.
$$

Since $(1-\alpha)<1$ then its log is negative!

This schedule is a **bit agressive** but playing a bit with $\alpha$ we can close $10-11$ rectangles from $13$ possible (this is the mumber of edges of the Aspirin graph). 

In {numref}`SA-Curves` we play with different values of $\alpha$. For the sake of clarity, herein we only represent the **best cost obtained so far**. Note that for different regimes we get different results, in this case between $0$ and $11$ rectangles. It is interesting to note that:

- The best result is obtained with one of the smallest $\alpha$s ($\alpha=0.85$)
- The annealing schedules tend to increase more the objective function $F$ in the early stages of the seach where SA is more free to explore!
- Note that the most strict schedule ($\alpha=0.93$) reaches $9$ rectangles in the middle stages and finaly peaks $10$ rectangles. This is the solution showed in {numref}`SA-Aspirin`.

The right image shows the negligible differences betweeen the annealing schedules in the log-log domain. 

```{figure} ./images/Topic2/SA-Curves-Photoroom.png
---
name: SA-Curves
width: 800px
align: center
height: 400px
---
Playing with $\alpha$ for different annealing schedules. 
```

**4)Understanding the solution**. Last, but by no  means least, we have to interpret to what extent our $F$ reaches or not an *interesting solution*. Herein, <span style="color:#f88146">*interesting means* **global**</span>. For instance: 
- The solution in {numref}`SA-Aspirin` captures mainly the central hexagon, although under a rotation around the $g-i$ axis and a flipping or torsion wrt the vertical: edge $f-e$ becomes $8-10$, but edge $h-i$ becomes $3-9$.
- The top part of the molecule in the Left  matches a bottom part of that of the Right, indicating a global deformation. 
- Also the bottom-left part of the molecule in the Left matches the top part of that in the Right (again a deformation).
- Overall, the largest subgraphs are correctly matched but the smallest ones are misplaced. 

Therefore, althoug the solution obtained seems **quantitatively global** it is partially **qualitatively global**. This reveals a need of making the objective function <span style="color:#f88146">**more informative**</span> (we will come back to this point later on).


```{figure} ./images/Topic2/SA-Aspirin-Photoroom.png
---
name: SA-Aspirin
width: 800px
align: center
height: 600px
---
Approximating the Aspirin's Isomorphism with SA. 
```



### Interpretation of SA 

**Intuitive interpretation**. From an intuitive point of view, SA can be described as follows [Hoffmann and Buhmann 1997](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=645602): 
1) Cost differences between neighboring states act as a **force field**. 
2) The effect of the computational temperature can be interpreted as a random force with an **amplitude proportional to T**.
3) Valleys and peaks with a cost difference less than T are smeared out and **vanish** in the stochastic search. Vanishing is more likely to happen in the early stages of the search where we need to scape from local (subobtimal) solutions. 

**Methodological Interpretation**. In the AI jargon, $F$ is considered an <span style="color:#f88146">**heuristic**</span> and SA is considered a <span style="color:#f88146">**meta-heuristic**</span>: 
- [Heuristic](https://en.wikipedia.org/wiki/Heuristic). This term was introduced by Herbert Simon and it loosely means "seach shortcut", i.e. some mechanism for deciding what solution is interesting and what is not in a potentially large space of solutions. This concept is our subject. Herein, it is the number of rectangles. 
- [Meta-Heuristic](https://en.wikipedia.org/wiki/Metaheuristic) are procedures that "modulate the application of a heuristic". SA is a metaheuristic because it decides when the search is more stochastic and when it becomes more greedy. 


**Formal interpretation**. Remember that the [Metropolis Algorithm](https://arxiv.org/pdf/1504.01896) implements an irreducible Markov chain $X_0,X_1,\ldots,X_t$ along $\Omega$. This Markov chain is **stationary** i.e. it converges to a given probability distribution $G$ which means that after a given time step $t$ if $X_{t}\sim G(\omega)$ then also $X_{t+1}\sim G(\omega)$. The stationary distribution is 

$$
\lim_{t\rightarrow\infty}p(X^{t+1}|X^{t})=G(\omega)\;,
$$

independently of the choose of $X_0$. In this case, the stationary distribution $G(\omega)$ is

$$
G(\omega) = \frac{\exp(\beta\cdot F(\omega))}{Z},\;\; \text{with}\;\; Z = \sum_{\omega\in\Omega}\exp(\beta\cdot F(\omega))\;,
$$

which is known as the <span style="color:#f88146">**Gibbs distribution**</span>. Look, for instance, at the $q$ variable at the step $4.2.$ of the Metropolis-Hastings algorithm: 

$$
q\leftarrow \exp\left(\frac{\Delta F}{T(t)}\right) = 
\exp\left(\beta(t)\Delta F\right) = \frac{\exp\left(\beta(t)F(\omega')\right)}{\exp\left(\beta(t) F(\omega_{old})\right)} = \frac{\frac{\exp\left(\beta(t)F(\omega')\right)}{Z(t)}}{\frac{\left(\beta(t) F(\omega_{old})\right)}{Z(t)}}=\frac{G(\omega')}{G(\omega_{old})}\;.
$$

Therefore, the so-called **acceptance probability** is a ratio between two "Gibbsian" probabilities!. Note that in the Gibbs distribution, for a given $\beta$ the states $\omega\in \Omega$ with larger $F(\omega)$ are more probable. As a result, $q$ is the largest possible when $F(\omega')\gg F(\omega_{old})$ (of course for this $\beta$). 

Consequently, SA can be seen as a <span style="color:#f88146">**sampling process** that converges to the global optimal $\omega^{\ast}$ of $F(\omega)$</span> if the annealing schedule is a proper one. However, what is the **fomal role** of the temperature (or its inverse: $\beta$)? Solving this question is a good excuse to dive into **Information Theory**, a mathematical discipline which is built on top of Probability and it is fundamental to undestanding intelligent search. 

### Gibbs Sampling
SA shows that intelligent search can be seen as deploying random walks for exploring combinatorial spaces such as $\Pi_n$. For this section, we need a key probabilistic concept: the expectation of a random variable.   

Remember that the **expectation** of a discrete random variable $\omega$ is defined as follows: 

$$
E(X)=\sum_{\omega\in \Omega}\omega\cdot p(X=\omega)\;.
$$

Expectation gives us an idea of the **spatial concentration** of the values $\omega\in\Omega$ attending to $p(X=\omega)$. Actually $E(X)$ can be seen as *the most representative* of this values (i.e. the *most likely* of them). 

For determining $E(X)$ it is mandatory to know all the probabilities $p(X=\omega)$.

However, <span style="color:#f88146">what if $p(X=\omega)=G(X)$, i.e. if $X$ is **Gibbsian**?</span>. This is the case of $X=F(\omega)$, the value of the cost function. In this latter case, 

$$
E(X) = \sum_{\omega\in \Omega}\omega\cdot G(X=\omega) = \sum_{\omega\in \Omega}\omega\cdot \frac{\exp\left(\beta\cdot F(\omega)\right)}{Z}\;,
$$

is the **expected cost**. Can we really calculate it? Well, it is impossible to do it analitically because:

1) We should know all the values $F(\omega)$. This is equivalent to solve the problem encoded by this cost function by means of **brute force**. Remember that for the QAP this implies to evaluate **all the states of the Permutohedron**, i.e. $n!$ states whose individual evaluation takes $O(n^4)$.

2) The Gibbs distribution seems to be restricted to a given $\beta$. We should evaluate $Z=\sum_{\omega\in\Omega}G(\omega)$, a very large sum, for that particular $\beta$. Therefore, $E(X)$ seems to be $\beta-$dependent as well. 

Instead, the Metropolis-Hastings algorithm allows us to **approximate** $E(X)$ as follows: 

$$
E(X)\approx \frac{1}{R}\sum_{r=1}^{R}F(\omega_r)\;,
$$

where $\omega_1,\omega_2,\ldots, \omega_r$ are the states visited by the algorithm when it follows an **admisible anneling schedule** such as $T(t)=\frac{C}{\log(1+t)}$ where $C$ is a constant. This means that $\omega_r$ is **accepted** at $T(r)$. 

Some related considerations:
1) The sequence $\omega_1,\omega_2,\ldots,$ is a **random walk** along $\Omega$ which is driven by $F(\omega)$ and converges to the Gibbs probability. 

2) The lower the temperature $T(t)$ the closer we become to the Gibbs distribution. Therefore <span style="color:#f88146">the computational temperature works as a **Lagrange parameter** to enforce the convergence</span> towards the Gibbs distribution. 

3) Formally, not all the **samples** $\omega_1,\omega_2,\ldots,\omega_r$ are generated by $G(\omega)$ since every random walk has a **transient period** (unstable values) and we do not know exactly at what time instant we have the convergence, i.e. $\omega_t$ become proper samples of $G(\omega)$. 

4) Following [Understanding Gibbs Sampling](https://eml.berkeley.edu/reprints/misc/understanding.pdf), a good strategy is to start different random walks at different $X_0$s (initial permutations in QAP) and track the samples until we can observe some **consensus**. 

In {numref}`SA-Expectations`, we show three random walks generated from different $X_0$s, for the admisible annealing (inverse log) with $C=3$. None of them reaches the global optimum in $R=7,000$ iterations. Actually their approximate expectations $E(X)$ are respectively $8.637$, $8.178$ and 
$9.371$ if we remove the first $1,000$ samples (considered as transient regime), whereas we have $9.051$, $8.466$, and $9.791$ if all the samples are included. 

What is interesting here is that <span style="color:#f88146">**it is quite difficult to reach the global optimum**</span> since:
1) There are fewer states with that value in general and most of them are sub-optimal. This is indiated by the fact that $E(X)\ll \omega^{\ast}$ for $X=F(\omega)$. 
2) In combinatorial problems such as QAP, the cost function has a discrete number of **energy levels** and this usually leads to **plateaus** (regions of constast cost) as cooling progresses. For instance, in {numref}`SA-Expectations` random search seems blocked in a sub-optimal state, at least for such number of iterations $R$. 


```{figure} ./images/Topic2/SA-Expectations-Photoroom.png
---
name: SA-Expectations
width: 800px
align: center
height: 600px
---
Three random walks along $\Pi_n$ with differernt $X_0$s. 
```

Note also that herein we have highlighted the main features of SA. Please check the beautiful paper of the Geman's brothers: Stochastic Relaxation, [Gibss Distribution and the Bayesian Restoration of Images](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4767596) one of the most influential papers in all times.

### Limitations of SA
SA is appealing for AI since it allows us to track the global optimum in a sort of random walk. However:

1) The quality or "intelligence" of the random walk relies on <span style="color:#f88146">**how informative**</span> is the cost function. How is information measured? We will address this question later on.    
2) The <span style="color:#f88146">global optimum is only guaranteed for a **slow** annealing</span> $T(t)=\frac{C}{\log(1+t)}\rightarrow 0$. SA has to be slow enough to not missing that global optimum; otherwise SA becomes greedy prematurely. This is an **important limitation** of SA: remember that for QAP we have to evaluate $F(\omega)$ at each time $t$ and it takes $O(n^4)$.


## Central Clustering
**Deterministic Annealing** (DA) emerges as a faster method than SA. In order to motivate this technique it is more convenient to visit another NP problem. This problem is **central clustering** (or <span style="color:#f88146">**$k-$centrer clustering**</span>) and it is formulated as follows: 

*Given a set ${\cal X}=\{\mathbf{x}_1,\mathbf{x_2},\ldots,\mathbf{x}_m\}\subset \mathbb{R}^d$ ($d-$dimensional points)  and an integer parameter $k>1$, *find a set of $k$ centers* ${\cal C}=\{\mathbf{c}_1,\mathbf{c}_2,\ldots,\mathbf{c}_k\}\subset \mathbb{R}^d$ such that so that the Euclidean distance $D(\mathbf{x},\mathbf{c})$ of each point $\mathbf{x}_a$ to its closest center $\mathbf{c}_i$ is minimal.*

### Clusters and Prototypes
A more intuitive formulation is as follows. Let 

$$
\text{cluster}(\mathbf{c}_i) = \{\mathbf{x}_a\in{\cal X}: D(\mathbf{x}_a,\mathbf{c}_i)\le D(\mathbf{x}_a,\mathbf{c}_j),j\neq i\}\;,
$$

be a <span style="color:#f88146">**cluster**</span>, i.e. a subset of ${\cal X}$ where its elements $\mathbf{x}_a$ are closer to the cluster's <span style="color:#f88146">**prototype**</span> $\mathbf{c}_i$ than to any other. Then, the problem is also formulated as follows: 

*Partition ${\cal X}$ in $k$ (disjoint) clusters such that the sum of distances between the points and the prototypes is minimal.*

Some clarifications: 
1) The problem is a <span style="color:#f88146">**chicken-and-egg**</span> problem: a) We need to know the $k$ centers $\mathbf{c}_i$ beforehand in order to determine the clusters $\text{cluster}(\mathbf{c}_i)$ but b) the prototypes have to be determined once the clusters are done according with the criterion of minimal distances. 

2) The NP-hardness of the problem is given by the <span style="color:#f88146">**partitioning flavor**</span>. Actually, the worst case complexity is $O(k^m)$ since each of the $m$ different points can be exclusively assigned to $k$ centers (**r-permutations with repetition** $P_{\sigma}(m,k)=k^m)$. 

3) The prototypes $\mathbf{c}_i$ <span style="color:#f88146">**do not necessarily belong to**</span> ${\cal X}$ and they must be discovered. 

See for instance a well defined problem in {numref}`KM-clusters` where the red dots are ideal centers $\mathbf{c}_i$ which are the expectations $\mu_i$ of bidimensional Gaussians $N(\mu_i,\sigma\mathbf{I})$ with $\sigma=0.5$.

```{figure} ./images/Topic2/KM-clusters-Photoroom.png
---
name: KM-clusters
width: 800px
align: center
height: 600px
---
Clusters and estimated prototypes in $\mathbb{R}^2$. 
```

### Cost Functions
#### Attractors
A first compact cost function could be:  

$$
F({\cal C}) = \sum_{a=1}^m\min_{1\le i\le k}D(\mathbf{x}_a,\mathbf{c}_i)\;. 
$$

where we are interested in ${\cal C}^{\ast}$, the set $k$ centers/prototypes  <span style="color:#f88146">**minimizing the total distance**</span>. 

In {numref}`KM-clusters`, two centers $\mathbf{c}_1=[-1,-1]^T$ and $\mathbf{c}_2=[1,-1]^T$ and mutually closer than the third one $\mathbf{c}_3=[0,2]^T$. Then, if we represent the individual cost of each point $\mathbf{x}_a$ wrt each center, i.e. $min_{1\le i\le k}D(\mathbf{x}_a,\mathbf{c}_i)$ (see {numref}`KM-dist`) note that there is well-defined minimum at $\mathbf{c}_1=[-1,-1]^T$ which we interpret as an <span style="color:#f88146">**attractor** or **meta-stable state**</span>, since it distorts the locations of the other two minima. Note also, that the cost function is clearly **non convex** and it has many **saddle points** (see for instance the 3D view in {numref}`KM-dist3D`)


```{figure} ./images/Topic2/KM-dist-removebg-preview.png
---
name: KM-dist
width: 800px
align: center
height: 600px
---
Individual cost wrt closest centers. 
```

```{figure} ./images/Topic2/KM-dist3D-removebg-preview.png
---
name: KM-dist3D
width: 800px
align: center
height: 600px
---
Individual cost wrt closest centers (3D view). 
```

However, if we also include the point-to-cluster <span style="color:#f88146">**assignment variables**</span> $\mathbf{M}_{ai}$, we aim to **jointly** minimize:   

$$
F({\cal C},\mathbf{M}) = \sum_{a=1}^m\mathbf{M}_{ai}D(\mathbf{x}_a,\mathbf{c}_i)\;,\text{s.t.}\;\sum_{i=1}^k\mathbf{M}_{ai}=1\;,\forall a\in \{1,\ldots,m\}\;,
$$

where each point can be only assigned to a single cluster and all points must be assigned (each row has a single $1$).


#### Independence 
Note that the Gibbs distribution for $F({\cal C},\mathbf{M})$ is

$$
G({\cal C},\mathbf{M})=\frac{\exp(-\beta F({\cal C},\mathbf{M}))}{Z}\;\;\text{with}\;\;\; Z=\sum_{{\cal C},\mathbf{M}}\exp(-\beta F({\cal C},\mathbf{M}))\;,
$$

where the negative exponential means that <span style="color:#f88146">the probability of a solution **decays exponentially with the distances**</span> beween the points and the cluster centers or prototypes.


Instead of sampling $\Omega={\cal C}\times{\cal P}$ with SA, where

$$
{\cal P} =\left\{\mathbf{M}\in \{0,1\}^{m\times k}:\forall a\in\{1,\ldots,m\}:\sum_{i=1}^k\mathbf{M}_{ai}=1,\right\} 
$$

we are going to <span style="color:#f88146">assume that the variables ${\cal C}$ and $\mathbf{M}$ are **statitically indepedendent**</span>. Then, following the **Appendix**, we compute the **marginal** of the Gibbs distribution wrt the assignment matrices: 

$$
\begin{align}
G({\cal C})&=\sum_{\mathbf{M}\in{\cal P}} G({\cal C},\mathbf{M})\\
           &=\frac{1}{Z}\sum_{\mathbf{M}\in{\cal P}}\exp(-\beta F({\cal C},\mathbf{M}))\\
           &=\frac{1}{Z}\sum_{\mathbf{M}\in{\cal P}}\exp\left(-\beta \sum_{a=1}^m\sum_{i=1}^k\mathbf{M}_{ai}D(\mathbf{x}_a,\mathbf{c}_i)\right)\\
           &=\frac{1}{Z}\sum_{\mathbf{M}\in{\cal P}}Q(\mathbf{M})\;.
\end{align}
$$

Now, looking at the sum of neg-exponentials, we observe that the sum is over the <span style="color:#f88146">**legal assignment matrices**</span> $\mathbf{M}\in{\cal P}$ (each point must be assigned to a unique prototype). This means that any **legal** matrix will be a collection of rows ratisfying:   

$$
\text{rows}(\mathbf{M})\in {\cal C}_k=\overbrace{\{\underbrace{(1,0,\ldots,0)}_{I_1:k\;\text{elements}},\underbrace{(0,1,\ldots,0)}_{I_2:k\;\text{elements}},\ldots,\underbrace{(0,0,\ldots,1)}_{I_k:k\;\text{elements}}\}}^{m\;\text{elements}}\;.
$$

This means that for each of the $m$ rows of $\mathbf{M}$ we have $k$ options for $m$ positions, leading to $k^m$ **total legal matrices**, i.e. terms in $\sum_{\mathbf{M}\in {\cal P}}Q(\mathbf{M})$. However, this number can be reduced if we perform a <span style="color:#f88146">**second independence assumption**: the assignmnet of a point to a cluster is independent to that of another point</span> (logically this is not true for nearby points but this simplifies the calculations).

This allows us to express the neg-exp sum as the following sum, implementing a logical $\lor$:

$$
\sum_{\mathbf{M}\in{\cal P}} Q(\mathbf{M})=\sum_{\text{rows}(\mathbf{M}^{(1)})}\sum_{\text{rows}(\mathbf{M}^{(2)})}\ldots\sum_{\text{rows}(\mathbf{M}^{(m)})}Q(\mathbf{M})
$$

where $\mathbf{M}^{(1)},\mathbf{M}^{(2)},\ldots,\mathbf{M}^{(m)}$ are the assignment matrices that can be chosen for any particular point $\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_m$. 

Since **independence means factorization** we have $\sum\sum\ldots\sum = \sum\cdot\sum\ldots\sum$ as in multiple integration, i.e.: 

$$
Z = \sum_{\mathbf{M}\in{\cal P}} Q(\mathbf{M})=\prod_{a=1}^m\sum_{\text{rows}(\mathbf{M}^{(a)})}Q(\mathbf{M}^{(a)}) = \prod_{a=1}^m\sum_{\text{rows}(\mathbf{M}^{(a)})}\exp\left(-\beta\sum_{i=1}^k\mathbf{M}^{(a)}_{ai}D(\mathbf{x}_a,\mathbf{c}_i)\right)\;.
$$

where the product $\prod_{a=1}^m$ absorbs the sums $\sum_{a=1}^m$ inside each exponential.  

Now *pay attention!* This the point where we use the fact that $\sum_{i=1}^k\mathbf{M}^{(a)}_{ai}=1,\forall a$ (for each row!), i.e. only one cluster will be chosen. Actually, we have an XOR! This leads us to: 

$$
Z=\sum_{\mathbf{M}\in{\cal P}} Q(\mathbf{M})=\prod_{a=1}^m\sum_{\text{rows}(\mathbf{M}^{(a)})}Q(\mathbf{M}^{(a)}) = \prod_{a=1}^m\sum_{i=1}^k\exp\left(-\beta D(\mathbf{x}_a,\mathbf{c}_i)\right)\;.
$$

**Final result**. Then, in order to capture the above factorization, the original cost function $F({\cal C},\mathbf{M})$ must be transformed as follows

$$
F_{\beta}({\cal C}) = -\frac{1}{\beta}\log Z = -\frac{1}{\beta}\sum_{a=1}^m \log\sum_{i=1}^k \exp\left(-\beta D(\mathbf{x}_a,\mathbf{c}_i)\right)\;.
$$

which can be computed in $O(nk)$. Actually, $F_{\beta} = -\frac{1}{\beta}\log Z$ is usally called the [Free Energy](https://en.wikipedia.org/wiki/Gibbs_free_energy#:~:text=In%20traditional%20use%2C%20the%20term,non%2Dpressure%2Dvolume%20work). 

#### Derivation 
The below DA algorithm for central clustering results from the **minimization of the Free energy** as follows: 

Let us compute the derivative of 

$$
F_{\beta}({\cal C})=-\frac{1}{\beta}\sum_{a=1}^m \log\underbrace{\sum_{i=1}^k \exp\left(-\beta D(\mathbf{x}_a,\mathbf{c}_i)\right)}_{Q(\mathbf{x}_a)}
$$ 


wrt each center $\mathbf{c}_i$: 

$$
\begin{align}
\frac{\partial F_{\beta}}{\partial \mathbf{c}_i} &= -\frac{1}{\beta} \sum_{a=1}^m\frac{\partial \log Q(\mathbf{x}_a)}{\partial \mathbf{c}_i}\\
&= -\frac{1}{\beta}\sum_{a=1}^m\frac{1}{Q(\mathbf{x}_a)}\cdot \frac{\partial Q(\mathbf{x}_a)}{\partial \mathbf{c}_i}\\
&= -\frac{1}{\beta}\sum_{a=1}^m\frac{1}{Q(\mathbf{x}_a)}\cdot \left(-e^{-\beta D(\mathbf{x}_a,\mathbf{c}_i)}\right)\cdot\frac{\partial D(\mathbf{x}_a,\mathbf{c}_i)}{\mathbf{c}_i}\;.\\
\end{align}
$$

For $D(\mathbf{x}_a,\mathbf{c}_i)=||\mathbf{x}_a-\mathbf{c}_i||^2$ we have: 

$$
\begin{align}
\frac{\partial D(\mathbf{x}_a,\mathbf{c}_i)}{\mathbf{c}_i}=2(\mathbf{x}_a-\mathbf{c}_i)
\end{align}
$$

Then, plugging this derivative in the main formula, and expanding the $Q(\mathbf{x}_a)$, we obtain: 

$$
\begin{align}
\frac{\partial F_{\beta}}{\partial \mathbf{c}_i} &= -\frac{1}{\beta}\sum_{a=1}^m\frac{1}{Q(\mathbf{x}_a)}\cdot \left(-e^{-\beta D(\mathbf{x}_a,\mathbf{c}_i)}\right)\cdot 2(\mathbf{x}_a-\mathbf{c}_i)\\
&= \frac{2}{\beta}\sum_{a=1}^m\frac{e^{-\beta D(\mathbf{x}_a,\mathbf{c}_i)}}{\sum_{i'=1}^k e^{-\beta D(\mathbf{x}_a,\mathbf{c}_{i'})}}\cdot (\mathbf{x}_a-\mathbf{c}_i)\;.
\end{align}
$$

Note that 

$$
\langle \mathbf{M}_{ai} \rangle = \frac{e^{-\beta D(\mathbf{x}_a,\mathbf{c}_i)}}{\sum_{i'=1}^k e^{-\beta D(\mathbf{x}_a,\mathbf{c}_{i'})}}\;.
$$

Then, 

$$
\begin{align}
\frac{\partial F_{\beta}}{\partial \mathbf{c}_i} &= 
\frac{2}{\beta}\sum_{a=1}^m\langle \mathbf{M}_{ai} \rangle\cdot (\mathbf{x}_a-\mathbf{c}_i)\;.
\end{align}
$$

Since the optimum is obtained when the gradient is zero, we have: 

$$
\frac{\partial F_{\beta}}{\partial \mathbf{c}_i} = \mathbf{0}\Rightarrow \sum_{a=1}^m\langle \mathbf{M}_{ai} \rangle\cdot (\mathbf{x}_a-\mathbf{c}_i)=\mathbf{0}\Rightarrow 
\mathbf{c}_i = \frac{1}{\sum_{a=1}^m\langle \mathbf{M}_{ai} \rangle}\sum_{a=1}^m\mathbf{x}_a\langle \mathbf{M}_{ai} \rangle\;.
$$

Therefore, the DA algorithm for central clustering is simply derived from the gradient of the free energy. Summarizing, DA consists in <span style="color:#f88146">**performing gradient descent** wrt to each inverse temperature $\beta$</span>. 


### Deterministic Annealing
[Deterministic Annealing (DA)](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=726788) is a search technique easily applicable to the central clustering problem (see more details in [Keneth Rose's PhD Thesis](https://thesis.library.caltech.edu/2858/1/Rose_k_1991.pdf)). 

The basic idea of DA is that for each value of $\beta=\frac{1}{T}$ we <span style="color:#f88146">**interate two phases**</span> for minimizing
the free energy $F_{\beta}({\cal C})$:

1) **Expectation**. For <ins>fixed centers</ins> $\mathbf{c}_i$, we estimate the <span style="color:#f88146">**probability**</span> that any point $\mathbf{x}_a$ "belongs" to each cluster ("membership"). We denote such probabilities as $\langle \mathbf{M}_{ai} \rangle$ and they are given by: 

$$
\langle \mathbf{M}_{ai} \rangle = \frac{\exp(-\beta D(\mathbf{x}_a,\mathbf{c}_i))}{\sum_{i'=1}^k \exp(-\beta D(\mathbf{x}_a,\mathbf{c}_{i'}))}\;.
$$

2) **Update**. For <ins>fixed probabilities</ins> $\langle \mathbf{M}_{ai} \rangle$, we update the <span style="color:#f88146">**centers**</span>. If $D(\mathbf{x}_a,\mathbf{c}_i)=||\mathbf{x}_q-\mathbf{c}_i||^2$ (squared Euclidean distance), then we have 

$$
\mathbf{c}_i =  \frac{\sum_{a=1}^m \mathbf{x}_a \langle \mathbf{M}_{ai}\rangle}{\sum_{a=1}^m\langle \mathbf{M}_{ai}\rangle}\;,
$$

i.e. the new ("fuzzy") centers are the expected centers according to the fixed probabilities or ("memberships"). 


```{prf:algorithm} Deterministic Annealing [Clustering]
:label: DA-alg

**Inputs** Given a set of points ${\cal X}=\{\mathbf{x}_1,\ldots,\mathbf{x}_m\}$, the free energy $F_{\beta}({\cal C})$, an annealing schedule $\beta =1/T(t)$ and an inital state ${\cal C}^0=(\mathbf{c}^0_1,\ldots,\mathbf{c}^0_k)$\
**Outputs** ${\cal C}^{\ast}=\arg\min_{{\cal C}\in\Omega} F_{\beta\rightarrow\infty}({\cal C})$ and $\langle \mathbf{M}^{\ast}_{ai}\rangle$

1. convegence = False
2. $\langle \mathbf{M}^{old}_{ai}\rangle\leftarrow 1$
3. $t\leftarrow 0$
4. **while** $\neg$ convergence:
    
    1. **for** $\mathbf{x}_a\in {\cal X}$, $\mathbf{c}_i\in {\cal C}$: 

       $
       \langle \mathbf{M}_{ai} \rangle = \frac{\exp(-\beta D(\mathbf{x}_a,\mathbf{c}_i))}{\sum_{i'=1}^k \exp(-\beta D(\mathbf{x}_a,\mathbf{c}_{i'}))}
       $

    2. **for** $\mathbf{c}_i\in {\cal C}$: 

       $
       \mathbf{c}_i =  \frac{\sum_{a=1}^m \mathbf{x}_a \langle \mathbf{M}_{ai}\rangle}{\sum_{a=1}^m\langle \mathbf{M}_{ai}\rangle}
       $

    3. $t\leftarrow t+1$
    4. $T(t)\leftarrow \frac{1}{\log(1 + t)}$
    5. convergence = $(T(t)<T_{min})$ **or** $(t>t_{max})$ **or** $(\sum_{ai}|\langle \mathbf{M}_{ai}\rangle - \langle \mathbf{M}^{old}_{ai}\rangle|\le\epsilon )$
    6. $\langle \mathbf{M}^{old}_{ai}\rangle\leftarrow \langle \mathbf{M}_{ai}\rangle$
5. **return** ${\cal C}^{\ast}$, $\langle M^{\ast}\rangle$
```



Some considerations about the above algorithm: 
- It is **deterministic**, i.e. we do not draw random numbers at any step of the algoorithm.
- The **initial (inverse) temperature** $\beta=1/T_{max}$, where $T_{max}=T(0)$, is used as in SA, to make all the assignments almost equally probable: note that $e^{-\beta D(\mathbf{x}_a,\mathbf{c}_i)}\rightarrow 1$, if $\beta\rightarrow 0$. Under these conditions, we have $\langle M_{ai}\rangle\approx 1/k$.
- **Initialization**. $T_{max}$ is set to the maximum variance $\sigma^2_{max}$ of  the data. If the data is multidimensional, as it happens usually, $\sigma^2_{max}$ has an spectral interpretation (PCA).
- However, as the algorithm evolves, $\beta$ increases and the **exponential decay becomes more selective**: given $D(\mathbf{x}_a,\mathbf{c}_i)\le D(\mathbf{x}_a,\mathbf{c}_j)+\alpha$, the second distance decays exponentially faster than the first as $\beta$ increases.
- **Alternating Expectation and Update**. Given the probabilities $\langle M_{ai}\rangle$ we update the centers $\mathbf{c}_i$ and then we re-compute the probabilities until a "fixed point" (stable assignment/centers) is reached. 
- **Convervence**. The algorithm converges to the nearest local optimum of the free energy to the initialization point ${\cal C}^0$. We add the condition $\sum_{ai}|\langle \mathbf{M}_{ai}\rangle - \langle \mathbf{M}^{old}_{ai}\rangle|\le\epsilon$ with $\epsilon>0$, which means that the algorithm stops if the $\langle \mathbf{M}_{ai}\rangle$ are stable enough. For instance, if we start by setting $\langle M_{ai}\rangle\approx 1/k$, the algorithm only performs a single iteration. Why? Because the centers in step 4.2. are not modified at all.
- **Complexity**. Each iteration takes $O(mk)$ and the number of iterations can be accelerated by a faster annealing schedule. 
- **Outputs**. The algorithm returns the best centers ${\cal C}^{\ast}$ and the optimal assignments $\langle \mathbf{M}_{ai}^{\ast}\rangle$ where a point $\mathbf{x}_a$ is assigned to the cluster centered at $\mathbf{c}_{i'}$ if: 

$$
i' = \arg\max_{1\le i\le k}\langle \mathbf{M}_{ai}^{\ast}\rangle\;.
$$

We show how the algorithm works in one-dimensional points in the following exercise: 
<br></br>
<span style="color:#d94f0b"> 
**Exercise**. Given the following points ${\cal X}=\{1,2,3,7,7.5,8.25\}$ in $\mathbb{R}$, cluster them with $k=2$, starting from a unique cluster. Use the following inverse-temperature scheduling: $\beta(t+1)=\beta(t)\beta_r$, with $\beta_r = 1.075$. 
<br><br>
Answer. We have then a $(m=6,k=2)$ instance of central clustering. Firstly, we analize the data. The mean and standard deviation of the data are respectively $\mu=4.79$ and $\sigma=2.87$. We may use the mean to encode the unique center, for instance: $c^0_1=c^0_2=\mu$. However, this would result in terminating the algorithm in a single iteration with $\langle \mathbf{M}_{ai}\rangle = 0.5,\forall a,i$. 
<br></br>
A simple way to avoid the $50/50$ setting is to make $c^0_1=\mu + 0.01,\;\; c^0_2=\mu$. Then the initial centers are $c_1^0 = \mathbf{4.801}$ and $c_2^0 = \mathbf{4.791}$.
<br></br>
Regarding $\beta(0)=1/T_{max}$ we make $T_{max}=\sigma$ which results in $\beta(0)=1/2.87=0.34$. 
</span>
<br></br>
<span style="color:#d94f0b">
<ins>Iteration 1</ins>. At each iteration we recompute the (transposed) **squared distances** matrix $\mathbf{D}$ where $\mathbf{D}_{ia}=(c_i-x_a)^2$ with $a\in\{1,\ldots,6\}$ and $i\in\{1,2\}$. We transpose it to better visualize how far is each center from each data point. Better seen in tabular form: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|ccc}
  \mathbf{D}^{(1)} & x_1     & x_2    & x_3    & x_4   & x_5   & x_6\\
  \hline
  c_1        & 14.452  & 7.849  &  3.246 & 4.832 & 7.281 & 11.891 \\
  c_2        & 14.376  & 7.793  &  3.210 & 4.876 & 7.335 & 11.960 \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then we calculate the (transposed) neg-exponentials and the sum of each column for later normalization:
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc}
  \exp(-\beta(0)\mathbf{D}^{(1)}) & x_1     & x_2    & x_3    & x_4   & x_5   & x_6\\
  \hline
  c_1        & 0.007  & 0.069 & 0.331 & 0.193 & 0.084 & 0.017 \\
  c_2        & 0.007  & 0.070 & 0.335 & 0.190 & 0.082 & 0.017 \\  
  \hline
  \sum       & 0.014  & 0.139 & 0.666 & 0.383 & 0.166 & 0.034 \\
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then we proceed to normalize the columns, and also sum the resuling rows for preparing the computation of the new centers: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc|c}
 \langle \mathbf{M}^{(1)}\rangle  & x_1     & x_2    & x_3    & x_4   & x_5   & x_6 &\sum_{a=1}^m  \langle \mathbf{M}^{(1)}_{ai}\rangle\\
  \hline
  c_1        & 0.493  & 0.495 & 0.496 & 0.503 & 0.504 & 0.505 &  2.996 \\
  c_2        & 0.506  & 0.504 & 0.503 & 0.496 & 0.495 & 0.494 &  2.998 \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then the new centers become: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
\mathbf{c}_1^{(1)} &= \frac{1}{2.996}\left(\mathbf{1}\cdot 0.493 + \mathbf{2}\cdot 0.495 + \ldots + \mathbf{8.25}\cdot 0.505\right) = \mathbf{4.819}\\
\mathbf{c}_2^{(1)} &= \frac{1}{2.998}\left(\mathbf{1}\cdot 0.506 + \mathbf{2}\cdot 0.504 + \ldots + \mathbf{8.25}\cdot 0.494\right) = \mathbf{4.763}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
where the first center $c_1$ starts to attract the largest values of ${\cal X}$ and $c_2$ attracts the lowest ones. 
</span>
<br></br>
<span style="color:#d94f0b">
<ins>Iteration 2</ins>. Given the new centers, we recompute their distances wrt  all the points in ${\cal X}$. The new (transposed) distance is: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc}
  \mathbf{D}^{(2)} & x_1     & x_2    & x_3    & x_4   & x_5   & x_6\\
  \hline
  c_1        & 14.590  & 7.950  & 3.311  & 4.753  & 7.183  & 11.766 \\
  c_2        & 14.164  & 7.637  & 3.110  & 5.001  & 7.487  & 12.155 \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then, the normalized neg-exponetials for $\beta(2)=0.365$ are: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc|c}
 \langle \mathbf{M}^{(2)}\rangle  & x_1     & x_2    & x_3    & x_4   & x_5   & x_6 &\sum_{a=1}^m  \langle \mathbf{M}^{(2)}_{ai}\rangle\\
  \hline
  c_1        & 0.461 & 0.471 & 0.481 & 0.522 & 0.527 & 0.535 &  2.997 \\
  c_2        & 0.538 & 0.528 & 0.518 & 0.477 & 0.472 & 0.464 &  2.997 \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then the new centers become: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
\mathbf{c}_1^{(2)} &= \frac{1}{2.997}\left(\mathbf{1}\cdot 0.461 + \mathbf{2}\cdot 0.471 + \ldots + \mathbf{8.25}\cdot 0.535\right) = \mathbf{4.960}\\
\mathbf{c}_2^{(2)} &= \frac{1}{2.997}\left(\mathbf{1}\cdot 0.538 + \mathbf{2}\cdot 0.528 + \ldots + \mathbf{8.25}\cdot 0.464\right) = \mathbf{4.622}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
<ins>Iteration 3</ins>. This is the **most informative iteration** so far because the two centers are clearly separated. They define what we later call a **phase change**. The distances give a hint of this: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc}
  \mathbf{D}^{(3)} & x_1     & x_2    & x_3    & x_4   & x_5   & x_6\\
  \hline
  c_1        & 15.689  & 8.767  & 3.845  & 4.157  & \mathbf{6.446}  &  10.817\\
  c_2        & 13.121  & \mathbf{6.876}  & 2.632  & 5.653  & 8.280  &  13.159\\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
where the largest values are attracted by $c_2$ and the smallest ones by $c_2$ (see for instances the data in bold).
Then, the normalized neg-exponetials for $\beta(3)=0.392$ are: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc|c}
 \langle \mathbf{M}^{(3)}\rangle  & x_1     & x_2    & x_3    & x_4   & x_5   & x_6 &\sum_{a=1}^m  \langle \mathbf{M}^{(3)}_{ai}\rangle\\
  \hline
  c_1        & 0.267 & 0.322 & 0.383 & 0.642 & \mathbf{0.672} & 0.715 &  3.001 \\
  c_2        & 0.732 & \mathbf{0.677} & 0.616 & 0.357 & 0.327 & 0.284 &  2.993 \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
and then the new centers more "polarized": 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
\mathbf{c}_1^{(3)} &= \frac{1}{3.001}\left(\mathbf{1}\cdot 0.267 + \mathbf{2}\cdot 0.322 + \ldots + \mathbf{8.25}\cdot 0.715\right) = \mathbf{5.828}\\
\mathbf{c}_2^{(3)} &= \frac{1}{2.993}\left(\mathbf{1}\cdot 0.732 + \mathbf{2}\cdot 0.677 + \ldots + \mathbf{8.25}\cdot 0.284\right) = \mathbf{3.752}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
<ins>Iterations 4 and 5</ins>. In these iterations, "intra-cluster" distances in bold) get small and stabilized  
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc}
  \mathbf{D}^{(4)} & x_1     & x_2    & x_3    & x_4   & x_5   & x_6\\
  \hline
  c_1        & 23.317  & 14.659  & 8.002  & \mathbf{1.371}   & \mathbf{2.792}  &  \mathbf{5.862} \\
  c_2        & \mathbf{7.575}   & \mathbf{3.070}   & \mathbf{0.565}  & 10.547  & 14.045  & 20.229\\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc}
  \mathbf{D}^{(5)} & x_1     & x_2    & x_3    & x_4   & x_5   & x_6\\
  \hline
  c_1        & 42.347  & 30.332  & 20.317  & \mathbf{0.257}   & \mathbf{0.000}  &  \mathbf{0.551} \\
  c_2        & \mathbf{1.0841}  & \mathbf{0.001}   & \mathbf{0.919}   & 24.589   & 29.798  & 38.548  \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
and the centers are respectively
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
c_1^{(4)}&=7.507,\; c_2^{(4)}=2.041\\
c_1^{(5)}&=7.583,\; c_2^{(5)}=1.999\\
\end{align}
$
<span>
<br></br>
<span style="color:#d94f0b">
<ins>Last iteration</ins>. Interestingly, at the end of iteration 5 we have the following assignment matrix: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc|c}
 \langle \mathbf{M}^{(5)}\rangle  & x_1     & x_2    & x_3    & x_4   & x_5   & x_6 &\sum_{a=1}^m  \langle \mathbf{M}^{(5)}_{ai}\rangle\\
  \hline
  c_1        & 0.000 & 0.000 & 0.000 & 0.999 & 0.999 & 0.999 & 2.997  \\
  c_2        & 0.999 & 0.999 & 0.999 & 0.000 & 0.000 & 0.000 & 2.997  \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Note that the first 3 points belong to cluster 2 and the second 3 points belong to cluster 1. Is this enough to converge? Let us see. First of all we recompute the distances: 
</span>
<br></br>
$
\begin{aligned}
\begin{array}{c|cccccc}
  \mathbf{D}^{(6)} & x_1     & x_2    & x_3    & x_4   & x_5   & x_6\\
  \hline
  c_1        & 43.337  & 31.171  & 21.004  & 0.340    &  0.006   &  0.444 \\
  c_2        & 0.999   & 0.000   & 1.000   & 25.000   &  30.250  &  39.062 \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then, the normalized neg-exponentials for $\beta(6)=0.488$ are: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|cccccc|c}
 \langle \mathbf{M}^{(6)}\rangle  & x_1     & x_2    & x_3    & x_4   & x_5   & x_6 &\sum_{a=1}^m  \langle \mathbf{M}^{(3)}_{ai}\rangle\\
  \hline
  c_1        & 0.000 & 0.000 & 0.000 & 0.999 & 0.999  & 0.999  & 2.997   \\
  c_2        & 0.999 & 0.999 & 0.999 & 0.000 & 0.000  &  0.000 & 2.997   \\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Which is clearly "stabilized" wrt $\langle \mathbf{M}^{(5)}\rangle$ (it is identical). As a result, **the cluster centers will be identical** and the algorithm **converges** returning these cluster centers $c_1^{(6)}=c_1^{(5)}$, $c_2^{(6)}=c_2^{(5)}$ and $\langle \mathbf{M}^{\ast}\rangle=\langle \mathbf{M}^{(6)}\rangle$.
</span>
<br></br>


#### Entropy and Free Energy
Deterministic Annealing (DA) relies on the following **theorem**: 

*Minimizing the Free Energy $F(\omega)=-\frac{1}{\beta}\log Z$ is equivalent to maximizing the Entropy $H(G)$ of the Gibbs distribution $G(\omega)=\exp(-\beta{\cal E}(\omega))/Z$, where ${\cal E}(\omega)$ is a cost (energy) function, as follows:*

$$
F(\omega) = \langle {\cal E}(\omega) \rangle - \frac{1}{\beta}H(G)\;.  
$$

**Proof**. We start by expanding the definition of entropy (see the **Appendix**): 

$$
\begin{align}
H(G) &:=-\sum_{\omega}G(\omega)\log G(\omega)\\
     &= -\sum_{\omega}\frac{\exp(-\beta{\cal E}(\omega))}{Z}\log \frac{\exp(-\beta{\cal E}(\omega))}{Z}\\
     &= -\sum_{\omega}\frac{\exp(-\beta{\cal E}(\omega))}{Z}\left[-\beta {\cal E}(\omega) - \log Z\right]\\
     &= -\sum_{\omega}G(\omega)\left[-\beta {\cal E}(\omega) - \log Z\right]\;.\\
\end{align}
$$

Since, $\langle {\cal E}(\omega) \rangle$ is the expectation

$$
\langle {\cal E}(\omega) \rangle = \sum_{\omega}G(\omega){\cal E}(\omega)\;,
$$

we have

$$
\begin{align}
H(G) &= -\sum_{\omega}G(\omega)\left[-\beta {\cal E}(\omega) - \log Z\right]\\
     &= \beta\langle {\cal E}(\omega) \rangle + \sum_{\omega}G(\omega)\log Z\;.\\
\end{align}
$$

Since $G(\omega)$ is a probability distribution, we have $\sum_{\omega}G(\omega)=1$ and therefore 

$$
H(G) = \beta\langle {\cal E}(\omega) \rangle + \log Z\;,
$$

and finally, 

$$
\begin{align}
-\log Z &= \beta\langle {\cal E}(\omega) \rangle - H(G)\\
-\frac{1}{\beta}\log Z &= \langle {\cal E}(\omega) \rangle -\frac{1}{\beta}H(G)\\
F(\omega) &= \langle {\cal E}(\omega)\rangle -\frac{1}{\beta}H(G)\;.\\
\end{align}
$$

Consequently, since $H(G)\ge 0$, we have that minimizing $F(\omega)$ implies minimizing  $\langle {\cal E}(\omega)\rangle$ (known as <span style="color:#f88146">**average cost**</span> or <span style="color:#f88146">**average distortion**</span>) as well as maximizing (i.e. -minimizing) the entropy $H(G)$ if the average cost is kept constant. 

#### Maximum Entropy 
The expansion of Free energy $F(\omega)$ as 

$$
F(\omega) = \langle {\cal E}(\omega) \rangle - \frac{1}{\beta}H(G)\;.  
$$

suggests how it is minimized:
1) In the beginning, where we have $T_{max}$ and $\beta\rightarrow 0$, we have that the dominant term is the negative of entropy $-H(G)$. This means that in the early stages of the algorithm, **entropy is maximized**. 

2) [Maximum Entropy](https://en.wikipedia.org/wiki/Principle_of_maximum_entropy) is a fundamental principle in Information Theory. It states that: *<span style="color:#f88146">of all the probability distributions that satisfy a given set of constraints, choose the one that maximizes the entropy*</span>.

3) Membership probabilities, $\langle\mathbf{M}_{ai}\rangle$ can be interpreted as **conditional probabilities**  

$$
\langle\mathbf{M}_{ai}\rangle = p(\mathbf{x}_a\in \text{cluster}(\mathbf{c}_i))=p(C=
i|X=a)\;,
$$ 

i.e., given a point $\mathbf{x}_a$, $p(C=
i|X=a)$ denotes <span style="color:#f88146">*how likely is $\mathbf{c}_i$ to be the true center for this point*</span>. This is usually called <span style="color:#f88146">**likelihood**</span>: how good is this center for be a <span style="color:#f88146">**codeword**</span> of that point. 

4) As a result, the **conditional entropy** $H(C|X)$ is maximized in the early stages of DA: 

$$
\begin{align}
H(C|X)&= \sum_{\mathbf{x}_a}p(X=a)H(C|X=a)\\
      &\approx \frac{1}{m}\sum_{\mathbf{x}_a}H(C|X=a)\\
      &= -\frac{1}{m}\sum_{\mathbf{x}_a}\sum_{\mathbf{c}_i}p(C=i|X=a)\log p(C=i|X=a)\\
      &= -\frac{1}{m}\sum_{a}\sum_{i}\langle\mathbf{M}_{ai}\rangle\log \langle\mathbf{M}_{ai}\rangle\\
\end{align}
$$

However, should be interesting to compute also the conditional entropy $H(X|C)$ so that we can visualize the **entropy per cluster** instead of the **entropy per point** as in $H(C|X)$. To this end, look at the Bayes theorem:

$$
p(X|C)=\frac{p(C|X)p(X)}{p(C)}\;.
$$

We assume $p(X=a)=\frac{1}{m}$ since we don't know how probable is a specific data in advance. But, how to estimate $p(C)$ we exploit the following property: 

$$
\begin{align}
p(C=i) &= \sum_{\mathbf{x}_a}p(X=a)p(C=i|X=a)\\ 
       &= \frac{1}{m}\sum_{\mathbf{x}_a}p(C=i|X=a)\\
       &= \frac{1}{m}\sum_{a}\langle\mathbf{M}_{ai}\rangle\\
\end{align}
$$

Then, we have that

$$
p(X=a|C=i)=\frac{\frac{1}{m}\langle\mathbf{M}_{ai}\rangle}{\frac{1}{m}\sum_{a}\langle\mathbf{M}_{ai}\rangle} = \frac{\langle\mathbf{M}_{ai}\rangle}{\sum_{a}\langle\mathbf{M}_{ai}\rangle}
$$

And finally

$$
\begin{align}
H(X|C) &= \sum_{\mathbf{c}_i}p(C=i)H(X|C=i)\\
       &\approx \frac{1}{m}\sum_{\mathbf{c}_i}\sum_{a}\langle\mathbf{M}_{ai}\rangle H(X|C=i)\;,\\
\end{align}
$$

where

$$
H(X|C=i) &= -\sum_{\mathbf{x}_a}p(X=a|C=i)\log p(X=a|C=i)\\
         &= -\sum_{a}\frac{\langle\mathbf{M}_{ai}\rangle}{\sum_{a}\langle\mathbf{M}_{ai}\rangle}\log \frac{\langle\mathbf{M}_{ai}\rangle}{\sum_{a}\langle\mathbf{M}_{ai}\rangle}\;.
$$

is the <span style="color:#f88146">**entropy per cluster** $i$</span> and it is a good quantity to analyze the **convergence of the algorithm**. 

Look, for instance, at {numref}`KM-entropy-example` for the solutions of the above exercise where $\langle M_{ai}\rangle$ behave like Bernouilli variables when we have two clusters. At the last iteration, the **maximal entropy per cluster** is lower than the initial one: it is around $1$ bit (i.e. $\log 3=1.09$ nats) because half of the points belong to each cluster and half of them do not. This **indicates good convergence**

```{figure} ./images/Topic2/KM-entropy-example-removebg-preview.png
---
name: KM-entropy-example
width: 800px
align: center
height: 600px
---
DA as entropy minimization in the exercise example. 
```

However, as we show in {numref}`KM-entropy-example-K3` in the motivating example for $k=3$ clusters, the final **maximal entropy per cluster** is around $4-5$ bits (i.e. $\log 102 = 4.62$ nats). This indicates that the data is <span style="color:#f88146">**not well quantized**</span> with $k=3$ centers.


```{figure} ./images/Topic2/KM-entropy-example-K3-removebg-preview.png
---
name: KM-entropy-example-K3
width: 800px
align: center
height: 600px
---
DA as entropy minimization the motivating example (K=3). 
```

However, as we show in {numref}`KM-entropy-example-K2` in the motivating example for $k=3$ clusters, the final **maximal entropy per cluster** is around $5$ bits (i.e. $\log 201 = 5.30$ nats). This indicates that the data is <span style="color:#f88146">**better quantized**</span> with $k=2$ centers: <span style="color:#f88146">we get (nearly) the same entropy with less centers!</span>

```{figure} ./images/Topic2/KM-entropy-example-K2-removebg-preview.png
---
name: KM-entropy-example-K2
width: 800px
align: center
height: 600px
---
DA as entropy minimization the motivating example (K=2). 
```

Actually, the new centers are $\mathbf{c}_1=[0,-1]^T$ and $\mathbf{c}_2=[0,2]^T$ (see {numref}`KM-clusters-K2`). 

```{figure} ./images/Topic2/KM-clusters-K2-removebg-preview.png
---
name: KM-clusters-K2
width: 800px
align: center
height: 600px
---
Clusters and estimated prototypes in $\mathbb{R}^2$ (K=2). 
```

As a result, <span style="color:#f88146">**entropy is crucial** for the interpretation of clustering problems!</span>


## SoftMax for Graph Matching

### SoftMaxing 
#### Continuation Methods
Back to the graph matching problem or MCS (Maximum Common Subgraph), we have formulated it in 
<span style="color:#f88146">**discrete terms** via SA</span>: 
1) The search space ${\cal P}$ is the set of **binary matrices** 
$\mathbf{M}\in\{0,1\}^{m\times n}$ whose rows and columns have a unique $1$. When $m=n$ we have that the search space is 
the Permuthoedron $\Pi_n$, i.e. the set of **permutation matrices**. 
2) SA explores the search space via **random walks** that eventually hit the **global maximum** 
under a **slow annealing schedule**. Each iteration takes $O(n^4)$ for evaluating the 
quadratic cost function $F(\mathbf{M})$. 

We have seen that central clustering can be posed, however, in <span style="color:#f88146">**continuous terms** via DA</span>:
1) The search space is $\mathbb{R}^d\times {\cal P}$: we have to jointly find $k$ centers 
$\mathbf{c}_i\in\mathbb{R}^d$ and a binary matrix $\mathbf{M}\in\{0,1\}^{m\times k}$ indicating 
how the $m$ points $\mathbf{x}_a\in\mathbb{R}^d$ are assigned to the most probable cluster.

2) DA **decouples the optimization** of the Free energy $F({\cal C},\mathbf{M})$: given a high-entropic 
(almost uniform) initial assignment we alternate **expectation** (update the assignments while
the centers are fixed) and **update** the centers. Each iteration takes $O(m\times k)$ and 
it **allows faster annealing schedules**. 

3) <span style="color:#f88146">**Remember**</span> that DA's success relies on the possibility of evaluating $Z$ 
(the Gibbs partition function) thanks to the **independence assumption**: points are assumed to 
be independently assigned to any cluster.  


In the [Gurobi framework ](https://gurobi-optimods.readthedocs.io/en/latest/index.html), 
for optimization, where graph matching is seen as a Maximmum Clique, many combinatorial algorithm share 
the following features with DA: 

1) **Transforming** the original discrete problem into a continuous one.

2) **Optimize** the objective function in the continuous space by means of a polynomial method.

3) **Revert** the result to the discrete space through a *clean-up* heuristic. 
 
These methods are called <span style="color:#f88146">**continuation method**</span>. 

#### SoftMax operator
In 1996, [Steven Gold and Anand Rangarajan](https://www.cise.ufl.edu/~anand/pdf/pamigm3.pdf) 
exploited the following **simple algorithm** for finding the maximum element in a list 
of numbers:

```{prf:algorithm} Soft Maximum
:label: Softmax-alg

**Inputs** A list of elements $L=[X_1,\ldots,X_m]$ with $X_i\in\mathbb{R}$, $\beta_0>0$\
**Outputs** $m_i>0$ if $X_i=\max_{L}$ and $0$ otherwise.

1. Initialize $\beta\leftarrow \beta_0$
2. **while** $\beta < \beta_f$:  

    1. **for** $i=1,\ldots, m$:
    
          $m_i\leftarrow \exp(\beta X_i)$
    
    2. **for** $i=1,\ldots, m$:
    
          $m_i\leftarrow \frac{m_i}{\sum_{i=1}^m m_i}$

    3. Increase $\beta$

3. **return** $\{m_i\}$ where $\sum_{i=1}^m m_i=1$
```

This algorithm is equivalent to give the $\{m_i\}$ which maximize 
$\sum_{i=1}^m m_iX_i$, now formulated in continuous terms via the **control parameter** 
$\beta>0$. This is the **softmax** operator: 

$$
m_j = \frac{\exp(\beta X_j)}{\sum_{i=1}^m\exp(\beta X_i)}\;,
$$

which is the classical mechanism to select class-membership in the last layer of a neural network!

We show this mechanism in {numref}`DAExample`. As we increase $\beta$, the $m_i$ corresponding to the maximum $X_i$ 
(note that there may be many maximal values) tend to $1$, whereas the $m_i$ for the non-maximal values drop to $0$. 
In the neural-network jargon, this mechanism is dubbed as **winner takes all** or **WTA**.

```{figure} ./images/Topic2/DAExample-Photoroom.png
---
name: DAExample
width: 800px
align: center
height: 800px
---
Softmaxing a list of real-valued elements.
```

### Graduated Assignment
#### Linear Assignment 
Linear Assignment (LA) is polinomial problem with many applications in Computer Science and AI. 
Given a **bipartite graph** $G=(V,E)$ where: a) $V = V_1\cup V_2$,$V_1\cap V_2=\emptyset$ is 
a vertex partition and b) $E=V_1\times V_2$, i.e. there are only edges between $V_1$ and $V_2$. 
In addition, the graph is **weighted** since we are given a matrix of variables 
$\mathbf{X}\in\mathbb{R}^{m\times n}$, with $m=|V_1|$, $n=|V_2|$, where $\mathbf{X}_{ai}$ measures the **affinity** 
between $i\in V_1$ and $j\in V_2$ and its meaning depends on the problem (compatibility between processes and 
CPUs, local similarity between points of two shapes, etc).  

The <span style="color:#f88146">**discrete** (original) version</span> of LA is kind of obvious: 

$$
\begin{align}
\mathbf{M}^{\ast} &= \arg\max_{{\cal P}} E(\mathbf{M})=\sum_{a\in V_1}\sum_{i\in V_2}\mathbf{M}_{ai}\mathbf{X}_{ai}\\
{\cal P}=&\left\{\mathbf{M}\in \{0,1\}^{m\times n}:\forall a\in V_1:\sum_{i\in V_2}\mathbf{M}_{ai}\le 1\;\text{and}\; \forall i\in V_2:\sum_{a\in V_1}\mathbf{M}_{ai}\le 1\;\right\}
\end{align}
$$

In other words, <span style="color:#f88146">LA is the **linear version** of QAP</span> where $\mathbf{X}$ is **real-valued** 
instead of being an adjacency matrix. Concerning the search space, if $m=n$ then ${\cal P}=\Pi_n$ 
space is the Permuthoedron. However, it is not in NP and actually it can be solved in $O(n^3)$ with 
the so called [Hungarian algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm). 

However, herein we are going to introduce the <span style="color:#f88146">**soft/continuous** version</span> which 
is basically an extension of **SoftMax** incorporating: 
- **Continuous assingments**. Instead of $\mathbf{M}\in \{0,1\}^{m\times n}$ we have $\mathbf{M}\in [0,1]^{m\times n}$.
- **Two-way constraints**. Each $i\in V_1$ has a **probability** of being assigned to $a\in V_2$ and viceversa. Therefore:

$$
\forall a:\sum_{i=1}^n\mathbf{M}_{ai}=1\;\;\text{and}\;\;\forall i:\sum_{a=1}^m\mathbf{M}_{ai}=1\;.
$$

i.e. our search space is a matricial generalization of the <span style="color:#f88146">(vectorial) **simplex**</span>: 

$$
\Delta_n =\left\{\{m_i\}\in\mathbb{R}^n:\sum_{i=1}^nm_i=1\right\}\;.
$$

Actually, if $m=n$ the search space is not the Permuthoedron $\Pi_n$ but the 
[Birkhoff polytope](https://en.wikipedia.org/wiki/Doubly_stochastic_matrix) $\mathbb{B}_n$. In matricial 
terms, $\mathbb{B}_n$ is the set of **doubly stochastic** matrices (matrices whose rows and columns add $1$). 
In this regard, remember that $\Pi_n\subset \mathbb{B}_n$ since every permutation matrix is doubly stochastic. 
Therefore, the vertices of the Birkhoff polytope are the elements of $\Pi_n$ (solutions to the discrete problem)
but during the resolution of the problem <span style="color:#f88146">**continuous solutions** lying in the 
edges and sides of the polytope are **allowed**</span>. 

Algorithmically, the **SoftLA** needs to **enforce the two-way constraints** in every iteration as follows: 

```{prf:algorithm} SoftLA
:label: SoftLA-alg

**Inputs** Affinity matrix $\mathbf{X}\in\mathbb{R}^{m\times n}$, $\beta_0>0$\
**Outputs** $\mathbf{M}^{\ast}$ maximal assignment.

1. Initialize $\beta\leftarrow \beta_0$
2. **while** $\beta < \beta_f$:  

    1. $\mathbf{M}_{ai}\leftarrow \exp(\beta \mathbf{X}_{ai})$
    
    2. $\mathbf{M}\leftarrow\text{Sinkhorn}(\mathbf{M})$
   
    3. Increase $\beta$

3. **return** $\mathbf{M}^{\ast}$ optimal doubly-stochastic matrix. 
```
where $\text{Sinkhorn}(\mathbf{M})$ is an iterative row-col normalization process which 
converges to a stochastic matrix: 

```{prf:algorithm} Sinkhorn
:label: Sinkhorn-alg

**Inputs** Exponential assignment matrix: $\mathbf{M}=\exp(\beta\mathbf{X})$\
**Outputs** $\mathbf{M}$ doubly-stochastic matrix.

1. $t\leftarrow 0$

2. convergence$\leftarrow$ False

3. **while** $\neg$convergence: 

    1. $\mathbf{M}^{old}_{ai}\leftarrow \mathbf{M}_{ai}$ 

    2. **for** $a=1,2,\ldots,m$ (Normalize rows)

        $\mathbf{M}_{ai}\leftarrow \frac{\mathbf{M}_{ai}}{\sum_{i=1}^n \mathbf{M}_{ai}}$
    
    3. **for** $i=1,2,\ldots,n$ (Normalize cols)

        $\mathbf{M}_{ai}\leftarrow \frac{\mathbf{M}_{ai}}{\sum_{i=a}^m \mathbf{M}_{ai}}$

    4. $t\leftarrow t + 1$
    5. convergence = $(t>t_{max})$ **or** $(\sum_{ai}|\mathbf{M}_{ai}- \mathbf{M}^{old}_{ai}|\le\epsilon )$
  
  

3. **return** $\mathbf{M}$ doubly-stochastic matrix. 
```

Interestingly, the structure of $\text{SoftLa}$ is rather simple and essentially is the similar to that 
of $\text{SoftMax}$ but for matrices, taking $O(n^2)$ instead of $O(n^3)$ as the Hungarian algorithm. 

In {numref}`Hungarian` we show how $\text{SoftLa}$ solves a problem in a couple of 
iterations

```{figure} ./images/Topic2/SA-Hungarian-removebg-preview.png
---
name: Hungarian
width: 800px
align: center
height: 600px
---
SoftLA solution for a $m=n=5$ instance with random costs.
```

Note that: 
1) The general initialization is $\mathbf{M}^0=\mathbf{1}^T\mathbf{1}+ \epsilon\mathbf{I}$, 
i.e. the matrix of ones with a slight perturbation: $\mathbf{M}^0_{ai}=1+\epsilon$. This 
initialization is known as **baricenter** or **neutral**. 
2) In the above example, $\mathbf{X}$ is drawn from $n\times n$ **random** integers between $0$ and $n^2-1$.
3) The role of $\text{Sinkhorn}$ is to enforce the two-way constraints and sometimes it may lead 
to a very entropic doubly-stochastic matrix as we show in the following exercise. 
<br></br>
<span style="color:#d94f0b"> 
**Exercise**. Prove that applying $\text{Sinkhorn}$ to the exponentiation of the above matrix 
converges to a maximum entropy assignment. Do not use $\beta$. Explain **why**.
</span>
<br></br>
<span style="color:#d94f0b"> 
$
\mathbf{X}=
\begin{bmatrix}
x          & x+\epsilon\\
x-\epsilon & x\\
\end{bmatrix}
$
</span>
<br></br>
<span style="color:#d94f0b">
A priori, it seems that the optimal assignment is $(a=1)\rightarrow (i=2)$ and $(a=2)\rightarrow (i=2)$
but this is not possible due to the two-way constraint. Let us apply the exponential:
</span>
<br></br>
<span style="color:#d94f0b"> 
$
\mathbf{M}=
\begin{bmatrix}
e^x          & e^{x+\epsilon}\\
e^{x-\epsilon} & e^x\\
\end{bmatrix}
$
</span>
<br></br>
<span style="color:#d94f0b">
<ins>Row normalization</ins> makes things independent of $x$:
</span>
<br></br>
<span style="color:#d94f0b"> 
$
\begin{align}
\mathbf{M} &=
\begin{bmatrix}
\frac{e^x}{e^x + e^{x+\epsilon}}   & \frac{e^{x+\epsilon}}{e^x + e^{x+\epsilon}}\\
\frac{e^{x-\epsilon}}{e^x + e^{x-\epsilon}} & \frac{e^x}{e^x + e^{x-\epsilon}}\\
\end{bmatrix}
=\begin{bmatrix}
\frac{1}{1 + e^{\epsilon}}   & \frac{e^{\epsilon}}{1 + e^{\epsilon}}\\
\frac{e^{-\epsilon}}{1 + e^{-\epsilon}} & \frac{1}{1 + e^{-\epsilon}}\\
\end{bmatrix}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
<ins>Column normalization</ins>:
</span>
<br></br>
<span style="color:#d94f0b"> 
$
\begin{align}
\mathbf{M}_{:1} &= \frac{1}{1 + e^{\epsilon}}  + \frac{e^{-\epsilon}}{1 + e^{-\epsilon}} 
                 = \frac{(1 + e^{-\epsilon})+e^{-\epsilon}(1 + e^{\epsilon})}{(1 + e^{\epsilon})(1 + e^{-\epsilon})}
                 = \frac{(1 + e^{-\epsilon}) + (1 + e^{-\epsilon})}{(1 + e^{\epsilon})(1 + e^{-\epsilon})}
                 = \frac{2}{1 + e^{\epsilon}}\\
\mathbf{M}_{:2} &= \frac{e^{\epsilon}}{1 + e^{\epsilon}} + \frac{1}{1 + e^{-\epsilon}}
                 = \frac{e^{\epsilon}(1 + e^{-\epsilon}) + (1 + e^{\epsilon})}{(1 + e^{\epsilon})(1 + e^{-\epsilon})}
                 = \frac{(1 + e^{\epsilon}) + (1 + e^{\epsilon})}{(1 + e^{\epsilon})(1 + e^{-\epsilon})}
                 = \frac{2}{1 + e^{-\epsilon}}
\end{align}   
$
</span>
<br></br>
<span style="color:#d94f0b"> 
$
\begin{align}
\mathbf{M} &=
\begin{bmatrix}
\frac{\mathbf{M}_{a1}}{\mathbf{M}_{:1}}   & \frac{\mathbf{M}_{a2}}{\mathbf{M}_{:2}} \\
\frac{\mathbf{M}_{b1}}{\mathbf{M}_{:1}}   & \frac{\mathbf{M}_{b2}}{\mathbf{M}_{:2}} \\
\end{bmatrix} 
= \begin{bmatrix}
\frac{1}{1 + e^{\epsilon}}:\frac{2}{1 + e^{\epsilon}}              & \frac{e^{\epsilon}}{1 + e^{\epsilon}}:\frac{2}{1 + e^{-\epsilon}}\\
\frac{e^{-\epsilon}}{1 + e^{-\epsilon}}:\frac{2}{1 + e^{\epsilon}} & \frac{1}{1 + e^{-\epsilon}}:\frac{2}{1 + e^{-\epsilon}}\\
\end{bmatrix}
= \begin{bmatrix}
\frac{1}{2}   & \frac{1}{2}\\
\frac{1}{2}   & \frac{1}{2}\\
\end{bmatrix}
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
which proves the statement. **Why**? Well, adding an $\epsilon$ in one column and 
subracting it in the opposite column leads to maximal uncertainty wrt the diagonal 
which is preserved. This is not obvious, and even **counter-ituitive**, but it is a bit clarified by the removal of 
$x$ after row normalization. 
</span>
<br></br>

#### SoftAssign
The extension of $\text{SoftLa}$ to graph matching is almost straight. Now, instead of 
having a linear cost function, the graph-matching cost function is **quadratic** in $\mathbf{M}$ 
(it is the QAP-Quadratic Assignment Problem). Then, the <span style="color:#f88146">**relaxed QAP**</span> becomes 

$$
\begin{align}
\mathbf{M}^{\ast}= & \arg\max_{{\cal B}}F(\mathbf{M}) = \frac{1}{2}\sum_{a\in V}\sum_{i\in V'}\sum_{b\in V}\sum_{j\in V'}\mathbf{M}_{ai}\mathbf{M}_{bj}\mathbf{X}_{ab}\mathbf{Y}_{ij}\\
{\cal B}=&\left\{\mathbf{M}\in [0,1]^{(m+1)\times (n+1)}:\forall a\in V:\sum_{i\in V'}\mathbf{M}_{ai}=1\;\text{and}\; \forall i\in V':\sum_{a\in V}\mathbf{M}_{ai}=1\;\right\}
\end{align}
$$

where
1) ${\cal B}$ is the set of doubly-stochastic matrices of dimension $(m+1)\times (n+1)$, and 
the Birkhoff polytope or order $n+1$, i.e. ${\cal B}_{n+1}$, if $m=n$. 

2) Why an **extra dimension**? If $m<n$, for instance, we can only have $m$ one-to-one assignments 
and $n-m$ un-assigned nodes. In order to deal with this, we assume that these un-asigned nodes 
match a <span style="color:#f88146">**slack** or **virtual** node</span>. 
Therefore, adding an extra dimension accomodates two virtual 
nodes (one per graph). Similar reasoning for $m>n$. 

Note that $\text{SoftLA}$ is just a Deterministic Annealing (DA) applied to linear assignment. 
If so, one of the driving features of DA is to <span style="color:#f88146">assume **assignment independence**</span>, 
i.e. we may assume that the *assignment of $a\in V$ to $i\in V'$ is independent of that of 
another $a'\in V$ to $i'\in V'$*. 

**Taylor Expansion**. [Steven Gold and Anand Rangarajan](https://www.cise.ufl.edu/~anand/pdf/pamigm3.pdf) used a similar principle 
to approximate the quadratic cost function. Note that $f(x)\approx f(a) + f'(a)(x-a)$ is 
the first order Taylor expansion. Then  

$$
F(\mathbf{M})\approx F(\mathbf{M}^0) + F'(\mathbf{M}^0)\cdot\sum_{a\in V}\sum_{i\in V}(\mathbf{M}_{ai}-\mathbf{M}^0_{ai})\;.
$$

where 

$$
F'(\mathbf{M}^0):={\frac{\partial F}{\partial\mathbf{M}_{a_i}}}\bigg|_{\mathbf{M}^0}=\underbrace{\sum_{b\in V}\sum_{j\in V'}\mathbf{M}_{bj}\mathbf{X}_{ab}\mathbf{Y}_{ij}}_{\mathbf{Q}_{ai}}\\
$$

i.e. the derivative of a quadratic function is a linear one! Then, for a fixed $\mathbf{M}^0$, we 
have that 

$$
\max F(\mathbf{M}) = \max \sum_{a\in V}\sum_{i\in V}\mathbf{M}_{ai}\mathbf{Q}_{ai}\;,
$$

i.e. a **linear assignment problem**. 
In other words: 
1) $\mathbf{Q}_{ai}=\sum_{b\in V}\sum_{j\in V'}\mathbf{M}_{bj}\mathbf{X}_{ab}\mathbf{Y}_{ij}$ works as 
a **temporary affinity matrix** for each $\beta$.
2) Given $\mathbf{Q}_{ai}$ we iterate to get the best (maximal) assignment $\mathbf{M}_{ai}$ for this 
$\beta$, i.e. we solve a **new assignment problem**. 
3) Increase $\beta$ and start again!

This is the typical **decoupled structure** of DA problems and it leads to the 
$\text{SoftAssign}$ algorithm detailed below. 

```{prf:algorithm} SoftAssign
:label: SoftAssign-alg

**Inputs** Adjacency matrices $\mathbf{X}\in\{0,1\}^{m\times m}$, $\mathbf{Y}\in\{0,1\}^{n\times n}$,  $\beta_0>0$\
**Outputs** $\hat{\mathbf{M}}^{\ast}$ (extended) maximal quadratic assignment.

1. Initialize $\beta\leftarrow \beta_0$,  $\hat{\mathbf{M}}_{ai}\leftarrow 1+\epsilon$
2. **while** $\beta < \beta_f$:  
    
    1. $t\leftarrow 0$

    2. convergence$\leftarrow$ False

    3. **while** $\neg$convergence: 

        1. $\hat{\mathbf{M}}^{old}_{ai}\leftarrow \hat{\mathbf{M}}_{ai}$

        2. $\mathbf{Q}_{ai}\leftarrow \sum_{b=1}^m\sum_{j=1}^n\mathbf{M}_{bj}\mathbf{X}_{ab}\mathbf{Y}_{ij}$
      
        3. $\mathbf{M}^0_{ai}\leftarrow \exp(\beta \mathbf{Q}_{ai})$
    
        4. Expand $\mathbf{M}^0\rightarrow \hat{\mathbf{M}}^0$

        5. $\hat{\mathbf{M}}^0\leftarrow\text{Sinkhorn}(\hat{\mathbf{M}}^0)$

        6. $t\leftarrow t + 1$

        7. convergence = $(t>t_{max})$ **or** $(\sum_{ai}|\hat{\mathbf{M}}_{ai}- \hat{\mathbf{M}}^{old}_{ai}|\le\epsilon' )$
   
    3. Increase $\beta$

3. **return** $\hat{\mathbf{M}}^{\ast}$ optimal doubly-stochastic matrix. 
```

Some notes:

1) If $m\neq n$ we always assume $m\le n$, i.e. $\mathbf{X}$ is the smallest graph.

2) Given the expanded $\hat{\mathbf{M}}$ in step $1$, we implicitly use its 
$\mathbf{M}$ $m\times n$ matrix in steps $3.2$ and $3.3$. 

2) In step $3.4$ we expand explicitly $\mathbf{M}^0$ by incorporating an additiona row and column 
of zeros. Note that $\text{Sinkhorn}$ **always needs** a $(m+1)\times (n+1)$ matrix if $m\neq n$!

**Complexity**. It is not difficult to see that the complexity of $\text{SoftAssign}$ is 
roughtly $O(n^3)$, or more precisely $O(|E|n)$, if $m=n$ and $|E|$ is the number of edges.
Such a complexity is due to step 3.2 (the computation of $\mathbf{Q}$ which involves the 
product of 3 matrices).

#### Cleanup
Remember that continuation methods are **solved in the continuum** but our **original 
is discrete**. Given the $\text{SoftAssign}$ solution $\hat{\mathbf{M}}^{\ast}$, which can 
be very entropic, we must recover the closest discrete matrix (which is not unique, in general). 
This task is performed by a <span style="color:#f88146">**cleanup heuristic** or algorithm</span> like the one detailed 
below:


```{prf:algorithm} Cleanup
:label: Cleanup-alg

**Inputs** $\hat{\mathbf{M}}\in[0,1]^{(m+1)\times (n+1)}$ optimal doubly-stochastic matrix\
**Outputs** Closest discrete matrix $\mathbf{M}\in\{0,1\}^{(m+1)\times (n+1)}$

1. yet_asigned $\leftarrow 0$
2. $\mathbf{M}[a,i]\leftarrow 0$
3. $\epsilon \leftarrow 10^{-6}$

4. **while** yet_asigned$< m$:  
    
    1. changes_done$\leftarrow$ False

    2. **for** $a\in\{1,2,\ldots,m+1\}$:

        1. candidates $\leftarrow \emptyset$

        2. max_row $\leftarrow \max\:\hat{\mathbf{M}}[a,:]$

        3. row_capacity $\leftarrow$ max_row $- \epsilon$

        4. **for** $i\in\{1,2,\ldots,n+1\}$:
        
            1. max_col $\leftarrow \max\;\hat{\mathbf{M}}[i,:]$

            2. col_capacity $\leftarrow$ max_col $- \epsilon$

            3. **if** $(\mathbf{M}[a,i]>$row_capacity$)$ **and** $(\mathbf{M}[a,i]>$col_capacity$)$:

                candidates $\leftarrow$ candidates $\cup \{i\}$ 
        
        5. **if** $|$candidates$|$>0: 

            1. max_val $\leftarrow$ $\max \mathbf{M}[a,$candidates$]$

            2. selection $\leftarrow$ $\arg\max$ $\mathbf{M}[a,$candidates$]$

            3. col $\leftarrow$ candidates[selection]

            4. yet_assigned $\leftarrow$ yet_assigned + 1

            5. $\mathbf{M}[a,$col$]\leftarrow$ 1 

            6. $\mathbf{M}[a,:]\leftarrow-\infty$

            7. $\mathbf{M}[:,col]\leftarrow-\infty$

            8. changes_done $\leftarrow$ True
    
    3. **if** $\neg$ changes_done:

          **break**

5. **return** $\mathbf{M}$ binary matrix. 
```

The algorithm, basically a <span style="color:#f88146">**greedy** version of the Hungarian algorithm</span>, proceeds as follows: 
1) <ins>Steps 2.1-2.4</ins>. For every row $a$, we build a list of *candidates* keeping track of both the row and col 
capacities. 
2) <ins>Step 2.5</ins>. If the list of candidates is not empty, we proceed to get the best of them (step 5.5) and then we proceed 
to exclude the corresponding row and column (steps 5.6 and 5.7). 

This takes $O(n^2)$ and could be replaced by a run of $\text{SoftLA}$. Remember that if the algorithm 
finds any match to a slack vertex these machings should be cleared. 

**Results**. Coming back to the Aspirin Isomorphism, we use the setting recommended 
in Gold and Rangarajan's paper: $\epsilon = 0.1$, $\epsilon'=0.5$, $\beta_0=0.5$,
$\beta_f=10.0$, $\beta_r=1.075$ and $I_0=4$, $I_1=30$, where the last two numbers 
are the maximal iterations of the internal **while** loop and $\text{Sinkhorn}$, 
respectively. 

The result in {numref}`DA-Aspirin` is nearly perfect and we can claim that Aspirin 
is isomorphic wrt to a permutation of itself!

```{figure} ./images/Topic2/DA-Aspirnin-removebg-preview.png
---
name: DA-Aspirin
width: 800px
align: center
height: 600px
---
Detecting the Aspirin's Isomorphism with DA. 
```

We leave to the practice lessons the extension of the algorithm for incorporating 
both <span style="color:#f88146">node and edge **attributtes**</span>.


## Appendix
### Independence
**Statistical independence**. It is well known that two random variables $A$ and $B$ are **independent** iff: 

$$
p(A,B) = p(A)\cdot p(B)\;.
$$

Or, in other words, applying the Bayes theorem we have that $A$ and $B$ are independent iff 

$$
p(A,B) = p(B|A)p(A) = p(A|B)p(B)\Rightarrow p(B|A)=p(B)\;\text{and}\;p(A|B)=p(B)\;.
$$

In other words: 
- Independence implies factorizing the joint distribution $p(A,B)$
- $A$ and $B$ are indendent if the knowledge of one of them does not influence the probability of the other, i.e. $p(A|B)=p(A)$ and $p(B|A)=p(B)$. 

**Marginalization**. Alternatively, we may define statistical independence by checking the marginal probabilities. Let $a_1,a_2,\ldots,a_m$ the discrete values of $A$ and $b_1,b_2,\ldots,b_n$ similarly for $B$. Then: 

$$
\begin{align}
&&\sum_{a_i}\sum_{b_j}p(A=a_i,B=b_j)=1\;\text{total probability}\\
&& p(A=a_i)=\sum_{b_j}p(A=a_i,B=b_j)\;\text{marginal wrt}\; A\\
&& p(B=b_i)=\sum_{a_i}p(A=a_i,B=b_j)\;\text{marginal wrt}\; B\\
\end{align}
$$

Then, $A$ and $B$ are independent if we can write a joint distribution as the product of two marginal distributions:

$$
p(A=a_i,B=b_j) = p(A=a_i)\cdot p(B=b_j)\;\forall a_i,b_j\;.
$$

Let us practice a bit this concept in the following exercise.

<br></br>
<span style="color:#d94f0b"> 
**Exercise**. Given the above table representing the joint distribution of two discrete random variables $A$ and $B$, determine where they are indepedent or not. **a)** Usign the marginals. **b)** Using the conditional probabities.
</span>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|ccc}
         & B=b_1 & B=b_2 & B=b_3 \\
  \hline
  A=a_1  & 0.1   & 0.2   & 0.3\\
  A=a_2  & 0.1   & 0.2   & 0.1\\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
**a)** First all, note that $\sum_{a_i,b_j}p(A=a_i,B=b_j)=1$. Now, let us compute the marginals summing the rows for $A$ and the columns for $B$:
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
p(A=a_1) &= \sum_{b_j}p(A=a_1,B=b_j)=0.1 + 0.2 + 0.3 = \mathbf{0.6}\\
p(A=a_2) &= \sum_{b_j}p(A=a_2,B=b_j)=0.1 + 0.2 + 0.1 = \mathbf{0.4}\\
p(B=b_1) &= \sum_{a_i}p(A=a_i,B=b_1)=0.1 + 0.1 = \mathbf{0.2}\\
p(B=b_2) &= \sum_{a_i}p(A=a_i,B=b_2)=0.2 + 0.2 = \mathbf{0.4}\\
p(B=b_3) &= \sum_{a_i}p(A=a_i,B=b_3)=0.3 + 0.1 = \mathbf{0.4}\\
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
Now we check independence: 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
p(A=a_1,B=b_1) &=0.1\neq p(A=a_1)p(B=b_1) = \mathbf{0.6}\times\mathbf{0.2}=0.12\\
p(A=a_1,B=b_2) &=0.2\neq p(A=a_1)p(B=b_2) = \mathbf{0.6}\times\mathbf{0.4}=0.24\\
p(A=a_1,B=b_3) &=0.3\neq p(A=a_1)p(B=b_3) = \mathbf{0.6}\times\mathbf{0.4}=0.24\\
p(A=a_2,B=b_1) &=0.1\neq p(A=a_2)p(B=b_1) = \mathbf{0.4}\times\mathbf{0.2}=0.08\\
p(A=a_2,B=b_2) &=0.2\neq p(A=a_2)p(B=b_2) = \mathbf{0.4}\times\mathbf{0.4}=0.16\\
p(A=a_2,B=b_3) &=0.1\neq p(A=a_2)p(B=b_3) = \mathbf{0.4}\times\mathbf{0.4}=0.16\\
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
However, note that in all cases $p(A=a_i,B=b_j)\approx p(A=a_i)p(B=b_j)$. 
</span>
<br></br>
<span style="color:#d94f0b">
**b)** Now, we can compute the conditional probabilities using the Bayes theorem: $p(A=a_i|B=b_j)=p(A=a_i,B=b_j)/p(B=b_j)$ 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|ccc}
                  & B=b_1 & B=b_2 & B=b_3 \\
  \hline
  p(A=a_1|B=b_j)  & \frac{0.1}{0.2}=0.5   & \frac{0.2}{0.4}=0.5  & \frac{0.3}{0.4}=0.75\\
  p(A=a_2|B=b_j)  & \frac{0.1}{0.2}=0.5   & \frac{0.2}{0.4}=0.5   & \frac{0.1}{0.4}=0.25\\  
  \hline
                & \sum = 1 & \sum = 1 & \sum=1\\
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
and $p(B=b_j|A=a_i)=p(A=a_i,B=b_j)/p(A=a_i)$ 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|ccc|c}
                  & p(B=b_1|A=a_i) & p(B=b_2|A=a_i) & p(B=b_3|A=a_i) \\
  \hline
  A=a_1  & \frac{0.1}{0.6}=0.17   & \frac{0.2}{0.6}=0.33  & \frac{0.3}{0.6}=0.5 & \sum = 1\\
  A=a_2  & \frac{0.1}{0.4}=0.25   & \frac{0.2}{0.4}=0.5   & \frac{0.1}{0.4}=0.25 & \sum = 1\\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Note that the conditional probabilities adds $1$ wrt the "influenced" variable, i.e.
</span>
<br></br>
<span style="color:#d94f0b">
$
\forall a_i: \sum_{b_j} p(A=a_i|B=b_j)= 1\;\text{and}\; 
\forall b_j: \sum_{a_i} p(B=b_j|A=a_i)= 1
$
</span>
<br></br>
### Entropy 
This is the **first point** in the subject where we talk about <span style="color:#f88146">**Information Theory**</span>, the main auxiliary maths behind this course. In this regard, we refer you to the [author's book on the subject](https://link.springer.com/book/10.1007/978-1-84882-297-9).

#### Definition and properties
The above exercise reveals one esential property of DA: <span style="color:#f88146">we cannot converge without keeping the **maximizing the entropy**</span>. But, what is entropy? 

**Entropy** is the ultimate measure of uncertainty defined by [Shannon](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf) as follows: 

Given a discrete random variable $X$, with possible values $\{x_1,\ldots,x_n\}$, its **entropy** is given by: 

$$
H(X) = -\sum_{i=1}^n p(X=x_i)\log p(X=x_i)\;.
$$

Similarly, we can express entropy in terms of the probabilities of the **discrete events** $p_i=p(X=x_i)$, i.e. 

$$
H(p_1,p_2,\ldots,p_n)=-\sum_{i=1}^n p_i\log p_i\;.
$$ 

Then, we have the following **properties**:
1) $H$ is continuous, concave and non-negative in $p_i$. 
2) If all the $p_i$ are equally likely, i.e. $p_i=1/n$, then entropy is **maximal** for this $n$ and equal to $\log n$. The larger $n$ the larger the **uncertainty** of the equally likely events. 
3) If we have $\log_2$ entropy is measured in **bits**, whereas if it is $\ln$ it is measured in  **nats**. 
4) Since $E(f(x))=\sum_x x\cdot f(x)$ is the **expectation** of $f(x)$, then entropy is the expectation of the logarithm. 

**Entropy of Bernouilli**. An interesting example is the measure of the entropy of $X\sim \text{Bernouilli}(p)$. 

$X$ has value $x=1$ with probability $p(X=1)=p$ and $x=0$ with probability $p(X=0)=q=1-p$.
Then: 

$$
\begin{align}
H(X) &= -\sum_{x}p(X=x)\log p(X=x)\\
     &= -p(X=0)\log p(X=0) - p(X=1)\log p(X=1)\\
     &= -p\log p - q\log q\\
     &= -p\log p - (1-p)\log (1-p)\;.
\end{align}
$$

Therefore, the maximal entropy (uncertainty) of a Bernouilli variable is when $p=q=1/2$ (for instance the probability of a fair coin): 

$$
H(X) = -p\log p -p\log p = -2p\log p = -\log\frac{1}{2} = -(\log 1 - \log 2) = \log 2\;.
$$

which is $\log_2 2=1$ bits as we can see in {numref}`KM-entropy-Bern`. Note that the function is concave wrt $p$. Minimal entropy is achieved at the two extreme points: $p=0$ and $p=1$, where the **uncertainty is minimal** (completely biased coin).

```{figure} ./images/Topic2/KM-entropy-Bern-removebg-preview.png
---
name: KM-entropy-Bern
width: 800px
align: center
height: 600px
---
Entropy of Bernouilli wrt $p$. 
```

#### Entropy and Coding 
The relationship between entropy and codelength is key to understand why <span style="color:#f88146">entropy measures information content</span>. 

The [Morse code](https://en.wikipedia.org/wiki/Morse_code) follows the principle of *assigning less number of bits to more frequent letters than to less frequent ones*. Herein, the bits are clear: $0$ for the dits "." and $1$ for the 
dash ".". For instance, following the frequencies of letters in English, the shorter "codewords" are those for $E=.$ and $T=-$. The longer ones are those for the digits $0-9$ (five bits all of them). The table below shows those lengths and probabilities. 


$$
\begin{aligned}
&\begin{array}{|c|c|c|c|}
\hline
\text{Letter} & \text{Morse code} & \text{codelength} & \text{Probability} \\
\hline
A & .- & 2 & 0.06381 \\
B & -... & 4 & 0.01166 \\
C & -.-. & 4 & 0.02173 \\
D & -.. & 3 & 0.03323 \\
E & . & 1 & 0.09924 \\
F & ..-. & 4 & 0.01741 \\
G & --. & 3 & 0.01574 \\
H & .... & 4 & 0.04761 \\
I & .. & 2 & 0.05442 \\
J & .--- & 4 & 0.00120 \\
K & -.- & 3 & 0.00603 \\
L & .-.. & 4 & 0.03145 \\
M & -- & 2 & 0.01880 \\
N & -. & 2 & 0.05273 \\
O & --- & 3 & 0.05865 \\
P & .--. & 4 & 0.01507 \\
Q & --.- & 4 & 0.00074 \\
R & .-. & 3 & 0.04677 \\
S & ... & 3 & 0.04943 \\
T & - & 1 & 0.07075 \\
U & ..- & 3 & 0.02155 \\
V & ...- & 4 & 0.00764 \\
W & .-- & 3 & 0.01844 \\
X & -..- & 4 & 0.00117 \\
Y & -.-- & 4 & 0.01542 \\
Z & --.. & 4 & 0.00058 \\
0 & ----- & 5 & 0.01563 \\
1 & .---- & 5 & 0.05469 \\
2 & ..--- & 5 & 0.02344 \\
3 & ...-- & 5 & 0.02344 \\
4 & ....- & 5 & 0.02344 \\
5 & ..... & 5 & 0.01563 \\
6 & -.... & 5 & 0.01563 \\
7 & --... & 5 & 0.01563 \\
8 & ---.. & 5 & 0.01563 \\
9 & ----. & 5 & 0.01563 \\
\hline
\end{array}
\end{aligned}
$$

Note that, due to traditional reasons it is not always happening that shorter codes correspond to highest frequencies (probabilities) but this is the global trend.  

Of course, each message should have a space ' ' between letters in order to be decoded properly. For instance: ".... . .-.. .-.. ---   .-- --- .-. .-.. -.." means H E L L O W O R L D in Morse.


[//]: https://courses.grainger.illinois.edu/cs573/fa2012/lec/lec/25_entropy.pdf

**Codes and Trees**. One interesting way of characterizing codes (binary codes in particular) is to try to put the symbols to code in a tree. See for instance {numref}`Morse-tree`, where the root is marked as "\#'. We point out the following: 
- <span style="color:#f88146">**Decoding means**</span> navigating through the tree top-to-bottom, making decisions (dit or slash for binary codes), i.e. building a "path", until the desired symbol is reached. Such a path is the "codeword". 
- <span style="color:#f88146">**The codeword length**</span> $l_i$ is the number of decision taken to decode it, i.e. the depth of the symbol in the tree. 
- <span style="color:#f88146">**A code is comma-like**</span> if all the symbols are placed at the leaves of the tree. This is clearly not the case of Morse. 
- <span style="color:#f88146">**A code is prefix-like**</span> if any symbol is preceeded by a unique sequence of other codes (the prefix). For instace, to decode "O" in Morse, we should decode "T" and "M". Thus, the prefix of "O" is "TM". 


In the Morse code few letters are placed at the leaves of tree. Notably, they are the ones with the largest codeword lengths. Herein, we follow the recipe that assigning less frequent letters to deeper levels of the tree. The purpose of such strategy is twofold: (a) saving storage space, and (b) fast decoding. 

**Expected codelength**. Suppose that we have $n$ letters, each one with frequency $p_i$ and codelenth $l_i$, $i=1,\ldots,n$. Then, the expected codelength is 

$$
E(l) = \sum_{i}l_i p_i = l_1p_1 + l_2p_2 + \ldots + l_np_n\;.
$$

For the Morse code, $E_{Morse}(l)=3.0794539323542267$. Now, the tree in {numref}`Morse-tree` is built under the constraint of placing more probable letters in the top of the (binary) tree. In other words, we may assume that 

$$
p_i \approx \frac{1}{2^{l_i}}\;. 
$$

If so, $E(l)\approx H(p_1,p_2,\ldots,p_n)$ since

$$
\log_2 p_i \approx \log_2 \frac{1}{2^{l_i}} = \log_2 1 - \log_2 2^{l_i} = - \log_2 2^{l_i} = - l_i\;.
$$

Then

$$
-p_i\log_2 p_i \approx \frac{1}{2^{l_i}}l_i = \frac{l_i}{2^{l_i}}\;.
$$

Roughly speaking, <span style="color:#f88146">**the expected codelegth matches entropy**</span>! 

More precisely, for comma-like codes we have $E(l)\le H(p_1,p_2,\ldots,p_n)$. Although Morse is not comma-like it satisfies this bound since $E_{Morse}(l) = 3.0794539323542267\le H_{Morse} = 4.713102340698242$. 

```{figure} ./images/Topic2/Morse-Photoroom.png
---
name: Morse-tree
width: 800px
align: center
height: 600px
---
Decoding tree for the Morse code. 
```

[//]: https://stackoverflow.com/questions/21853101/why-huffman-coding-is-good

**Huffman Codes**. As an alternative code, herein we highlight a <span style="color:#f88146">**prefix code**</span> (no code is the prefix of another). Prefix codes are easily decodable and, in addition, they allow us (by definition) to suppress the space " " between words (see the Table below). 

$$
\begin{aligned}
\begin{array}{|c|c|c|c|}
\hline
\text{Letter} & \text{Huffman Code} & \text{Length} & \text{Probability} \\
\hline
A & 1010 & 4 & 0.06381 \\
B & 001110 & 6 & 0.01166 \\
C & 111001 & 6 & 0.02173 \\
D & 10111 & 5 & 0.03323 \\
E & 000 & 3 & 0.09924 \\
F & 110101 & 6 & 0.01741 \\
G & 100111 & 6 & 0.01574 \\
H & 11111 & 5 & 0.04761 \\
I & 0101 & 4 & 0.05442 \\
J & 1101001011 & 10 & 0.00120 \\
K & 11010011 & 8 & 0.00603 \\
L & 10110 & 5 & 0.03145 \\
M & 110111 & 6 & 0.01880 \\
N & 0100 & 4 & 0.05273 \\
O & 0111 & 4 & 0.05865 \\
P & 001111 & 6 & 0.01507 \\
Q & 1101001001 & 10 & 0.00074 \\
R & 11101 & 5 & 0.04677 \\
S & 0010 & 4 & 0.04943 \\
T & 1100 & 4 & 0.07075 \\
U & 111000 & 6 & 0.02155 \\
V & 1101000 & 7 & 0.00764 \\
W & 110110 & 6 & 0.01844 \\
X & 1101001010 & 10 & 0.00117 \\
Y & 100000 & 6 & 0.01542 \\
Z & 1101001000 & 10 & 0.00058 \\
0 & 100001 & 6 & 0.01563 \\
1 & 0110 & 4 & 0.05469 \\
2 & 111100 & 6 & 0.02344 \\
3 & 111101 & 6 & 0.02344 \\
4 & 00110 & 5 & 0.02344 \\
5 & 100011 & 6 & 0.01563 \\
6 & 100010 & 6 & 0.01563 \\
7 & 100110 & 6 & 0.01563 \\
8 & 100100 & 6 & 0.01563 \\
9 & 100101 & 6 & 0.01563 \\
\hline
\end{array}
\end{aligned}
$$

For instance, the Huffman code for HELLOW WORLD is 

$$
1111100010110101100111\;1101100111111011011010111
$$

Concerning $E_{Huffman}(l)$ and $H_{Huffman}=$, they are identical: $4.7$. See the encoding tree in {numref}`Huffman-tree`

```{figure} ./images/Topic2/Huffman-Photoroom.png
---
name: Huffman-tree
width: 800px
align: center
height: 600px
---
Decoding tree for the Huffman code of the Morse frequencies. 
```

Finally, this code is not only useful for English but for any laguage such as Chinese. 


#### Joint vs Conditional entropy
Entropy is defined for any (discrete) probability function, and in particular for the **joint probability**. Thus, if $A$ and $B$ are two discrete variables with respective values $a_1,\ldots, a_m$ and $b_1,\ldots,b_n$ and does exists the probability $p(A,B)$, the <span style="color:#f88146">**joint entropy**</span> $H(A,B)$ is defined as follows: 

$$
H(A,B) = -\sum_{a_i}\sum_{b_j}p(A=a_i,B=b_j)\log p(A=a_i,B=b_j)\;.
$$

Remember that the Bayes theorem leads to $p(A,B)=p(A|B)p(B)=p(B|A)p(A)$. 
As a result, we may define entropies for $p(A|B)$ and $p(B|A)$, i.e. <span style="color:#f88146">**conditional entropies**</span>. 

Remember that *entropy is an expectation*, then  

$$
H(A|B=b_j) = -\sum_{a_i}p(A=a_i|B=b_j)\log p(A=a_i|B=b_j)\;,
$$

is the <span style="color:#f88146">**entropy of the conditional expectation** of $A$ wrt $B=b_j$</span>. Now, if we average this quantity wrt all values  $b_j$, we have that the *conditional entropy* is given by following weigthed sum: 

$$
H(A|B) = \sum_{b_j}p(B=b_j)\cdot H(A|B=b_j)\;,
$$

and similarly for $H(B|A)$. Actually we have several **properties** linking conditional entropies and joint entropies:
1) **Zero-value**: $H(A|B)=0$ if the value of $A$ is *completely determined* by $B$. In other words, $H(A|B)$ measures how much uncertainty adds $B$ to $A$. Then, if $p(A|B)=1$ it is expected that knowing $B$ is the same as knowing $A$ itself.   
2) **Independence**: $H(A|B)=H(A)$ and $H(B|A)=H(A)$ iff $A$ and $B$ are *independent*, as it happens with $p(A|B)=p(A)$ and $p(B|A)=p(A)$.  
3) **Chain rule**: the conditional entropies can be derived from the joint entropy $H(A,B)$ by subtracting the entropy of the conditioning variable: 

$$
H(A|B) = H(A,B)-H(B)\;\;\text{and}\;\; H(B|A) = H(A,B)-H(A)\;.
$$
<br></br>
<span style="color:#d94f0b"> 
**Exercise**. Given the discrete random variales $A$ and $B$, in the previous exercise: **a)** determine the joint entropy and **b)** the conditional entropies. **c)** Interpret what is the uncertainty added or removed by each variable wrt to the other and the uncertainty removed from the joint entropy. **d)** Are $A$ and $B$ independent attending to their entropies? 
</span>
<br></br>
<span style="color:#d94f0b">
**a)** First all, for the joint entropy we need the joint distribution:
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|ccc}
         & B=b_1 & B=b_2 & B=b_3 \\
  \hline
  A=a_1  & 0.1   & 0.2   & 0.3\\
  A=a_2  & 0.1   & 0.2   & 0.1\\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
H(A,B) &= -\sum_{a_i}\sum_{b_j}p(A=a_i,B=b_j)\log p(A=a_i,B=b_j)\\
       &= -[3(0.1\log 0.1) + 2(0.2\log 0.2) + 0.3\log 0.3]\\
       &= \mathbf{1.695}\;\text{nats}\;.
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
Note that the maximum entropy (in this case with $m\cdot n=6$ events) is given when 
all events are equiprobable, i.e. $p(A=a_i,B=b_j)=\frac{1}{m\cdot n}=\frac{1}{2\cdot 3}=\frac{1}{6}$, and it is $H_{max}(A,B)=\log n = \log 6$
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
H_{max}(A,B) &= -\sum_{a_i}\sum_{b_j}\frac{1}{m\cdot n}\log \frac{1}{m\cdot n}\\
       &= -6\cdot \frac{1}{6}\log\frac{1}{6}\\
       &= -\log\frac{1}{6}\\
       &= -(\log 1 - \log 6)\\
       &= \log 6 = 1.791\;\text{nats}\;.
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then $H(A,B)=1.695$ is pretty close to $H_{max}(A,B)=1.791$. 
</span>
<br></br>
<span style="color:#d94f0b">
**b1)** For $H(A|B)$ we need the conditional distribution $p(A|B)$:
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|ccc}
                  & B=b_1 & B=b_2 & B=b_3 \\
  \hline
  p(A=a_1|B=b_j)  & \frac{0.1}{0.2}=0.5   & \frac{0.2}{0.4}=0.5  & \frac{0.3}{0.4}=0.75\\
  p(A=a_2|B=b_j)  & \frac{0.1}{0.2}=0.5   & \frac{0.2}{0.4}=0.5   & \frac{0.1}{0.4}=0.25\\  
  \hline
                & \sum = 1 & \sum = 1 & \sum=1\\
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then, attending to the definition $H(A|B)=\sum_{b_j}p(B=b_j)H(A|B=b_j)$, we have: 
</span>
<span style="color:#d94f0b">
$
\begin{align}
H(A|B=b_1) &=  -\sum_{a_i}p(A=a_i|B=b_1)\log p(A=a_i|B=b_1)\\
           &=  -(0.5\log 0.5 + 0.5\log 0.5) = 0.693\\
H(A|B=b_2) &=  -\sum_{a_i}p(A=a_i|B=b_2)\log p(A=a_i|B=b_2)\\
           &=  -(0.5\log 0.5 + 0.5\log 0.5) = 0.693\\
H(A|B=b_3) &=  -\sum_{a_i}p(A=a_i|B=b_3)\log p(A=a_i|B=b_3)\\
           &=  -(0.75\log 0.75 + 0.25\log 0.25) = 0.562\\
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then, $H(A|B)$ is given by a weighted average, where we need the marginals $p(B=b_j)=[0.2\;0.4\;0.4]$ (see the corresponding exercise): 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
H(A|B) &= \sum_{b_j}p(B=b_j)H(A|B=b_j)\\
       &= p(B=b_1)H(A|B=b_1) + p(B=b_2)H(A|B=b_2) + p(B=b_3)H(A|B=b_3)\\
       &= 0.2\cdot H(A|B=b_1) + 0.4\cdot H(A|B=b_2) + 0.4\cdot H(A|B=b_3)\\
       &= 0.2\cdot 0.693 + 0.4\cdot 0.693 + 0.4\cdot 0.562\\
       &= \mathbf{0.640}\;.
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
**b2)** For $H(B|A)$ we need the conditional distribution $p(B|A)$:
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{aligned}
\begin{array}{c|ccc|c}
                  & p(B=b_1|A=a_i) & p(B=b_2|A=a_i) & p(B=b_3|A=a_i) \\
  \hline
  A=a_1  & \frac{0.1}{0.6}=0.17   & \frac{0.2}{0.6}=0.33  & \frac{0.3}{0.6}=0.5 & \sum = 1\\
  A=a_2  & \frac{0.1}{0.4}=0.25   & \frac{0.2}{0.4}=0.5   & \frac{0.1}{0.4}=0.25 & \sum = 1\\  
  \hline
\end{array}
\end{aligned}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then, attending to the definition $H(B|A)=\sum_{a_i}p(A=a_i)H(B|A=a_i)$, we have: 
</span>
<span style="color:#d94f0b">
$
\begin{align}
H(B|A=a_1) &=  -\sum_{b_j}p(B=b_j|A=a_1)\log p(B=b_j|A=a_1)\\
           &=  -(0.17\log 0.17 + 0.33\log 0.33 + 0.5\log 0.5) = 1.013\\
H(B|A=a_2) &=  -\sum_{b_j}p(B=b_j|A=a_2)\log p(B=b_j|A=a_2)\\
           &=  -(0.25\log 0.25 + 0.5\log 0.5 + 0.25\log 0.25) = 1.039\\
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
Then, $H(B|A)$ is given by a weighted average, where we need the marginals $p(A=a_i)=[0.6\;0.4]$ (see the corresponding exercise): 
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
H(B|A) &= \sum_{a_i}p(A=a_i)H(B|B=a_i)\\
       &= p(A=a_1)H(B|A=a_1) + p(A=a_2)H(B|A=a_2)\\
       &= 0.6\cdot H(B|A=a_1) + 0.4\cdot H(B|A=a_2)\\
       &= 0.6\cdot 1.013 + 0.4\cdot 1.039 \\
       &= \mathbf{1.023}\;.
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
**c)** As a result $H(B|A)=\mathbf{1.023}>H(A|B)=\mathbf{0.640}$. This means that **$B$ holds more information about $A$ than $A$ holds about $B$**.
<br></br>
Finally, let us compute the individual entropies $H(A)$ and $H(B)$:
</span>
<br></br>
<span style="color:#d94f0b">
$
\begin{align}
H(A) & = -\sum_{a_i}p(A=a_i)\log p(A=a_i)\\
       & = -(0.6\log 0.6 + 0.4\log 0.4)\\
       & = \mathbf{0.673}\;.\\
H(B) & = -\sum_{b_j}p(B=b_j)\log p(B=b_j)\\
       & = -(0.2\log 0.2 + 0.4\log 0.4 + 0.4\log 0.4)\\
       & = \mathbf{1.054}\;.\\
\end{align}
$
</span>
<br></br>
<span style="color:#d94f0b">
Since $H(A|B)=H(A,B)-H(B)$ and $H(B|A)=H(A,B)-H(A)$ and $H(B)>H(A)$ ($B$ is more uncertain than $A$), then $H(B)$ reduces more the uncertainty of $H(A,B)$ than $H(A)$. This explains why $H(A|B)\ll H(B|A)$. 
<br><br>
Let us check the calculus: 
</span>
<span style="color:#d94f0b">
$
H(A|B) = \mathbf{0.640} = 1.695 - 1.054\;\;\text{and}\;\; H(B|A) = \mathbf{1.023} = 1.695 - 0.673\;.
$
</span>
<br></br>
<span style="color:#d94f0b">
**d)** Since $H(A|B)=0.640\approx H(A)= 0.673$ and $H(B|A)=1.023\approx H(B)=1.054$, we conclude that $A$ and $B$ are *close to be independent, although they are not*. This is consistent with the previous exercise on marginals. 
<br></br>
<span style="color:#d94f0b">
Then, in this exercise we have illustrated how to compute the conditional entropies either directly or through the individual entropies and the joint entropy. However, the key of this exercixse is to **interpret correctly what a conditional entropy is**: *a quantification of how much information has one variable about the other*.
</span>






