Entropy

Basic Ideas

"Entropy is the minimum descriptive complexity of a random variable"
"Mutual information is the communication rate in the presence of noise"

In communication there exists a data compression minimum and a transmission maximum or channel capacity.

Kolmogorov Complexity: the idea that the complexity of a string of data can be defined by the length of the shortest binary computer program for computing the string.

Entropy

Is a measure of the uncertainty of a random variable.
Let $X$ be a discrete random variable with alphabet (set of all possible outcomes) $X$ and probability mass function $p (x) = P r {X = x}$ .

Entropy $H (X)$ is defined by

H (X) = - \sum_{x \in X}^{} p (x) \log_{2} p (x)

The entropy $H (X)$ is the theoretical lower bound, in bits, on how efficiently you can compress the outcomes of a random variable X assuming you’re coding them in binary and want lossless reconstruction.

Entropy of fair coin toss:

\begin{aligned} X & = {h e a d s, t a i l s} \\ H (X) & = - p (h e a d s) \log_{2} (p (h e a d s)) - p (t a i l s) \log_{2} (p (t a i l s)) \\ = - \frac{1}{2} (- 1) - \frac{1}{2} (- 1) = 2 \end{aligned}

Interpretation: A fair coin toss carries 1 bit of information, meaning it’s maximally uncertain—you gain 1 full bit of information every time you observe the outcome.

For other logarithm bases $b$ we notate entropy as $H_{b} (X) .$

Entropy as Expectation

If $X \sim p (x)$ the expected value of the random variable $g (X)$ is:

E_{p} g (X) = \sum_{x \in X} g (x) p (x)

using $g (X) = \log \frac{1}{p (X)}$ we can interpret the entropy of $X$ as an expectation.

H (X) = E_{p} \log \frac{1}{p (X)}

Immediate Properties

$0 \leq p (x) \leq 1 ⟹ \log \frac{1}{p (x)} \geq 0$
$\log_{b} p = \log_{b} a \log_{a} p ⟹ H_{b} X = (\log_{b} a) H_{a} (X)$

Joint Entropy and Conditional Entropy

Joint Entropy

Let $X, Y$ be a pair of discrete random variables with joint distribution $p (x, y)$ . Their joint entropy is:

\begin{aligned} H (X, Y) & = - \sum_{x \in X} \sum_{y \in Y} p (x, y) \log p (x, y) \\ H (X, Y) & = - E \log p (X, Y) \end{aligned}

Conditional Entropy

If $(X, Y) \sim p (x, y)$ the conditional entropy is:

\begin{aligned} H (Y | X) & = \sum_{x \in X} p (x) H (Y | X = x) \\ = - \sum_{x \in X} p (x) \sum_{y \in Y} p (y | x) \log p (y | x) \\ = - \sum_{x \in X} \sum_{y \in Y} p (x, y) \log p (y | x) \\ = - E \log p (Y | X) \end{aligned}

Information theoretic measure of correlation:

ρ = 1 - \frac{H (X | Y)}{H (Y)}

Chain Rule

H (X, Y) = H (X) + H (Y | X)

Proof:

\begin{aligned} H (X, Y) & = - \sum_{x \in X}^{} \sum_{y \in Y}^{} p (x, y) \log p (x, y) \\ = - \sum_{x \in X}^{} \sum_{y \in Y}^{} p (x, y) \log (p (x) p (y | x)) \\ = - \sum_{x \in X}^{} \sum_{y \in Y}^{} p (x, y) \log p (x) - \sum_{x \in X}^{} \sum_{y \in Y}^{} p (x, y) \log p (y | x) \\ = - \sum_{x \in X} p (x) \log p (x) - \sum_{x \in X} \sum_{y \in Y} p (x, y) \log p (y ∣ x) \\ = H (X) + H (Y ∣ X) . \end{aligned}

Equivalently $\log p (X, Y) = \log p (X) + \log p (Y | X)$
and $H (X, Y | Z) = H (X | Z) + H (Y | X, Z)$

Note $H (Y | X) \neq H (X | Y)$ however $H (X) - H (X | Y) = H (Y) - H (Y | X)$

Relative Entropy

Also called the Kullback-Leiber distance between two probability mass functions $p (x)$ and $q (x)$ is:

\begin{aligned} D (p | | q) & = \sum_{x \in X} p (x) \log \frac{p (x)}{q (x)} \\ = E_{p} \log \frac{p (X)}{q (X)} \end{aligned}

However it is not symmetric and does not satisfy the triangle inequality therefore it is not a norm.

Mutual Information

$(X, Y) \sim p (x, y)$
The mutual information is the relative entropy between the joint distribution and the product distribution $p (x) p (y)$ :

\begin{aligned} I (X; Y) & = \sum_{x \in X} \sum_{y \in Y} p (x, y) \log \frac{p (x, y)}{p (x) p (y)} \\ = D (p (x, y) | | p (x) p (y)) \\ = E_{p (x, y)} \log \frac{p (X, Y)}{p (X) p (Y)} \end{aligned}

Note $D (p | | q) \neq D (q | | p)$ in general.

$I (X; Y) = E p (x, y) [\log \frac{p (x | y)}{p (x)}] = E p (x, y) [\log \frac{p (y | x)}{p (y)}]$

For jointly Gaussian variables $X$ and $Y$ with Pearson Correlation $ρ$ , the mutual information is:

I (X; Y) = - \frac{1}{2} \log (1 - ρ^{2})