Entropy

Basic Ideas

"Entropy is the minimum descriptive complexity of a random variable"
"Mutual information is the communication rate in the presence of noise"

In communication there exists a data compression minimum and a transmission maximum or channel capacity.

Kolmogorov Complexity: the idea that the complexity of a string of data can be defined by the length of the shortest binary computer program for computing the string.

Entropy

Is a measure of the uncertainty of a random variable.
Let X be a discrete random variable with alphabet (set of all possible outcomes) X and probability mass function p(x)=Pr{X=x}.

Entropy H(X) is defined by

H(X)=xXp(x)log2p(x)

The entropy H(X) is the theoretical lower bound, in bits, on how efficiently you can compress the outcomes of a random variable X assuming you’re coding them in binary and want lossless reconstruction.

Entropy of fair coin toss:

X={heads,tails}H(X)=p(heads)log2(p(heads))p(tails)log2(p(tails))=12(1)12(1)=2

Interpretation: A fair coin toss carries 1 bit of information, meaning it’s maximally uncertain—you gain 1 full bit of information every time you observe the outcome.

For other logarithm bases b we notate entropy as Hb(X).

Entropy as Expectation

If Xp(x) the expected value of the random variable g(X) is:

Epg(X)=xXg(x)p(x)

using g(X)=log1p(X) we can interpret the entropy of X as an expectation.

H(X)=Eplog1p(X)

Immediate Properties

  1. 0p(x)1log1p(x)0
  2. logbp=logbalogapHbX=(logba)Ha(X)

Joint Entropy and Conditional Entropy

Joint Entropy

Let X,Y be a pair of discrete random variables with joint distribution p(x,y). Their joint entropy is:

H(X,Y)=xXyYp(x,y)logp(x,y)H(X,Y)=Elogp(X,Y)

Conditional Entropy

If (X,Y)p(x,y) the conditional entropy is:

H(Y|X)=xXp(x)H(Y|X=x)=xXp(x)yYp(y|x)logp(y|x)=xXyYp(x,y)logp(y|x)=Elogp(Y|X)

Information theoretic measure of correlation:

ρ=1H(X|Y)H(Y)

Chain Rule

H(X,Y)=H(X)+H(Y|X)

Proof:

H(X,Y)=xXyYp(x,y)logp(x,y)=xXyYp(x,y)log(p(x)p(y|x))=xXyYp(x,y)logp(x)xXyYp(x,y)logp(y|x)=xXp(x)logp(x)xXyYp(x,y)logp(yx)=H(X)+H(YX).

Equivalently logp(X,Y)=logp(X)+logp(Y|X)
and H(X,Y|Z)=H(X|Z)+H(Y|X,Z)

Note H(Y|X)H(X|Y) however H(X)H(X|Y)=H(Y)H(Y|X)

Relative Entropy

Also called the Kullback-Leiber distance between two probability mass functions p(x) and q(x) is:

D(p||q)=xXp(x)logp(x)q(x)=Eplogp(X)q(X)

However it is not symmetric and does not satisfy the triangle inequality therefore it is not a norm.

Mutual Information

(X,Y)p(x,y)
The mutual information is the relative entropy between the joint distribution and the product distribution p(x)p(y):

I(X;Y)=xXyYp(x,y)logp(x,y)p(x)p(y)=D(p(x,y)||p(x)p(y))=Ep(x,y)logp(X,Y)p(X)p(Y)

Note D(p||q)D(q||p) in general.

I(X;Y)=Ep(x,y)[logp(x|y)p(x)]=Ep(x,y)[logp(y|x)p(y)]

For jointly Gaussian variables X and Y with Pearson Correlation ρ, the mutual information is:

I(X;Y)=12log(1ρ2)