Information Theory Exam Notes

Info Content

h(x) = log₂(1/P(x))
Measures uncertainty of an event in bits
Higher probability = lower information content

Entropy

H(X) = Σ P(x) log₂(1/P(x))
Always non-negative
Higher entropy = more uncertainty
Measures expected info content of a source

Source Coding Theorem

N outcomes from source X can be compressed into roughly NH(X) bits

Joint Entropy

H(X,Y) = H(X) + H(Y) only if X and Y are independent
List every P(X=x, Y=y) that is non-zero.
Calculate P(x,y) × log₂(1/P(x,y))
Sum all terms

Relative Entropy

D(P||Q) = Σ P(x) log₂(P(x)/Q(x))

Marginal Probabilities

P(X=0) = Sum of P for all cases where X=0, e.g. P(X=0,Y=0) + P(X=0,Y=1)

Conditional Entropy

H(Y|X) < H(X|Y) means that knowing X reduces uncertainty about Y more than knowing Y reduces uncertainty about X, or X is a better predictor of Y than Y is of X
Average uncertainty about Y when X is known
H(Y|X) = ∑ P(x)H(Y|X=x) = -∑∑ P(x,y)log₂(P(y|x))
H(Y|X) ≠ H(X|Y)

Using Conditional Distributions:

Find marginal probability P(x) for each x
Find conditional distributions P(y|x) for each x: P(y|x) = P(x,y) / P(x)
Calculate H(Y|X=x) for each x: H(Y|X=x) = Σ P(y|x) log₂(1/P(y|x))
Calculate weighted average: H(Y|X) = Σ P(x) × H(Y|X=x)

Using Joint Probabilities:

For each (x,y) pair, calculate P(y|x) = P(x,y)/P(x)
Calculate -P(x,y) × log₂(P(y|x)) for each pair
Sum all terms

Chain Rule for Entropy

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

Mutual Information

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Bernoulli Distribution

Entropy: Max at p=0.5 (H=1 bit), min at p=0 or 1 (H=0)

Binary Symmetric Channel

	Y = 0	Y = 1
X = 0	0.9	0.1
X = 1	0.1	0.9

Prefix-Free Codes

A code is prefix-free if for any two different codewords, one is never the beginning (prefix) of another.
Huffman codes are always prefix-free

Information Divergence

How much extra information (in bits) is needed on average to encode data from distribution P when using a code optimized for distribution Q

True Distance

A function d(x,y) is a true distance/metric if it satisfies three properties:

Non-negativity: d(x,y) ≥ 0, with equality iff x = y
Symmetry: d(x,y) = d(y,x)
Triangle inequality: d(x,z) ≤ d(x,y) + d(y,z)

Triangle Inequality

For any three points A, B, C, the direct distance from A to C cannot exceed the sum of distances A→B→C

Relative Entropy (KL Divergence)

Formula: $D(P||Q) = \sum_x P(x) \log_2 \frac{P(x)}{Q(x)}$
Measures information divergence between two distributions P and Q
Properties:
Always non-negative: $D(P||Q) \geq 0$
Not symmetric: $D(P||Q) \neq D(Q||P)$
Not a true distance (doesn't satisfy triangle inequality)
Conventions: $0 \log \frac{0}{0} = 0$, if $P(x) > 0$ and $Q(x) = 0$ then $D(P||Q) = \infty$

Jensen's Inequality

For convex function $f$: $E[f(X)] \geq f(E[X])$
For concave function $f$: $E[f(X)] \leq f(E[X])$
Used to prove KL divergence is non-negative (since $\log$ is strictly concave)

Information Inequality (Proof)

Key steps using Jensen's inequality: $$-D(P||Q) = \sum_{x \in A} P(x) \log \frac{Q(x)}{P(x)} \leq \log \sum_{x \in A} P(x) \frac{Q(x)}{P(x)} \leq \log \sum_{x \in A} Q(x) = \log 1 = 0$$

Maximum Entropy Theorem

Formula: $\log |X| - H(X) \geq 0$ where $|X|$ is alphabet size
Therefore: $H(X) \leq \log |X|$
Maximum entropy is achieved by uniform distribution
Equality holds if and only if $P(x) = \frac{1}{|X|}$ for all $x$

Coding Algorithms

Expected length: Sum of probabilities × codeword length
Efficiency = H(X) / L(X)

Huffman Coding

List all symbols with probabilities
Combine the two least probable symbols into a new node
Repeat until one node remains
Assign codes - Traverse tree: left=0, right=1

Shannon-Fano

Sort symbols by probability in decreasing order
Divide symbols into two groups with probabilities as close to equal as possible
Assign bits: First group gets '0', second group gets '1'
Repeat recursively for each group until each group has only one symbol

Shannon-Fano-Elias

Calculate F̄(x) - Midpoint of Each Symbol's Interval

F̄(A) = P(A)/2 F̄(B) = P(A) + P(B)/2 F̄(C) = P(A) + P(B) + P(C)/2 ...
Convert to Binary
Calculate Codeword Lengths

l(x) = ⌈log₂(1/P(x))⌉ + 1 ⌈1.58⌉ = 2 because 2 is the smallest integer greater than 1.58
Extract Codewords (First l(x) bits after decimal)

Decimal to Fractional Binary

Take the decimal part
Multiply by 2
If result ≥ 1: write down "1", subtract 1 from result
If result < 1: write down "0"
Repeat with new fractional part

Fractional Binary to Decimal

Write down the binary number
Starting from the leftmost digit, multiply each digit by 2 raised to the power of its position index (starting at -1).
Sum all the products

Arithmetic Coding

Encoding

Set up cumulative probabilities
Start with [0, 1)
For each symbol, narrow the interval using:
new_low = low + (high-low) × cum_prob_start
new_high = low + (high-low) × cum_prob_end
Calculate the tag (midpoint) and determine codeword length

Decoding

Convert binary to decimal
Set up cumulative probability intervals
Decode iteratively:
Check which interval the value falls into
New range: [low, high), width = high - low
Rescale: (value - low) / width = new value
Repeat until reach termination symbol

When to Stop

Calculate interval widths by multiplying the widths of all intervals.
Interval width needs to > 1/2^n, where n is the number of bits in the codeword. (010 has 3 bits)

Markov Chains

Transitions: p(j|i) means probability of going from state i to state j
Transition graph: 2 circles labeled 0 and 1, with arrows showing transitions
Transition matrix: P = [p(0|0) p(1|0)] [p(0|1) p(1|1)]

Stationary Distribution

Let π = [π₀ π₁] be the stationary distribution
Need: π·P = π and π₀ + π₁ = 1
Write the system of equations: π₀ = π₀p(0|0) + π₁p(0|1) π₁ = π₀p(1|0) + π₁p(1|1) π₀ + π₁ = 1
Solve for π₀ and π₁

Entropy Rate

H(X) = -∑ᵢ∑ⱼ πᵢpᵢⱼlog₂(pᵢⱼ)

π₀p(0|0)log₂(p(0|0))
π₀p(0|1)log₂(p(0|1))
π₁p(1|0)log₂(p(1|0))
π₁p(1|1)log₂(p(1|1))
Sum all terms with negative sign

Ternary Channel

Calculate Mutual Information

List transition probabilities
Calculate output distribution P(Y) using law of total probability: P(M) = P(M|a)P(a) + P(M|b)P(b) + P(M|c)P(c) P(N) = P(N|a)P(a) + P(N|b)P(b) + P(N|c)P(c)
Calculate entropy H(Y) H(Y) = P(M)log₂P(1/M) + P(N)log₂P(1/N)
Calculate H(Y|X=x) for each input x then calculate the sum: H(Y|X) = Σₓ P(x)H(Y|X=x)
Calculate mutual information: I(X;Y) = H(Y) - H(Y|X)

(7, 4) Hamming Code

4 intersection regions hold the information bits (s₁, s₂, s₃, s₄)
3 outer regions hold the parity bits (p₁, p₂, p₃)
Codeword format: C = s₁ s₂ s₃ s₄ p₁ p₂ p₃

Encoding

Place information bits in intersection regions
Calculate even parity

Decoding

Extract bits from received codeword

Calculating Parity Bits

Count the number of 1s in your data bits.
If the count is even, the parity bit is 0 for even parity, 1 for odd parity.
If the count is odd, the parity bit is 1 for even parity, 0 for odd parity.