Skip to content

Information Theory Exam Notes

Info Content

  • h(x) = log₂(1/P(x))
  • Measures uncertainty of an event in bits
  • Higher probability = lower information content

Entropy

  • H(X) = Σ P(x) log₂(1/P(x))
  • Always non-negative
  • Higher entropy = more uncertainty
  • Measures expected info content of a source

Source Coding Theorem

  • N outcomes from source X can be compressed into roughly NH(X) bits

Joint Entropy

  • H(X,Y) = H(X) + H(Y) only if X and Y are independent

  • List every P(X=x, Y=y) that is non-zero.

  • Calculate P(x,y) × log₂(1/P(x,y))
  • Sum all terms

Relative Entropy

  • D(P||Q) = Σ P(x) log₂(P(x)/Q(x))

Marginal Probabilities

  • P(X=0) = Sum of P for all cases where X=0, e.g. P(X=0,Y=0) + P(X=0,Y=1)

Conditional Entropy

  • H(Y|X) < H(X|Y) means that knowing X reduces uncertainty about Y more than knowing Y reduces uncertainty about X, or X is a better predictor of Y than Y is of X
  • Average uncertainty about Y when X is known
  • H(Y|X) = ∑ P(x)H(Y|X=x) = -∑∑ P(x,y)log₂(P(y|x))
  • H(Y|X) ≠ H(X|Y)

Using Conditional Distributions:

  1. Find marginal probability P(x) for each x
  2. Find conditional distributions P(y|x) for each x: P(y|x) = P(x,y) / P(x)
  3. Calculate H(Y|X=x) for each x: H(Y|X=x) = Σ P(y|x) log₂(1/P(y|x))
  4. Calculate weighted average: H(Y|X) = Σ P(x) × H(Y|X=x)

Using Joint Probabilities:

  1. For each (x,y) pair, calculate P(y|x) = P(x,y)/P(x)
  2. Calculate -P(x,y) × log₂(P(y|x)) for each pair
  3. Sum all terms

Chain Rule for Entropy

  • H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

Mutual Information

  • I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Bernoulli Distribution

  • Entropy: Max at p=0.5 (H=1 bit), min at p=0 or 1 (H=0)

Binary Symmetric Channel

Y = 0 Y = 1
X = 0 0.9 0.1
X = 1 0.1 0.9

Prefix-Free Codes

  • A code is prefix-free if for any two different codewords, one is never the beginning (prefix) of another.
  • Huffman codes are always prefix-free

Information Divergence

How much extra information (in bits) is needed on average to encode data from distribution P when using a code optimized for distribution Q

True Distance

A function d(x,y) is a true distance/metric if it satisfies three properties:

  1. Non-negativity: d(x,y) ≥ 0, with equality iff x = y
  2. Symmetry: d(x,y) = d(y,x)
  3. Triangle inequality: d(x,z) ≤ d(x,y) + d(y,z)

Triangle Inequality

For any three points A, B, C, the direct distance from A to C cannot exceed the sum of distances A→B→C

Relative Entropy (KL Divergence)

  • Formula: \(D(P||Q) = \sum_x P(x) \log_2 \frac{P(x)}{Q(x)}\)
  • Measures information divergence between two distributions P and Q
  • Properties:
  • Always non-negative: \(D(P||Q) \geq 0\)
  • Not symmetric: \(D(P||Q) \neq D(Q||P)\)
  • Not a true distance (doesn't satisfy triangle inequality)
  • Conventions: \(0 \log \frac{0}{0} = 0\), if \(P(x) > 0\) and \(Q(x) = 0\) then \(D(P||Q) = \infty\)

Jensen's Inequality

  • For convex function \(f\): \(E[f(X)] \geq f(E[X])\)
  • For concave function \(f\): \(E[f(X)] \leq f(E[X])\)
  • Used to prove KL divergence is non-negative (since \(\log\) is strictly concave)

Information Inequality (Proof)

Key steps using Jensen's inequality: \(\(-D(P||Q) = \sum_{x \in A} P(x) \log \frac{Q(x)}{P(x)} \leq \log \sum_{x \in A} P(x) \frac{Q(x)}{P(x)} \leq \log \sum_{x \in A} Q(x) = \log 1 = 0\)\)

Maximum Entropy Theorem

  • Formula: \(\log |X| - H(X) \geq 0\) where \(|X|\) is alphabet size
  • Therefore: \(H(X) \leq \log |X|\)
  • Maximum entropy is achieved by uniform distribution
  • Equality holds if and only if \(P(x) = \frac{1}{|X|}\) for all \(x\)

Coding Algorithms

  • Expected length: Sum of probabilities × codeword length
  • Efficiency = H(X) / L(X)

Huffman Coding

  1. List all symbols with probabilities
  2. Combine the two least probable symbols into a new node
  3. Repeat until one node remains
  4. Assign codes - Traverse tree: left=0, right=1

Shannon-Fano

  1. Sort symbols by probability in decreasing order
  2. Divide symbols into two groups with probabilities as close to equal as possible
  3. Assign bits: First group gets '0', second group gets '1'
  4. Repeat recursively for each group until each group has only one symbol

Shannon-Fano-Elias

  1. Calculate F̄(x) - Midpoint of Each Symbol's Interval

    F̄(A) = P(A)/2 F̄(B) = P(A) + P(B)/2 F̄(C) = P(A) + P(B) + P(C)/2 ...

  2. Convert to Binary

  3. Calculate Codeword Lengths

    l(x) = ⌈log₂(1/P(x))⌉ + 1 ⌈1.58⌉ = 2 because 2 is the smallest integer greater than 1.58

  4. Extract Codewords (First l(x) bits after decimal)

Decimal to Fractional Binary

  1. Take the decimal part
  2. Multiply by 2
  3. If result ≥ 1: write down "1", subtract 1 from result
  4. If result < 1: write down "0"
  5. Repeat with new fractional part

Fractional Binary to Decimal

  1. Write down the binary number
  2. Starting from the leftmost digit, multiply each digit by 2 raised to the power of its position index (starting at -1).
  3. Sum all the products

Arithmetic Coding

Encoding

  1. Set up cumulative probabilities
  2. Start with [0, 1)
  3. For each symbol, narrow the interval using:
  4. new_low = low + (high-low) × cum_prob_start
  5. new_high = low + (high-low) × cum_prob_end
  6. Calculate the tag (midpoint) and determine codeword length

Decoding

  1. Convert binary to decimal
  2. Set up cumulative probability intervals
  3. Decode iteratively:
  4. Check which interval the value falls into
  5. New range: [low, high), width = high - low
  6. Rescale: (value - low) / width = new value
  7. Repeat until reach termination symbol

When to Stop

  1. Calculate interval widths by multiplying the widths of all intervals.
  2. Interval width needs to > 1/2^n, where n is the number of bits in the codeword. (010 has 3 bits)

Markov Chains

  • Transitions: p(j|i) means probability of going from state i to state j
  • Transition graph: 2 circles labeled 0 and 1, with arrows showing transitions
  • Transition matrix: P = [p(0|0) p(1|0)] [p(0|1) p(1|1)]

Stationary Distribution

  1. Let π = [π₀ π₁] be the stationary distribution
  2. Need: π·P = π and π₀ + π₁ = 1
  3. Write the system of equations: π₀ = π₀p(0|0) + π₁p(0|1) π₁ = π₀p(1|0) + π₁p(1|1) π₀ + π₁ = 1
  4. Solve for π₀ and π₁

Entropy Rate

H(X) = -∑ᵢ∑ⱼ πᵢpᵢⱼlog₂(pᵢⱼ)

  • π₀p(0|0)log₂(p(0|0))
  • π₀p(0|1)log₂(p(0|1))
  • π₁p(1|0)log₂(p(1|0))
  • π₁p(1|1)log₂(p(1|1))
  • Sum all terms with negative sign

Ternary Channel

Calculate Mutual Information

  1. List transition probabilities
  2. Calculate output distribution P(Y) using law of total probability: P(M) = P(M|a)P(a) + P(M|b)P(b) + P(M|c)P(c) P(N) = P(N|a)P(a) + P(N|b)P(b) + P(N|c)P(c)
  3. Calculate entropy H(Y) H(Y) = P(M)log₂P(1/M) + P(N)log₂P(1/N)
  4. Calculate H(Y|X=x) for each input x then calculate the sum: H(Y|X) = Σₓ P(x)H(Y|X=x)
  5. Calculate mutual information: I(X;Y) = H(Y) - H(Y|X)

(7, 4) Hamming Code

  • 4 intersection regions hold the information bits (s₁, s₂, s₃, s₄)
  • 3 outer regions hold the parity bits (p₁, p₂, p₃)
  • Codeword format: C = s₁ s₂ s₃ s₄ p₁ p₂ p₃

Encoding

  1. Place information bits in intersection regions
  2. Calculate even parity

Decoding

  1. Extract bits from received codeword

Calculating Parity Bits

  1. Count the number of 1s in your data bits.
  2. If the count is even, the parity bit is 0 for even parity, 1 for odd parity.
  3. If the count is odd, the parity bit is 1 for even parity, 0 for odd parity.