Information Theory Exam Notes
Info Content
- h(x) = log₂(1/P(x))
- Measures uncertainty of an event in bits
- Higher probability = lower information content
Entropy
- H(X) = Σ P(x) log₂(1/P(x))
- Always non-negative
- Higher entropy = more uncertainty
- Measures expected info content of a source
Source Coding Theorem
- N outcomes from source X can be compressed into roughly NH(X) bits
Joint Entropy
-
H(X,Y) = H(X) + H(Y) only if X and Y are independent
-
List every P(X=x, Y=y) that is non-zero.
- Calculate P(x,y) × log₂(1/P(x,y))
- Sum all terms
Relative Entropy
- D(P||Q) = Σ P(x) log₂(P(x)/Q(x))
Marginal Probabilities
- P(X=0) = Sum of P for all cases where X=0, e.g. P(X=0,Y=0) + P(X=0,Y=1)
Conditional Entropy
- H(Y|X) < H(X|Y) means that knowing X reduces uncertainty about Y more than knowing Y reduces uncertainty about X, or X is a better predictor of Y than Y is of X
- Average uncertainty about Y when X is known
- H(Y|X) = ∑ P(x)H(Y|X=x) = -∑∑ P(x,y)log₂(P(y|x))
- H(Y|X) ≠ H(X|Y)
Using Conditional Distributions:
- Find marginal probability P(x) for each x
- Find conditional distributions P(y|x) for each x: P(y|x) = P(x,y) / P(x)
- Calculate H(Y|X=x) for each x: H(Y|X=x) = Σ P(y|x) log₂(1/P(y|x))
- Calculate weighted average: H(Y|X) = Σ P(x) × H(Y|X=x)
Using Joint Probabilities:
- For each (x,y) pair, calculate P(y|x) = P(x,y)/P(x)
- Calculate -P(x,y) × log₂(P(y|x)) for each pair
- Sum all terms
Chain Rule for Entropy
- H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
Mutual Information
- I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
Bernoulli Distribution
- Entropy: Max at p=0.5 (H=1 bit), min at p=0 or 1 (H=0)
Binary Symmetric Channel
| Y = 0 | Y = 1 | |
|---|---|---|
| X = 0 | 0.9 | 0.1 |
| X = 1 | 0.1 | 0.9 |
Prefix-Free Codes
- A code is prefix-free if for any two different codewords, one is never the beginning (prefix) of another.
- Huffman codes are always prefix-free
Information Divergence
How much extra information (in bits) is needed on average to encode data from distribution P when using a code optimized for distribution Q
True Distance
A function d(x,y) is a true distance/metric if it satisfies three properties:
- Non-negativity: d(x,y) ≥ 0, with equality iff x = y
- Symmetry: d(x,y) = d(y,x)
- Triangle inequality: d(x,z) ≤ d(x,y) + d(y,z)
Triangle Inequality
For any three points A, B, C, the direct distance from A to C cannot exceed the sum of distances A→B→C
Relative Entropy (KL Divergence)
- Formula: \(D(P||Q) = \sum_x P(x) \log_2 \frac{P(x)}{Q(x)}\)
- Measures information divergence between two distributions P and Q
- Properties:
- Always non-negative: \(D(P||Q) \geq 0\)
- Not symmetric: \(D(P||Q) \neq D(Q||P)\)
- Not a true distance (doesn't satisfy triangle inequality)
- Conventions: \(0 \log \frac{0}{0} = 0\), if \(P(x) > 0\) and \(Q(x) = 0\) then \(D(P||Q) = \infty\)
Jensen's Inequality
- For convex function \(f\): \(E[f(X)] \geq f(E[X])\)
- For concave function \(f\): \(E[f(X)] \leq f(E[X])\)
- Used to prove KL divergence is non-negative (since \(\log\) is strictly concave)
Information Inequality (Proof)
Key steps using Jensen's inequality: \(\(-D(P||Q) = \sum_{x \in A} P(x) \log \frac{Q(x)}{P(x)} \leq \log \sum_{x \in A} P(x) \frac{Q(x)}{P(x)} \leq \log \sum_{x \in A} Q(x) = \log 1 = 0\)\)
Maximum Entropy Theorem
- Formula: \(\log |X| - H(X) \geq 0\) where \(|X|\) is alphabet size
- Therefore: \(H(X) \leq \log |X|\)
- Maximum entropy is achieved by uniform distribution
- Equality holds if and only if \(P(x) = \frac{1}{|X|}\) for all \(x\)
Coding Algorithms
- Expected length: Sum of probabilities × codeword length
- Efficiency = H(X) / L(X)
Huffman Coding
- List all symbols with probabilities
- Combine the two least probable symbols into a new node
- Repeat until one node remains
- Assign codes - Traverse tree: left=0, right=1
Shannon-Fano
- Sort symbols by probability in decreasing order
- Divide symbols into two groups with probabilities as close to equal as possible
- Assign bits: First group gets '0', second group gets '1'
- Repeat recursively for each group until each group has only one symbol
Shannon-Fano-Elias
-
Calculate F̄(x) - Midpoint of Each Symbol's Interval
F̄(A) = P(A)/2 F̄(B) = P(A) + P(B)/2 F̄(C) = P(A) + P(B) + P(C)/2 ...
-
Convert to Binary
-
Calculate Codeword Lengths
l(x) = ⌈log₂(1/P(x))⌉ + 1 ⌈1.58⌉ = 2 because 2 is the smallest integer greater than 1.58
-
Extract Codewords (First l(x) bits after decimal)
Decimal to Fractional Binary
- Take the decimal part
- Multiply by 2
- If result ≥ 1: write down "1", subtract 1 from result
- If result < 1: write down "0"
- Repeat with new fractional part
Fractional Binary to Decimal
- Write down the binary number
- Starting from the leftmost digit, multiply each digit by 2 raised to the power of its position index (starting at -1).
- Sum all the products
Arithmetic Coding
Encoding
- Set up cumulative probabilities
- Start with [0, 1)
- For each symbol, narrow the interval using:
- new_low = low + (high-low) × cum_prob_start
- new_high = low + (high-low) × cum_prob_end
- Calculate the tag (midpoint) and determine codeword length
Decoding
- Convert binary to decimal
- Set up cumulative probability intervals
- Decode iteratively:
- Check which interval the value falls into
- New range: [low, high), width = high - low
- Rescale: (value - low) / width = new value
- Repeat until reach termination symbol
When to Stop
- Calculate interval widths by multiplying the widths of all intervals.
- Interval width needs to > 1/2^n, where n is the number of bits in the codeword. (010 has 3 bits)
Markov Chains
- Transitions: p(j|i) means probability of going from state i to state j
- Transition graph: 2 circles labeled 0 and 1, with arrows showing transitions
- Transition matrix: P = [p(0|0) p(1|0)] [p(0|1) p(1|1)]
Stationary Distribution
- Let π = [π₀ π₁] be the stationary distribution
- Need: π·P = π and π₀ + π₁ = 1
- Write the system of equations: π₀ = π₀p(0|0) + π₁p(0|1) π₁ = π₀p(1|0) + π₁p(1|1) π₀ + π₁ = 1
- Solve for π₀ and π₁
Entropy Rate
H(X) = -∑ᵢ∑ⱼ πᵢpᵢⱼlog₂(pᵢⱼ)
- π₀p(0|0)log₂(p(0|0))
- π₀p(0|1)log₂(p(0|1))
- π₁p(1|0)log₂(p(1|0))
- π₁p(1|1)log₂(p(1|1))
- Sum all terms with negative sign
Ternary Channel
Calculate Mutual Information
- List transition probabilities
- Calculate output distribution P(Y) using law of total probability: P(M) = P(M|a)P(a) + P(M|b)P(b) + P(M|c)P(c) P(N) = P(N|a)P(a) + P(N|b)P(b) + P(N|c)P(c)
- Calculate entropy H(Y) H(Y) = P(M)log₂P(1/M) + P(N)log₂P(1/N)
- Calculate H(Y|X=x) for each input x then calculate the sum: H(Y|X) = Σₓ P(x)H(Y|X=x)
- Calculate mutual information: I(X;Y) = H(Y) - H(Y|X)
(7, 4) Hamming Code
- 4 intersection regions hold the information bits (s₁, s₂, s₃, s₄)
- 3 outer regions hold the parity bits (p₁, p₂, p₃)
- Codeword format: C = s₁ s₂ s₃ s₄ p₁ p₂ p₃
Encoding
- Place information bits in intersection regions
- Calculate even parity
Decoding
- Extract bits from received codeword
Calculating Parity Bits
- Count the number of 1s in your data bits.
- If the count is even, the parity bit is 0 for even parity, 1 for odd parity.
- If the count is odd, the parity bit is 1 for even parity, 0 for odd parity.