14. Huffman Codes
14.1 Codes
14.1.1 Binary Codes
Fixed-length bianry codes
For most coding patterns, usually widely-used coding like ASCII is encoded in the fixed-length biary codes, which depicted as follows: A 00; B 01; C 10; D 11.
Variable-length codes
this pattern is depicted as: A 0; B 01; C 10; D 1;
However, here might be several ambiguity in terms of this representation, because in bit code, all bits are consecutive, and there are no space between every representation. In order to settle that problem, we have to propose a new coding pattern.
14.1.3 Prefix-Free Codes
First, we need to have a basic understanding that how comes the ambiguity? That is the ambiguous prefix.
For example, for 2 distinct symbols $a, b \in \Sigma$, the encoding of $a$ is not a prefix of that of $b$.
Thus, we propose a coding pattern: A 0; B10; C 110; D 111.
Then we eliminate the ambiguity by introducing the pattern above, because it doesn’t match pattern of the collidable prefix.
14.1.4 The Benefits of Prefix-Free Codes
Variable-length prefix-free codes can be more efficient than fixed-length codes when the symbols have very different frequencies.
14.1.5 Problem Definition
Problem: Optimal Prefix-Free Codes
Input: A nonnnegative frequency $p_a$ for each symbol $a$ of an alphabet $\Sigma$ of size $n \ge 2$.
Output: The prefix-free binary code with minimum-possible average encoding length:
$$\sum_{a\in \Sigma} p_a \dot (\text{number of bits used to encode } a)$$
14.2 Codes as Trees
14.2.1 Prefix-free Variable-length Code
More generally, every binary code can be represetned as a binary tree with left and right child edges labeled with 0 and 1. Thus, the number of edges in a path equals the number of bits used to encode the corresponding symbol, so we have the following proposition.
Proposition (Encoding Length and Tree Depth) For every binary code, the encoding length in bits of a symbol $a \in \Sigma$ equals the depth of the node with label $a$ in the corresponding tree.
The encoding of a symbol $a$ is a prefix of that of another symbol $b$ if and only if the node labeled $a$ is an ancestor of the node labeled $b$. Thus, no leaf can be the ancestor of another, a tree with label only at the leaves defines a prefix-free code.
14.2.3 Problem Definition
Because now with the help of coding tree, we are able to define that, $L(T,p)$ as the average depth of a leaf in $T$. And then the contribution of each leaf weighted according to the frequency of its label is:
$$L(T,p) = \sum_{a\in \Sigma} p_a \times (\text{Depth}_a)$$
Thus, the problem will be defined as:
Problem: Optimal Prefix-free Codes
Input: A nonnegative frequency $p_a$ for each symbol $a$ of an alphabet $\Sigma$ of size $n \ ge 2$
Output: A $\Sigma$-tree with minimum-possible average leaf depth.
14.3 Huffman’s Greedy Algorithm
14.3.1 Building Trees Through Successive Mergers
The big idea dates back to 1951 to tackle the optimal prefix-free code problem using a bottom-up approach.
The first 2 merges are C-D then B-CD.
Then here are the next merge:
Then merge the A-BCD
In general, Huffman’s greedy algorithm mantains a forest, which is a collection of one or more binary trees.
Thus, there are $n – 1$ mergers in the Huffman’s greedy algorithm perform before halting.
14.3.2 Huffman’s Greedy Criterion
Each merger increments the depths of all the leaves in the two participating trees.
Every merger increments the objective function that we want to minimize.
Every iteration of Huffman’s greedy algorithm myopically performs the merge that least increases this objective function.
Huffman’s Greedy Criterion: Merge the pair of trees that causes the minimum-possible increase in the average leaf depth.
14.3.3 Pseudocode
Huffman # Initialization for a in Sigma: Ta := tree containing one node, labeled "a" P(Ta) = pa F = {Ta} # invariant //main loop while F contains at least two trees: T1 = argmin_{T in F} P(T). # min frequency sum T2 = argmin_{T in F, T ≠ T1} P(T). # second-smallest remove T1 and T2 from F # roots of T1, T2 become left, right children of a new internal node T3 = merger of T1 and T2 P(T3) = P(T1) + P(T2) add T3 to F return the unique tree in F
14.3.6 Running Time
A straightforaward implementation of the Huffman algorithm runs in $O(n^2)$ time.
Each iteration is responsible for indentifying the two current trees with the smallest sums of symbol frequencies, which can be done by exhaustive search over the $O(n)$ trees of $F$.
However, the heap data structure can help this more better by repeating minimum computations so that each computation takes logarithmic rather than linear time.
Moreover, the algorithm can be implemented by sorting the symbols in order of increasing frequency and then performing a linear amount of additional processing, which calls out for the RadixSort running in linear time. Also, it helps us eschew heaps in favor of queues ( 2 queues).