Algorithm 14 – Hoffman Coding

Author:

14. Huffman Codes

14.1 Codes

14.1.1 Binary Codes

Fixed-length bianry codes

For most coding patterns, usually widely-used coding like ASCII is encoded in the fixed-length biary codes, which depicted as follows: A 00; B 01; C 10; D 11.

Variable-length codes

this pattern is depicted as: A 0; B 01; C 10; D 1;

However, here might be several ambiguity in terms of this representation, because in bit code, all bits are consecutive, and there are no space between every representation. In order to settle that problem, we have to propose a new coding pattern.

14.1.3 Prefix-Free Codes

First, we need to have a basic understanding that how comes the ambiguity? That is the ambiguous prefix.

For example, for 2 distinct symbols $a, b \in \Sigma$, the encoding of $a$ is not a prefix of that of $b$.

Thus, we propose a coding pattern: A 0; B10; C 110; D 111.

Then we eliminate the ambiguity by introducing the pattern above, because it doesn’t match pattern of the collidable prefix.

14.1.4 The Benefits of Prefix-Free Codes

Variable-length prefix-free codes can be more efficient than fixed-length codes when the symbols have very different frequencies.

14.1.5 Problem Definition

Problem: Optimal Prefix-Free Codes

Input: A nonnnegative frequency $p_a$ for each symbol $a$ of an alphabet $\Sigma$ of size $n \ge 2$.

Output: The prefix-free binary code with minimum-possible average encoding length:

$$\sum_{a\in \Sigma} p_a \dot (\text{number of bits used to encode } a)$$

14.2 Codes as Trees

14.2.1 Prefix-free Variable-length Code

More generally, every binary code can be represetned as a binary tree with left and right child edges labeled with 0 and 1. Thus, the number of edges in a path equals the number of bits used to encode the corresponding symbol, so we have the following proposition.

Proposition (Encoding Length and Tree Depth) For every binary code, the encoding length in bits of a symbol $a \in \Sigma$ equals the depth of the node with label $a$ in the corresponding tree.

The encoding of a symbol $a$ is a prefix of that of another symbol $b$ if and only if the node labeled $a$ is an ancestor of the node labeled $b$. Thus, no leaf can be the ancestor of another, a tree with label only at the leaves defines a prefix-free code.

14.2.3 Problem Definition

Because now with the help of coding tree, we are able to define that, $L(T,p)$ as the average depth of a leaf in $T$. And then the contribution of each leaf weighted according to the frequency of its label is:

$$L(T,p) = \sum_{a\in \Sigma} p_a \times (\text{Depth}_a)$$

Thus, the problem will be defined as:

Problem: Optimal Prefix-free Codes
Input: A nonnegative frequency $p_a$ for each symbol $a$ of an alphabet $\Sigma$ of size $n \ ge 2$
Output: A $\Sigma$-tree with minimum-possible average leaf depth.

14.3 Huffman’s Greedy Algorithm

14.3.1 Building Trees Through Successive Mergers

The big idea dates back to 1951 to tackle the optimal prefix-free code problem using a bottom-up approach.

The first 2 merges are C-D then B-CD.

Then here are the next merge:

Then merge the A-BCD

In general, Huffman’s greedy algorithm mantains a forest, which is a collection of one or more binary trees.

Thus, there are $n – 1$ mergers in the Huffman’s greedy algorithm perform before halting.

14.3.2 Huffman’s Greedy Criterion

Each merger increments the depths of all the leaves in the two participating trees.
Every merger increments the objective function that we want to minimize.
Every iteration of Huffman’s greedy algorithm myopically performs the merge that least increases this objective function.

Huffman’s Greedy Criterion: Merge the pair of trees that causes the minimum-possible increase in the average leaf depth.

14.3.3 Pseudocode

 

Huffman
# Initialization

for a in Sigma:
  Ta := tree containing one node, labeled "a"
  P(Ta) = pa
F = {Ta}  # invariant

//main loop
while F contains at least two trees:
  T1 = argmin_{T in F} P(T). # min frequency sum
  T2 = argmin_{T in F, T ≠ T1} P(T). # second-smallest
  remove T1 and T2 from F
  # roots of T1, T2 become left, right children of a new internal node
  T3 = merger of T1 and T2
  P(T3) = P(T1) + P(T2)
  add T3 to F

return the unique tree in F

14.3.6 Running Time

A straightforaward implementation of the Huffman algorithm runs in $O(n^2)$ time.

Each iteration is responsible for indentifying the two current trees with the smallest sums of symbol frequencies, which can be done by exhaustive search over the $O(n)$ trees of $F$.

However, the heap data structure can help this more better by repeating minimum computations so that each computation takes logarithmic rather than linear time.

Moreover, the algorithm can be implemented by sorting the symbols in order of increasing frequency and then performing a linear amount of additional processing, which calls out for the RadixSort running in linear time. Also, it helps us eschew heaps in favor of queues ( 2 queues).

 

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注