Tutorial 3 for CSC 310

This week's tutorial is aimed at the idea that by coding in blocks,
one can compress to arbitrarily close to the entropy, but that the
size of block required may be too large to make Huffman coding for
blocks feasible.  (This motivates looking at other methods such as
arithmetic coding.)

This can be seen in the simplest case of an alphabet with only two
symbols, with probabilities p and 1-p.  To make the arithmetic easier,
let's make p = 2^(-k), and concentrate on cases where p is small (ie,
k is big).

The first thing to do is find the entropy.  Doing so is also an
exercise in how to approximate things.  The entropy is

    H = p log_2(1/p) + (1-p) log_2(1/(1-p))

where log_2 means log base two.  You should get them to simplify the
first term to pk.  To simplify the second term, they will need to
realize first that log_2(x) = log_e(x) / log_e(2), where log_e is the
natural log.  They can then convert the second term to

    (1-p) log_e(1/(1-p)) / log_e(2)

Now then need to realize that for small p, 1/(1-p) is approximately
1+p, and log_e(1+p) is approximately p.  So the second term in the
entropy is approximately

    (1-p) p / log_e(2)

Now they need to realize that when p is small, 1-p is approximately 1,
so we can just drop that factor.

The final result is that the entropy is approximately

     pk + p/log_e(2) = p (k + 1/log_e(2))

In particular, when p is very small (hence k is big), the entropy is
almost entirely due to the first term - ie, almost all the information
arrives when the rare symbol occassionally occurs.  (I think this is
interesting to know in itself.)

Now that we know (approximately) what the entropy is when p is small,
we can ask how large the block size (N) has to be in order for the
compressed size to be close (in relative terms) to the entropy.

The key point is that the code for a block can't be any shorter than
one bit.  So the average code length per symbol has to be at least
1/N.  To get close to the entropy, 1/N must therefore be in the
vicinity of pk (ignoring the 1/log_e(2) part).  It follows that N must
be around 1/(pk) = 2^k / k.  For instance, if p = 1/1024 (ie, k=10),
we'll need to use blocks of size about 1024/10, which is about 100.

You could then talk about how feasible it is to come up with a good
code for such large blocks.  You can consider both Huffman codes and
Shannon-Fano codes.  Shannon-Fano codes have codeword lengths that are
easy to figure out for any particular block, but once we have these
lengths, which will satisfy the Kraft-McMillan inequality, we still
have to find a code with those lengths.  Actually, we have to find the
codeword for the block we're trying to code WITHOUT finding all the
other codewords, since there are too many of them.

They could try out various ideas.  I think if they succeed, they will
have essentially re-invented arithmetic coding.  But who knows, maybe
there is some other good scheme!