Tutorial 3 for CSC 310 This week's tutorial is aimed at the idea that by coding in blocks, one can compress to arbitrarily close to the entropy, but that the size of block required may be too large to make Huffman coding for blocks feasible. (This motivates looking at other methods such as arithmetic coding.) This can be seen in the simplest case of an alphabet with only two symbols, with probabilities p and 1-p. To make the arithmetic easier, let's make p = 2^(-k), and concentrate on cases where p is small (ie, k is big). The first thing to do is find the entropy. Doing so is also an exercise in how to approximate things. The entropy is H = p log_2(1/p) + (1-p) log_2(1/(1-p)) where log_2 means log base two. You should get them to simplify the first term to pk. To simplify the second term, they will need to realize first that log_2(x) = log_e(x) / log_e(2), where log_e is the natural log. They can then convert the second term to (1-p) log_e(1/(1-p)) / log_e(2) Now then need to realize that for small p, 1/(1-p) is approximately 1+p, and log_e(1+p) is approximately p. So the second term in the entropy is approximately (1-p) p / log_e(2) Now they need to realize that when p is small, 1-p is approximately 1, so we can just drop that factor. The final result is that the entropy is approximately pk + p/log_e(2) = p (k + 1/log_e(2)) In particular, when p is very small (hence k is big), the entropy is almost entirely due to the first term - ie, almost all the information arrives when the rare symbol occassionally occurs. (I think this is interesting to know in itself.) Now that we know (approximately) what the entropy is when p is small, we can ask how large the block size (N) has to be in order for the compressed size to be close (in relative terms) to the entropy. The key point is that the code for a block can't be any shorter than one bit. So the average code length per symbol has to be at least 1/N. To get close to the entropy, 1/N must therefore be in the vicinity of pk (ignoring the 1/log_e(2) part). It follows that N must be around 1/(pk) = 2^k / k. For instance, if p = 1/1024 (ie, k=10), we'll need to use blocks of size about 1024/10, which is about 100. You could then talk about how feasible it is to come up with a good code for such large blocks. You can consider both Huffman codes and Shannon-Fano codes. Shannon-Fano codes have codeword lengths that are easy to figure out for any particular block, but once we have these lengths, which will satisfy the Kraft-McMillan inequality, we still have to find a code with those lengths. Actually, we have to find the codeword for the block we're trying to code WITHOUT finding all the other codewords, since there are too many of them. They could try out various ideas. I think if they succeed, they will have essentially re-invented arithmetic coding. But who knows, maybe there is some other good scheme!