===========================================================================
CSC 310H                Lecture Summary for Week  1               Fall 2007
===========================================================================

------
Topics
------

In this course, we will study the problems of data compression and data
communication over a noisy channel.

In the context of data compression, a source produces an input X (which
could be a string, a computer file, an image, a movie) and we need to
encode X into another string Z that is as small as possible, but in such a
way that a decoder can later compute X by looking only at Z. If we require
that X be correctly retrieved all the time, we say that the compression
scheme is "lossless" (e.g. zip, compress). If we only require that X be
retrieved "with high probability", the compression scheme is "lossy" (e.g.
mp3, mpeg).

In the context of data communication over a noisy channel, we again start
with a source that produces an input X. We need to transmit X across a
communication channel (e.g. network cable, radio signal, CD) that might
introduce errors in the transimssion. In order to boost the chance that X
is transmitted correctly, we encode X by adding "redundancy" into a string
Z. We then transmit Z across the noisy channel and we retrieve Z' at the
other end. Finally, we remove the redundancy from Z' and we obtain X'.
Ideally, we want to have X'=X. Unlike the case of data compression, we
cannot always enforce this, because there is always a small chance that the
communication channel alters Z a lot. So we settle for asking that X'=X
with high probability.

In this course, we will study:

 -  The mathematical theory behind compression and communication problems,
    including two famous theorems proved by Claude Shannon in 1948 that
    give limits on how well *any* method can do.

 -  Practical algorithms for compression and communication, which appeared
    much later.

 -  Modelling issues involved in transforming a real-world problem into a
    mathematical form so that analytical tools can be applied to it.

-------------------
Compression example
-------------------

Suppose we are given as input a black/white image of size 10x50 and we
would like to compress it. In particular, consider the following image:

..................................................
..XX...XX...........XX.....XX.....................
..XX...XX...........XXX...XXX.....................
..XX...XX...X.......XX.XXX.XX.....................
..XXXXXXX...........XX..X..XX.....................
..XXXXXXX...X.......XX.....XX....XX....XXXXX.XXX..
..XX...XX...X.......XX.....XX...X..X...XX...X...X.
..XX...XX...X.......XX.....XX...X..X...XX...X...X.
..XX...XX...X.......XX.....XX....XX....XX...X...X.
..................................................

Uncompressed, it takes 10x50 = 500 bits.

If both encoder and decoder know that the image is actually a piece of
English text and they also know the specific font that was used to generate
the image, it would be enough to encode the 6 characters: "H", "i", space,
"m", "o", "m". There are 26x2=52 letters plus a few punctuation symbols,
say at most 64 different symbols. We could encode each one in binary with a
6-bit string. Thus, we would only send 6x6 = 36 bits! This is a huge
improvement on the 500 bits required by the raw image, but it relies
heavily on "extra information" shared by the encoder and the decoder. Not
every 10x50 bw image can be encoded by this scheme.

Another idea: going left to right, if a column is identical to the previous
one, send a bit that signals this situation. If the new column is not
identical to the previous one, simply include it as is in the encoding.
Note, we still have to transmit the extra bit, that now says "new column is
not identical to previous one".

The image above would be encoded as follows:

[0,0000000000],[1],[0,0111111110],[1],[0,0000110000],[1],[1] ...

Above, I included the delimiters "[", ",", "]" in order to make it clear
how the encoding is being constructed. These extra symbols would not be
part of the encoding.

How many bits does it take to encode the image? We use 1 bit for every
column identical to the one before it, and 11 bits for a new column. One
can check that the image contains 25 columns identical to the one before
them, and 25 new ones. Thus, the image would be encoded in 1x25+11x25 =
12x25 = 300 bits. That's pretty good compared to the raw 500 bits.
Furthermore, note that unlike the "english letters" scheme before, this
scheme can actually encode any 10x50 bw image. In the worst case, we will
use 11 bits for every column, so any encoding has at most 550 bits.

One more idea: encode each black rectangle as a 4-tuple that specifies the
row and column of its top left and bottom right corners. We need 4 bits to
specify a row and 6 bits to specify a column, so 2x(4+6) = 20 bits per
black rectangle. The image above contains 20 of them, so it would be
encoded in 20x20 = 400 bits. Again, this scheme can encode any image. If
the black rectangles are all very small (say, only 1 pixel), we end up
using 20 bits per each such rectangle. In the worst case, the input looks
like a chess board with 500/2 = 250 single black pixels, which would get
encoded into 250x20 = 5000 bits.

--------------------
Moral of the example
--------------------

Which scheme is better? It ultimately depends on what kind of image we
expect to receive as input. The more we know about the source that
generates the input, the better we will be at encoding it.

The way we incorporate knowledge about the source in our scheme is to build
a probabilistic model for the inputs that we expect to receive. A simple
such model would be "the probability any bit is white is .7, independently
of the colours of the bits around it." A more elaborate model could specify
the probability a pixel has colour C given the colours of the pixels above
and to the left of it. Based on this information, we could tune our
encoding scheme to perform well on these types of inputs.

If we know these probabilities before we even look at the image, that would
be a static model. If, instead, we learn these probabilities as we go
through the image, that would correspond to an adaptive model.

----------------------------------
Shannon's Noiseless Coding Theorem
----------------------------------

How much can we compress the data coming from a given source?

Shannon proved in 1948 that there is a lower limit, called the source
entropy, below which no scheme can losslessly compress.

---------------------
Communication example
---------------------

Suppose we have a communication channel that flips every bit passing
through it independently with probability .1. We want to send an 8-bit
message X through this channel.

First, let's try sending X as-is. What is the probability the entire
message goes through with no errors? It's (.9)^8 = .430. That's pretty bad,
more than half the time the message will have errors.

What if we try sending each bit k times? Say, k=3. Then, to decode the
received string, we take the majority of the bits received for every source
bit.

Example:

X  =  0   0   1   1   1   0   1   1
Z  =  000 000 111 111 111 000 111 111
Z' =  000 010 111 100 110 000 111 011
X' =  0   0   1   0   1   0   1   1

As you can see, we were not able to fix the error in the 4th bit because
2 out of the 3 bits sent for that source bit were corrupted by the channel.
What is the probability the enitre message goes through with no errors?

Pr[X_i = X'_i] = Pr[Z_i1 = Z'_i1 & Z_i2 = Z'_i2 & Z_i3 = Z'_i3] +
                   + Pr[ 2 of Z_i1, Z_i2, Z_i3 are correct and 1 is wrong]
               = (.9)^3 + (3 choose 2) * (.9)^2 * (.1)
               = .972

Pr[X = X'] = (.972)^8 = .797, much better than .430!

However, we are paying some price for this increase. Define the "rate" of
the code to be |X|/|Z|. The first code had rate 1, the second one has rate
1/3 = .333.

What if we want the probability that X goes through to be arbitrarily close
to 1? It shouldn't be hard to see that we would need k --> infinity, so
rate --> 0. Is this unavoidable? Surprisingly: NO.

------------------------------
Shannon's Noisy Coding Theorem
------------------------------

There exists a non-zero constant dependent on the channel only, called the
capacity of the channel, such that there are encoding schemes that can
transmit messages with arbitrarily small probability of error at a rate
close to this capacity.