CSC 310 Fall 2007

CSC 310 Fall 2007 — Project

Due: November 26rd. See course webpage for up-to-date information on due date.
Worth: 20% of your course grade.

Part 1

Using either C, C++, Java or Matlab, write an implementation of Arithmetic Coding using integer-based operations (not floating-point). Many of the details needed for this implementation can be found in lecture notes 8-10. Here are some of the assumptions and constraints you may work with:

For the encoder, the input will be presented in a file, to be read one character at a time. The input alphabet are 8-bit characters, that is, numbers from 0 to 255 inclusive. You may use a special character (say, with code 256), to signal the end of the encoding. This character will not appear in the input file, rather, the moment your program detects the end of the file, it inserts this extra character in the encoding. The decoder stops the moment it decodes this character, and it doesn't include it in the output file.
The output alphabet of the encoder will be bits 0 and 1, but you may represent these with characters '0' and '1'. Normally, you would pack 8 output bits in each output character, but we will not do that. However, when comparing the size of the source string with the size of the encoding, remember to count in the same units, that is, divide the size of the encoding by 8.
For the symbol probabilities, use the dynamic Laplace model for estimating probabilities using counts. Increment the count for each character each time you encounter that character in the input string. Make sure you do not use zero probabilities. The decoder should work in the same way.
You may assume that the sum of all symbol counts in the input file will always fit in a 25-bit unsigned integer. Use this information to make a decision about the size of the integers you will use in your program.

Using your program, compress the following files: [ file 1 ], [ file 2 ], [ file 3 ], [ file 4 ], [ file 5 ]. Decompress them and make sure the output of your decompressor matches the original file.

For each of these files, report the size of the original file in bytes, the size of the encoding in bits, the average number of bits per symbol and the compression ratio achieved (for this, make sure you use the same unit, bits or bytes).

Moreover, using a separate program, compute the counts for each input symbol in each input file. Compute the entropy of the set of probabilities you observe. Note, for this step, there is no need to adjust the zero counts as in the encoding part. Compare this entropy with the average number of bits per symbol your encoder achieves.

Repeat this analysis with one more input file: the encoding of any one of the files above. That is, try to compress the same thing twice and see how it goes.

What to hand in:

The analysis described above for each of the 5+1 input files.
Your code for the Arithmetic encoder and decoder. We will probably arrange for an office hour with a TA where you must demonstrate how your encoder works on a few input files. The work you submit must be your own.

Part 2

Using either C, C++, Java or Matlab, write an implementation for a coder and a decoder for the [15,11] Hamming code given by the matrices below. For this part of the project, you might find Matlab more useful, as it contains built-in operations for dealing matrices. Details:

Both the channel encoder and the channel decoder should read their input from a file and write their output to a file.
You may assume all files contain only characters '0' and '1'. For example, this is the case with two of the input files for part 1, as well as with the encodings you get in part 1, provided you follow the suggestion of representing bits 0 and 1 with the respective 8-bit characters.
The channel encoder should read in 11 characters at a time, compute the 15 character encoding, and write it to the output file before reading in the next batch. Again, write only characters '0' and '1', no line endings, no spaces. If the input file contains a number of symbols not divisible by 11, it is ok to simply forget about the last incomplete 11 character block. Of course, you wouldn't want to do this in practice, but for this project that is ok.
The channel decoder should operate in a similar manner, reading in 15 characters at a time, and writing 11 characters to the output before reading in the next batch.
The generator matrix for our code is [ G ], the parity check matrix is [ H ], and the following matrix can be used to transform G and H into systematic form [ Q ] (that is, by multiplying them on the right). You can read one of these in Matlab with the command:
fid=fopen('G','r');G=fscanf(fid,'%d',[15,11]);G=G';fclose(fid);
Apart from reading and writing, the channel encoder should be trivial. Do not encode with G*Q (that is, with the equivalent code in systematic form). You will only use Q for decoding.
Apart from reading and writing, the channel decoder should consist of two main parts. One is the codeword correction, the other is the codeword decoding. For the correction part, compute the syndrome of the string received, and use the fact that H is in a special form to decide which bit to correct, if any. For the decoding part, note that, for every input string x, Q*(G^T)*x = Q*w. Use this hint to quickly derive x, given w.

Along with your channel encoder and decoder, use the following program to simulate a BSC with a given flip probability: [ bsc.c ]. Compile it and use it as follows:

bsc 0.2 <file.in >file.out

Furthermore, use the following program to count the number of blocks you get right and wrong: [ block-diff.c ]. Compile it and use it as follows:

block-diff file-before-encoding file-after-decoding

Use your channel encoder on the following 2 channel input files: [ file 5 ] (also used before), and its arithmetic coding (that is, pass it through your arithmetic coder, and apply the channel coder to the file you get). If you did not finish your Arithmetic coder (and only then), use [ file 4 ] as your second example for channel coding. At this point, you should have 2 channel encoding files.

Take the 2 channel encodings and simulate their transmission through the BSC once with flip probability 0.05 and once with 0.2. At this point you should have 4 channel transmission files. Clarification: Do not apply the BSC twice in a row! Apply it to the first file with flip probability 0.05, save the result, then apply it again to the first file with flip probability 0.2, and save the result. Repeat this for the second channel encoding file.

Pass each of the 4 channel transmission files through your channel decoder. At this point, you should have 4 channel decoding files.

Compare each of your channel decodings with their respective channel input file, using the provided block-based comparison program.

What to hand in:

A discussion of how well your channel coding scheme is doing on the 2 channel input files under each of the 2 flip probabilities. Give the counts obtained in your simulation and explain why some settings are better than others.
Your code for your channel coder and decoder. The work you submit must be your own.