=========================================================================== CSC 310H Lecture 8 Summary Fall 2007 =========================================================================== ------ Topics ------ - Problems with block Huffman coding - Intervals as codes - Bit strings as intervals - Encode blocks of symbols: Arithmetic coding, take 1. - References: sections 5.6, 6.2, lectures 7-8 from 2006 course by prof Roweis. ---------------------------------- Problems with block Huffman coding ---------------------------------- Ideas: 1. In order to get the average number of bits per symbol close to the entropy, need to consider large blocks of N symbols. a. If the source has I symbols, its N-th extension has I^N symbols. To compute a Huffman code for this source, we need to consider all those very improbable strings and assign them codewords. As N grows, the number of symbols we have to deal with increases exponentially with N! b. In order to emit a codeword, we need N symbols of the source string. This is not as big of an issue compared to a. above, but it still is a consideration. 2. No easy way to alter the symbol probabilities once the encoding starts. For example, suppose that, as we read in an input bit string where we assumed that 0 occurs with probability 90%, we realize the probability of a 0 is really much closer to 80%. The only way to incorporate this knowledge is to recompute the entire Huffman code, which is expensive. Reference: section 5.6. ------------------ Intervals as codes ------------------ Ideas: - A ID code tree divides up the available codespace into sub-intervals of the [0,1) interval, where each subinterval has a length which is a power of 2. - If symbol probabilities are not powers of 2, there is no way in which an ID code tree can assign codewords with those lengths. - For now, try to have codewords the intervals themselves. For example, if {a_1, a_2, a_3, a_4} have probabilities {1/3, 1/6, 1/6, 1/3}, assign the codewords: a_1 -> [0, 1/3) a_2 -> [1/3, 1/2) a_3 -> [1/2, 2/3) a_4 -> [2/3, 1) ------------------------ Bit strings as intervals ------------------------ - However, how do we transmit an interval? Its endpoints can be rational/real numbers. - All the encoder needs to do is to transmit a rational number that lies inside of of these intervals. The decoder can then figure out what the symbol being encoded was. For the example above, we could send 0 or 1/8 for a_1, as they both lie inside the interval corresponding to a_1. - We will actually transmit a smaller interval which completely lies inside one of the codeword intervals, as follows. - For a bit string b = b_1, .., b_k, define r(b) to be the real number that has as its binary real number representation the string b. For example, if b = 00101, r(b) = 0.0010100.. = 1*2{-3} + 1*2^{-5} = 5/32. - For a bit string b, define int(b) to be the real number interval [ r(b), r(b+1) ). For example, if b = 011, int(b) = [0.011, 0.100) = [3/8, 1/2) - Note that the length of int(b) is 2^{-k}, where k is the length of b. - So, encoder upon seeing a_i, sends a bit string b such that int(b) is a subset of the interval corresponding to a_i. For example, with the code above, upon seeing a_2, the encoder can send b = 011 as [3/8, 1/2) is a subinterval of [1/3, 1/2). It could also send b = 0110, which corresponds to [3/8, 7/16). --------------------------------------------------- Encode blocks of symbols: Arithmetic coding, take 1 --------------------------------------------------- Idea: - Instead of encoding single source symbols, encode blocks of N symbols. - Subdivide interval corresponding to a_4 into intervals corresponding to a_4 a_1, .., a_4 a_4. Algo: Arithmetic Coding, take 1: u^0 <- 0 v^0 <- 1 for k <- 1 to N u^k <- u^{k-1} + (v^{k-1} - u^{k-1})*sum_{j=1}^{i_k-1}{ p_j } v^k <- u^k + (v^{k-1} - u^{k-1})*p_{i_k} endfor send a bit string b such that int(b) is a subset of [u^N, v^N)