PROCEDURES FOR ARITHMETIC CODING AND ADAPTIVE MODELING Radford M. Neal, 2002/2004 This directory contains a "code" module for performing arithmetic coding, a "freq" module for maintaining symbol frequencies in an adaptive model, and three example programs, demonstrating the the use of these modules for compressing arbitrary files (encode/decode) and for compressing black/white images (encpic/decpic, encmono/decmono). THE ARITHMETIC CODING MODULE The "code" module allows a driver program to encode or decode a sequence of symbols, using arithmetic coding based on cumulative symbol frequencies that are supplied by the caller. Symbols from different alphabets may be intermixed. Special procedures are provided for encoding bits, which are faster than using the general procedures for alphabets of arbitrary size. The encoding data is written to standard output. Data to be decoded is read from standard input. Programs using these procedures should include "code.h", which declares prototypes for the procedures in this module. The precision used in performing arithmetic coding is specified by constants and typedefs in code.h. The number of bits used for coding intervals (Code_bits) must be at least two more than the number of bits used for symbol frequencies (Freq_bits). This difference should be larger if possible, for better compression. The data type freq_value, defined in code.h, should be used for storing symbol frequencies. For the general encode/decode procedures, the cumulative symbol frequencies must be supplied, in the form of an array whose first element (indexed by 0) is the total count, followed by the count for all except the first symbol, the count for all except the first and second symbols, etc. The last entry of this array, indexed by the total number of symbols, must be zero. Coding efficiency is improved by reordering the symbols so that the one with the largest frequency comes first, but the improvement is slight if Code_bits-Freq_bits is fairly large (eg, five or more). The procedures for encoding are as follows: void start_outputing_bits (void) This procedure must be called to initialize bit output before calling start_encoding (below). (Bits could be output by the program directly, using the procedures in the bit-io module, but this is not documented here.) void start_encoding (void) This procedure must be called before attempting to encode anything with encode_symbol or encode_bit. void encode_symbol (int symbol, freq_value cum_freq[]) Encodes the symbol identified by its first argument, which must be an integer from 1 (NOT 0) up to the number of symbols. The cumulative symbol frequencies are stored in the second argument, with cum_freq[i] being the total frequency for all symbols greater than i. Consequently, cum_freq[0] will be the total frequency for all symbols. void encode_bit (int bit, int freq0, int freq1) The is a specialized procedure for encoding bits (encode_symbol could also be used for this, but would be slower). The bit to encode (0 or 1) is the first argument, followed by the frequency of 0 bits and the frequency of 1 bits. The bit is encoded on the basis that the probability of a 1 is freq1/(freq0+freq1). void done_encoding (void) This procedure must be called after encoding the last symbol. It outputs any final bits needed for unambiguous decoding. void done_outputing_bits (void) This procedure must be called once all bits have been output (eg, after calling done_encoding), to flush the last bits in the buffer. The procedures for decoding are as follows: void start_inputing_bits (void) This procedure must be called to initialize bit input before calling start_decoding (below). void start_decoding (void) This procedure must be called before attempting to decode anything with decode_symbol or decode_bit. int decode_symbol (freq_value cum_freq[]) This procedure is passed an array of cumulative frequencies, which must have the same contents as it had for the corresponding call of encode_symbol in the encode program. The value returned is the next symbol, which is represented as an integer from 1 to the number of symbols. int decode_bit (int freq0, int freq1) Decodes a bit encoded by encode_bit. It must be passed the same frequencies for 0 and 1 bits that were passed to the corresponding call of encode_bit. MODULE FOR MAINTAINING SYMBOL FREQUENCIES The "freq" module provides a data structure and procedures for maintaining frequencies of symbols in various contexts, which are updated as new symbols are encountered. The symbols are also re-ordered so that they are in order of decreasing frequency, which decreases the time neeed for searching, and also slightly improves arithmetic coding efficiency. Programs using these procedures should include "freq.h", which declares prototypes for the procedures in this module and the data type "frequencies" for holding frequencies in a context. The actual symbols are represented as integers from 0 up to No_of_symbols-1. No_of_symbols may be changed as required, but is set up at present to be 257, which allows for any 8-bit byte plus the special EOF_symbol. Symbols are often referred to by indexes, which range from 1 to No_of_symbols. The translation between actual symbols and symbol indexes is performed by the tables symbol_to_index and index_to_symbol in the "frequencies" structure. These translation tables are automatically updated to keep the order of symbols by index in decreasing order of frequencies. The "freq" and "cum_freq" arrays in a "frequencies" structure record the counts and cumulative counts for symbols. They are indexed by the INDEX of a symbol, not the actual symbol. The cum_freq array is suitable for use with encode_symbol and decode_symbol, but note that one should therefore pass to encode_symbol the INDEX of the symbol, not the symbol itself. Similarly, if the "freq" module is used to maintain frequencies, decode_symbol will be passed the cum_freq array, and will return a symbol INDEX, not the symbol itself. The maximum total frequency allowed is Freq_full, defined in code.h. If the total frequency becomes larger than that, all frequencies are scaled down by a factor of two (by f' = (f+1)/2, so non-zeros will stay non-zero). Programs can maintain multiple "frequencies" structures, which are used in different contexts. Procedures in this module: void initialize_frequencies (frequencies *f) Initializes the frequencies structure pointed to by its argument to set the frequencies of all symbols to one. The mapping of symbols to indexes is initialized so symbol i has index i+1. void update_frequencies (frequencies *f, int index) Adds one to the frequency for the symbol with the given index. The mapping of symbols to indexes may change to keep symbols in decreasing order of frequency. If this increment takes the total frequency of all symbols above Freq_full, they are all scaled down. EXAMPLE PROGRAMS. The "encode" and "decode" programs encode and decode arbitrary files of 8-bit bytes, using an adaptive model with a single contexts. The "encpic" and "decpic" programs encode and decode black-and-white images, stored as ASCII characters in the format illustrated by the file "HiMom". These programs demonstrate the specialized procedures for encoding and decoding bits. They do not use the "freq" module, since maintaining counts for just two symbols is easily done directly. These programs do maintain multiple contexts to increase the coding efficiency. The "encmono" and "decmono" programs are the same as encpic and decpic except that they read and write images that are stored one bit per pixel with eight bits packed in a byte (with the low-order bit of a byte coming first). REFERENCES These procedures are derived from the program code and algorithms described in the following references: Witten, I. H., Neal, R. M., and Cleary, J. G. (1987) ``Arithmetic coding for data compression'', Communications of the ACM, vol. 30, pp. 520-540. Moffat, A., Neal, R. M., and Witten, I. H. (1998) ``Arithmetic coding revisited'', ACM Transactions on Information Systems, vol. 16, pp. 256-294. See these references for more information on the way the programs work. The arithmetic coding is done as described in the second reference, but the symbol frequencies are maintained in the way described in the first reference.