David MacKay
.

Information Theory, Inference, and Learning Algorithms



Search :

.

`An instant classic, covering everything from Shannon's fundamental theorems to the postmodern theory of LDPC codes. You'll want two copies of this astonishing book, one for the office and one for the fireside at home.'

Bob McEliece, California Institute of Technology

Price: £30.00 / $50.00 from |CUP UK/USA| |amazon.co.uk/.com/.ca/.co.jp| |Barnes and Noble USA | fetchbook.info | allbookstores | biggerbooks | blackwells | kalahari.net (South Africa, R464) |

Best buys (US) amazon.com, $39.50

NEW for teachers: all the figures available for download (as well as the whole book).

   
David J.C. MacKay

Information Theory, Inference, and Learning Algorithms

newcover10
 
Cambridge University Press, 2003

0521642981
How does it compare with Harry Potter?

Information Theory, Inference, and Learning Algorithms

Textbook by David J.C. MacKay

Published by Cambridge University Press, 2003.

[want to choose a nearer web site? | Cambridge, Europe | Toronto, North America |]

In 2003 this book will be published by CUP.
It will continue to be available from this website for on-screen viewing.

Want to give feedback on the book, or report typos?

Great, to ask a question about the book please use this FAQ; to report a typo, mail me. THANK YOU!
Corrections for draft 3.1415

Cambridge students might wish to look at the course website for Information Theory, Pattern Recognition, and Neural Networks

Want to ask a question?

- please click to submit a query to the FAQ about the book, or the software.

Information Theory, Inference, and Learning Algorithms

(Hardback, 640 pages, Published September 2003)

Order your copy

Price: £30.00 / $50.00 from |CUP UK/USA| |amazon.co.uk/.com/.ca/.co.jp| |Barnes and Noble USA | fetchbook.info | allbookstores | biggerbooks | blackwells | kalahari.net (South Africa, R464) |

Best buys (US) amazon.com, $39.50


Download the book too

You may download The book in one file (640 pages):

U.K. english Canada canada South Africa south africa
PDF (A4) pdf (9M) pdf pdf
Postscript (A4) postscript (third printing, September 2004) (5M) postscript postscript
DJVU djvu file (6M) djvu file djvu file
(djvu information | Download djView)
Just the words (latex) [provided for convenient searching] (2.4M) (latex) (latex)
Just the figures NEW All in one file (postscript) [provided for use of teachers] (2M)
(pdf) (5M) (ps.gz) (pdf)
In individual eps files
Individual chapters postscript and pdf available from this page mirror mirror

Notes:

  • Version 6.0 was released Thu 26/6/03; the book is finished. You are welcome to view the book on-screen. Version 6.0 was used for the first printing, published by C.U.P. September 2003.
  • Version 6.6 was released Mon 22/12/03; it will be used for the second printing, to be released January 2004. In this second printing, a small number of typographical errors were corrected, and the design of the book was altered slightly. Page-numbering generally remains unchanged, except in chapters 1, 6, and 28, where a few paragraphs, figures, and equations have moved around. All equation, section, and exercise numbers are unchanged.
  • Version 7.0 is the third printing (November 2004). Its only differences from the 2nd printing are a number of corrections, and the renaming of the chapter `Correlated Random Variables' to `Dependent Random Variables'.

Copyright issues: The book is copyright (c) Cambridge University Press. It has been available in bookstores since September 2003. The cover price is 30 pounds (UK) and $50 (USA).

Now the book is published, these files will remain viewable on this website. The same copyright rules will apply to the online copy of the book as apply to normal books. [e.g., copying the whole book onto paper is not permitted.]

History:
Draft 1.1.1 - March 14 1997.
Draft 1.2.1 - April 4 1997.
Draft 1.2.3 - April 9 1997.
Draft 1.2.4 - April 10 1997. Margins altered so as to print better on Northamerican paper
Draft 1.3.0 - December 23 1997.
Draft 1.9.0 - Feb 1 1999.
Draft 2.0.0 - Jan 6 2000. New page layout.
Draft 2.2.0 - Dec 23 2000. Fresh draft.
Draft 3.1415 - Jan 12 2003. Nearly finished.
Draft 4.0 - April 15 2003. Chapter sequence finalized.
Draft 4.1 - April 18 2003. Adding new frontmatter (Preface etc)to book. Corrected printing of letter version.
Version 6.0 - Thu 26 June 2003. Final version.
Version 6.6 - Mon 22 December 2003. Second printing.

  • Here is my method for converting to two-up under linux:
    pstops '4:0L@.67(20cm,1cm)+1L@.67(20cm,15cm),3R@.67(1cm,15.25cm)\
    +2R@.67(1cm,29.25cm)' $*.ps $*.dps 

The Back Cover

Information theory and inference, often taught separately, are here united in one entertaining textbook. These topics lie at the heart of many exciting areas of contemporary science and engineering - communication, signal processing, data mining, machine learning, pattern recognition, computational neuroscience, bioinformatics, and cryptography.

This textbook introduces theory in tandem with applications. Information theory is taught alongside practical communication systems, such as arithmetic coding for data compression and sparse-graph codes for error-correction. A toolbox of inference techniques, including message-passing algorithms, Monte Carlo methods, and variational approximations, are developed alongside applications of these tools to clustering, convolutional codes, independent component analysis, and neural networks.

The final part of the book describes the state of the art in error-correcting codes, including low-density parity-check codes, turbo codes, and digital fountain codes -- the twenty-first century standards for satellite communications, disk drives, and data broadcast.

Richly illustrated, filled with worked examples and over 400 exercises, some with detailed solutions, David MacKay's groundbreaking book is ideal for self-learning and for undergraduate or graduate courses. Interludes on crosswords, evolution, and sex provide entertainment along the way.

In sum, this is a textbook on information, communication, and coding for a new generation of students, and an unparalleled entry point into these subjects for professionals in areas as diverse as computational biology, financial engineering, and machine learning.

Endorsements by famous people

This is an extraordinary and important book, generous with insight and rich with detail in statistics, information theory, and probabilistic modeling across a wide swathe of standard, creatively original, and delightfully quirky topics. David MacKay is an uncompromisingly lucid thinker, from whom students, faculty and practitioners all can learn.

Zoubin Ghahramani and Peter Dayan
Gatsby Computational Neuroscience Unit, University College London.

An utterly original book that shows the connections between such disparate fields as information theory and coding, inference, and statistical physics.

Dave Forney, M.I.T.

An instant classic, covering everything from Shannon's fundamental theorems to the postmodern theory of LDPC codes. You'll want two copies of this astonishing book, one for the office and one for the fireside at home.

Bob McEliece, California Institute of Technology

(Shorter versions of the jacket blurb)


Endorsements by famous people

This is an extraordinary and important book, generous with insight and rich with detail in statistics, information theory, and probabilistic modeling across a wide swathe of standard, creatively original, and delightfully quirky topics. David MacKay is an uncompromisingly lucid thinker, from whom students, faculty and practitioners all can learn.

Zoubin Ghahramani and Peter Dayan
Gatsby Computational Neuroscience Unit, University College London.

An utterly original book that shows the connections between such disparate fields as information theory and coding, inference, and statistical physics.

Dave Forney, M.I.T.

An instant classic, covering everything from Shannon's fundamental theorems to the postmodern theory of LDPC codes. You'll want two copies of this astonishing book, one for the office and one for the fireside at home.

Bob McEliece, California Institute of Technology

Endorsements by users

I am involved with teaching undergraduate information theory at Royal Holloway. We aim to make this book a reference text in our related courses.

I have found the book very enjoyable to read and I think that it is far easier for students to relate to as the maths is explicit and easy to follow, with excellent examples.

Congratulations on an inspirational book! This is by far the best book I have read in years!

David Lindsay, Computer Learning Research Centre (www.clrc.rhul.ac.uk)


I am compelled to state categorically that this is one of the finest text books I have read on these subjects. I have even pilfered some of the material for use in my classes.

Samir Chettri, UMBC


One of the best technical books ever written, period. It's a joy to read and students I have shown it to are attracted to it like bees to honey.

Alpan Raval, Keck Graduate Institute & Claremont Graduate University



Full blurb from back cover

Alternate blurbs

Jacket blurb (215 words)

Information theory and inference, often taught separately, are here united in one entertaining textbook. These topics lie at the heart of many exciting areas of contemporary science and engineering -- communication, signal processing, data mining, machine learning, pattern recognition, computational neuroscience, bioinformatics, and cryptography.

This textbook introduces theory in tandem with applications. Information theory is taught alongside practical communication systems, such as arithmetic coding for data compression and sparse-graph codes for error-correction. A toolbox of inference techniques, including message-passing algorithms, Monte Carlo methods, and variational approximations, are developed alongside applications of these tools to clustering, convolutional codes, independent component analysis, and neural networks.

The final part of the book describes the state of the art in error-correcting codes, including low-density parity-check codes, turbo codes, and digital fountain codes -- the twentyfirst-century standards for satellite communications, disk drives, and data broadcast.

Richly illustrated, filled with worked examples and over 400 exercises, some with detailed solutions, the book is ideal for self-learning and for undergraduate or graduate courses. Interludes on crosswords, evolution, and sex provide entertainment along the way.

The result is a textbook on information, communication, and coding for a new generation of students, and an unparalleled entry point into these subjects for professionals in areas as diverse as computational biology, financial engineering, and machine learning.

old jacket blurb

Information theory, probabilistic reasoning, coding theory and algorithmics lie at the heart of some of the most exciting areas of contemporary science and engineering. They are integral to such areas as communication, signal processing, data mining, machine learning, pattern recognition, computational neuroscience, bioinformatics, and cryptography. David MacKay breaks new ground in this exciting and entertaining textbook by introducing mathematical technology in tandem with applications, providing at once both motivation and hands-on guidance for problem-solving and modelling. For example, he covers not only the theoretical foundations of information theory, but practical methods for communication systems, including arithmetic coding for practical data compression, and low-density parity-check codes for reliable communication over noisy channels. The duality between communication and machine learning is illustrated through data modelling and compression; machine learning is also represented by clustering, classification, feedforward networks, Hopfield networks, Boltzmann machines and independent component analysis. A toolbox of probabilistic techniques are covered in detail: message-passing, Monte Carlo, and variational methods. The final part of the book, on sparse graph codes, describes the state-of-the-art in error-correcting codes, including chapters on low-density parity-check codes, turbo codes, and digital fountain codes. There are over 390 exercises, some with full solutions, which, together with worked examples, extend the text and enhance technique and understanding. A profusion of illustrations enliven and complement the text. Interludes on crosswords, evolution, and sex provide entertaining glimpses on unconventional applications. In sum, this is a textbook for courses in information, communication and coding for a new generation of students, and an unparalleled entry point to these subjects for professionals working in areas as diverse as computational biology, data mining, financial engineering and machine learning.

150 word blurb

Information theory, probability, coding and algorithmics lie at the heart of some of the most dynamic areas of contemporary science and engineering. David MacKay breaks new ground in this exciting and entertaining textbook by introducing mathematical technology in tandem with applications, providing simultaneously both motivation and hands-on guidance for problem-solving and modelling. For example, he covers the theoretical foundations of information theory, and practical methods for communication systems. Communication and machine learning are linked through data modelling and compression. Over 390 exercises, some with full solutions, and nearly 40 worked examples, extend the text and enhance technique and understanding. Enlivening and enlightening illustrations abound. In sum, this is a textbook for courses in information, communication and coding for a new generation of students, and an unparalleled entry point to these subjects for professionals working in areas as diverse as computational biology, data mining, financial engineering and machine learning.

90 word blurb

Information theory, probabilistic reasoning, coding theory and algorithmics underpin contemporary science and engineering. David MacKay breaks new ground in this exciting and entertaining textbook by introducing mathematics in tandem with applications. Over 390 exercises, some with full solutions, extend the text and enhance technique and understanding. Enlivening and enlightening illustrations abound. It's ideal for courses in information, communication and coding for a new generation of students, and an unparalleled entry point to these subjects for professionals working in areas as diverse as computational biology, datamining, financial engineering and machine learning.

50 word blurb

This exciting and entertaining textbook is ideal for courses in information, communication and coding for a new generation of students, and an unparalleled entry point to these subjects for professionals working in areas as diverse as computational biology, datamining, financial engineering and machine learning.

Another old short blurb

This textbook offers comprehensive coverage of Shannon's theory of information as well as the theory of neural networks and probabilistic data modelling. It includes explanations of Shannon's important source encoding theorem and noisy channel theorem as well as descriptions of practical data compression systems. Many examples and exercises make the book ideal for students to use as a class textbook, or as a resource for researchers who need to work with neural networks or state-of-the-art error-correcting codes.

Yet another old blurb

This textbook offers comprehensive coverage of Shannon's theory of information as well as the theory of neural networks and probabilistic data modeling. Shannon's source coding theorem and noisy channel theorem are explained and proved. Accompanying these theoretical results are descriptions of practical data compression systems including the Huffman coding algorithm and the less well known arithmetic coding algorithm. The treatment of neural networks is approached from two perspectives. On the one hand, the information-theoretic capabilities of some neural network algorithms are examined, and on the other hand, neural networks are motivated as statistical models. With many examples and exercises, this book is ideal for students to use as the text for a course, or as a resource for researchers who need to work with neural networks or state-of-the-art error correcting codes.

Contents
  1Introduction to Information Theory
2Probability, Entropy, and Inference
3More about Inference
Part IData Compression
4The Source Coding Theorem
5Symbol Codes
6Stream Codes
7Codes for Integers
Part IINoisy-Channel Coding
8Dependent Random Variables
9Communication over a Noisy Channel
10The Noisy-Channel Coding Theorem
11Error-Correcting Codes and Real Channels
Part III Further Topics in Information Theory
12Hash Codes: Codes for Efficient Information Retrieval
13Binary Codes
14Very Good Linear Codes Exist
15Further Exercises on Information Theory
16Message Passing
17Communication over Constrained Noiseless Channels
18Crosswords and Codebreaking
19Why have Sex? Information Acquisition and Evolution
Part IV Probabilities and Inference
20An Example Inference Task: Clustering
21Exact Inference by Complete Enumeration
22Maximum Likelihood and Clustering
23Useful Probability Distributions
24Exact Marginalization
25Exact Marginalization in Trellises
26Exact Marginalization in Graphs
27Laplace's Method
28Model Comparison and Occam's Razor
29Monte Carlo Methods
30Efficient Monte Carlo Methods
31Ising Models
32Exact Monte Carlo Sampling
33Variational Methods
34Independent Component Analysis and Latent Variable Modelling
35Random Inference Topics
36Decision Theory
37Bayesian Inference and Sampling Theory
Part V Neural networks
38Introduction to Neural Networks
39The Single Neuron as a Classifier
40Capacity of a Single Neuron
41Learning as Inference
42Hopfield Networks
43Boltzmann Machines
44Supervised Learning in Multilayer Networks
45Gaussian Processes
46Deconvolution
Part VI Sparse Graph Codes
47Low-Density Parity-Check Codes
48Convolutional Codes and Turbo Codes
49Repeat-Accumulate Codes
50Digital Fountain Codes
Part VII Appendices
Notation; Some Physics; Some Mathematics

newcover10

newcover4 newcover2
newcover16 newcover3
newcover10 newcover17
newcover12 newcover13
newcover14 newcover15
newcover11 newcover9
newcover18 newcover19
newcover2 newcover3
newcover4 newcover5
newcover6 newcover7
newcover8
Draft covers for the book by DJCM.

Errors in the book

I would be grateful to hear about errors in the book.

A list of corrections is provided.

Please send me new corrections by email. Thank you!

If you have a query about the book that you think other people might also ask, please use one of the following links to submit your query through metaFAQ; you may find the answer is already there:
Query categories
  • Any other query

  • Software, demonstrations, etc.

    Hint: To force download to file, use shift-click

    The tcl programs and other links marked (*) can be run directly in netscape if you have got the tcl/tk plugin. You can also run the tcl programs in the normal way if you have a linux machine with X windows and tcl and you download the tcl files then execute them (this is preferable to the netscape approach, I think). [The top four lines of each tcl file contain flags which change the program from working with the netscape plugin to working alone under X.] The other programs can be used by people on unix systems that have the relevant software (perl, gnuplot, octave.)

    Please select a software category from the sidebar.



    Hint: To force download to file, use shift-click

    The tcl programs and other links marked (*) can be run directly in netscape if you have got the tcl/tk plugin. You can also run the tcl programs in the normal way if you have a linux machine with X windows and tcl and you download the tcl files then execute them (this is preferable to the netscape approach, I think). [The top four lines of each tcl file contain flags which change the program from working with the netscape plugin to working alone under X.] The other programs can be used by people on unix systems that have the relevant software (perl, gnuplot, octave.)


    Inference methods


    Hint: To force download to file, use shift-click

    The tcl programs and other links marked (*) can be run directly in netscape if you have got the tcl/tk plugin. You can also run the tcl programs in the normal way if you have a linux machine with X windows and tcl and you download the tcl files then execute them (this is preferable to the netscape approach, I think). [The top four lines of each tcl file contain flags which change the program from working with the netscape plugin to working alone under X.] The other programs can be used by people on unix systems that have the relevant software (perl, gnuplot, octave.)


    Online Hamiltonian Monte Carlo Demo

    (formerly known as `Hybrid Monte Carlo')

    Task: sample from the posterior distribution of a bivariate Gaussian distribution given ten data points. The left-hand image shows the details underlying a sequence of HMC transitions. The right-hand image shows 50 of the successive states produced by the Markov chain - the endpoints of the trajectories.
    Hamiltonian Monte Carlo demonstration details  Hamiltonian Monte Carlo demonstration 
    fit.tar tar file giving octave source code.



    Hint: To force download to file, use shift-click

    The tcl programs and other links marked (*) can be run directly in netscape if you have got the tcl/tk plugin. You can also run the tcl programs in the normal way if you have a linux machine with X windows and tcl and you download the tcl files then execute them (this is preferable to the netscape approach, I think). [The top four lines of each tcl file contain flags which change the program from working with the netscape plugin to working alone under X.] The other programs can be used by people on unix systems that have the relevant software (perl, gnuplot, octave.)



    Hint: To force download to file, use shift-click

    The tcl programs and other links marked (*) can be run directly in netscape if you have got the tcl/tk plugin. You can also run the tcl programs in the normal way if you have a linux machine with X windows and tcl and you download the tcl files then execute them (this is preferable to the netscape approach, I think). [The top four lines of each tcl file contain flags which change the program from working with the netscape plugin to working alone under X.] The other programs can be used by people on unix systems that have the relevant software (perl, gnuplot, octave.)


    Sparse Graph Codes


    Other files

    • Images for Chromatic abberation demo: | 1 | 2 |
    chromatic aberration demo by David MacKay

    Dedication

    My textbook is dedicated to the Campaign Against the Arms Trade

    Food for thought...


    Extra materials

    Additional materials associated with the book.


    Feel free to make links to this website!

    To make a link like this:
    Information Theory, Inference, and Learning Algorithms Information Theory, Inference, and Learning Algorithms
    by David MacKay
    use this html:
    <table cellpadding=2 cellspacing=2 bgcolor="#47aeba"><tr><td>
    <table bgcolor="#ffffee" cellpadding=2><tr><td>
    <a href="http://www.inference.phy.cam.ac.uk/mackay/itila/"><img
    border=0 align=right
    src="http://www.inference.phy.cam.ac.uk/mackay/itila/images/Sept2003Cover25.gif"
    alt="Information Theory, Inference, and Learning Algorithms"
    width="137" height="180"></a>
    </td><td width=180><small><a
    href="http://www.inference.phy.cam.ac.uk/mackay/itila/">Information Theory,
    Inference, and Learning Algorithms</a><br>
     by David MacKay</small>
    </td></tr></table>
    </td></tr></table>
    

    All the figures

    For the use of teachers, all the figures are provided, one figure per page, in one 343-page file (postscript) | pdf. You can also select individual figures (in encapsulated postscript files) from this directory.

    FAQ

    How do I take one page of your postscript file and get the bounding box of the figure to be correct?
    I have tried to do this for you in this directory, using the two methods described below.
    Alternatively - Use this perl script by Dick Furnstahl: bbox_add.pl | bbox_add.pl (which google found here).
    You can see example output of this here (applied to figures 7 and 93): example7.ps | example93.ps.
    How it works
    It invokes
        gs -sDEVICE=bbox -dNOPAUSE -dBATCH file.ps
    
    to print the bounding box information for file.ps (and then exit gs). The perl script will automatically find the bounding box and correct the postscript file (or add the info if there is none). Be sure to make it executable:
        chmod +x bbox_add.pl
    
    An alternative answer is get DJCM to make the page using dvips thus:
     dvips -E -p 7 -l 7 ../figures.dvi -o ex7.eps
    
    - is this eps output any better for you?
    How do I convert PostScript to EPS?
    Use pstoepsi or ps2epsi (this didn't actually work for me). Or ps2eps | ps2eps.
    DESCRIPTION
    ps2epsi uses gs(1) to process a PostScript(tm) file and generate as output a new file which conforms to Adobe's Encapsulated PostScript Interchange (EPSI) format. EPSI is a special form of encapsulated PostScript (EPS) which adds to the beginning of the file in the form of PostScript comments a bitmapped version of the final displayed page. Programs which understand EPSI (usually word processors or DTP programs) can use this bitmap to give a preview version on screen of the PostScript. The displayed quality is often not very good (e.g., low resolution, no colours), but the final printed version uses the real PostScript, and thus has the normal PostScript quality.
    USAGE
    On Unix systems invoke ps2epsi like this:
                ps2epsi infile.ps [ outfile.epsi ]
    	    


    Extra exercises

    Information theory

    1. Solve the weighing problem for N coins of which one is odd, and known to be lighter. You may find it interesting to minimize the average number of weighings required, (rather than the maximum number) for the cases N=6 and N=7. [D. G. Mead, The average number of weighings to locate a counterfeit coin, IEEE Trans IT Sept 1979, vol 25, no 5, 616-617]
    2. Mutual information. Consider a channel with input $x$ and output $y$. Let |Y| = some integer greater than 5. Let |X| = |Y| choose 2.

      Let the marginal distribution of X be uniform. Let the transition matrix from X to Y be a weight-two matrix like this: (with every 1 entry replaced by 0.5)

      1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
      0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0
      0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0
      0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0
      0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
      0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1
      
      [rows are labelled by y, cols by x.]

      Now, assume that one input $x$ is selected randomly and transmitted { twice\/} over the channel. Let $y_1$ and $y_2$ be the two outputs. Find the mutual informations $I(X;Y_1)$ and $I(Y_1;Y_2)$.

      Solution:

       I(Y1:Y2) = log|Y| - ( 1 + 0.5 log(|Y|-1) )
               I(X:Y1) - H(Y1|X) = log|Y| - 2
      	 
    3. A correlation puzzle [Hard]
    4. Investigate the two codes used on CDs, and quantify how far each one is from the capacity of an appropriate channel model. [CDs use an error-correcting code and a run-length limiting code. The ECC is composed of three interleaved reed-solomon codes; typical parameters in bytes are (N,K)=(28,24) and (32,28). The inner run-length-limiting code is known as "EFM", which stands for eight-to-fourteen modulation; this code converts every eight bits into a string of seventeen bits (yes, seventeen, not fourteen!) in such a way that the smallest run length is 3 bits, the average run length is 7 bits, and the maximum runlength is 11 bits.]
    5. Entropy inequalities The order-$r$ {R\'enyi} entropy is
      H^{(r)}() = FRACTION {1}{1-r} log ( p_i^r ) , {\ for $r 1$}.
      Using Jensen's inequality, prove that, for any $r$ this function of $$ is a bound on the Shannon entropy of $$. [Hint: I found the case $r=2$ easiest to start with.]
    6. Compression One simple way to encode a binary file of length N=10,000 in which a fraction f=0.01 of the bits are 1s is to send the 100 or so positions of the 1s using 14-bit integers. This code has redundancy because there are many ways (roughly 100! = 100x99x98x...x1) of encoding one file, since the positions can be conveyed in any order. In the bits back coding method, this redundancy is exploited by piggy-backing other data onto the transmission. In this case, another source of bits can be used to choose from the 100! possible orders of the bits. The receiver can then recover the extra bits once the message has been received. Implement this bits-back method. Can you find a way to implement bits-back coding such that the `extra bits' are actually the original coded positions? (A bit like a snake swallowing itself!)
    7. Coding for a redundant language.
       [ABC] ->  2
       [DEF] ->  3
       [GHI] ->  4
       [JKL] ->  5
       [MNO] ->  6
       [PQRS] -> 7
       [TUV] ->  8
       [WXYZ] -> 9
      
      The standard mobile phone encoder
      When sending text on mobile phones, some people use the iTap/T9/predictive text system, in which, for example, HELLO is written as 43556, and, since no other word in the dictionary is encoded as 43556, the phone deduces that you wanted HELLO. However, some words collide under this iTap encoding, for example KISS and LIPS are both 5477.

      Investigate how much better the iTap system would work if a different encoding of the alphabet A-Z onto the numbers 2..9 were used.

    8. Capacity of a simple deletion channel. A channel has 8 input bits and the output is always exactly 7 bits. One of the 8 bits is deleted, at random. The other 7 are transmitted without error. (Note the difference from an erasure channel, which replaces the erased bits by "?".) Estimate the capacity of this channel. Is the optimal input distribution uniform? Find the highest rate code that you can, that communicates reliably over this channel.

    Inference

    1. Message passing
      Imagine you want to help out a dictation school, where students are given dictation, then the teacher has to count how many errors each student made in what they wrote.
      Write an inference algorithm that aligns one piece of text with another, assuming that the second one differs from the first by random insertions, deletions and substitutions. Assume a small probability of insertion per character, p_i, a similar probability of deletion, p_d, and a substitution probability p_s. Find both the most probable single alignment (using the min-sum algorithm, aka Viterbi), and the probability distribution of the alignment at any particular character in the noisy sequence (using the sum-product algorithm). Can you make a program that, as the second user writes his text, computes on the fly where the second user has probably got up to? Remember, the aim is for efficient computation, so try to make a method whose cost is at most proportional to N per iteration, where N is the length of the first piece of text.
    2. Message passing II
      Imagine that N coins, with biases pn, are tossed, once each; they are independent but have different biases. The outcome is a binary vector x. The outcomes (1/0) are summed, and the sum is c. What is the probability, given this sum, that coin n had outcome 1? Use a message-passing method to answer this question, and to generate efficiently a sample from the posterior distribution of x. [Hint: one option is to make a trellis in which the vertical coordinate is the partial sum after the nth outcome is added to the total; the horizontal coordinate runs over the n coins. Run a forward pass to find the probability of each outcome conceivable outcome c; run a backward pass from the particular value of c that occurred; then (just like the path-counting example) use the product of the forward and backward messages to pick a path through the trellis from the posterior distribution.]
    3. Metropolis method Assume the data set {x} = { 1.1, 0.5, 1.8, 2.1, 0.3 } come independently from an exponential distribution P(x|lamba) = exp(-x/lamba)/Z(lambda). Use a Metropolis method to draw samples from the posterior distribution of lambda. (Pick an appropriate prior on lambda.)
    4. Reversibility Show, by constructing a simple example, that Gibbs sampling (in which the K variables are updated in a fixed sequence) is not reversible, even though it is composed of a sequence of K transition operators that are reversible.
    5. Change point inference. Make some data that are a noisy version of a simple function with a discontinuity.
       eg, y = (t < t0) ? y0 : y1
      
      Add Gaussian noise to y to get N data points {d}. Now imagine that you do not know the time t0 of the discontinuity. Infer it.
    6. Hidden Markov Model. Implement an algorithm that, given parameters pi, A, B of a hidden Markov model, and given a data sequence x, will
      1. find the posterior probability of each state at each time [using sum-product algorithm]
      2. find the most probable state-sequence [max-product algorithm, aka min-sum, aka Viterbi]
      3. draw a sample from the posterior distribution of the hidden sequence [using information from the sum-product algorithm].
      Construct a simple example that illustrates why the sequence found by the min-sum (Viterbi) algorithm may be a poor way of representing the posterior distribution.
    7. Life in high dimensions. Consider a hypersphere of radius 1 and a hypercube of radius r (i.e., width of cube is 2r, and diameter of sphere is 2), both in K dimensions. Set K to 256. Use simple Monte Carlo methods to find
      1. The fraction of the hypersphere that is enclosed inside the hypercube as a function of r, for r from 0 to 1. How small can r go, and still give 95% of the hypersphere inside the cube?
      2. The fraction of the hypercube that is enclosed in the hypersphere as a function of r, for r from 0 to 1. When the cube contains 95% of the hypersphere, what fraction of the cube is inside the hypersphere?

      (Exercise by John Platt; for further reading see his 2004 paper on bit-indexing.)

    Decision theory

    1. In financial mathematics, it is often assumed that a stock price will vary in accordance with a random walk with a drift parameter mu and a volatility parameter sigma.
      d log S = mu dt + sigma dz
      
      where dz is a Wiener process with mean zero and variance dt. An option is a financial instrument giving the owner the right (but not the obligation) to buy or sell something. For example, a gambler interested in owning a stock in case its value goes up a lot might buy an option to buy the stock at some future time $T$ for a fixed price $K$ (known as the strike price). If the price indeed is above $K$ at the expiration time, the gambler can exercise the option and turn an instant profit of $S_T - K$ by immediately selling the stock. The gambler has to pay an appropriate price for this option. What should the price of an option be? A `European' option can be exercised only at its expiration time $T$; an `American' option may also be exercised at any time prior to $T$. Chop time up into $N$ steps of duration $T/N$ and use a discrete-time message-passing algorithm to work out the correct price for an option of each type, as a function of the strike price $K$ and the other parameters. Assume the gamblers who price options wish to maximize their expected return (not the log of their return) and for simplicity assume that the interest rate for cash is zero. (a) Model the variation in the log of the stock price by a random walk with discrete steps up or down by $log f$, with probabilities $p_{+}$ and $p_{-}$ respectively. [For thoroughness, if you want, show that $log f$ and $p_{+} 1/2 + epsilon$ are given in terms of the other parameters and $alpha = mu*mu*T/(N*sigma^2)$ by
      	epsilon = 0.5 / sqrt( 1.0 + 1.0/alpha )
      
      	log f = sigma^2/mu * alpha / (2.0 * epsilon) ]
      
      Work out the expected value of the European option by propagating backward from the leaves of the tree of possible prices. (b) For the American option, your message-passing algorithm should keep track, for every node in the tree, of 1) the payoff if the option is exercised at that time; 2) the expected payoff if the option is not exercised at that time, and at subsequent times the optimal decisions are made. 3) the value of the option (which is the max of those two quantities).
    2. Using information content to get insight into algorithm complexity. Consider an algorithm that sorts a bunch of objects by calling a subroutine that compares two of them for you, returning either "+" or "-". For simplicity, assume there are no ties.. How many times must an algorithm call the binary comparison operation to be able to guarantee that it will correctly sort the list? [Hint: Think of the initial state as being a permutation, which we are conveying to someone. How much information is there in a permutation?]

    Neural networks

    1. Another derivation of the Hopfield network
      Think of associative memory as an inference problem. Imagine that you know that a set of N memories {x} (with elements {+/-1} have given rise to `data' in the form of the Hebbian weight matrix W = sum x x^T. Imagine furthermore that a noisy version y of one of the memories x0 is provided. Your task now is to infer x0 given the available information, W, y, and N.
      Some approximations are required in order to get a simple answer out. Write down the posterior of x0. The contribution from y to the likelihood of x0 is a simple bias. The central quantity to think about is the other likelihood factor, P(W|x0). Approximate this quantity by assuming that, conditional on x0, W_ij is Gaussian-distributed with mean
      mu=x0_i x0_j
      and variance (N-1) (treating the contributions from the (N-1) other memories as independent). Approximate the likelihood factor P(W|x0) by
      prod_{ij} P(W_{ij}|x0).
      [An approximation since in fact the other patterns {x} introduce weak dependencies among the W_{ij}.]
      Now write down dynamics for each variable x^0_i that greedily maximize the posterior probability P(x0|W,y). What have you got?
      Thanks to Máté Lengyel for suggesting this exercise.

    Extra topics I would have liked to include in the book.

    1. Some notes on compression using runlength codes (one page)
      | postscript | pdf |
    2. Randomized algorithms
    3. The Swendsen-Wang algorithm | pdf |
    4. Cycle graphs and random permutations | pdf |
    5. Counting trees (coming soon)

    Randomized Algorithms

    Consider the task of sorting N objects (eg numbers). A possible randomized algorithm works as follows.

    1. Pick one of the objects at random, and pretend that it is the median of the set. Compare all (N-1) others with it, thus dividing the others into two sets of size A and B. Certainly A+B = N-1; if we got lucky, A ~= N/2. Maybe, typically (in some sense), A ~= N/3 and B ~= 2N/3, or vice versa.
    2. Repeat step 1 on each of the two sets (until A = 1 and B = 1).

    If we get lucky at every step (i.e., we hit the true median every time) then the number of comparisons of objects with each other will be pretty accurately

    Cmin = N log2 N.

    Failing to hit the median means that our binary tree will be lopsided, and more comparisons will be needed. How many comparisons?

    We still expect something that scales as N ln N, but the base of the logarithm will be different. The "2" in the "log2" in

    Cmin(N) = N log2 N
    is there because the perfect median choice makes a balanced binary tree, and log2 N is the depth of that tree. The intuition is: When we make an unbalanced tree, the "2" changes to a number smaller than 2; as we'll see, it turns out to be roughly 1.6.
    Crandomized algorithm ~= N log1.6 N

    There are two rather different ways to prove this result.

    Recurrence relation proof

    Let the average cost, in comparisons, of sorting N items by the randomized algorithm be T(N). Then, considering the first step, with the pseudo-median being drawn at random with u being uniform distributed between 0 and 1,

    T(N) ~= N + < T(uN) > + < T((1-u)N) >
    where < T((1-u)N) > denotes the average of T((1-u)N), and we've neglected the distinction between N and N-1.

    We need to solve this recurrence relation, with boundary condition T(1)=T(0)=0. Let's guess that

    T(N) = a N ln N
    and solve for a Then
    T(N) ~= N + T(N) - a N < H_2^(e)(u) >
    For consistency this gives
    1 = a < H_2^(e)(u) >
    Carrying out the integral we find the neat result:
    a = 2.
    So the base of the logarithm - the number formally known as `1.6' - is in fact sqrt(e).

    `Means add' proof

    Another neat derivation of the cost of the randomized sort algorithm, which gives little insight into the correct answer, but generates it quite easily, runs as follows.

    Imagine that the numbers are already sorted, and consider two numbers whose ranks in the list of size $N$ are $m$ and $n$, with $m < n$. When we apply the algorithm, what is the chance $c_{mn}$ that $m$ will, at some point along the way, get compared with $n$? Well, if any of the $n-m-1$ numbers between $m$ and $n$ gets selected before both $m$ and $n$, then the subsequent comparisons will separate $m$ and $n$ into separate sets, so that they never get compared with each other. On the other hand, if one of $m$ and $n$ is selected before any of those $n-m-1$ numbers, then $m$ and $n$ will instantly be compared with each other in the next comparison step. Thus

    c_{mn} = FRACTION {2}{n-m+1} ;
    and the expected number of comparisons is the sum over all pairs,
    C(N) = {m,n: m < n} FRACTION {2}{n-m+1}


    Comparison of Information Theory, Inference, and Learning Algorithms with Harry Potter

    OK, you're tempted to buy MacKay's book, but you're not sure whether it's the best deal around?

    Let's compare it with another textbook with a similar sales rank.

    Information Theory, Inference, and Learning Algorithms
    Information Theory, Inference, and Learning Algorithms
    David J.C. MacKay
    images/Potter
    Harry Potter and the Philosopher's Stone
    J.K Rowling
    Sales rank
    amazon.co.uk, Mon 5/1/04
    5,667 2,486
    List Price (UK) £30.00 £11.99
    Number of pages 640 223
    Cost per page 4.6p 5.4p
    Has large margins for making notes? Yes No
    Includes comprehensive index? Yes No
    Number of tables, figures, and algorithms 462 0
    Free software provided for use of teachers and students? Yes No
    Number of exercises More than 400 None
    Entire book available for free on-line viewing? Yes No
    Also available in paperback? No Yes
    Available in Latin translation? No Yes
    Available from Barnes and Noble?
    (on Mon 5/1/04)
    Yes Only in Latin and Welsh translations
    Table 1
    The comparisons in Table 1 show that the popular textbook by Rowling has a lower cover price than MacKay's new work. However, the prospective buyer should take into account the significantly greater length of MacKay's text: both textbooks have a cost per page of about 5 pence.

    In terms of user-friendliness, MacKay's text has the edge when it comes to its thorough index, the rich use of figures, the provision of free software for the use of the reader, and the profusion of examples and exercises, many with worked solutions.

    A strong selling point for MacKay's book is its free availability for online viewing. Unlike Rowling, who is notoriously secretive about her textbooks before they are published, MacKay made his entire book available online while it was a work in progress.

    A possible weakness of MacKay's product is that it is only available in hardback. Native speakers of Latin may also prefer Rowling's text, since a full translation into Latin is now available.

    The issue that must clinch the choice, however, is availability. Our reviewer tried to purchase both texts from Barnes and Noble on Mon 5/1/04, and found that the Rowling text is not available.1

    In conclusion, we can give a cautious recommendation of Harry Potter only to speakers of Welsh and Latin; for everyone else, we recommend Information Theory, Inference, and Learning Algorithms.


    Footnotes
    [1] Rowling's text has however been translated into American, and released under a different title.


    Any Questions?

    If you have a query about the book that you think other people might also ask, please use one of the following links to submit your query through metaFAQ; you may find the answer is already there:
    Query categories
  • Any other query

  • Site last modified Thu Sep 30 20:34:34 BST 2004