This document is (c) David J.C. MacKay, 2001

It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html

It contains the text of David MacKay's book, Information theory, inference, and learning algorithms. (latex source)

Copying and distribution of this file are NOT PERMITTED.

The file is provided for convenience of anyone wishing to make a web-based search of the text of the book.

% This document is (c) David J.C. MacKay, 2001
%
% It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/
%                    http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html
%
% It contains the text of David MacKay's book,
%   Information theory, inference, and learning algorithms.
% (latex source)
%
% Copying and distribution of this file are NOT PERMITTED.
%
% The file is provided for convenience of anyone wishing to
% make a web-based search of the text of the book.

% was book2e.tex is now book.tex   (and still latex2e)
\documentclass[11pt]{book}%
% last minute additions
\usepackage{DJCMamssymb}%  needed for blacktriangleright Mon 10/11/03  (put in symbols instead) 
\usepackage{ragged2e}%  provides \justifying   
% end last minute additions 
\usepackage{floatflt}
%\usepackage{hangingsecnum}% makes sec numbers sit in the left margin (tried cutting out on Thu 6/11/03)
\usepackage{hangingsecnum2}% makes sec numbers sit in the left margin (modified Thu 6/11/03)
%\usepackage{mparhack} 
\usepackage{mparhackright-209}% makes all margin pars go in right margin
\usepackage{marginfig}%   Defines many macros for making various styles of figure with captions
%\usepackage{symbols}%     Provides a few math symbols (replaced with DJCMamssymb) 
%\usepackage{twoside}
\usepackage{myalgorith}%   defines the Algorithm environment as a float
                       %   Also forces fig,tab, and alg all to use a single counter
\usepackage{aside}%        defines the {aside} environment
\usepackage{chapsummary}%  helps me compile index-like objects (NOT USED)
\usepackage{chapternotes}% lots of assorted stuff
\usepackage{lsalike}%      defines citation commands 
\usepackage{booktabs}%     makes nice quality tables
\usepackage{prechapter}%   defines a chapter-like object     
\usepackage{mycaption}%    defines ``\indented''and \@makecaption; and the notindented style used in figure captions
% additions post-Sat 5/10/02 
\usepackage{latexsym}%  needed in order to make use of the \Box command 
\usepackage{tocloft}% implements my look of table of contents  
\usepackage{tocloftcomp2}% implements my look of table of contents (was tocloftcomp until Thu 6/11/03) 
\usepackage{mychapter}% defines chapter command, including the look of the new chapter page
                      % also defines the look of the section and subsection commands  
\usepackage{mycenter}%  modifies center to reduce vertical space waste - useful for figures, etc.
\usepackage{mypart}%    modifies part to not cleardoublepage (no longer Sat 5/4/03)
\usepackage{myheadings}% redefines the pagestyle ``headings''
% \usepackage{headingmods}% redefines the pagestyle ``headings'' (similar to myheadings)
% \usepackage{myindents}% defines parindent and leftmargin
\usepackage{graphics}% enables rotating of boxes
% \usepackage{boldmathgk}%  provides bold alpha etc. (doesn't work)
% \usepackage{fixmath}%  provides bold alpha etc.  Also (I think) provides numerous sloping greeks that I don't like
\usepackage{fixmathDJCM}%  provides bold alpha etc.   Has Gamma definition cut out. and Omega
% suggested by DAG:
%\usepackage{amsmath}
%\usepackage{mathptmx}
\usepackage{DAGmathspacing}% provides smallfrac
\usepackage{boxedminipage}
\usepackage{fancybox}%  Provides ability to put verbatim text inside boxes
\usepackage{bbold}% CTAN blackboard.ps was helpful for choosing this PROVIDES ``holey 1'' as \textbb{1}
\usepackage{epsf}% to allow use of metapost figures
%\usepackage{hyperref} % incompatible with something
% 
\usepackage{multicol}% why does CTAN refer to multicols?
%\usepackage{myindex2}%   overrides book definition of index
\usepackage{myindex}%   overrides book definition of index
\usepackage{makeidx}
\usepackage{mybibliog}
\usepackage{mygaps}% defines \eq and \puncgap and \colonspace and \puncspace
\usepackage{mytoc}% suppresses the CONTENTS headings
\makeindex
%
\newcommand{\thedraft}{7.0}% 6.6 was 2nd printing. 6.8 was when I fixed errs Tue 24/2/04 % 6.9 = Mon 28/6/04 % 6.10 =  Mon 2/8/04 % 6.11 Sun 22/8/04 % 7.0 final for 3rd printing
\renewcommand{\textfraction}{0.10}
\pagestyle{headings}
\begin{document}
\bibliographystyle{lsalikedjcmsc}%.bst
%\newcommand{\bf}{\textbf}
%\newcommand{\sf}{\textsf}
%%\newcommand{\em}{\textem}  
%\newcommand{\rm}{\textrm}
%\newcommand{\tt}{\texttt}  
%\newcommand{\sl}{\textsl}
%\newcommand{\sc}{\textsc}
% 
 
  
 
 
% chapter.tex
% 
% this contains a few common definitions for all chapters 
% of the itprnn book
% for _l1.tex:
\hyphenation{left-multi-pli-ca-tion}
\hyphenation{multi-pli-ca-tion}
% 
\newcommand{\partnoun}{Part}
\newcommand{\partone}{\partnoun\ I}
\newcommand{\datapart}{I}
\newcommand{\noisypart}{II}
\newcommand{\finfopart}{III}
\newcommand{\probpart}{IV}
\newcommand{\netpart}{V}
\newcommand{\sgcpart}{VI}

\newcommand{\hybrid}{Hamiltonian}
\newcommand{\Hybrid}{Hamiltonian}
%
% If sending book to readers -
\newcommand{\begincuttable}{}
\newcommand{\ENDcuttable}{}
% If sending to editor -
%\newcommand{\begincuttable}{\marginpar{\raisebox{-0.5in}[0in][0in]{$\downarrow$}CUTTABLE?}}
%\newcommand{\ENDcuttable}{\marginpar{\raisebox{0.5in}[0in][0in]{$\uparrow$}CUTTABLE?}}
%
\newcommand{\adhoc}{ad hoc}
\newcommand{\busstop}{bus-stop}
\newcommand{\mynewpage}{\newpage}% switch this off later Sun 3/2/02
% see also tex/inputs/itchapter.sty
% chapternotes.sty is where there is an index
\newcommand{\fN}{f\!N}
\newcommand{\exercisetitlestyle}{\sf}
%
% used in sumproduct.tex and gallager.tex
\newcommand{\Mn}{{\cal M}(n)}
\newcommand{\Nm}{{\cal N}(m)}
%\newcommand{\N}{{\cal N}}
%
% the delta function that is 1 if true (defined in notation.tex)
\newcommand{\truth}{\mbox{\textbb{1}}}
% requires:
% \usepackage{bbold}% CTAN blackboard.ps was helpful for choosing this
%
% used in gene.tex
\newcommand{\deltaf}{\delta\! f}
\newcommand{\tI}{\tilde{I}}
\newcommand{\Kp}{K_{\rm{p}}}
\newcommand{\Ks}{K_{\rm{s}}}
%
% end
% lang4.tex - distributions.tex
\newcommand{\lI}{I}
%
% clust.tex
\newcommand{\rnk}{r^{(n)}_k}
\newcommand{\hkn}{\hat{k}^{(n)}}
% good sizes: 
% -0.45: 1.25
% -0.25: 0.65
% -0.4 0.8
\newcommand{\softfig}[1]{\hspace{-0.4in}\psfig{figure=octave/kmeansoft/ps1/#1.ps,width=0.8in,angle=-90}}
\newcommand{\softtfa}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.4in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.2in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtfabig}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.6in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.5in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtfabigb}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.45in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.625in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtf}[2]{\softtfa{#1}{#2}{ps1}}
\newcommand{\softtfbig}[2]{\softtfabig{#1}{#2}{ps1}}
\newcommand{\softtfbigb}[2]{\softtfabigb{#1}{#2}{ps1}}
\newcommand{\softtfb}[2]{\softtfa{#1}{#2}{ps3}}
\newcommand{\softtfbbig}[2]{\softtfabigb{#1}{#2}{ps3}}
\newcommand{\softfc}[1]{\begin{tabular}{c}%
\hspace*{-0.2in}\mbox{\psfig{figure=octave/kmeansoft/ps5/#1.ps,width=1.32in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
% end
%
% used in _p1 and _l2
\newcommand{\hpheight}{26mm}
\newcommand{\wow}{\marginpar{{\Huge{$*$}}}}
%\newcommand{\wow}{\marginpar{\raisebox{-12pt}{\psfig{figure=figs/wow.eps,width=1in}}}}
%
% used in _l1.tex:::::::
\renewcommand{\q}{{f}}
\newcommand{\obr}[3]{\overbrace{{#1}\,{#2}\,{#3}}}
\newcommand{\ubr}[3]{\underbrace{{#1}\,{#2}\,{#3}}}
\newcommand{\nbr}[3]{{{#1}\,{#2}\,{#3}}}
%
% for \mid and gaps puncgap etc see mygaps.sty
\newcommand{\EM}{EM}
\newcommand{\ENDsolution}{\hfill \ensuremath{\epfsymbol}\par}
\newcommand{\ENDproof}{\hfill \ensuremath{\epfsymbol}\par}
\newcommand{\Hint}{{\sf{Hint}}}
\newcommand{\viceversa}{{\itshape{vice versa}}}
\newcommand{\analyze}{analyze}
\newcommand{\analyse}{analyze}

\newcommand{\fitpath}{/home/mackay/octave/fit/ps}% used in fit.tex (gaussian fitting, octave)
% CUP style: 
\renewcommand{\cf}{cf.}
\renewcommand{\ie}{i.e.}
\renewcommand{\eg}{e.g.}
\renewcommand{\NB}{N.B.}
%
% symbols i e and d in maths (operators)
\newcommand{\im}{{\rm i}}
\newcommand{\e}{{\rm e}}
% \d is already defined
%
% needs
% \usepackage{boxedminipage}
\newenvironment{conclusionboxplain}%
{\begin{Sbox}\begin{minipage}{\textwidth}}%
{\end{minipage}\end{Sbox}\fbox{\TheSbox}}
\newenvironment{conclusionbox}%
%{\begin{Sbox}\begin{minipage}{\textwidth}}%
%{\end{minipage}\end{Sbox}\fbox{\TheSbox}}
{% see also marginfig.sty for conflicting use of this enironment and its params - and for defn of fatfboxsep
\fatfboxsep%
\setlength{\mylength}{\textwidth}%
\addtolength{\mylength}{-2\fboxsep}%
\addtolength{\mylength}{-2\fboxrule}%
\vskip8pt\noindent\begin{Sbox}\begin{minipage}{\mylength}\hspace*{-\fboxsep}\hspace*{-\fboxrule}%
\hspace*{\leftmargini}\begin{minipage}{\textwidthlessindents}}%
{\end{minipage}\end{minipage}\end{Sbox}\shadowbox{\TheSbox}\resetfboxsep\vskip 1pt}
\newenvironment{oldconclusionbox}%
{\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -18pt\begin{quote}\vskip -8pt}%
{\end{quote}\vskip -14pt \noindent\rule{\textwidth}{0.1pt}\vskip 6pt}
% {\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -12pt\begin{quote}}%
% {\end{quote}\vskip -12pt \noindent\rule{\textwidth}{0.1pt}}
\newcommand{\dy}{\d y}
\newcommand{\plus}{+}
\newcommand{\Wenglish}{Wenglish}% winglish
\newcommand{\wenglish}{\Wenglish}% winglish
\newcommand{\percent}{{per cent}}% in USA only: percent
%
%\newcommand{\nonexaminable}{$^{*}$}
\newcommand{\nonexaminable}{}
%
% for exact sampling chapter
\newcommand{\envelope}{summary state}
%
\def\unit#1{\,{\rm #1}}
\def\cm{\unit{cm}}
\def\grams{\unit{g}}
% this is a 209 versus 2e problem: (huffman.latex edited instead)
%\def\tenrm{\rm}
%\def\tenit{\it}
%
% other problems: \pem
\renewcommand{\textfraction}{0.1}
%
% for use in free text: 
\newcommand{\bits}{{\rm bits}}
\newcommand{\bita}{{\rm bit}}
% for use in equations or in '1 bit'
\newcommand{\ubits}{\,{\bits}}
\newcommand{\ubit}{\,{\bita}}
%
%
%
% ch 2:
\newcommand{\sixtythree}{{\tt sixty-three}}
\newcommand{\aep}{`asymptotic equipartition' principle}
%
% used in alpha: 
\newcommand{\sla}{\sqrt{\lambda_a}}
\newcommand{\kga}{\kappa\gamma}
\newcommand{\kkgg}{\kappa^2\gamma^2}
\newcommand{\skg}{\sqrt{\kappa\gamma}}
\newcommand{\TYP}{{\rm \scriptscriptstyle TYP}}
%
\newcommand{\bb}{{\bf b}} 
%
% used in ising.tex and _s4.tex
% J=+1 are in states1, J=-1 are in states
%\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1.82in}}
\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1in}}% was 1.75
\newcommand{\smallrisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=0.6in}}% was 1.2 was 0.9
\newcommand{\Hisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=2.6in}}
\newcommand{\hisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=2.6in}}
\newcommand{\bighisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=3.86in}}
%
% used in _noiseless.tex
\newcommand{\Connectionmatrix}{Connection matrix}
\newcommand{\connectionmatrix}{connection matrix}
\newcommand{\connectionmatrices}{connection matrices}
%\newcommand{\cwM}{M}% codeword number
%\newcommand{\cwm}{m}% codeword number
\newcommand{\cwM}{S}% codeword number
\newcommand{\cwm}{s}% codeword number
\newcommand{\sa}{\alpha}% signal amplitude in gaussian channel
%
\newcommand{\cmA}{A}% connection matrix symbol
\newcommand{\bcmA}{{\bf \cmA}}% connection matrix symbol
\newcommand{\bAcm}{{\bcmA}}
\newtheorem{ctheorem}{Theorem}[chapter]
\newtheorem{definc}{Definition}[chapter]
\newcommand{\appendixref}[1]{Appendix \ref{#1}}
\newcommand{\appref}[1]{Appendix \ref{#1}}
\newcommand{\Appendixref}[1]{Appendix \ref{#1}}
\newcommand{\sectionref}[1]{section \ref{#1}}
\newcommand{\Sectionref}[1]{Section \ref{#1}}
\newcommand{\secref}[1]{section \ref{#1}}
\newcommand{\Secref}[1]{Section \ref{#1}}
\newcommand{\chapterref}[1]{Chapter \ref{#1}}
\newcommand{\Chapterref}[1]{Chapter \ref{#1}}
\newcommand{\chref}[1]{Chapter \ref{#1}}
\newcommand{\Chref}[1]{Chapter \ref{#1}}
\newcommand{\chone}{\ref{ch.one}}
\newcommand{\chtwo}{\ref{ch.two}}
\newcommand{\chthree}{\ref{ch.three}}
\newcommand{\chfour}{\ref{ch.four}}
\newcommand{\chfive}{\ref{ch.five}}
\newcommand{\chsix}{\ref{ch.six}}
\newcommand{\chseven}{\ref{ch.ecc}}
\newcommand{\cheight}{\ref{ch.bayes}}
\newcommand{\chthirteen}{\ref{ch.single.neuron.class}}% single neuron
\newcommand{\chfourteen}{\ref{ch.single.neuron.bayes}}% single neuron bayes? 
\newcommand{\chtwelve}{\ref{ch.nn.intro}}% intro to nn
\newcommand{\chcover}{\ref{ch.cover}}
\newcommand{\chbayes}{\ref{ch.bayes}}
\newcommand{\secpulse}{\ref{sec.pulse}}% 7.2.1?}
\newcommand{\secthirteenthree}{13.3?}
\newcommand{\secmetrop}{\ref{sec.metrop}}% 11.3?}
\newcommand{\figooo}{?1.11?}
\newcommand{\eqgamma}{8.27?}
\newcommand{\TSP}{travelling salesman problem}
\newcommand{\Bayes}{Bayes'}
\newcommand{\vfe}{variational free energy}
\newcommand{\vfem}{variational free energy minimization}
% could make this \ch6 = \ref{ch6}
% author, title etc is in here....
% {headerinfo.tex}% uses special commands
\setcounter{secnumdepth}{2}%
\newcommand{\indep}{\bot}% upside down pi desired
\newcommand{\dbf}{\slshape}% boldface in definitions
\newcommand{\dem}{\slshape}% emphasized definitions in text
\newcommand{\solutionb}[2]{\setcounter{solution_number}{#1}
\solutiona{#2}}
\newcommand{\lsolution}[2]{\section{Solution to exercise {#1}}{#2}}
%
%
\newcommand{\FIGS}{/home/mackay/book/FIGS}
\newcommand{\bookfigs}{/home/mackay/book/figs}
\newcommand{\figsinter}{/home/mackay/handbook/figs/inter}
\newcommand{\exburglar}{\exerciseref{ex.burglar}}
\newcommand{\exnine}{\exerciseref{ex.invP}}%10}
\newcommand{\exseven}{\exerciseonlyref{ex.weigh}}% use deprecated!
% was \exseven ....  \exerciseref{ex.expectn}}%9}
\newcommand{\exaseven}{\exerciseref{ex.R9}}%{7}
\newcommand{\exten}{\exerciseref{ex.expectng}}%{11}
\newcommand{\exfourteen}{\exerciseref{ex.Hadditive}}%{15}
\newcommand{\exfifteen}{\exerciseref{ex.Hcondnal}}%{16}
\newcommand{\exeighteen}{\exerciseref{ex.Hmutualineq}}%{19}
\newcommand{\extwenty}{\exerciseref{ex.rel.ent}}%{21}
\newcommand{\extwentyone}{\exerciseref{ex.joint}}%{22}% the joint ensemble
\newcommand{\extwentytwo}{\exerciseref{ex.dataprocineq}}%{23}
\newcommand{\extwentythree}{\exerciseref{ex.zxymod2}}%{24}
\newcommand{\extwentyfour}{\exerciseref{ex.waithead}}%{25}
\newcommand{\extwentyfive}{\exerciseref{ex.sumdice}}%{26}
\newcommand{\extwentysix}{\exerciseref{ex.RN}}%{27}
\newcommand{\extwentyseven}{\exerciseref{ex.RNGaussian}}%{28}
\newcommand{\exthirtyone}{\exerciseref{ex.logit}}%{32}% logistic
\newcommand{\exthirtysix}{\exerciseref{ex.exponential}}%{37}% 
\newcommand{\exthirtyseven}{\exerciseref{ex.blood}}%{38}% forensic
\newcommand{\exfiftythree}{\exerciseref{ex.}}%{53}% integers
\newcommand{\eqsixteenfive}{16.5}
\newcommand{\Kraft}{Kraft}% Kraft--McMillan
\newcommand{\exrelent}{\exerciseref{ex.rel.ent}}%{20} %% \ref{ex.rel.ent}
\newcommand{\eqKL}{1.24} %% \eqref{eq.KL}
\newcommand{\bSigma}{{\mathbf{\Sigma}}}
\newcommand{\sumproduct}{sum-product}
%
% for cpi material
%
\newcommand{\sigbias}{\sigma_{\rm bias}}
\newcommand{\sigin}{\sigma_{\rm in}}
\newcommand{\sigout}{\sigma_{\rm out}}
\newcommand{\abias}{\alpha_{\rm bias}}
\newcommand{\ain}{\alpha_{\rm in}}
\newcommand{\aout}{\alpha_{\rm out}}
%\newcommand{\bff}{\bf}
\newcommand{\handfigs}{/home/mackay/handbook/figs} 
\newcommand{\mjofigs}{/home/mackay/figs/mjo} 
\newcommand{\FIGSlearning}{/home/mackay/book/FIGS/learning}
\newcommand{\codefigs}{/home/mackay/_doc/code/ps/ps} 
%
% mncEL stuff
%
\newcommand{\ebnowide}[1]{\mbox{\psfig{figure=../../code/#1.ps,width=2.8in,angle=-90}}}
\newcommand{\fem}{m}
\newcommand{\feM}{M}
\newcommand{\fel}{n}
\newcommand{\feL}{N}
\renewcommand{\L}{N}
\newcommand{\feLm}{{\cal N}(m)}
\newcommand{\feMl}{{\cal M}(n)}
\newcommand{\feK}{N}
\newcommand{\fek}{n}
\newcommand{\feKn}{{\cal N}(m)}
\newcommand{\feNk}{{\cal M}(n)}
\newcommand{\feN}{M}
\newcommand{\fen}{m}
\newcommand{\fer}{r}
\newcommand{\GL}{GL}
\newcommand{\SMN}{GL}
\newcommand{\NMN}{MN}
\newcommand{\MN}{MN}
\renewcommand{\check}{check}% was relationship
\newcommand{\checks}{checks}% was relationship
\newcommand{\fs}{f_{\rm s}}
\newcommand{\fn}{f_{\rm n}}
\newcommand{\llncspunc}{.}
\newcommand{\query}{\mbox{{\tt{?}}}}
\newcommand{\lcA}{{H}} 
\newcommand{\rmncNall}{/home/mackay/_doc/code/rmncNall} 
\newcommand{\oneA}{1A}
\newcommand{\twoA}{2A}
\newcommand{\thrA}{2A}
\newcommand{\oneB}{1B}
\newcommand{\twoB}{2B}
\newcommand{\thrB}{2B}
\newcommand{\bndips}{/home/mackay/_doc/code/bndips}
\newcommand{\codeps}{/home/mackay/_doc/code/ps}
\newcommand{\equalnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/equal.eps,width=8pt}\hspace{0mm}}}
\newcommand{\plusnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/plus.eps,width=8pt}\hspace{0mm}}}
%
% Mon 26/5/03 modified this to try to centre the left heading
\newcommand{\fourfourtable}[9]{\begin{tabular}[b]{lcc@{\hspace{4pt}}c}
          \multicolumn{1}{l}{#1:} &  & \multicolumn{2}{c}{#2} \\[-0.1in]%  \cline{1-1}
                                  &                        & {#3} & {#4} \\ \cline{3-4}
\raisebox{-6.5pt}[0pt][0pt]{{#5}} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt] 
                                  &\multicolumn{1}{l|}{#4} & {#8} & {#9} \\ 
\end{tabular}}
% Mon 26/5/03 extra version with heading right aligned and space reduced between col 1 and 2
\newcommand{\fourfourtabler}[9]{\begin{tabular}[b]{r@{}cc@{\hspace{4pt}}c}
          \multicolumn{1}{l}{#1:} &  & \multicolumn{2}{c}{#2} \\[-0.1in]%  \cline{1-1}
                                  &                        & {#3} & {#4} \\ \cline{3-4}
\raisebox{-6.5pt}[0pt][0pt]{{#5}} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt] 
                                  &\multicolumn{1}{l|}{#4} & {#8} & {#9} \\ 
\end{tabular}}
\newcommand{\fourfourtablebeforemaythree}[9]{\begin{tabular}[b]{lcc@{\hspace{4pt}}c}
 \multicolumn{1}{l}{#1:} &  & \multicolumn{2}{c}{#2} \\[-0.1in]%  \cline{1-1}
 & & {#3} & {#4} \\ \cline{3-4}
{#5} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt] 
     &\multicolumn{1}{l|}{#4} & {#8} & {#9} \\ 
\end{tabular}}
\newcommand{\fourfourtableb}[9]{\begin{tabular}[b]{l|c@{\hspace{1pt}}c@{\hspace{3pt}}c}
 {#1} & {#2}  & {#3} & {#4} \\ \cline{1-1}\cline{3-4}
\multicolumn{2}{l}{#5} & & \\ 
\multicolumn{1}{l|}{#3} & & {#6} & {#7} \\[-5pt] 
\multicolumn{1}{l|}{#4} & & {#8} & {#9} \\ 
\end{tabular}}
\newcommand{\fourfourtableold}[9]{\begin{tabular}[b]{l|c|c|c|}
 {#1} & {#2}  & {#3} & {#4} \\ \cline{1-1}
\multicolumn{2}{l|}{#5} & & \\ \hline
\multicolumn{2}{l|}{#3} & {#6} & {#7} \\ \hline
\multicolumn{2}{l|}{#4} & {#8} & {#9} \\ \hline
\end{tabular}}
\newcommand{\mathsstrut}{\rule[-3mm]{0pt}{8mm}}
%
% for ra.tex
%
\newcommand{\halfw}{0.35in}
\newcommand{\onew}{0.9in}%{0.7in}% used in Gallager/MN figures in ra.tex% increased Wed 9/4/03
\newcommand{\onehalfw}{1.05in}
\newcommand{\twow}{1.4in}
\newcommand{\twohalfw}{1.75in}
\newcommand{\GHfig}[1]{\psfig{figure=GHps/#1,width=\onehalfw}}% for rate 1/3
\newcommand{\GHfigone}[1]{\psfig{figure=GHps/#1,width=\onew}}%
\newcommand{\GHfigthird}[1]{\psfig{figure=GHps/#1,width=\halfw}}
\newcommand{\GHfigquarter}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}
\newcommand{\GHfigtwo}[1]{\psfig{figure=GHps/#1,width=\twow}}% for rate 1/2
\newcommand{\GHfigdouble}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}% for five wide
% extra wide fitting::::::::::: (for turbo)
\newcommand{\GHfigdoubleE}[1]{\psfig{figure=GHps/#1,width=2in}}% for five wide
\newcommand{\GHfigE}[1]{\psfig{figure=GHps/#1,width=1.2in}}% for rate 1/3
%
\newcommand{\GHdrawfig}[1]{\psfig{figure=GHps/#1,width=1.5in}}% was 1.8
\newcommand{\standardfig}[1]{\psfig{figure=rirreg/#1,width=1.8in,angle=-90}}
\newcommand{\loopsfig}[1]{\psfig{figure=rirreg/loops.#1,height=1.85in,width=1.8in,angle=-90}}
\newcommand{\titledfig}[2]{\begin{tabular}{c}%
{#1}\\%
\standardfig{#2}\\%
\end{tabular}%
}

%
% for the single neuron chapters
%
\newcounter{funcfignum}
\setcounter{funcfignum}{1}
\newcommand{\funcfig}[2]{
	\put(#1,#2){\makebox(0,0)[b]{
	\begin{tabular}{@{}c@{}}
		\psfig{figure=\FIGSlearning/f.#1.#2.ps,height=1.3in,width=1.3in,angle=-90} \\[-0.15in]
 			$\bw = (#1,#2)$		
\\	\end{tabular}
		}
	}
}
\newcommand{\wflatfig}[1]{
	\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.3)(0.30,0.40)
		\psfig{figure=\FIGSlearning/#1,height=2.43in,width=2.064in,angle=-90}
% was 1.3,1.3
	\end{picture}\\\end{tabular}
}
\newcommand{\wsurfig}[1]{
	\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.5)(0,0)
		\psfig{figure=\FIGSlearning/#1,height=1.8in,width=1.8in,angle=-90}
% was 1.5,1.5
	\end{picture}\end{tabular}
}
\newcommand{\datfig}[1]{
	\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1,1)(0.30,0.1)
		\psfig{figure=\FIGSlearning/#1,height=1.2in,width=1.412in,angle=-90}
% was 1,1
	\end{picture}\end{tabular}
}
\newcommand{\optens}{optimal input distribution}% used in l5.tex, l6.tex, s5.tex
\newcommand{\dilbertcopy}{{[Dilbert image Copyright\copyright{1997} United Feature Syndicate, Inc.,
 used with permission.]}}
\newcommand{\Rnine}{\mbox{R}_9}
\newcommand{\Rthree}{\mbox{R}_3}
\newcommand{\eof}{{\Box}}
\newcommand{\teof}{\mbox{$\Box$}}% for use in text
\newcommand{\ta}{{\tt{a}}}
\newcommand{\tb}{{\tt{b}}}
%\newcommand{\dits}{dits}
%\newcommand{\dit}{dit}
\newcommand{\disc}{disk}
\newcommand{\dits}{bans}
\newcommand{\dit}{ban}
%
% used in l5
%
\newcommand{\BSC}{binary symmetric channel}
\newcommand{\BEC}{binary erasure channel}
\newcommand{\subsubpunc}{}% change to . if subsubsections are given in-line headings
%
% convolutional code definitions
%
\newcommand{\cta}{t^{(a)}}
\newcommand{\ctb}{t^{(b)}}
\newcommand{\z}{z}
\newcommand{\lfsr}{linear-feedback shift-register} 
%
% definitions for including hinton diagrams from extended directory
%
\newcommand{\ecfig}[1]{\psfig{figure=extended/ps/#1.ps,silent=}}
% extra argument
\newcommand{\ecfigb}[2]{\psfig{figure=extended/ps/#1.ps,#2,silent=}}
%
% used in _s1 and in _linear maybe
%%%%%%%%%%% see /home/mackay/code/bucky
\newcommand{\buckypsfig}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in}}}
\newcommand{\buckypsfigw}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in}}}
\newcommand{\buckypsgraph}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in,angle=-90}}}
\newcommand{\buckypsgraphb}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in,angle=-90}}}
\newcommand{\buckypsgraphB}[1]{\mbox{\psfig{figure=buckyps/#1,width=2.2in,angle=-90}}}
%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%55
% for l1a
%%%%%%%%%%%%%%%%%%
% example
% \bigrampicture{3.538mm}{hd_conbigram.ps}
% \bigrampicture{3.538mm}{hd_conbigram.ps,width=278pt}%%%%%%% 278 is the original size
% This used to work fine in latex209 then needed rejigging in 2e.
% (alignment of g,j,p,q,y wrong at the bottom) (saved to graveyard.tex
\newcommand{\bigrampicture}[3]%args are unitlength,picturename-and-picturesize,font-request
{%%%%%%%%%
\setlength{\unitlength}{#1}
\begin{picture}(30,30)(0,-30)% was 28,28   0,-28
\put(0.15,-27.8){\makebox(0,0)[bl]{\psfig{figure=bigrams/#2,angle=-90}}}
\put(1,-29){\makebox(0,0)[b]{{#3\tt a}}}
\put(2,-29){\makebox(0,0)[b]{{#3\tt b}}}
\put(3,-29){\makebox(0,0)[b]{{#3\tt c}}}
\put(4,-29){\makebox(0,0)[b]{{#3\tt d}}}
\put(5,-29){\makebox(0,0)[b]{{#3\tt e}}}
\put(6,-29){\makebox(0,0)[b]{{#3\tt f}}}
\put(7,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt g}}}
\put(8,-29){\makebox(0,0)[b]{{#3\tt h}}}
\put(9,-29){\makebox(0,0)[b]{{#3\tt i}}}
\put(10,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt j}}}
\put(11,-29){\makebox(0,0)[b]{{#3\tt k}}}
\put(12,-29){\makebox(0,0)[b]{{#3\tt l}}}
\put(13,-29){\makebox(0,0)[b]{{#3\tt m}}}
\put(14,-29){\makebox(0,0)[b]{{#3\tt n}}}
\put(15,-29){\makebox(0,0)[b]{{#3\tt o}}}
\put(16,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt p}}}
\put(17,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt q}}}
\put(18,-29){\makebox(0,0)[b]{{#3\tt r}}}
\put(19,-29){\makebox(0,0)[b]{{#3\tt s}}}
\put(20,-29){\makebox(0,0)[b]{{#3\tt t}}}
\put(21,-29){\makebox(0,0)[b]{{#3\tt u}}}
\put(22,-29){\makebox(0,0)[b]{{#3\tt v}}}
\put(23,-29){\makebox(0,0)[b]{{#3\tt w}}}
\put(24,-29){\makebox(0,0)[b]{{#3\tt x}}}
\put(25,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt y}}}
\put(26,-29){\makebox(0,0)[b]{{#3\tt z}}}
\put(27,-29){\makebox(0,0)[b]{{#3--}}}
% they used to be at height -29 and were aligned  bottom
%\put(27,-29){\makebox(0,0)[b]{{#3\verb+-+}}}
%      
\put(29,-29){\makebox(0,0)[r]{#3$y$}}
%
\put(-0.2,-1){\makebox(0,0)[r]{{#3\tt a}}}
\put(-0.2,-2){\makebox(0,0)[r]{{#3\tt b}}}
\put(-0.2,-3){\makebox(0,0)[r]{{#3\tt c}}}
\put(-0.2,-4){\makebox(0,0)[r]{{#3\tt d}}}
\put(-0.2,-5){\makebox(0,0)[r]{{#3\tt e}}}
\put(-0.2,-6){\makebox(0,0)[r]{{#3\tt f}}}
\put(-0.2,-7){\makebox(0,0)[r]{{#3\tt g}}}
\put(-0.2,-8){\makebox(0,0)[r]{{#3\tt h}}}
\put(-0.2,-9){\makebox(0,0)[r]{{#3\tt i}}}
\put(-0.2,-10){\makebox(0,0)[r]{{#3\tt j}}}
\put(-0.2,-11){\makebox(0,0)[r]{{#3\tt k}}}
\put(-0.2,-12){\makebox(0,0)[r]{{#3\tt l}}}
\put(-0.2,-13){\makebox(0,0)[r]{{#3\tt m}}}
\put(-0.2,-14){\makebox(0,0)[r]{{#3\tt n}}}
\put(-0.2,-15){\makebox(0,0)[r]{{#3\tt o}}}
\put(-0.2,-16){\makebox(0,0)[r]{{#3\tt p}}}
\put(-0.2,-17){\makebox(0,0)[r]{{#3\tt q}}}
\put(-0.2,-18){\makebox(0,0)[r]{{#3\tt r}}}
\put(-0.2,-19){\makebox(0,0)[r]{{#3\tt s}}}
\put(-0.2,-20){\makebox(0,0)[r]{{#3\tt t}}}
\put(-0.2,-21){\makebox(0,0)[r]{{#3\tt u}}}
\put(-0.2,-22){\makebox(0,0)[r]{{#3\tt v}}}
\put(-0.2,-23){\makebox(0,0)[r]{{#3\tt w}}}
\put(-0.2,-24){\makebox(0,0)[r]{{#3\tt x}}}
\put(-0.2,-25){\makebox(0,0)[r]{{#3\tt y}}}
\put(-0.2,-26){\makebox(0,0)[r]{{#3\tt z}}}
\put(-0.2,-27){\makebox(0,0)[r]{{#3--}}}
%\put(-0.2,-27){\makebox(0,0)[r]{{#3\verb+-+}}}

\put(-0.2,1){\makebox(0,0)[r]{#3$x$}}
\end{picture}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% used in ch 1:
\newcommand{\pB}{p_{\rm B}}
\newcommand{\pb}{p_{\rm b}}
% from theorems.tex for exact.tex
\newcommand{\PGB}{p^{\rm G}_{\rm B}}
\newcommand{\PGb}{p^{\rm G}_{\rm b}}
\newcommand{\PB}{p_{\rm B}}
\newcommand{\Pb}{p_{\rm b}}
%
% used in occam.tex (from nn_occam.tex)
\newlength{\minch}
\setlength{\minch}{0.82in}
\newcommand{\ostruta}{\rule[-0.07\minch]{0cm}{0.18\minch}}
\newcommand{\ostrutb}{\rule[-0.17\minch]{0cm}{0.14\minch}}
%
% sumproduct.tex
\newcommand{\gP}{P^*}
\newcommand{\xmwon}{\ensuremath{\bx_m \wo n}}
\newcommand{\xmwonb}{\ensuremath{\bx_{m \wo n}}}
% southeast.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\gridlet}[1]{\thinlines
                         \multiput(#1)(0,-2){4}{\line(1,0){7.22}}%
                         \multiput(#1)(2,0){4}{\line(0,-1){7.22}}}
%
\newcommand{\gridletfive}[1]{\thinlines
                          \multiput(#1)(0,-2){5}{\line(1,0){9.22}}%
                          \multiput(#1)(2,0){5}{\line(0,-1){9.22}}}
%
\newcommand{\piece}[1]{\put(#1){\circle*{0.872}}}
\newcommand{\opiece}[1]{\put(#1){\circle{0.872}}}
\newcommand{\movingpiece}[1]{%
\thinlines
\put(#1){\circle*{0.872}}
\put(#1){\vector(0,-1){2}}
\put(#1){\vector(1,0){2}}
}%end movingpiece
\newcommand{\lhnextposition}[2]{\hnextposition{#1}
\put(#1){\makebox(0,0)[bl]{\raisebox{2mm}{#2}}}}% labelled horizontal arrow
\newcommand{\ldnextposition}[2]{\dnextposition{#1}
\put(#1){\makebox(0,0)[tl]{\raisebox{0mm}{#2}}}}% labelled horizontal arrow
\newcommand{\hnextposition}[1]{\put(#1){\vector(1,  0){2}}}
\newcommand{\vnextposition}[1]{\put(#1){\vector(0, -1){2}}}
\newcommand{\dnextposition}[1]{\put(#1){\vector(-2,-1){4}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% dfountain
\newcommand{\Ripple}{\ensuremath{S}}
% deconvoln.tex
\newcommand{\noisenu}{n}

% for _s13.tex and one_neuron
\newcommand{\hammingsymbol}[7]{\setlength{\unitlength}{1.4mm}%
\begin{picture}(1,2)(0,0)%
\ifnum #1=1 \put(0,2){\line(1,0){1}} \fi%
\ifnum #2=1 \put(0,2){\line(0,-1){1}} \fi%
\ifnum #3=1 \put(1,2){\line(0,-1){1}} \fi%
\ifnum #4=1 \put(0,1){\line(1,0){1}} \fi%
\ifnum #5=1 \put(0,1){\line(0,-1){1}} \fi%
\ifnum #6=1 \put(1,1){\line(0,-1){1}} \fi%
\ifnum #7=1 \put(0,0){\line(1,0){1}} \fi%
\end{picture}%
}
\newcommand{\hammingdigit}[1]{%
\ifnum #1=6  \hammingsymbol{0}{0}{0}{1}{0}{1}{1}\fi%
\ifnum #1=14 \hammingsymbol{0}{0}{1}{0}{1}{1}{1}\fi%
\ifnum #1=2  \hammingsymbol{0}{0}{1}{1}{1}{0}{0}\fi%
\ifnum #1=1  \hammingsymbol{0}{1}{0}{0}{1}{1}{0}\fi%
\ifnum #1=10 \hammingsymbol{0}{1}{0}{1}{1}{0}{1}\fi%
\ifnum #1=12 \hammingsymbol{0}{1}{1}{0}{0}{0}{1}\fi%
\ifnum #1=4  \hammingsymbol{0}{1}{1}{1}{0}{1}{0}\fi%
\ifnum #1=11 \hammingsymbol{1}{0}{0}{0}{1}{0}{1}\fi%
\ifnum #1=0  \hammingsymbol{1}{0}{0}{1}{1}{1}{0}\fi%
\ifnum #1=7  \hammingsymbol{1}{0}{1}{0}{0}{1}{0}\fi%
\ifnum #1=13 \hammingsymbol{1}{0}{1}{1}{0}{0}{1}\fi%
\ifnum #1=5  \hammingsymbol{1}{1}{0}{0}{0}{1}{1}\fi%
\ifnum #1=9  \hammingsymbol{1}{1}{0}{1}{0}{0}{0}\fi%
\ifnum #1=3  \hammingsymbol{1}{1}{1}{0}{1}{0}{0}\fi%
\ifnum #1=8  \hammingsymbol{1}{1}{1}{1}{1}{1}{1}\fi%
}
% here in binary order.
%6 &\hammingsymbol{0}{0}{0}{1}{0}{1}{1} \\
%14&\hammingsymbol{0}{0}{1}{0}{1}{1}{1} \\
%2 &\hammingsymbol{0}{0}{1}{1}{1}{0}{0} \\
%1 &\hammingsymbol{0}{1}{0}{0}{1}{1}{0} \\
%10&\hammingsymbol{0}{1}{0}{1}{1}{0}{1} \\
%12&\hammingsymbol{0}{1}{1}{0}{0}{0}{1} \\
%4 &\hammingsymbol{0}{1}{1}{1}{0}{1}{0} \\
%11&\hammingsymbol{1}{0}{0}{0}{1}{0}{1} \\
%0 &\hammingsymbol{1}{0}{0}{1}{1}{1}{0} \\
%7 &\hammingsymbol{1}{0}{1}{0}{0}{1}{0} \\
%13&\hammingsymbol{1}{0}{1}{1}{0}{0}{1} \\
%5 &\hammingsymbol{1}{1}{0}{0}{0}{1}{1} \\
%9 &\hammingsymbol{1}{1}{0}{1}{0}{0}{0} \\
%3 &\hammingsymbol{1}{1}{1}{0}{1}{0}{0} \\
%8 &\hammingsymbol{1}{1}{1}{1}{1}{1}{1} \\

\newcommand{\ldpcc}{low-density parity-check code}
%\newcommand{\Ldpc}{Low-density parity-check}% defined elsewhere

% included by l2.tex
% definitions for weighings.tex and for text
% shows weighing trees, ternary
%
% decisions of what to weigh are shown in square boxes with 126 over 345 (l:r)
% state of valid hypotheses are listed in double boxes
% or maybe dashboxes?
% three arrows, up means left heavy,  straioght means right heavy, down is balance
%
\newcommand{\mysbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}}
\newcommand{\mydbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}}
\newcommand{\myuvector}[3]{\put(#1){\vector(#2){#3}}}
\newcommand{\mydvector}[3]{\put(#1){\vector(#2){#3}}}
\newcommand{\mysvector}[2]{\put(#1){\vector(1,0){#2}}}
\newcommand{\mythreevector}[4]{\myuvector{#1}{#2,#3}{#4}\mydvector{#1}{#2,-#3}{#4}\mysvector{#1}{#4}}
%
%\newcommand{\h1}{\mbox{$1^+$}}
%\newcommand{\l1}{\mbox{$1^-$}}
%\newcommand{\h2}{\mbox{$2^+$}}
%\newcommand{\l2}{\mbox{$2^-$}}
%\newcommand{\h3}{\mbox{$3^+$}}
%\newcommand{\l3}{\mbox{$3^-$}}
%\newcommand{\h4}{\mbox{$4^+$}}
%\newcommand{\l4}{\mbox{$4^-$}}
%\newcommand{\h5}{\mbox{$5^+$}}
%\newcommand{\l5}{\mbox{$5^-$}}
%\newcommand{\h6}{\mbox{$6^+$}}
%\newcommand{\l6}{\mbox{$6^-$}}
%\newcommand{\h7}{\mbox{$7^+$}}
%\newcommand{\l7}{\mbox{$7^-$}}
%\newcommand{\h8}{\mbox{$8^+$}}
%\newcommand{\l8}{\mbox{$8^-$}}
%\newcommand{\h9}{\mbox{$9^+$}}
%\newcommand{\l9}{\mbox{$9^-$}}
%\newcommand{\h10}{\mbox{$10^+$}}
%\newcommand{\l10}{\mbox{$10^-$}}
%\newcommand{\h11}{\mbox{$11^+$}}
%\newcommand{\l11}{\mbox{$11^-$}}
%\newcommand{\h12}{\mbox{$12^+$}}
%\newcommand{\l12}{\mbox{$12^-$}}

%\setlength{\parindent}{0mm}
\title{Information Theory,  Inference, \& Learning Algorithms}
\shortlecturetitle{}
\shortauthor{David J.C. MacKay}
% the book - called by book.tex
%
% aiming for 696 pages total 
%
% thebook.tex
% should run
%   make book.ind 
% by hand?
% Mon 7/10/02
\setcounter{exercise_number}{1} % set to imminent value 
%
\setcounter{secnumdepth}{1}    % sets the level at which subsection numbering stops
\setcounter{tocdepth}{0}
\newcommand{\mysetcounter}[2]{}%was {\setcounter{#1}{#2}}
% useful for forcing pagenumbers in drafts
%\setcounter{tocdepth}{1}    
\renewcommand{\bs}{{\bf s}}
\newcommand{\figs}{/home/mackay/handbook/figs} % while in bayes chapter 
% \addtocounter{page}{-1}
\pagenumbering{roman}
\setcounter{page}{2} % set to current value    
\setcounter{frompage}{2}% this is used by newcommands1.tex dvips operator that helps make
\setcounter{page}{1} % set to current value    
\setcounter{frompage}{1}% this is used by newcommands1.tex dvips operator that helps make
% individual chapters.
%
%  PAGE ii
%
% \chapter*{Dedication}
%\input{tex/dedicationa.tex}
%\newpage
%
% TITLE PAGE iii
%
\thispagestyle{empty} 
\begin{narrow}{0in}{-\margindistancefudge}%
\begin{raggedleft}
~\\[1.5in]
{\Large \bf Information Theory, 
 Inference,
 and Learning Algorithms\\[1in]
}
{\Large\sf David J.C. MacKay }\\
\end{raggedleft}
   \vfill 
   \mbox{}\epsfxsize=160pt\epsfbox{cuplogo.eps}% increased x size to compensate for 0.9 shrinkage later and another 10%
%   \mbox{}\epsfxsize=128pt\epsfbox{cuplogo.eps}
   \vspace*{-6pt}
\end{narrow}
\newpage 
\thispagestyle{empty} 
\begin{center}
~\\[1.5in]
{\Huge \bf Information Theory,  \\[0.2in]  
 Inference,\\[0.2in]
 and Learning Algorithms\\[1in]
}
{\Large\sf David J.C. MacKay }\\
{\tt{mackay@mrao.cam.ac.uk}}\\[0.3in]
\copyright  1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004\\[0.1in]
\copyright Cambridge University Press  2003\\[1.3in]
Version \thedraft\ (third printing) \today\\
\medskip
\medskip
\medskip
\medskip
\medskip

Please send feedback on this book via
{\tt{http://www.inference.phy.cam.ac.uk/mackay/itila/}}

\medskip
\medskip
\medskip
 Version 6.0 of this book  was  published by C.U.P.\ in September 2003. 
 It will remain viewable on-screen on the above website, in postscript, djvu, 
 and pdf formats.
\medskip
\medskip

 In the second printing (version 6.6) minor typos were corrected,
 and the book design was slightly altered to modify the placement of section numbers.
\medskip
\medskip

 In the third printing (version 7.0) minor typos were corrected, and chapter 8 was
 renamed `Dependent random variables' (instead of `Correlated').
\medskip
\medskip
\medskip

{\em (C.U.P. replace this page with their own page ii.)}

\end{center}
%\dvipsb{frontpage}
\newpage 
% choose one of these: 
% \input{cambridgefrontstuff.tex}
% \newpage
% {\em Page vi intentionally left blank.}
%
\newpage
% pages v and vi           pages vii and viii
\mytableofcontents
\dvipsb{table of contents}
% alternate
%\fakesection{Roadmap}
%\input{roadmap.tex}
%
\subchaptercontents{Preface}%{How to Use This Book}% use subchapter because this
                                    % marks the chapter name in the header, unlike chapter*{}
% \section*{How to use this book}
%{\em  [This front matter is still being written. The remainder of the book is essentially finished,
% except for typographical corrections, April 18th 2003.]}
%
% a longer version of this is in
% longabout.tex


% \section*{How to use this book}
% \section{How to use this book}
% The first question we must address is:

 This book is aimed at senior undergraduates and graduate students in
 Engineering, Science, Mathematics, and Computing. It expects
 familiarity with calculus, probability theory, and
 linear algebra as  taught in a first- or second-year
 undergraduate course on mathematics for
 scientists and engineers.


 Conventional courses on information theory 
 cover not only the beautiful {\em theoretical\/} ideas of Shannon,
 but also  {\em practical\/} solutions to \ind{communication} problems.
 This book
 goes further, bringing in Bayesian data modelling,
 Monte Carlo methods, variational
 methods, clustering algorithms, and neural networks.

 Why unify information theory and
 machine learning?
% Well,
 Because they 
% Information theory and
% machine learning
 are  two sides of the same coin.
% , so it makes sense to unify them.
% These two fields were once unified:
% It was once so:
 In the 1960s, a single field, cybernetics, was populated
 by information theorists,  computer scientists, and neuroscientists, all
 studying common problems.
 Information theory and machine learning still belong together.
 Brains are the ultimate compression and \ind{communication} systems.
 And the state-of-the-art algorithms
 for both data compression and  error-correcting codes
 use the same tools as  machine learning.
% Our brains are surely the ultimate in robust
% error-correcting information storage and recall systems.

\section*{How to use this book}
 The essential dependencies between chapters are indicated in
 the figure on the next page. An arrow from one chapter to another
 indicates that the second chapter requires some of the first.

%\section*{General points}
% The pinnacles of the book, the key chapters with the really exciting bits,
% are first  \chref{chone} (in which we meet Shannon's noisy-channel coding theorem);
% \chref{ch.six} (in which we prove it); \chref{ch.hopfield} (in which
% we meet a neural network that performs robust error-correcting
% content-addressable memory); and Chapters \ref{ch.ldpcc} and \ref{chdfountain}
% (in which we  meet beautifully simple sparse-graph codes that solve
% Shannon's communication problem).
%% honorable mention -  \chref{ch.ac},  ch.ra  /////\   exact sampling - not central.

% Do not feel daunted by this book.
% You don't need to read all of this book.
 Within  {\partnoun}s \datapart, \noisypart, \probpart, and \netpart\ of this book,  chapters
 on advanced or optional topics  are
 towards the end.
% For example, \chref{ch.codesforintegers} (Codes for Integers), \chref{ch.xword} (Crosswords  and Codebreaking)
% and  \chref{ch.sex} (Why have Sex? Information Acquisition and Evolution)
% are provided for fun.
 All  chapters of {\partnoun} \finfopart\
 are optional on a first reading, except perhaps for
 \chref{ch.message} (Message Passing).

 The same system sometimes applies within a chapter:
 the  final sections often deal with
 advanced topics that can be skipped on a first reading.
 For  example in two key chapters --
 \chref{chtwo} ({The Source Coding Theorem}) and \chref{ch.six} ({The Noisy-Channel Coding Theorem}) --
 the first-time reader should   detour
 at \secref{sec.chtwoproof} and \secref{sec.ch6stop} respectively.

% \subsection*{Roadmaps}
 Pages \pageref{map1}--\pageref{map4} show a few ways to use this book.
 First, I give the roadmap for a  course that I teach in Cambridge:
% which embraces both information theory and machine learning.
 `Information theory, pattern recognition, and neural networks'.
%
 The book is also intended  as a textbook for 
 traditional courses in information theory.
 The second  roadmap
 shows the chapters for 
 an introductory information theory course
 and the third
 for a course aimed at an understanding of
 state-of-the-art error-correcting codes.
%
 The fourth roadmap shows how to use the text  in a
 conventional course  on machine learning.
% The diagrams on the following pages will indicate
% the dependences between chapters and 
% a few possible routes through the book.

\newpage
\begin{center}\hspace*{-0.2cm}\raisebox{2cm}{\epsfbox{metapost/roadmap.2}}\end{center}
\newpage
% \input{tex/cambroadmap.tex}
% \newpage
\begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.3}}\end{center}
\label{map1}
\newpage
\begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.4}}\end{center}
\newpage      
\begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.5}}\end{center}
\newpage      
\begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.6}}\end{center}
\label{map4}
\newpage




\section*{About the exercises}
% I firmly believe that
 You  can  understand a subject only by 
 creating it for yourself.
% To this end, you should
% I think it is essential to
 The exercises
 play an essential role in this book.
%  on each topic.
 For guidance, each
% exercise
 has a rating (similar to that used by \citeasnoun{KnuthAll})
 from 1 to 5 to indicate its difficulty.

\noindent\ratfull\hspace*{\parindent}In addition, exercises that are especially recommended
 are marked by a marginal encouraging rat.
 Some exercises that require the use of a computer are
 marked with a {\sl C}.
% will have 
% a rating such as A1, A5, C1 or C5. 
% The letter  indicates how important I think the exercise is:
% A = very important $\ldots$ C = not essential to the flow of the
% book. The number indicates the difficulty of the problem: 
% 1 = easy, 5 = research project.

% I'll circulate detailed recommendations on exercises
% as the course progresses.

 Answers to many  exercises are provided. Use them
 wisely. Where a solution is provided, this is indicated
 by including its page number
% of the solution with
 alongside the difficulty rating.

 Solutions to many of the other exercises
 will be supplied to instructors using this book in their
 teaching; please email {\tt{solutions@cambridge.org}}.
 

%\begin{table}[htbp]
%\caption[a]
\begin{realcenter}
\fbox{
\begin{tabular}{ll}
%\begin{minipage}{3in}
{\sf Summary of codes for exercises}\\[0.2in]
% \hspace{0.2in}
\begin{tabular}[b]{cl}
\dorat & Especially recommended \\[0.2in]
{\ensuremath{\triangleright}} & Recommended \\
{\sl C}   & Parts require a computer \\
{\rm [p.$\,$42]}& Solution provided on page 42  \\
\end{tabular}
%\end{minipage}
&
\begin{tabular}[b]{cl}
\pdifficulty{1} & Simple (one minute) \\
\pdifficulty{2} &  Medium (quarter hour) \\
\pdifficulty{3} &  Moderately hard \\
\pdifficulty{4} &  Hard \\
\pdifficulty{5} &  Research project \\[0.2in]
\end{tabular}
\\
\end{tabular}
}
\end{realcenter}
%\end{table}


\section*{Internet resources}
 The website
\begin{realcenter}
{\tt{http://www.inference.phy.cam.ac.uk/mackay/itila}}
\end{realcenter}
 contains several resources:

\ben
\item
{\em Software}.
 Teaching software that I use in lectures,\index{software}
 interactive software, and research software,
 written in {\tt{perl}}, {\tt{octave}}, {{\tt{tcl}}}, {\tt{C}}, and {\tt{gnuplot}}.
 Also some animations.
\item
{\em Corrections to the book}. Thank you in advance for emailing these!
\item
{\em This book}.
 The book is provided in {\tt{postscript}}, {\tt{pdf}}, and {\tt{djvu}}
 formats for on-screen viewing. The same copyright restrictions
 apply as to a normal book. 
% \item
% {\em Further worked solutions to some exercises}.
% If you would like to send in your own solutions for inclusion,
% please do.
\een

% {\em (I aim to add a table of software resources here.)}

\section*{About this edition}
 This is the third printing of the first edition.
 In the second printing,
% a small number of typographical errors were corrected,
% and
 the design of the book was altered slightly.
% to allow a slightly larger font size.
 Page-numbering generally remains unchanged,
% consistent between the two printings,
 except in chapters 1, 6, and 28,
 where
% with the exception of pages 7 to 13, where
% among which
 a few paragraphs, figures, and equations have
 moved around.
% on which text, figures, and equations have all been slightly rearranged.
 All equation, section, and exercise numbers are unchanged.
 In the third printing, chapter 8 has been renamed
 `Dependent Random Variables', instead of `Correlated', which was sloppy.

% BEWARE, _RNGaussian.tex had to be changed for the asides.

%\input{tex/secondprint.tex}% about the second printing

\section*{Acknowledgments}
%\chapter*{Acknowledgments}
 I am most grateful to the organizations who have supported
 me while this book gestated: the Royal Society and Darwin College
 who gave  me a fantastic research fellowship
 in the early years; the University of Cambridge;
 the Keck Centre at the University of California in San Francisco,
 where I spent a productive sabbatical;
% (and failed to finish the book);
 and 
 the Gatsby Charitable Foundation, whose support gave me the
 freedom to  break out of the Escher staircase that book-writing had become.

 My work has  depended on the generosity of  free software authors.\index{software!free}\index{Knuth, Donald}
 I wrote the book in \LaTeXe. Three cheers for Donald Knuth and Leslie Lamport!
%\nocite{latex}
 Our computers run the GNU/Linux operating system. I use  {\tt{emacs}}, {\tt{perl}}, and
 {\tt{gnuplot}} every day.  Thank you Richard Stallman, thank you Linus Torvalds,
 thank you everyone.

% I thank  David Tranah of Cambridge University Press for his editorial support.
% ``cut, it's my job''
 
 Many readers, too numerous to name here,
 have given feedback on the book, and to
 them  all I extend my sincere acknowledgments. 
%
 I especially wish to thank all the students and colleagues
 at Cambridge University who have attended my lectures on
 information theory and machine learning over the last nine years.
% Without their enthusiasm and criticism, this book would surely

 The members of the Inference research group have given immense support, 
 and I thank them all for their generosity and patience over the last ten years:
 Mark Gibbs, Michelle Povinelli,  Simon Wilson, Coryn Bailer-Jones, Matthew Davey,
 Katriona Macphee,  James Miskin,  David Ward, Edward Ratzer,  Seb Wills, John Barry,
 John Winn, Phil Cowans, Hanna Wallach, Matthew Garrett, and especially Sanjoy Mahajan.
 Thank you too to Graeme Mitchison, Mike Cates, and Davin Yap.

 Finally I would like to express my debt to my personal heroes,
 the mentors from whom I have learned so much:
 Yaser Abu-Mostafa, 
 Andrew Blake,
 John Bridle,
 Peter Cheeseman,
 Steve Gull,
 Geoff Hinton,
 John Hopfield,
 Steve Luttrell,
 Robert MacKay,
 Bob McEliece,
 Radford Neal,
 Roger Sewell,
 and
 John Skilling.
%%%%%%%%%%%%%%

%\chapter*{Dedication}
%\vspace*{80pt}
\vfill
\begin{center}
 \rule{\textwidth}{1pt} \par \vskip 18pt
{ \huge \sl 
 {Dedication} }
\par
%\end{center}
 \nobreak \vskip 40pt 
%\begin{center}
 This book is dedicated to the campaign against the arms trade.\\[0.3in]
%
% Their web page is
% , as overburdened with animated images as the world is with weapons, is here:
%\verb+http://www.caat.demon.co.uk/+\\[0.6in]
\verb+www.caat.org.uk+\\[0.6in]
\end{center}
\begin{quote}
\begin{raggedleft}
 Peace cannot be kept by force.\\
 It can only be achieved
% by understanding.
% Peace cannot be achieved through violence, it can only be attained
 through understanding.\\
\hfill -- {\em Albert Einstein}\\
\end{raggedleft}
\end{quote}
\vspace*{2pt}
 \rule{\textwidth}{1pt} \par

% Two things are infinite: the universe and human stupidity; and I'm not sure
%      about the the universe.

%The important thing is not to stop questioning. Curiosity has its own reason for
%      existing.
%Any intelligent fool can make things bigger, more complex, and more violent. It
%      takes a touch of genius -- and a lot of courage -- to move in the opposite
%      direction.


% \input{extrafrontstuff.tex}% aims dedication, about the author, etc
% see also tex/oldaims.tex
% for some good stuff.
% and tex/typicalreaders.tex 
%
%% \input{tex/overview2001.tex}
%\dvipsb{preface}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\setcounter{page}{0} % set to current value
%Fake page % added to get draft.dps to look right
%\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\pagenumbering{arabic}
\prechapter{About Chapter} 
\setcounter{page}{1} % set to current value    
\label{pch.one}
%
% pre-chapter 1
%
\fakesection{Before ch 1}
 In the first chapter, you will need to be familiar with the \ind{binomial distribution}.
% , reviewed below.
 And to solve the exercises in the text --
 which I urge you to do -- you will need to know {\dem\ind{Stirling's
 approximation}\/}\index{approximation!Stirling}
 for the factorial function, $%\beq
	 x! \simeq  x^{x} \, e^{-x} 
$,
 and be able to
 apply it to ${{N}\choose{r}} =
 \smallfrac{N!}{(N-r)!\,r!}$.\marginpar{\small\raggedright{Unfamiliar notation?\\ See
  \appref{app.notation}, \pref{app.notation}.}}
% $x!$
 These topics are reviewed below.

\subsection*{The binomial distribution}
\label{sec.first.binomial}
\exampl{ex.binomial}{
 A bent coin has probability $f$ of coming up heads.
 The coin is tossed $N$ times.
 What is the  probability
 distribution of the number of heads, $r$?
 What are the \ind{mean} and \ind{variance} of $r$?
}

\amarginfig{t}{%
\begin{tabular}{r}
% $P(r\given f,N)$\\
\mbox{\psfig{figure=bigrams/urn.f.g.ps,angle=-90,width=1.51in}}%
%\\
%\mbox{\psfig{figure=bigrams/urn.f.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{\small$r$}
\\
\end{tabular}
%}{%
\caption[a]{The binomial distribution $P(r \given f\eq 0.3,\,N \eq 10)$.}
% , on a linear scale (top) and  a logarithmic scale (bottom).}
\label{fig.binomial}
}
% see bigrams/README

\noindent
%\begin{Sexample}{ex.binomial}
{\sf Solution\colonspace}
\label{sec.first.binomial.sol}
 The number of heads
 has a binomial distribution.
\beq P(r \given f,N) = {N \choose r} f^{r} (1-f)^{N-r} . \eeq
 The mean, $\Exp [ r ]$, and variance, $\var[r]$,
 of this distribution are
 defined by
\beq
 \Exp [ r ] \equiv \sum_{r=0}^{N} P(r\given f,N) \, r
\label{eq.mean.def}
\eeq
\beqan
 \var[r] & \equiv &
\Exp \left[ \left( r  -   \Exp [ r ] \right)^2 \right] \\
& = &
\Exp [ r^2 ] -  \left( \Exp [ r ] \right)^2
 =  \sum_{r=0}^{N} P(r\given f,N) r^2 -  \left( \Exp [ r ] \right)^2 .
\label{eq.var.sum}
\eeqan
%
 Rather than evaluating the sums over $r$ in (\ref{eq.mean.def}) and (\ref{eq.var.sum}) directly,
 it is easiest to  obtain the mean and variance by noting that $r$
 is the sum of $N$ {\em independent\/}
% , identically distributed
 random variables, namely, the number of heads in the
 first toss (which is either zero or one),
 the number of heads in the second toss, and so forth.
 In general, 
\beq
\begin{array}{rcll}
 \Exp [ x + y ] &=&  \Exp [ x ] +  \Exp [ y ]  & \mbox{for any random variables $x$ and $y$};
\\
 \var [ x + y ] &=&  \var [ x ] +  \var [ y ]  & \mbox{if $x$ and $y$ are independent}.
\end{array}
\eeq
 So the mean of $r$ is the sum of the means of those random
 variables, and the variance of $r$ is the sum of their variances.\index{variances add}
% its mean and variance are given by adding the means and variances
% of those random variables, respectively.
 The mean number of heads in a single toss
 is $f\times 1 + (1-f)\times 0 = f$, and the variance of the
 number of heads in a single toss  is
\beq
 \left[ f\times 1^2 + (1-f)\times 0^2 \right] - f^2 = f - f^2 = f(1-f),
\eeq
 so the mean and variance of $r$ are:
\beq \Exp [ r ] = N f
%\eeq\beq
\hspace{0.35in} \mbox{and} \hspace{0.35in}
 \var[r] = N f (1-f) . \hspace{0.35in}\epfsymbol\hspace{-0.35in}
\eeq
%\end{Sexample}
% ADD END PROOF SYMBOL HERE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

\subsection*{Approximating $x!$ and ${{N}\choose{r}}$}

\amarginfig{t}{%
\begin{tabular}{r}
\mbox{\psfig{figure=bigrams/poisson.g.ps,angle=-90,width=1.5in}}%
%\\
%\mbox{\psfig{figure=bigrams/poisson.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{\small$r$}
\\
\end{tabular}
%}{%
\caption[a]{The Poisson distribution $P(r\,|\,\l\eq 15)$.}
% , on a linear scale (top) and  a logarithmic scale (bottom).}
\label{fig.poisson}
}
% see bigrams/README
\label{sec.poisson}


% FAVOURITE BIT
\noindent
 Let's derive Stirling's approximation by an unconventional route.
 We start from the \ind{Poisson distribution} with mean $\l$,
\beq
	P( r \given  \l ) = e^{-\l} \frac{\l^r}{r!} \:\:\:\:
  \:\: r\in \{ 0,1,2,\ldots\} .
\label{eq.poisson}
\eeq
%
% \noindent
For large $\l$, this distribution is well approximated -- at least\index{approximation!by Gaussian}
 in the vicinity of $r \simeq \l$ -- by
 a \ind{Gaussian distribution} with mean $\l$ and variance $\l$:
% So,
\beq
  e^{-\l} \frac{\l^r}{r!} \,\simeq\, \frac{1}{\sqrt{2\pi \l}}
		\, e^{{ -\smallfrac{(r-\l)^2}{2\l}}} .
\eeq
 Let's plug  $r=\l$ into this formula.\label{sec.stirling}
\beqan
  e^{-\l} \frac{\l^{\l}}{\l!} &\simeq& \frac{1}{\sqrt{2\pi \l}}
\\
\Rightarrow \l! &\simeq&  \l^{\l} \, e^{-\l}  \sqrt{2\pi \l}  .
\eeqan
 This is {Stirling's approximation}
 for the \ind{factorial} function.
\beq
	 x! \,\simeq\,  x^{x} \, e^{-x}  \sqrt{2\pi x}  \:\:\:\Leftrightarrow\:\:\:
	\ln x! \,\simeq\, x \ln x - x + {\textstyle\frac{1}{2}} \ln {2\pi x} .
\label{eq.stirling}
\eeq
 We have derived not only the
 leading order behaviour, $x! \simeq  x^{x} \, e^{-x}$,
 but also, at no cost, the next-order correction
 term $\sqrt{2\pi x}$.
%
 We now apply Stirling's approximation
% the approximation
%$%\beq
%	 x! \simeq  x^{x}  \, e^{-x} $
 to\index{combination}
$%\beq
	\ln {{N}\choose{r}} 
$:%\eeq
\beqan
\ln	{{N}\choose{r}}
\,\equiv\, \ln \frac{N!}{(N-r)!\,r!}
%	&	\simeq &
% N  [ \ln N - 1 ] - (N-r) [ \ln (N-r) - 1 ] - r [ \ln r - 1 ]
%\\
 & \simeq & (N-r) \ln\frac{N}{N-r} + r \ln\frac{N}{r}
 .
\label{eq.choose.approx}
\eeqan
 Since all the terms in this equation are logarithms,
 this result can be rewritten in any base.\marginpar{\small Recall that
$\displaystyle{ \log_2 x = \frac{ \log_e x }{ \log_e 2} }$.\\[0.03in]
 Note that $\displaystyle\frac{\partial  \log_2 x }{\partial x} =
 \frac{1}{\log_e 2}\,\frac{1}{x}$.
}
%\fakesubsection*{My rule about log and ln}
 We will denote\index{conventions!logarithms}\index{notation!logarithms}
 natural logarithms ($\log_e$) by `ln', and \ind{logarithms}
 to base 2 ($\log_2$)
 by `$\log$'.

 If we introduce the {\dbf\ind{binary entropy function}},
\beq
 H_2(x) \equiv x \log \frac{1}{x} + (1\! -\! x) \log \frac{1}{(1\! -\! x)} ,
\eeq
 then we can rewrite the approximation (\ref{eq.choose.approx})
%\beq
%$ \log	{{N}\choose{r}} 
%  \simeq  (N-r) \log \frac{N}{N-r} + r \log \frac{N}{r} 
%$
%\eeq
 as
\amarginfig{t}{\small%
\begin{center}
\mbox{
\hspace{-6mm}
% \hspace{6.2mm}
\raisebox{\hpheight}{$H_2(x)$}
% to put H at left:
\hspace{-7.5mm}
% \hspace{-20mm}
\mbox{\psfig{figure=figs/H2.ps,%
width=42mm,angle=-90}}$x$
}
% see also H2p.tex

\end{center}
\caption[a]{The  binary entropy function.}
% $H_2(x)$.}
\label{fig.h2x}
}
\beq
\log	{{N}\choose{r}} 
\,  \simeq \, N H_2(r/N) ,
\label{eq.stirling.choose.l}
\eeq
 or, equivalently,
% \:\:\:\Leftrightarrow\:\:\:
\beq
	{{N}\choose{r}} 
\,  \simeq \, 2^{N H_2(r/N)} .
\label{eq.stirling.choose}
\eeq
 If we need a more accurate approximation, we
 can include terms of the next order from
 Stirling's approximation
 (\ref{eq.stirling}):
\beq
\log	{{N}\choose{r}} 
  \,\simeq\, N H_2(r/N) -
 {\textstyle\frac{1}{2}} \log \left[ {2\pi N \, \frac{N\!-\!r}{N} \,
                                       \frac{r}{N}}  \right]
.
\label{eq.H2approxaccurate}
\eeq
%
% - {\textstyle\frac{1}{2}} \ln {2\pi N}
% + {\textstyle\frac{1}{2}} \ln {2\pi N-r}
% + {\textstyle\frac{1}{2}} \ln {2\pi r}
%
% ln += {\textstyle\frac{1}{2}} \ln {2\pi (N-r)(r)/N}
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi (N-r)(r)/N}
% or
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi N}
%        + {\textstyle\frac{1}{2}} \log_2 {\frac{(N-r)}{N}\frac{r}{N}} 
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi \frac{(N-r)}{N}\frac{r}{N} N} 


\ENDprechapter
\chapter{Introduction to Information Theory}
\label{ch.one}
\label{chone}
% % \part{Information Theory}
% \chapter{Introduction to Information Theory}
\label{ch1}
%\section{Communication over noisy channels}
% One of the principal questions addressed by information theory is 
%  Shannon's ground-breaking paper on `The Mathematical Theory of 
%  Communication' opens thus:
\begin{quotation}
\noindent
 The fundamental problem of \index{communication} is that of reproducing at one point
 either exactly or approximately a message selected at another point.
\\
\mbox{~} \hfill {\em (Claude Shannon, 1948)}\index{Claude Shannon}  \\
%
\end{quotation}

\noindent
 In the first half of 
 this book we
%are going to
 study how to measure information content; 
 we
% are going to
% learn by how much data from a given source 
% can be compressed; we
% are going to
 learn how
% , practically, to
% achieve  data compression;
 to compress data; and we
% are going to
 learn  how to communicate 
 perfectly over  imperfect communication channels. 

 We start by getting a feeling for this last problem. 

\section[How can we achieve perfect communication?]{How
 can we achieve perfect communication over an imperfect, noisy
 communication channel?}
 Some examples of noisy communication channels are:
\bit
\item
 an analogue telephone 
 line,\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
 \begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{modem}}}
\put(21,10){\makebox(0,0)[l]{\shortstack{phone\\line}}}
\put(39,10){\makebox(0,0)[l]{\shortstack{modem}}}
\put(15,10){\vector(1,0){3}}
\put(32,10){\vector(1,0){3}}
\end{picture}
}
 over which two modems communicate digital information;
\item
 the radio communication link from 
  Galileo,\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
 \begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{Galileo}}}
\put(21,10){\makebox(0,0)[l]{\shortstack{radio\\waves}}}
\put(39,10){\makebox(0,0)[l]{\shortstack{Earth}}}
\put(15,10){\vector(1,0){3}}
\put(32,10){\vector(1,0){3}}
\end{picture}
}
 the  Jupiter-orbiting spacecraft,
 to earth;
\item
\marginpar[c]{\footnotesize
\setlength{\unitlength}{1mm}%
 \begin{picture}(30,20)(0,0)
\put(0,10){\makebox(0,0)[l]{\shortstack{parent\\cell}}}
\put(16,2){\makebox(0,0)[l]{\shortstack{daughter\\cell}}}
\put(16,16){\makebox(0,0)[l]{\shortstack{daughter\\cell}}}
\put(10,10){\vector(1,1){5}}
\put(10,10){\vector(1,-1){5}}
\end{picture}
}reproducing cells, in which the daughter cells' \ind{DNA}
 contains information from the parent
% cell or
 cells;
\item 
 \marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
 \begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}}
\put(20,10){\makebox(0,0)[l]{\shortstack{\disc\\drive}}}
\put(33,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}}
\put(15,10){\vector(1,0){3}}
\put(29,10){\vector(1,0){3}}
\end{picture}
}a \disc{} drive.
\eit
 The last example shows that \ind{communication} doesn't have to involve 
 information going from one {\em place\/}  to another. When 
 we write a file on a \disc{} drive, we'll
% typically
 read it off
% again 
 in the same location -- but at a later {\em time}.

 These channels are noisy.\index{noise}\index{channel!noisy} A telephone line  suffers
 from cross-talk with other lines; the hardware in the 
 line distorts and adds noise to the transmitted signal.  The deep
 space network that listens to Galileo's puny transmitter
% fairy-bulb power
 receives background radiation  from
 terrestrial and cosmic sources.
 DNA is subject to mutations and damage. 
 A \ind{disk drive}, which  writes
 a binary digit (a one or zero, also known as a {\dbf bit}) by aligning a patch of magnetic
 material in one of two orientations, may later
% , with some probability,
 fail to read out the stored binary digit:
% that was stored
 the patch of material might  spontaneously flip
 magnetization, or
 a glitch of
 background noise might cause the reading circuit
 to report the wrong 
 value for the binary digit, or  the writing head might not induce 
 the magnetization in the first place because of interference
 from neighbouring bits.

 In all these cases, if we transmit data, \eg, a string 
 of bits, over the channel, there is some probability that 
 the received message will not be identical to the transmitted message. 
% And in all cases,
 We would prefer to have a communication channel for
 which this probability was zero -- or so close to zero that 
 for practical purposes it is indistinguishable from zero.  

 Let's consider
% the example of
 a noisy \disc{} drive
% having the property
 that transmits  each bit  correctly
% transmitted
 with probability
 $(1\!-\!f)$ and incorrectly  with probability $f$. 
 This model
% favourite
 communication channel\index{channel!binary symmetric}  is known 
 as the {\dbf{\ind{binary symmetric channel}}} (\figref{fig.bsc1}).

\begin{figure}[htbp]
\figuremargin{%
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,20)(-5,0)
\put(-4,9){{\makebox(0,0)[r]{$x$}}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{1}}
\put(4,16){\makebox(0,0)[r]{0}}
\put(16,2){\makebox(0,0)[l]{1}}
\put(16,16){\makebox(0,0)[l]{0}}
\put(24,9){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}

\:\:\:
\begin{array}{ccl}%%%%% {c@{}c@{}l} %%%%% (for twocolumn style)
        P(y\eq 0 \given x\eq 0) &= & 1 - \q ; \\  P(y\eq 1 \given x\eq 0) &= & \q ;
\end{array} 
\begin{array}{ccl}
        P(y\eq 0 \given x\eq 1) &= &  \q ; \\ P(y\eq 1 \given x\eq 1) &= & 1 - \q .
\end{array} 
\]
}{%
\caption[a]{The binary symmetric channel. The 
 transmitted symbol is $x$ and the 
 received symbol $y$. The noise level, the probability
% of a bit's being
 that a bit is
 flipped, is $f$.}
\label{fig.bsc1}
}%
\end{figure}
\begin{figure}[htbp]
\figuremargin{%
\begin{mycenter}
\begin{tabular}{rcl}
\psfig{figure=bitmaps/dilbert.ps,width=1.2in} 
&\hspace{0.1in}%
\raisebox{0.22in}{%
\setlength{\unitlength}{1.2mm}%
\begin{picture}(20,20)(0,0)%
\put(10,1){\makebox(0,0)[t]{$(1-f)$}}
\put(10,17){\makebox(0,0)[b]{$(1-f)$}}
\put(12,9.5){\makebox(0,0)[l]{$f$}}
% \put(10,16.5){\makebox(0,0)[b]{$(1-f)$}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{{1}}}
\put(4,16){\makebox(0,0)[r]{{0}}}
\put(16,2){\makebox(0,0)[l]{{1}}}
\put(16,16){\makebox(0,0)[l]{{0}}}
\end{picture}%
}%
\hspace{0.385in}&
\psfig{figure=_is/10000.10.ps,width=1.2in} \\
% & & \makebox[0in][l]{\large 10\% of bits are flipped} \\
\end{tabular}
\end{mycenter}
}{%
\caption[a]{A binary data sequence of length $10\,000$ transmitted over 
 a binary symmetric channel with  noise level $f=0.1$.
\dilbertcopy}
\label{fig.bsc.dil}
}%
\end{figure}

\noindent
 As an example,
% For the sake of argument,
 let's imagine that $f=0.1$, that is, ten \percent\ of the bits are 
 flipped (figure \ref{fig.bsc.dil}).
% For a \disc{} drive to be useful, we would prefer that it should 
% flip no bits at all in its entire lifetime.
 A useful \disc{} drive would  flip no bits at all in its entire lifetime.
%  
 If we expect to read and write a 
 gigabyte per day for ten years,  we require a bit error 
 probability  of the order of $10^{-15}$, or smaller.
 There are two approaches to this goal. 


\subsection{The physical solution}
 The physical solution is to improve the physical characteristics of 
 the communication channel to reduce its error probability. We could 
 improve our \disc{} drive by
% , for example,
\ben
\item
  using more reliable components in its circuitry;
\item
  evacuating the air from the \disc{} enclosure so as
 to eliminate the turbulence that perturbs the 
 reading head from the  track; 
\item
  using a larger magnetic patch to represent each bit;  or 
\item 
   using higher-power signals or cooling the 
 circuitry in order to reduce thermal noise. 
\een
 These physical modifications
 typically
 increase the cost of the communication 
 channel.
%   unit of area  making the \disc{} spin at a slower rate

%
% the system solution
% 
\begin{figure}%[htbp]
\figuremargin{%
\setlength{\unitlength}{1.25mm}
\begin{mycenter}
\begin{picture}(50,40)(-10,5)
\put(0,5){\framebox(25,10){\begin{tabular}{c}Noisy\\ channel\end{tabular}}}
\put(-20,20){\framebox(25,10){\begin{tabular}{c}Encoder\end{tabular}}}
\put(20,20){\framebox(25,10){\begin{tabular}{c}Decoder\end{tabular}}}
%\put(-20,40){\framebox(25,10){\begin{tabular}{c}Compressor\end{tabular}}}
%\put(20,40){\framebox(25,10){\begin{tabular}{c}Decompressor\end{tabular}}}
%\put(-50,20){\makebox(25,10){\begin{tabular}{c}{\sc Source}\\{\sc coding}\end{tabular}}}
% \put(-50,40){\makebox(25,10){\begin{tabular}{c}{\sc Channel}\\{\sc coding}\end{tabular}}}
\put(-20,37){\makebox(25,12){Source}}
%
\put(-10,14){\makebox(0,0){$\bt$}}
\put(-10,34){\makebox(0,0){$\bs$}}
\put(35,14){\makebox(0,0){$\br$}}
\put(35,34){\makebox(0,0){$\hat{\bs}$}}

\put(-7.5,18){\line(0,-1){8}}  
\put(-7.5,10){\vector(1,0){6}} 
\put(32.5,10){\vector(0,1){8}}
\put(32.5,10){\line(-1,0){6}}
%
\put(32.5,31){\vector(0,1){8}}
%\put(32.5,51){\vector(0,1){5}}
\put(-7.5,39){\vector(0,-1){8}}
%\put(-7.5,55){\vector(0,-1){5}}
\end{picture}
\end{mycenter}
}{%
\caption[a]{The `system' solution for
         achieving 
% almost perfect 
        reliable communication
        over a noisy channel. The encoding system introduces
 systematic redundancy
%        in a systematic way 
        into the transmitted vector $\bt$. The decoding system 
   uses this known redundancy to deduce
 from  the 
        received vector $\br$
 {\em both\/}
 the original source vector
        {\em and\/}
 the noise introduced by the channel.
}
\label{system.solution}
}%
\end{figure}
\subsection{The `system' solution}
 Information theory\index{information theory} and
 \ind{coding theory}\index{system} offer
 an alternative (and much more exciting)
 approach: we accept the given noisy channel as it is
 and 
 add communication {\dem systems\/} to it so that we 
 can {detect\/} and {correct\/} the errors introduced by the 
% noise.
 channel.
 As shown in \figref{system.solution}, we   add an 
 {\dem\ind{encoder}\/} before the channel and a {\dem\ind{decoder}\/} after 
 it. The encoder encodes the source message $\bs$ 
 into a {\dem transmitted\/} message $\bt$,
% the idea is that the  encoder adds
 adding {\dem\ind{redundancy}\/} to the original message in some way. The 
 channel adds noise to the transmitted message, yielding a received 
 message $\br$. The decoder uses the  known redundancy 
 introduced by the encoding system to infer both the original signal 
 $\bs$ and the added noise.
% added by the channel was. 

 Whereas  physical solutions give incremental channel improvements
 only  at an ever-increasing cost,
% we hope to find
% there exist
 system solutions  can turn noisy channels into reliable
 communication channels
 with the only cost being a  {\em computational\/} requirement 
 at the encoder and decoder.
% (and the delay associated with those computations.
%
% suggested addition: 
% So, as the cost of computation falls, the cost of reliability will fall as well.

{\dbf Information theory} is concerned with the theoretical limitations and 
% theoretical 
 potentials of such  systems. `What is the best error-correcting 
 performance we could achieve?'

{\dbf Coding theory} is concerned with the creation of practical 
 encoding and decoding systems.
 
% Some
\section{Error-correcting codes for the binary symmetric channel}
 We now consider  examples of encoding and decoding systems. 
 What is the simplest way to  add useful redundancy to a transmission? 
 [To make the rules of the game clear:
 we want to be able to detect {\em and\/} correct errors;
 and retransmission is not an option. We  get only
one chance to encode, transmit,
 and decode.]

\subsection{Repetition codes}
\label{sec.r3}
 A straightforward idea is to repeat every bit of the message a prearranged
 number of times -- for example, three times, as shown in \tabref{fig.r3}. 
 We call this {\dem \ind{repetition code}\/} `$\Rthree$'.

%\begin{figure}[htbp]
%\figuremargin{%
\amargintab{c}{
\begin{mycenter}
\begin{tabular}{c@{\hspace{0.3in}}c} \toprule % \hline
%        Source sequence $\bs$ &  Transmitted sequence $\bt$ \\ \hline
        Source  &  Transmitted  \\[-0.02in] % was -0.1, which was to much
         sequence  &   sequence  \\ 
         $\bs$ &   $\bt$ \\ \midrule % \hline
        \tt 0 &\tt  000 \\
        \tt 1 &\tt  111  \\ \bottomrule % \hline
\end{tabular} 
\end{mycenter}
%}{%
\caption[a]{The repetition code {$\Rthree$}.}
\label{fig.r3}
}%
%\end{figure}

% \noindent
%
 Imagine that
% what might happen if 
 we transmit the source message
\[
 \bs = \mbox{\tt 0 0 1 0 1 1 0}
\]
 over a binary 
 symmetric channel with noise level $f=0.1$ using this repetition code. 
 We can describe the channel as `adding' a sparse noise vector $\bn$ to the 
 transmitted vector -- adding in modulo 2 arithmetic, \ie, the binary algebra in which 
 {\tt 1}+{\tt 1}={\tt 0}.  A possible noise
 vector $\bn$ and received vector $\br = \bt + \bn$
 are shown in 
 \figref{fig.r3.transmission}.
\begin{figure}[htbp]
%
% here i should switch the \[ \] for a display that oes not introduce
% white space at the top (about 0.1in)
%
\figuremargin{%
\[
        \begin{array}{rccccccc}
        \bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\
        \bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ 
        \bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8}
        \br &  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \nbr{{\tt 0}}{{\tt 1}}{{\tt 0}}&   \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} 
        \end{array}
\]
}{%
\caption{An example transmission using $\mbox{R}_3$.}
\label{fig.r3.transmission}
}
\end{figure}

%\noindent
 How should we decode this received vector?
%
% optimality not clear - should justify? 
%
% Perhaps you can see  that
 The optimal algorithm looks at the received
 bits three at a time and takes 
 a \ind{majority vote} (\algref{alg.r3}).


\begin{algorithm}[htbp]
\algorithmmargin{%
\begin{mycenter}
\begin{tabular}{ccc} % \toprule % \hline
        Received sequence $\br$ &
 Likelihood ratio $\frac{P(\br\,|\, s\eq {\tt 1})}{P(\br\,|\, s\eq {\tt 0})}$
 &
 Decoded sequence $\hat{\bs}$ \\ \midrule
\tt      000 & $\gamma^{-3}$ &\tt 0 \\
\tt      001 & $\gamma^{-1}$ &\tt 0 \\
\tt      010 & $\gamma^{-1}$ &\tt 0 \\
\tt      100 & $\gamma^{-1}$ &\tt 0 \\
\tt      101 & $\gamma^{1}$  &\tt 1 \\
\tt      110 & $\gamma^{1}$  &\tt 1 \\
\tt      011 & $\gamma^{1}$  &\tt 1 \\
\tt      111 & $\gamma^{3}$  &\tt 1 \\
% \bottomrule
\end{tabular} 
\end{mycenter}
}{%
\caption[a]{Majority-vote decoding algorithm for {$\Rthree$}.
 Also shown are the likelihood ratios (\ref{eq.likelihood.bsc}), assuming
%  This is the optimal decoder if
 the channel is a binary symmetric channel; $\gamma \equiv (1-f)/f$.}
%
\label{fig.r3d}
\label{alg.r3}
}%
\end{algorithm}

%
\begin{aside}
% 
 At the risk of explaining the obvious, let's prove this result.
 The optimal decoding decision
 (optimal in the sense
 of having the smallest probability of being wrong)
 is to find which value of $\bs$
 is most probable, given $\br$.\index{maximum {\em a posteriori}}
% to make clear the assumptions.
 Consider the decoding of a single bit $s$, which was encoded
 as
% after encoding as
 $\bt(s)$ 
 and  gave rise to three received bits $\br = r_1r_2r_3$.
 By \ind{Bayes' theorem},\label{sec.bayes.used} the {\dem posterior
 probability\/} of $s$ is
\beq
	P(s \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s ) P( s ) }
				{ P( r_1r_2r_3 ) } .
\label{eq.bayestheorem}
\eeq
 We can spell out the posterior probability of the two alternatives thus:
\beq
	P(s\eq {\tt 1} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\eq {\tt 1} )
						P( s\eq {\tt 1} ) }
				{ P( r_1r_2r_3 ) } ; 
\label{eq.post1}
\eeq
\beq
	P(s\eq {\tt 0} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\eq {\tt 0} )
						P( s\eq {\tt 0} ) }
				{ P( r_1r_2r_3 ) } .
\label{eq.post0}
\eeq
%
 This \ind{posterior probability} is determined by two factors:
  the
 {\dem{\ind{prior} probability\/}} $P(s)$, and 
 the data-dependent term $P( r_1r_2r_3 \,|\, s )$, which is called
 the {\dem{\ind{likelihood}\/}} of $s$.
 The normalizing constant $P( r_1r_2r_3 )$
% is irrelevant to
 needn't be computed when finding
 the optimal decoding decision,
 which is to guess $\hat{s}\eq {\tt 0}$
 if $P(s\eq {\tt 0} \,|\, \br ) > P(s\eq {\tt 1} \,|\, \br )$,
 and $\hat{s}\eq {\tt 1}$ otherwise.

 To find
 $P(s\eq {\tt 0} \,|\, \br )$ and $P(s\eq {\tt 1} \,|\, \br )$,
% the optimal decoding decision,
 we must  make an assumption about the prior probabilities of the
 two hypotheses ${s}\eq {\tt 0}$ and ${s}\eq {\tt 1}$, and we 
 must make an assumption about the probability of $\br$ given
 $s$.
% $\bt(s)$.
 We  assume that the prior probabilities are equal:
 $P( {s}\eq {\tt 0}) = P( {s}\eq {\tt 1}) = 0.5$; 
 then  maximizing the posterior probability $P(s\,|\,\br)$ is
 equivalent to maximizing the likelihood $P(\br\,|\,s)$.\index{maximum likelihood}
 And  we  assume that the
 channel is a binary symmetric channel with noise level $f<0.5$, so that
 the likelihood is
\beq
	P( \br \,|\, s ) = P(\br \,|\, \bt(s) ) = \prod_{n=1}^N
		P(r_n \,|\, t_n(s) ) ,
\eeq
 where $N=3$ is the number of transmitted bits in the block
 we are considering, and
\beq
 P(r_n\,|\,t_n) = \left\{ \begin{array}{lll}
 (1\!-\!f) & \mbox{if} &  r_n=t_n \\
 f & \mbox{if} & r_n \neq t_n. \end{array} \right.
\eeq
 Thus the likelihood ratio for the
 two hypotheses is
% if we define $
\beq
	\frac{P(\br\,|\, s\eq {\tt 1})}{P(\br\,|\, s\eq {\tt 0})}
%	= \left( \frac{ (1-f) }{f} \right)^{
	= \prod_{n=1}^N
		\frac{P(r_n \,|\, t_n({\tt 1}) )}{P(r_n \,|\, t_n({\tt 0}) )} ;
\label{eq.likelihood.bsc}
\eeq
 each factor
% $P(r_n \,|\, t_n(s) )$
 $\frac{P(r_n | t_n({\tt 1}) )}{P(r_n | t_n({\tt 0}) )}$
 equals $\frac{ (1-f) }{f}$ if $r_n=1$ and $\frac{f}{ (1-f) }$ if
 $r_n=0$. 
 The ratio $\gamma \equiv \frac{ (1-f) }{f}$ is greater than 1,
 since $f<0.5$, so the winning hypothesis is the one with the most
 `votes', each vote counting for a factor of $\gamma$ in the
% posterior probability.
 likelihood ratio. 

 Thus the majority-vote decoder shown in \algref{fig.r3d}
 is the optimal decoder if we assume that 
 the channel is  a binary symmetric channel and that the 
 two possible source messages {\tt 0} and {\tt 1} 
 have equal prior probability.
\end{aside}


%\noindent
 We now apply the majority vote decoder to the received vector of  \figref{fig.r3.transmission}.
 The first  three received bits are all ${\tt 0}$, so
 we decode this triplet 
 as a ${\tt 0}$. 
 In the second triplet of \figref{fig.r3.transmission},
 there are two {\tt 0}s and one {\tt 1}, so  we decode 
 this triplet as a ${\tt 0}$ -- which in this case corrects the error.
 Not all errors are corrected, however. If we are unlucky and 
 two errors fall in a single block, as in the fifth triplet of 
 \figref{fig.r3.transmission}, 
 then the decoding rule gets the wrong answer, as shown in 
 \figref{fig.decoding.R3}. 
% \Figref{fig.decoding.R3}
% shows the result of decoding the received vector 
% from \figref{fig.r3.transmission}.
\begin{figure}[htbp]
\figuremargin{%
\[
        \begin{array}{rccccccc}
        \bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\
        \bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ 
        \bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8} 
        \br &  \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 1}}&   \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}&  \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \ubr{{\tt 0}}{{\tt 1}}{{\tt 0}}&   \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}&  \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ 
        \hat{\bs} &     {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 0}&{\tt 1}&{\tt 0} \\
        \mbox{corrected errors} &
                         &\star & & & & & \\ 
        \mbox{undetected errors} &
                         & & & &\star & & 
        \end{array}
\]
}{%
\caption{Decoding
% Applying the maximum likelihood decoder for $\mbox{R}_3$ to 
 the received vector 
 from \protect\figref{fig.r3.transmission}.}
\label{fig.decoding.R3}
}%
\end{figure}

\noindent
% Thus the error probability is reduced by the use of this code. 
% It is easy to compute the error probability.

% Exercise 1.1. Could this be made an Example, i.e. worked through in
%        the text? -- for a beginner, there is a lot in it, and it seems to
%        be important.
%
% see exercise.sty
\exercissx{2}{ex.R3ep}{%%%%%%%% keep this as A2, but cut it from the ITPRNN list
 Show\marginpar{\small\raggedright The exercise's rating, \eg
% `{\em{A}}2'
 `[{\em2\/}]',
 indicates its  difficulty:
 `1' exercises are the easiest.
% An exercise rated {\em{A}}2 is important and should not prove too difficult.
 Exercises that are accompanied by a marginal rat are especially recommended.
 If a solution or partial solution is provided, the page is indicated after the difficulty rating;
 for  example, this exercise's solution is on page \pageref{ex.R3ep.sol}.
}
 that  the error probability is reduced by the use of {$\Rthree$}
 by computing the error probability of
 this code for a binary symmetric channel
 with noise level $f$.
%Do so.
}
%
% This fig is 0.1 inch too wide, 9801
%

 The error probability is dominated by the probability that two
 bits in a block of three are flipped, which scales as $f^2$.
%
% JARGON??????
%
 In the
 case of the binary symmetric channel with $f=0.1$, the {$\Rthree$} code has a
 probability of error, after decoding, of $\pb \simeq 0.03$ per bit.
 \Figref{fig.r3.dilbert} shows the 
 result of transmitting a binary
 image over a binary symmetric channel
 using the repetition code.

\begin{figure}[hbtp]
%\fullwidthfigure{%
%\figuredangle{% this hung off the bottom of the page
\figuremarginb{% I think this may make a collision?
\begin{center}
\setlength{\unitlength}{0.8in}% was 0.75 98.12. changed to 0.8 99.01
\begin{picture}(7,4.3)(0,1.4)
\put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}}
\put(0.625,5.4){\makebox(0,0){\Large$\bs$}}
\thicklines
\put(1.35,4.75){\vector(1,0){0.4}}
\put(1.55,5.4){\makebox(0,0){{\sc encoder}}}
\put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.ps,width=1in}}}
\put(2.625,5.4){\makebox(0,0){\Large$\bt$}}
\put(3.6,5.4){\makebox(0,0){{\sc channel}}}
\put(3.6,5.15){\makebox(0,0){$f={10\%}$}}
\put(3.4,4.75){\vector(1,0){0.4}}
\put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.ps,width=1in}}}
\put(4.625,5.4){\makebox(0,0){\Large$\br$}}
\put(5.6,5.4){\makebox(0,0){{\sc decoder}}}
%\put(5.6,3.4){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder takes the majority vote of the three signals.}}}}
\put(5.4,4.75){\vector(1,0){0.4}}
\put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.d.ps,width=1in}}}
\put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}}
\end{picture}
\end{center}
}{%
\caption[a]{Transmitting $10\,000$ source bits over a binary symmetric channel
 with $f=10\%$
% 0.1$
  using a repetition code and the majority vote decoding 
 algorithm. The  probability 
 of decoded bit error has fallen to about 3\%; the rate has fallen 
 to 1/3.}
% \dilbertcopy
\label{fig.r3.dilbert}
}%
\end{figure}
%  Should `rate' be explicitly defined?

\newpage\indent
The repetition code $\Rthree$ has therefore reduced the probability of
 error, as desired.
 Yet we have lost something: our
 {\em rate\/} of information transfer has fallen by a factor of
 three. So if we use a repetition code to communicate data over a telephone
 line, it will reduce the error frequency, but it will also reduce our
 communication rate. We will have to pay three times as much for each
 phone call.
% there will also be a delay
 Similarly,
%As for our \disc{} drive,
 we would need three of the original noisy gigabyte \disc{} drives
 in order to create a one-gigabyte \disc{} drive with $\pb=0.03$.

 Can we 
% What happens as we try to 
 push the error probability lower, to the
 values required for a
% quality
 sellable \disc{} drive -- $10^{-15}$?
 We could achieve lower error probabilities by using repetition 
 codes with more repetitions. 

\exercissx{3}{ex.R60}{
\ben
\item
 Show that the probability of error of $\RN$, the repetition
 code with  $N$ repetitions,  is 
\beq
 p_{\rm b} = \sum_{n=(N+1)/2}^{N} {{N}\choose{n}} f^n (1-f)^{N-n} ,
\eeq
 for  odd $N$.
\item
 Assuming $f = 0.1$, which of the terms in this sum is the biggest?
 How much bigger is it than the second-biggest term? 
\item
 Use \ind{Stirling's approximation} (\pref{sec.stirling}) to approximate
% get rid of
 the ${{N}\choose{n}}$
 in the largest term, and find,
 approximately, the probability of error of the repetition
 code with  $N$ repetitions.
\item
  Assuming $f = 0.1$, find how many repetitions
 are required
% show that it takes a repetition 
% code with rate about $1/60$
 to get the probability of error 
 down to $10^{-15}$. [Answer: about 60.]
\een
}
 So to build a {\em single\/}
 gigabyte \disc{} drive 
 with the required reliability from noisy gigabyte drives with $f=0.1$, 
 we would need {\em sixty\/} of the  noisy \disc{} drives.
 The tradeoff between error probability and rate for repetition 
 codes is shown in \figref{fig.pbR.R}.
%
%  see end of l1.tex for method, also see poster1.gnu
%
\newcommand{\pbobject}{\hspace{-0.15in}\raisebox{1.62in}{$\pb$}%
\hspace{-0.05in}}
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/rep.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/rep.1.l.ps,angle=-90,width=2.6in}  \\
\end{tabular} 
\end{center}
}{%
\caption[a]{Error probability $\pb$ versus rate for repetition codes
 over a binary symmetric channel with $f=0.1$.
 The right-hand figure shows $\pb$ on a logarithmic scale. We would like 
 the rate to be large and $\pb$ to be small.
}
\label{fig.pbR.R}
}%
\end{figure}
%  see end of this file for method


\subsection{Block codes -- the $(7,4)$ Hamming code}
\label{sec.ham74}
 We would  like to  communicate with\index{Hamming code} 
 tiny probability of error {\em and\/} at a substantial rate.
 Can we improve on repetition codes? What if we add redundancy to 
 {\dem blocks\/} of data instead of 
% redundantly 
 encoding one bit at a time?
% You may already have heard of the idea of `parity check bits'. 
 We now 
 study a simple {\dem{block code}}.

 A {\dem \ind{block code}\/} is a rule\index{error-correcting code!block code}
 for converting a sequence of source
 bits $\bs$, of length $K$, say, into a transmitted sequence $\bt$ of length
 $N$ bits. To add redundancy, we make $N$ 
 greater than $K$. In a {\dem linear\/} block code,
 the extra $N-K$ bits are linear functions of the
 original $K$ bits; these extra bits are called {\dem\ind{parity-check bits}}.
 An example of a \ind{linear block code} is the \mbox{\dem$(7,4)$
 \ind{Hamming code}}, which transmits $N=7$ bits for every $K=4$ source
 bits.
% \index{error-correcting code!linear}

\begin{figure}[htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
(a)\psfig{figure=hamming/encode.eps,angle=-90,width=1.3in}  &
(b)\psfig{figure=hamming/correct.eps,angle=-90,width=1.3in}  \\
\end{tabular} 
\end{center}
}{
\caption[a]{Pictorial representation of encoding for the  $(7,4)$ Hamming
 code.
% a and  b are not explained in the caption. Does this matter? 
%
% The parity check bits $t_5,t_6,t_7$ are set so that the parity within
%% each circle is even.
}
\label{fig.74h.pictorial}
\label{fig.hamming.pictorial}
}
\end{figure}
%
 The encoding operation for the code is shown pictorially
 in \figref{fig.74h.pictorial}.
%
% \subsubsection{Encoding}
 We arrange the seven transmitted bits in three intersecting circles.
%  as shown in \figref{fig.hamming.encode}.
 The first four
 transmitted bits,
 $t_1 t_2 t_3 t_4$, are set equal to the four source bits,
 $s_1 s_2 s_3 s_4$.
 The parity-check bits\index{parity-check bits}
 $t_5 t_6 t_7$ are set so that the {\dem\ind{parity}\/}
 within each circle is even:
 the first parity-check bit is the parity of the first three source bits
 (that is, it is
%zero
 {\tt 0} if the sum of those bits is even, and
% one
 {\tt 1} if the sum  is odd); 
 the second is the parity of the last three; and the third parity bit 
 is the parity of source bits one, three and four. 

 As an example, \figref{fig.74h.pictorial}b shows the transmitted 
 codeword  for the case $\bs = {\tt 1000}$. 
% idea for rewriting this: go straight to pictorial story, leave out the 
% matrix description for another time.
%
%
%\noindent
%
 Table \ref{tab.74h} shows the codewords generated
 by each of the  $2^4=$ sixteen settings of the four source bits.
% Notice that the first four transmitted bits are 
% identical to the four source bits, and the remaining three bits 
% are parity bits:
% The special property of these codewords is that
 These codewords
 have the  special property  that
 any pair 
 differ from each other in at least three bits.
\begin{table}[htbp]
\figuremargin{%
\begin{center}
\mbox{\small
\begin{tabular}{cc} \toprule
%       Source sequence
 $\bs$ & 
% Transmitted sequence
               $\bt$ \\ \midrule
\tt     0000 &\tt 0000000 \\
\tt     0001 &\tt 0001011 \\
\tt     0010 &\tt 0010111 \\
\tt     0011 &\tt 0011100 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
         $\bs$ &   $\bt$ \\ \midrule
\tt     0100 &\tt 0100110 \\
\tt     0101 &\tt 0101101 \\
\tt     0110 &\tt 0110001 \\
\tt     0111 &\tt 0111010 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule 
         $\bs$ &   $\bt$ \\ \midrule
\tt     1000 &\tt 1000101 \\
\tt     1001 &\tt 1001110 \\
\tt     1010 &\tt 1010010 \\
\tt     1011 &\tt 1011001 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
         $\bs$ &   $\bt$ \\ \midrule
\tt     1100 &\tt 1100011 \\
\tt     1101 &\tt 1101000 \\
\tt     1110 &\tt 1110100 \\
\tt     1111 &\tt 1111111 \\ \bottomrule
\end{tabular}
}%%%%%%%%% end of row of four tables
\end{center} 
}{%
\caption[a]{The sixteen codewords
  $\{ \bt \}$ of the $(7,4)$ Hamming  code. Any pair of
  codewords 
% have the % beautiful % elegant property that they
 differ from each other in at least three bits.}
%\label{fig.hamming.encode}
\label{tab.74h}
\label{tab.h74}
\label{fig.h74}
\label{fig.74h}
}
\end{table}
%

\begin{aside}
 Because the Hamming code is a   {linear\/} code, it can\indexs{error-correcting code!linear}
 be written  compactly in terms of matrices as follows.\index{linear block code} 
% It is a 
% {\em linear\/} code; that is, t
 The transmitted codeword $\bt$ is
% can be
 obtained 
 from the source sequence $\bs$ by a linear operation,
\beq
        \bt = \bG^{\T} \bs,
\label{eq.encode}
\eeq
 where $\bG$ is the {\dem\ind{generator matrix}} of the code,
\beq
 \bG^{\T} = {\left[ \begin{array}{cccc} 
\tt 1 &\tt 0 &\tt 0 &\tt 0 \\
\tt 0 &\tt 1 &\tt 0 &\tt 0 \\
\tt 0 &\tt 0 &\tt 1 &\tt 0 \\
\tt 0 &\tt 0 &\tt 0 &\tt 1 \\
\tt 1 &\tt 1 &\tt 1 &\tt 0 \\
\tt 0 &\tt 1 &\tt 1 &\tt 1 \\
\tt 1 &\tt 0 &\tt 1 &\tt 1  \end{array} \right] } ,
\label{eq.h74.gen}
\eeq 
 and the encoding operation (\ref{eq.encode}) uses 
  modulo-2 arithmetic (${\tt 1}+{\tt 1}={\tt{0}}$, ${\tt 0}+{\tt 1}={\tt 1}$, etc.).
%\footnote{My notational 
% convention  is that  all  vectors -- $\bs$, $\bt$, etc.\ --
% are column vectors, except that in the figures where many 
% vectors are listed, they are displayed as row vectors. The 
% generator matrix $\bG$  is written ..... as to retain 
% consistency with established notation in coding texts.}

% \begin{aside}
 In the encoding operation
 (\ref{eq.encode}) I have assumed that $\bs$ and $\bt$ are 
 column vectors. If instead they are row vectors, then this equation 
 is replaced by
\beq
        \bt =  \bs \bG,	
\label{eq.encodeT}
\eeq
 where 
\beq
       \bG = \left[ \begin{array}{ccccccc} 
 \tt 1& \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1 \\
 \tt 0& \tt 1& \tt 0& \tt 0& \tt 1& \tt 1& \tt 0 \\
 \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1& \tt 1 \\
 \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1 \\
  \end{array} \right] .
\label{eq.Generator}
\eeq
% f you are like me, you may
 I find it easier to relate to
 the right-multiplication (\ref{eq.encode})
% hyphenation specified in itprnnchapter.tex did not work so I do it manually
 than the left-multiplica-{\breakhere}tion (\ref{eq.encodeT}).
% -- I like my matrices to act to the right.
 Many coding theory texts use the left-multiplying conventions 
 (\ref{eq.encodeT}--\ref{eq.Generator}), however.

 The rows of the generator matrix (\ref{eq.Generator}) can be 
 viewed as defining four basis vectors lying in a seven-dimensional
 binary space. The sixteen codewords are obtained by making all 
 possible linear combinations
% binary sums
 of these vectors.
\end{aside}


%
% should I add a cast of characters here?
% s,t,r,s^

\subsubsection{Decoding the $(7,4)$ Hamming code}
 When we invent a more complex encoder $\bs \rightarrow \bt$,
 the task of decoding the
 received vector $\br$ becomes less straightforward. Remember that
 {\em any\/} of the bits may have been flipped, including the parity bits. 
% We can't assume that the  three extra parity bits 
%(The reader who
% is eager to see the denouement of the plot may skip ahead to section
% \ref{sec.code.perf}.)
 

% General defn of optimal decoder 
 If we assume that the channel is a binary symmetric channel and that
 all source vectors are equiprobable, 
% {\em a priori},
 then  the
 optimal decoder
% is one that
 identifies the source vector $\bs$ whose
 encoding $\bt(\bs)$ differs from the received vector $\br$ in the
 fewest bits. [{Refer to the likelihood function 
% equation
% {eq.bayestheorem}--\ref{eq.likelihood.bsc}}
 \bref{eq.likelihood.bsc}} to see why this is so.]
 We could solve the decoding problem by measuring how far $\br$
 is from each of the 
 sixteen codewords in \tabref{tab.74h}, then picking the closest.
 Is there a more efficient way of finding the most probable source vector?


\subsubsection{Syndrome decoding for the Hamming code}
\label{sec.syndromedecoding}
 For the $(7,4)$ Hamming code there is a pictorial solution to the 
% syndrome
 decoding problem, based on the  encoding  picture,
 \figref{fig.74h.pictorial}. 
%
% \subsubsection{Decoding}
%
% sanjoy says this is CONFUSING - tried to improve it Sat 22/12/01
% also romke did not like it

 As a first example, let's assume the transmission was
 $\bt = {\tt 1000101}$ and the noise flips the second bit,
 so the received vector is
 $\br = {\tt 1000101}\oplus{\tt{0100000}} = {\tt{1100101}}$.
% \ie,  $\bn=({\tt 0},{\tt 1},{\tt 0},{\tt 0},{\tt 0}, {\tt 0},{\tt 0})$,
% and the received vector 
 We write the received vector  into the three circles
 as shown in \figref{fig.hamming.decode}a, and
 look at each of the three circles to see whether its parity is even.
 The circles whose parity is {\em{not}\/} even are shown by
 dashed lines in \figref{fig.hamming.decode}b.
% The fact that all codewords differ from each other in at least 
% three bits means that if the noise has flipped any one or two bits, 
% the received vector will no longer be a valid codeword, and some of 
% the parity checks  will be broken.
%
 The decoding task is
%We want
 to find the smallest
 set of flipped bits that can account for these violations
 of the parity rules.
% violated.
 [The  pattern of violations of the parity checks is called the {\dem\ind{syndrome}}, and can be written as a binary vector -- for example,
 in \figref{fig.hamming.decode}b, the syndrome is $\bz = ({\tt1},{\tt1},{\tt0})$,
 because the first two circles are `unhappy' (parity {\tt1}) and the
 third circle is `happy' (parity {\tt0}).]
% RESTORE ME:
%, and the task of  syndrome decoding 
% syndrome (just as a
% \ind{doctor} might seek the most probable underlying \ind{disease} to account for
% the symptoms shown by a \ind{patient}).


\begin{figure}% [htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{ccc}
(a)\psfig{figure=hamming/decode.eps,angle=0,width=1.3in}  \\
(b)\psfig{figure=hamming/s2.eps,angle=-90,width=1.3in}  &
(c)\psfig{figure=hamming/t5.eps,angle=-90,width=1.3in}  &
(d)\psfig{figure=hamming/s3.eps,angle=-90,width=1.3in}  \\[0.3in]
\multicolumn{3}{c}{%
(e)\psfig{figure=hamming/s3.t7.eps,angle=0,width=1.3in}  
\setlength{\unitlength}{1in}
\begin{picture}(0.4,0.6)(0,0)
\put(0,0.6){\vector(1,0){0.6}}
\end{picture}
% \raisebox{0.6in}{$\rightarrow$}
(${\rm e}'$)\psfig{figure=hamming/s3.t7.d.eps,angle=0,width=1.3in}  
}\\
\end{tabular} 
\end{center}
}{%
\caption[a]{Pictorial representation of decoding of the Hamming $(7,4)$ 
 code. The received vector is written into the diagram
 as shown in (a).
 In (b,c,d,e), the received vector is
 shown, assuming that the transmitted vector was
 as in
% The bits that are flipped relative to
 \protect
 \figref{fig.hamming.pictorial}b and the bits labelled by $\star$
 were flipped. The violated 
 parity checks are highlighted by dashed circles. One of the seven bits 
 is the most probable suspect to account for each `\ind{syndrome}', \ie, each 
 pattern of violated and satisfied parity checks. 

 In examples  (b), (c), and (d), the most probable suspect is
 the one bit that was flipped.

 In example (e), two bits  have been flipped, $s_3$ and $t_7$.
 The most probable suspect is $r_2$, marked by a circle in (${\rm e}'$),
 which shows the output of the decoding algorithm. 
% each circle is even.
}\label{fig.hamming.decode}
\label{fig.hamming.s2}% these labels were in the wrong place feb 2000
\label{fig.hamming.s3}
\label{fig.hamming.correct}
}
\end{figure}
%
% ACTION: sanjoy still thinks this part is hard to follow - fixed Sat 22/12/01?
 To solve the decoding task,
% problem,
 we ask the question:
 can we find  a unique bit that lies {\em inside\/}
 all the `unhappy' circles and {\em outside\/} all the
 `happy' circles? If so, the flipping of that bit
 would account for the observed
 syndrome. 
 In the case shown in   \figref{fig.hamming.s2}b,
 the bit $r_2$
% that was flipped
 lies  inside the  two unhappy circles and outside the happy
 circle;
 no other single bit has this property, so
 $r_2$  is the only single bit capable of explaining the syndrome.

 Let's work through a couple more examples. 
  \Figref{fig.hamming.s2}c shows what happens if one of the
 parity bits, $t_5$, is flipped by the noise. Just one of the checks
 is violated. Only $r_5$ lies inside this unhappy circle and outside
the other two happy circles,
 so $r_5$ is  identified as the only single bit
 capable of explaining the syndrome.

 If the central bit $r_3$ is received flipped, 
  \figref{fig.hamming.s3}d shows that all three checks are violated;
 only $r_3$ lies inside all three circles, so $r_3$  is
 identified as the  suspect bit.

 If you try flipping any one of the seven bits, you'll find 
 that a different syndrome is obtained in each case -- seven non-zero syndromes,
 one for each bit. There is only 
 one other syndrome, the all-zero syndrome. So if
 the channel is a binary symmetric channel with a
 small noise level $f$, the optimal 
 decoder unflips at most one bit, depending  on the 
 syndrome, as shown in \algref{tab.hamming.decode}.
 Each syndrome could have been caused by other noise patterns
 too, but any other noise pattern that has the same syndrome 
 must be less probable because it involves a larger number of 
 noise events.

%\begin{figure}
%\figuremargin{%
\begin{algorithm}
\algorithmmargin{%
\begin{center}
\begin{tabular}{c*{8}{c}}
% Fri 4/1/02 removed toprule and bottomrule because algorithm has its own frame
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \toprule
Syndrome $\bz$ &            {\tt 000} & {\tt 001} & {\tt 010} & {\tt 011} & {\tt 100} & {\tt 101} & {\tt 110} & {\tt 111} \\ \midrule
Unflip this bit  & {\small{\em none}} &   $r_7$      & $r_6$          &        $r_4$  &  $r_5$       & $r_1$          & $r_2$         & $r_3$ \\
% \bottomrule
% Unflip this bit  & {\small{\em none}} &   7      & 6          &        4  &  5       & 1          & 2         & 3 \\
% \bottomrule
% this is appropriate only if z =z3,z2,z1: 
% Unflip this bit  & {\small{\em none}} & 5      & 6          &        2  &  7       & 1          & 4         & 3 \\  \hline
\end{tabular}
\end{center} 
%\begin{center}
%\begin{tabular}{cc}  \hline
%Syndrome $\bz$ & % 3 2 1 !!!!!!!!!!!!!!!!!!!
%Flip this bit  \\  \hline
% 000 &{\small{\em none}} \\
% 001 &5\\
% 010 &6\\
% 011 &2\\
% 100 &7\\
% 101 &1\\
% 110 &4\\
% 111 &3 \\  \hline
%\end{tabular}
%\end{center}
}{%
\caption[a]{Actions taken by the optimal decoder for the $(7,4)$ Hamming 
 code, assuming a binary symmetric channel with small noise level $f$.
 The syndrome vector $\bz$ lists whether each parity check is 
 violated ({\tt 1}) or satisfied ({\tt 0}),
 going through the checks in the order
 of the bits $r_5$, $r_6$,
 and $r_7$. }
\label{tab.hamming.decode}
}%
\end{algorithm}

 What happens if the noise actually flips more than one bit? 
  \Figref{fig.hamming.s3}e shows the situation when two bits, 
 $r_3$ and $r_7$, are received flipped.  The syndrome, {\tt 110},
 makes us suspect the single bit $r_2$; so our optimal decoding algorithm 
 flips this bit, giving a decoded pattern with three errors 
 as shown in   \figref{fig.hamming.s3}${\rm e}'$.
 If we use the optimal decoding algorithm, 
 any two-bit error pattern will lead to a decoded seven-bit vector 
 that contains three errors. 

\subsection{General view of decoding for linear codes: syndrome decoding}
\label{sec.syndromedecoding2}
\begin{aside}
% {\em (Does some of this stuff belong earlier in the pictorial area?)}
 We can also  describe the decoding problem
 for a linear code in terms of matrices.\index{syndrome decoding}\index{linear block code} 
% In the  case of a linear code and a  symmetric channel, 
% the decoding task can be re-expressed as {\bf syndrome decoding}.
% Let's assume that the noise level $f$ is less than $1/2$.
 The first four received bits, $r_1r_2r_3r_4$, purport to be  
 the four source bits; and the received bits $r_5r_6r_7$ purport
 to be the parities of the source bits, as defined by the generator 
 matrix $\bG$. We evaluate the three parity-check bits for the 
 received bits, $r_1 r_2r_3 r_4$, and see whether
 they match the three received 
 bits, $r_5r_6r_7$. The differences (modulo 2) between 
 these two triplets are called the {\dbf\ind{syndrome}}
 of the received vector. 
 If the syndrome is zero -- if all three parity checks are happy
% agree with  the corresponding received bits
 -- then the received vector is a codeword, 
 and the most probable decoding is given by reading out its first four 
 bits.  If the syndrome is non-zero, then
% we are certain that
 the noise 
 sequence for this block was non-zero, and the syndrome is our 
 pointer to the most probable error pattern. 

 The computation of  the  syndrome vector is a
 linear operation. If we define the $3 \times 4$ matrix $\bP$
 such that  the matrix of 
 equation (\ref{eq.h74.gen})
is
\beq
        \bG^{\T} = \left[ \begin{array}{c}{\bI_4}\\
 \bP\end{array} \right], 
\eeq
 where $\bI_4$ is the $4\times 4$ identity matrix,  then 
 the syndrome vector is $\bz  = \bH \br$, where the {\dbf\ind{parity-check matrix}}
 $\bH$ is given by $\bH =  \left[ \begin{array}{cc} -\bP & \bI_3 \end{array}
 \right]$; in  modulo 2 arithmetic, $-1 \equiv 1$, so
\beq
        \bH =   \left[ \begin{array}{cc} \bP & \bI_3 \end{array}
 \right] = \left[ 
 \begin{array}{ccccccc} 
\tt  1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0&\tt 0 \\
\tt  0&\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0 \\
\tt  1&\tt 0&\tt 1&\tt 1&\tt 0&\tt 0&\tt 1
 \end{array} \right] .
\label{eq.pcmatrix}
\eeq
 All the codewords $\bt = \bG^{\T} \bs$ of the code satisfy
\beq
	\bH \bt = \left[ {\tt \begin{array}{c} \tt0\\ \tt0\\ \tt0 \end{array} } \right] .
% (0,0,0)  .
\eeq
\exercisaxB{1}{ex.GHis0}{
 Prove that this is so by evaluating the $3\times4$ matrix $\bH \bG^{\T}$.
}
 Since the received vector $\br$ is given by $\br = \bG^{\T}\bs + \bn$,
% and $\bH \bG^{\T}$=0, 
 the syndrome-decoding problem is  to find the
 most probable noise vector $\bn$ satisfying
 the equation 
\beq
        \bH \bn = \bz .
\eeq
 A decoding algorithm that solves this problem is called 
 a {\dem maximum-likelihood decoder}. We will discuss 
 decoding problems like this  in later  chapters. 
%\footnote{Somewhere in this book
% I need to spell out \Bayes\  theorem for decoding. Here would be 
% a good spot; but on the other hand, people can understand decoding
% intuitively, they don't need Bayes theorem and they might find it 
% a hindrance if they were not only being hit by 
% Shannon's theorem but also by likelihoods and priors.}
%
% ACTION NEEDED ????????????????????????????????????????
%
\end{aside}

\begin{figure}
%\fullwidthfigure{%
\figuredanglenudge{%
\begin{center}
\setlength{\unitlength}{0.8in}% was 1in, with figures 1.25 wide % then was 0.8 with 1in
\begin{picture}(7,2.7)(0,2.8)
\put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}}
\put(0.625,5.4){\makebox(0,0){\Large$\bs$}}
\thicklines
\put(1.35,4.75){\vector(1,0){0.4}}
\put(1.55,5.4){\makebox(0,0){{\sc encoder}}}
\put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.ps,width=1in}}}
\put(1.982,3.75){\makebox(0,0)[tr]{{parity bits} $\left.\rule[-0.342in]{0pt}{0.342in} \right\{$}}
\put(2.625,5.4){\makebox(0,0){\Large$\bt$}}
\put(3.6,5.4){\makebox(0,0){{\sc channel}}}
\put(3.6,5.15){\makebox(0,0){$f={10\%}$}}
\put(3.4,4.75){\vector(1,0){0.4}}
\put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.ps,width=1in}}}
\put(4.625,5.4){\makebox(0,0){\Large$\br$}}
\put(5.6,5.4){\makebox(0,0){{\sc decoder}}}
%\put(5.6,3.5){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder picks the $\hat{\bs}$ with maximum likelihood.}}}}
\put(5.4,4.75){\vector(1,0){0.4}}
\put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.d.ps,width=1in}}}
\put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}}
\end{picture}
\end{center}
}{%
\caption[a]{Transmitting $10\,000$ source bits over a binary symmetric channel
 with $f=10\%$
%0.1$
  using a $(7,4)$ Hamming code. The  probability 
 of decoded bit error is about 7\%.}
% \dilbertcopy}
\label{fig.h74.dilbert}
}{0.7in}% third argument is the upward nudge of the caption
\end{figure}
\subsection{Summary of the $(7,4)$ Hamming code's properties}
 Every possible received vector of length 7 bits is either a codeword,
 or it's one flip away from a codeword.\index{Hamming code}

 Since there are three parity constraints, each of which might
 or might not be violated, there are
 $2\times 2\times 2= 8$
% eight
 distinct syndromes. They can be divided 
 into seven non-zero syndromes --  one
 for each of the one-bit error patterns --
 and the all-zero syndrome, corresponding to the zero-noise case. 

 The optimal decoder takes no action if the syndrome is zero, 
 otherwise it uses this mapping of non-zero syndromes onto one-bit error 
 patterns to unflip the suspect bit. 

 There is a {\dbf decoding error}   if the four decoded bits $\hat{s}_1,
 \hat{s}_2, \hat{s}_3, \hat{s}_4$ do not all match the source bits ${s}_1,
 {s}_2, {s}_3, {s}_4$. The {\dbf probability of block error} $\pB$ is 
 the probability that one or more of the decoded bits in one block fail to 
 match the corresponding source bits,
\beq
 \pB = P( \hat{\bs} \neq \bs ) .
\eeq
 The {\dbf probability of bit error} $\pb$ is 
 the average probability
%  per decoded bit
 that a decoded bit fails to 
 match the corresponding source bit,
\beq
        \pb =  \frac{1}{K} \sum_{k=1}^K P( \hat{s}_k \neq s_k ) .
\eeq

 In the case of the Hamming code, 
 a decoding error will occur whenever  the noise has flipped more than 
 one bit in a block of seven. 
%  Any noise pattern that flips more than one bit will give rise to one of 
%  these syndromes, and our decoder will make an erroneous decision. 
%
 The probability of block error is thus the probability that two or more 
 bits are flipped in a block. This probability scales as $O(f^2)$, as did the 
 probability of error for the repetition code 
 $\Rthree$. But notice that the Hamming code 
 communicates at a greater rate, $R=4/7$. 

 \Figref{fig.h74.dilbert} shows a binary image transmitted over 
 a binary symmetric channel using the $(7,4)$ Hamming code. 
 About 7\% of the decoded bits are in error. Notice that 
 the errors are correlated:
% with each other:
 often two or three successive
 decoded bits are flipped.

\exercisaxA{1}{ex.Hdecode}{
 This exercise and the next three  refer to the 
  $(7,4)$ \ind{Hamming code}.  Decode the received strings:
\ben
\item $\br = {\tt 1101011}$ % 10
\item $\br = {\tt 0110110}$ % 4
\item $\br = {\tt 0100111}$ % 4
\item $\br = {\tt 1111111}$. % 15
\een
}
\exercissxA{2}{ex.H74p}{
\ben \item
 Calculate the probability of block error $p_{\rm B}$ of the $(7,4)$ Hamming 
 code 
 as a function of the noise level $f$ and show that to leading order
% \footnote{Do I need to explain what this means? Or use a different
%  terminology? Maybe only physicists are familiar?} 
%
% ACTION!!!
%
 it goes as $21 f^2$.
\item
% }
% \exercis{}{
 \difficulty{3}
% $^{B3}$
 Show that to leading order  the probability of 
 bit error $\pb$ goes as $9 f^2$. 
\een}
\exercissxA{2}{ex.H74zero}{
% Hamming $(7,4)$ code.
 Find some noise vectors that give the all-zero syndrome (that is, 
 noise vectors that leave all the parity checks unviolated). 
 How many such noise vectors are there?
}
% they are the codewords. 
\exercisaxB{2}{ex.H74detail}{
% Hamming $(7,4)$ code.
 I asserted above that a block decoding error will result 
 whenever two or more bits are flipped in a single block. 
 Show that this is indeed so. [In principle, there might be 
 error patterns that, after decoding, led only to the corruption 
 of the parity bits, with no source bits incorrectly 
 decoded.] 
}

\subsection{Summary of codes' performances}
\label{sec.code.perf}
 Figure \ref{fig.pbR.RH} shows the performance of \ind{repetition code}s and
 the \ind{Hamming code}. It also shows the performance of a family of linear
 block codes that are generalizations of Hamming codes, called \ind{BCH codes}.  
% Reed-Muller codes, and
%  see end of this file for method
% 
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/rephambch.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/rephambch.1.l.ps,angle=-90,width=2.6in}  \\
\end{tabular} 
\end{center}
}{%
\caption[a]{Error probability $\pb$ versus rate $R$ for repetition codes,
  the $(7,4)$ Hamming code and BCH codes with blocklengths up to 1023
 over a binary symmetric channel with $f=0.1$.
 The righthand figure shows $\pb$ on a logarithmic scale.}
\label{fig.pbR.RH}
}
\end{figure}
%

%\noindent
% use this noindent if the ``h'' (here) works, otherwise new para.
 This figure shows that we can, using linear block codes, achieve better
 performance than repetition codes; but the asymptotic situation still
 looks  grim. 
\exercissxA{4}{ex.makecode}{
% invent your own code 
 Design an error-correcting code and a decoding algorithm for it,
 estimate its probability of error, 
 and add it to figure \ref{fig.pbR.RH}.
 [Don't worry if you find it difficult to make a code better than the
 Hamming code, or if you find it difficult to find a good
 decoder for your code; that's the point of this exercise.]
}
\exercissxA{3}{ex.makecode2error}{
 A $(7,4)$ Hamming code
 can correct any {\em one\/} error;
 might there be a
% (10,4)
 $(14,8)$ code
 that can correct any two errors?
% What about a (9,4) code?

 {\sf Optional extra:} Does the answer to this question
 depend on whether the code is linear or nonlinear?
}
\exercissxA{4}{ex.makecode2}{
	 Design an error-correcting code, other than
 a repetition code, that can
 correct any {\em two\/} errors  in a block of size $N$.
}

\section{What performance can the best codes achieve?}
 There seems to be a trade-off between the decoded bit-error
 probability $\pb$ (which we would like to reduce) and the rate $R$ (which
 we would like to keep large).  How can this trade-off be
 characterized?
%  Can we do better than repetition codes?
 What points in
 the $(R,\pb)$ plane are achievable?  This question was addressed by
 Claude Shannon\index{Shannon, Claude} in his pioneering paper of 1948, in which he both created the
 field of information theory and solved most of its fundamental
 problems.
%  in the same paper.

 At that time  there was a widespread belief that the 
 boundary between achievable and nonachievable points in the 
 $(R,\pb)$ plane was a curve passing through the origin $(R,\pb) = (0,0)$; 
 if this were so, then,  in order to achieve a vanishingly small 
 error probability $\pb$, one would have to reduce the rate 
 correspondingly close to zero.
%  (figure ref here).
% This would seem a reasonable guess, 
% in accordance with the general rule that the better something works
% the more you have to pay for it. 
%
% ACTION: sanjoy doesn't like This
%
 `No pain, no gain.'

 However, Shannon proved the remarkable result that\wow\
% , for any given  channel,
 the boundary 
 between achievable and nonachievable points meets the $R$ 
 axis at a {\em non-zero\/} value $R=C$, as shown in \figref{fig.pbR.RHS}. 
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/repshan.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/repshan.1.l.ps,angle=-90,width=2.6in}  \\
\end{tabular} 
\end{center}
}{%
\caption[a]{Shannon's noisy-channel coding theorem.\indexs{noisy-channel coding theorem}\index{Shannon, Claude}
 The solid curve shows  the Shannon limit 
 on achievable values of $(R,\pb)$  for
 the binary symmetric channel with $f=0.1$.
 Rates up to $R=C$ are achievable with arbitrarily small $\pb$.
 The  points show the performance of some textbook codes,
 as in \protect\figref{fig.pbR.RH}.


%\indent MANUAL INDENT
\hspace{1.5em}The equation defining the Shannon limit (the solid curve) is 
%\[
        $R = \linefrac{C}{(1-H_2(\pb))},$
%\]
 where $C$ and $H_2$ are defined in \protect \eqref{eq.capacity}.
}
\label{fig.pbR.RHS}
}
\end{figure}
%  see end of this file for method
%
 For any channel, there exist codes that make it possible to 
 communicate  with {\em arbitrarily small\/} probability of 
 error $\pb$ at non-zero rates.  The first half of this book ({\partnoun}s I--III) will be 
 devoted  to understanding  this  remarkable result, which is called 
 the {\dbf{noisy-channel coding theorem}}.

\subsection{Example: $f=0.1$}% A few details}
 The maximum rate at which communication is possible with arbitrarily
 small $\pb$ is called the {\dbf\ind{capacity}} of the channel.\index{channel!capacity} 
 The formula for the capacity of a binary
 symmetric channel  with noise level $f$ is\index{binary entropy function} 
\beq
        C(f) = 1 - H_2(f) = 1 - \left[ f \log_2
                 \frac{1}{f} + (1-f) \log_2 \frac{1}{1-f} \right] ;
\label{eq.capacity}
\eeq
 the channel we were discussing earlier with noise level $f=0.1$
 has capacity $C \simeq 0.53$. Let us consider what this means in terms 
 of noisy \disc{} drives. The \ind{repetition code} $\Rthree$ could communicate over this
 channel with $\pb=0.03$ at a rate $R = 1/3$. Thus we know how
 to  build a single  gigabyte \disc{} drive with $\pb = 0.03$
 from three noisy  gigabyte \disc{} drives. We also know how to make 
 a single  gigabyte \disc{} drive 
 with  $\pb \simeq 10^{-15}$ from sixty
 noisy one-gigabyte drives \exercisebref{ex.R60}.
 And now  Shannon\index{Shannon, Claude}
 passes by, notices us 
 \ind{juggling}
% tinkering 
 with \disc{} drives and codes and says:
\begin{quotation}
\noindent
        `What performance are you trying to achieve? 
        $10^{-15}$? You don't need {\em sixty\/} \disc{} drives   -- 
        you can get that performance with just  
         {\em two\/} \disc{} drives (since 1/2 is  less than $0.53$).  
%       (The capacity is 0.53, so the number of \disc{} drives needed at 
%       capacity is 1/0.53.)
%       `
        And if you want $\pb = 10^{-18}$
% , or $10^{-21}$,
         or $10^{-24}$ or anything,
        you can get there with  two \disc{} drives too!'
\end{quotation}
%\begin{aside}
 [Strictly, the above statements might  not be quite right, since,
 as we shall see, Shannon
 proved his 
 noisy-channel coding theorem
%proves the achievability of ever smaller
% error probabilities at a given rate $Ra$)
 is defined to be $\int_{a}^{b} \! \d v \: P(v)$. $P(v)\d v$ is dimensionless.
 The density $P(v)$ is a dimensional
 quantity, having dimensions inverse to the dimensions of $v$ -- in contrast
 to discrete  probabilities, which are dimensionless. Don't be surprised
 to see probability densities greater than 1. This is normal, and nothing
 is wrong, as long as $\int_{a}^{b} \! \d v \: P(v) < 1$ for any interval $(a,b)$.

 Conditional and joint probability densities 
 are defined  in just the same way as conditional and joint probabilities.
% , which is why I choose not to use different notation for them.
\end{aside}

% More equations here.
%
% bring from chapter 4?
%
% at present ch 4 refers to this page as the first occurrence of
% Laplace's rule.
%
% Sort out this mess:::::::::::::::
% p30 Ex 2.8 : There claims to be a solution to this on p121 but this is
%actually a solution to Ex 6.2
%Generally would be helpful if notation in Chapters 2 and 6 was the same
%
% !!!!!!!!!!!!!!!!!!!! Idea: move this exe to the end of this subsection?
% THIS EX seems to have no solution
\exercisaxB{2}{ex.postpa}{% solution added Mon 10/11/03
	Assuming a uniform prior on $f_H$,  $P(f_H) = 1$,
 solve the problem posed in \exampleref{exa.bentcoin}.
 Sketch the posterior distribution of $f_H$
 and compute the probability that the $N\!+\!1$th outcome will be a head,
 for
\ben
\item	$N=3$ and $n_H=0$; 
\item	$N=3$ and $n_H=2$; 
\item	$N=10$ and $n_H=3$; 
\item
	$N=300$ and $n_H=29$.
\een
 You will find the \ind{beta integral} useful: 
\beq
 \int_0^1 \! \d p_a \: p_a^{F_a} (1-p_a)^{F_b}  = 
        \frac{\Gamma(F_a+1)\Gamma(F_b+1)}{ \Gamma(F_a+F_b+2) } 
        = \frac{ F_a! F_b! }{ (F_a + F_b + 1)! } .
\eeq
 You may also find it instructive to look back at
 \exampleref{ex.ip.urns} and \eqref{eq.laplace.succession.first}.
}
 
 People sometimes confuse assigning a prior distribution 
 to an unknown parameter such as $f_H$ with making an initial guess 
 of the {\em{value}\/} of the parameter. 
% But priors  are not values, they are distributions.
 But the prior over $f_H$, $P(f_H)$, is not a simple statement
 like `initially, I would guess $f_H = \dhalf$'. 
 The prior is a probability density over $f_H$ which 
  specifies the prior degree of belief that $f_H$ lies
 in any interval $(f,f+\delta f)$. It may well be the case
 that our prior   for $f_H$ is symmetric about $\dhalf$, so that the
 {\em mean\/} of  $f_H$ under the prior is $\dhalf$.
%under our prior for $f_H$, the {\em mean\/} of  $f_H$ is $\dhalf$
% -- on symmetry grounds for example.
 In this case, the 
 predictive distribution {\em for the first toss\/} $x_1$ would indeed be 
\beq
	P(x_1 \eq  \mbox{head}) = 
	\int \! \d f_H \: P(f_H) P(x_1 \eq  \mbox{head} \given  f_H)
	= \int \! \d f_H \: P(f_H)  f_H = \dhalf .
\eeq
 But the prediction for subsequent tosses will depend on
 the whole prior distribution, not just its mean.

\subsubsection{Data compression and inverse probability}
 Consider the following task.
\exampl{ex.compressme}{
 Write a computer  program capable of compressing  binary files like this
 one:\par
\begin{center}{\footnotesize%was tiny
{\tt 0000000000000000000010010001000000100000010000000000000000000000000000000000001010000000000000110000}\\
{\tt 1000000000010000100000000010000000000000000000000100000000000000000100000000011000001000000011000100}\\
{\tt 0000000001001000000000010001000000000000000011000000000000000000000000000010000000000000000100000000}\\[0.1in]% added this space Sat 21/12/02
}
\end{center}
%  This file contains N=300 and n_1 = 29
 The string shown contains $n_1=29$ {\tt 1}s 
 and $n_0=271$ {\tt 0}s.
% What is the probability that the next character in this file
% is a {\tt 1}? 
}
 Intuitively, compression works by taking advantage of the predictability
 of a file. In this case, the source of the file
 appears more likely to emit
 {\tt 0}s than {\tt 1}s. A data compression program that compresses
 this file must, implicitly or explicitly, be addressing the
 question `What is the probability that the next character in this file
 is a {\tt 1}?' 


 Do you think this problem is similar in character
 to \exampleref{exa.bentcoin}? I do. One of the themes
 of this book is  that data compression and
 data modelling are one and the same, and that they should
 both be addressed, like the  urn of example \ref{ex.ip.urns},
 using inverse probability. 
 \Exampleonlyref{ex.compressme} is solved in  \chref{ch4}.
%
% SOLVE IT HERE???
%
\subsection{The likelihood principle}
\label{sec.lp}
 Please solve the following two exercises.
\exampl{ex.lp1}{
 Urn\amarginfig{c}{\begin{center}\psfig{figure=figs/urnsA.ps,width=1.6in}\end{center}
\caption[a]{Urns for \protect\exampleonlyref{ex.lp1}.}}
 A contains three balls: one black, and two white;
 \ind{urn} B contains three balls: two black, and one white.
 One of the urns is selected at random and one ball
 is drawn. The ball is black. What is the probability
 that the selected urn is urn A?
}
%
\exampl{ex.lp2}{
 Urn\amarginfig{c}{\begin{center}\psfig{figure=figs/urns.ps,width=1.6in}\end{center}%
\caption[a]{Urns for \protect\exampleonlyref{ex.lp2}.}}
 A contains five balls: one black, two white, one green and one pink;
 urn B contains five hundred balls:
 two hundred black, one hundred white, 50 yellow, 40 cyan, 30 sienna,
 25 green, 25 silver, 20 gold, and 10  purple.
 [One fifth of A's balls are black; two-fifths of B's are black.]
 One of the urns is selected at random and one ball
 is drawn. The ball is black. What is the probability
 that the urn is urn A?
}
%
 What do you notice about your solutions?  Does each answer
 depend on the detailed contents of each urn?

 The details of the other possible outcomes and their probabilities
 are irrelevant. All that matters is the probability of the outcome
 that actually happened (here, that the ball drawn was  black) given the different
 hypotheses. We need only  to know the {\em likelihood}, \ie,
 how the probability  of the  data that happened varies with the
 hypothesis.
 This simple rule about inference
 is known as the {\dbf\ind{likelihood principle}}.\label{sec.likelihoodprinciple}
%
% NOTE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% { \em (connect back to this point when discussing
% early stopping and inference in problems where the stopping rule is not known.)}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% README NOTE!!!!!!!!!!
\begin{conclusionbox}
 {\sf The likelihood principle:}
 given a generative model for data $d$ given parameters $\btheta$, $P(d \given \btheta)$,
 and having observed a particular outcome $d_1$, all inferences\index{key points!likelihood principle} 
 and predictions should depend only on the function $P(d_1 \given \btheta)$. 
\end{conclusionbox}
\noindent
 In spite of the simplicity of this principle,
 many classical statistical methods violate it.\index{classical statistics!criticisms}\index{sampling theory!criticisms}

% \newpage
\section{Definition of entropy and related functions}
\begin{description} 
\item[The Shannon information content of an outcome $x$] is defined to be
%	We define for each $x \in \A_X$, $
\beq
	h(x) = \log_2 \frac{1}{P(x)} .
\eeq 
% We can interpret $h(a_i)$ as the information content of the event 
% $x \eq  a_i$.
 It is measured in bits. [The word `bit' is also used to
 denote a variable whose value is 0 or 1; I hope context will
 always make clear which of the two meanings is intended.]

\noindent
 In the next few chapters, we will establish  	that 
 the Shannon information content  $h(a_i)$  is indeed a natural measure of 
 the  information content of the event $x \normaleq a_i$.
 At that point, we will shorten the name of this quantity to 
 `the information content'. 

\margintab{%
\begin{center}\small%footnotesize
%
% vertical table of a-z with probabilities, and information contents too;
% four decimal place
\begin{tabular}[t]{cccr} \toprule
$i$ & $a_i$ & $p_i$ & \multicolumn{1}{c}{$h(p_i)$} \\ \midrule
% $i$ & $a_i$ & $p_i$ & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} \\ \midrule
%
1 & {\tt a} &.0575  &  4.1 \\ 
2 & {\tt b} &.0128  &  6.3 \\ 
3 & {\tt c} &.0263  &  5.2 \\ 
4 & {\tt d} &.0285  &  5.1 \\ 
5 & {\tt e} &.0913  &  3.5 \\ 
6 & {\tt f} &.0173  &  5.9 \\ 
7 & {\tt g} &.0133  &  6.2 \\ 
8 & {\tt h} &.0313  &  5.0 \\ 
9 & {\tt i} &.0599  &  4.1 \\ 
10 &{\tt j} &.0006  & 10.7 \\ 
11 &{\tt k} &.0084  &  6.9 \\ 
12 &{\tt l} &.0335  &  4.9 \\ 
13 &{\tt m} &.0235  &  5.4 \\ 
14 &{\tt n} &.0596  &  4.1 \\ 
15 &{\tt o} &.0689  &  3.9 \\ 
16 &{\tt p} &.0192  &  5.7 \\ 
17 &{\tt q} &.0008  & 10.3 \\ 
18 &{\tt r} &.0508  &  4.3 \\ 
19 &{\tt s} &.0567  &  4.1 \\ 
20 &{\tt t} &.0706  &  3.8 \\ 
21 &{\tt u} &.0334  &  4.9 \\ 
22 &{\tt v} &.0069  &  7.2 \\ 
23 &{\tt w} &.0119  &  6.4 \\ 
24 &{\tt x} &.0073  &  7.1 \\ 
25 &{\tt y} &.0164  &  5.9 \\ 
26 &{\tt z} &.0007  & 10.4 \\ 
27 &{\tt{-}}&.1928  &  2.4 \\ \midrule
%27 &\verb+-+&.1928  &  2.4 \\ \midrule
 & & & \\[-0.1in]
\multicolumn{3}{r}{
$\displaystyle \sum_i p_i \log_2 \frac{1}{p_i}$
} & 4.1  \\ \bottomrule % 4.11
\end{tabular}\\ 

\end{center}
%  vertical table of a-z with probabilities, and information contents too;
\caption[a]{Shannon information contents of the outcomes {\tt a}--{\tt z}.}
\label{fig.monogram.log}
}
%
 The fourth column  in  \tabref{fig.monogram.log} shows the Shannon 
 information content of the 27 possible outcomes when
 a 
 random character is picked from an English document. The 
 outcome
% character
 $x={\tt z}$ has a Shannon information content of
 10.4 bits, and $x={\tt e}$ has an information content of 3.5 bits.


\item[The entropy of an ensemble $X$] is defined to be the average Shannon information 
 content of an outcome:
% from that ensemble:
\beq
	H(X) \equiv \sum_{x \in \A_X} P(x) \log \frac{1}{P(x)}, 
\eeq
%\beq
%	H(X) = \sum_i p_i \log \frac{1}{p_i}, 
%\eeq
 with the convention for $P(x) \normaleq 0$ that \mbox{$0 \times \log 1/0 \equiv 0$},
 since \mbox{$\lim_{\theta\rightarrow 0^{+}} \theta \log 1/\theta \normaleq  0 $}.

 Like the information content, entropy is measured in bits. 

 When it is convenient, we may also write $H(X)$ as $H(\bp)$, 
 where $\bp$ is the vector $(p_1,p_2,\ldots,p_I)$.
 Another name for the entropy of $X$ is the uncertainty of $X$. 
\end{description}
\noindent
% The entropy is  a measure of the information content or 
% `uncertainty' of $x$. The question of why entropy is a 
% fundamental measure of information content will  be discussed in the 
% forthcoming chapters. Here w

% was continued example
\exampl{eg.mono}{
 The entropy of a 
 randomly selected letter in an English document
 is about  4.11 bits, assuming its probability 
 is as given in  \tabref{fig.monogram.log}.
%, p.\ \pageref{fig.monogram}.
%   \tabref{tab.mono}. 
 We obtain this number  by  averaging $\log 1/p_i$ (shown in the fourth 
 column) under the probability distribution $p_i$ (shown in the third column).  
}

 We now note some properties of the entropy function.
\bit
\item  
	$H(X) \geq 0$ with equality iff $p_i \normaleq  1$ for one $i$.
 [{`iff' means
 `if and only if'.}]
\item Entropy is maximized if $\bp$ is uniform:
\beq
	H(X) \leq \log(|\A_X|)
 \:\: \mbox{ with equality iff $p_i \normaleq  1/|\A_X|$ for all $i$. }
\eeq
% \footnote{Exercise: Prove this assertion.}
 {\sf Notation:}\index{notation!absolute value}\index{notation!set size}
 the vertical bars `$|\cdot|$'
 have two meanings.
% If $X$ is an ensemble, then
 If $\A_X$ is  a set, 
 $|\A_X|$ denotes the number of elements in  $\A_X$;
 if $x$ is a number,
% for example, the value of a random variable,
 then $|x|$ is the absolute  value of $x$.
\eit
%
% Mon 22/1/01
 The {\dem\ind{redundancy}}
 measures the fractional difference
 between $H(X)$ and its maximum possible value,
 $\log(|\A_X|)$.
\begin{description}% 
\item[The redundancy of $X$] is:
\beq
	1 - \frac{H(X)}{\log |\A_X|} .
\eeq
	We won't make use of `redundancy'
% need this definition
 in this book, so
 I have not assigned a symbol to it.
% -- it would be redundant.
\end{description}
% ha ha
% funny but true.
% example: X is select a codeword from a code - H(X) = K, but |X| = 2^N
%
% Redundancy = 1 - R
%  of code


\begin{description}% duplicated in _l1a and _p5A
\item[The joint entropy of $X,Y$] is:
\beq
	H(X,Y) = \sum_{xy \in \A_X\A_Y} P(x,y) \log \frac{1}{P(x,y)}.
\eeq
	Entropy is additive for independent random variables:
\beq
	H(X,Y) = H(X) +H(Y) \:\mbox{ iff }\: P(x,y)=P(x)P(y).
\label{eq.ent.indep}% also appears in p5a (.again)
\eeq
\end{description}
\label{sec.entropy.end.parta}
 Our definitions for information content
 so far  apply only to discrete probability distributions
 over finite sets $\A_X$.  The definitions can be extended
 to infinite sets, though the entropy may then be infinite.
 The case of a probability {\em density\/} over a continuous set is
 addressed in section \ref{sec.entropy.continuous}.\index{probability!density} 
 Further important definitions and exercises to do with entropy
 will come along in  section \ref{sec.entropy.contd}.

\section{Decomposability of the entropy}
 The entropy function satisfies a recursive property
 that can be very useful when computing entropies.
 For convenience, we'll stretch our notation\index{notation!entropy}
 so that we can write $H(X)$ as $H(\bp)$, where
 $\bp$ is the probability vector  associated with the ensemble $X$.

 Let's illustrate the property by an example first.
 Imagine that a random variable $x \in \{ 0,1,2 \}$
 is created by first flipping a fair coin to determine
 whether $x = 0$; then, if $x$ is not 0,
 flipping a fair coin a second time to determine whether
 $x$ is 1 or 2.
 The probability distribution of $x$ is
\beq
	P( x\! =\! 0 )  = \frac{1}{2} ; \:\:
	P( x\! =\! 1 )  = \frac{1}{4} ; \:\:
	P( x\! =\! 2 )  = \frac{1}{4} .
\eeq
 What is the entropy of $X$? We can either compute it by brute
 force:
\beq
	H(X) = \dfrac{1}{2} \log 2 +  \dfrac{1}{4} \log 4  +  \dfrac{1}{4} \log 4
	     = 1.5 ; 
\eeq
 or we can use the following decomposition, in which the value of $x$
 is revealed gradually.
 Imagine  first learning whether $x\! =\! 0$, and then,
 if $x$ is not $0$, learning which non-zero value is the case. The revelation
 of whether  $x\! =\! 0$ or not entails revealing a
 binary variable whose probability distribution is $\{\dhalf,\dhalf \}$.
 This revelation has an entropy $H(\dhalf,\dhalf) = \frac{1}{2} \log 2 +\frac{1}{2} \log 2 = 1\ubit$.
 If  $x$ is not $0$,  we learn the value of  the second  coin flip.
 This too is  a
 binary variable whose probability distribution is $\{\dhalf,\dhalf\}$, and whose entropy is
 $1\ubit$.
 We only get to experience the second revelation half the time, however,
 so the entropy can be written:
\beq
	H(X) = H( \dhalf , \dhalf ) +  \dhalf  \, H( \dhalf , \dhalf ) .
\eeq

 Generalizing, the observation we are making about the entropy
 of any probability distribution $\bp = \{ p_1, p_2, \ldots , p_I \}$
 is that 
\beq
	H(\bp) =
	H( p_1 , 1\!-\!p_1 )
	+ (1\!-\!p_1)
	H \! \left(
	\frac{p_2}{1\!-\!p_1} , 
	\frac{p_3}{1\!-\!p_1} , \ldots ,
	\frac{p_I}{1\!-\!p_1}
\right) .
\label{eq.entropydecompose}
\eeq

 When it's written as a formula, this property
 looks regrettably ugly; nevertheless it is a simple
 property and one that you should make use of.

 Generalizing further, the entropy  has the property for any $m$
 that
\beqan
	H(\bp) &=&
	H\left[ ( p_1+p_2+\cdots+p_m ) ,  ( p_{m+1}+p_{m+2}+\cdots+p_I )  \right]
\nonumber
\\
&&+   ( p_1+
% p_2+
 \cdots+p_m )
	H\! \left(
	\frac{p_1}{ ( p_1+\cdots+p_m ) } , 
%	\frac{p_2}{ ( p_1+\cdots+p_m ) } ,
\ldots ,
	\frac{p_m}{ ( p_1+\cdots+p_m ) }
\right) 
\nonumber
\\
&&	+  ( p_{m+1}+
%p_{m+2}+
                    \cdots+p_I )
	H \! \left(
	\frac{p_{m+1}}{ ( p_{m+1}+\cdots+p_I ) } , 
%	\frac{p_{m+2}}{ ( p_{m+1}+\cdots+p_I ) } ,
 \ldots ,
	\frac{p_I}{ ( p_{m+1}+\cdots+p_I ) }
\right) .
\nonumber
\\
\label{eq.entdecompose2}
\eeqan
\exampl{example.entropy}{
 A source produces a character $x$
 from the alphabet $\A = \{ {\tt 0}, {\tt 1}, \ldots, {\tt 9}, {\tt a}, {\tt b}, \ldots, {\tt z} \}$;
 with probability $\dthird$, $x$ is a numeral (${\tt 0}, \ldots, {\tt 9}$);
 with probability $\dthird$, $x$ is a vowel (${\tt a}, {\tt e}, {\tt i}, {\tt o}, {\tt u}$);
 and with probability $\dthird$ it's one of the 21 consonants. All numerals are equiprobable,
 and the same goes for vowels and consonants.
 Estimate  the entropy of $X$.
}
\solution\ \ 
 $\log 3 + \frac{1}{3} ( \log 10  + \log 5 + \log 21 )= \log 3 +  \frac{1}{3}  \log 1050 \simeq \log 30\ubits$.\ENDsolution
%> pr log(36)/log(2)
%5.16992500144231
%> pr log(30)/log(2)
%4.90689059560852
%> pr (log(3) +log(1050)/3.0 )/log(2)
%4.93035370490565
% This may be compared with the maximum entropy for an alphabet
% of 36 characters, $\log 36\ubits$. 

\section{Gibbs' inequality}
%  We will also find useful the following:
\begin{description}
% SPACE PROBLEM HERE ...
\item[The \ind{relative entropy} {\em or\/} \ind{Kullback--Leibler divergence}]
	\marginpar[t]{\small\raggedright{The `ei' in L{\bf{ei}}bler is pronounced\index{pronunciation}
 the same as in h{\bf{ei}}st.}}between two probability distributions $P(x)$ and $Q(x)$ 
	that are defined over the same alphabet $\A_X$ is\index{entropy!relative}\index{divergence}
\beq
	D_{\rm KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} .
\label{eq.KL}
\label{eq.DKL}
\eeq
 The relative entropy satisfies {\dem\ind{Gibbs' inequality}}
\beq
	D_{\rm KL}(P||Q) \geq 0
\eeq
 with equality only if $P \normaleq Q$.  Note that in general
 the relative entropy is not symmetric under interchange of the
 distributions $P$ and $Q$:
 in general
 $D_{\rm KL}(P||Q) \neq D_{\rm KL}(Q||P)$, so $D_{\rm KL}$,
 although it is sometimes called the `\ind{KL distance}',
 is not strictly a
 distance\index{distance!$D_{\rm KL}$}.\index{distance!relative entropy}
% `distance\index{distance!$D_{\rm KL}$}'.
%  It is also known as the `discrimination' or `divergence',
 The \ind{relative entropy} is important in pattern recognition and neural networks, 
 as well as in information theory.
%
% could include that aston guy's stuff here on (pq)^1/2?
%
% see also ../notation.tex
%
\end{description}
 Gibbs' inequality is probably the most important inequality in this book. 
 It, and many other inequalities, can be proved
 using the concept of convexity.
\section{Jensen's inequality for convex functions}
\begin{aside}
 The
 words `\ind{\convexsmile}'
 and `\ind{\concavefrown}' may be  pronounced `convex-smile'
 and `concave-frown'.
 This terminology has useful redundancy: while  one
  may forget which way up `convex' and `concave' are,
 it is harder to    confuse a smile with a frown.\index{notation!convex/concave}
\end{aside}
\begin{description}
%
\item[{\Convexsmile\ functions}\puncspace] A function $f(x)$ is {\dem \ind{\convexsmile}\/}
 over $(a,b)$ if
\amarginfig{c}{%
\footnotesize
\setlength{\unitlength}{0.75mm}
\begin{tabular}{c}
\begin{picture}(60,60)(0,0)
\put(0,0){\makebox(60,65){\psfig{figure=figs/convex.eps,angle=-90,width=45mm}}}
\put(10,8){\makebox(0,0){$x_1$}}
\put(48,8){\makebox(0,0){$x_2$}}
\put(17,2){\makebox(0,0)[l]{$x^* = \lambda x_1 + (1-\lambda)x_2$}}
\put(31,23){\makebox(0,0){$f(x^*)$}}
\put(35,39){\makebox(0,0){$\lambda f(x_1) + (1-\lambda)f(x_2)$}}
\end{picture}
\end{tabular}
\caption[a]{Definition of convexity.}
\label{fig.convex}
}\ 
 every chord of the function
 lies above the function,
  as shown in \figref{fig.convex}; that is,
 for all $x_1,x_2
 \in (a,b)$ and $0\leq \lambda \leq 1$, 
\beq
	f( \lambda x_1 + (1-\lambda)x_2 )  \:\:\leq \:\:\
		\lambda f(x_1) + (1-\lambda) f(x_2 ) .
\eeq
  A function $f$ is {\dem strictly
 \convexsmile\/} if, for all $x_1,x_2 \in (a,b)$, the equality holds only
 for $\lambda \normaleq 0$ and $\lambda\normaleq 1$.

 Similar definitions apply to \concavefrown\ and strictly \concavefrown\
 functions.
\end{description}
\newcommand{\tinyfunction}[2]{
\begin{tabular}{@{}c@{}}
{\small{#1}}
\\[-0.25in]
\psfig{figure=figs/#2.ps,width=1.06in,angle=-90}
\\
\end{tabular}
}
 Some strictly \convexsmile\ functions are
\bit
\item $x^2$, $e^x$ and $e^{-x}$ for all $x$; 
\item  $\log (1/x)$ and $x \log x$ for $x>0$.
\eit
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\raisebox{0.4in}{%
\begin{tabular}[c]{c@{}c@{}c@{}c}
\tinyfunction{$x^2$}{convex_xx} &
\tinyfunction{$e^{-x}$}{convex_exp-x} &
\tinyfunction{$\log \frac{1}{x}$}{convex_logix} &
\tinyfunction{$x \log x$}{convex_xlogx} \\[0.2in]
%\tinyfunction{$x^2$}{convex_xx} &
%\tinyfunction{$e^{-x}$}{convex_exp-x} \\[0.42in]
%\tinyfunction{$\log \frac{1}{x}$}{convex_logix} &
%\tinyfunction{$x \log x$}{convex_xlogx} \\[0.2in]
\end{tabular}
}
\end{center}
}{%
\caption[a]{\Convexsmile\ functions.}
\label{fig.convexf}
}%
\end{figure}
\begin{description}
\item[Jensen's inequality\puncspace]  If $f$ is a \convexsmile\ function
 and $x$ is a random variable then:
\beq
	\Exp\left[ f(x) \right] \geq f\!\left( \Exp[x] \right) ,
\label{eq.jensen}
\eeq
 where $\Exp$ denotes \ind{expectation}. If $f$ is strictly \convexsmile\ and 
 $\Exp\left[ f(x) \right] \normaleq  f\!\left( \Exp[x] \right)$, then the random 
 variable $x$ is a constant.
% (with probability 1).
% |!!!!!!!!!!!!!!!!! removed pedantry

 \ind{Jensen's inequality} can also be rewritten for a
 \concavefrown\ function, with the direction of the inequality
 reversed.
\end{description}

 A physical version of Jensen's \ind{inequality} runs as follows.
\amarginfignocaption{b}{\mbox{\psfig{figure=figs/jensenmass.ps,width=1.75in,angle=-90}}}
\begin{quote}
 If a collection of 
 masses $p_i$ are placed on a
 \convexsmile\ curve $f(x)$
 at locations $(x_i, f(x_i))$, then the
 \ind{centre of gravity} of those masses, which  is at $\left( \Exp[x],
 \Exp\left[ f(x) \right] \right)$, lies above the curve.
\end{quote}
 If this fails to convince you, then feel free to
 do the following exercise.
\exercissxC{2}{ex.jensenpf}{
 Prove \ind{Jensen's  inequality}.
}

\exampl{ex.jensen}{
 Three squares have average area $\bar{A} = 100\,{\rm m}^2$.
 The average of the lengths of their sides is $\bar{l} = 10\,{\rm m}$.
 What can be said about the size of the largest of the 
 three squares? [Use Jensen's inequality.]

}
\solution\ \ 
 Let $x$ be the length of the side of a square, and let the 
 probability of $x$ be $\dthird,\dthird,\dthird$ over the 
 three lengths $l_1,l_2,l_3$. Then the information that we have is
 that $\Exp\left[ x \right]=10$ and  $\Exp\left[ f(x) \right]=100$, 
 where $f(x) = x^2$ is the function mapping lengths to areas. 
 This is a strictly \convexsmile\ function. 
 We notice that the equality 
  $\Exp\left[ f(x) \right] \normaleq  f\!\left( \Exp[x] \right)$ holds, 
 therefore $x$ is a constant, and the three lengths 
 must all be equal. The area of the largest square is 100$\,{\rm m}^2$.\ENDsolution


\subsection{Convexity and concavity also relate to maximization}
 If $f(\bx)$ is \convexfrown\ and there exists a point at which
\beq
	\frac{\partial f}{\partial x_k} = 0 \:\: \mbox{for all $k$},
% \forall k
\eeq 
 then $f(\bx)$ has its maximum value at  that point.

 The converse does not hold: if a  \convexfrown\  $f(\bx)$ is maximized at
 some $\bx$ it is not necessarily true that the gradient
 $\grad f(\bx)$ is equal
 to zero there. For example, $f(x) = -|x|$ is maximized at $x=0$
 where its derivative is undefined; and $f(p) = \log(p),$ for
 a probability
 $p \in (0,1)$, is maximized on the boundary of the range,
 at $p=1$, where the gradient $\d f(p)/\d p =1$.
%, since $f$ might for example 
% be an increasing function with no maximum  such as $\log x$, 
% or its maximum might be located at a point $\bx$
% on the boundary of the range of $\bx$. 
%
%{\em  (is this use of range correct?)}







% exercises from that. 
%
% exercises that belong between old chapters 1 and 2.
%
% see also _p5a.tex for moved exercises.
%
\section{Exercises}
\subsection*{Sums of random variables}
% sums of random variables.
% dice questions
\exercissxA{3}{ex.sumdice}{
\ben
\item
 Two ordinary dice  with faces labelled $1,\ldots,6$
 are thrown. What is the probability distribution of
 the sum\index{law of large numbers}
 of the values?  What is the probability distribution of the 
 absolute difference between the values? 


\item
 One\marginpar[c]{\small\raggedright{This exercise
 is intended to help you think about the \ind{central-limit theorem}, which says
 that if independent random variables $x_1, x_2, \ldots, x_N$
 have  means $\mu_n$ and finite variances $\sigma_n^2$, then, in the
 limit of large $N$, the sum $\sum_n x_n$ has a distribution that tends
 to a  normal  (\index{Gaussian distribution}Gaussian) distribution 
 with mean $\sum_n \mu_n$ and variance $\sum_n \sigma_n^2$.}}
 hundred ordinary dice are thrown. What,
 roughly, is the probability distribution of the sum of the values?
 Sketch the   probability distribution and estimate its mean and
 standard deviation.

\item
 How can two cubical dice be labelled using the numbers $\{0,1,2,3,4,5,6\}$
 so that when the two dice are thrown the sum has a uniform
 probability distribution over the integers 1--12?
%  Can you prove your solution is unique?

\item
 Is there any way that one hundred dice
 could be labelled with integers 
 such that the probability distribution of the sum is uniform?
\een
}
% answer, one normal, one 060606
% uniqueness proved by noting that every outcome 1-12 has
% to be made from 3 microoutcomes, and 12 can only be made
% from 6,6, so there must be a six on each die, indeed 3 on 1, and
% 1 on the other. 1 can only be mae from 1,0, and don't want 0,0,
%  so there must be three 0s. (M Gardner)
%
\subsection*{Inference problems}
\exercissxA{2}{ex.logit}{
	If $q=1-p$ and $a = \ln \linefrac{p}{q}$, show that
\beq
	p = \frac{1}{1+\exp(-a)} .
\label{eq.sigmoid}
\label{eq.logistic}
\eeq
 Sketch this function and find its relationship to the hyperbolic tangent
 function $\tanh(u)=\frac{e^{u} - e^{-u}}{e^{u} + e^{-u}}$.

 It will be useful to be fluent in base-2 logarithms also.
 If $b = \log_2 \linefrac{p}{q}$, what is $b$ as a function of $p$?
}
%
% is this exercise inappropriate now because we have not defined
% joint ensembles yet?
%
\exercissxB{2}{ex.BTadditive}{
	Let $x$ and $y$ be dependent
% correlated
 random variables with
 $x$ a binary variable taking values in $\A_X = \{ 0,1 \}$.
	Use \Bayes\  theorem to show that the log posterior probability 
	ratio for $x$ given $y$ is
\beq
	\log \frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} = \log \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)}
		+ \log \frac{P(x\eq 1)}{P(x\eq 0)}  .
\eeq
}
% define ODDS ?
\exercissxB{2}{ex.d1d2}{
	Let $x$, $d_1$ and $d_2$ be random variables such that $d_1$
 and $d_2$  are conditionally independent given a binary variable $x$.
% (That is, $P(x,d_1,d_2)
%  = P(x)P(d_1 \given x)P(d_2 \given x)$.)
%
% somewhere I need to introduce graphical repns and define
%
% TO DO!!! TODO
%
% (\ind{conditional independence} is discussed further in section XXX.)
%
% and give examples. A and C children of B. and A->B->C
% Jensen defn is
% A is cond indep of B given C if
%  A|B,C = A|C
% which is symmetric, implying by BT
%  B|A,C = B|C
% pf
%  B|A,C = A|B,C B|C / A|C = B|C
% my defn here is 
%  A,B,C = C  A|C  B|C
% proof: 
%  A,B,C =  C  A|C  B|C,A =  .
% NB graphical model and decomposition are not 1-1 related. The two
% graphs  A and C children of B. and A->B->C  both have a joint prob
% that can be factorized  in either way. 
%
% $x$ is a binary variable taking values in $\A_X = \{ 0,1 \}$.
	Use \Bayes\  theorem to show that the  posterior probability 
	ratio for $x$ given $\{d_i \}$ is
\beq
	 \frac{P(x\eq 1 \given \{d_i \} )}{P(x\eq 0 \given  \{d_i \})} = 
	 \frac{P(d_1 \given x\eq 1)}{P(d_1 \given x\eq 0)}
		 \frac{P(d_2 \given x\eq 1)}{P(d_2 \given x\eq 0)}
		 \frac{P(x\eq 1)}{P(x\eq 0)}  .
\eeq
}

\subsection*{Life in high-dimensional spaces}
%{Life in $\R^N$}
\index{life in high dimensions}
\index{high dimensions, life in}
 Probability distributions and volumes have some unexpected 
 properties in high-dimensional spaces.

% The real line is denoted by $\R$. An $N$--dimensional real space 
% is denoted by $\R^N$.
\exercissxA{2}{ex.RN}{
 Consider a sphere of radius $r$ in an $N$-dimensional real space.
% dimensions.
 Show that the 
 fraction of the volume of the sphere that
 is 
 in the surface shell lying
 at values of the radius between $r- \epsilon$ and $r$, where $0 < \epsilon < r$, is:
\beq
 f = 1 - \left( 1 - \frac{\epsilon}{r} \right)^{\!N} .
\eeq
% from Bishop p.29
 Evaluate $f$ for the cases $N\eq 2$, $N\eq 10$ 
 and $N\eq 1000$,  with (a) $\epsilon/r \eq  0.01$; (b)  $\epsilon/r \eq  0.5$. 

 {\sf Implication:} points that are uniformly distributed in a sphere in $N$ 
 dimensions, where $N$ is large, are very likely to be in a \ind{thin shell} 
 near the surface.
% (From Bishop (1995).)
}
%
\label{sec.exercise.block1}
\subsection*{Expectations and entropies}
 You are probably familiar with the idea of computing the \ind{expectation}\index{notation!expectation} 
 of a function of $x$, 
\beq
	\Exp\left[ f(x) \right] =	\left< f(x) \right> = \sum_{x} P(x) f(x) .
\eeq
 Maybe you are not so comfortable with computing this expectation 
 in cases where the function  $f(x)$  depends on
 the probability $P(x)$. The next few 
 examples address this concern.

\exercissxA{1}{ex.expectn}{ 
 Let $p_a \eq  0.1$, $p_b \eq  0.2$, and $p_c \eq  0.7$. 
 Let $f(a) \eq  10$, $f(b) \eq  5$, and $f(c) \eq  10/7$. 
 What is $\Exp\left[ f(x) \right]$?
 What is $\Exp\left[ 1/P(x) \right]$?
}
\exercissxA{2}{ex.invP}{
 For an arbitrary ensemble, what is $\Exp\left[ 1/P(x) \right]$? 
}
\exercissxB{1}{ex.expectng}{
 Let $p_a \eq  0.1$, $p_b \eq  0.2$, and $p_c \eq  0.7$. 
 Let $g(a) \eq  0$, $g(b) \eq  1$, and $g(c) \eq  0$. 
 What is $\Exp\left[ g(x) \right]$?
}
\exercissxB{1}{ex.expectng2}{
 Let $p_a \eq  0.1$, $p_b \eq  0.2$, and $p_c \eq  0.7$. 
 What is the probability that $P(x) \in [0.15,0.5]$?
 What is 
\[
 P\left( \left| \log \frac{P(x)}{ 0.2} \right| > 0.05 \right) ?
\]
}
\exercissxA{3}{ex.Hineq}{
 Prove the assertion that 
	$H(X) \leq \log(|\A_X|)$ with equality iff $p_i \normaleq  1/|\A_X|$ for all $i$. 
 ($|\A_X|$ denotes the number of elements in the set $\A_X$.)
 [Hint: use Jensen's inequality (\ref{eq.jensen}); if  your
 first attempt to use Jensen does not succeed, remember that
 Jensen involves both a random variable and a function,
 and you have quite a lot of freedom in choosing
 these; think about whether
 your chosen function $f$ should be convex or concave.]
%  further hint: try $u\eq 1/p_i$ as the random variable.]
}
\exercissxB{3}{ex.rel.ent}{
 Prove that the relative entropy (\eqref{eq.KL}) 
 satisfies $D_{\rm KL}(P||Q) \geq 0$ (\ind{Gibbs' inequality})
 with equality only if $P \normaleq Q$.

% You may find this result
% helps with the previous two exercises. Note  (moved to _p5a.tex)
%
%  refer to this in mean field theory chapter {ch.mft}
%
}
%
% Decomposability of the entropy
\exercisaxB{2}{ex.entropydecompose}{
 Prove that the entropy is
 indeed decomposable as described in 
 \eqsref{eq.entropydecompose}{eq.entdecompose2}.
}
\exercissxB{2}{ex.decomposeexample}{
 A random variable $x \in \{0,1,2,3\}$ is selected
 by flipping a bent coin with bias $f$ to determine whether
 the outcome is in $\{0,1\}$ or $\{ 2,3\}$;
\amarginfignocaption{t}{%
\begin{center}\small%footnotesize
\setlength{\unitlength}{0.6mm}
\begin{picture}(30,50)(-10,-15)
\put(-6,25){{\makebox(0,0)[r]{$f$}}}
\put(-6,5){{\makebox(0,0)[r]{$1\!-\!f$}}}
\put(-10,15){\vector(1,1){17}}
\put(-10,15){\vector(1,-1){17}}
\put(10,35){\vector(1,1){10}}
\put(10,35){\vector(1,-1){10}}
\put(16,45){{\makebox(0,0)[r]{$g$}}}
\put(16,25){{\makebox(0,0)[r]{$1\!-\!g$}}}
\put(16,5){{\makebox(0,0)[r]{$h$}}}
\put(16,-15){{\makebox(0,0)[r]{$1\!-\!h$}}}
\put(10,-5){\vector(1,1){10}}
\put(10,-5){\vector(1,-1){10}}
\put(24,45){{\makebox(0,0)[l]{\tt 0}}}
\put(24,25){{\makebox(0,0)[l]{\tt 1}}}
\put(24,5){{\makebox(0,0)[l]{\tt 2}}}
\put(24,-15){{\makebox(0,0)[l]{\tt 3}}}
\end{picture}
\end{center}
}
 then either  flipping a second bent coin with bias $g$
 or a third bent coin with bias $h$ respectively.
 Write down the probability distribution of $x$.
 Use   the
 decomposability of the entropy (\ref{eq.entdecompose2})
 to find the entropy of $X$. [Notice how compact
 an expression is obtained if you make use of the binary entropy
 function $H_2(x)$, compared with writing out the four-term
 entropy explicitly.]
 Find the derivative of $H(X)$ with respect to $f$. [Hint: $\d H_2(x)/\d x = \log((1-x)/x)$.]
}
\exercissxB{2}{ex.waithead0}{
 An unbiased coin is flipped until one head is thrown. What is the 
 entropy of the random variable $x \in \{1,2,3,\ldots\}$, the number of
 flips?
 Repeat the calculation for the case of a biased coin with probability $f$
 of coming up heads.
  [Hint: solve the problem both directly  and by using  the
 decomposability of the entropy (\ref{eq.entropydecompose}).]
%
}
%
% removed joint entropy questions.
\section{Further exercises}
%
\subsection*{Forward probability}%  problems}
\exercisaxB{1}{ex.balls}{
 An urn contains $w$ white balls and $b$ black balls.
 Two balls are drawn, one after the other, without replacement.
 Prove that the probability that the first ball
 is white is equal to the probability that the second is white.
}
%
\exercisaxB{2}{ex.buffon}{
 A circular \ind{coin} of diameter $a$ is thrown onto a \ind{square} grid
 whose squares are $b \times b$. ($aB$ given that $F>A$?)
}
\exercisaxB{2}{ex.liars}{
 The inhabitants of an island tell the
 truth one third of the time. They lie with  probability  2/3.

 On an occasion, after one of them made a statement,
 you ask another `was that statement true?'
 and he says `yes'.

 What is the probability that the statement was indeed true?
% [Ans: 1/5].
}

%
\exercissxB{2}{ex.R3error}{
 Compare two ways of computing the probability of error of
 the repetition code $\Rthree$, assuming a binary
 symmetric channel (you
 did this once for \exerciseref{ex.R3ep}) and confirm that they
 give the same answer.
\begin{description}
\item[Binomial distribution method\puncspace]
	Add  the probability that all three bits are
 flipped to the probability that exactly two bits are flipped.
%	Add  the probability of all three bits'
% being flipped to the probability of exactly two bits' being flipped.
\item[Sum rule method\puncspace]
% Using the different possible inferences]
 Using the \ind{sum rule},
 compute  the marginal probability that $\br$ takes on each of
 the eight possible values, $P(\br)$.
 [$P(\br) = \sum_s P(s)P(\br \given s)$.]
  Then compute
 the posterior probability of $s$ for each of the  eight
 values of $\br$. [In fact, by symmetry, only two example
 cases
 $\br = ({\tt0}{\tt0}{\tt0})$ and 
 $\br = ({\tt0}{\tt0}{\tt1})$ need  be considered.]
\marginpar{\small\raggedright{\Eqref{eq.bayestheorem} gives the posterior probability of
 the input $s$, given the received vector $\br$.
}}
% $\br = ({\tt1},{\tt1},{\tt0})$, 
% $\br = ({\tt1},{\tt1},{\tt1})$,
 Notice that some of the
 inferred bits are better determined than others.
 From the posterior probability $P(s \given \br)$ you can read out
 the case-by-case error probability,
 the probability that the more probable hypothesis
 is not correct, $P(\mbox{error} \given \br)$.
 Find the average error probability using the sum rule,
\beq
	P(\mbox{error}) = \sum_{\br} P(\br) P(\mbox{error} \given \br) .
\eeq
\end{description}
}

%


\exercissxB{3C}{ex.Hwords}{
	The frequency
% probability
 $p_n$ of the
 $n$th most frequent word in English is roughly approximated
 by
\beq
 p_n \simeq \left\{
\begin{array}{ll}
\frac{0.1}{n} & \mbox{for $n \in 1, \ldots, 12\,367$}
% 8727$.}
\\
0 & n > 12\,367 .
\end{array}
\right.
\eeq
 [This remarkable $1/n$ law is known as \ind{Zipf's law},
 and applies to the word frequencies of many languages
% cite Shannon collection p.197 - except he has the number 8727, wrong!
% could also cite Gell-Mann
 \cite{zipf}.]
 If we assume that English is generated by picking
 words at random according to this distribution,
 what is the entropy of English (per word)?
 [This calculation  can be found in `Prediction and entropy of printed English', C.E.\ Shannon,
 {\em Bell Syst.\ Tech.\ J.}\ {\bf 30}, p\pdot50--64 (1950), but, inexplicably,
 the great man made numerical errors in it.] 
% , in bits per word?
}


%%% Local Variables: 
%%% TeX-master: ../book.tex
%%% End:

% \input{tex/_e1A.tex}%%%%%%%%%%%%%%%%%%%%% inference probs to do with logit and dice and decay moved into _p8.tex
\dvips
% include urn.tex here for another forward probability exercise.
%
\section{Solutions}% to Chapter \protect\ref{ch.prob.ent}'s exercises} 
\fakesection{_s1aa solutions}
%=================================
\soln{ex.independence.bigram}{
 No, they are not independent. If they were then all the
 conditional distributions $P(y \given x)$ would be identical
 functions of $y$, regardless of $x$ (\cf\ \figref{fig.conbigrams}).
}
\soln{ex.fp.toss}{ 
 We define  the fraction $f_B \equiv B/K$.
\ben 
\item
 The number of black balls
 has a binomial distribution.
\beq P(n_B\,|\,f_B,N) = {N \choose n_B} f_B^{n_B} (1-f_B)^{N-n_B} . \eeq
\item
 The mean and variance of this distribution are: 
\beq \Exp [ n_B ] = N f_B \eeq
\beq \var[n_B] = N f_B (1-f_B) .
\label{eq.variance.binomial}
\eeq
 These results were derived in \exampleref{ex.binomial}.
 The standard deviation of $n_B$ is $\sqrt{\var[n_B]} = \sqrt{N f_B (1-f_B)}$.
% on page \pageref{sec.first.binomial.sol}.

 When $B/K = 1/5$ and $N=5$, 
 the expectation and variance of   $n_B$ are
 1 and 4/5. The standard deviation is 0.89.

 When $B/K = 1/5$ and $N=400$, 
 the expectation and variance of   $n_B$ are
 80 and 64. The standard deviation is 8.
\een
}
\soln{ex.fp.chi}{
 The numerator of the  quantity
\[%beq
 z = \frac{(n_B - f_B N)^2}{ {N f_B (1-f_B)} } 
%\label{eq.chisquared}
\]%eeq
 can be recognized as\index{chi-squared}\index{$\chi^2$}
 $\left( n_B - \Exp [ n_B ] \right)^2$;
 the denominator is equal to
 the variance of $n_B$ (\ref{eq.variance.binomial}),
 which is by definition the expectation of the numerator.
 So the expectation of $z$ is 1. [A random variable like $z$,
 which measures the deviation of data from the
 expected
% average
 value, is sometimes called $\chi^2$ (chi-squared).]

 In the case $N=5$ and $f_B = 1/5$, $N f_B$ is 1, and
 $\var[n_B]$ is 4/5. The numerator has five possible values, only
 one of which is smaller than 1:
 $(n_B - f_B N)^2 = 0$ has probability $P(n_B \eq  1)= 0.4096$;
% $(n_B - f_B N)^2 = 1$ has probability $P(n_B = 0)+P(n_B = 2)= $ ;
% $(n_B - f_B N)^2 = 4$ has probability $P(n_B = 3)= $ ;
% $(n_B - f_B N)^2 = 9$ has probability $P(n_B = 4)= $ ;
% $(n_B - f_B N)^2 = 16$ has probability $P(n_B = 5)= $ ;
 so the probability that $z < 1$ is 0.4096.
% 
}
%
% stole solution from here
%
%%%%%%%%%%%%%%%%%%%%%%%%%% added 99 9 14
\soln{ex.jensenpf}{
 We wish to prove, given the property 
\beq
	f( \lambda x_1 + (1-\lambda)x_2 ) \:\: \leq  \:\:
		\lambda f(x_1) + (1-\lambda) f(x_2 ) ,
\label{eq.convexdefn}
\eeq
 that, if $\sum p_i = 1$ and $p_i \geq 0$, 
\beq%
%	\Exp\left[ f(x) \right] \geq f\left( \Exp[x] \right) ,
	\sum_{i=1}^I p_i  f(x_i)  \geq f\left( \sum_{i=1}^I p_i x_i  \right) .
\eeq
 We proceed by recursion, working from the right-hand side. (This proof
 does not
% needs further work to
 handle
% awkward
 cases where some $p_i=0$; such
 details are left to the pedantic reader.) At the first line we
 use the definition of convexity (\ref{eq.convexdefn}) with
 $\lambda = \frac{p_1}{\sum_{i=1}^I p_i } = p_1$; at the second line,
 $\lambda =  \frac{p_2}{\sum_{i=2}^I p_i }$.
% , and so forth.
\fakesection{temporary solution}
\begin{eqnarray}
\lefteqn{  f\left( \sum_{i=1}^I p_i x_i  \right) = 
% &=&
 f\left( p_1 x_1  +  \sum_{i=2}^I p_i x_i
 \right) } \nonumber
\\
&\leq&
 p_1 f(x_1) +  \left[ \sum_{i=2}^I p_i \right]
		\left[  f\left(  \sum_{i=2}^I p_i x_i
	\left/ \sum_{i=2}^I p_i \right. \right) \right]
 \\
&\leq&
 p_1 f(x_1) +  \left[ \sum_{i=2}^I p_i \right]
	     \left[
             \frac{p_2}
	{\sum_{i=2}^I p_i }             f\left( x_2 \right)
		+  \frac{\sum_{i=3}^I p_i}
                       {\sum_{i=2}^I p_i }
		 f\left( \sum_{i=3}^I p_i x_i
 \left/ \sum_{i=3}^I p_i \right. \right)
            \right] ,
\nonumber
% probably cut this last line, just show one itn of recursion
%
\end{eqnarray}
  and so forth. %
% this works if I want to restore it. Indeed I have restored it
 \hfill $\epfsymbol$%  $\Box$%\epfs% end proof symbol








}
%%%%%%%%%%%%%%%%%%%%
% main post-chapter exercise solution area:
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\soln{ex.sumdice}{
\ben \item For the outcomes $\{2,3,4,5,6,7,8,9,10,11,12\}$, 
 the probabilities are $\P = \{ 
\frac{1}{36},
\frac{2}{36},
\frac{3}{36},
\frac{4}{36},
\frac{5}{36},
\frac{6}{36},
\frac{5}{36},
\frac{4}{36},
\frac{3}{36},
\frac{2}{36},
\frac{1}{36}\}%
$.
\item The value of one die has mean $3.5$ and variance $35/12$. 
 So the sum of one hundred has mean $350$ and variance $3500/12 \simeq 292$,
 and by the \ind{central-limit theorem} the probability distribution 
 is roughly Gaussian (but confined to the integers), with 
 this mean and variance.
\item
	In order to obtain a sum that has a uniform distribution 
 we have to start from random variables some of which
 have a spiky distribution 
 with the probability mass concentrated at the extremes. 
 The unique solution is to have one ordinary die and one with faces 6, 6, 6, 0, 0, 0.
% That this solution is unique can be proved with an argument 
% that starts by noting 
% that each of the 12 outcomes has to be realized
% by 3 distinct microstates (a microstate
% being one of the 36 particular orientations
% of the two dice).  To create outcome `12'
% in three ways there must be one six on 
% one dice and three sixes on the other; 
% similarly to create outcome `1' three ways, there 
% must be one die with three zeroes on it
% and one with one one.
\item
 Yes, a uniform  distribution can be created in several ways,\marginpar[t]{\small\raggedright{To think about:
  does this uniform distribution contradict the \ind{central-limit theorem}?}}
 for example by labelling the $r$th die with
 the numbers $\{0,1,2,3,4,5\}\times 6^r$.
\een  
}

% \subsection*{Inference problems}
%
\soln{ex.logit}{
\beqan
	a = \ln \frac{p}{q}
\hspace{0.2in} & \Rightarrow & \hspace{0.2in} \frac{p}{q}  =  e^a
\label{logit.step1}
\eeqan
 and $q=1-p$ gives
\beqan
	\frac{p}{1-p} & =&   e^a 
\\ \Rightarrow \hspace{0.52in} p & = & \frac{e^a}{e^a+1} = \frac{1}{1+\exp(-a)} .
\label{logit.step2}
\eeqan
 The hyperbolic tangent is
\beq
	\tanh(a) = \frac{e^a -e^{-a}}{e^a + e^{-a}}
\eeq
 so 
\beqan
	f(a)& \equiv& \frac{1}{1+\exp(-a)} =
\frac{1}{2}	\left( \frac{1-e^{-a}}{1+e^{-a}} + 1 \right) \nonumber \\
	&=&  \frac{1}{2}\left(  \frac{ e^{a/2} - e^{-a/2} }{
			e^{a/2} + e^{-a/2}} +1 \right)
	= \frac{1}{2} ( \tanh(a/2) + 1 ) .
\eeqan

 In the case  $b = \log_2 \linefrac{p}{q}$, we can repeat
 steps (\ref{logit.step1}--\ref{logit.step2}), replacing $e$ by $2$, to
 obtain
\beq
	p = \frac{1}{1+2^{-a}} .
\label{eq.sigmoid2}
\label{eq.logistic2}
\eeq
}	
\soln{ex.BTadditive}{
\beqan
 P(x \given y) &=& \frac{P(y \given x)P(x) }{P(y)}
\\%\eeq\beq
\Rightarrow\:\:
 \frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} &=&  \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)}
		 \frac{P(x\eq 1)}{P(x\eq 0)}  
\\%\eeq\beq
\Rightarrow\:\:
\log \frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} &=& \log \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)}
		+ \log \frac{P(x\eq 1)}{P(x\eq 0)}  .
\eeqan
}
\soln{ex.d1d2}{
 The conditional independence of $d_1$ and $d_2$ given $x$
 means
\beq
	P(x,d_1,d_2)  = P(x)P(d_1 \given x)P(d_2 \given x) .
\eeq
 This gives a separation of the posterior probability ratio 
 into a series of factors, one for each data point, times 
 the prior probability ratio.
\beqan
 \frac{P(x\eq 1 \given \{d_i \} )}{P(x\eq 0 \given  \{d_i \})} &=& 
	 \frac{P(\{d_i\} \given x\eq 1)}{P(\{d_i\} \given x\eq 0)}
		 \frac{P(x\eq 1)}{P(x\eq 0)}  
\\ &=&
	 \frac{P(d_1 \given x\eq 1)}{P(d_1 \given x\eq 0)}
		 \frac{P(d_2 \given x\eq 1)}{P(d_2 \given x\eq 0)}
		 \frac{P(x\eq 1)}{P(x\eq 0)}  .
\eeqan
}

%
%
\subsection*{Life in high-dimensional spaces}
\soln{ex.RN}{
 The \ind{volume} of a \ind{hypersphere} of radius $r$ in $N$ dimensions is
 in fact
\beq
	V(r,N) = \frac{\pi^{N/2}}{(N/2)!} r^{N} ,
\eeq
 but you don't need to know this.
 For this question all that we need is the $r$-dependence, 
 $V(r,N)  \propto r^{N} .$
 So	the fractional  volume in $(r-\epsilon,r)$ is
\beq
	\frac{	r^{N} - (r-\epsilon)^N }{ r^N} = 
		1 -\left( 1 -\frac{\epsilon}{r}\right)^N .
\eeq
 The  fractional volumes in the shells for the required cases are:
\begin{center}
\begin{tabular}[t]{cccc} \toprule
$N$ & 2 & 10 & 1000 \\ \midrule 
$\epsilon/r = 0.01$ & 0.02  & 0.096 & 0.99996 \\
$\epsilon/r = 0.5\phantom{0}$  & 0.75  & 0.999 & $1 - 2^{-1000}$ \\  \bottomrule
\end{tabular}\\
\end{center}
\noindent Notice that no matter how small $\epsilon$ is, for large enough $N$ 
 essentially all the probability mass is in the surface shell of thickness 
 $\epsilon$.
}


%\soln{ex.weigh}{
% See chapter \chtwo.
%}
%
\soln{ex.expectn}{
 $p_a \eq  0.1$, $p_b \eq  0.2$, $p_c \eq  0.7$. 
 $f(a) \eq  10$, $f(b) \eq  5$, and $f(c) \eq  10/7$. 
\beq
	\Exp\left[ f(x) \right] = 0.1 \times 10 + 0.2 \times 5 + 0.7 \times 10/7 = 3.
\eeq
 For each $x$, $f(x) = 1/P(x)$, so 
\beq
 \Exp\left[ 1/P(x) \right] = \Exp\left[ f(x) \right] = 3.
\eeq
}
%
\soln{ex.invP}{
 For general $X$, 
\beq
	\Exp\left[ 1/P(x) \right] = \sum_{x\in \A_X} P(x)  1/P(x) = 
	\sum_{x\in \A_X} 1 = | \A_X | .
\eeq
}
%
\soln{ex.expectng}{
  $p_a \eq  0.1$, $p_b \eq  0.2$, $p_c \eq  0.7$. 
  $g(a) \eq  0$, $g(b) \eq  1$, and $g(c) \eq  0$. 
\beq
	\Exp\left[ g(x) \right]=p_b = 0.2.
\eeq
}
\soln{ex.expectng2}{
\beq
	P\left( P(x) \! \in \! [0.15,0.5] \right) = p_b = 0.2 .
\eeq
\beq
	 P\left( \left| \log \frac{P(x)}{ 0.2} \right| > 0.05 \right) 
		= p_a + p_c = 0.8 .
\eeq
}
%
\soln{ex.Hineq}{
 This type of question can be approached in two ways:
 either  by differentiating
 the function to be maximized, finding the maximum, and proving
 it is a global maximum; this strategy is somewhat risky since it is possible 
 for the maximum of a function to be at the boundary of the space,
 at a place where the derivative is not zero.
 Alternatively, a carefully chosen inequality 
 can establish the answer. The second method is much neater.

\begin{Prooflike}{Proof by differentiation (not the recommended method)}
 Since it is slightly easier to differentiate $\ln 1/p$ than $\log_2 1/p$,
 we temporarily define  $H(X)$ to be measured using natural logarithms, thus
 scaling it down by a factor of $\log_2 e$.
\beqan
	H(X) &=& \sum_i p_i \ln \frac{1}{p_i} \\
	\frac{\partial H(X)}{\partial p_i} &=&  \ln \frac{1}{p_i} - 1 
\eeqan 
 we maximize subject to the constraint $\sum_i p_i = 1$ which can be enforced
 with a Lagrange multiplier:
\beqan
	G(\bp) & \equiv & H(X) + \lambda \left( \sum_i p_i - 1 \right) \\
	\frac{\partial  G(\bp)}{\partial p_i}  &=&  \ln \frac{1}{p_i} - 1 + \lambda .
\eeqan
 At a maximum, 
\beqan
	\ln \frac{1}{p_i} - 1 + \lambda &=& 0 \\
\Rightarrow \ln \frac{1}{p_i} &=& 1 - \l ,
\eeqan
 so all the $p_i$ are equal. That this extremum is indeed a maximum
 is established by finding the curvature:
\beq
	\frac{\partial^2  G(\bp)}{\partial p_i \partial p_j}  = -\frac{1}{p_i}
	\delta_{ij} ,
\eeq
 which is negative definite. \hfill
\end{Prooflike}
\begin{Prooflike}{Proof using Jensen's inequality (recommended method)}
 First a reminder of the inequality.
\begin{quotation}
\noindent
 If $f$ is a \convexsmile\ function
 and $x$ is a random variable then:
\[%beq
	\Exp\left[ f(x) \right] \geq f\left( \Exp[x] \right) .
\]%eeq
 If $f$ is strictly  \convexsmile\ and 
 $\Exp\left[ f(x) \right] \eq  f\left( \Exp[x] \right)$, then the random 
 variable $x$ is a constant
 (with probability 1). 
\end{quotation}

 The secret of a proof using Jensen's inequality is to choose the 
 right function and the right random variable. 
 We could define 
% $f(u) = \log \frac{1}{u}$ and 
\beq
	f(u) = \log \frac{1}{u} = - \log u
\eeq
 (which is a convex function) and 
 think of $H(X) = \sum p_i \log \frac{1}{p_i}$ as the 
 mean of  $f(u)$ where  $u=P(x)$, but this 
 would not get us there -- it would give us an inequality in the 
 wrong direction. If instead we define 
\beq
	u = 1/P(x)
\eeq
 then we find:
% this introduces an extra minus sign:
\beq
	H(X) = - \Exp\left[ f( 1/P(x) ) \right]
	 \leq - f\left( \Exp[ 1/P(x) ] \right)  ;
\eeq
 now we know from   \exerciseref{ex.invP}\ that $\Exp[ 1/P(x) ] = |\A_X|$, so
\beq
	H(X)   \leq - f\left( |\A_X| \right) = \log  |\A_X| .
\eeq
 Equality  holds only if the random variable $u = 1/P(x)$ is a constant, 
 which means $P(x)$ is a constant for all $x$.  
\end{Prooflike}
}
%
\soln{ex.rel.ent}{
\beq
	D_{\rm KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} .
% \label{eq.KL}
\eeq
\label{sec.gibbs.proof}% cross ref problem? Tue 12/12/00
 We prove \ind{Gibbs' inequality} using \ind{Jensen's inequality}. 
 Let $f(u) = \log 1/u$ and $u=\smallfrac{Q(x)}{P(x)}$. 
 Then 
\beqan
	D_{\rm KL}(P||Q) & =& \Exp[ f( Q(x)/P(x) ) ]
\\ &\geq&
 f\left(
	\sum_x P(x) \frac{Q(x)}{P(x)} \right)
	= \log \left( \frac{1}{\sum_x Q(x)} \right) = 0,
\eeqan
 with equality only if $u=\frac{Q(x)}{P(x)}$ is a constant, that is, 
 if $Q(x) = P(x)$.\hfill$\epfsymbol$\\

\begin{Prooflike}{Second solution}
 In the above proof the expectations were with respect to
 the probability distribution $P(x)$.  A second solution method
 uses Jensen's inequality with $Q(x)$ instead.
 We define $f(u) = u \log u$ and let $u = \frac{P(x)}{Q(x)}$.
 Then
\beqan
	D_{\rm KL}(P||Q)& =&
 \sum_x Q(x) \frac{P(x)}{Q(x)} \log
 	\frac{P(x)}{Q(x)} = \sum_x Q(x) f\left( \frac{P(x)}{Q(x)} \right) \\
	&\geq& f\left( \sum_x Q(x) \frac{P(x)}{Q(x)} \right) = f(1) = 0,
\eeqan
 with equality only if $u=\frac{P(x)}{Q(x)}$ is a constant, that is, 
 if $Q(x) = P(x)$.
\end{Prooflike}
}
%
% solns moved to _s5A.tex
%
\soln{ex.decomposeexample}{
\beq
H(X)= H_2(f) + f H_2(g) + (1-f) H_2(h) .
\eeq
}
%
\soln{ex.waithead0}{
 The probability that there are $x-1$ tails and then one head
 (so we get the first head on the $x$th
 toss) is
\beq
	P(x) = (1-f)^{x-1} f .
\eeq
 If the first toss is a tail, the probability distribution for
 the future looks just like it did before we made the first toss.
 Thus we have a recursive expression for the entropy:
\beq
	H(X) = H_2(  f ) + (1-f) H(X)  .
\eeq
 Rearranging,
\beq
	H(X) =  H_2(  f )  / f .
\eeq
}
%
%
\fakesection{waithead solution}
\soln{ex.waithead}{
 The probability of the number of tails $t$ is 
\beq
	P(t) = \left(\frac{1}{2}\right)^{\!t} \frac{1}{2} 
		\:\mbox{ for $t\geq 0$}.
\eeq
 The expected number of heads is 1, by definition of the problem.
 The expected number of tails is 
\beq
	\Exp[t] =
	\sum_{t=0}^{\infty} t \left(\frac{1}{2}\right)^{\!t} \frac{1}{2} ,
\eeq
 which may be shown to be 1 in a variety of ways. For example, since 
 the situation after one tail is thrown is equivalent to the opening 
 situation, we can write down the recurrence relation
\beq
	\Exp[t] = \frac{1}{2} ( 1 + \Exp[t] )  + \frac{1}{2}0 \:\:
 \Rightarrow \:\: \Exp[t] = 1.
\eeq
% if we define $S=\Exp[t]$ then we can subtract $S/2$ from $S$ to obtain 
% a geometric series:
%\beq
%	(1-1/2)S = \sum_{t=0}^{\infty} \left(\frac{1}{2}\right)^{t+1}
%		= \frac{1/2}{1-1/2} = 1
%\eeq
% which gives $S=2$ --- what?
%%%%%%%%%%%%%%%%
%, for example, introducing 
% $Z(\beta) \equiv \sum_t \left(\frac{1}{2}\right)^{\beta t} \frac{1}{2}
% = \frac{1}{2}/\left(1 - (\linefrac{1}{2})^{\beta}\right)$:
%\beq
%	\sum_{t=0}^{\infty} t \left(\frac{1}{2}\right)^{t} \frac{1}{2}
%	= \frac{\d}{\d\beta} \log Z
%\eeq

 The probability distribution of the `estimator' $\hat{f} = 1/(1+t)$,
 given that $f=1/2$, is plotted 
 in \figref{fig.f.estimator}. The  probability of $\hat{f}$ is
 simply the probability of the corresponding
 value of $t$.
%
% gnuplot
% load 'figs/festimator.gnu'
%\begin{figure}
%\figuremargin{%
\marginfig{%
\begin{center}
\begin{tabular}{c}
$P(\hat{f})$\\[-0.3in]
\mbox{\psfig{figure=figs/festimator.ps,angle=-90,width=2in}}\\
\hspace{1.82in}$\hat{f}$
\end{tabular}
\end{center}
%}{%
\caption[a]{The probability distribution of the estimator $\hat{f} = 1/(1+t)$, 
 given that $f=1/2$.}
% , so that  $P(t) = 1/2^{t+1}$.}
\label{fig.f.estimator}
%}
%\end{figure}
}
}
\soln{ex.waitbus}{
\ben
\item
	The mean number of rolls from one six to the next six is six
 (assuming
	we
% don't count the first of the two sixes).
 start counting rolls after   the first of the two sixes).
	The probability that the next six occurs on the $r$th
 roll is the probability of {\em not\/} getting a six
 for $r-1$ rolls multiplied by the probability of then
getting a six:
\beq
 P(r_1 \eq  r) = \left( \frac{5}{6} \right)^{\! r-1} \frac{1}{6}, \:\: \mbox{for $r\in \{1,2,3,\ldots \}$.}
\eeq
	This  probability distribution of the number of rolls, $r$,
 may be called 
	an \ind{exponential distribution}, since 
\beq
 P(r_1 \eq  r) = e^{-\alpha r} / Z, 
\eeq
 where $\alpha = \ln({6}/5)$, and $Z$ is a normalizing constant.
\item
  The mean number of rolls from the clock until the next six is six.
\item
 	The mean number of rolls, going back in time,
	until the most recent six is six.
\item
	The mean number of rolls from the six before
	the clock struck to the six after the clock struck
	is the sum of the answers to (b) and (c), less one,
% (assuming	we don't count the first of the two sixes),
 that is, eleven.
\item
	Rather than explaining the difference between (a)
% six and
 and  (d), let me give another hint.\index{bus-stop paradox}\index{waiting for a bus}
% see gnu/waitbus.gnu
 Imagine that the buses in Poissonville  arrive independently at random
 (a \ind{Poisson process}), with, on average, one bus every six minutes.
 Imagine that passengers turn up at {\busstop}s at a uniform rate,
% random also,
 and are scooped up by the bus without delay, so the
 interval between two buses remains constant.
 Buses that follow gaps bigger than six minutes
 become overcrowded. The passengers' representative complains that
 two-thirds of  all passengers found themselves on overcrowded buses.
 The bus operator claims, `no, no -- only one third
 of our buses are overcrowded'. Can both these claims be true? 
\een
\amarginfig{b}{%
\begin{center}
\mbox{\hspace{-0.3in}\psfig{figure=figs/waitbus.ps,angle=-90,width=2.05in}}\\[-0.2in]
\end{center}
\caption[a]{The probability distribution of the number
 of rolls $r_1$
 from one 6 to the next
  (falling solid line),
\[%\beq
	P(r_1 \eq  r) = \left( \frac{5}{6} \right)^{\! r-1} \frac{1}{6} ,
\]%\eeq
 and the probability distribution (dashed line)
 of
% the quantity $r_{\rm tot}=r_1+r_2-1$,
 the number of rolls from the 6 before 1pm to the next 6,
% where $r_1$ and $r_2$ are the numbers of rolls before
% and after the clock strikes,
 $r_{\rm tot}$, 
\[%\beq
	P(r_{\rm tot} \eq  r) = r \, \left( \frac{5}{6} \right)^{\! r-1}
		\left( \frac{1}{6} \right)^{\! 2 }
 .
\]%\eeq
 The probability $P(r_1>6)$ is about 1/3; the probability
 $P(r_{\rm tot} > 6 )$ is about 2/3. The mean of $r_1$ is 6, and the
 mean of $r_{\rm tot}$ is 11.
}
% other elegant ways of saying it:
% P( number rolls from one 6 to the next)
% P( number of rolls from the 6 before 1pm to the next)
}% end figure
}% end solbn






%
% \subsection{Move this solution}
%
% \subsection*{Conditional probability}
% \soln{ex.R3error}{
%
\fakesection{r3 error soln}
\soln{ex.R3error}{
\begin{description}
\item[Binomial distribution method\puncspace]
 From the solution to \exerciseonlyref{ex.R3ep}, 
 $p_B = 3 f^2 (1-f) + f^3$.\index{repetition code}
\item[Sum rule method\puncspace]
 The marginal probabilities of the eight values of $\br$ are\index{sum rule}
 illustrated by: 
\beq
 P(\br \eq {\tt0}{\tt0}{\tt0} ) = \dhalf (1-f)^3 + \dhalf f^3 ,
\eeq
\beq
 P(\br \eq {\tt0}{\tt0}{\tt1} ) = \dhalf f(1-f)^2 + \dhalf f^2(1-f)
 =  \dhalf f(1-f) .
\eeq
 The posterior probabilities are represented by 
\beq
 P( s\eq{\tt1}  \given  \br \eq {\tt0}{\tt0}{\tt0} )  = \frac{  f^3  }
		{   (1-f)^3 +  f^3 }
\eeq
 and
\beq
 P( s\eq{\tt1}  \given  \br \eq {\tt0}{\tt0}{\tt1} )
		= \frac{  (1-f)f^2  }
			{   f(1-f)^2 +  f^2(1-f) }
		= f .
\eeq
 The probabilities of error in these representative cases are thus
\beq
 P(\mbox{error} \given \br \eq  {\tt0}{\tt0}{\tt0} )  =  \frac{  f^3  }
		{   (1-f)^3 +  f^3 }
\eeq
 and 
\beq
 P(\mbox{error} \given \br \eq  {\tt0}{\tt0}{\tt1} )  =  f .
\eeq
 Notice that while the average probability of error of $\Rthree$ is
 about $3 f^2$, the probability (given $\br$)
 that any {\em{particular}\/} bit is
 wrong is either about $f^3$ or $f$.

 The average error probability, using the sum rule, is
\beqa
	P(\mbox{error}) &=& \sum_{\br} P(\br) P(\mbox{error} \given \br) \\
 &=& 2 [\dhalf (1-f)^3 + \dhalf f^3]  \frac{  f^3  }
		{   (1-f)^3 +  f^3 }
 + 6  [\dhalf f(1-f)] f .
\eeqa
\marginpar{\vspace{-0.8in}\par\small\raggedright{The first two terms are for the cases $\br = \tt000$ and $\tt111$;
 the remaining 6 are for the other outcomes, which share the
 same
 probability of occurring and identical  error probability, $f$.}}%
 So
\beqa
	P(\mbox{error}) 
 &=&   f^3  + 3   f^2(1-f) .
\eeqa
\end{description}
}



%
%
% see also _s1A.tex
\soln{ex.Hwords}{
The entropy is 9.7
% 11.8
 bits per word.
% , which is 2.6 bits per letter  WRONG - shannon (p197) is in error
}
%\soln{ex.Hwords}{
%
%                           z := 1.000004301
%
%sum( 0.1/n * log(1.0/(0.1/n))/log(2.0) , n=1..12367) ;
%                             9.716258456
% 9.716 bits.
%}


%\input{tex/_s1a.tex} nothing there any more
\fakesection{_s1A solutions}
%=================================
% quake
%
% \subsection*{Solutions to further inference problems}

%\soln{ex.exponential}{
% See chapter \chbayes.
%}
%\soln{ex.blood}{
% See chapter \chbayes.
%}
%

% The other exercises are discussed in the next chapter.


%%%%%%%%%%%%%%%%%%%%%%%%%%
\dvipsb{solutions 1a}
% now another inference chapter !
\prechapter{About   Chapter} 
\fakesection{About the first Bayes chapter}
 If you are eager to get on to
% with data compression, information content and entropy,
 information theory, data compression, and noisy channels,
 you can skip to  \chapterref{ch2}. 
 Data compression and data modelling are
 intimately connected, however, so you'll probably
 want to come back to this chapter
 by the time you get to  \chapterref{ch4}. 
%
% move this later
%
% The exercises in this chapter are not a prerequisite for
% chapters \ref{ch2}--\ref{ch7}.

\fakesection{prerequisites for chapter 8}
 Before reading \chapterref{ch.bayes},
 it might be good to look at the following exercises.
% you
% should have worked on
% finished 
% all the exercises in chapter \chone, in particular, 
% \exerciserefrange{ex.logit}{ex.exponential}.
%
%  \exthirtyone--\exthirtysix.
% uvw to HXY>0
\exercissxB{2}{ex.dieexponential}{
	A die is selected at random from two twenty-faced dice 
 on which the symbols 1--10 are written with nonuniform frequency
 as follows.
\begin{center}
\begin{tabular}{l@{\hspace{0.2in}}*{10}{l}} \toprule
Symbol & 1 & 2 & 3 & 4  & 5 & 6 & 7  & 8 & 9 & 10 \\  \midrule
Number of faces of die A & 
        6 & 4  & 3 & 2 & 1 &1 &1 &1 &1 & 0 \\
Number of faces of die B & 
        3 & 3  & 2 & 2 & 2 &2 &2 &2 &1 & 1 \\ \bottomrule
\end{tabular}
\end{center}
 The randomly chosen die is rolled 7 times, with the following
 outcomes:
\begin{center}
 5, 3, 9, 3, 8, 4, 7. %  Sat 21/12/02   tried cutting this \\
\end{center}
 What is the probability that the die is die A?
}
\exercissxB{2}{ex.dieexponentialb}{
 Assume that there is a third twenty-faced die, die C, on which the symbols 
 1--20 are written once each. 
 As above, one of the three dice is selected at random and rolled
 7 times, giving the outcomes:
% \begin{center}
 3, 5, 4, 8, 3, 9, 7. \\
% \end{center}
 What is the probability that the die is (a) die A, (b) die B, (c) die C?
}

% no normal solution pointer
\exercissxA{3}{ex.exponential}{ {\exercisetitlestyle Inferring a decay constant}\\ 
%\begin{quotation}
	Unstable particles are emitted from a source and decay at a
	distance $x$, a real number
	 that has an exponential probability distribution
	with characteristic length $\lambda$.  Decay events can only
	be observed if they occur in a window extending from $x=1\cm$
	to $x=20\cm$. $N$ decays are observed at locations $\{x_1 ,
	\ldots , x_N\}$. 
% ($x_n$ is a real number.)
	 What is $\lambda$?

%\end{quotation}
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.ps,width=3in,angle=90,%
bbllx=154mm,bblly=147mm,bbury=257mm,bburx=175mm}}\\
\end{center}
}
% no normal solution pointer
% \subsection*{Genetic test evidence}
% \begin{quotation}
\exercissxB{3}{ex.blood}{ {\exercisetitlestyle Forensic evidence} \\
% Two people have left traces of their own blood at the scene of a
% crime.  Their blood groups can be reliably identified from these
% traces and are found 
% to be of type `O' (a common type in the local population, having
% frequency 60\%) and of type `AB' (a rare type, with frequency 1\%).
% A suspect is tested and found to have type `O' blood. 
% A careless lawyer might claim that the fact that the suspect's
% blood type was found at the scene is positive evidence for the theory
% that he was present. But do these data
% $D=$ \{type `O' and `AB' blood were found at scene\} make it more
% probable that this suspect was one of the two people present at the
% crime? 
 Two people have left traces of their own blood at the scene of a
 crime. 
 A suspect, Oliver, is tested and found to have type `O' blood.
 The blood groups of the two traces 
 are found
 to be of type `O' (a common type in the local population, having
 frequency 60\%) and of type `AB' (a rare type, with frequency 1\%).
  Do these data
 (type `O' and `AB' blood were found at scene) give evidence in favour 
 of the proposition  that Oliver was one of the two people present at the
 crime? 

}
% \end{quotation}


%%%%%%%%%% (many are repeated from _s1aa)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \prechapter{About Chapter}
\mysetcounter{page}{54} 
\ENDprechapter
\chapter{More about  Inference}
\label{ch.bayes}\label{ch1b}
% contains the decay problem, the bent coin, and blood.
%
%
% solutions to exercises are in _s8.tex
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\fakesection{Inference intro}
 It is not a controversial statement that \Bayes\  theorem\index{Bayes' theorem}
 provides the correct language for describing the inference of a 
 message communicated over a
 noisy channel, as we used it in \chref{ch1} (\pref{sec.bayes.used}).
 But strangely, when it comes to other
 inference problems, the use of
% approaches based on
 \Bayes\  theorem
 is not so widespread.
%let's take a little tour of other applications of 
% probabilistic inference. 
 

% Coherent inference can always be mapped onto probabilities (Cox, 1946).
%% \cite{cox}.
%  Many
% textbooks on statistics do not mention this fact, so maybe it is worth
% using an example to emphasize the contrast between Bayesian inference
% and the orthodox methods of statistical inference.
%% involving
%% estimators, confidence intervals, hypothesis testing, etc.
% If this topic interests you, excellent further reading is
% to be found in the works of Jaynes, for example,
% \citeasnoun{Jaynes.intervals}.

\section{A first inference problem}
\label{sec.decay}\label{ex.exponential.sol}% special label by hand
 When I was an undergraduate in Cambridge, I was privileged to receive
 supervisions from Steve Gull. Sitting at his desk in a dishevelled
 office in St.\ John's College, I asked him how one ought to answer an
 old Tripos question (\exerciseonlyref{ex.exponential}):
\begin{quotation}
	Unstable particles are emitted from a source and decay at a
	distance $x$, a real number
	 that has an exponential probability distribution
	with characteristic length $\lambda$.  Decay events can only
	be observed if they occur in a window extending from $x=1\cm$
	to $x=20\cm$. $N$ decays are observed at locations $\{x_1 ,
	\ldots , x_N\}$. 
% ($x_n$ is a real number.)
	 What is $\lambda$?

\end{quotation}
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.ps,width=3in,angle=90,%
bbllx=154mm,bblly=147mm,bbury=257mm,bburx=175mm}}\\
\end{center}
 I had scratched my head over this for some time.
 My education had provided me with a couple of  approaches to solving
 such inference problems: constructing `\ind{estimator}s'
 of the unknown parameters; or  `fitting' the model to
 the data, or to a processed version of the data.

 Since the mean of an unconstrained exponential distribution is $\l$,
 it seemed reasonable to examine the sample mean $\bar{x} = \sum_n x_n / N$
 and see
 if  an estimator $\hat{\l}$  could be obtained from it.
 It was evident that the {estimator}
 $\hat{\l}=\bar{x}-1$ would be appropriate for
 $\lambda \ll 20\,$cm, but not for cases where the
 truncation of the distribution at the right-hand side
 is significant; with a little ingenuity and the introduction of
 ad hoc bins, promising estimators for $\lambda \gg 20$ cm could be
 constructed.  But there was no obvious estimator that would work
 under all conditions.

 Nor could I find a satisfactory
 approach based on fitting the density $P(x\given \lambda)$ to
 a histogram derived from the data.  I was stuck.
 
 What is the general solution to this problem and others like it?
 Is it always necessary, when confronted by a new inference problem,
 to grope in the dark for appropriate `estimators' and worry
 about finding the `best' estimator (whatever that means)?

%% I hope you have already stopped and thought about this question.
% problem. 
% \\ \mbox{~}\dotfill\ \mbox{~} \\
% \newpage

 Steve 
% Gull
 wrote down the probability of one data point, given $\l$: 
\beq
        P(x\given \lambda) =\left\{ \begin{array}{ll}
        {\textstyle \smallfrac{1}{\l}}  \,
        e^{-x/\lambda } / Z(\lambda) & 1 < x < 20 \\
 0                                      & {\rm otherwise }
        \end{array} \right.
\label{basic.likelihood}
\eeq
where 
\beq
        Z(\l) = \int_1^{20} \d x \: \smallfrac{1}{\l}  \,
 e^{-x/\lambda } = \left(e^{-1/\l} - e^{-20 /\l} \right).
\label{basic.likelihood.Z}
\eeq
 This seemed obvious enough. 
 Then he wrote {\dem{\ind{\Bayes\  theorem}}}: 
\beqan
\label{bayes.theorem}
% \begin{array}{l}
 P(\l\given \{x_1, \ldots, x_N\}) &=& 
        \frac{P(\{x\}\given \lambda) P(\l)}{P(\{x\}) } \\
%&& \hspace{0.5in}
 &\propto&    \frac{1}{\left( \l Z(\l) \right)^N}
                 \exp \left( \textstyle - \sum_1^N x_n / \l \right)  P(\l) 
 .
% \end{array}
\label{basic.posterior}
\eeqan
 Suddenly, the straightforward distribution $P(\{x_1 ,\ldots, x_N \}\given 
 \l)$, defining the probability of the data given the hypothesis $\l$,
 was being turned on its head so as to define the probability of a
 hypothesis given the data.  A simple figure showed the probability of
 a single data point $P(x\given \l)$ as a familiar function of $x$, for
 different values of $\l$ (figure \ref{decay.like.1}).  Each curve was
 an innocent exponential, normalized to have area 1.  Plotting the
 same function as a function of $\l$ for a fixed value of $x$,
 something remarkable happens: a peak emerges (figure
 \ref{decay.like.2}). To help understand these two points
 of view of the one function, \figref{decay.probandlike}
 shows a surface plot of   $P(x\given \l)$ as a function of $x$ and $\l$.

\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.like.1.ps,%
width=2 in,angle=-90}\ \ \ \raisebox{-3mm}[0in][0in]{$x$}}
\end{center}
}{%
\caption{{The probability density $P(x\given \l)$ as a function of $x$.}}
\label{decay.like.1}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.like.2.ps,%
width=2 in,angle=-90}\ \ \ \raisebox{-3mm}[0in][0in]{$\lambda$}}
\end{center}
}{%
\caption[a]{{The probability density $P(x\given \l)$ as a function of $\l$,
 for three different values of $x$.}
 \small
 When plotted this way round, the function is known as 
 the {\dem\ind{likelihood}\/} of $\l$.
 The marks indicate the three values of $\l$, $\l=2,5,10$,
 that were used in the preceding figure.
}
\label{decay.like.2}
}
\end{figure}
%\begin{figure}
%\figuremargin{%
\marginfig{
\begin{center}
\begin{tabular}{c}
\makebox[0pt][l]{\hspace*{0.21in}\raisebox{0.435in}{$x$}}%
\mbox{\psfig{figure=\FIGS/probandlike.ps,%
width=2in,angle=-90}%
\makebox[0pt][l]{\hspace*{-0.352in}\raisebox{0.435in}{$\l$}}}\\[-0.3in]% was -0.6 Sat 5/10/02
\end{tabular}\end{center}
%}{%
\caption[a]{{The probability density $P(x\given \l)$ as a function of  $x$
 and $\l$. Figures \ref{decay.like.1} and \ref{decay.like.2} are
 vertical sections through this surface.}
}
\label{decay.probandlike}
}
%\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.like.xxx.ps,%
width=2in,angle=-90}}
\end{center}
}{%
\caption[a]{{The likelihood function in the case of a six-point  dataset, 
  $P(\{x\} = \{1.5,2,3,4,5,12\}\given \lambda)$, as a function of   $\l$.}
}
\label{decay.like.xxx}
}
\end{figure}
 For a dataset consisting of several  points, \eg, the
 six points
 $\{x\}_{n=1}^{N} = \{1.5,2,3,4,5,12\}$,  the likelihood function
 $P(\{x\}\given \lambda)$ is the product of the $N$ functions of $\l$, 
 $P(x_n\given \l)$ (\figref{decay.like.xxx}).
%

 Steve summarized \Bayes\  theorem
% (equation \ref{bayes.theorem})
 as
 embodying the fact that
\begin{conclusionbox}
 what you know about $\lambda$ 
 after the data arrive is what
 you knew before [$P(\lambda)$], and what the data told you 
 [$P(\{x\}\given \lambda)$].
\end{conclusionbox}
 Probabilities are used here to 
 quantify degrees of belief. 
% The probability 
% of $\lambda$ is a quantification of what you know about $\lambda$. 
 To nip possible confusion in the bud, it must be
 emphasized that the hypothesis $\lambda$ that correctly describes
 the situation is {\em not\/} a {\em stochastic\/} variable, and the fact that
 the Bayesian uses a probability\index{probability!Bayesian}
 distribution $P$ does {\em not\/} mean
 that he thinks of the world as stochastically changing its nature
 between the states described by the different hypotheses. He uses the
 notation of probabilities to represent his {\em beliefs\/} about the mutually
 exclusive micro-hypotheses (here, values of $\l$),
 of which only one is actually true.  That
 probabilities can denote degrees of belief, given assumptions, seemed
 reasonable to me.
% , and is proved  by Cox  (1946). 
% \citeasnoun{cox}.
% . Anyone who does not find it reasonable to use
% probabilities to quantify degrees of belief can read
% paper, where it is proved to be
% valid.


\label{sec.decayb}
 The posterior probability distribution
% of equation
 (\ref{basic.posterior}) represents 
 the unique and complete solution to the problem. 
 There is no need to invent\index{classical statistics!criticisms}
 `estimators'; nor do we need to invent 
 criteria for comparing alternative estimators with each other. 
 Whereas orthodox statisticians offer twenty ways of solving a
 problem, and another twenty different criteria for deciding which of
 these solutions is the best, Bayesian statistics only offers one
 answer to a well-posed problem.
% Added Mon 4/2/02 
\marginpar{\small\raggedright{If you have any difficulty understanding this chapter I recommend
 ensuring you are happy with 
 exercises \ref{ex.dieexponential} and \ref{ex.dieexponentialb} (\pref{ex.dieexponentialb})
 then noting their similarity to 
 \exerciseonlyref{ex.exponential}.}}


\subsection{Assumptions in inference}
 Our inference is conditional on our assumptions [for example, the
 prior $P(\lambda)$]. Critics view such priors as a difficulty because 
 they are  `subjective', but I
 don't see how it could be otherwise.  How can one perform inference
 without making assumptions? 
 I believe that it is of great value that Bayesian
 methods force one to make these tacit assumptions explicit.  

 First,
 once assumptions are made, the inferences are objective and unique,
 reproducible with complete agreement by anyone who has the same
 information and makes the same assumptions.  For example, given the
 assumptions listed above, $\H$, and the data $D$,
% from an experiment
% measuring decay lengths,
 everyone will agree about the posterior
 probability of the decay length $\l$:
\beq
P(\l\given D,\H) = \frac{ P(D\given \l,\H) P(\l\given \H) }{ P(D\given \H) } .
\eeq

 Second, when the assumptions are explicit, they are easier to
 criticize, and easier to modify -- indeed,
 we can quantify the sensitivity of our inferences to
 the details of the assumptions. For example,
 we can note from the likelihood curves 
 in figure \ref{decay.like.2} that in the case of a single data point at 
 $x=5$, the likelihood 
 function is less strongly peaked than in the case $x=3$;  the 
 details of the prior $P(\lambda)$ become  increasingly important as the sample 
 mean $\bar{x}$ gets closer to the middle of the window, 10.5. In the case 
 $x=12$, the likelihood function doesn't have a peak at all -- such data 
 merely rule out small values of $\lambda$, and don't give any information 
 about the relative probabilities of large values of $\lambda$. So 
 in this case, the details of the prior at the small--$\lambda$ end 
 of things are not important, but at  the large--$\lambda$ end, the prior 
 is important. 
%  is whatever we knew before 
%   the experiment, \ie, our prior.

 Third, when we are not sure which of various alternative assumptions
 is the most appropriate for a problem, we can treat this question as
 another inference task.  Thus, given data $D$, we can\index{Bayes' theorem}
% learn from the data  
 compare alternative assumptions $\H$ using \Bayes\  theorem: 
\beq
P(\H\given D,\I) = \frac{ P(D\given \H,\I) P(\H\given \I) }{ P(D\given \I) } ,
\label{basic.ev}
\eeq
 where $\I$ denotes the highest assumptions, which we are not
 questioning.  

 Fourth, we can take into account our uncertainty regarding such
 assumptions when we make subsequent predictions. Rather than choosing
 one particular assumption $\H^{*}$, and working out our predictions
 about some quantity $\bt$, $P(\bt\given D,\H^{*},\I)$, we obtain
 predictions that take into account our uncertainty about $\H$ by
 using the sum rule:
\beq
P(\bt \given  D, \I) = \sum_{\H} P(\bt \given  D, \H , \I ) P(\H\given D,\I) .
\label{basic.marg}
\eeq
 This is another contrast with orthodox statistics, in which it is
 conventional to `test' a default model, and then, if the test\index{test!statistical}\index{statistical test}
 `accepts the model' at some `\ind{significance level}', to use exclusively that model  to make
 predictions.

 Steve thus persuaded me that
\begin{conclusionbox}
        probability theory reaches parts that ad hoc methods cannot reach.
\end{conclusionbox}
% However, that is a topic for another lecture. 

 Let's look at a few more examples of simple inference problems. 
\section{The bent coin}
\label{sec.bentcoin}
 A \ind{bent coin}\index{inference problems!bent coin}
 is tossed $F$ times; we observe a sequence $\bs$ of 
 heads and tails (which we'll denote by the symbols $\ta$ and $\tb$).
 We wish to know the bias of the coin, and predict 
 the probability that the next toss will result in a head. 
 We first encountered this task in \exampleref{exa.bentcoin},
 and we will encounter it again
 in \chref{ch.four}, when we discuss adaptive data compression. 
% the adaptive encoder for $a$s and $b$s. 
 It is also the original inference problem studied by
% Rev.\
 {Thomas Bayes}
 in his essay published in 1763.\index{Bayes, Rev.\ Thomas}
% cite{Bayes}

 As in
% \chref{ch.prob.ent}
 \exerciseref{ex.postpa}, we will
 assume 
% In chapter \chfour\ we assumed
 a uniform prior distribution and
 obtain a posterior distribution by multiplying by the likelihood. A
 critic might object, `where did this prior come from?'  I will not
 claim that the uniform prior is in any way fundamental; indeed
 we'll give examples of nonuniform priors later.  The prior is
% It is simply
 a subjective assumption. One of the themes of this book is:
%
% put this back somewhere?
%
% One way to justify the need for a prior is
% to assume, as in  chapter \chfour,
% that our task is simply to make a code to encode the
% outcome $\bs$ as efficiently as possible. We have to compress the
% data from the source somehow, and any choice of a compression scheme
% must correspond to a prior distribution over coin biases.  I see no
% way round this.  The choice of code implies an assumed probability
% distribution over outcomes.
%\begin{quotation}
\begin{conclusionbox}
\noindent
        you can't do  inference -- or  data compression -- without
 making assumptions.
%        You can't do data compression -- or inference -- without
% making assumptions.
\end{conclusionbox}
%\end{quotation}

%
% change notation? f_H?????????????????????????????????
% 
%\subsubsection*{Likelihood function}
 We give the name $\H_1$ to our assumptions. [We'll be introducing
 an alternative set of assumptions in a moment.]
 The probability, given $p_{\ta}$, that  $F$ tosses
 result in a sequence $\bs$
 that contains $\{F_{\ta},F_{\tb}\}$ counts of the two outcomes
%  $\{ a , b \}$
 is
\beq
        P( \bs \given  p_{\ta} , F,\H_1 ) =  p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} .
\label{eq.pa.likeb}
\eeq
 [{For example, $P(\bs\eq {\tt{aaba}} \given p_{\ta},F \eq 4,\H_1)
 = p_{\ta}p_{\ta}(1-p_{\ta})p_{\ta}.$}]
% This function of $p_{\ta}$ (\ref{eq.pa.likeb}) defines the likelihood function.
% Model 1
 Our first model assumes a uniform prior distribution for $p_{\ta}$,
\beq
        P(p_{\ta}\given\H_1) = 1 , \: \: \: \: \: \: p_{\ta} \in [0,1] 
\label{eq.pa.priorb}
\eeq
 and $p_{\tb} \equiv 1-p_{\ta}$.


\subsubsection{Inferring  unknown parameters}
 Given a string of length $F$ of which $F_{\ta}$ are $\ta$s and 
 $F_{\tb}$ are $\tb$s, we are interested in (a) inferring 
 what $p_{\ta}$ might be;  (b) predicting whether   the next character is an $\ta$ 
 or a $\tb$. [Predictions\index{prediction} are always expressed as probabilities.
 So `predicting whether the next character is an $\ta$'
 is the same as computing the probability that the next character is an $\ta$.]
 

 Assuming $\H_1$ to be true, the posterior probability of $p_{\ta}$, given a
 string $\bs$ of length $F$ that has 
 counts  $\{F_{\ta},F_{\tb}\}$, is, by \Bayes\  theorem,
\beqan
        P( p_{\ta} \given \bs ,F,\H_1) &=& 
        \frac{  P( \bs \given p_{\ta} , F,\H_1 ) P(p_{\ta}\given\H_1) }{ P(  \bs \given F,\H_1 )  } .
\label{eq.pa.post}
\label{eq.pa.post.again}
\eeqan 
 The factor $P( \bs \given p_{\ta} , F,\H_1 )$, which, as a function
 of $p_{\ta}$, is known as the likelihood function,
 was given in \eqref{eq.pa.likeb}; the prior
 $P(p_{\ta}\given\H_1)$  was given in \eqref{eq.pa.priorb}. 
 Our inference of $p_{\ta}$ is thus:
% The posterior 
\beqan
        P( p_{\ta} \given \bs ,F,\H_1) &=& 
        \frac{    p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}}  }{ P(  \bs \given F,\H_1 )  } .
\label{eq.pa.postb.again}
\eeqan 
 The normalizing constant is given by the beta integral
\beq
        P(  \bs \given F,\H_1 )  = \int_0^1 \d p_{\ta} \: p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} = 
        \frac{\Gamma(F_{\ta}+1)\Gamma(F_{\tb}+1)}{ \Gamma(F_{\ta}+F_{\tb}+2) } 
        = \frac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } .
\label{eq.evidenceZ}
\eeq
% Our inference of $p_{\ta}$, assuming $\H_1$ to be true,
% is thus given by \eqref{eq.pa.postb.again}. 

%%%%%%%%%%%%%
\exercissxA{2}{ex.postpaII}{
 Sketch the posterior probability $P( p_{\ta} \given \bs\eq {\tt aba} ,F\eq 3)$.
 What is the most probable value of $p_{\ta}$ (\ie, the value that maximizes 
 the posterior probability density)? What is the mean value of $p_{\ta}$ 
 under this distribution?

 Answer the same questions for
 the posterior probability $P( p_{\ta} \given \bs\eq {\tt bbb} ,F\eq 3)$.
}
 
\subsubsection{From inferences to predictions}
 Our prediction about the next toss, the probability that the next toss is an $\ta$,
 is obtained by integrating over $p_{\ta}$. This has the effect of 
 taking into account our uncertainty about $p_{\ta}$ when making predictions.
 By the sum rule,
\beqan
        P(\ta \given  \bs ,F)& =& \int \d p_{\ta} \: P(\ta \given p_{\ta} ) P(p_{\ta} \given \bs,F )  .
\eeqan
 The probability of an $\ta$ given $p_{\ta}$ is simply $p_{\ta}$, 
 so
\beqan
\lefteqn{        P(\ta \given  \bs ,F)  
        = \int \d p_{\ta} \: p_{\ta} \frac{p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}}}
        {P(  \bs \given F ) }  }
\\
&=& \int \d p_{\ta} \: \frac{p_{\ta}^{F_{\ta}+1} (1-p_{\ta})^{F_{\tb}}}
        {P(  \bs \given F ) } 
\\
&=& \left.
% \frac
 { \left[ \frac{ (F_{\ta}+1)! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 2)! } \right] } \right/
 { \left[  \frac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } \right] } 
\:\: = \:\: \frac{ F_{\ta}+1 }{ F_{\ta} + F_{\tb} + 2 } ,
\label{eq.laplacederived}
\eeqan
 which is known as {\dem{\ind{Laplace's rule}}}.


\section{The bent coin and model comparison}
\label{sec.bentcoin2}
 Imagine that a scientist introduces another theory for our data. 
 He asserts that the source is not really a bent coin but is really a 
 perfectly formed die with one face painted heads (`$\ta$') and the other five
 painted tails (`$\tb$'). Thus the parameter $p_{\ta}$, which in the original model,
 $\H_1$, could take any value between 0 and 1, is according 
 to the new hypothesis, $\H_0$, not a free parameter at all; rather, it
 is equal to 
% p_{\ta} = 
 $1/6$. [This hypothesis is termed $\H_0$ so that the suffix of each model
 indicates its number of free parameters.] 

 How can we compare these two models in the light of data? 
 We wish to
 infer  how probable 
 $\H_1$ is relative to $\H_0$.
% , so we can use \Bayes\  theorem again. 
% Let us write down the first model's probabilities again.

% {\em Here we repeat some material from the arithmetic coding
% chapter, chapter \ref{ch4}.}

\subsubsection*{Model comparison as inference}
 In order to perform model comparison, we write down 
 \Bayes\  theorem again, but this time with a different\index{Bayes' theorem} 
 argument on the left-hand side. We wish to know how probable 
 $\H_1$ is given the data. By \Bayes\  theorem, 
\beq
 P( \H_1 \given \bs ,F ) = \frac{ P(  \bs \given F,\H_1 )  P( \H_1 ) }{  P(  \bs \given F) } .
\eeq
 Similarly, the posterior probability of $\H_0$ is 
\beq
 P( \H_0 \given \bs ,F ) = \frac{ P(  \bs \given F,\H_0 )  P( \H_0 ) }{  P(  \bs \given F) }.
\eeq
 The normalizing constant in both cases is $P(\bs\given F)$, which is the total 
 probability of getting the observed data.
% regardless of which model  is true.
 If $\H_1$ and $\H_0$ are the only models under 
 consideration, this  probability is given by the sum rule: 
\beq
         P(  \bs \given  F) =  P(  \bs \given  F,\H_1 )  P( \H_1 ) 
                 + P(  \bs \given  F,\H_0 )  P( \H_0 ) .
\eeq
 To evaluate the posterior probabilities of the hypotheses we 
 need to assign values to the prior probabilities $P( \H_1 )$ 
 and $P( \H_0 )$; in this case, we might set these to 1/2 each. And
 we need to evaluate the data-dependent terms
 $P(  \bs \given  F,\H_1 )$ and $P(  \bs \given  F,\H_0 )$. 
 We can give names to these quantities. 
 The quantity $P(  \bs \given  F,\H_1 )$ is a measure of how much the data 
 favour $\H_1$, and we call it the {\dbf\ind{evidence}} for model $\H_1$. 
 We already encountered this quantity in equation (\ref{eq.pa.post.again})
 where it appeared 
 as the normalizing constant of the first inference we made -- the 
 inference of $p_{\ta}$ given the data. 
\medskip

\begin{conclusionbox}
%\begin{description} 
%\item[How model comparison works:]
 {\bf How model comparison works:}
 The evidence for a model is usually\index{key points!model comparison} 
 the normalizing constant of an earlier Bayesian inference.
%\end{description}
\end{conclusionbox}
\medskip

 We evaluated the normalizing constant for model $\H_1$ in
 (\ref{eq.evidenceZ}).
 The evidence for model $\H_0$ is very simple because this model 
 has no parameters to infer. Defining $p_0$ to be $1/6$, we have
\beq
        P(  \bs \given  F,\H_0 )  =  p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} .
\eeq

 Thus the posterior probability ratio  of model $\H_1$ to model $\H_0$ is
\beqan
\frac{ P( \H_1 \given  \bs ,F )}
{P( \H_0 \given  \bs ,F )}
& =&
 \frac{ P( \bs \given  F,\H_1 ) P( \H_1 ) }
      { P( \bs \given  F,\H_0 ) P( \H_0 ) }
\\ 
 &=& 
\left.
{ \frac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } }
\right/
{  p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} } .
% \frac{ \smallfrac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } }{  p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} } .
% SECOND EDN - sanjoy says use linefrac
\label{eq.compare.final}
\eeqan
 Some values of this posterior probability ratio are illustrated in 
 table \ref{tab.mod.comp}. The first five lines illustrate that 
 some outcomes  favour one model, and some favour the other.
 No  outcome is completely incompatible with either model.
\begin{table}
\figuremargin{%
\begin{center}
\begin{tabular}{cccl}  \toprule
$F$ & Data $(F_{\ta},F_{\tb})$ & $\displaystyle \frac{ P( \H_1 \given  \bs ,F )}
                                {P( \H_0 \given  \bs ,F )}$ \\  \midrule
6 & $(5,1)$ & 222.2 & \\
6 & $(3,3)$ & 2.67 &\\
6 & $(2,4)$ & 0.71 & =  1/1.4 \\
6 & $(1,5)$ & 0.356 & = 1/2.8 \\
6 & $(0,6)$ & 0.427 & = 1/2.3 \\ \midrule
20 & $(10,10)$ & 96.5 & \\
20 & $(3,17)$ & 0.2 & = 1/5 \\
20 & $(0,20)$ & 1.83 &  \\  \bottomrule
\end{tabular}
\end{center}
}{%
\caption{Outcome of model comparison between models $\H_1$ and $\H_0$
 for the `bent coin'. Model $\H_0$ states that  $p_{\ta}=1/6$, $p_{\tb}=5/6$.}
\label{tab.mod.comp}
}
\end{table}
 With small amounts of data (six tosses, say) it is typically not the case that 
 one of the two models is overwhelmingly more probable than 
 the other. But with more data, the evidence against $\H_0$ given 
 by any data set with the ratio $F_{\ta} \colon F_{\tb}$ differing from $1 \colon 5$ mounts up.
%
% add figure showing some typical histories
%
 You can't predict in advance how much data are needed to be pretty sure
 which theory is true.\index{key points!how much data needed}  It depends what $p_0$ is.
%
% THIS IS A VERY GENERAL
% message for machine learning.

% corrected Wed 28/11/01
 The simpler model, $\H_0$, since it has no adjustable parameters, 
 is able to lose out by the biggest margin. The odds may be hundreds to one 
 against it. The more complex model can never lose out 
 by a large margin; there's no data set that is actually {\em unlikely\/}
 given model $\H_1$.
\exercisaxB{2}{ex.evidencebounds}{
 Show that after $F$ tosses have taken place, the
 biggest value that the log evidence ratio
\beq
\log \frac{ P( \bs \given  F,\H_1 ) }
          { P( \bs \given  F,\H_0 ) }
\eeq
 can have scales {\em linearly\/} with $F$ if
 $\H_1$ is more probable, but
 the log evidence in favour of $\H_0$ can grow
 at most as $\log F$.
}
\exercissxB{3}{ex.evidenceest}{
 Putting your sampling theory hat on, assuming $F_{\ta}$ has not yet been measured, 
 compute a plausible range that
% the mean and variance -- or some sort of most probable value, and indication of spread -- of the
 the log evidence ratio might lie in, as a function of $F$ and
 the true value of $p_{\ta}$,
 and sketch it
 as a function of $F$ for $p_{\ta}=p_0=1/6$, $p_{\ta}=0.25$,
 and $p_{\ta}=1/2$.
 [Hint:  sketch the log evidence as a function
 of the random variable $F_{\ta}$ and work out the mean
 and standard deviation of $F_{\ta}$.]
% [Hint: Taylor-expand the log evidence as a function
% of $F_{\ta}$.]
}
%
% This page comes out rotated bizarrely by 90 degrees in pdf
%
\subsection{Typical behaviour of the evidence}
% see figs/sixtoone
% and bin/sixtoone.p
 \Figref{fig.evidencetyp} shows the log evidence ratio
 as a function of the number of
 tosses, $F$, in a number of simulated experiments.
 In the left-hand experiments, $\H_0$ was true.
 In the right-hand ones, $\H_1$ was true, and the value of
 $p_{\ta}$ was either 0.25 or 0.5.
% \newcommand{\sixtoone}[2]{%  in newcommands1.tex
\begin{figure}
\figuremargin{%
\small%
\begin{center}
\begin{tabular}{cccc}
$\H_0$ is true &&
\multicolumn{2}{c}{$\H_1$ is true} \\ \cmidrule{1-1}\cmidrule{3-4}
\sixtoone{$p_{\ta}=1/6$}{h09}&&
\sixtoone{$p_{\ta}=0.25$}{h69}&
\sixtoone{$p_{\ta}=0.5$}{h29}\\
\sixtoone{}{h08}&&
\sixtoone{}{h68}&
\sixtoone{}{h28}\\
\sixtoone{}{h07}&&
\sixtoone{}{h67}&
\sixtoone{}{h27}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Typical behaviour of the evidence in favour of $\H_1$ as
 bent coin tosses accumulate\index{typicality!behaviour of evidence}\index{evidence!typical behaviour of}\index{model comparison!typical evidence}
 under three different conditions. Horizontal axis is the number of
 tosses, $F$. The vertical axis on the left is
$\ln \smallfrac{ P( \bs \given  F,\H_1 ) }
          { P( \bs \given  F,\H_0 ) }$;
  the right-hand vertical axis shows the values of 
$\smallfrac{ P( \bs \given  F,\H_1 ) }
          { P( \bs \given  F,\H_0 ) }$.

 (See also \protect\figref{fig.evidenceMSD}, \pref{fig.evidenceMSD}.)
}
\label{fig.evidencetyp}
}%
\end{figure}
 

 We will discuss model comparison more in a later chapter. 

\section{An example of legal evidence}
\label{ex.blood.sol}% special label by hand
 The following example
% (\exerciseonlyref{ex.blood})
 illustrates that there is more 
 to Bayesian inference than the priors.

\begin{quote}
% Two people have left traces of their own blood at the scene of a
% crime.  Their blood groups can be reliably identified from these
% traces and are found 
% to be of type `O' (a common type in the local population, having
% frequency 60\%) and of type `AB' (a rare type, with frequency 1\%).
% A suspect is tested and found to have type `O' blood. 
% A careless lawyer might claim that the fact that the suspect's
% blood type was found at the scene is positive evidence for the theory
% that he was present. But do these data
% $D=$ \{type `O' and `AB' blood were found at scene\} make it more
% probable that this suspect was one of the two people present at the
% crime? 
 Two people have left traces of their own blood at the scene of a
 crime. 
 A suspect, Oliver, is tested and found to have type `O' blood.
 The blood groups of the two traces 
 are found
 to be of type `O' (a common type in the local population, having
 frequency 60\%) and of type `AB' (a rare type, with frequency 1\%).
  Do these data
 (type `O' and `AB' blood were found at scene) give evidence in favour 
 of the proposition  that Oliver was one of the two people present at the
 crime? 

\end{quote}
 A careless \ind{lawyer} might claim that the fact that the suspect's
 blood type was found at the scene is positive evidence for the theory
 that he was present. But this is not so.

 Denote the proposition `the suspect and one unknown person were
 present' by $S$. The alternative, $\bar{S}$, states `two unknown people
 from the population were present'. 
 The prior  in  this problem is the prior probability ratio between the 
 propositions $S$ and $\bar{S}$. This quantity is important to the final 
 verdict and would be based on all other available information 
 in the case. Our task here is just to evaluate the contribution made by the 
 data $D$, that is, the likelihood ratio, $P(D\given S,\H)/P(D\given \bar{S},\H)$.
 In my view, a jury's task should generally be to multiply together carefully 
 evaluated 
 likelihood ratios from each independent piece of admissible evidence
 with an equally carefully reasoned prior probability.
 [This  view is shared by many statisticians but learned British appeal judges\index{judge}   
 recently disagreed and actually overturned the verdict of a trial
 because the \index{jury}{jurors} {\em had\/} been taught to use \Bayes\  theorem to 
 handle complicated \ind{DNA} evidence.]

%
 The probability of the data given $S$ is the probability that one unknown person 
 drawn from the population has blood type AB:
\beq
P(D\given S,\H) = p_{\rm{AB}} 
\eeq
 (since given $S$, we already know that one trace will be of type O). 
 The probability of the data given  $\bar{S}$ is the 
 probability that two unknown people drawn from the population have 
 types O and AB: 
\beq
	P(D\given \bar{S},\H) = 2 \, p_{\rm{O}} \, p_{\rm{AB}} .
\eeq
 In these equations $\H$ denotes the assumptions that two people were
 present and left blood there, and that the probability distribution
 of the blood groups of unknown people in an explanation is the same
 as the population frequencies. 
% Our posterior probability ratio for
% $S$ relative to $\bar{S}$ is obtained by multiplying the probability
% ratio based on all other independent information by the ratio of
% these likelihoods. The most straightforward way to summarize the
% contribution of any piece of evidence is in terms of a likelihood
% ratio.

 Dividing, we obtain the likelihood ratio: 
\beq
        \frac{P(D\given S,\H)}{P(D\given \bar{S},\H)} = \frac{1}{2 p_{\rm O}} 
        = \frac{1}{2 \times 0.6}
         = 0.83 .
\eeq
 Thus the data in fact provide weak evidence {\em against\/} the
 supposition that Oliver was present.

 This result may be found surprising, so let us examine it from
 various points of view. First consider the case of another suspect,
 Alberto, 
 who has type AB.  Intuitively, the data do provide evidence in favour
 of the theory $S'$ that this suspect was present, relative to the
 null hypothesis $\bar{S}$. And indeed the likelihood ratio in this
 case is:
\beq
        \frac{P(D\given S',\H)}{P(D\given \bar{S},\H)} = \frac{1}{2\, p_{\rm{AB}}} = 50.
\eeq 
 Now let us change the situation slightly; imagine that 99\% of people
 are of blood type O, and the rest are of type AB. Only these two 
 blood types exist in the population. The data at the
 scene are the same as before. Consider again how these data influence
 our beliefs about Oliver,
 a  suspect of type O, and Alberto, a suspect of type
 AB.  Intuitively, we still believe that the presence of the rare AB
 blood provides positive evidence that  \ind{Alberto} was
 there.  But does
% we still have the feeling that
 the fact that type O
 blood was detected at the scene favour the hypothesis that
 Oliver was present? If this were the case, that would mean that
 regardless of who the suspect is, the data make it more probable they
 were present; everyone in the population would be
 under greater suspicion, which would be absurd.  The data may be {\em
 compatible\/} with any suspect of either blood type being present, but
 if they provide  evidence {\em for\/} some theories, they must also
 provide evidence {\em against\/} other theories.

 Here is another way of thinking about this: imagine that instead of
 two people's blood stains there are ten, and that in the entire local
 population of one hundred, there are ninety type O suspects and ten
 type AB suspects.
% Initially all 100 people are suspects. 
 Consider a particular type O suspect, \ind{Oliver}: without any other information,
 and before the blood test results come in,
 there is a one in 10 chance that he was at the scene, since
 we know that 10 out of the 100 suspects were present.  We now get the
 results of blood tests, and find that {\em nine\/} of the ten stains are of
 type AB, and {\em one\/} of the stains is of type O. Does this make it more
 likely that Oliver was there? No,
% although he could have been,
 there is now only a one in ninety chance that he was there, since we
 know that only one person present was of type O.

 Maybe the intuition is aided finally by writing down the formulae for
 the general case where $n_{\rm{O}}$ blood stains of individuals of type O
 are found, and $n_{\rm{AB}}$ of type $\rm{AB}$, a total of $N$ individuals in
 all, and unknown people come from a large population with fractions
 $p_{\rm{O}}, p_{\rm{AB}}$. (There may be other blood types too.) 
 The task is to evaluate the likelihood ratio for the
 two hypotheses:  $S$, `the type O suspect (Oliver)
 and $N\!-\!1$ unknown others
 left $N$ stains'; and $\bar{S}$, `$N$ unknowns left $N$ stains'. The
 probability of the data under hypothesis $\bar{S}$ is just the
 probability of getting $n_{\rm{O}}, n_{\rm{AB}}$ individuals of the two types
 when $N$ individuals are drawn at random from the population:
\beq
        P(n_{\rm{O}},n_{\rm{AB}}\given \bar{S}) = 
        \frac{ N! }{ n_{\rm{O}} ! \, n_{\rm{AB}}! } p_{\rm{O}}^{n_{\rm{O}}} p_{\rm{AB}}^{n_{\rm{AB}}} .
\eeq
 In the case of hypothesis $S$, we need  the distribution of
 the $N\!-\!1$ other individuals:
\beq
        P(n_{\rm{O}},n_{\rm{AB}}\given S) = 
        \frac{ (N-1)! }{ (n_{\rm{O}}-1)! \, n_{\rm{AB}}! } p_{\rm{O}}^{n_{\rm{O}}-1} p_{\rm{AB}}^{n_{\rm{AB}}} .
\eeq
 The likelihood ratio is:
\beq
        \frac{ P(n_{\rm{O}},n_{\rm{AB}}\given S) }{ P(n_{\rm{O}},n_{\rm{AB}}\given \bar{S}) }
        = \frac{n_{\rm{O}}/N}{p_{\rm{O}}} .
\eeq
 This is an instructive result. The likelihood ratio, \ie\ the
 contribution of these data to the question of whether Oliver
 was present, depends simply on a comparison of the frequency
 of his blood type
% type O blood
 in the observed data with the background frequency 
% of type O blood
 in the population. There is no dependence on the counts
 of the other types found at the scene, or their frequencies in the
 population.  If there are more type O stains than the average number
 expected  under hypothesis $\bar{S}$, then the data give
 evidence in favour of the presence of Oliver.
 Conversely, if there are fewer type O stains than the expected number
 under $\bar{S}$, then the data reduce the probability of the
 hypothesis that he was there.  In the special case $n_{\rm{O}}/N = p_{\rm{O}}$, the
 data contribute no evidence either way, regardless of the fact that
 the data are compatible with the hypothesis $S$.


\section{Exercises}
% \subsection*{The game show}
%\subsubsection*{The normal rules}
%\subsubsection*{The earthquake scenario}
\exercissxA{2}{ex.3doors}{
  {\sf The \ind{three doors},\index{Monty Hall problem} normal rules.}
% "Let's Make A Deal," hosted by Monty Hall

 On a \ind{game show},\index{doors, on game show}\index{game!three doors}
 a contestant is told the rules as 
 follows:
\begin{quote}
 There are three doors, labelled 1, 2, 3. A single
 prize has been hidden behind one of 
 them. You get to select one door. Initially your chosen door will {\em not\/} 
 be opened. Instead, the gameshow host will open one of the other two doors, 
 and {\em he will do so in such a way as not to reveal the prize.}
 For example, if you first
 choose door 1, he will then open {one\/} of doors 2 and 3, and it 
 is guaranteed that he will choose which one to open so that
 the prize will not be revealed. 

 At this point, you will be given a fresh choice of door:
 you can either stick with your first choice,
 or you can switch to the other 
 closed door.  All the doors will then be opened and 
 you will  receive whatever is behind your final 
 choice of door.
\end{quote}
  Imagine that the contestant chooses door 1 first; then the gameshow host 
 opens door 3, revealing nothing behind the door, as promised. 
 Should the contestant (a) stick with door 1, or (b)
 switch to door 2, or (c) does it make no difference?
}
\exercissxA{2}{ex.3doorsb}{
 {\sf The three doors,  earthquake scenario.}

 Imagine that the game happens again
 and  just as the gameshow host is about to open one of the 
 doors a violent earthquake\index{earthquake, during game show}
 rattles the  building and one of the 
 three doors flies open. It happens to be door 3, and it 
 happens not to have the prize behind it. The contestant had initially 
 chosen door 1.

 Repositioning his toup\'ee,
 the host suggests, `OK, since you chose door 1 initially, 
 door 3 is a valid door for me to open, according to the
 rules of the game; I'll let door 3 stay open. Let's carry on 
 as if nothing happened.'
 
 Should the contestant stick with door 1, or switch to door 2, or
 does it make no difference? Assume that the prize was placed randomly, that
 the gameshow host does not know where it is, and that the door flew open
 because its latch was broken by the earthquake.

 [A similar alternative scenario is a gameshow whose {\em confused host\/}\index{confused gameshow host}
 forgets the rules, and  where the prize is, and opens one of
 the unchosen doors at random. He opens door 3, and the prize is not revealed.
 Should the contestant choose what's behind door 1 or door 2?
 Does the optimal decision for
 the contestant depend on the contestant's \ind{belief}s about
 whether  the gameshow host is confused or not?]\index{game show}\index{three doors}\index{doors, on game show}\index{prize, on game show}\index{Monty Hall problem}
}
\exercisaxB{2}{ex.girlboy}{
%\subsection
{\sf Another example in which the emphasis is not on priors.}
%\begin{quote}
 You visit a family whose three children are all at the local school.
 You don't know anything about the sexes of the children.
 While walking clumsily round the home, you stumble through
 one of the  three unlabelled bedroom doors that you know
 belong, one each, to the three children, and find that the bedroom
 contains \ind{girlie stuff} in sufficient quantities to
 convince you that the child who lives in that bedroom
 is a girl.
  Later, you sneak a look at a letter addressed to the parents,
 which reads `From the Headmaster:
  we are sending this letter to all parents who have male children at
 the  school to inform them about the following \ind{boyish matters}\ldots'.

 These two sources of evidence establish that at least
 one of the three  children is
 a girl, and that at least one of the children is a boy.
 What are the probabilities that there are (a) two girls and one boy;
 (b) two boys and one girl?
%\end{quote}
}
% Another example of legal evidence}
\exercissxB{2}{ex.simpsons}{
 Mrs\ S is found stabbed in her family
 garden.
% \index{Simpson, O.J., similar case to} 
 Mr\ S behaves strangely after her death and is considered as
 a suspect. On investigation of police and social records 
 it is found that Mr\ S had  beaten up his wife on at least 
 nine previous occasions. The prosecution advances this 
 data as evidence in favour of the hypothesis that Mr\ S is 
 guilty of the murder. 
 `Ah no,' says 
% Mr.\ Merd-Kopf,
 Mr\ S's highly paid lawyer,\index{lawyer}\index{wife-beater}\index{murder}  
 `{\em statistically}, only one in a  thousand wife-beaters 
 actually goes on to murder his wife.\footnote{In the U.S.A., it 
 is estimated that 
% http://www.umn.edu/mincava/papers/factoid.htm
 2 million women are abused each year by their partners.
 In 1994, $4739$ women were victims of homicide; of those,
% 28 \percent,
 $1326$ women (28\%)
    were slain by husbands and boyfriends.\\ (Sources: 
 {\tt http://www.umn.edu/mincava/papers/factoid.htm,\\ 
 http://www.gunfree.inter.net/vpc/womenfs.htm})
% http://www.gunfree.inter.net/vpc/womenfs.htm
%  In keeping 
% with the fictitious nature of this story, the $1/100\,000$ 
% figure was made up by me.
 }\label{footnote.murder} So the wife-beating
% , which  is not denied by Mr\ S,
 is not strong evidence at all. In fact, 
 given the wife-beating evidence alone, it's extremely unlikely 
 that he would be the murderer of his wife -- only a 
 $1/1000$ chance. You should therefore find him innocent.'

 Is the lawyer
% Mr\ Merd-Kopf
 right to imply that the history of wife-beating does
 not point to Mr\ S's being the murderer? Or is the lawyer a  slimy trickster? If 
 the latter, what is wrong with his argument?

 [Having received an indignant letter from a lawyer about
 the preceding paragraph, I'd like to
 add an extra inference exercise at this point:
 {\em Does my suggestion that Mr.\ S.'s lawyer
 may have been a  slimy trickster imply that
 I believe {\em all} lawyers are   slimy tricksters?} (Answer: No.)]
}
% Lewis Carroll's Pillow Problem
\exercisaxB{2}{ex.bagcounter}{ A bag contains one counter, known to be
 either white or black. A white counter is put in, the bag is shaken,
 and a counter is drawn out, which proves to be white. What is now the
 chance of drawing a white counter?
 [Notice that
 the state of the bag, after the operations, is exactly identical to its state before.] 
}
\exercissxB{2}{ex.phonetest}{% ????????????????? needs solution adding (was phonecheck!)
 You move into a new house; the phone is connected, and
% you are unsure of your phone number --
 you're pretty sure that
 the \ind{phone number}\index{telephone number} is
% it's
 {\tt 740511}, but not as sure as you would like to be.
% 
 As an experiment, you pick up the phone and dial {\tt 740511};
 you obtain a `busy' signal.
 Are you now more sure of your phone number? If so, how much?
}
%

\exercisaxB{1}{ex.othercoin}{
 In a game, two coins are tossed. If either of the coins comes up
 heads, you have won a prize. To claim the prize, you must point to
 one of your coins that is a head
 and say `look, that coin's a head, I've won'.
 You watch Fred play the game. He tosses the two coins, and he
 points to a coin and says `look, that coin's a head, I've won'.
 What is the probability that the {\em other\/} coin is a head?
}
%\subsection*{Another quasi-legal story}
%     \exercis{ex.}{
% During a radio chat show on the health consequences of 
% secondary smoking, it is reported by an expert that 
% twelve recent studies have  investigated whether 
% there was a link between secondary smoking and cancer. 
% Of these, eleven studies  failed to establish a link
% and one study  found significant evidence of a causal 
% link -- secondary smoking increasing the risk of getting 
% cancer.  The expert said that the net evidence from these 
% twelve results was that there was significant evidence of a causal
% link. 
%
% Shortly thereafter, a Mr.\ N.T.\ Social called in in support 
% of smokers' ``rights'' to pollute public air. `If eleven 
% of the studies didn't find a link, and only one found a link, 
% then it's eleven to one  that there isn't a link, isn't it?'
%
% `Well, you clearly don't understand statistics, do you?' responded
% the condescending  host. 
%
% Can you suggest a more helpful explanation of the expert's statement?
%}
% euro.tex
\exercissxB{2}{ex.eurotoss}{
 A statistical statement appeared in 
% \footnote{Quoted by  Charlotte Denny and Sarah Dennis
 {\em The Guardian} on Friday January 4, 2002:
\begin{quote}
 When spun on edge 250
                times, a Belgian one-euro
 coin came up heads 140 times and tails 110. 
 `It looks very suspicious to me', said Barry Blight, a statistics lecturer
  at the London School of Economics.
 `If the coin were unbiased the
  chance of getting a result as extreme as that would be less than 7\%'.
\end{quote}
 But {\em do\/} these
 data give evidence that the coin is biased rather than fair?

[Hint: see \eqref{eq.compare.final}.]
}

% \input{tex/bayes_occam.tex}
\dvips 
\section{Solutions}% to Chapter \protect\ref{ch.bayes}'s exercises} % 
\soln{ex.dieexponential}{
 Let the data be $D$. Assuming equal prior probabilities, 
\beqan
	\frac{P(A \given D)}{P(B \given D)} = \frac{1}{2}\frac{3}{2}\frac{1}{1}\frac{3}{2}
				\frac{1}{2}\frac{2}{2}\frac{1}{2} = \frac{9}{32}
\eeqan	
 and $P(A \given D) = 9/41.$
% (check me).
}
\soln{ex.dieexponentialb}{
 The probability of the data given each hypothesis is:
\beq
	P(D \given A) = \frac{3}{20}\frac{1}{20}\frac{2}{20}\frac{1}{20} 
			\frac{3}{20}\frac{1}{20} \frac{1}{20} =
	 \frac{18}{20^7} ;
\eeq	
\beq
	P(D \given B) = \frac{2}{20}\frac{2}{20}\frac{2}{20}\frac{2}{20} 
			\frac{2}{20}\frac{1}{20} \frac{2}{20}
			= \frac{64}{20^7} ;
\eeq	
\beq
	P(D \given C) = \frac{1}{20}\frac{1}{20}\frac{1}{20}\frac{1}{20} 
			\frac{1}{20}\frac{1}{20} \frac{1}{20} 
		= \frac{1}{20^7}.
\eeq	
 So
\beq
% \hspace*{-0.1in}
	P(A \given D) = \frac{18}{18+64+1} = \frac{18}{83} ; \hspace{0.3in}
	P(B \given D) = \frac{64}{83} ;\hspace{0.3in} 
	P(C \given D) = \frac{1}{83} .
\eeq
}

\fakesection{Bent coin exercise solns}
\begin{figure}[htbp]
\figuremargin{%
\footnotesize
\begin{center}
\begin{tabular}{cc}
(a) \psfig{figure=figs/aba.ps,width=2in,angle=-90}&
(b) \psfig{figure=figs/bbb.ps,width=2in,angle=-90}\\
$P( p_{\tt{a}}  \given  \bs\eq {\tt{aba}} ,F\eq 3) \propto p_{\tt{a}}^2 (1-p_{\tt{a}})$
&
 $P( p_{\tt{a}}  \given  \bs\eq {\tt{bbb}} ,F\eq 3) \propto (1-p_{\tt{a}})^3$ \\
\end{tabular}
\end{center}
}{%
\caption[a]{Posterior probability for the bias $p_a$ of a bent coin given
 two different data sets.}
\label{fig.aba.bbb}
}%
\end{figure}
\soln{ex.postpaII}{% relabelled from postpa Sun 6/4/03  - beware incorrect refs likely
\ben
\item
 $P( p_{\tt{a}}  \given  \bs\eq {\tt{aba}} ,F\eq 3) \propto p_{\tt{a}}^2 (1-p_{\tt{a}})$.
 The most probable value of $p_{\tt{a}}$ (\ie, the value that maximizes 
 the posterior probability density) is $2/3$.
 The mean value of $p_{\tt{a}}$  is $3/5$. 

 See \figref{fig.aba.bbb}a.
\item
 $P( p_{\tt{a}}  \given  \bs\eq {\tt{bbb}} ,F\eq 3) \propto (1-p_{\tt{a}})^3$.
 The most probable value of $p_{\tt{a}}$ (\ie, the value that maximizes 
 the posterior probability density) is $0$.
 The mean value of $p_{\tt{a}}$  is $1/5$. 

 See \figref{fig.aba.bbb}b.
\een
}
%/home/mackay/_courses/itprnn/figs
%gnuplot> plot x**2*(1-x)
%gnuplot> set xrange [0:1]
%gnuplot> replot
%gnuplot> set nokey
%gnuplot> set size 0.4,0.4
%gnuplot> replot
%gnuplot> set noytics
%gnuplot> replot
%gnuplot> set yrange [0:0.4]
%gnuplot> replot
%gnuplot> set yrange [0:0.17] 
%gnuplot> replot             
%gnuplot> set term post
%Terminal type set to 'postscript'
%Options are 'landscape monochrome dashed "Helvetica" 14'
%gnuplot> set output "aba.ps"
%gnuplot> replot
%gnuplot> set term X
%Terminal type set to 'X11'
%gnuplot> set yrange [0:1]   
%gnuplot> plot (1-x)**3
%gnuplot> set term post      
%Terminal type set to 'postscript'
%Options are 'landscape monochrome dashed "Helvetica" 14'
%gnuplot> set output "bbb.ps"
%gnuplot> replot


\fakesection{evidence est}
\begin{figure}[htbp]
\figuremargin{%
\small%
\begin{center}
\begin{tabular}{cccc}
$\H_0$ is true &&
\multicolumn{2}{c}{$\H_1$ is true} \\ \cmidrule{1-1}\cmidrule{3-4}
\sixtoone{$p_a=1/6$}{h0MSD}&&
\sixtoone{$p_a=0.25$}{h6MSD}&
\sixtoone{$p_a=0.5$}{h2MSD}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Range of plausible values of the log evidence in favour of $\H_1$ as
 a function of $F$. The vertical axis on the left is
$\log \smallfrac{ P( \bs  \given  F,\H_1 ) }
          { P( \bs  \given  F,\H_0 ) }$;
  the right-hand vertical axis shows the values of 
$\smallfrac{ P( \bs  \given  F,\H_1 ) }
          { P( \bs  \given  F,\H_0 ) }$.

\index{typicality!behaviour of evidence}\index{evidence!typical behaviour of}\index{model comparison!typical behaviour of evidence}%
The solid line shows the log evidence if the random variable $F_a$
 takes on its mean value, $F_a = p_aF$. The dotted lines show (approximately)
 the log evidence if $F_a$ is at its 2.5th or 97.5th percentile.

 (See also \protect\figref{fig.evidencetyp}, \pref{fig.evidencetyp}.)
 }
\label{fig.evidenceMSD}
}%
\end{figure}
\soln{ex.evidenceest}{
 The curves in \figref{fig.evidenceMSD} were found by finding the mean and standard deviation
 of $F_a$, then setting $F_a$ to the mean $\pm$ two standard deviations
 to get a 95\% plausible range for $F_a$, and computing the three
 corresponding values of the log evidence ratio.

}%
\soln{ex.3doors}{
 Let $\H_i$ denote the hypothesis that the prize is behind 
 door $i$.
 We make the following assumptions: the three hypotheses
 $\H_1$, $\H_2$ and $\H_3$ are equiprobable {\em a priori}, \ie, 
\beq
	P(\H_1) = P(\H_2) = P(\H_3) = \frac{1}{3} .
\eeq
 The datum we receive, after choosing door 1,
 is one of $D \eq 3$ and $D \eq 2$ (meaning door 3 or 2 is opened, respectively).
 We assume that these two possible outcomes have the following probabilities.
 If the prize is behind door 1 then the host has a free choice; in 
 this case we assume that the host selects at random between $D\eq 2$ and $D\eq 3$.
 Otherwise the choice of the host is forced and the probabilities
 are 0 and 1.
\beq
\begin{array}{|r@{\,}c@{\,}l|r@{\,}c@{\,}l|r@{\,}c@{\,}l|}
	P( D\eq 2  \given  \H_1) &=& \dfrac{1}{2}  & 
	P( D\eq 2  \given  \H_2) &=& 0  & 
	P( D\eq 2  \given  \H_3) &=& {1} \\
	P( D\eq 3  \given  \H_1) &=& \dfrac{1}{2}  & 
	P( D\eq 3  \given  \H_2) &=& {1}  & 
	P( D\eq 3  \given  \H_3) &=& 0
\end{array} 
\eeq
 Now, using \Bayes\  theorem, we evaluate the posterior probabilities
 of the hypotheses:
\beq
	P( \H_i  \given  D\eq3 ) = \frac{P( D\eq3  \given  \H_i)  P(\H_i) }{P(D\eq3) }
\eeq
\beq
\begin{array}{|r@{\,}c@{\,}l|r@{\,}c@{\,}l|r@{\,}c@{\,}l|}
	P(\H_1  \given  D\eq 3) &=& \frac{ (1/2)  (1/3) }{P(D\normaleq 3) }  & 
	P(\H_2  \given  D\eq 3) &=& \frac{ ({1})  (1/3) }{P(D\normaleq 3) }  & 
	P(\H_3  \given  D\eq 3) &=& \frac{ ({0})  (1/3) }{P(D\normaleq 3) } 
\end{array}
\eeq
 The denominator $P(D\eq 3)$ is  $(1/2)$ because it is the  normalizing 
 constant for this posterior distribution. 
So
\beq
\begin{array}{|rcl|rcl|rcl|}
	P( \H_1   \given  D\eq3 ) &=&	 \dfrac{ 1}{3} &
P(\H_2  \given  D\eq3) &=&	 \dfrac{ 2}{3} &
P(\H_3  \given  D\eq3) &=&	 0 .
\end{array} 
\eeq
 So the contestant should switch to door 2 in order to have
 the biggest chance of getting the prize.

 Many people find this outcome surprising. There are two 
 ways to make it more intuitive. One is to play the game\index{game!three doors}
 thirty
 times with a friend and keep track of the frequency with 
 which switching gets the prize. Alternatively, 
 you can perform a thought experiment in which the game is 
 played with a million doors. The rules are now that the contestant
 chooses one door, then the game show host opens 
 999,998 doors in such a way as not to reveal the prize, leaving 
 the {\em contestant's\/}
 selected door  and {\em one other door\/}
 closed. The contestant may 
now stick or switch. 
 Imagine the contestant confronted by a million doors, of which 
  doors 1 and 234,598  have not been opened, door 1 having been 
 the contestant's initial guess. Where do you think the prize is?
}
%
\soln{ex.3doorsb}{
% earthquake rules.
 If door 3 is opened by an earthquake, the inference comes out
 differently -- even though visually the scene looks the same.  The
 nature of the data, and the probability of the data, are both now
 different.  The possible data outcomes are, firstly, that any number
 of the doors might have opened. We could label the eight possible
 outcomes $\bd = (0,0,0), (0,0,1), (0,1,0), (1,0,0), (0,1,1), \ldots,
 (1,1,1)$. Secondly, it might be that the prize is visible after the
 earthquake has opened one or more doors.  So the data $D$ consists of
 the value of $\bd$, and a statement of whether the prize was
 revealed.  It is hard to say what the probabilities of these outcomes
 are, since they depend on our beliefs about the reliability
 of the door latches and the properties of earthquakes,
 but it is possible to extract the desired posterior probability
 without naming the values of $P(\bd \given \H_i)$ for each $\bd$.  All that
 matters are the relative values of the quantities $P(D \given \H_1)$,
 $P(D \given \H_2)$, $P(D \given \H_3)$, for the value of $D$ that actually occurred.
 [This is the {\dem\ind{likelihood principle}}, which
 we met in \sectionref{sec.lp}.]
% !!!!!!!!! add page ref?
 The  value of $D$ that actually occurred is
 `$\bd  \eq  (0,0,1)$, and no prize visible'. First, it is clear that
 $P(D \given \H_3)=0$, since the datum that no prize is visible is
 incompatible with $\H_3$.  Now, assuming that the contestant selected
 door 1, how does the probability $P(D \given \H_1)$ compare with
 $P(D \given \H_2)$?  Assuming that earthquakes are not sensitive to
 decisions of game show contestants,
 these two quantities have to be equal,  by symmetry. We don't know how likely it is
 that door 3 falls off its hinges, but however likely it is, it's just
 as likely to do so whether the prize is behind door 1 or door 2.  So,
 if $P(D \given \H_1)$ and $P(D \given \H_2)$ are equal, we obtain:
\beq
 \begin{array}{|r@{\,\,=\,\,}l|r@{\,\,=\,\,}l|r@{\,\,=\,\,}l|}
	P(\H_1  |  D) & \smallfrac{ P(D | \H_1)  (\smalldfrac{1}{3}) }{P(D) }  & 
	P(\H_2  |  D) & \smallfrac{ P(D | \H_2)  (\smalldfrac{1}{3}) }{P(D) }  & 
	P(\H_3  |  D) & \smallfrac{ P(D | \H_3)  (\smalldfrac{1}{3}) }{P(D) } 
\\
 &	 \dfrac{ 1}{2} &
 &	 \dfrac{ 1}{2} &
 &	 0 .
\end{array} 
\eeq
 The two possible hypotheses are now equally likely.

 If we assume that 
 the host knows where the prize is and might be acting 
 deceptively, then the answer might be further modified, because we 
 have to view the host's words as part of the data.

 Confused? It's well worth  making sure you
 understand these two gameshow  problems.
 Don't worry, I slipped up on the second problem, the
 first time I met it.

 There is a general rule which  helps immensely
 when you have a confusing probability problem:\index{key points!how to solve probability problems}
\begin{conclusionbox}
 Always write down the probability of everything.\\ {
 \hfill {\em (Steve Gull)} \par
}
\end{conclusionbox}
 From this joint probability, any desired inference can
 be mechanically obtained (\figref{fig.everything}).
\amarginfig{b}{
\begin{center}
\newcommand{\tabwidth}{30}
\newcommand{\tabheight}{80}
\setlength{\unitlength}{1mm}{
\begin{picture}(43,92)(-13,0)
\put(15,90){\makebox(0,0){\small\sf{Where the prize is}}}
\put( 5,85){\makebox(0,0){\small{door}}}
\put(15,85){\makebox(0,0){\small{door}}}
\put(25,85){\makebox(0,0){\small{door}}}
\put( 5,82){\makebox(0,0){\small{1}}}
\put(15,82){\makebox(0,0){\small{2}}}
\put(25,82){\makebox(0,0){\small{3}}}
\put(-1, 5){\makebox(0,0)[r]{\footnotesize{1,2,3}}}
\put(-1,15){\makebox(0,0)[r]{\footnotesize{2,3}}}
\put(-1,25){\makebox(0,0)[r]{\footnotesize{1,3}}}
\put(-1,35){\makebox(0,0)[r]{\footnotesize{1,2}}}
\put(-1,45){\makebox(0,0)[r]{\footnotesize{3}}}
\put( 5,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}}
\put(15,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}}
\put(25,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}}
\put( 5,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}}
\put(15,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}}
\put(25,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}}
\put( 5, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}}
\put(15, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}}
\put(25, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}}
\put(-1,55){\makebox(0,0)[r]{\footnotesize{2}}}
\put(-1,65){\makebox(0,0)[r]{\footnotesize{1}}}
\put(-1,75){\makebox(0,0)[r]{\footnotesize{none}}}
\put(-12,40){\makebox(0,0){\rotatebox{90}{\small\sf{Which doors opened by earthquake}}}}
\multiput(0,0)(0,10){9}{\line(1,0){\tabwidth}}
\multiput(0,0)(10,0){4}{\line(0,1){\tabheight}}
\end{picture}}
\end{center}
\caption[a]{The probability of everything, for the second three-door problem,
 assuming an earthquake has just occurred.
 Here, $p_3$ is the probability that door 3 alone is opened by an earthquake.}
\label{fig.everything}
}
}
\fakesection{simpsons}
\soln{ex.simpsons}{
 The statistic quoted by the lawyer indicates the 
% {prior\/} 
 probability
% \index{Simpson, O.J., similar case to}%
%\index{Simpson, O.J., allusion to}
\index{lawyer}\index{wife-beater}\index{murder} 
 that a randomly selected wife-beater will also murder his wife. 
 The probability that the husband was the murderer, {\em given
 that the wife has been murdered}, is a completely different quantity. 

 To deduce the latter, we need to make  further assumptions about 
 the probability that  the wife is murdered by someone else. 
 If she lives in a neighbourhood with frequent random murders, then 
 this probability is large and the posterior probability that 
 the husband did it (in the absence of other evidence) may not 
 be very large. But in more peaceful regions, it may well be
 that the most likely person to have murdered you, if you are found
 murdered, is 
 one of your closest relatives.

%{\em  Numbers here.}
 Let's work out some illustrative numbers with the help
 of the statistics on page \pageref{footnote.murder}.
 Let $m\eq 1$ denote the proposition that a woman has been murdered;
 $h\eq 1$, the proposition that the husband did it; and $b\eq 1$,
 the proposition that he beat her in the year preceding the
 murder. The statement `someone else did it'
 is denoted by  $h\eq 0$.
 We need to define $P(h \given m\eq 1)$, $P(b \given h\eq 1,m\eq 1)$, and $P(b\eq 1 \given h\eq 0,m\eq 1)$
 in order to compute the posterior probability $P(h\eq 1 \given b\eq 1,m\eq 1)$.
 From the statistics, we can read out  $P(h\eq 1 \given m\eq 1)=0.28$.
 And if two million women out of 100 million are beaten,
 then $P(b\eq 1 \given h\eq 0,m\eq 1)=0.02$. Finally, we need a
 value for  $P(b \given h\eq 1,m\eq 1)$: if a man murders his wife, how likely is
 it that this is the first time he laid a finger on her? I
 expect it's pretty unlikely; so maybe  $P(b\eq 1 \given h\eq 1,m\eq 1)$ is 0.9
 or larger.

 By \Bayes\  theorem, then,
\beq
	P(h\eq 1 \given b\eq 1,m\eq 1)
 = \frac{ .9 \times .28 }{  .9 \times .28 + .02 \times .72 }
	\simeq 95\% .
\eeq
 One way to make obvious the  sliminess of the lawyer on \pref{ex.simpsons}
 is to construct arguments, with the same logical structure
 as his, that
 are clearly wrong. For example, the lawyer could say `Not only
 was Mrs.\ S murdered, she was murdered between 4.02pm and
 4.03pm. {\em Statistically}, only one in a {\em million\/} wife-beaters 
 actually goes on to murder his wife between 4.02pm and
 4.03pm. So the wife-beating
% , which  is not denied by Mr.\ S,
 is not strong evidence at all. In fact, 
 given the wife-beating evidence alone, it's extremely unlikely 
 that he would murder his wife in this way -- only a 
 1/1,000,000 chance.'
}

% arrived here Sun 6/4/03
\soln{ex.phonetest}{% was phonecheck
	There are two hypotheses.
 $\H_0$: your number is {\tt 740511}; $\H_1$: it is another number.
 The data, $D$, are `when I dialed {\tt 740511}, I got a busy signal'.
 What is the probability of $D$, given each hypothesis?
 If your number is {\tt 740511}, then we expect a busy signal with certainty:
\[
	P(D \given \H_0) = 1  .
\]
 On the other hand, if $\H_1$ is true, then the probability that the number dialled
 returns a busy signal is smaller than 1, since various other outcomes
 were also possible (a ringing tone, or a number-unobtainable signal,
 for example).  The value of this probability $P(D \given \H_1)$
 will depend on  the probability $\alpha$ that a random phone number
 similar to your own phone number would be a valid phone number,
 and on the probability $\beta$ that you get a busy signal when you dial
 a valid phone number.

% 37 per col, 4 cols per page, 250 pages.
% 20 per col, 3 cols per page, 270 pages.
% 50,000. maybe another 50% ex-directory?
 I estimate from the size of
 my phone book that Cambridge has about $75\,000$ valid phone numbers, all of length six
 digits. The probability that a random six-digit number is valid is
 therefore about $75\,000/10^6 = 0.075$. If we exclude numbers beginning with 0, 1, and 9
 from the random choice, the probability $\a$
 is about $75\,000/700\,000 \simeq 0.1$.
 If we assume that
 telephone numbers are clustered then  a misremembered number
 might be more likely to be valid than a randomly chosen number; so
 the probability, $\alpha$,
 that our guessed number would be valid, assuming $\H_1$ is true,
 might be bigger than 0.1. Anyway, $\alpha$  must be somewhere between 0.1 and 1.
 We can carry forward this uncertainty in the probability
 and see how much it matters at the end.

 The  probability $\beta$ that you get a busy signal when you dial
 a valid phone number is equal to the fraction of phones you think are in use
 or off-the-hook
 when you make your tentative call.
 This fraction varies from town to town and with the time of day.
 In Cambridge, during the day, I would guess that about 1\% of phones
 are in use. At 4am,
% four in the morning,
 maybe 0.1\%, or fewer.

 The probability $P(D \given \H_1)$ is the product of $\alpha$ and $\beta$,
 that is, about $0.1 \times 0.01 = 10^{-3}$. According to
 our estimates, there's about a one-in-a-thousand
 chance of getting a busy signal when you dial a random number;
 or one-in-a-hundred, if valid numbers are strongly clustered;
 or one-in-$10^4$, if you dial in the wee hours.

 How  do the data affect your beliefs about your phone number?
 The posterior probability ratio is the likelihood ratio
 times the prior probability ratio:
\beq
	\frac{ P(\H_0 \given D) }{ P(\H_1 \given D) }
=	\frac{ P(D \given \H_0) }{ P(D \given \H_1) }
	\frac{ P(\H_0) }{ P(\H_1) } .
\eeq
 The likelihood ratio is about 100-to-1 or 1000-to-1, so the posterior
 probability ratio is swung by a factor of 100 or 1000 in favour of $\H_0$.
 If the prior probability of $\H_0$ was 0.5 then the posterior
 probability is
\beq
	 P(\H_0 \given D)  = \frac{1}{1 + \smallfrac{ P(\H_1 \given D) }{ P(\H_0 \given D) } }
		\simeq  0.99 \: \mbox{or} \: 0.999 . 
\eeq
}

\soln{ex.eurotoss}{
% see also
% http://www.dartmouth.edu/~chance/chance_news/recent_news/chance_news_11.02.html
% for lots of practical info on coin biases.
%%%%%%%%%%%%%%%%%%%%%%%%%%% included by _s8.tex
% First, could confirm his sampling theory
%Sampling theory:  number of heads $\sim 125 \pm 8$
%$ \sqrt{62.5}$
%so two-tail probability is
% pr 2*(1-myerf(14.5/7.9))    ans = 0.066440
% if the data were 141 out of 250 then we get 
%  2*(1-myerf(15.5/7.9))    ans = 0.049760
 \index{euro}We compare the models $\H_0$ -- the coin is fair --
 and $\H_1$ -- the \ind{coin} is biased, with
 the prior on its bias set to the uniform
 distribution $P(p|\H_1)=1$.  
% ent, as defined  in this chapter.
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=gnu/euro.ps,width=1.62in,angle=-90}}
\end{center}
\caption[a]{The probability distribution of the
 number of heads given the two hypotheses, that
 the coin is fair, and that it is biased, with
 the prior distribution of the bias being uniform.
 The outcome ($D = 140$ heads) gives weak evidence
 in favour of $\H_0$, the  hypothesis that the coin is fair.}
\label{fig.euro}
}
 [The use of a uniform prior seems reasonable to me, since I know
 that some coins, such as American pennies,
 have severe biases when spun on edge; so the situations $p=0.01$ or $p=0.1$
 or $p=0.95$ would not surprise me.]
\begin{aside}
 When I mention $\H_0$ -- the coin is fair -- a pedant would say, `how
 absurd to even consider that the coin is fair -- any coin is surely
 biased to some extent'. And of course I would agree. So will pedants
 kindly understand $\H_0$ as meaning `the coin is fair to within
 one part in a thousand, \ie, $p \in 0.5\pm 0.001$'.
\end{aside}
 The likelihood ratio is:
% given  in \eqref{eq.compare.final}.
\beq
% Bayesian approach: Model comparison:
\frac{ P( D|\H_1  )}
      {P( D|\H_0  )}
= \frac{ \smallfrac{ 140! 110! }{ 251! } }{  1/2^{250} } = 0.48 .
\eeq
 Thus the data give scarcely any evidence
 either way; in fact they
 give weak evidence (two to one) in favour of $\H_0$!
% load 'gnu/euro.gnu'

 `No, no', objects the believer in bias, `your silly uniform
 prior doesn't represent {\em my\/} prior beliefs about
 the bias of biased coins -- I was {\em expecting\/}  only  a small bias'. 
 To be as generous as possible to the $\H_1$,
 let's see how well it could fare
 if the prior were presciently set.
 Let us allow a prior of the form
\beq
	P(p|\H_1,\a) = \frac{1}{Z(\a)} p^{\a-1}(1-p)^{\a-1},
	\:\:\:\: \mbox{where $Z(\a)=\Gamma(\alpha)^2/\Gamma(2 \alpha)$}
\eeq
 (a Beta
% Dirichlet (or Beta)
 distribution, with the original uniform prior reproduced
 by setting  $\a=1$). By tweaking $\alpha$, 
 the likelihood ratio for $\H_1$ over $\H_0$,
\beq
 \frac{ P( D|\H_1,\a  )}
      {P( D|\H_0 )} =
 \frac{\Gamma(140 \!+\! \alpha) \, \Gamma(110 \!+\! \alpha) \, \Gamma(2 \alpha) 2^{250}}
              {  \Gamma(250 \!+\! 2 \alpha) \, \Gamma(\alpha)^2 },
\eeq
 can
 be increased a little. It
 is  shown for several values of $\a$ in  \figref{fig.eurot}.%
%
% fig.eurot WAS here but has been moved away to avoid a crunch
% This figure belongs earlier.
\amarginfig{t}{
{\footnotesize
\begin{tabular}{r@{}l@{$\:\:\:$}r@{\hspace*{0.3in}}r@{}l}
\toprule
\multicolumn{2}{c}{$\alpha$}&
\multicolumn{3}{c}{$\displaystyle \frac{ P( D|\H_1,\a  )}
                                        {P( D|\H_0     )}$}\\
\midrule
 &.37 & & &.25\\
1&.0  & & &.48\\
2&.7  & & &.82\\
7&.4  & &1&.3\\
20&   & &1&.8\\
55&   & &1&.9\\
148&  & &1&.7\\
403&  & &1&.3\\
1096& & &1&.1\\
% from euro.dat
\bottomrule
\end{tabular}
}
\caption[a]{Likelihood ratio for various choices of
 the prior distribution's hyperparameter $\alpha$.
}
\label{fig.eurot}
}
% 
 Even the most favourable choice of $\alpha$ ($\a \simeq 50$)
 can 
 yield a likelihood ratio of only two to one in favour of
 $\H_1$.

 In conclusion, the data are not `very suspicious'. They
 can be construed as giving at most two-to-one evidence
 in favour of one or other of the two hypotheses.

\begin{aside}
 Are these wimpy likelihood ratios the fault
 of over-restrictive
 priors? Is there any way of producing
 a `very suspicious' conclusion?
 The prior that is best-matched to the data,
 in terms of likelihood, 
%  and one that surely has to be viewed as unreasonable,
 is the prior that sets $p$ to $f \equiv 140/250$ with probability
 one. Let's call this model $\H_*$.
% , since it is a parameterless model like $\H_0$.
 The likelihood ratio  is $P(D|\H_*)/P(D|\H_0) = 2^{250} f^{140} (1-f)^{110}
 =6.1$.  So the strongest evidence that these data can possibly
 muster against the hypothesis that there is no bias is six-to-one.
\end{aside}
% b.blight@lse.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% alternate answers for the case of 141 heads where
% the P value is 0.05 (0.04976)
%
%The outcomes of the computations for this case (141 from 250)
% are
% alpha , likelihood ratio
%
%.3678794412, .3166098681
%1., .6110726692
%2.718281828, 1.049115229
%7.389056099, 1.627382387
%20.08553692, 2.181864309
%54.59815003, 2.303276774
%148.4131591, 1.882663014
%403.4287935, 1.419011740
%1096.633158, 1.168433218
%2980.957987, 1.063851106
%8103.083928, 1.023737702
%22026.46579, 1.008765749
%
% and H_BF achieves 7.796


 While we are noticing the absurdly misleading\index{sermon!sampling theory}\index{p-value}
 answers that `sampling theory' statistics produces,
 such as the \index{p-value}$p$-value of 7\% in the  exercise we just solved,
 let's stick the boot in.\label{sec.sampling5percent}
 If we make a tiny change to the data set, increasing the
 number of heads in 250 tosses from 140 to 141,
 we find that the $p$-value goes below the mystical value of 0.05
 (the $p$-value is 0.0497).
 The sampling theory statistician would happily squeak `the probability
 of getting a result as extreme as 141 heads is smaller than 0.05 --
 we thus reject the null hypothesis at a significance level of 5\%'.
 The correct answer
 is  shown for several values of $\a$ in  \figref{fig.eurot141}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% alternate answers for the case of 141 heads where
% the P value is 0.05 (0.04976)
% Radford: Using R, I get that the true p-value (with genuine binomial
%probabilities) for 141 out of 250 is 0.04970679, close to your value.
%5
%The outcomes of the computations for this case (141 from 250)
% are
% alpha , likelihood ratio
%
%.3678794412, .3166098681
%1., .6110726692
%2.718281828, 1.049115229
%7.389056099, 1.627382387
%20.08553692, 2.181864309
%54.59815003, 2.303276774
%148.4131591, 1.882663014
%403.4287935, 1.419011740
%1096.633158, 1.168433218
%2980.957987, 1.063851106
%8103.083928, 1.023737702
%22026.46579, 1.008765749
%
% and H_BF achieves 7.796
 The values worth highlighting from this table are, first,
 the likelihood ratio when $\H_1$ uses the standard uniform prior,
 which is 1:0.61 in favour of the {\em null hypothesis\/} $\H_0$.
 Second, the  most favourable choice of $\a$, from the
 point of view of $\H_1$, can only 
 yield a likelihood ratio of about 2.3:1 in favour of
 $\H_1$.\label{sec.pvalue05}

\amarginfig{c}{
{\footnotesize
\begin{tabular}{r@{}l@{$\:\:\:$}r@{\hspace*{0.3in}}r@{}l}
\toprule
\multicolumn{2}{c}{$\alpha$}&
\multicolumn{3}{c}{$\displaystyle \frac{ P( D'|\H_1,\a  )}
                                        {P( D'|\H_0     )}$  }\\
\midrule
 &.37 & & &.32\\
1&.0  & & &.61\\
2&.7  & &1&.0\\
7&.4  & &1&.6\\
20&   & &2&.2\\
55&   & &2&.3\\
148&  & &1&.9\\
403&  & &1&.4\\
1096& & &1&.2\\
% from euro.dat
\bottomrule
\end{tabular}
}
\caption[a]{Likelihood ratio for various choices of
 the prior distribution's \ind{hyperparameter} $\alpha$, when the data are
 $D'=141$ heads in 250 trials.
}
\label{fig.eurot141}
}
%
 Be warned! A $p$-value of 0.05 is often interpreted
% gives the impression to many
 as implying 
 that the odds are stacked about twenty-to-one
 {\em against\/} the null hypothesis. But the truth in this case
 is that the evidence
 either slightly  {\em favours\/} the  null  hypothesis,
 or disfavours it by at most 2.3 to one, depending on
 the choice of prior.

% $p$-values
 The $p$-values and `\ind{significance level}s' of
 \ind{classical statistics}\index{sermon!classical statistics}
 should be treated with {\em extreme caution}.\index{caution!sampling theory}
% This is the  last we will see of them in this book.
 Shun them!
 Here ends the sermon.\index{sermon!sampling theory}
% Classical statistics  and  Microsoft Windows 95 --
% two of the greatest evils to come out of the twentieth century.

}




\dvipsb{solutions bayes}
% \input{tex/_l1b.tex}
%
% message passing was here 
%
\renewcommand{\partfigure}{\poincare{8.0}}
\part{Data Compression} 
\prechapter{About    Chapter}
\fakesection{prerequisites for chapter 2}
%
 In this chapter we 
 discuss how to measure the information content of the outcome
 of a random experiment. 

 This chapter has some tough bits.
 If you find the mathematical details  hard,
% to follow,
 skim through them and keep going -- you'll be able to enjoy Chapters
 \ref{ch3} and \ref{ch4} without this chapter's tools.

% of typicality.
\amarginfignocaption{t}{%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Cast of characters}
\footnotesize
\begin{tabular}{@{}lp{1.14in}}
\multicolumn{2}{c}{
{\sf Notation}
}\\
\midrule
$x \in \A$  & $x$ is a {\dem{member}\/} of the \ind{set} $\A$ \\
$\S \subset \A$  & $\S$ is a {\dem\ind{subset}\/} of the set $\A$ \\
$\S \subseteq \A$  & $\S$ is a {\ind{subset}} of, or equal to, the set $\A$ \\
% \union
$\V = \B \cup \A$
       & $\V$ is the {\dem\ind{union}\/} of the sets $\B$ and $\A$ \\
$\V = \B \cap \A$
       & $\V$ is the {\dem\ind{intersection}\/} of the sets $\B$ and $\A$ \\
$|\A|$ & number of elements in set $\A$\\

\bottomrule
\end{tabular} \medskip
% end marginstuff
}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 Before reading \chref{ch2}, you should have
read
% section \ref{ch1.secprob}
 \chref{ch1.secprob}
 and
 worked on
% \exerciseref{ex.expectn}.
% It  will also help if you have worked on
%
% do I need to ensure that {ex.Hadditive} occurs earlier? 
%
  \exerciseonlyrange{ex.expectn}{ex.Hineq} and \ref{ex.sumdice}
%  \exerciseonlyrangeshort{ex.sumdice}{ex.RNGaussian}
 \pagerange{ex.sumdice}{ex.invP},
% {ex.RNGaussian}.
% exercises \exnine-\exfourteen\ and \extwentyfive-\extwentyseven.
 and \exerciseonlyref{ex.weigh} below.

 The following
 exercise is intended to
 help you think about how to measure information content. 
% Please work on this exercise now.
% weighing
% ITPRNN Problem 1
%
% weighing problem
%
\fakesection{the weighing problem}
\exercissxA{2}{ex.weigh}{
  -- {\em Please work on this problem before reading   \chref{ch.two}.}

 \index{weighing problem}You are given 12 balls, all equal in weight except for
 one that is either heavier or lighter. You are also given a two-pan
 \ind{balance} to use.
% , which you are to use as few times as possible. 
 In each use of the balance you may put {any\/} number of the 12
 balls on the left pan, and the same number on the right pan, and push
 a button to initiate the weighing; there are three possible outcomes:
 either the weights are equal, or the balls on the left are heavier,
 or the balls on the left are lighter.  Your task is to design a
 strategy to determine which is the odd ball {\em and\/} whether it is
 heavier or lighter than the others {\em in as few uses of the balance
 as possible}.

% There will be a prize for the best answer.

 While thinking about this problem, 
 you
% should
 may find it helpful to 
 consider the following questions:
\ben
\item How can one measure {\dem\ind{information}}?
\item When you have identified the odd ball and whether it is heavy or 
	light, how much information have you gained?
\item Once  you have designed a strategy, draw a tree showing,
 for each of the possible   outcomes
 of a weighing, what weighing you perform next.
 At each node in the tree, how much information have the outcomes
 so far given you, and how much information remains to be
 gained?
% What is the probability of each of the possible outcomes of the first
% weighing?
%\item
% What is the most information you can get from a single weighing?
%	How much information do you get from a single weighing
% if the three outcomes are equally probable? 
%\item What is the smallest number of weighings that might conceivably 
%be sufficient  always to identify the odd ball and whether it is heavy
%or light?
\item How much information is gained when you learn (i) the state of a
 flipped coin; (ii) the states of two flipped coins; 
 (iii)  the outcome when a four-sided die is rolled?
\item
 How much information is gained on the first step of the weighing 
 problem if 6 balls are weighed against the other 6?  How much is gained
 if 4 are weighed against 4 on the first step, leaving out 4 balls?
% the other 4 aside?
\een
}
% 
% How many possible outcomes of an e weighing process are there? To put it another way, imagine that you report the outcome by sending a postcard  which says, for example, "ball number 5 is heavy", how many  prepare a postcard 
% 
% how many outcomes are there?
% How many possible states of the world are y
% if you tell someone ball number x is heavy, how much info have you given
% them? how much information can be conveyed by $k$ uses of the balance? 
% 
% 
% make clear that you can put any objects on the scales,
% don't have to weigh 6 vs 6.
% no cheating by gradually adding weights
% 
% katriona's problem: 4 bits, randomly rotated every time you ask them
% to be flipped.
% 
% hhhh llll gggg
% hhll  lhgg    lh
% if left is h then
% hh or l
% so do h vs h
% 
% else  gggg gggg ????
% -> ?? ?g
% -> hh l       or ggg -> wegh last dude (1 bit)
% do h vs h
% 
% if 13 and good avail, -  hhhhh llll* gggg
% hhll lhgg     hhl
% 






\mysetcounter{page}{76}
\ENDprechapter
\chapter{The Source Coding Theorem}
\label{ch.two}\label{ch2}\label{chtwo}
% _l2.tex 
% \part{Data Compression} 
% \chapter{The Source Coding Theorem}
%
% I introduce the idea of a  "name" (or label?) here, and should clarify
% (example 2.1)
%
% E = 13%, Q,Z = 0.1%
% TH = 3.7%
% 
%  New plan for this chapter: 
% \section{Key concept}
%  Rather than $H(\bp)$ being the measure of information content of
%  an ensemble, 
%  I want the central idea of this chapter to be that 
%  $\log 1/P(\bx)$ is the information content of a particular
%  outcome $\bx$. $H$ is then of interest because it is the average 
%  information content. 
% 
%  An example to illustrate this is `hunt the professor'. Or crack 
%  the combination. Guess the PIN. 
%  An absent-minded professor wishes to remember an 
%  integer between 1 and 256, that is, eight bits of information.
%  He takes 256 large numbered cardboard boxes, and climbs
%  in the  box whose number is the integer to be remembered.
%  The only way to find him 
%  is to open the lid of a box. A single experiment involves 
%  opening a particular box. The outcome is either $x={\tt n}$ -- no 
%  professor -- or $x={\tt y}$ -- the professor is in there. 
%  The probabilities are 
% \beq
% 	P(x\eq {\tt n}) = 255/256; P(x\eq {\tt y}) = 1/256.
% \eeq
%  We open box $n$.
%  If the professor is revealed, we have learned the integer, 
%  and thus recovered 8 bits of information. If he is not revealed, 
%  we have learned very little -- simply that the 
%  integer is not $n$. The information contents are:
% \beq
% 	h(x\eq 0) = \log_2( 256/255) = 0.0056 ; h(x\eq 1) = \log_2 256 = 8 .
% \eeq
%  The average information content is 
% \beq
% 	H(X) = 0.037 \bits .
% \eeq
%  This example shows that in the event of an improbable outcome's occuring, 
%  a large amount of information really is conveyed. 
% 
% \section{Weighing problem}
%  The weighing problem remains useful, let's keep it. 
% 
% \section{Source coding theorem}
%  Relate `information content' $\log 1/P$ to message length 
%  in two steps. First, establish the AEP, that 
%  the outcome from an ensemble $X^N$
%  is very likely to lie in a typical set having `information 
%  content' close to NH. 
% 
%  Second, show that we can count the number of elements in the 
%  typical set, give them all names, and the number of 
%  names will be about $2^{NH}$. 
% 
%  At what point should $H_{\delta}$ be introduced? 

\section{How to measure the information content of a random variable?}
 In the next few chapters, we'll be talking about probability
 distributions and random variables. Most of the time
 we can get by with sloppy notation, but occasionally, we will need
 precise notation. Here is the
%definition and
 notation that we established in  \chapterref{ch.prob.ent}.\indexs{ensemble}
%
\sloppy
\begin{description} 
\item[An ensemble] $X$ is a triple $(x,\A_X, \P_X)$,
  where the {\dem outcome\/} $x$ is the value of a random variable,
% whose value $x$ can take on a
 which   takes on one of a 
	set of possible values,
% the alphabet
% {\em outcomes}, 
	$\A_X =  \{a_1,a_2,\ldots,a_i,\ldots, a_I\}$,
%	\ie, possible values for a random variable $x$
%	and a probability distribution over them, 
	having probabilities
	$\P_X = \{p_1,p_2,\ldots, p_I\}$, with $P(x\eq a_i) = p_i$, 
	$p_i  \geq 0$ and $\sum_{a_i \in \A_X} P(x \eq a_i) = 1$.
\end{description}

%\begin{description}
%\item[An ensemble] $X$ is a random variable $x$ taking on a value
% from a 	set of possible {\em outcomes}, 
%	$$\A_X \eq  \{a_1,\ldots,a_I\},$$ 
%	having probabilities
%	$$\P_X = \{p_1,\ldots, p_I\},$$ with $P(x\eq a_i) = p_i$, 
%	$p_i  \geq 0$ and $\sum_{x \in \A_X} P(x) = 1$.
%\end{description}
% An ensemble is a set of possible values for a random variable
%	and a probability distribution over them.
{How can we measure the information content of an outcome
 $x = a_i$ from such an ensemble?}
 In this chapter we examine the assertions 
\ben
\item
 that  the
% It is claimed that the 
 {\dem{\ind{Shannon information content}}},\index{information content!Shannon}\index{information content!how to measure}
\beq
	h(x\eq a_i) \equiv \log_2 \frac{1}{p_i},
\eeq
 is a sensible measure of the information content of the outcome 
 $x = a_i$, and
\item
 that 
 the {\dem{\ind{entropy}}} of the ensemble,
\beq 
	H(X) = \sum_i p_i \log_2 \frac{1}{p_i},
\eeq
 is a sensible measure of the ensemble's average information content.
\een
\begin{figure}[htbp]
\figuremargin{%1
{\small%
\begin{center}
\mbox{
\mbox{
\hspace{-9mm}
\mbox{\psfig{figure=figs/h.ps,%
width=42mm,angle=-90}}$p$
\hspace{-35mm}
\makebox[0in][l]{\raisebox{\hpheight}{$h(p)=  \log_2 \displaystyle \frac{1}{p}$ }}
\hspace{35mm}
}
\hspace{0.9mm}
\begin{tabular}[b]{ccc}\toprule
$p$ & $h(p)$ & $H_2(p)$ \\ \midrule
0.001             & 10.0            & 0.011 \\ %  9.96578 & 0.0114078
0.01\phantom{0}   & \phantom{1}6.6  & 0.081 \\
0.1\phantom{01}   & \phantom{1}3.3  & 0.47\phantom{1} \\
0.2\phantom{01}   & \phantom{1}2.3  & 0.72\phantom{1} \\
0.5\phantom{01}   & \phantom{1}1.0  & 1.0\phantom{01} \\ \bottomrule
\end{tabular}
\mbox{
% to put H at left: \hspace{1.2mm}
\hspace{6.2mm}
\raisebox{\hpheight}{$H_2(p)$}
% to put H at left: \hspace{-7.5mm}
\hspace{-20mm}
\mbox{\psfig{figure=figs/H2.ps,%
width=42mm,angle=-90}}$p$
}
% see also H2x.tex

\end{center}
}% end small
}{%
\caption[a]{The \ind{Shannon information content} $h(p) =  \log_2 \frac{1}{p}$ and 
 the binary entropy function $H_2(p)=H(p,1\!-\!p)=p \log_2 \frac{1}{p}
  + (1-p)\log_2 \frac{1}{(1-p)}$ as a function of $p$.}
\label{fig.h2}
}%
\end{figure}
% gnuplot 
% load 'figs/l2.gnu'

\noindent
 \Figref{fig.h2} shows the Shannon information content 
 of an outcome with probability $p$, as a function of $p$.
 The less probable an outcome is, the greater its
 Shannon information content. 
 \Figref{fig.h2} also shows 
% $h(p) =  \log_2 \frac{1}{p}$,
 the binary entropy function, 
\beq
 H_2(p)=H(p,1\!-\!p)=p \log_2 \frac{1}{p}
  + (1-p)\log_2 \frac{1}{(1-p)} ,
\eeq
 which is the entropy   of the ensemble $X$ whose alphabet and probability
 distribution are 
 $\A_X = \{ a , b \}, \P_X = \{ p , (1-p) \}$.
%

\subsection{Information content of independent random variables}
 Why should $\log 1/p_i$ have anything to do with the
 information content? Why not some other function of $p_i$?
 We'll explore this question in  detail shortly,
 but first, notice a nice property of this particular function
 $h(x)=\log 1/p(x)$.

 Imagine learning the value of two {\em independent\/} random
 variables, $x$ and $y$.
 The definition of independence is that the probability
 distribution is separable into a {\em product}:
\beq
	P(x,y) = P(x) P(y) .
\eeq
 Intuitively, we might want any measure of
 the `amount of information gained' to have the property of
 {\em additivity} --
 that is,
 for independent random variables $x$ and $y$, 
 the information gained when we learn $x$ and $y$ should 
 equal  the sum of  the information gained if $x$ alone were learned
 and  the information gained if $y$ alone were learned.

 The Shannon information content of the outcome $x,y$ is
\beq
	h(x,y) = \log \frac{1}{P(x,y)}
	= \log \frac{1}{P(x)P(y)} 
	= \log \frac{1}{P(x)} 
	+ \log \frac{1}{P(y)} 
\eeq
 so it does indeed satisfy
\beq
	h(x,y) =  h(x) + h(y), \:\:\mbox{if $x$ and $y$ are independent.}
\eeq
\exercissxA{1}{ex.Hadditive}{
	Show that, if $x$ and $y$ are independent,
	the entropy of the outcome $x,y$
	satisfies
\beq
	H(X,Y) = H(X) + H(Y) .
\eeq
 In words, entropy is additive for independent variables.
}

 We now  explore these ideas with some examples;
 then, in section \ref{sec.aep} and in Chapters \ref{ch3}
 and \ref{ch4}, we  prove that 
 the Shannon information content and the entropy  are 
 related to the number of bits  needed to describe 
 the  outcome of an experiment.

% \section{Thinking about information content}
% \subsection{Ensembles with maximum average information content}
%  The first property of the entropy that we will 
%  consider is the property that you proved when you solved
%  \exerciseref{ex.Hineq}: the entropy of an ensemble 
%  $X$ is biggest if  all the outcomes 
%  have equal probability $p_i \eq  1/|X|$.
% 
%  If entropy  measures the average information content
%  of an ensemble, then this idea of equiprobable outcomes
%  should have relevance for the design of efficient experiments.

\subsection{The weighing problem: designing informative experiments}
 Have you solved the \ind{weighing problem}\index{puzzle!weighing 12 balls}
 \exercisebref{ex.weigh}\
 yet? Are you sure? Notice that in three uses of the balance --
 which  reads either `left heavier', `right heavier', or `balanced' --
 the number 
 of conceivable outcomes is $3^3=27$, whereas the number of possible 
 states of the world is 24: the odd ball could be any of twelve balls, 
 and it could be heavy or light. So in principle, the problem might be 
 solvable in three weighings -- but not in two, since $3^2 < 24$.  

 If you know how you 
 {can} determine the odd weight {\em and\/} whether  it is heavy or 
 light in {\em three\/} weighings, then you may read on.
 If you haven't found a strategy that always gets there in three weighings,
 I encourage you to think about  \exerciseonlyref{ex.weigh}  some more.
% {ex.weigh}

% \subsection{Information from experiments}
 Why is your strategy optimal? What is it about your series of weighings
 that allows useful information to be gained as quickly as possible?
\begin{figure}%[htbp]
\fullwidthfigureright{%
% included by l2.tex
%
% shows weighing trees, ternary
%
% decisions of what to weigh are shown in square boxes with 126 over 345 (l:r)
% state of valid hypotheses are listed in double boxes
% three arrows, up means left heavy,  straight means right heavy, down is balance
% actually s and d boxes end up having the same defn.
%
\setlength{\unitlength}{0.56mm}% page width is 160mm % was 6mm
\begin{center}
\small
\begin{picture}(260,260)(-50,-130)
%
%   initial state 
%
% all 24 hypotheses
\mydbox{-50,-100}{15,200}{$1^+$\\$2^+$\\$3^+$\\$4^+$\\$5^+$\\$6^+$\\$7^+$\\
$8^+$\\$9^+$\\$10^+$\\$11^+$\\$12^+$\\$1^-$\\$2^-$\\$3^-$\\$4^-$\\
$5^-$\\$6^-$\\$7^-$\\$8^-$\\$9^-$\\$10^-$\\$11^-$\\$12^-$}
\mysbox{-30,-8}{25,16}{$\displaystyle\frac{1\,2\,3\,4}{5\,6\,7\,8}$}
\put(-30,10){\makebox(25,8){weigh}}
%
% 1st arrows
%
\mythreevector{0,0}{1}{3}{30}
%
% first three boxes of hypotheses % boxes of actions 
% #1 is bottom left corner, so has to be offset by height of box
% #2 is dimensions of box
%
% each digit is about 10 high
%
\mydbox{40,55}{15,70}{$1^+$\\$2^+$\\$3^+$\\$4^+$\\$5^-$\\$6^-$\\$7^-$\\$8^-$}
\mysbox{65,82}{25,16}{$\displaystyle\frac{1\,2\,6}{3\,4\,5}$}
\put(65,100){\makebox(25,8){weigh}}
\mydbox{40,-35}{15,70}{$1^-$\\$2^-$\\$3^-$\\$4^-$\\$5^+$\\$6^+$\\$7^+$\\$8^+$}
\mysbox{65,-8}{25,16}{$\displaystyle\frac{1\,2\,6}{3\,4\,5}$}
\put(65,10){\makebox(25,8){weigh}}
\mydbox{40,-125}{15,70}{$9^+$\\$10^+$\\$11^+$\\$12^+$\\$9^-$\\$10^-$\\$11^-$\\$12^-$}
\mysbox{65,-98}{25,16}{$\displaystyle\frac{9\,10\,11}{1\,2\,3}$}
\put(65,-80){\makebox(25,8){weigh}}
%
%    2nd arrows 
%
\mythreevector{95,90}{1}{2}{15}
\mythreevector{95,0}{1}{2}{15}
\mythreevector{95,-90}{1}{2}{15}
% nine intermediate states. top ones
\mydbox{115,113}{35,14}{$1^+2^+5^-$}
\mysbox{155,112}{25,16}{$\displaystyle\frac{1}{2}$}
\mydbox{115,83}{35,14}{$3^+4^+6^-$}
\mysbox{155,82}{25,16}{$\displaystyle\frac{3}{4}$}
\mydbox{115,53}{35,14}{$7^-8^-$}
\mysbox{155,52}{25,16}{$\displaystyle\frac{1}{7}$}
% nine intermediate states. mid ones
\mydbox{115,23}{35,14}{$6^+3^-4^-$}
\mysbox{155,22}{25,16}{$\displaystyle\frac{3}{4}$}
\mydbox{115,-7}{35,14}{$1^-2^-5^+$}
\mysbox{155,-8}{25,16}{$\displaystyle\frac{1}{2}$}
\mydbox{115,-37}{35,14}{$7^+8^+$}
\mysbox{155,-38}{25,16}{$\displaystyle\frac{7}{1}$}
% nine intermediate states. bot ones
\mydbox{115,-67}{35,14}{$9^+10^+11^+$}
\mysbox{155,-68}{25,16}{$\displaystyle\frac{9}{10}$}
\mydbox{115,-97}{35,14}{$9^-10^-11^-$}
\mysbox{155,-98}{25,16}{$\displaystyle\frac{9}{10}$}
\mydbox{115,-127}{35,14}{$12^+12^-$}
\mysbox{155,-128}{25,16}{$\displaystyle\frac{12}{1}$}
% 3rd arrows mainline
\mythreevector{185,60}{1}{1}{10}
\mythreevector{185,0}{1}{1}{10}
\mythreevector{185,-60}{1}{1}{10}
% other branch lines
\mythreevector{185,120}{1}{1}{10}
\mythreevector{185,90}{1}{1}{10}
\mythreevector{185,30}{1}{1}{10}
\mythreevector{185,-30}{1}{1}{10}
\mythreevector{185,-90}{1}{1}{10}
\mythreevector{185,-120}{1}{1}{10}
% final answers aligned at 200,x*10
\mydbox{200,126}{10,8}{$1^+$}
\mydbox{200,116}{10,8}{$2^+$}
\mydbox{200,106}{10,8}{$5^-$}
\mydbox{200,96}{10,8}{$3^+$}
\mydbox{200,86}{10,8}{$4^+$}
\mydbox{200,76}{10,8}{$6^-$}
\mydbox{200,66}{10,8}{$7^-$}
\mydbox{200,56}{10,8}{$8^-$}
\mydbox{200,46}{10,8}{$\star$}% ---------- impossible outcome
\mydbox{200,36}{10,8}{$4^-$}
\mydbox{200,26}{10,8}{$3^-$}
\mydbox{200,16}{10,8}{$6^+$}
\mydbox{200,6}{10,8}{$2^-$}
\mydbox{200,-4}{10,8}{$1^-$}% the middle, 0
\mydbox{200,-14}{10,8}{$5^+$}
\mydbox{200,-24}{10,8}{$7^+$}
\mydbox{200,-34}{10,8}{$8^+$}
\mydbox{200,-44}{10,8}{$\star$}
\mydbox{200,-54}{10,8}{$9^+$}
\mydbox{200,-64}{10,8}{$10^+$}
\mydbox{200,-74}{10,8}{$11^+$}
\mydbox{200,-84}{10,8}{$10^-$}
\mydbox{200,-94}{10,8}{$9^-$}
\mydbox{200,-104}{10,8}{$11^-$}
\mydbox{200,-114}{10,8}{$12^+$}
\mydbox{200,-124}{10,8}{$12^-$}
\mydbox{200,-134}{10,8}{$\star$}
\end{picture}
\end{center}

}{%
\caption[a]{An optimal solution to the weighing problem. 
%
 At each step there are two boxes: the left box  shows which hypotheses are still
 possible; the right box shows the balls involved in  the next weighing.
 The 24 hypotheses are written $1^+,
% 2^+,\ldots,1^-,
 \ldots, 12^-$, 
 with, \eg, $1^+$ denoting that 1 is the odd ball and
 it is heavy.  
 Weighings are written by listing the names of the balls on the 
 two pans, separated by a line; for example, in the first weighing,
% $\displaystyle\frac{1\,2\,3\,4}{5\,6\,7\,8}$ denotes that
 balls 1,
 2, 3, and 4 are put on the left-hand side and 5, 6, 7, and 8 on the
 right.
 In each triplet of arrows the upper arrow leads to the situation when 
 the left side is heavier, the middle arrow to the situation  when the right side is heavier, 
 and the lower arrow  to the situation when the outcome is balanced.
 The three points labelled $\star$
% arrows without subsequent boxes at the right-hand side
 correspond to impossible outcomes.
%The total number of outcomes
% of the weighing process is 24, which equals $3^3 - 3$, so we would expect
% this ternary tree of depth three to have three spare branches.
}
\label{fig.weighing}\label{ex.weigh.sol}
}%
\end{figure}
 The answer is that at each step of an optimal 
 procedure, the three outcomes (`left heavier', `right heavier', and `balance')
 are {\em as close as possible to equiprobable}. 
 An optimal solution is shown in \figref{fig.weighing}. 
 
 Suboptimal strategies, such as weighing balls 1--6 against 7--12
 on the first step, do not achieve all outcomes with equal probability:
 these two sets of balls can never  balance, so the only possible
 outcomes are `left heavy' and `right heavy'.
% Similarly, strategies
% that after an unbalanced initial result
% do not mix together balls that might be heavy with balls that 
% might be light are incapable of giving one of the three outcomes.
 Such a binary outcome  rules out only half of the possible
 hypotheses, so a  strategy that uses such outcomes must sometimes
 take longer to find the right answer.
% Some suboptimal strategies produce binary trees rather than ternary trees like 
% the one in \figref{fig.weighing}, and binary trees 
% are  necessarily deeper than balanced ternary trees
% with the same number  of leaves. 

 The insight that the outcomes should be as near as possible
 to equiprobable makes 
 it easier to search for an optimal strategy. The first weighing 
 must divide the 24 possible hypotheses into three groups of eight. Then 
 the second weighing must be chosen so that there is a 3:3:2
 split of the hypotheses. 

 Thus we might conclude:
\begin{conclusionbox}
{the outcome of a random experiment is guaranteed to be  most informative
 if the probability distribution over outcomes is uniform.}
\end{conclusionbox}
 This conclusion agrees with 
 the  property of the entropy that you proved when you solved
 \exerciseref{ex.Hineq}: the entropy of an ensemble 
 $X$ is biggest if  all the outcomes 
 have equal probability $p_i \eq  1/|\A_X|$.

% for anyone who wants to play it against a machine:
%  http://y.20q.net:8095/btest
%  http://www.smalltime.com/dictator.html
% http://www.guessmaster.com/
\subsection{Guessing games}
  In the game of \ind{twenty questions},\index{game!twenty questions}
 one player thinks of
  an object, and the other player attempts to guess what the object is
  by asking questions that have yes/no answers, for example,  
  `is it alive?', or `is it human?'
 The aim is to identify the object with as few questions
  as possible.
  What is the best strategy for playing this game?
  For simplicity, imagine that we are playing the rather dull 
  version of twenty questions called `sixty-three'.
% % two hundred and fifty five'.
%  In this game, the permitted objects are the $2^6$ integers 
%  $\A_X = \{ 0 , 1 , 2 , \dots 63 \}$.
%  One player selects an $x \in \A_X$, and we ask 
%  questions  that have yes/no answers in order to identify $x$. 

\exampl{example.sixtythree}{ {\sf The game `sixty-three'}.
 What's the smallest number of   yes/no  questions needed\index{game!sixty-three} 
 to identify an integer $x$ between 0 and 63?\index{twenty questions}
}
 Intuitively,
 the best questions successively divide 
 the 64 possibilities into equal sized sets.
Six questions suffice.
 One reasonable strategy asks the following questions: 
%
% want a computer program environment here.
%
\begin{quote}
\begin{tabbing}
 {\sf 1:} is $x \geq 32$? \\
 {\sf 2:} is $x \mod 32 \geq 16$? \\
 {\sf 3:} is $x \mod 16 \geq 8$? \\
 {\sf 4:} is $x \mod 8 \geq 4$? \\
 {\sf 5:} is $x \mod 4 \geq 2$? \\
 {\sf 6:} is $x \mod 2 = 1$? 
\end{tabbing}
\end{quote}
%
% I'd like to put this in a comment column on the right beside the 'code':
%
 [The notation $x \mod 32$, pronounced `$x$ modulo 32', denotes the remainder
 when $x$ is divided by 32; for example, $35 \mod 32 = 3$
 and $32 \mod 32 = 0$.]

 The answers to these questions, if translated 
 from  $\{\mbox{yes},\mbox{no}\}$
 to $\{{\tt{1}},{\tt{0}}\}$, 
 give the binary expansion of $x$, for example 
 $35 \Rightarrow {\tt{100011}}$.\ENDsolution\smallskip 

 What are the 
 Shannon information contents of the outcomes in this example?  
 If we assume that all values of $x$ are equally likely, then the
 answers to the questions are independent  and each has 
% entropy $H_2(0.5) = 1 \ubit$. The
 Shannon information content
% of each answer is
 $\log_2 (1/0.5)
 =  1 \ubit$;  the total Shannon information gained 
 is always six bits. Furthermore, the number  $x$ that we learn from 
 these questions is a six-bit binary number. Our questioning 
 strategy defines a way of encoding the random variable $x$
 as a binary file.

 So far, the  Shannon information content  makes sense:
 it measures the length of a binary file that encodes
 $x$. 
%
 However, we have not yet studied ensembles where the 
 outcomes have unequal probabilities. Does the 
 Shannon information content make sense there too?

\fakesection{Submarine figure}
%
\newcommand{\subgrid}{\multiput(0,0)(0,10){9}{\line(1,0){80}}\multiput(0,0)(10,0){9}{\line(0,1){80}}}
\newcommand{\sublabels}{
\put(-5,75){\makebox(0,0){\sf\tiny{A}}}
\put(-5,65){\makebox(0,0){\sf\tiny{B}}}
\put(-5,55){\makebox(0,0){\sf\tiny{C}}}
\put(-5,45){\makebox(0,0){\sf\tiny{D}}}
\put(-5,35){\makebox(0,0){\sf\tiny{E}}}
\put(-5,25){\makebox(0,0){\sf\tiny{F}}}
\put(-5,15){\makebox(0,0){\sf\tiny{G}}}
\put(-5, 5){\makebox(0,0){\sf\tiny{H}}}
%
\put(75,-5){\makebox(0,0){\tiny{8}}}
\put(65,-5){\makebox(0,0){\tiny{7}}}
\put(55,-5){\makebox(0,0){\tiny{6}}}
\put(45,-5){\makebox(0,0){\tiny{5}}}
\put(35,-5){\makebox(0,0){\tiny{4}}}
\put(25,-5){\makebox(0,0){\tiny{3}}}
\put(15,-5){\makebox(0,0){\tiny{2}}}
\put( 5,-5){\makebox(0,0){\tiny{1}}}
}
\newcommand{\misssixteen}{
\put(45,65){\makebox(0,0){$\times$}}
\put(45,45){\makebox(0,0){$\times$}}
\put(35,75){\makebox(0,0){$\times$}}
\put(35,65){\makebox(0,0){$\times$}}
\put(35,55){\makebox(0,0){$\times$}}
\put(35,45){\makebox(0,0){$\times$}}
\put(35,35){\makebox(0,0){$\times$}}
\put(35,25){\makebox(0,0){$\times$}}
\put(35,15){\makebox(0,0){$\times$}}
\put(35, 5){\makebox(0,0){$\times$}}
\put(25,75){\makebox(0,0){$\times$}}
\put(25,65){\makebox(0,0){$\times$}}
\put(25,55){\makebox(0,0){$\times$}}
\put(25,45){\makebox(0,0){$\times$}}
\put(25,35){\makebox(0,0){$\times$}}
\put(25,25){\makebox(0,0){$\times$}}
\put(25,15){\makebox(0,0){$\times$}}
}
\newcommand{\missthirtytwo}{
\put(75,75){\makebox(0,0){$\times$}}
\put(75,65){\makebox(0,0){$\times$}}
\put(75,55){\makebox(0,0){$\times$}}
\put(75,45){\makebox(0,0){$\times$}}
\put(75,35){\makebox(0,0){$\times$}}
\put(75,25){\makebox(0,0){$\times$}}
\put(75,15){\makebox(0,0){$\times$}}
\put(75, 5){\makebox(0,0){$\times$}}
\put(65,75){\makebox(0,0){$\times$}}
\put(65,65){\makebox(0,0){$\times$}}
\put(65,55){\makebox(0,0){$\times$}}
\put(65,45){\makebox(0,0){$\times$}}
\put(65,35){\makebox(0,0){$\times$}}
\put(65,25){\makebox(0,0){$\times$}}
\put(65,15){\makebox(0,0){$\times$}}
\put(65, 5){\makebox(0,0){$\times$}}
\put(55,75){\makebox(0,0){$\times$}}
\put(55,65){\makebox(0,0){$\times$}}
\put(55,55){\makebox(0,0){$\times$}}
\put(55,45){\makebox(0,0){$\times$}}
\put(55,35){\makebox(0,0){$\times$}}
\put(55,25){\makebox(0,0){$\times$}}
\put(55,15){\makebox(0,0){$\times$}}
\put(55, 5){\makebox(0,0){$\times$}}
\put(45,75){\makebox(0,0){$\times$}}
%%\put(45,65){\makebox(0,0){$\times$}}
\put(45,55){\makebox(0,0){$\times$}}
%% \put(45,45){\makebox(0,0){$\times$}}
\put(45,35){\makebox(0,0){$\times$}}
\put(45,25){\makebox(0,0){$\times$}}
\put(45,15){\makebox(0,0){$\times$}}
\put(45, 5){\makebox(0,0){$\times$}}
\put(5,65){\makebox(0,0){$\times$}}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%% submarine figure %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuredangle{%
\begin{center}
%\begin{tabular}{l@{\hspace{-1mm}}*{5}{@{\hspace{2pt}}c}} \toprule
\begin{tabular}{l@{\hspace{0mm}}*{5}{@{\hspace{8.5mm}}c}} \toprule
% moves made &  1 & 2 & 32 & 48 & 49 \\
 &
%
% 1 miss
%
% this fig actually needs extra width on left, but there is nothing there.
\setlength{\unitlength}{0.26mm} 
\begin{picture}(80,95)(0,-10)\subgrid\sublabels
\put(25,15){\makebox(0,0){$\times$}}
\put(25,15){\circle{15}}
\end{picture} 
 &
%
% 2 miss
%
\setlength{\unitlength}{0.26mm} 
\begin{picture}(80,95)(0,-10)\subgrid
\put(25,15){\makebox(0,0){$\times$}}
\put(5,65){\makebox(0,0){$\times$}}
\put(5,65){\circle{15}}
\end{picture} 
 &
%
% 32 miss
%
\setlength{\unitlength}{0.26mm} 
\begin{picture}(80,95)(0,-10)\subgrid
\put(25,15){\makebox(0,0){$\times$}}
\put(45,35){\circle{15}}
\missthirtytwo
\end{picture} 
 &
%
% 49 miss
%
\setlength{\unitlength}{0.26mm} 
\begin{picture}(80,95)(0,-10)\subgrid
\put(25,15){\makebox(0,0){$\times$}}
\put(5,65){\makebox(0,0){$\times$}}
\missthirtytwo
\misssixteen
\put(25,25){\circle{15}}
\end{picture} 
&
\setlength{\unitlength}{0.26mm} 
\begin{picture}(80,95)(0,-10)\subgrid
\put(25,15){\makebox(0,0){$\times$}}
\put(5,65){\makebox(0,0){$\times$}}
\missthirtytwo
\misssixteen
%%%%%%%%%%%%%%%%%%%%%%% hit the submarine: 
\put(25,5){\circle{15}}
\put(25,5){\makebox(0,0){\tiny\bf S}}
\end{picture} 
 \\
move \# &  1 & 2 & 32 & 48 & 49 \\
question
& G3
& B1
& E5 
& F3
& H3 \\
 outcome     
&  $x = {\tt n}$ % $(\times)$
&  $x = {\tt n}$ %$(\times)$
&  $x = {\tt n}$ %$(\times)$
&  $x = {\tt n}$ %$(\times)$
&  $x = {\tt y}$ %({\small\bf S})
 \\[0.1in]
 $P(x)$ 
& 	$\displaystyle\frac{63}{64}$  
& 	$\displaystyle\frac{62}{63}$  
& 	$\displaystyle\frac{32}{33}$  
& 	$\displaystyle\frac{16}{17}$  
& 	$\displaystyle\frac{1}{16}$  
 \\[0.15in]
 $h(x)$ 
& 	 0.0227
& 	 0.0230
& 	 0.0443 
% & 	 0.0430 -------- 0.9556 , just before 32 are pasted
& 	 0.0874
& 	 4.0 
 \\[0.05in]
 Total info.
& 	 0.0227
&  0.0458
&  1.0
&  2.0
&  6.0
 \\  \bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{A game of {\tt submarine}. The submarine is hit on the 49th attempt.}
\label{fig.sub}
}%
\end{figure}

\subsection{The game of {\ind{submarine}}: how many bits can one bit convey?}
 In the game of {\ind{battleships}}, each player hides a fleet of 
 ships in a sea represented by a square grid. On each\index{game!submarine} 
 turn, one player
 attempts to hit the other's ships by firing at one square
 in the opponent's sea. The response to a selected square such 
 as `G3' is either `miss', `hit', or `hit and destroyed'.

 In a
% rather
 boring version of battleships called {\tt submarine}, 
 each player hides just one submarine in one square of 
 an eight-by-eight grid. 
 \Figref{fig.sub} shows a few pictures of  this game in progress:
 the circle represents the square that is being fired at, and the
 $\times$s show squares in which the outcome was a miss, $x={\tt{n}}$; the
 submarine is hit (outcome $x={\tt{y}}$ shown by
 the symbol $\bs$) on the 49th attempt.
 
 Each shot made by a player defines an ensemble. The 
 two possible outcomes are $\{  {\tt{y}} ,{\tt{n}}\}$, 
 corresponding to a hit and a miss, and their probabilities
 depend on the state of the board. 
 At the beginning, $P({\tt{y}}) = \linefrac{1}{64}$ and 
 $P({\tt{n}}) = \linefrac{63}{64}$. 
 At the second shot, if the first shot missed,
% enemy sub has not yet been hit, 
 $P({\tt{y}}) = \linefrac{1}{63}$ and $P({\tt{n}}) = \linefrac{62}{63}$. 
 At the third shot, if the first two shots missed,
% enemy submarine has not yet been hit, 
 $P({\tt{y}}) = \linefrac{1}{62}$ and $P({\tt{n}}) = \linefrac{61}{62}$. 

% According to the Shannon information content, t
 The  Shannon information
 gained from an outcome $x$ is $h(x) = \log (1/P(x))$.
% Let's investigate this assertion.
 If we are lucky, and hit the submarine on the first shot, then 
\beq
	h(x) = h_{(1)}({\tt y}) = \log_2 64 = 6 \ubits .
\eeq
 Now, it might seem a little strange that
 one binary outcome can convey six bits.
% , but it does make sense. W
 But we have learnt the hiding place,
% where the submarine was,
 which 
 could have been any of 64 squares; so we have, by one lucky 
 binary question, indeed learnt six bits. 

 What if the first shot misses?  The Shannon information that we gain from this outcome
 is
\beq
	h(x) = h_{(1)}({\tt n}) = \log_2 \frac{64}{63} = 0.0227 \ubits .
\eeq
 Does this make sense? It is not so obvious. Let's keep going.
 If our second shot also misses, the Shannon information 
 content of the second outcome is
\beq
	 h_{(2)}({\tt n}) = \log_2 \frac{63}{62} = 0.0230 \ubits .
\eeq
 If we miss thirty-two times (firing at a new square each time), 
 the total Shannon information gained is
\beqan
%\hspace*{-0.2in}
\lefteqn{ \log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots +
	\log_2 \frac{33}{32} } \nonumber \\
&	\!\!\!=\!\!\! &  0.0227  +  0.0230 + \cdots + 0.0430  \:\:=\:\:
  1.0 \ubits .
\eeqan
 Why this round number? Well, what have we learnt? We now know
 that the submarine is not in any of the 32 squares we fired at; 
 learning that fact is just like playing a game of \sixtythree\ 
 (\pref{example.sixtythree}),
 asking as our first question `is $x$  one of the 
 thirty-two numbers corresponding to these squares I fired at?',
 and receiving the answer `no'.  This answer rules out half of the 
 hypotheses, so it gives us one bit.
%It doesn't matter what the 
% outcome might have been; all that matters is the probability 
% of what actually happened.

 After 48 unsuccessful shots, the information 
 gained is 2 bits: the unknown location has been narrowed down to
 one quarter of the original hypothesis space.

 What if we hit the submarine on the 49th shot, when there 
 were 16 squares left? 
 The Shannon information content of this outcome is
\beq
		 h_{(49)}({\tt y}) = \log_2 16 = 4.0 \ubits .
\eeq
 The total Shannon information content of all the outcomes is
\beqan
\lefteqn{	\log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots +
%	\log_2 \frac{33}{32} + \cdots +
	\log_2 \frac{17}{16} + 
	\log_2 \frac{16}{1} }
  \nonumber \\
	&=&  0.0227  +  0.0230 + \cdots
% + 0.0430 + \cdots 
		+ 0.0874 + 4.0 \:\: =\:\: 6.0 \ubits .
\label{eq.sum.me}
\eeqan
 So once we know where the submarine is, the total Shannon information 
 content gained is 6 bits.

 This result holds regardless of when  
 we hit the submarine. If we hit it when there are  $n$ squares 
 left to choose from --   $n$ was 16 in 
 \eqref{eq.sum.me} -- then the total information gained
 is: 
\beqan
\lefteqn{	\log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots +
	\log_2 \frac{n+1}{n} + 
	\log_2 \frac{n}{1} } \nonumber \\
&=& \log_2 \left[
	\frac{64}{63} \times \frac{63}{62} \times \cdots
                   \times \frac{n+1}{n} \times \frac{n}{1} \right]
%\times 63 \times \cdots \times (n+1) \times n}
%		{63 \times 62 \times \cdots \times n \times 1} 
	\:\:=\:\: \log_2 \frac{64}{1}\:\: =\:\: 6 \,\bits.
\eeqan

%
% add winglish here?
%
% follows in lecture 2, after submarine
%
% aim: introduce the language of Wenglish
% and demonstrate Shannon info content.

 What have we learned from the examples so far?
 I think the {\tt submarine} example makes quite a convincing
 case for the claim that the Shannon information content
 is a sensible measure of information content.
 And the game of {\tt sixty-three} shows that
 the Shannon information content can be  intimately connected
 to the size of a file that encodes the outcomes of
 a random experiment, thus suggesting a possible connection to
 data compression.

 In case you're not convinced, let's look at one more example.

 

\subsection{The \Wenglish\ language}
\label{sec.wenglish}
% [this section under construction]}
 {\dem{\ind{\Wenglish}}} is a  language similar to \ind{English}.
 \Wenglish\ sentences consist of words drawn at random from the
 \Wenglish\ dictionary, which contains $2^{15}=32$,768 words, all of length 5
 characters. Each word in the \Wenglish\ dictionary was constructed
% by the \Wenglish\  language committee, who created each of those $32\,768$ words
 at random by picking five letters from the
 probability distribution over {\tt a$\ldots$z} depicted
 in \figref{fig.monogram}.
% Since all words are five characters long

%\begin{figure}
%\figuremargin{
\marginfig{\small
\begin{center}
\begin{tabular}{rc} \toprule
%  & Word \\ \midrule
1 & {\tt{aaail}} \\
2 & {\tt{aaaiu}} \\
3   & {\tt{aaald}} \\
    & $\vdots$ \\
129 & {\tt{abati}} \\
    & $\vdots$ \\
$2047$ & {\tt{azpan}} \\
$2048$ & {\tt{aztdn}} \\
    & $\vdots$ \\
    & $\vdots$ \\
 $16\,384$   & {\tt{odrcr}} \\
    & $\vdots$ \\
    & $\vdots$ \\
 $32\,737$ & {\tt{zatnt}} \\
    & $\vdots$ \\
 $32\,768$ & {\tt{zxast}} \\ \bottomrule
\end{tabular}
\end{center}
%}{
\caption[a]{The \Wenglish\ dictionary.}
\label{fig.wenglish}
}
%\end{figure}
% 5366+1219+2602+2718+8377+1785+1280+3058+5903+70+800+3431+2319+5470+6526+1896+539+4660+5453+6767+3108+652+1388+765+1564+78
% 77794
 Some entries from the dictionary are shown in
  alphabetical order in \figref{fig.wenglish}.
 Notice that the number of words in the \ind{dictionary}
 (32,768)
 is much smaller than the total number of possible  words of length 5 letters,
 $26^5 \simeq 12$,000,000.

 Because the probability of the letter {{\tt{z}}} is about $1/1000$,
 only 32 of the words in the dictionary begin with the letter {\tt z}.
 In contrast,  the probability of the letter {{\tt{a}}} is about $0.0625$,
 and 2048 of the words begin  with the letter {\tt a}. Of those 2048 words,
 two start {\tt az}, and 128 start {\tt aa}.

 Let's imagine that we are reading a \Wenglish\ document, and let's discuss
 the Shannon \ind{information content} of the characters as we acquire them.
 If we are given the text one word at a time, the Shannon information
 content of each five-character word is $\log \mbox{32,768} = 15$ bits,
 since \Wenglish\ uses all its words with equal probability. The
 average information content per character is therefore 3 bits.

 Now let's look at the information content if we read the document
 one character at a time.
 If, say, the first letter of a word is {\tt a}, the Shannon information
 content is
 $\log 1/ 0.0625 \simeq 4$ bits.
 If the first letter is {\tt z}, the Shannon information content
 is $\log 1/0.001 \simeq 10$ bits.
 The information content is thus highly variable
 at the first character. The total information
 content  of the 5 characters in a word, however,
 is exactly 15 bits; so the letters that
 follow an initial {\tt{z}} have lower average  information content
 per character than the letters that follow an initial {\tt{a}}.
 A rare initial letter such as {\tt{z}} indeed conveys
 more information about what the word is
 than a common initial letter.


 Similarly, in English, if  rare characters occur at the start of
 the word (\eg\ {\tt{xyl}\ldots}),
 then often we can identify the whole word immediately; whereas
 words that start with common characters (\eg\ {\tt{pro}\ldots}) require more characters
 before we can identify them.

% Does this make sense? Well, in English,
% the first few characters of a word do very often fully identify the whole word.
%
% {\em MORE HERE........}
 







\section{Data compression}
 \index{data compression}\index{source code}The
 preceding examples justify the idea that the Shannon \ind{information 
 content} of an outcome is a natural  measure of its
 \ind{information content}.  Improbable outcomes
 do convey more information than probable outcomes.
 We now discuss the  information content 
 of a source by considering how many bits are needed to describe 
 the  outcome of an experiment.
% , that is, by studying {data compression}. 

 If we can show that we can  compress data from a particular source 
 into a file of $L$ bits per source symbol and recover the data reliably,
 then we will say that the average information 
 content of that source is at most
% less than or equal to
 $L$ bits per symbol.
%
% cut Sat 13/1/01
%
% We will show that, for any source, the information content of the source 
% is intimately  related to its entropy.

\subsection{Example: compression of text files}
 A file is composed of a sequence of bytes.  A byte is composed of 8
 bits\marginpar{\small\raggedright{Here we use the word `bit' with its meaning, `a
 symbol with two values', not to be confused with the
 unit of information content.}}
 and can have a decimal value between 0 and 255.  A
 typical text file is composed of the
 ASCII character set (decimal values 0 to 127). 
 This character set uses only 
 seven of the eight bits in a byte. 
\exercissxB{1}{ex.ascii}{
 By how much could the size of a file be reduced given that 
 it is an ASCII file? How would you achieve this reduction?
} 
 Intuitively, it seems reasonable to assert that an ASCII file 
 contains $7/8$ as much information as an arbitrary file of the same 
 size, since we already know one out of every eight bits before we even 
 look at the file. 
 This is a 
% very
 simple example of redundancy. 
 Most sources of data have further redundancy: English text files
 use the ASCII characters with non-equal frequency; certain pairs 
 of letters are more probable than others;  and entire words 
 can be predicted given the context and a semantic understanding
 of the text.
% this par is repeated in l4. 

% compressibility.

\subsection{Some simple data compression methods that define
     measures of information content}
%
% IDEA: connect back to opening
%
 One way of measuring the information content of a  random variable 
 is  simply to count the number of  {\em possible\/} outcomes,
 $|\A_X|$. (The number of elements in a set $\A$ is denoted by $|\A|$.)
 If we  gave a binary name  to each outcome, the length 
 of each name would be $\log_2 |\A_X|$ bits, if $|\A_X|$ happened
 to be a power of 2.
 We thus make the following definition.  
\begin{description}%%%% was: [Perfect information content] Raw bit content
%%%%%%%%%%%%%%%%%%%%%%% see newcommands1.tex
\item[The \perfectic] of $X$ is
\beq
	H_0(X) = \log_2 |\A_X| .
\eeq 
 \end{description}
 $H_0(X)$ is a lower bound for 
 the number of binary questions that are always guaranteed to identify
 an outcome from the ensemble $X$.
 It is an additive quantity: the \perfectic\ of an ordered pair $x,y$,
 having $|\A_X||\A_Y|$  
 possible outcomes,
 satisfies    
\beq
	H_0(X,Y)= H_0(X) + H_0(Y).
\eeq

 This measure of information content does not include any
 probabilistic element, and the encoding  rule it corresponds to
 does not `compress' the source data, it simply maps each
 outcome
% source character
 to a constant-length binary string.
 
\exercissxA{2}{ex.compress.possible}{
 Could there be a compressor that maps
 an outcome $x$ to a binary code $c(x)$, and a decompressor
 that maps $c$ back to $x$, such that {\em every
 possible outcome\/} is compressed into a binary code
 of length {\em shorter\/}
 than $H_0(X)$ bits?
}
 Even though  a simple counting argument\index{compression!of {\em any\/} file}
 shows that it is impossible to make a reversible
 compression program that reduces the size of {\em all\/} files,
 amateur compression enthusiasts frequently announce that they have invented
 a program that  can do this -- indeed that they can further compress
 compressed files by putting them through their compressor several\index{compression!of already-compressed files}\index{myth!compression}
 times. Stranger yet, patents have
 been granted to these modern-day \ind{alchemists}. See
 the {\tt{comp.compression}} frequently asked questions
% \verb+http://www.faqs.org/faqs/compression-faq/part1/+
 for further reading.\footnote{\tt{http://sunsite.org.uk/public/usenet/news-faqs/comp.compression/}}
%\footnote{\verb+http://www.lib.ox.ac.uk/internet/news/faq/+}
% ............by_category.compression-faq.html+}
% http://www.faqs.org/faqs/compression-faq/part1/preamble.html

 There are only two ways in which a `compressor' can actually
 compress files:
\ben
\item
	A {\dem lossy\/} compressor compresses some\index{compression!lossy}
	files, but maps some files
% {\em distinct\/} files are mapped
 to the
	{\em same\/} encoding. We'll assume that
	the user requires perfect recovery of the source
	file, so  the occurrence of one of these
	 confusable files leads to a failure (though in 
	applications such as \ind{image compression}, lossy compression is viewed as
 satisfactory).  We'll denote by
  $\delta$ 
 the probability that the
	source string is one of the confusable files, so a
 lossy compressor\index{error probability!in compression}
	has a probability $\delta$ of
	failure.	If $\delta$ can be made very small then
	a lossy compressor may be practically useful. 
\item
	A {\dem lossless} compressor maps all files
 to different encodings; if it
% f a lossless compressor
 shortens some files,\index{compression!lossless}
	it necessarily {\em  makes others longer}.  We try to design the
	compressor so that the probability that a
	file is lengthened is very small, and the probability that
 it is shortened is large.
\een
 In this chapter we  discuss a simple lossy compressor.
 In subsequent chapters we  discuss  lossless compression
 methods.

%
\section{Information content defined in terms of lossy
 compression}
%

 Whichever type of compressor we construct, we need somehow to
 take into account the {\em probabilities\/} of the  different outcomes. 
 Imagine comparing the information contents of
 two text files -- one
 in which all 128 ASCII characters are used with equal probability,
 and one in which the characters are used with their frequencies 
 in English text.
%: $P(x={\tt e})=$, 
% $P(x={\tt e})=$, $P(x={\tt e})=$,$P(x={\tt e})=$,$P(x={\tt e})=$, \ldots
% $P(x={\tt e})=$, \ldots. 
% only the characters {\tt 0} and {\tt 1} are used. 
 Can we define a measure of information content that
 distinguishes between these two files? Intuitively,
 the latter file contains less information per character
 because it is more predictable.

%And a file of {\tt 0}s 
% and {\tt 1}s in which nearly all the characters are {\tt 0}s 
% conveys even less information. 
% Maybe introducing 0 and 1 is nto a good idea. 
% At this point I start talking in terms of compression. 
% How can we include a probabilistic element?
 One simple way to use
 our knowledge that some symbols have a smaller probability is
 to imagine recoding the observations into a smaller alphabet -- thus losing
 the ability to encode some of the more improbable
 symbols -- and then measuring the \perfectic\ of the new alphabet.
% choice here - could either map multiple symbols onto 
% one, so the compression is lossy, 
% or could define no entry at all for some symbols, so compression
% fails. 
%  The general mapping situation is not ideal since I really want all 
% the losers to be mapped to one symbol. Student might imagine mapping
% Z and z to Z, Y and y to Y.. and claim they are losing little info.
% But this messes up the defn of delta.
 For example, 
 we might take a risk when compressing English text, guessing that the most
 infrequent  characters won't occur, 
 and make a reduced ASCII code that omits the characters
% for example, 
%  `\verb+!+', `\verb+@+', `\verb+#+',
%  `\verb+$+', `\verb+%+', `\verb+^+', `\verb+*+', `\verb+~+', 
%  `\verb+<+', `\verb+>+', `\verb+/+',   `\verb+\+',  `\verb+_+',
%  `\verb+{+',  `\verb+}+',  `\verb+[+',  `\verb+]+',
%  and `\verb+|+',
 $\{$ \verb+!+, \verb+@+, \verb+#+,
% \verb+$+, $
 \verb+%+, \verb+^+, \verb+*+, \verb+~+, 
 \verb+<+, \verb+>+, \verb+/+,   \verb+\+,  \verb+_+,
 \verb+{+,  \verb+}+,  \verb+[+,  \verb+]+, \verb+|+ $\}$,
 thereby reducing the size of the alphabet
% the total number of characters
 by seventeen.
%
% cut this dec 2000
% Thus we can give new
%%%%  a (not necessarily unique)
% names to a {\em subset\/} of the possible outcomes and count how many names we
% use.
 The larger the risk we are willing to take, the smaller
 our final alphabet becomes.
% ] the number of names we need.
% We thus relax the exhaustive requirement of the definition of 
%
% aside
%
% We could imagine doing this to the numbers coming out of the guessing 
% game with which this chapter started, for example. It seems 
% quite unlikely that the subject would have to guess 25, 26 or 27 times 
% to get the next letter; these outcomes 
%%`27' is
% are very improbable, 
% and we might be willing to record the sequence of numbers using 
% 24 symbols only, taking the gamble that in fact more guesses might 
% be needed. 

 We  introduce a parameter $\delta$ that describes the risk we 
 are taking when using this compression method:  $\delta$ is 
 the probability that there will be no name for an outcome $x$.
\exampl{exHdelta}{
 Let 
\beq
\begin{array}{l*{14}{@{\,}c}}
     & \A_X & = & \{  & {\tt a},& {\tt b},&{\tt c},&{\tt d},&{\tt e},&{\tt f},&{\tt g},&{\tt h} & \}, \\
 \mbox{and }\:\:
  & \P_X & = & \bigl\{  &    \frac{1}{4} ,&    \frac{1}{4} ,&   \frac{1}{4} ,&  \frac{3}{16} ,&  \frac{1}{64} ,&  \frac{1}{64} ,&  \frac{1}{64} ,&  \frac{1}{64}  & \bigr\} .
\end{array} 
\eeq
 The \perfectic\ of this ensemble is 3 bits, corresponding to 
 8 binary names.
 But notice that $P( x \in \{ {\tt a}, {\tt b}, {\tt c}, {\tt d} \} ) = 15/16$.
 So if we are willing to run a risk of $\delta=1/16$ of not having a name
 for $x$, then we can get by with four names --
 half as many names as are needed if
 every $x \in \A_X$  has a name.

 Table \ref{fig.delta.examples} shows binary names that could be given 
 to the different outcomes in the cases $\delta = 0$ and $\delta = 1/16$.
 When $\delta=0$ we need 3 bits to encode the outcome;
 when $\delta=1/16$ we  need only 2 bits. 
%
%\begin{figure}[htbp]
%\figuremargin{%
\amargintab{b}{
\begin{center}
\begin{tabular}{cc} 
\toprule
\multicolumn{2}{c}{$\delta = 0$}
\\
\midrule
$x$ & $c(x)$ \\ \midrule
{\tt a} & {\tt{000}} \\
{\tt b} & {\tt{001}} \\
{\tt c} & {\tt{010}} \\
{\tt d} & {\tt{011}} \\
{\tt e} & {\tt{100}} \\
{\tt f} & {\tt{101}} \\
{\tt g} & {\tt{110}} \\
{\tt h} & {\tt{111}} \\
 \bottomrule
\end{tabular}
% \hspace{0.61in}
\hspace{0.1in}
\begin{tabular}{cc} 
\toprule
\multicolumn{2}{c}{$\delta = 1/16$}
\\
\midrule
$x$ & $c(x)$ \\ \midrule
{\tt a} & {\tt{00}} \\
{\tt b} & {\tt{01}} \\
{\tt c} & {\tt{10}} \\
{\tt d} & {\tt{11}} \\
{\tt e} & $-$ \\
{\tt f} & $-$ \\
{\tt g} & $-$ \\
{\tt h} & $-$ \\
 \bottomrule
\end{tabular}
\end{center}
%}{%
\caption[a]{Binary names for the outcomes,
 for two failure probabilities $\delta$.}
\label{fig.delta.examples}
\label{tab.twosillycodes}
}%
%\end{figure}
}

%\noindent
 Let us now formalize this idea.\index{source code}
%
 To make a compression strategy with risk $\delta$,
% we consider all  subsets $T$ of the alphabet $\A_X$ and 
% seek out
 we make the smallest possible subset
 $S_{\delta}$ such that  the
 probability that $x$ is not in $S_{\delta}$ is less than or equal to 
 $\delta$, \ie,
 $P(x \not\in S_{\delta} ) \leq \delta$. For each value of $\delta$ we can then
 define a new measure of information content -- the log of the size
 of this smallest subset $S_{\delta}$. [In ensembles in which
 several elements have the same probability, there may be several
 smallest subsets that contain different elements, but all that matters
 is their sizes (which are equal), so we will not dwell on this ambiguity.]
% worry about this possibility.
\begin{description}
\item[The smallest $\delta$-sufficient subset] $S_{\delta}$ is the smallest
	subset of $\A_X$ satisfying
\beq
	P(x \in S_{\delta} ) \geq 1 - \delta.
\eeq
%\beq
% S_{\delta} = \argmin 
%\eeq
\end{description}
 The subset  $S_{\delta}$ can be constructed by
 ranking the elements of $\A_X$ in order of decreasing probability
 and adding successive elements starting from the
 most probable elements
% front of the list
 until the total
 probability is $\geq (1\!-\!\delta)$.

 We can make a data compression code by assigning a binary name
 to each element of the smallest sufficient subset.
% (\tabref{tab.twosillycodes}).
 This compression  
 scheme motivates the following measure of information content: 
\begin{description}
\item[The \essentialic] of $X$ is: %%%%% was ESSENTIAL information content
% consider risk-delta bit content?
\beq
	H_{\delta}(X) = \log_2 |S_{\delta}| .
% =	\log_2 \min 	\left\{ |S| : S\subseteq \A_X,
%% P(S)\geq 1-\delta \right\}.
% P(x \in S)\geq 1-\delta \right\}.
\eeq
\end{description}
 Note that $H_0(X)$ is the special case of $H_{\delta}(X)$ with $\delta = 0$ 
 (if $P(x) > 0$ for all $x \in \A_X$). 
%
 [{\sf Caution:} do not confuse $H_0(X)$ and $H_{\delta}(X)$
 with the function $H_2(p)$ displayed in \figref{fig.h2}.] 

%%%%%%%(Should  I change notation to avoid confusion?)
%
\newcommand{\gapline}{\cline{1-4}\cline{6-9}}
\begin{figure}
\figuremargin{%
\begin{center}
\footnotesize%
\begin{tabular}{rc}
(a)&
\hspace*{-0.2in}\input{Hdelta/Sdelta/X.tex}\\
(b)&
\mbox{\makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X)$}}\hspace{-5mm}%
\psfig{figure=Hdelta/byhand/X.ps,%
width=70mm,angle=-90}$\delta$}%
\\
\end{tabular}
\end{center}
}{%
\caption[a]{(a) The outcomes of $X$ (from \protect\exampleref{exHdelta}),
 ranked by their probability.
 (b) The 
 \essentialic\ $H_{\delta}(X)$. The labels on the graph
 show the smallest sufficient set as a function of $\delta$.
  Note  $H_0(X) = 3$ bits and $H_{1/16}(X) = 2$ bits. 
}
\label{fig.hd.1}
}
\end{figure}

%\noindent
{\Figref{fig.hd.1} shows $H_{\delta}(X)$ for the ensemble
 of \exampleonlyref{exHdelta} as a function of  $\delta$.
}  

\subsection{Extended ensembles}
% The compression method we're studying in which a subset of
% outcomes are given binary names is not giving us a
% measure of information content for a single symbol.
%
% sanjoy wants a motivation here.
%
 Is this compression method any more useful if we compress
 {\em blocks\/} of symbols from a source?\index{source code!block code}\index{ensemble!extended}\index{extended ensemble}
%

 We now turn to examples where the outcome $\bx = (x_1,x_2,\ldots, x_N)$ is a string of  $N$
 independent identically distributed random variables
 from a single ensemble $X$. 
 We will denote by
% $\bX$ or
 $X^N$ the ensemble $( X_1, X_2, \ldots, X_N )$.
% for which $\bx$ is the random variable.
 Remember that entropy is additive for independent variables (\exerciseref{ex.Hadditive}),
% \footnote{There should have been an exercise on this by now.}
 so 
% $H(\bX) = N H(X)$. 
 $H(X^N) = N H(X)$. 

\exampl{ex.Nfrom.1}{
% {\sf Example 2:}
 Consider a string of $N$ flips of a bent coin,
 $\bx = (x_1,x_2,\ldots, x_N)$, where $x_n \in
 \{{\tt{0}},{\tt{1}}\}$, with probabilities $p_0 \eq 0.9,$ $p_1 \eq
 0.1$. The most probable strings $\bx$ are those with most {\tt{0}}s.  If
 $r(\bx)$ is the number of {\tt{1}}s in $\bx$ then
\beq
% |p_0,p_1
 P(\bx) = p_0^{N-r(\bx)} p_1^{r(\bx)} .
\eeq
 To evaluate  $H_{\delta}(X^N)$
 we must find the smallest sufficient subset $S_{\delta}$.
 This  subset will contain 
 all $\bx$ with $r(\bx) = 0, 1, 2, \ldots,$ up to some $r_{\max}(\delta)-1$,
 and some of the $\bx$ with $r(\bx) = r_{\max}(\delta)$.
% Working backwards, we can evaluate the cumulative probability 
% $P(r(\bx) \leq r)$ and evaluate the size of the subset $T(r): \{ \bx:
% r(\bx) \leq r \}$. 
%\beq
%	|T(r)| = \sum_{r=0}^{r} \frac{N!}{(N-r)!r!}
%\label{l2.T}
%\eeq
%\beq
%	P(r(\bx) \leq r)  = \sum_{r=0}^{r} \frac{N!}{(N-r)!r!}  p_0^{N-r} p_1^{r}
%\label{l2.Pr}
%\eeq
% We can then plot $\log |T(r)|$ versus $P(r(\bx) \leq r)$. This defines 
% a graph of $H_{\delta}(\bX)$ against $\delta$. 
 Figures \ref{fig.hd.4} and \ref{fig.hd.10} 
% Figure \ref{fig.hd.4}
 show  graphs of $H_{\delta}(X^N)$ against
 $\delta$ for the cases $N=4$ and $N=10$. The steps are the values of
 $\delta$ at which $|S_{\delta}|$ changes by~1, and the cusps where the slope
 of the staircase changes are the points
 where $r_{\max}$ changes by 1.  
}
\exercissxC{2}{ex.cusps}{
 What are the mathematical shapes of the curves between the cusps? 
}
% , both with $p_1 =
% 0.1$.  The points defined by equations (\ref{l2.T}) and (\ref{l2.Pr})
% are the cusps in the curve.
%
% I think this figure may be sick. CHECK IT.
%
\renewcommand{\gapline}{\cline{1-3}\cline{5-8}}
\begin{figure}
\figuremargin{%
%
% this table done by hand with help of (above hd.p command) /home/mackay/itp/Hdelta> more figs/4.tex
%
\begin{center}
\footnotesize%
\begin{tabular}{r@{\hspace*{-0.3in}}c}
(a)&
%%%%%%%% written by hand    see also X.tex
%
% picture of Sdelta for X^4
%
\newcommand{\axislevel}{24}
\newcommand{\axislevelp}{29.5}
\newcommand{\axislevelm}{21}
\newcommand{\axislevelmm}{18}
\newcommand{\forestgap}{-0.7}
\newcommand{\forest}[3]{\multiput(#1)(\forestgap,0){#2}{\line(0,1){#3}}}
%
%
%
\setlength{\unitlength}{2.2pt}%
\begin{picture}(155,50)(-143,-20)% adjusted vertical height from 50 to 60 Sat 5/10/02. And put back again Sun 22/12/02  was (-143,-22) Sun 22/12/02
% - log P = 2.0 , 2.4 and 6.0
\forest{-6.1,0}{1}{16}% heights fictitious
\forest{-37.3,0}{4}{12.5}% 
\forest{-68.5,0}{6}{9.4}% 69.5
\forest{-100.8,0}{4}{6.3}%
\forest{-132.9,0}{1}{4.2}%
% axis: 
\put(-143,\axislevelm){\vector(1,0){151.0}}
%
% axis labels
\put(5,\axislevelp){\makebox(0,0)[b]{\small$\log_2 P(x)$}}
\put(0,\axislevel){\makebox(0,0)[b]{\small$0$}}
\put(-20,\axislevel){\makebox(0,0)[b]{\small$-2$}}
\put(-40,\axislevel){\makebox(0,0)[b]{\small$-4$}}
\put(-60,\axislevel){\makebox(0,0)[b]{\small$-6$}}
\put(-80,\axislevel){\makebox(0,0)[b]{\small$-8$}}
\put(-100,\axislevel){\makebox(0,0)[b]{\small$-10$}}
\put(-120,\axislevel){\makebox(0,0)[b]{\small$-12$}}
\put(-140,\axislevel){\makebox(0,0)[b]{\small$-14$}}
%
% this box is right size for the whole set
%\put(0,-2.5){\framebox(140,\axislevelm){}}
%\put(142,13){\makebox(0,0)[l]{\small$S_0$}}
% this box is round 3 clumps
\put(-83.5,-2.5){\framebox(83.5,\axislevelm){}}
\put(-84.5,13){\makebox(0,0)[r]{\small$S_{0.01}$}}
% a smaller box round 3 clumps
%\put(2.5,-1){\framebox(81,\axislevelmm){}}
%
\put(-53.5,-1){\framebox(51,\axislevelmm){}}
\put(-54.5,13){\makebox(0,0)[r]{\small$S_{0.1}$}}
%
% object labels
\put(-6.1,-12){\makebox(0,0)[t]{\footnotesize{\tt 0000}}}
\put(-37.7,-12){\makebox(0,0)[t]{\footnotesize${\tt 0010},{\tt 0001},\ldots$}}
\put(-69.5,-12){\makebox(0,0)[t]{\footnotesize${\tt 0110},{\tt 1010},\ldots$}}
\put(-101.2,-12){\makebox(0,0)[t]{\footnotesize${\tt 1101},{\tt 1011},\ldots$}}
\put(-132.9,-12){\makebox(0,0)[t]{\footnotesize{\tt 1111}}}
\multiput(-6.1,-10)(-31.6,0){5}{\vector(0,1){5}}  
\end{picture}
%
%
%
%

(b)&
\makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X^4)$}}\hspace{-5mm}%
\psfig{figure=Hdelta/figs/hd/4.ps,%
width=65mm,angle=-90}$\delta$%%
%
% 
% useful for making table: 
% hd.p mmin=4 mmax=4 mstep=6 scale_by_n=0 plot_sub_graphs=1 latex=1 
%
\end{tabular}
\end{center}
}{%
%
% I think this figure may be sick. CHECK IT.
%
\caption[a]{(a) The sixteen outcomes of the ensemble $X^4$ with $p_1=0.1$, ranked by probability. (b) The
 \essentialic\ $H_{\delta}(X^4)$. The upper
 schematic diagram indicates the strings'
 probabilities by the vertical lines' lengths (not to scale).}
\label{fig.hd.4}
}%
\end{figure}
%
%
%
\begin{figure}%[htbp]
\figuremargin{%
\begin{center}
\mbox{%%%%%%%%%%%%% (twocol) %}\\ \mbox{
\makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X^{10})$}}\hspace{-5mm}%
\psfig{figure=Hdelta/figs/hd/10.ps,%
width=65mm,angle=-90}$\delta$}
% command, in Hdelta: 
% hd.p mmin=4 mmax=10 mstep=6 scale_by_n=0 plot_sub_graphs=1 | gnuplot 
\end{center}
}{%
\caption[a]{$H_{\delta}(X^N)$ 	for  $N=10$ binary variables with $p_1=0.1$.}
\label{fig.hd.10}
}%
\end{figure}

 For the examples shown in figures \ref{fig.hd.1}--\ref{fig.hd.10},
 $H_{\delta}(X^N)$ depends strongly on the 
 value of $\delta$, so it might not seem  a  fundamental or useful 
 definition of information content.  
 But we will consider what happens as $N$, the number of independent variables
 in $X^N$, increases. We will find the remarkable result that 
 $H_{\delta}(X^N)$ becomes almost independent of $\delta$ -- and for all 
 $\delta$ it is very close to $N H(X)$, where $H(X)$ is the 
 entropy of one of the random variables.
% sketch? 
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\makebox[0in][r]{\raisebox{1.3in}{$\frac{1}{N}H_{\delta}(X^{N})$}}\hspace{-5mm}%
\psfig{figure=Hdelta/figs/hd/all.10.1010.ps,%
width=65mm,angle=-90}$\delta$}
\end{center}
}{%
\caption[a]{$\frac{1}{N} H_{\delta}(X^{N})$ 
	for  $N=10, 210, \dots,1010$ binary variables with $p_1=0.1$.}
\label{fig.hd.10.1010}
}
\end{figure}


 \Figref{fig.hd.10.1010} illustrates this asymptotic tendency for 
 the binary ensemble of  example \ref{ex.Nfrom.1}.
% discussed earlier with $N$ binary variables with $p_1 = 0.1$. 
 As $N$ increases, $\frac{1}{N} H_{\delta}(X^N)$  becomes an increasingly 
 flat function, except for tails close to $\delta=0$ and $1$.
%  The limiting value of  the plateau is $H(X) = 0.47$.
% We will explain and prove this result in the remainder of
% this chapter. Let's first note the implications of this result.
% The limiting value of the plateau, which for  $N$ binary variables with $p_1 = 0.1$
% appears to be about 0.5, defines how much compression is possible:
% $N$  binary variables with $p_1 = 0.1$ can be compressed into
% about $N/2$ bits, with a probability of error $\delta$ which
% can be any value between 0 and 1.
% We will show that the plateau value to which  $\frac{1}{N} H_{\delta}(X^N)$
% tends, for large $N$, is the entropy, $H(X)$.
%
% IDEA: Box this next sentence?
%
 As long as we are allowed
 a tiny probability of error $\delta$, compression down to
 $NH$ bits is possible. Even if we are allowed a large probability of error,
 we still can  compress only down to $NH$ bits.
%
% IDEA: Box above?
%
 This is the \ind{source coding theorem}.
% \subsection{The theorem}
\begin{ctheorem}
\label{thm.sct}
 {\sf Shannon's source coding theorem.}
% HOW TO NAME THIS?????????????????
% this name is taken later
	Let $X$ be an ensemble with entropy $H(X) = H$ bits. Given $\epsilon>0$
 and $0<\delta<1$, there exists a positive integer $N_0$ such that for 
 $N>N_0$, 
\beq
 \left| \frac{1}{N} H_{\delta}(X^N) - H \right| < \epsilon. 
\eeq
\end{ctheorem}
%
% sanjoy wants explan here
%
% The reason that increasing $N$ helps is that, if $N$ is large,
% the outcome $\bx$ 

\section{Typicality}
 Why does increasing $N$ help?\indexs{typicality}
 Let's examine long strings from $X^N$.
 Table \ref{tab.typical.tcl} shows fifteen samples from $X^N$ 
 for  $N=100$ and $p_1=0.1$.
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{lr} \toprule
$\bx$ &
% \multicolumn{1}{c}{$\log_2(P(\bx))$}
\hspace{-0.3in}{$\log_2(P(\bx))$}
% {\rule[-3mm]{0pt}{8mm}}%strut
 \\ \midrule
% REQUIRE MONOSPACED FONT!!!
{\tinytt{%VERB
...1...................1.....1....1.1.......1........1...........1.....................1.......11...%END
}} & $-$50.1  \\
{\tinytt{%VERB
......................1.....1.....1.......1....1.........1.....................................1....%END
}} & $-$37.3  \\
{\tinytt{%VERB
........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1.%END
}} & $-$65.9  \\
{\tinytt{%VERB
1.1...1................1.......................11.1..1............................1.....1..1.11.....%END
}} & $-$56.4  \\
{\tinytt{%VERB
...11...........1...1.....1.1......1..........1....1...1.....1............1.........................%END
}} & $-$53.2  \\
{\tinytt{%VERB
..............1......1.........1.1.......1..........1............1...1......................1.......%END
}} & $-$43.7  \\
{\tinytt{%VERB
.....1........1.......1...1............1............1...........1......1..11........................%END
}} & $-$46.8  \\
{\tinytt{%VERB
.....1..1..1...............111...................1...............1.........1.1...1...1.............1%END
}} & $-$56.4  \\
{\tinytt{%VERB
.........1..........1.....1......1..........1....1..............................................1...%END
}} & $-$37.3  \\
{\tinytt{%VERB
......1........................1..............1.....1..1.1.1..1...................................1.%END
}} & $-$43.7  \\
{\tinytt{%VERB
1.......................1..........1...1...................1....1....1........1..11..1.1...1........%END
}} & $-$56.4  \\
{\tinytt{%VERB
...........11.1.........1................1......1.....................1.............................%END
}} & $-$37.3  \\
{\tinytt{%VERB
.1..........1...1.1.............1.......11...........1.1...1..............1.............11..........%END
}} & $-$56.4  \\
{\tinytt{%VERB
......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1..............%END
}} & $-$59.5  \\
{\tinytt{%VERB
............11.1......1....1..1............................1.......1..............1.......1.........%END
}} & $-$46.8  \\ \midrule % [0.2in]
%															 
{\tinytt{%VERB
....................................................................................................%END
}} & $-$15.2 \\
{\tinytt{%VERB
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111%END
}} & $-$332.1\\
%
\bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{The top 15 strings are samples from $X^{100}$, 
 where $p_1 = 0.1$ and $p_0 = 0.9$. 
 The bottom two are the most and least probable strings in this ensemble.
 The final column shows the 
% Compare the
 log-probabilities of the random strings,
 which may be compared with the entropy
% with 
% the \aep: $H(X) = 0.469$, so
 $H(X^{100}) = 46.9$ bits.}
\label{tab.typical.tcl}
}
\end{figure}
% 1000 Typical set size +/-    28.46 has log_2(P(x)) within +/-    90.22
%  i.e. 1/N (logp) is within 0.090
% 100  Typical set size +/-        9 has log_2(P(x)) within +/-    28.53
%  i.e. 1/N(logp) is within 0.285
% 200  Typical set size +/-    12.73 has log_2(P(x)) within +/-    40.35
%
% N=100 alternative (see hd.p for the commands)
%
\begin{figure}
\fullwidthfigureright{
%\figuremargin{%
\begin{center}
\begin{tabular}{r@{\hspace*{-0in}}c@{\hspace*{-0.1in}}c} \toprule
 & $N=100$ & $N=1000$ \\ \midrule
\raisebox{0.71in}{\small$n(r) = {N \choose r}$}
  & \mbox{\psfig{figure=Hdelta/figs/num/100.ps,%
width=50mm,angle=-90}} 
  & \mbox{\psfig{figure=Hdelta/figs/num/1000.ps,%
width=50mm,angle=-90}} \\
\raisebox{0.71in}{\small$P(\bx) = p_1^r (1-p_1)^{N-r}$}
 & \mbox{\psfig{figure=Hdelta/figs/per/100.ps,%
width=50mm,angle=-90}}%
\makebox[0in][r]{\raisebox{0.4in}{%
\psfig{figure=Hdelta/figs/perdet/100.ps,%
width=30mm,angle=-90}}\hspace{0.2in}} 
&
\\
\raisebox{0.71in}{\small$\log_2 P(\bx)$}
 & \mbox{\psfig{figure=Hdelta/figs/logper/100.ps,%
width=50mm,angle=-90}} 
 & \mbox{\psfig{figure=Hdelta/figs/logper/1000.ps,%
width=50mm,angle=-90}} \\
\raisebox{0.71in}{\small$n(r)P(\bx)= {N \choose r} p_1^r (1-p_1)^{N-r}$} 
& \mbox{\psfig{figure=Hdelta/figs/tot/100.ps,%
width=50mm,angle=-90}}
& \mbox{\psfig{figure=Hdelta/figs/tot/1000.ps,%
width=50mm,angle=-90}}
% \makebox[0in][l]{$r$}
\\
 & 
$r$ & $r$  \\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{Anatomy of the typical set $T$.
	For  $p_1=0.1$ 
 and $N=100$ and $N=1000$, these graphs show $n(r)$, the number of 
 strings containing $r$ {\tt{1}}s; the probability $P(\bx)$ of a single 
 string that contains $r$ {\tt{1}}s; the same probability on a 
 log scale; and the total probability 
 $n(r)P(\bx)$ of all strings that contain $r$ {\tt{1}}s. 
 The number $r$ is on the  horizontal axis. 
 The plot of $\log_2 P(\bx)$ also shows by a dotted line the mean value of
 $\log_2 P(\bx) = -N H_2(p_1)$ which equals $-46.9$
 when $N=100$ and $-469$ when $N=1000$. The typical set includes 
 only the strings that have $\log_2 P(\bx)$ close to this value.
 The range marked {\sf T} shows the set $T_{N \beta}$ (as defined
 in \protect\sectionref{sec.ts})
 for $N=100$ and $\beta = 0.29$ (left) and  $N=1000$,  $\beta = 0.09$ (right).
} 
\label{fig.num.per.tot}
}%
\end{figure}
 The probability of a string $\bx$ that contains $r$ {\tt{1}}s and
 $N\!-\!r$ {\tt{0}}s is
\beq
	P(\bx) = p_1^r (1-p_1)^{N-r} .
\eeq
 The number of strings that contain $r$ {\tt{1}}s  is
\beq
	n(r) =  {N \choose r} .
\eeq
 So the number of {\tt{1}}s, $r$, has a binomial distribution:
\beq
	P(r) =  {N \choose r} p_1^r (1-p_1)^{N-r} .
\eeq
 These functions are shown in \figref{fig.num.per.tot}.
The mean of $r$ is $N p_1$, and its standard deviation is
 $\sqrt{N p_1 (1-p_1)}$ (\pref{sec.first.binomial}).
 If $N$ is 100 then
\beq
	r \sim 	N p_1 \pm  \sqrt{N p_1 (1-p_1)} \simeq 10 \pm 3 .
\eeq
 If $N=1000$ then 
\beq
	r \sim 100 \pm 10 .
\eeq
 Notice that as $N$ gets bigger, the probability distribution
 of $r$ becomes more concentrated, in the sense that
 while the
  range of possible values of $r$  grows
 as $N$, the standard deviation of $r$  
 grows only as $\sqrt{N}$. 
 That $r$ is most likely to fall
 in a small range of values implies
 that the outcome $\bx$ is also most likely to
 fall in a corresponding  small  subset of outcomes
 that we will call the {{\dbf\inds{typical set}}}. 

\subsection{Definition of the typical set}
\label{sec.ts}
% Let us generalize our discussion to an arbitrary ensemble $X$
% with alphabet $\A_X$
% and define typicality.
 Let us define \ind{typicality}\index{typical set!for compression}
 for an arbitrary ensemble $X$
 with alphabet $\A_X$.
 Our definition of a typical string will
 involve the string's probability.
 A long string
% message
 of $N$ symbols will usually
 contain
% with high	probability
 about $p_1N$ occurrences of the first symbol,
	$p_2N$ occurrences of the second, etc. Hence the probability
	of this string
% long message
 is roughly
\beq
	P(\bx)_{\rm typ}
 = P(x_1)P(x_2)P(x_3) \ldots P(x_N)
 \simeq p_1^{(p_1N)} p_2^{(p_2N)} \ldots p_I^{(p_IN)}
\eeq
%  p_i^{p_iN}
 so that
 the information content of a typical string is
\beq
	\log_2 \frac{1}{P(\bx)}
 \simeq N \sum_i p_i \log_2 \frac{1}{p_i} \simeq N H . 
\eeq
	So the random variable $\log_2 \!\dfrac{1}{P(\bx)}$,
%	So the random variable $\frac{1}{N} \log_2 \frac{1}{P(\bx)}$,
% which is the average information content per symbol, is
 which is the  information content of $\bx$, is
	very likely to be close in  value  to $N H$.
 We build our definition of typicality on this observation.

 We   define  the typical elements of $\A_X^N$ to be
	those elements that
	have probability  close to $2^{-NH}$. (Note that the typical set,
 unlike the
% best subset for compression
 smallest sufficient subset,  does
	{\em not\/} include the most probable elements of $\A_X^N$, but we
	will show that these most probable elements
 contribute negligible probability.)

 We introduce a parameter $\beta$ that defines how close
 the probability has to be  to   $2^{-NH}$ for
 an element to be `typical'.
% $\beta$-
 We call the set of typical elements the typical set,
% $T$, or, to be more precise,
 $T_{N \beta}$:
% , where the parameter $\beta$
%% controls the breadth of the typical set by defining
% defines what we mean by a probability `close' to $2^{-NH}$:
\beq
	T_{N\b} \equiv \left\{ \bx\in\A_X^N : 
	\left| \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right| < \b 
	\right\} .
\label{eq.TNb}
\eeq
%
% check whether < has propagated to all necessary places
%

 We will show  that whatever value of $\beta$ we choose,
 the typical set  contains almost all the probability
 as $N$ increases. 

 This important result is sometimes called the
 {\dem `asymptotic equipartition' principle}.\index{asymptotic equipartition}
% \newpage
%\section{`Asymptotic Equipartition' and Source Coding}
\label{sec.aep}
%	We will prove the following result: 
\begin{description}
\item[`Asymptotic equipartition' principle\puncspace]
% (AEP).]
 For an ensemble of $N$ independent identically distributed (\ind{i.i.d.})
 random variables 
 $X^N \equiv ( X_1, X_2, \ldots, X_N )$, with $N$ sufficiently large, 
 the outcome $\bx = (x_1,x_2,\ldots, x_N)$ is almost certain to belong 
 to a subset of $\A_X^N$ having only $2^{N H(X)}$ members, each having 
 probability `close to' $2^{-N H(X)}$.
\end{description}
 Notice that if $H(X) < H_0(X)$ then $2^{N H(X)}$ is a {\em tiny\/}
 fraction of the number of possible outcomes $|\A_X^N|=|\A_X|^N=2^{N
 H_0(X)}.$

\begin{aside}
 The term \ind{equipartition} is chosen to describe the idea
 that the members of the typical set have {\em roughly equal\/}
 probability. [This should not be taken too literally, hence my
 use of quotes around `asymptotic equipartition';
% in the phrase \aep;
 see page \pageref{sec.aep.caveat}.]

 A second meaning for equipartition, in thermal physics,
 is the idea that each degree of freedom of a classical system
 has {equal\/} average energy, $\half kT$. This second meaning
 is not intended here.
\end{aside}

%
	The \aep\ is equivalent to:
\begin{description}
\item[Shannon's source coding theorem (verbal statement)\puncspace]
	 $N$ i.i.d.\ random variables each 
	with entropy $H(X)$ can be compressed into more than $NH(X)$ bits with 
	negligible risk of information  loss, as $N\rightarrow \infty$; 
	conversely if they are compressed into fewer than $NH(X)$ bits 
 	it is virtually certain that information will be lost.
\end{description}
 These two theorems are equivalent
	because we can define a compression algorithm that gives a distinct 
 name of length $N H(X)$ bits to each $\bx$ in the typical set.
% probable subset. 
% as follows: 
% enumerate the $\bx$ belonging to 
% the subset of $2^{N H(X)}$ equiprobable outcomes as 000\ldots000, 
% 000\ldots001, etc. 


\begin{figure}
\figuredangle{%
\begin{center}
%%%%%%%% written by hand    see also X.tex
%
% picture of Sdelta for X^100
%
\newcommand{\axislevel}{27}
\newcommand{\axislevelp}{32.5}
\newcommand{\axislevelm}{24}
\newcommand{\axislevelmm}{21}
\newcommand{\forestgap}{-0.4}
\newcommand{\forestgab}{-0.6}
\newcommand{\forestgac}{-0.56}
\newcommand{\forestgad}{-0.52}
\newcommand{\forestgae}{-0.48}
\newcommand{\forestgaf}{-0.44}
% \newcommand{\forestgag}{0.48}
%\newcommand{\forestgap}{0.35} was .35 when I went up to 14.
\newcommand{\forest}[3]{\multiput(#1)(\forestgap,0){#2}{\line(0,1){#3}}}
\newcommand{\foresb}[4]{\multiput(#1)(#4,0){#2}{\line(0,1){#3}}}
%
% picture
%
%\setlength{\unitlength}{2.45pt}%
\setlength{\unitlength}{2.87pt}%
\begin{picture}(170,81)(-170,-42)
\forest{0,0}{1}{16.5}%
\foresb{-5,0}{2}{16}{\forestgab}
\foresb{-10,0}{3}{15.5}{\forestgab}
\foresb{-15,0}{4}{15}{\forestgac}
\foresb{-20,0}{5}{14.5}{\forestgad}
\foresb{-25,0}{6}{14}{\forestgae}
\foresb{-30,0}{7}{13.5}{\forestgaf}
\foresb{-35,0}{8}{13}{\forestgap}
\foresb{-40,0}{9}{12.5}{\forestgap}
\forest{-45,0}{10}{12}%
\forest{-50,0}{11}{11.5}%
\forest{-55,0}{12}{11}%
\forest{-60,0}{12}{10.5}%
\forest{-65,0}{12}{10}%
\forest{-70,0}{12}{9.5}%
\forest{-75,0}{12}{9}%
\forest{-80,0}{12}{8.5}%
\forest{-85,0}{12}{8}%
\forest{-90,0}{12}{7.5}%
\forest{-95,0}{12}{7}%
\forest{-100,0}{12}{6.5}%
\forest{-105,0}{12}{6}%
\forest{-110,0}{12}{5.5}%
\forest{-115,0}{11}{5}%
\forest{-120,0}{10}{4.5}%
\foresb{-125,0}{9}{4.2}{\forestgap}
\foresb{-130,0}{8}{3.9}{\forestgap}
\foresb{-135,0}{7}{3.6}{\forestgaf}
\foresb{-140,0}{6}{3.3}{\forestgae}
\foresb{-145,0}{5}{3.0}{\forestgad}
\foresb{-150,0}{4}{2.7}{\forestgac}
\foresb{-155,0}{3}{2.4}{\forestgab}
\foresb{-160,0}{2}{2.1}{\forestgab}
\forest{-165,0}{1}{1.8}%
%
% axis: 
\put(-168,\axislevelm){\vector(1,0){171.0}}
%
% axis labels
\put(0,\axislevelp){\makebox(0,0)[br]{\small$\log_2 P(x)$}}
\put(-42.4,\axislevel){\makebox(0,0)[b]{\small$-NH(X)$}}
% tic mark  (was at -40 until Tue 8/1/02)
\put(-42.4,\axislevelm){\line(0,1){2}}
% the S0 box
%\put(-3,-2.5){\framebox(172,\axislevelm){}}
%\put(142,16){\makebox(0,0)[l]{$S_0$}}
%
%
% typical set box
\put(-49.5,-1){\framebox(15,\axislevelmm){}}
\put(-51,16){\makebox(0,0)[r]{$T_{N\b}$}}
%
% object labels
\put(0,-40){\vector(0,1){35}}  
\put(-15,-35){\vector(0,1){30}}  
%\put(26,-30){\vector(0,1){25}}  
\put(-36,-25){\vector(0,1){20}}  
\put(-46,-20){\vector(0,1){15}}  
%\put(56,-15){\vector(0,1){10}}  
\put(-155,-10){\vector(0,1){5}}  
\put( 0,-40){\makebox(0,0)[tr]{\footnotesize{{\tt 0000000000000}\ldots{\tt{00000000000}}}}}
\put(-15,-35){\makebox(0,0)[tr]{\footnotesize{{\tt 0001000000000}\ldots{\tt{00000000000}}}}}
%\put(26,-30){\makebox(0,0)[tl]{\footnotesize{{\tt 0000001000000}\ldots{\tt{00000010000}}}}}
\put(-36,-25){\makebox(0,0)[tr]{\footnotesize{{\tt 0100000001000}\ldots{\tt{00010000000}}}}}
\put(-46,-20){\makebox(0,0)[tr]{\footnotesize{{\tt 0000100000010}\ldots{\tt{00001000010}}}}}
%\put(56,-15){\makebox(0,0)[tl]{\footnotesize{{\tt 0100001000100}\ldots{\tt{00010100100}}}}}
\put(-155,-10){\makebox(0,0)[tl]{\footnotesize{{\tt 1111111111110}\ldots{\tt{11111110111}}}}}
\end{picture}
%
%
%
%

\end{center}
}{%
\caption[a]{Schematic diagram showing all strings
 in the ensemble $X^{N}$
% with $p_0 = 0.9, p_1=0.1$
% of large length $N$
 ranked by their probability, and
 the typical set $T_{N\b}$.}
\label{fig.typical.set.explain}
}%
\end{figure}


\section{Proofs}
\label{sec.chtwoproof}
 This section may be skipped if found tough going.


\subsection{The law of large numbers}
 Our proof of the source coding theorem  uses  the
 \ind{law of large numbers}.
\begin{description}
% \item[A random variable $u$] is any real function of $x$, 
\item[Mean and variance] of a real random variable
%\footnote
 are $\Exp[u] = \bar{u} = \sum_u P(u) u$ and $\var(u) =
	\sigma^2_u = \Exp[(u-\bar{u})^2] = \sum_u P(u) (u - \bar{u})^2.$
\begin{aside}
 Technical note: 
	strictly I am assuming here that $u$ is a function $u(x)$ of a
	sample $x$ from a finite discrete ensemble $X$. Then the
	summations $\sum_u P(u) f(u)$ should be written $\sum_x P(x)
	f(u(x))$.  This means that $P(u)$ is a finite sum of delta
	functions.  This restriction guarantees that the mean and
	variance of $u$ do exist, which is not necessarily the case for general
	$P(u)$.
\end{aside}

\item[Chebyshev's inequality 1\puncspace]
	Let $t$ be a non-negative real random variable, and\index{Chebyshev inequality} 
 let $\a$ be a positive real number.  Then\index{inequality}
\beq
	P(t \geq \a) \:\leq\: \frac{\bar{t}}{\a}.
\label{eq.cheb.1}
\eeq

	{\sf Proof:} $P(t \geq \a) = \sum_{t \geq \a} P(t)$. 
 We multiply each 
 term by $t/\a \geq 1$ and obtain: 
 $P(t \geq \a) \leq \sum_{t \geq \a} P(t) t/\a.$
 We add the (non-negative) missing terms and obtain:
 $P(t \geq \a) \leq \sum_{t} P(t) t/\a = \bar{t}/\a$. \hfill$\epfsymbol$\par

\item[Chebyshev's inequality 2\puncspace]
	Let $x$ be a random variable, and let $\a$ be a positive real number.
 Then
\beq
	P\left( (x-\bar{x})^2 \geq \a \right) \:\leq\: \sigma^2_x / \a.
\eeq

{\sf Proof:} Take $t = (x-\bar{x})^2$ and apply the previous proposition. \hfill$\epfsymbol$\par

\item[Weak \ind{law of large numbers}\puncspace]
	Take $x$ to be the average of $N$ independent random variables 
 $h_1, \ldots , h_N$, having common mean $\bar{h}$ and common variance  
 $\sigma^2_h$: $x = \frac{1}{N} \sum_{n=1}^N h_n$. Then 
\beq
	P( (x-\bar{h})^2 \geq \a ) \leq \sigma^2_h/\a N.
\eeq

{\sf Proof:} obtained by showing that $\bar{x}=\bar{h}$ and that 
 $\sigma^2_x = \sigma^2_h/ N$. \hfill$\epfsymbol$\par

\end{description}
 We are interested in $x$ being very close to the mean ($\a$ very small).
 No matter how large $\sigma^2_h$ is, and no matter how small the
 required $\a$ is, and no matter how small the desired probability that
 $(x-\bar{h})^2 \geq \a$, we can always achieve it by
 taking $N$ large enough.

\subsection{Proof of theorem \protect\ref{thm.sct} (\pref{thm.sct})}
% the source coding theorem}
% or could say theorem 1
 We apply the law of large numbers to the random variable $\frac{1}{N}
 \log_2 \frac{1}{P(\bx)}$ defined for $\bx$ drawn from the ensemble $X^N$. 
 This random variable can be written as the average of $N$ information
 contents 
 $h_n = \log_2 ( 1 / P(x_n))$, each of which is a random variable with 
 mean $H = H(X)$ and variance $\sigma^2 \equiv \var[ \log_2 ( 1 / P(x_n)) ]$.
 (Each  term $h_n$
 is  the Shannon information content of the $n$th
 outcome.)

 We again define the typical set with parameters $N$ and $\beta$ thus: 
\beq
	T_{N\b} = \left\{ \bx\in\A_X^N : 
	\left[ \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right]^2 < \b^2 
	\right\} .
\label{eq.TNb.2}
\eeq
 For all $\bx \in T_{N\b}$, the probability of $\bx$ satisfies
\beq
2^{-N(H+\b)} < P(\bx) < 2^{-N(H-\b)}.
\eeq
 And by the law of large numbers, 
\beq
	P(\bx \in T_{N\b}) \geq 1 - \frac{\sigma^2}{\b^2 N} .
\eeq
 We have thus proved the \aep. As $N$ increases, the probability
 that $\bx$ falls in  $T_{N\b}$ approaches 1, for any $\beta$.
 How does this result relate to source coding?

%	We will prove the \aep\ first; then w
 We must relate $T_{N\b}$ to $H_{\delta}(X^N)$.
 We will
 show that for any given $\delta$ there is
	a sufficiently big $N$ such that
	$H_{\delta}(X^N) \simeq N H$.


\subsubsection{Part 1:  $\frac{1}{N} H_{\delta}(X^N) <  H + 
	\epsilon$.}
% of the source coding theorem.
%
% More words here reminding what H_delta is
%
 The set $T_{N\b}$ is not the best subset for  compression. So the
 size of $T_{N\b}$ gives an upper bound on $H_{\delta}$.
 We show how {\em small} $H_{\delta}(X^N)$ must be by calculating
% the  largest cardinality that $T_{N\b}$ could have.
 how big  $T_{N\b}$  could possibly be.
 We are
 free to set $\beta$ to any convenient value.
 The smallest possible 
 probability that a member of $T_{N\b}$ can have is  $2^{-N(H+\b)}$, and 
 the  total probability that $T_{N\b}$ contains can't be any bigger
 than 1. So 
\beq
	|T_{N\b}|  \,  2^{-N(H+\b)}  < 1 ,
\eeq
 that is, the size of the typical set is bounded by
% so we can bound
\beq
	|T_{N\b}| < 2^{N(H+\b)} . 
\eeq
 If we set $\b = \epsilon$ and $N_0$ such that
 $\frac{\sigma^2}{\epsilon^2 N} \leq \delta$, then $P(T_{N\b}) \geq
 1 - \delta$,
 and the set $T_{N\b}$ becomes a witness to the fact that
 $H_{\delta}(X^N) \leq \log_2 | T_{N\b} | < N ( H + \epsilon)$.
%
\amarginfig{b}{
{\footnotesize
\setlength{\unitlength}{1.2mm}
\begin{picture}(40,40)(-5,0)
\put(5,5){\makebox(0,0)[bl]{\psfig{figure=figs/gallager/Hdeltaconcept.eps,width=36mm}}}
\put(5,35){\makebox(0,0){$\smallfrac{1}{N} H_{\delta}(X^N)$}}
\put(5,27){\makebox(0,0)[r]{$H_0(X)$}}
\put(5,4){\makebox(0,0)[t]{$0$}}
\put(30,4){\makebox(0,0)[t]{$1$}}
\put(35,4){\makebox(0,0)[t]{$\delta$}}
\put(33,11){\makebox(0,0)[l]{$H-\epsilon$}}
\put(33,15){\makebox(0,0)[l]{$H$}}
\put(33,19){\makebox(0,0)[l]{$H+\epsilon$}}
\end{picture}
}
\caption[a]{Schematic illustration of the two parts of the theorem.
 Given any $\delta$ and $\epsilon$, we show that
 for large enough $N$, $\frac{1}{N} H_{\delta}(X^N)$
 lies (1) below the line 
 $H+\epsilon$ and (2) above the line $H-\epsilon$.}
\label{fig.Hd.schem}
}
\subsubsection{Part 2: $\frac{1}{N} H_{\delta}(X^N) > 
	H - \epsilon$.}
% of the source coding theorem.} 

%
% needs work ,sanjoy says: 
%
% (jan 99)_
%
 Imagine that someone claims this second part is not so -- that,
 for any $N$, the 
 smallest $\delta$-sufficient subset $S_{\delta}$ is smaller than the above
 inequality would allow.
% They claim that 
% $|S_{}| \leq 2^{N(H-\epsilon)}$   and $P(\bx \in S_{})
% \geq 1 - \delta$.
 We can   make use of our typical set to show that they must be mistaken.
 Remember that we are free to set $\beta$ to any value we choose.
 We will set $\beta = \epsilon/2$, so that our task is to 
 prove that  a 
% that an alternative {\em smaller\/}
 subset $S'_{}$ having 
 $|S'_{}| \leq 2^{N(H-2\beta)}$ and achieving $P(\bx \in S'_{}) \geq 1 - \delta$
 cannot exist (for $N$ greater than an $N_0$ that we will specify).
%(We attach the
% prime to $S$ to denote the fact that this is a conjectured smallest subset.)

 So, let us consider the probability of falling in this rival smaller subset $S'_{}$.
 The probability of the subset $S'_{}$ is\marginpar[t]{%
\begin{center}
\raisebox{-0.5in}[0in][0in]{
%%%%%%%% written by hand   Sun 22/12/02
%
% Venn picture
%
%
\setlength{\unitlength}{0.321pt}%
{\begin{picture}(452,215)(-173,-132)% 
% axis labels
\put(-100,39){\makebox(0,0)[r]{\small$T_{N\b}$}}
\put(100,39){\makebox(0,0)[l]{\small$S'$}}
\thinlines
\put(-33,-1){\circle{126}}
\thicklines
\put(33,-1){\circle{126}}
\thinlines
\put(18,-85){\vector(-1,4){18}}
\put(33,-90){\makebox(0,0)[t]{\small$ S'_{} \cap T_{N\b} $}}
\put(105,-51){\vector(-1,1){40}}
\put(112,-39){\makebox(0,0)[tl]{\small$ S'_{} \cap \overline{T_{N\b}} $}}
\end{picture}}
%
%
%
%

\end{center}}
\beq
	P(\bx \in S'_{}) \,=\,	P(\bx \in S' \! \cap \! T_{N\b}) + 
 P(\bx \in S'_{} \!\cap\! \overline{T_{N\b}}),
\eeq
 where $\overline{T_{N\b}}$ denotes 
 the complement $\{ \bx \not \in T_{N\b}\}$.
 The maximum value of the first term is found if
 $S'_{} \cap T_{N\b} $ contains
 $2^{N(H-2\beta)}$ outcomes all with the maximum probability,   
 $2^{-N(H-\beta)}$. The maximum value  the second term can have is 
 $P( \bx \not \in T_{N\b})$. So: 
\beq
	P(\bx \in S'_{}) \, \leq  \, 2^{N(H-2\beta)}
                \, 2^{-N(H-\beta)}
      + \frac{\sigma^2}{\b^2 N} 
	= 2^{-N \b} + \frac{\sigma^2}{\b^2 N} .
\eeq
 We can now set $\b = \epsilon/2$ and $N_0$ such that $P(\bx \in S'_{}) < 1-
 \delta$, which shows that $S'$ cannot satisfy the definition of
 a sufficient subset $S_{\delta}$.
 Thus {\em any\/} subset $S'$ with size
 $|S'| \leq 2^{N(H-\epsilon)}$ has probability less than $1-\delta$, so
 by the definition of $H_\delta$, $H_{\delta}(X^N) > N ( H - \epsilon)$.

% this sentence used to be below at
% hereherehere
 Thus for large enough $N$, 
 the function
 $\frac{1}{N} H_{\delta}(X^N)$ is essentially a constant function of $\delta$,
 for $0 < \delta < 1$,
 as  illustrated in figures \ref{fig.hd.10.1010}
 and \ref{fig.Hd.schem}. \hfill $\Box$


\section{Comments}
 The source coding theorem  (\pref{thm.sct}) has two parts,
	$\frac{1}{N} H_{\delta}(X^N)  < H + \epsilon$, 
 and
 $\frac{1}{N} H_{\delta}(X^N) > 
	H - \epsilon$.
% $H  -\frac{1}{N} H_{\delta}(X^N)< \epsilon$.
 Both results  are interesting. 

 The first part tells us that even if the probability of
 error $\delta$ is extremely small, 
 the
% average
 number of bits per symbol
 $\frac{1}{N} H_{\delta}(X^N)$ needed to specify a long $N$-symbol 
 string $\bx$ with vanishingly 
 small error probability does not 
 have to exceed $H+ \epsilon$ bits. 
 We  need to have only a tiny tolerance for error, and the number of bits 
 required drops significantly from $H_0(X)$ to $(H + \epsilon)$. 

 What happens if we are yet more tolerant to compression errors? Part
 2 tells us that even if $\delta$ is very close to 1, so that  errors
  are made most of the time, the average number of bits per symbol needed to
 specify $\bx$ must  still  be at least $H - \epsilon$ bits. These two
 extremes tell us that regardless of our specific allowance for error,
 the number of bits per symbol needed to specify $\bx$ is
% boils down to
 $H$ bits; no more and no less. 
\medskip

% hereherehere

%In section 2.4.2 `$\epsilon$ can decrease with increasing $N$'. I'd prefer
%something like $N$ increases with decreasing $\epsilon$', since $N$ 
%depends on $\epsilon$ and not vice versa -- if I got it right.
% caution warning
\subsection{Caveat regarding `asymptotic equipartition'}
\label{sec.aep.caveat}
 \index{caution!equipartition}I
 put the words `asymptotic equipartition' in quotes because 
 it is important not to\index{asymptotic equipartition!why it is a misleading term}
% be misled into
 think that the 
 elements of the typical set $T_{N\beta}$
 really do have roughly the same 
 probability as each other. They are  similar in probability only
 in the sense that their values of $\log_2 \frac{1}{P(\bx)}$ are 
 within $2 N \beta$ of each other. Now, as $\beta$ is decreased,
 how does $N$ have to increase, if we are to keep our bound on the
 mass of the typical set, 
 $P(\bx \in T_{N\beta}) \geq 1 - \frac{\sigma^2}{\beta^2 N}$, constant?
% CHANGED 9802:
% Since $\beta$ can decrease
%scales
% with increasing
 $N$ must grow as $1/ \beta^2$, so, if we write
 $\beta$ in terms of 
 $N$ as $\alpha/\sqrt{N}$, for some constant $\alpha$, then
 the  most probable string in the typical set will be of order 
 $2^{\alpha \sqrt{N}}$ times greater than the least probable string in the 
 typical set. As $\beta$ decreases, $N$ increases,
 and this ratio $2^{\alpha \sqrt{N}}$ grows exponentially.
 Thus we  have `equipartition'  only in a weak sense!
% relative

\subsection{Why did we introduce the typical set?}
 The best choice of subset for block compression is (by definition)
 $S_{\delta}$, not a typical set. So why did we bother introducing
 the typical set? The answer is, {\em we can count the typical set}.
 We know that all its elements have `almost identical' probability ($2^{-NH}$),
 and we know the whole set has probability almost 1, so the typical
 set must have roughly $2^{NH}$ elements.
 Without the help of the typical set (which is very similar
 to $S_{\delta}$) it would have been
 hard to count how many elements there are in $S_{\delta}$.

%\section{Summary and overview}
%\section{Where next}
% We have established that the entropy $H(X)$ measures
% the average information content of an ensemble.
%%
% In this chapter we discussed a lossy {block}-compression scheme that 
% used large blocks of fixed size.
% In the next chapter we  discuss variable length compression schemes that are
% practical for small block sizes and that are not lossy.
%%
%
\section{Exercises}
% weighing problems in here
% ITPRNN Problem 1a
%
\subsection*{Weighing problems}
%
\exercisaxB{1}{ex.weighexplain}{
 While some people, when  they first  encounter   
 the
 weighing problem with  12 balls and the three-outcome balance (\exerciseref{ex.weigh}),
 think that weighing six balls against six balls is a good first weighing,
 others say `no, weighing six against six conveys {\em no\/} information
 at all'.  Explain to the second group why they are both right and
 wrong.  Compute the information gained about {\em  which is the
 odd ball\/}, and the information gained about {\em  which is the
 odd ball and whether it is heavy or light}.
}
\exercisaxB{2}{ex.weighthirtynine}{
  Solve the weighing problem for the case where there are 39 balls
 of which one is known to be odd.
}
\exercisaxB{2}{ex.binaryweigh}{
 You are given 16 balls, all of which are equal in weight except for
 one that is either heavier or lighter. You are also given a bizarre
 two-pan balance that can  report only two outcomes: `the two sides balance'
 or `the two sides do not balance'.
 Design a
 strategy to determine which is the odd ball {in as few uses of the balance
 as possible}.
}
\exercisaxB{2}{ex.flourforty}{
	You have a two-pan balance; your job is to weigh
 out bags of flour with integer weights  1 to 40 pounds inclusive.
 How many weights do you need? [You are allowed
 to put  weights on either pan. You're only allowed to
 put one flour bag on the balance at a time.]
}
\exercissxC{4}{ex.twelve.generalize.weigh}{ 
\ben
\item% {ex.weigh}
 Is it possible to solve  \exerciseref{ex.weigh}
 (the
 weighing problem with  12 balls and the three-outcome balance)
 using a sequence of three {\em fixed\/} weighings, such that the
 balls chosen for the second weighing do not depend on the outcome of the first, and
 the third weighing does not depend on the first or second?
\item
 Find a  solution to the general $N$-ball weighing problem in which exactly one of  $N$
 balls is odd.
 Show that in $W$ weighings, an odd ball can be identified from among 
$N = (3^W - 3 )/2$ balls.
%How large can $N$ be if you are allowed $W$ weighings? 
% How are the weighings arranged in the case of the largest $N$? 
\een
}
\exercisaxC{3}{ex.twelve.two.weigh}{ 
 You are given 12 balls and the three-outcome balance 
 of \exerciseonlyref{ex.weigh}; this time, {\em two} of the balls are odd;
 each odd ball may be heavy or light, and we don't know which.
 We want to identify the odd balls and in which direction they are odd.
\ben
\item
 {\em Estimate\/} how many weighings are required by the optimal strategy.
 And what if there are three odd balls?
%\item
% How do your answers change if it is known in advance that 
% the odd balls will all have the same bias (all heavy, or all light)?
\item
 How do your answers change if it is known that all the regular balls
 weigh 100\grams, that light balls weigh 99\grams, and heavy ones
 weigh 110\grams?
\een
}

% end weighing
\subsection*{Source coding with a lossy compressor, with loss $\delta$}
\exercissxB{2}{ex.Hd46}{
% Let ${\cal P}_X = \{ 0.4,0.6 \}$. Sketch $\frac{1}{N} H_{\delta}(X^N)$
% as a function of $\delta$ for $N=1,2$ and 100.
 Let ${\cal P}_X = \{ 0.2,0.8 \}$. Sketch $\frac{1}{N} H_{\delta}(X^N)$
 as a function of $\delta$ for $N=1,2$ and 1000.
}
\exercisaxB{2}{ex.Hd55}{
 Let ${\cal P}_Y = \{ 0.5,0.5 \}$. Sketch $\frac{1}{N} H_{\delta}(Y^N)$
 as a function of $\delta$ for $N=1,2,3$ and 100.
}
\exercissxB{2}{ex.HdSB}{ 
 (For \ind{physics} students.)
 Discuss the 
 relationship
% similarities
 between the proof of the \aep\ and the  equivalence\index{entropy!Gibbs}\index{entropy!Boltzmann} 
 (for large systems) of the \ind{Boltzmann entropy} and the \ind{Gibbs entropy}.}
\subsection*{Distributions that don't obey the law of large numbers}
%
% Cauchy distbn here? 
 The \ind{law of large numbers}, which we used in this chapter, 
 shows that the mean  of a set of $N$ i.i.d.\ random variables 
 has a probability distribution that becomes 
% more concentrated
 narrower, with width $\propto 1/\sqrt{N}$, as $N$ increases. 
 However, we have proved this property only for 
 discrete random variables,  that is, for real numbers 
 taking on a {\em finite\/} set of possible values. 
 While many random variables
 with continuous probability distributions also satisfy the 
 law of large numbers, there are important distributions that 
 do not. Some continuous distributions do not have 
 a mean or variance. 
\exercissxB{3}{ex.cauchy}{
 Sketch the \ind{Cauchy distribution}
\beq
	P(x) = \frac{1}{Z} \frac{1}{x^2 + 1} , \:\:\:\: x \in (-\infty,\infty).
\eeq
 What is its normalizing constant $Z$? Can you evaluate
 its mean or variance?

 Consider the sum $z=x_1 + x_2$, where $x_1$ and $x_2$ are independent 
 random variables from a Cauchy 
 distribution. What is $P(z)$? What is the probability 
 distribution of the mean of $x_1$ and $x_2$, $\bar{x}=(x_1+x_2)/2$?
 What is the 
 probability
 distribution of the mean of $N$ samples from this \ind{Cauchy distribution}? 
}
%
\subsection{Other asymptotic properties}
% Levy flights too?
\exercisaxC{3}{ex.chernoff}{ {\sf\ind{Chernoff bound}.}
 We derived the weak law of large numbers from Chebyshev's inequality\index{Chebyshev inequality}
 (\ref{eq.cheb.1}) by letting the random variable $t$
 in the inequality
$%\beq
	P(t \geq \a) \:\leq\: \bar{t}/\a
%\label{eq.cheb.1a}
$
 be a function, $t = (x-\bar{x})^2$,
 of the random variable $x$ we were interested in.

 Other useful inequalities can be obtained by using other
 functions. The \ind{Chernoff bound}, which is useful\index{bound}
 for bounding the \ind{tail}s of a distribution, is obtained by
 letting $t = \exp( s x)$.

 Show that
\beq
	P( x \geq a ) \leq e^{-sa} g(s) , \:\:\:\mbox{ for any $s>0$ }
\eeq
 and 
\beq
	P( x \leq a ) \leq e^{-sa} g(s) , \:\:\:\mbox{ for any $s<0$ }
\eeq
 where $g(s)$ is the moment-generating function of $x$,
\beq
	g(s) = \sum_x  P(x) \, e^{sx} .
\eeq
%
% Hence show that if $z$ is a sum of $N$ random variables $x$,
%\beq
%	P( z \geq a ) \leq  
%\eeq
}
% end






%
\subsection*{Curious functions related to $p \log 1/p$}
%  SOLN - BORDERLINE
\exercissxE{4}{ex.fxxxxx}{
 This exercise has {no purpose at all}; it's  included
 for the enjoyment of those who like mathematical curiosities.

 Sketch the function
\beq
	f(x) = x^{x^{x^{x^{x^{\cdot^{\cdot^{\cdot}}}}}}} 
%	f(x) = x^{x^{x^{x^{x^{\ddots}}}}} 
\eeq
 for $x \geq 0$.
% To be explicit about the order in which the powers are evaluated, 
% here's another definition of $f$:
%\beq
%	f(x) = x^{\left(x^{\left(x^{\cdot^{\cdot^{\cdot}}}\right)}\right)}
%\eeq
 {\sf Hint:}
 Work out the inverse function to $f$ -- that is, the function $g(y)$
 such that if $x=g(y)$ then $y=f(x)$ --  it's closely related to
 $p \log 1/p$.
% {\sf Hints:}
%\ben
%\item Consider $f(\sqrt{2})$:
% you might be able to persuade yourself
% that $f(\sqrt{2})=2$. You might also be able
% to persuade yourself that  $f(\sqrt{2})=4$. What's going on?
% [Yes, a two-valued function.]
%\item
% For a given  $x$, if $f(x)=y$, then we have $y = x^{y}$, so
% $y$ is found at the intersection of the curves $u_1(y)=x^y$ and $u_2(y)=y$.
%\item
% Work out the inverse function to $f$ -- that is, the function $g(y)$
% such that if $x=g(y)$ then $y=f(x)$ -- hint: it's closely related to
% $p \log 1/p$.
%\een
}



\dvips
%\chapter{The Source Coding Theorem (old version of this Chapter)}
%\label{ch.two.old}
%\input{tex/_l2old.tex}
%\dvips
\section{Solutions}% to Chapter \protect\ref{ch.two}'s exercises} 
\fakesection{_s2}
% chapter 2
% ex 39...
%
\soln{ex.Hadditive}{
 Let $P(x,y)=P(x)P(y)$.
 Then 
\beqan
	H(X,Y) &=& \sum_{xy} P(x)P(y) \log \frac{1}{P(x)P(y)} \\
		& = & \sum_{xy} P(x)P(y) \log \frac{1}{P(x)} 
			+ \sum_{xy} P(x)P(y) \log \frac{1}{ P(y)} \\
	&=& \sum_{x} P(x) \log \frac{1}{P(x)} +
		  \sum_{y} P(y) \log \frac{1}{ P(y)} \\
	&=& H(X) + H(Y) .
\eeqan
}
%
\soln{ex.ascii}{
 An ASCII file can be reduced in size by a factor of 7/8. This reduction 
 could be achieved by a  block code that maps 8-byte blocks 
 into 7-byte blocks by copying the
% . The mapping would copy
 56 information-carrying bits  into 
 7 bytes, and ignoring the last bit of every character.  
}
\soln{ex.compress.possible}{
% Theorem:
%  No program can compress without loss *all* files of size >= N bits, for
%  any given integer N >= 0.
%
%Proof:
%  Assume that the program can compress without loss all files of size >= N
%  bits.  Compress with this program all the 2^N files which have exactly N
%  bits.  All compressed files have at most N-1 bits, so there are at most
%  (2^N)-1 different compressed files [2^(N-1) files of size N-1, 2^(N-2) of
%  size N-2, and so on, down to 1 file of size 0]. So at least two different
%  input files must compress to the same output file. Hence the compression
%  program cannot be lossless.
%
%The proof is called the "counting argument". It uses the so-called
 The \ind{pigeon-hole principle}
 states: you can't put 16 pigeons into 15 holes without using one of the
 holes twice.

 Similarly, you can't give $\A_X$ outcomes unique 
 binary names of some length $l$
 shorter than $\log_2 |\A_X|$ bits, because there are only $2^l$
 such binary names, and $l < \log_2 |\A_X|$ implies $2^l <  |\A_X|$,
 so at least two different inputs to the compressor would compress to
 the same output file.
}
\soln{ex.cusps}{
 Between the cusps, all the changes in 
 probability are equal, and the number of elements 
 in $T$ changes by one at each step. So $H_{\delta}$ 
 varies logarithmically with $(-\delta)$.
% NEEDS WORK!
}
%
% Another solution from Conway:
% Label them
% F AM NOT LICKED
% then use these divisions
% MA DO   LIKE
% ME TO   FIND
% FAKE    COIN
%
%\soln{ex.twelve.generalize.weigh}{
% Thu, 28 Jan 1999 19:19:30 -0500 (EST)
% From:
% 
\begin{Sexercise}{ex.twelve.generalize.weigh}
 This solution was  found by Dyson and Lyness in 1946
 and presented in the following elegant form by
 {John Conway}\index{Conway, John H.} in 1999.
% \footnote{Posting to   {\tt{geometry-puzzles@forum.swarthmore.edu}}
% Thu, 28 Jan 1999.
%}
%
 Be warned: the symbols A, B, and C are used to  name the
 balls, to name the pans of the balance, 
 to name the outcomes, and to name
 the possible states of the odd ball!
\ben%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% enumerate 1
\item
   Label the 12 balls by the sequences
%
% verbatim not allowed in the argument of a command
%
{\small
\begin{verbatim}
   AAB  ABA  ABB  ABC  BBC  BCA  BCB  BCC  CAA  CAB  CAC  CCA
\end{verbatim}
}
and in the
{\small
\begin{verbatim}
1st               AAB ABA ABB ABC           BBC BCA BCB BCC
2nd weighings put AAB CAA CAB CAC in pan A, ABA ABB ABC BBC in pan B.
3rd               ABA BCA CAA CCA           AAB ABB BCB CAB
\end{verbatim}
}
 Now in a given weighing, a pan will either end up in the
\bit
\item
   {\tt C}anonical position ({\tt C}) that it assumes when the pans are balanced, or
\item
   {\tt A}bove that position ({\tt A}), or
\item
   {\tt B}elow it ({\tt B}),
\eit
 so the three weighings determine for each pan a sequence of three of these letters.

   If both sequences are {\tt CCC}, then there's no odd ball.  Otherwise,
for {\em just one\/} of the two pans, the sequence is among the 12 above,
and names the odd ball, whose weight is {\tt A}bove or {\tt B}elow the proper
one according as the pan is  {\tt A}  or  {\tt B}.
\item

 In $W$  weighings the odd ball can be identified from
 among
\beq
 N = (3^W - 3 )/2
\eeq
 balls in the same way, by labelling them with all
 the non-constant sequences of  $W$  letters from  {\tt A}, {\tt B}, {\tt C}  whose
 first change is  A-to-B  or  B-to-C  or  C-to-A, and at the
 $w$th weighing putting those whose  $w$th  letter is  {\tt A}  in pan {\tt A} 
 and those whose  $w$th  letter is  {\tt B}  in pan {\tt B}.
\een
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%}
\end{Sexercise}
% {ex.twelve.two.weigh}{
% removed old solution to graveyard  Tue 4/3/03
\soln{ex.Hd46}{% ex 42
%  hd.p p=0.2 mmin=1 mmax=2 mstep=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot 
%  hd.p p=0.2 mmin=2 mmax=2 mstep=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot 
%  hd.p p=0.2 mmin=100 mmax=100 mstep=1 suppress_early_detail=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot 
%  hd.p p=0.2 mmin=1000 mmax=1000 mstep=1 suppress_early_detail=1 scale_by_n=1 plot_sub_graphs=1 hd=figs/hd0.2 | gnuplot 
%#                                 gnuplot < gnu/Hd0.2.gnu
%#45:coll:/home/mackay/itp/Hdelta> gv figs/hd0.2/all.1.100.ps
 The curves $\frac{1}{N} H_{\delta}(X^N)$
 as a function of $\delta$ for $N=1,2$ and 1000 are shown in \figref{fig.hd.1.100}.
% and table \ref{tab.Hdelta.0.4}.
 Note that $H_2(0.2) =  0.72$ bits.
\begin{figure}[htbp]
%\figuremargin{%
\figuredanglenudge{%
\begin{center}
\begin{tabular}[t]{rl}
\begin{tabular}[t]{l}\vspace{0in}\\% alignment hack
\mbox{\psfig{figure=Hdelta/figs/hd0.2/all.1.100.ps,%
width=60mm,angle=-90}}
\end{tabular}
%
\hspace{0in}
&
%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}[t]{r@{--}lcc} \toprule
\multicolumn{4}{c}{$N=1$} \\ \midrule
%    delta          1/N Hdelta        2^{Hdelta}
\multicolumn{2}{c}{$\delta$} & $\frac{1}{N} H_{\delta}(\bX)$ & $2^{H_{\delta}(\bX)}$ 
% raise the roof!
% {\rule[-3mm]{0pt}{8mm}}
\\ \midrule
0    &    0.2 &           1   &         2           \\
0.2  &      1 &           0   &         1           \\ \bottomrule
\end{tabular}
\hspace{0.1in} 
\begin{tabular}[t]{r@{--}lcc} \toprule% {r@{--}lcc}
\multicolumn{4}{c}{$N=2$} \\  \midrule
%    delta          1/N Hdelta        2^{Hdelta}
\multicolumn{2}{c}{$\delta$} & $\frac{1}{N} H_{\delta}(\bX)$ & $2^{H_{\delta}(\bX)}$ 
% raise the roof!
% {\rule[-3mm]{0pt}{8mm}}
\\ \midrule 
0 &         0.04  &           1  &           4            \\
0.04 &      0.2   &       0.79   &          3            \\ % was 0.792\,48
0.2 &       0.36  &         0.5  &           2            \\
0.36 &         1  &           0  &           1            \\ \bottomrule
\end{tabular}\\
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{center}
}{%
\caption[a]{$\frac{1}{N} H_{\delta}(\bX)$ (vertical axis) against $\delta$ (horizontal), 
	for  $N=1, 2, 100$ binary variables with $p_1=0.4$.}
\label{fig.hd.1.100}
\label{tab.Hdelta.0.4}
}{0.25in}
\end{figure}
%\begin{table}[htbp]
%\figuremargin{%
%\begin{center}
%\end{center}
%}{%
%\caption[a]{Values of $\frac{1}{N} H_{\delta}(\bX)$  against $\delta$.}
%% add 0.4 to this caption
%\label{tab.Hdelta.0.4}
%}
%\end{table}
%
}
\soln{ex.HdSB}{
 The Gibbs entropy is $\kB \sum_i p_i \ln \frac{1}{p_i}$, where $i$
 runs over all states of the system. This entropy is equivalent  (apart from the factor of $\kB$) 
 to  the Shannon entropy of the ensemble. 

 Whereas the Gibbs entropy can be
 defined for any ensemble, the Boltzmann entropy is only
 defined for  {\dem microcanonical\/} ensembles, which
 have a probability distribution that is uniform over a
 set of accessible states.
 The Boltzmann entropy is defined to be $S_{\rm B} = \kB \ln \Omega$
 where $\Omega$ is the number of accessible states 
 of the  microcanonical  ensemble. This is equivalent 
 (apart from the factor of $\kB$) to the perfect information content 
 $H_0$ of that constrained
 ensemble. The Gibbs entropy of a microcanonical
 ensemble is trivially equal to the Boltzmann entropy. 

 We now  consider a   \ind{thermal distribution} (the
 {\dem\ind{canonical}\/} ensemble),
 where the probability of a state  $\bx$ is 
\beq
%	P(\bx) =\frac{1}{Z} \exp( - \beta E(\bx) )? 
	P(\bx) =\frac{1}{Z} \exp\left( - \frac{ E(\bx) }{\kB T} \right) . 
\eeq
 With this canonical ensemble we can associate a
 corresponding microcanonical ensemble,
% typically
% usually
 an ensemble 
 with  total energy  fixed to the mean
 energy of the canonical ensemble
 (fixed to within some precision $\epsilon$).
% Recalling that under the 
% thermal distribution (the canonical ensemble) we see that
 Now, fixing the total energy to a precision $\epsilon$ is equivalent to 
 fixing the value of $\ln \dfrac{1}{P(\bx)}$ to within
% $\epsilon/\beta$.
 $\epsilon \kB T$.
 Our definition of the typical set 
 $T_{N \beta}$ was precisely that it consisted of all elements that 
 have a  value of $\log P(\bx)$ very close to the mean value
 of $\log P(\bx)$ under the canonical ensemble, $- N H(X)$. 
 Thus the microcanonical ensemble is equivalent to 
 a uniform distribution over 
% constraining the state $\bx$ to be in 
 the typical set of the canonical ensemble. 

 Our proof of the \aep\  thus proves -- for the 
 case of a system whose energy is separable into a sum of independent
 terms -- that the 
 Boltzmann entropy of the microcanonical ensemble 
 is very close (for large $N$) to the Gibbs entropy of 
 the canonical ensemble, if the energy of the microcanonical
 ensemble is constrained to equal the mean energy of the 
 canonical ensemble.
}
\soln{ex.cauchy}{
 The normalizing constant of  the \ind{Cauchy distribution}\index{distribution!Cauchy}
\[
	P(x) = \frac{1}{Z} \frac{1}{x^2 + 1} 
\]
 is
\beq
	Z = \int^{\infty}_{-\infty} \d x \: \frac{1}{x^2 + 1}
  = \left[ {\tan}^{-1} x \right]^{\infty}_{-\infty} = \frac{\pi}{2} - \frac{-\pi}{2} = \pi .
\eeq
 The mean and variance of this distribution are both undefined. (The distribution
 is symmetrical about zero, but this does not imply that its mean is zero. The mean 
 is the value of a divergent integral.)
% ; depending what limiting procedure we 
%  define to evaluate this integral we 
 The sum $z=x_1 + x_2$, where $x_1$ and $x_2$ both 
 have Cauchy distributions, has probability density given by the convolution
\beq
 P(z) = \frac{1}{\pi^2} \int^{\infty}_{-\infty} \d x_1 \:
	\frac{1}{x_1^2 + 1}
	\frac{1}{(z-x_1)^2 + 1}
% P(x1,x2) delta [z=x1+x2] .. -> x2 = z-x1 
 , 
\eeq
%  Introducing $\Delta \equiv x_1-x_2$ this can be written more symmetrically 
%  as
% \beq
%  P(z) = \frac{1}{\pi^2} \int^{\infty}_{-\infty} \d \Delta \:
% \eeq
 which after a considerable labour using standard methods
%\footnote{Can anyone 
% give me an elegant solution?} 
 gives
\beq
	P(z) = \frac{1}{\pi^2} 2 \frac{\pi}{z^2+4} = \frac{2}{\pi}  \frac{1}{z^2+2^2} ,
\label{eq.cauchysum}
\eeq
 which we recognize as a Cauchy distribution with width parameter 2
 (where the original distribution has width parameter 1).
 This implies that the mean of the two points, $\bar{x} = (x_1+x_2)/2 = z/2$, 
 has a Cauchy distribution with width parameter 1. Generalizing, the mean 
 of $N$ samples from a Cauchy distribution is Cauchy-distributed 
 with the {\em same parameters\/} as the individual samples. The probability 
 distribution of the mean does {\em not\/} become narrower 
 as $1/\sqrt{N}$. 

 {\em The \ind{central-limit theorem} does not apply to the \ind{Cauchy distribution}, 
 because it does not have a finite \ind{variance}.}

 An alternative neat method for getting to \eqref{eq.cauchysum} makes 
 use of the \ind{Fourier transform}\index{generating function}
 of the Cauchy distribution, which is 
 a \index{biexponential distribution}{biexponential} $e^{-|\omega|}$. Convolution in real space 
 corresponds to multiplication in Fourier space,
 so the \ind{Fourier transform} of $z$ is simply $e^{-|2 \omega|}$.
 Reversing the transform, we obtain \eqref{eq.cauchysum}.
}
%\begincuttable
\soln{ex.fxxxxx}{
\amarginfig{t}{
\begin{center}
\begin{tabular}{c}
\psfig{figure=gnu/fxxxxx50.ps,width=1.7in,angle=-90}\\
\psfig{figure=gnu/fxxxxx5.ps,width=1.7in,angle=-90}\\
\psfig{figure=gnu/fxxxxx.5.ps,width=1.7in,angle=-90}\\
\end{tabular}
\end{center}
%}{%                  gnu: load 'fxxxxx.gnu'
\caption[a]{
% The function
$\displaystyle
	f(x) = x_{\:,}^{x^{x^{x^{x^{\cdot^{\cdot^{\cdot}}}}}}} 
$ shown at three different scales.}
\label{fig.xxxxx}
}%
 The function $f(x)$
%\beq
%	f(x) = x^{x^{x^{x^{x^{\ddots}}}}} 
%\eeq
 has inverse function 
% to $f$ is
\beq
 g(y) = y^{1/y}. 
\eeq
 Note
\beq
	\log g(y) = 1/y \log y .
\eeq
 I obtained a tentative graph of $f(x)$ by plotting $g(y)$ with
 $y$ along the vertical axis and $g(y)$ along the horizontal
 axis. The resulting  graph suggests that $f(x)$
 is single valued for $x \in (0,1)$, and looks surprisingly well-behaved
 and ordinary; for $x \in (1, e^{1/e})$, $f(x)$ is two-valued.
 $f(\sqrt{2})$ is  equal both to 2 and 4.
 For $x > e^{1/e}$ (which is about 1.44), $f(x)$ is infinite.
% undefined.
 However, it might be argued that this approach to sketching $f(x)$
 is  only partly valid, if we define $f$ as the  limit of the
 sequence of functions  $x$, 
 $x^x$, $x^{x^x}, \ldots$;
	 this sequence does not
 have a limit for
% , below
% pr (1.0/exp(1.0))**exp(1.0)
% 0.0659880358453126
 $0 \leq x \leq  (1/e)^e \simeq 0.07$
 on account of a pitchfork \ind{bifurcation} at $x=(1/e)^e$;
 and for $x \in (1,e^{1/e})$, the sequence's limit is single-valued --
 the lower of the two values sketched in the figure.
% load 'fxxxxx.gnu2'
%
}
%\endcuttable




\dvipsb{solutions source coding}
\prechapter{About     Chapter}
\fakesection{intro for chapter 3}
 In the last chapter, we saw a proof of the fundamental status of the entropy 
 as a measure of average information content.
 We defined a data compression scheme using
 {\em fixed length block codes}, and
 proved that as  $N$ increases,
 it is possible to encode $N$ i.i.d.\ variables 
 $\bx = (x_1,\ldots,x_N)$ into a block of $N(H(X)+\epsilon)$ bits
 with vanishing probability of error, whereas if we attempt to 
 encode $X^N$ into $N(H(X)-\epsilon)$ bits, the probability of 
 error is virtually 1.

        We thus verified the {\em possibility\/} of 
 data compression, but the block coding defined in the proof 
 did not  give a  practical algorithm. 
 In this chapter and the next,
 we  study practical data compression algorithms. 
 Whereas the last chapter's compression scheme
 used large blocks of {\em fixed\/} size and was
 {\em lossy}, in the next chapter we discuss
 {\em variable-length\/} compression schemes that are
 practical for small block sizes and that are {\em not lossy}.

 Imagine a rubber glove filled with water. If we compress two
 fingers of the glove, some other part of the glove has
 to expand, because
 the total volume of water is constant. (Water is essentially
 incompressible.) Similarly, when we shorten
 the codewords for some outcomes, there must be other
 codewords that get longer, if the scheme is not lossy.
 In this chapter we will discover the information-theoretic
 equivalent of water volume.
% the constant volume of water in the glove.
%%
\medskip

\fakesection{prerequisites for chapter 3}
 Before reading  \chref{ch.three}, you should have worked on 
 \extwenty.
\medskip

 We will use the\index{notation!intervals} 
 following notation for intervals:\medskip
% the statement
\begin{center}
\begin{tabular}{ll}
 $x \in [1 ,2)$ & means that $x \geq 1$ and $x < 2$; \\
% the statement 
 $x \in (1 ,2]$ & means that $x > 1$ and $x \leq 2$.\\
\end{tabular}
\end{center}
 

% {All these definitions of source
%        codes, Huffman codes, etc., can be generalized to codes over
%        other $q$-ary alphabets, but little is lost by concentrating on 
%        the binary case.} 


%\chapter{Data Compression II: Symbol Codes}
\mysetcounter{page}{102}
\ENDprechapter
\chapter{Symbol Codes}
\label{ch.three}
% %.tex 
% \documentstyle[twoside,11pt,chapternotes,lsalike]{itchapter}
% \begin{document}
% \bibliographystyle{lsalike} 
% \input{psfig.tex} 
% \include{/home/mackay/tex/newcommands1}
% \include{/home/mackay/tex/newcommands2}
% \input{itprnnchapter.tex} 
% \setcounter{chapter}{2}%  set to previous value
% \setcounter{page}{34} % set to current value 
% \setcounter{exercise_number}{45} % set to imminent value
% % 
% \renewcommand{\bs}{{\bf s}}
% \newcommand{\eq}{\mbox{$=$}}
% \chapter{Data Compression II: Symbol Codes}
% % \section*{Source Coding: Lossless data compression with symbol codes}
% % Practical source coding
\label{ch3}
%\section{Symbol codes}
 In this chapter, we  discuss
 {\dem variable-length symbol codes\/}\indexs{symbol code},\index{source code!symbol code}
% , variable-length},
 which encode one source symbol at a time, instead of encoding huge strings of 
 $N$ source symbols. These codes  are 
 {\dem lossless:}
 unlike the last chapter's block codes, they are guaranteed to
 compress and  decompress without
 any errors; but there is a chance that the codes may sometimes produce 
 encoded strings longer  than the original source string.

 The idea is that we can achieve compression, on average,
 by assigning {\em shorter\/} encodings to the more
probable outcomes and {\em longer\/} encodings to the less probable.

 The key issues are:
\begin{description}
\item[What are the implications if a symbol code is {\em lossless\/}?]
 If some codewords are shortened, by how much do other codewords
 have to be lengthened?
\item[Making compression practical\puncspace]
 How can we ensure that a symbol code is easy to decode?
\item[Optimal symbol codes\puncspace]
 How should we assign codelengths to achieve the best
 compression, and what is the best achievable compression?
\end{description}

 We  again verify the 
 fundamental status of the Shannon \ind{information content}
 and the entropy, proving:\index{source coding theorem}
%
%
\begin{description}
\item[Source coding theorem (symbol codes)\puncspace]
        There exists a variable-length encoding $C$ of an ensemble
 $X$ such that the average length of an encoded symbol, 
 $L(C,X)$, satisfies
 $L(C,X) \in \left[ H(X) ,  H(X) + 1 \right)$.

The average length is equal to the entropy $H(X)$ only if the codelength
 for each outcome is equal to its \ind{Shannon information content}.
\end{description}
%
 We will also define a constructive  procedure, the 
 \index{Huffman code}Huffman 
 coding algorithm, that produces optimal symbol codes.\index{symbol code!optimal}\index{source code!symbol code!optimal} 

\begin{description}
\item[Notation for alphabets\puncspace]  $\A^N$ denotes the set of 
        ordered $N$-tuples of elements from the set $\A$, \ie,
        all strings of length $N$. 
        The symbol $\A^+$ will denote the set of all strings of finite
        length composed of elements from the set $\A$. 
\end{description}
\exampla{ $\{{\tt{0}},{\tt{1}}\}^3 = \{{\tt{0}}{\tt{0}}{\tt{0}},{\tt{0}}{\tt{0}}{\tt{1}},{\tt{0}}{\tt{1}}{\tt{0}},{\tt{0}}{\tt{1}}{\tt{1}},{\tt{1}}{\tt{0}}{\tt{0}},{\tt{1}}{\tt{0}}{\tt{1}},{\tt{1}}{\tt{1}}{\tt{0}},{\tt{1}}{\tt{1}}{\tt{1}}\}$. }
\exampla{
        $\{{\tt{0}},{\tt{1}}\}^+ = \{ {\tt{0}} , {\tt{1}} , {\tt{0}}{\tt{0}} , {\tt{0}}{\tt{1}} , {\tt{1}}{\tt{0}} , {\tt{1}}{\tt{1}} , {\tt{0}}{\tt{0}}{\tt{0}} , {\tt{0}}{\tt{0}}{\tt{1}} , \ldots \}$.
}
% This notation is borrowed from the standard notation for expressions 
% in computer science
\section{Symbol codes}
\label{sec.symbol.code.intro}
\begin{description}
\item[A (binary) symbol code]
        $C$ for an ensemble $X$ is a mapping from the range of $x$,
        $\A_X \eq \{a_1,\ldots, $ $a_I\}$, to $\{{\tt{0}},{\tt{1}}\}^+$.
% a set of finite length strings of symbols 
%       from an alphabet (NAME?). 
        $c(x)$ will denote the {\dem{codeword}\/}\indexs{symbol code!codeword}
 corresponding to $x$, 
        and $l(x)$ will denote its length, with $l_i = l(a_i)$.

        The {\dem \inds{extended code}\/} $C^+$ 
        is a mapping from $\A_X^+$ to $\{{\tt{0}},{\tt{1}}\}^+$
        obtained by concatenation, without punctuation, of the 
 corresponding codewords:\index{concatenation!in compression} 
\beq
        c^+(x_1 x_2 \ldots x_N) = c(x_1)c(x_2)\ldots c(x_N) .
\eeq

 [The term `\ind{mapping}' here is a synonym for `function'.] 
\end{description}
\exampla{
 A symbol code for the ensemble 
 $X$ defined by
\beq
\begin{array}{*{4}{c}*{5}{@{\,}c}}
             & \A_X & = & \{ & {\tt a}, & {\tt b}, & {\tt c}, & {\tt d} & \} , \\
             & \P_X & = & \{ & \dhalf, & \dquarter, & \deighth, & \deighth  &  \}, 
\end{array}
\eeq
% : \A_X = \{{\tt{a}},{\tt{b}},{\tt{c}},{\tt{d}}\},$ $\P_X = \{ \dhalf,\dquarter,\deighth,\deighth \}$
 is   $C_0$, shown in the margin.
% = \{ {\tt{1}}{\tt{0}}{\tt{0}}{\tt{0}}, {\tt{0}}{\tt{1}}{\tt{0}}{\tt{0}}, {\tt{0}}{\tt{0}}{\tt{1}}{\tt{0}}, {\tt{0}}{\tt{0}}{\tt{0}}{\tt{1}}\}$.
\marginpar{
\begin{center}
$C_0$: 
\begin{tabular}{clc} \toprule
$a_i$ & $c(a_i)$ & $l_i$ 
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule 
{\tt a} & {\tt 1000}   &   4      \\
{\tt b} & {\tt 0100}   &     4    \\
{\tt c} & {\tt 0010}   &    4     \\
{\tt d} & {\tt 0001}   &   4      \\
 \bottomrule
\end{tabular}
\end{center}
}

 Using the extended code, we may encode ${\tt{acdbac}}$
 as
\beq
	c^{+}({\tt{acdbac}}) =
 {\tt{1000}} 
 {\tt{0010}} 
 {\tt{0001}}
 {\tt{0100}} 
 {\tt{1000}}
 {\tt{0010}} .
\eeq
}
 There are  basic requirements for a useful symbol code. 
 First, any encoded string must have a unique decoding.
  Second, the symbol code must be easy to decode.
 And third, the code should achieve as much compression as possible.
\subsection{Any encoded string must have a unique decoding}
\begin{description}
\item[A code $C(X)$ is uniquely decodeable] if, under the 
 extended code $C^+$, no two distinct
 strings have the same encoding,
% every element of $\A_X^+$  maps into a different string,
 \ie, 
\beq
        \forall \, \bx,\by \in \A_X^+, \:\: \bx \not = \by  \:\:  \Rightarrow   \:\:
        c^+(\bx) \not = c^+(\by).
\label{eq.UD}
\eeq
%cnp22@maths.cam.ac.uk:
% I'm missing the word `injectivity'. This would explain, why 
% (3.2) is necessary for an inverse function.
%
% {\em I believe mathematicians would put it this way:
% a code is uniquely decodeable if the extended code is an injective
% mapping.}
\end{description}
 The code $C_0$ defined above is  an example of a uniquely decodeable
 code.

\subsection{The symbol code must be easy to decode}
 A symbol code 
 is easiest to decode if it is possible to identify the end of a 
 codeword as soon as it arrives, which means that no codeword can 
 be a {\dem{prefix}\/} of another codeword.
%
% {\em (Need a defn of a prefix here.)}
%\marginpar{\footnotesize
% [A word $c$
%% \in \A^{+}$
% is a {\dem prefix\/} of another word $d$
%% \in \A^{+}$
% if there exists a tail string $t$
%% \in \A^{*}
% such that the concatenation $ct$ is
% identical to $d$. For example, {\tt 1} is a prefix of {\tt 101},
% and so is {\tt 10}.]
%}
 [A word $c$
% \in \A^{+}$
 is a {\dem prefix\/} of another word $d$
% \in \A^{+}$
 if there exists a tail string $t$
% \in \A^{*}
 such that the concatenation $ct$ is
 identical to $d$. For example, {\tt 1} is a prefix of {\tt 101},
 and so is {\tt 10}.]

%
 We will show later that we don't lose 
 any performance if we constrain our symbol code to be 
 a prefix code. 
\begin{description}
\item[A symbol code is called a \inds{prefix code}]
 if no codeword is a prefix of 
 any other codeword.

 A prefix code is also known as an {\dem\ind{instantaneous}\/}
 or {\dem\ind{self-punctuating}\/}
 code, because an encoded string  can be decoded 
 from left to right without looking ahead to subsequent 
 codewords. The end of a codeword is immediately recognizable.
 A prefix code is  uniquely decodeable.


\end{description}
\begin{aside}
 {Prefix codes are also
% is more accurately called
 known as  `prefix-free codes' or  `prefix condition codes'.}
\end{aside}

 Prefix codes correspond to trees.

\exampla{
\amarginfignocaption{t}{\mbox{\small$C_1$ \psfig{figure=figs/C1.ps,angle=-90,width=1in}}}
        The code $C_1 = \{ {\tt{0}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$ is a prefix code because 
        ${\tt{0}}$ is not a prefix of {\tt{1}}{\tt{0}}{\tt{1}}, nor is {\tt{1}}{\tt{0}}{\tt{1}} a prefix of {\tt{0}}.

}
\exampla{
        Let $C_2 = \{ {\tt{1}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$. This code is not a prefix code because 
        ${\tt{1}}$ is  a prefix of {\tt{1}}{\tt{0}}{\tt{1}}.
}
\exampla{
% \marginpar[t]{\mbox{\small\raisebox{0.4in}[0in][0in]{$C_3$} \psfig{figure=figs/C3.ps,angle=-90,width=1in}}}
 The code $C_3 = \{ 
{\tt 0}   ,
{\tt 10}  ,
{\tt 110} ,
{\tt 111}
\}$
 is a prefix code.
%
}
%%%%%%%%%%%%%%%
\exampla{
\amarginfignocaption{t}{\mbox{\small\raisebox{0.4in}[0in][0in]{$C_3$} \psfig{figure=figs/C3.ps,angle=-90,width=1in}}\\[0.21in]
\mbox{\small%
\raisebox{0.2in}[0in][0in]{$C_4$} \psfig{figure=figs/C4.ps,angle=-90,width=0.681in}%
}\\[0.125in]
\small\raggedright
 Prefix codes can be represented on binary trees. {\dem Complete\/} prefix codes
 correspond to binary trees with no unused branches. $C_1$ is an incomplete code.}
 The code $C_4 = \{ 
{\tt 00}   ,
{\tt 01}  ,
{\tt 10} ,
{\tt 11}
\}$
 is a prefix code.
%
}
%%%%%%%%%%%%%%%

\exercissxA{1}{ex.C1101}{
        Is $C_2$ uniquely decodeable?
}
%
% example
%
% morse code with spaces stripped out. Is it a prefix code? Is it UD?
% (no,no)
%
\exampla{
% ref corrected 9802
 Consider  \exerciseref{ex.weigh} and \figref{fig.weighing} (\pref{fig.weighing}).
 Any weighing strategy that identifies the odd ball and whether it 
 is heavy or light can be viewed as assigning a  {\em ternary\/}
 code to each of the 24 possible states. 
 This code is a prefix code.
}
\subsection{The code should achieve as much compression as possible}
\begin{description}
\item[The expected length $L(C,X)$] of a symbol code $C$ for ensemble $X$ is 
\beq
        L(C,X) = \sum_{x \in \A_X} P(x) \, l(x).
\eeq
 We may also write this quantity as
\beq
	L(C,X) = \sum_{i=1}^{I} p_i l_i
\eeq
 where $I = |\A_X|$. 
\end{description}
%
\exampla{
% {\sf Example 1:}
\marginpar[b]{
\begin{center}
$C_3$:\\[0.1in] 
\begin{tabular}{cllcc} \toprule
$a_i$ & $c(a_i)$ & $p_i$  &
% \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$}
 $h(p_i)$
 & $l_i$ 
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule 
{\tt a} & {\tt 0}   & \dhalf         &  1.0     &   1      \\
{\tt b} & {\tt 10}  & \dquarter        &  2.0     &   2      \\
{\tt c} & {\tt 110} & \deighth       &  3.0     &   3      \\
{\tt d} & {\tt 111} & \deighth       &  3.0     &   3      \\
 \bottomrule
\end{tabular}
\end{center}
}
 Let 
\beq
\begin{array}{*{4}{c}*{5}{@{\,}c}}
             & \A_X & = & \{ & {\tt a}, & {\tt b}, & {\tt c}, & {\tt d} & \} , \\
\mbox{and} \:\:& \P_X & = & \{ & \dhalf, & \dquarter, & \deighth, & \deighth  &  \}, 
\end{array}
\eeq
 and  consider the code $C_3$.
% $c(a)\eq {\tt{0}}$, $ c(b)\eq {\tt{1}}{\tt{0}}$,
% $c(c)\eq {\tt{1}}{\tt{1}}{\tt{0}}$, $ c(d)\eq {\tt{1}}{\tt{1}}{\tt{1}}$.
%
 The entropy of $X$ is 1.75 bits, and the expected length $L(C_3,X)$ of this 
 code is also 1.75 bits. The sequence of symbols $\bx\eq ({\tt acdbac})$ is 
% 134213
 encoded as $c^+(\bx)={\tt{0110111100110}}$. 
% You can confirm that no other sequence of 
% symbols $\bx$ has the same encoding.
% In fact,
 $C_3$ is a {prefix code\/}
 and is therefore \inds{uniquely decodeable}. 
 Notice that the codeword lengths satisfy $l_i \eq  \log_2 (1/p_i)$,  or
 equivalently,
 $p_i \eq  2^{-l_i}$.
}
%\medskip
%
%\noindent {\sf Example 2:}
\exampla{
 Consider the fixed length code for the same ensemble
 $X$, $C_4$.
% $ c(1)\eq {\tt{00}}$, $ c(2)\eq {\tt{01}}$, $ c(3)\eq {\tt{10}}$, $ c(4)\eq {\tt{11}}$.
%
% C4 by itself in a table, moved to graveyard
\marginpar[b]{
\begin{center}
 \begin{tabular}{cll} \toprule
% $a_i$
 &
$C_4$&
$C_5$
%&$C_6$
% \\
% $c(a_i)$ & $p_i$  &
% \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$}
% $h(p_i)$ & $l_i$
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule 
{\tt a} & {\tt 00} & {\tt 0}        \\
{\tt b} & {\tt 01} & {\tt 1}         \\
{\tt c} & {\tt 10} & {\tt 00}     \\
{\tt d} & {\tt 11} & {\tt 11}     \\
 \bottomrule
\end{tabular}
\end{center}
}
 The expected length $L(C_4,X)$ is 2 bits.
}
% edskip
% 
% \noindent {\sf Example 3:}
\exampla{
 Consider $C_5$.
%$ c(1)\eq {\tt{0}}$, $ c(2)\eq {\tt{1}}$, $ c(3)\eq {\tt{00}}$,  $c(4)\eq {\tt{11}}$.
 The expected 
 length $L(C_5,X)$ is 1.25 bits, which is less than $H(X)$. 
 But the code is not uniquely decodeable. 
 The sequence $\bx\eq ({\tt acdbac})$
% 134213)$
 encodes as {\tt{000111000}}, which can also be 
 decoded as $({\tt cabdca})$.
}
% \medskip
% 
% \noindent {\sf Example 4:}
\exampla{
 Consider the code $C_6$.
\amargintabnocaption{c}{
\begin{center}
$C_6$:\\[0.1in]
 \begin{tabular}{cllcc} \toprule
$a_i$  & $c(a_i)$    & $p_i$  &
% {$\log_2 \frac{1}{p_i}$}
 $h(p_i)$
 & $l_i$ 
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule 
{\tt a} & {\tt 0}    & \dhalf         &  1.0     &   1     \\
{\tt b} & {\tt 01}   & \dquarter        &  2.0     &   2     \\
{\tt c} & {\tt 011}  & \deighth       &  3.0     &   3     \\
{\tt d} & {\tt 111}  & \deighth       &  3.0     &   3     \\
 \bottomrule
\end{tabular}
\end{center}
}
%$ c(1)\eq {\tt{0}}$, $ c(2)\eq {\tt{01}}$, $ c(3)\eq {\tt{011}}$,  $c(4)\eq {\tt{111}}$. 
 The  expected length $L(C_6,X)$ of this 
 code is  1.75 bits. The sequence of symbols $\bx\eq ({\tt acdbac})$ is 
 encoded as $c^+(\bx)={\tt{0011111010011}}$. 

 Is $C_6$  a {prefix code}?
 It is not, because $c({\tt a}) = {\tt 0}$ is a prefix of both
 $c({\tt b})$ and $c({\tt c})$. 

 Is $C_6$ {uniquely decodeable}? This is not so obvious. If you think that
 it might {\em not\/} be {uniquely decodeable},  try to prove it 
 so by finding a pair of strings $\bx$ and $\by$ that have the same
 encoding. [The definition of unique decodeability is given in \eqref{eq.UD}.]

 $C_6$ certainly isn't {\em easy\/} to decode. 
 When we receive `{\tt{00}}', it is possible that $\bx$ could start `{\tt{aa}}', 
 `{\tt{ab}}' or `{\tt{ac}}'. Once we have received `{\tt{001111}}', the second symbol 
 is still ambiguous, as $\bx$ could be `{\tt{abd}}\ldots' or `{\tt{acd}}\ldots'. 
 But eventually a unique decoding crystallizes, once the next {\tt{0}} appears in the 
 encoded stream. 

 $C_6$ {\em is\/} in fact {uniquely decodeable}. Comparing with the prefix code $C_3$, 
 we see that the codewords of $C_6$ are  the reverse of $C_3$'s.
 That $C_3$ is uniquely decodeable proves that $C_6$ is too, since
 any string from $C_6$ is identical to a string from $C_3$ read backwards. 
}
% \medskip
% something I recall reading in cover was a contrary statement that said that
% with a nonprefix code it will take an arb long time to figure things out. 
% maybe that was just a w.c. result.

% What is it that distinguishes a uniquely

\section{What limit is imposed by unique decodeability?}
 We now ask, given a list of positive integers $\{ l_i
 \}$, does there exist a uniquely decodeable\index{uniquely decodeable}\index{source code!uniquely decodeable} code with those
 integers as its codeword lengths?
 At this stage, we  ignore the probabilities of the different
 symbols; once we understand unique decodeability better, we'll
 reintroduce the probabilities and discuss how to make
 an {\dem optimal\/} uniquely decodeable symbol code. 

 In the examples above, we have observed that if we take a code 
 such as $\{{\tt{00}},{\tt{01}},{\tt{10}},{\tt{11}}\}$, and
 shorten one of its codewords, 
 for example ${\tt{00}} \rightarrow {\tt{0}}$, then we can  retain unique 
 decodeability only if we lengthen  other codewords.
 Thus there seems to be a constrained budget\index{symbol code!budget} that we can spend
 on codewords, with shorter codewords being more expensive.

 Let us explore the nature of this \ind{budget}. 
 If we build a code purely from codewords of length $l$ equal 
 to three, how many 
 codewords can we have and retain unique decodeability?
 The answer is $2^l = 8$. Once we have chosen all eight 
 of these codewords, is there any way we could add to the code another 
 codeword of some {\em other\/} length and retain unique decodeability? 
 It would seem not.

 What if we make a code that includes a length-one codeword, `{\tt{0}}', 
 with the other codewords being of length three?  How many length-three
 codewords can we have?
 If we restrict attention to prefix codes, then
% it is clear  that
 we can  have only four codewords of length three, namely 
 $\{ {\tt{100}},{\tt{101}},{\tt{110}},{\tt{111}} \}$. What about other codes? Is there any other 
 way of choosing codewords of length 3 that can give more codewords? 
 Intuitively, we  think this unlikely. 
 A codeword of length $3$ appears to 
 have a cost that is $2^{2}$ times smaller than a codeword of length 1. 
% "... cost ... times smaller ..."; I suspect some
%        readers may have difficulty with this sentence.

 Let's  define a total budget of size 1, 
 which we can spend on codewords.
 If we set the cost of a codeword whose length is $l$ to $2^{-l}$,
 then we have a pricing system that fits the examples
 discussed above. Codewords of length 3 cost $\deighth$ each;
 codewords of length 1 cost $1/2$ each. 
 We can spend our budget on any codewords.
 If we go over our budget then the code will certainly not be 
 uniquely decodeable. If, on the other hand,
\beq
	\sum_i 2^{-l_i} \leq 1,
\label{eq.kraft}
\eeq
 then the code may be uniquely decodeable. This inequality is
 the \inds{Kraft inequality}.\label{sec.kraft}
\begin{description}
\item[\Kraft\ inequality\puncspace] 
 For any uniquely decodeable code $C(X)$ over the binary alphabet $\{0,1\}$, 
 the codeword lengths must satisfy:
\beq
        \sum_{i=1}^I 2^{-l_i} \leq 1 ,
\eeq
 where $I = |\A_X|$.
\end{description}
\begin{description}
\item[Completeness\puncspace]
 If a uniquely 
 decodeable code satisfies the \Kraft\ inequality with equality 
 then  it is called a {\dbf complete} code.
\end{description}
% It is less obvious that t
 We want  codes that are uniquely decodeable; 
 prefix codes are uniquely decodeable, and are easy to decode.
% ;  and it is  easy to assess whether a code is a prefix code. 
% codes that are not prefix codes are less straightforward to decode than 
% prefix codes. 
 So life would be simpler for us if we could restrict attention to prefix
 codes.\index{prefix code}  
 Fortunately,
% we can prove that
 for any source there {\em is\/}
 an optimal symbol code that is also a prefix
 code. 
% We wi, and we will discuss an 
% algorithm    we can restrict attention to prefix
% codes.
% The following
% result is also true:
\begin{description}
\item[\Kraft\ inequality and prefix codes\puncspace]
 Given a set of codeword lengths that satisfy
 the Kraft inequality,
% this inequality,
 there exists a uniquely decodeable prefix
 code\index{source code!prefix code}\index{prefix code} with these
 codeword lengths.
\end{description}
\begin{aside}
%\subsection*{The small print}
 The Kraft inequality
% , which appears on page \pageref{sec.kraft},
 might be more accurately referred to 
 as the Kraft--McMillan inequality:\index{Kraft, L.G.}\index{McMillan, B.}\nocite{mcmillan1956}
 Kraft
% (1949)
 proved that if the inequality is satisfied, 
 then a prefix code exists with the given lengths.
% McMillan
% (1956)
 \citeasnoun{mcmillan1956}
 proved the converse, that unique decodeability 
 implies that the inequality holds.
\end{aside}
\begin{prooflike}{Proof of the \Kraft\ inequality}
%
        Define $S = \sum_i 2^{-l_i}$.
 Consider the quantity 
\beq
        S^N = \left[ \sum_i 2^{-l_i} \right]^N 
        = \sum_{i_1=1}^{I} \sum_{i_2=1}^{I} \cdots \sum_{i_N=1}^{I}
         2^{-\displaystyle \left(l_{i_1} + l_{i_2} + \cdots l_{i_N} \right) } .
\eeq
 The quantity in the exponent, $\left(l_{i_1} + l_{i_2} + \cdots +
 l_{i_N} \right)$, is the length of the encoding of the string $\bx =
 a_{i_1} a_{i_2} \ldots a_{i_N}$. For every string $\bx$ 
 of length $N$, there is one term in the above sum. Introduce an 
 array $A_l$ that counts how many strings $\bx$ have encoded length $l$. 
 Then, defining $l_{\min} = \min_i l_i$ and $l_{\max} = \max_i l_i$:
\beq
        S^N = \sum_{l = N l_{\min} }^{N l_{\max}} 2^{-l} A_l .
\eeq
 Now assume $C$ is
 uniquely decodeable, so that for all $\bx \not = \by$, 
 $c^+(\bx) \not = c^+(\by)$. Concentrate on the $\bx$ that have encoded 
 length $l$. There are a total of $2^l$ distinct bit strings of length $l$, 
 so it must be the case that $A_l \leq 2^l$. 
%
 So
\beq
        S^N = \sum_{l = N l_{\min} }^{N l_{\max}} 2^{-l} A_l \leq
         \sum_{l = N l_{\min} }^{N l_{\max}} 1 \:\: \leq \:\:  N l_{\max}.
\label{eq.kraft.climax}
\eeq
 Thus $S^N \leq l_{\max} N$ for all $N$.
 Now if $S$ were greater than 1, then as $N$ increases,
 $S^N$ would be an exponentially growing function, and for large enough
 $N$, an exponential always exceeds a polynomial such as $l_{\max} N$.
 But our  result $(S^N \leq l_{\max} N)$
% \ref{eq.kraft.climax}
 is true for {\em any\/} $N$.
 Therefore $S \leq 1$. \hfill
% Q.E.D. 
%
% to have
% enabled me to understand it the first time round, it would have been
% sufficient to have said 'for the inequality to be true for all N,
% regardless of how large, S has to be <= 1.'
%
\end{prooflike}


\exercissxB{3}{ex.KIconverse}{ 
% (optional)
 Prove
 the  result stated above,
 that for any set of codeword lengths $\{ l_i \}$
 satisfying the \Kraft\ inequality, there is a prefix code having those 
 lengths.
}
%
%  Symbol Coding Budget
%
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=figs/budget1.eps,height=3in}\ \psfig{figure=figs/budgetmax.eps,height=3in}}
\end{center}
}{%
\caption[a]{The symbol coding \ind{budget}.\index{source code!supermarket}\indexs{symbol code!budget}
 The `cost' $2^{-l}$ of each codeword
 (with length $l$)
 is indicated by the size of the box it is written in. The total budget 
 available when making a uniquely decodeable code is 1.

 You can think of this diagram as showing
 a {\dem{codeword supermarket}\/}\index{supermarket for codewords},
 with the codewords arranged in aisles by their length, and the cost of each codeword indicated by the
 size of its box on the shelf.
 If the cost of the codewords that  you take exceeds the budget then your code
 will not be uniquely decodeable.
}
\label{fig.budget1}
}%
\end{figure}
\begin{figure}
\figuredangle{%
\begin{center}
\mbox{
%\begin{tabular}{cc}
% $C_0$ & $C_3$ \\
%\psfig{figure=figs/budget0.eps,height=1.48in}&
%\psfig{figure=figs/budget3.eps,height=1.48in} \\[0.2in]
% $C_4$ & $C_6$ \\
%\psfig{figure=figs/budget4.eps,height=1.48in}&
%\psfig{figure=figs/budget6.eps,height=1.48in}\\
%\end{tabular}}
\begin{tabular}{cccc}
 $C_0$ & $C_3$ &  $C_4$ & $C_6$ \\
\psfig{figure=figs/budget0.eps,height=1.66in}&
\psfig{figure=figs/budget3.eps,height=1.66in}&
\psfig{figure=figs/budget4.eps,height=1.66in}&
\psfig{figure=figs/budget6.eps,height=1.66in}\\
\end{tabular}}
\end{center}
}{%
\caption[a]{Selections of codewords
% from the codeword supermarket
 made by codes $C_0,C_3,C_4$ and $C_6$
 from section \protect\ref{sec.symbol.code.intro}.}
\label{fig.budget0}
\label{fig.budget6}
}%
\end{figure}
 A pictorial view of the \Kraft\ inequality may  help you solve this exercise.
 Imagine that we are choosing the codewords to make a symbol code. 
 We can draw the set of all candidate codewords
% that we might  include in a code
 in a supermarket that displays
 the `cost' of the codeword by the area of a box (\figref{fig.budget1}). 
 The total budget available -- the `1' on the right-hand side of 
 the \Kraft\ inequality -- is shown at one side. 
 Some of the codes discussed in section \ref{sec.symbol.code.intro}
 are illustrated in figure \ref{fig.budget0}. Notice that the codes that 
 are prefix codes, $C_0$, $C_3$,
 and $C_4$,  have the property that to the right of any selected 
 codeword, there are no other selected codewords --
 because prefix codes correspond to trees.
% The {\em complete\/} prefix codes  $C_0$, $C_3$,
% and $C_4$ have the property that
% the codewords abut 
% Notice also that the 
% `incomplete' code 
% -\ref{fig.budget6}.
 Notice that a {\em complete\/} prefix code
 corresponds to a {\em complete\/} tree having no unused branches.

\medskip

 We are now ready to put back the symbols' probabilities $\{ p_i \}$.
 Given a set of symbol probabilities (the English language
 probabilities of \figref{fig.monogram}, for example),
 how do we make the best symbol code --  one  with the smallest
 possible expected length $L(C,X)$? And what is that smallest possible
 expected length?
 It's not
 obvious how to assign the codeword lengths.
 If we give short codewords to the more probable
 symbols then the expected length might be reduced; on the other
 hand, shortening some codewords necessarily causes others
 to lengthen, by the Kraft inequality.

\section{What's the most compression that we can hope for?}
% there must be a compromise.
% of s
% Of the four codes  displayed in figure \ref{fig.budget0},
% $C_3$ and $C_6$
 We wish to minimize the expected length of a code,  
\beqan
        L(C,X) &=& \sum_i p_i l_i .
\eeqan

 As you might have guessed, the entropy  appears as the 
% It is easy to show that there is a 
 lower bound on the expected length of a code.
\begin{description}
\item[Lower bound on expected length\puncspace] The expected length $L(C,X)$ 
 of a uniquely decodeable code 
 is bounded below by $H(X)$. 

\item[{\sf Proof.}]
% Introduce the optimum codelengths $l^*_i \equiv \log (1/p_i)$, 
        We define the {\dem\inds{implicit probabilities}\/}
 $q_i \equiv 2^{-l_i}/z$,
        where $z\eq \sum_{i'} 2^{-l_{i'}}$, so that $l_i \eq  \log 1/q_i -
        \log z$.  We then use Gibbs' inequality,
        $\sum_i p_i \log 1/q_i \geq \sum_i p_i \log 1/p_i$, with
        equality if $q_i \eq  p_i$, and the \Kraft\ inequality $z\leq 1$:
\beqan
        L(C,X) &=& \sum_i p_i l_i =
        \sum_i p_i \log 1/q_i - \log z
\label{eq.expected.length}
\\
        & \geq & \sum_i p_i \log 1/p_i - \log z
\\
        & \geq & H(X) . 
\eeqan
        The equality $L(C,X) \eq  H(X)$ is achieved only if the \Kraft\ 
        equality $z
% \sum_i 2^{-l_i} 
        \eq  1$ is satisfied, and if 
        the codelengths satisfy $l_i \eq  \log (1/p_i)$. \hfill $\Box$

\end{description}
 This is an important result so let's say it again: 
\begin{description}
\item[Optimal source codelengths\puncspace]
        The\index{source code!optimal lengths}
         expected length is minimized and is equal to 
 $H(X)$ only if the codelengths 
        are equal to the {\dem Shannon information contents}:\index{Shannon information content}\index{information content}
\beq
        l_i = \log_2 (1/p_i)  .
\eeq
\item[Implicit probabilities defined by codelengths\puncspace]
	Conversely, any choice of codelengths $\{l_i\}$ {\em implicitly\/}
 defines a probability distribution $\{q_i\}$,
\beq
	q_i \equiv 2^{-l_i}/z  ,
\eeq
 for  which  those codelengths would be the  optimal codelengths.
 If the code is complete then $z=1$ and the implicit probabilities 
 are given by $q_i =  2^{-l_i}$.
\end{description}
%  This is one of the central themes of this course.
%
%
%
\section{How much can we compress?}
 So, we can't compress below the entropy.
% using a symbol code.
 How close can we expect  to get to the entropy?
% if we are using a symbol code?
% \section{Existence of good symbol codes}
\begin{ctheorem}
{\sf Source coding theorem for symbol codes.}
 For an ensemble $X$ there exists a prefix code $C$ with  expected length 
 satisfying\indexs{extra bit} 
\beq
        H(X) \leq L(C,X) < H(X) + 1.
\label{eq.source.coding.symbol}
\eeq
\label{th.source.coding.symbol}
\end{ctheorem}
\begin{prooflike}{Proof}    We set the codelengths to integers slightly 
 larger than the optimum lengths:
\beq
        l_i = \lceil \log_2 (1/p_i) \rceil
\eeq
        where $\lceil l^* \rceil$ denotes the smallest integer greater
        than or equal to $l^*$.
 [We are not asserting that the {\em optimal\/} code necessarily uses
 these lengths, we are simply choosing these lengths 
 because we can use them to prove the theorem.]

  We check that there {\em is\/} a
        prefix code with these lengths by confirming that the
        \Kraft\ inequality is satisfied.
\beq
	\sum_i 2^{-l_i} = \sum_i 2^{-\lceil \log_2 (1/p_i) \rceil} 
	\leq \sum_i 2^{ -\log_2 (1/p_i) } = \sum_i p_i = 1 . 
\eeq

        Then we confirm
\beq
	L(C,X) = \sum_i p_i \lceil \log (1/p_i) \rceil
        < \sum_i p_i ( \log (1/p_i) + 1 ) = H(X) + 1.
\eeq
% corrected < to =  , 9802
%
\end{prooflike}

\subsection{The cost of using the wrong codelengths}
 If we use a code whose lengths are not equal to the optimal 
 codelengths,  the average message length will be larger
 than the entropy.

%when        we use the `wrong' code. 
 If the true probabilities are $\{ p_i
        \}$ and we use a complete code with lengths $l_i$,
% that satisfy the
%         \Kraft\ equality (that is, 
%  the  \Kraft\ inequality with equality),
  we can view those lengths as defining 
        \ind{implicit probabilities} $q_i = 2^{-{l_i}}$.
% l_i \eq  \log 1/q_i$ such
%       that $\sum_i q_i \eq  1$, then 
        Continuing from \eqref{eq.expected.length},
 the average length is
\beq
	L(C,X) = H(X)+\sum_i p_i \log p_i/q_i,
\eeq
        \ie, it exceeds the entropy by the \ind{relative entropy}
        $D_{\rm KL}(\bp||\bq)$ (as defined on  \pref{eq.KL}).

\section{Optimal source coding with symbol codes:  Huffman coding}
 Given a set of probabilities $\P$, how can we design an optimal
 prefix code? For example,
 what is the best symbol code for the English language ensemble
 shown in \figref{fig.elfig}? 
\marginfig{\begin{center}\input{tex/_paz.tex}\end{center}
\caption[a]{An ensemble in need of a symbol code.}\label{fig.elfig}}
 When we say `optimal', let's assume our aim is to minimize the
 expected length $L(C,X)$.

\subsection{How not to do it}
 One might try
 to roughly split the set $\A_X$ in two, and
 continue bisecting the subsets so as to define a binary tree from the
 root. This construction has the right spirit, as in the weighing problem, 
% is how the {\em Shannon-Fano code\/} is constructed,\index{Shannon, Claude}\index{Fano} 
 but it is not
 necessarily optimal; it achieves $L(C,X) \leq H(X) + 2$. 
%
% find a reference for proof of this?
%
%{\em [Is Shannon-Fano
% the correct name? According to Goldie and Pinch this has a different
% meaning. Check.]}
\subsection{The Huffman coding algorithm}
 We now present a beautifully simple algorithm for finding an optimal
 prefix code.
 \indexs{Huffman code}The trick is to
 construct the code {\em backwards\/} starting from the tails of the
 codewords;  {\em we build the binary tree  from its leaves}.

\begin{algorithm}[h]
\begin{framedalgorithmwithcaption}{\caption[a]{Huffman coding algorithm.}}
\ben
\item%[{\sf 1.}]
 Take the two least probable symbols in the alphabet. These two symbols 
 will be given the longest codewords, which will have equal length, 
 and differ only in the last digit. 
\item%[{\sf 2.}]
 Combine these two symbols into a single symbol, and repeat.
\een
\end{framedalgorithmwithcaption}
\end{algorithm}

 Since each step reduces the size of the alphabet by one, 
 this algorithm will have assigned strings to all the symbols 
 after  $|\A_X|-1$ steps.
\exampla{
% {\sf Example:}
 \begin{tabular}[t]{*{11}{@{\,}l}}
 Let \hspace{0.1in} & $\A_X$  &=&$\{$& {\tt a},&{\tt b},&{\tt c},&{\tt d},&{\tt e} &$\}$ \\
 and \hspace{0.1in} &  $\P_X$  &=&$\{$& 0.25,  &0.25,  & 0.2,  & 0.15, & 0.15   & $\}$.
 \end{tabular}
\begin{center}
% \framebox{\psfig{figure=figs/huffman.ps,%
%angle=-90}}
\setlength{\unitlength}{0.015in}%was0125
\begin{picture}(200,95)(40,40)
\put( 60,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put( 60,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put( 60,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.2}}}
\put( 60,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.15}}}
\put( 60,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.15}}}
\put(100,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put(100,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put(100,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.2}}}
\put(100,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.3}}}
\put(140,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put(140,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.45}}}
\put(140,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.3}}}
\put(180,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.55}}}
\put(180,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.45}}}
\put(220,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{1.0}}}
\put( 40,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt a}}}
\put( 40,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt b}}}
\put( 40,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt c}}}
\put( 40,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt d}}}
\put( 40,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt e}}}
\put( 85,067){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}}
\put( 85,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}}
\put(125,097){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}}
\put(125,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}}
\put(165,112){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}}
\put(165,065){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}}
\put(205,112){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}}
\put(205,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}}
\thinlines
\put( 80,110){\line( 1, 0){ 15}}
\put( 80,095){\line( 1, 0){ 15}}
\put( 80,080){\line( 1, 0){ 15}}
\put( 80,065){\line( 1, 0){ 15}}
\put( 95,065){\line(-1,-1){ 15}}

\put(120,110){\line( 1, 0){ 15}}
\put(120,065){\line( 1, 0){ 15}}
\put(120,095){\line( 1, 0){ 15}}
\put(135,095){\line(-1,-1){ 15}}

\put(160,095){\line( 1, 0){ 15}}
\put(160,110){\line( 1, 0){ 15}}
\put(175,110){\line(-1,-3){ 15}}

\put(200,110){\line( 1, 0){ 15}}
\put(215,110){\line(-1,-1){ 15}}
\put( 40,125){\makebox(0,0)[bl]{\raisebox{0pt}[0pt][0pt]{$x$}}}
\put( 85,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 1}}}
\put(125,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 2}}}
\put(165,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 3}}}
\put(205,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 4}}}
\end{picture}

\end{center}
 The  codewords are then obtained by concatenating the binary digits
 in reverse order:
% Codewords
 $C = \{ {\tt{00}}, {\tt{10}} , {\tt{11}}, {\tt{010}}, {\tt{011}} \}$.
\margintab{
\begin{center}
\begin{tabular}{clrrl} \toprule
$a_i$  & $p_i$  &
 \multicolumn{1}{c}{$h(p_i)$%$\log_2 \frac{1}{p_i}$}
 }
& $l_i$ & $c(a_i)$
%{\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule 
{\tt a} & 0.25        &  2.0     &   2 & {\tt 00}       \\
{\tt b} & 0.25        &  2.0     &   2 & {\tt 10}       \\
{\tt c} & 0.2         &  2.3     &   2 & {\tt 11}       \\
{\tt d} & 0.15        &  2.7     &   3 & {\tt 010}      \\
{\tt e} & 0.15        &  2.7     &   3 & {\tt 011}      \\ \bottomrule
\end{tabular}
\end{center}
\caption[a]{Code created by the Huffman algorithm.}
\label{tab.huffman}
}
 The codelengths  selected by the Huffman algorithm (column 4
 of \tabref{tab.huffman}) are
 in some cases longer and in some cases shorter than 
 the ideal codelengths, the  Shannon information contents $\log_2 \dfrac{1}{p_i}$ (column 3).
 The expected length of the code is $L=2.30$ bits, whereas the
 entropy is $H=2.2855$ bits.\ENDsolution
}
 If at any point there is more than one way of selecting the two least 
 probable symbols then the choice may be made in any manner -- the 
 expected length of the code will not depend on the choice.
\exercissxC{3}{ex.Huffmanconverse}{ 
% (Optional)
 Prove\index{Huffman code!`optimality'} 
 that there is no better symbol code for a source than the
 Huffman code.
}
%
\exampla{
 We can make a Huffman code for the probability distribution
 over the alphabet introduced in \figref{fig.monogram}.
 The result is shown in \figref{fig.monogram.huffman}.
 This code has an expected length of    4.15 bits; the entropy of
 the ensemble is     4.11 bits.
% It is interesting to notice how
% some symbols, for example {\tt q}, receive codelengths that
% differ by more than 1 bit from
 Observe the disparities between the assigned
 codelengths and the ideal codelengths
 $\log_2 \dfrac{1}{p_i}$.
}
%%%%%%%%%%%%%%%%%%%%%%%%% alphabet of english!
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\small
\begin{tabular}{clrrl}  \toprule
$a_i$  & $p_i$  & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$}  & $l_i$ & $c(a_i)$
%{\rule[-3mm]{0pt}{8mm}}%strut
\\[0in] \midrule
{\tt a}&    0.0575 & 4.1  &   4 & {\tt 0000       } \\
{\tt b}&    0.0128 & 6.3  &   6 & {\tt 001000       } \\
{\tt c}&    0.0263 & 5.2  &   5 & {\tt 00101       } \\
{\tt d}&    0.0285 & 5.1  &   5 & {\tt 10000       } \\
{\tt e}&    0.0913 & 3.5  &   4 & {\tt 1100       } \\
{\tt f}&    0.0173 & 5.9  &   6 & {\tt 111000       } \\
{\tt g}&    0.0133 & 6.2  &   6 & {\tt 001001       } \\
{\tt h}&    0.0313 & 5.0  &   5 & {\tt 10001       } \\
{\tt i}&    0.0599 & 4.1  &   4 & {\tt 1001       } \\
{\tt j}&    0.0006 & 10.7 &  10 & {\tt 1101000000       } \\
{\tt k}&    0.0084 & 6.9  &   7 & {\tt 1010000       } \\
{\tt l}&    0.0335 & 4.9  &   5 & {\tt 11101       } \\
{\tt m}&    0.0235 & 5.4  &   6 & {\tt 110101       } \\
{\tt n}&    0.0596 & 4.1  &   4 & {\tt 0001       } \\
{\tt o}&    0.0689 & 3.9  &   4 & {\tt 1011       } \\
{\tt p}&    0.0192 & 5.7  &   6 & {\tt 111001       } \\
{\tt q}&    0.0008 & 10.3 &   9 & {\tt 110100001       } \\
{\tt r}&    0.0508 & 4.3  &   5 & {\tt 11011       } \\
{\tt s}&    0.0567 & 4.1  &   4 & {\tt 0011       } \\
{\tt t}&    0.0706 & 3.8  &   4 & {\tt 1111       } \\
{\tt u}&    0.0334 & 4.9  &   5 & {\tt 10101       } \\
{\tt v}&    0.0069 & 7.2  &   8 & {\tt 11010001       } \\
{\tt w}&    0.0119 & 6.4  &   7 & {\tt 1101001       } \\
{\tt x}&    0.0073 & 7.1  &   7 & {\tt 1010001       } \\
{\tt y}&    0.0164 & 5.9  &   6 & {\tt 101001       } \\
{\tt z}&    0.0007 & 10.4 &  10 & {\tt 1101000001       } \\
{--}& 0.1928 & 2.4  &   2 & {\tt 01       } \\  \bottomrule
%{\verb+-+}& 0.1928 & 2.4  &   2 & {\tt 01       } \\  \bottomrule
\end{tabular}
\hspace*{0.5in}\raisebox{-2in}{\psfig{figure=tex/sortedtree.eps,width=1.972in}}
}
\end{center}
}{%
\caption[a]{Huffman code for the English language ensemble (monogram statistics).}
%  introduced in \protect\figref{fig.monogram}.}
\label{fig.monogram.huffman}
}%
\end{figure}
% see \cite[p. 97]{Cover&Thomas}
% \medskip
\subsection{Constructing a binary tree top-down is suboptimal}
 In previous chapters we studied weighing problems
 in which we built ternary or binary trees. 
 We noticed that balanced trees -- ones in which, at every step, the two 
 possible outcomes were as close as possible to equiprobable --
 appeared to describe the most efficient experiments. 
 This gave an intuitive motivation for entropy as a measure of information 
 content.

 It is not the case, however, that optimal codes can {\em always\/}
 be constructed
 by a greedy top-down method in which the alphabet
 is successively divided into subsets that are as near as possible to equiprobable.
% /home/mackay/itp/huffman> huffman.p latex=1 < fiftywrong3
\exampla{
 Find the optimal binary symbol code for the ensemble:
\beq
\begin{array}{*{3}{@{\,}c@{\,}}*{6}{c@{\,}}*{2}{@{\,}c}}
\A_X & = & \{ & 
{\tt a}, & 
{\tt b}, & 
{\tt c}, & 
{\tt d}, & 
{\tt e}, & 
{\tt f}, & 
{\tt g} &
 \} \\ 
\P_X & = & \{ 
& 0.01, 
& 0.24, 
& 0.05, 
& 0.20, 
& 0.47, 
& 0.01, 
& 0.02 
& \} \\ 
\end{array} .
\eeq
 Notice that a greedy top-down method can split this set into two
% equiprobable
 subsets
 $\{ {\tt a},{\tt b},{\tt c},{\tt d} \}$ and $\{{\tt e},{\tt f},{\tt g}\}$
 which both have probability $1/2$,
 and that  $\{ {\tt a},{\tt b},{\tt c},{\tt d} \}$  can be divided
 into
% equiprobable
 subsets $\{ {\tt a},{\tt b} \}$ and $\{{\tt c},{\tt d}\}$,
 which have probability $1/4$; 
 so  a greedy top-down method gives the code shown
 in the third column of  \tabref{tab.greed},\margintab{
\begin{center}\small
\begin{tabular}{clll} \toprule
$a_i$  & $p_i$  & Greedy  & Huffman   \\[0in] \midrule 
{\tt a} & .01  & {\tt 000}  & {\tt 000000}     \\
{\tt b} & .24  & {\tt 001}  & {\tt 01}             \\
{\tt c} & .05  & {\tt 010}  & {\tt 0001}         \\
{\tt d} & .20  & {\tt 011}  & {\tt 001}           \\
{\tt e} & .47  & {\tt 10}   & {\tt 1}              \\
{\tt f} & .01  & {\tt 110}  & {\tt 000001}     \\
{\tt g} & .02  & {\tt 111}  & {\tt 00001}       \\
 \bottomrule
\end{tabular}
\end{center}
\caption[a]{A greedily-constructed code compared with the Huffman code.}
\label{tab.greed}
}
 which has expected length 2.53.
 The Huffman coding algorithm yields the code shown in the fourth
 column,
%\begin{center}
%\begin{tabular}{clrrl} \toprule
%$a_i$  & $p_i$  & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$}  & $l_i$ & $c(a_i)$
%%{\rule[-3mm]{0pt}{8mm}}%strut
%\\[0in] \midrule 
%{\tt a} & 0.01        &  6.6     &   6 & {\tt 000000}   \\
%{\tt b} & 0.24        &  2.1     &   2 & {\tt 01}       \\
%{\tt c} & 0.05        &  4.3     &   4 & {\tt 0001}     \\
%{\tt d} & 0.20        &  2.3     &   3 & {\tt 001}      \\
%{\tt e} & 0.47        &  1.1     &   1 & {\tt 1}        \\
%{\tt f} & 0.01        &  6.6     &   6 & {\tt 000001}   \\
%{\tt g} & 0.02        &  5.6     &   5 & {\tt 00001}    \\
% \bottomrule
%\end{tabular}
%\end{center}
 which has 
 expected length       1.97.\ENDsolution
% entropy     1.9323
%
}


%\subsection{Twenty questions}
% The Huffman algorithm defines the optimal way to 
% play `twenty questions'. 
%
% {\em [MORE HERE]}

\section{Disadvantages of the Huffman code}
\label{sec.huffman.probs}
 The Huffman\index{Huffman code!disadvantages}\index{symbol code!disadvantages}
 algorithm produces an
 optimal symbol code for an ensemble, but this is not the end of the
 story. Both the word `ensemble' and the phrase `symbol code' 
 need careful attention. 
%\begin{description}
%\item[Changing ensemble.] 
\subsection{Changing ensemble}
        If we wish to communicate a sequence of outcomes from one
        unchanging ensemble, then a Huffman code may be convenient.
        But often the appropriate ensemble changes. If for
        example we are compressing text, then the symbol frequencies
        will vary with context: in English  the letter  {\tt{u}} is
 much more probable after a {\tt{q}} than after an {\tt{e}} (\figref{fig.conbigrams}).  And
        furthermore, our knowledge of these context-dependent symbol
        frequencies will also change as we learn 
% accumulate statistics on
	the statistical properties     of the
	text source.\index{adaptive models}
% So our probabilities	should change 

        Huffman codes do not handle  changing
        ensemble probabilities with any elegance.
 One brute-force approach would be to
        recompute the Huffman code every time the probability over
        symbols changes. Another attitude is to deny the option of
        adaptation, and instead  run through the entire file in
        advance and compute a good probability distribution, which will
        then remain fixed throughout transmission. The code itself must 
 also be communicated in this scenario. Such a technique is
        not only cumbersome and restrictive, it is also suboptimal,
        since the initial message specifying the code and the document
        itself are partially redundant.
% -- knowing the algorithm that
%       defines the code for a given document, one can deduce what the
%       initial header has to be from the .
        This technique  therefore wastes bits.
% flag this: 
% could discuss bits back here
%
\subsection{The extra bit}
%item[The extra bit.] 
        An equally serious problem with Huffman codes is the
        innocuous-looking `\ind{extra bit}' relative to the ideal average
        length of $H(X)$ -- a Huffman code achieves a length that
        satisfies $H(X) \leq L(C,X) < H(X) + 1,$ as proved in theorem
        \ref{th.source.coding.symbol}.
%\eqref{eq.source.coding.symbol}).
  A
        Huffman code thus incurs an overhead of between 0 and 1 bits per
        symbol. If $H(X)$ were large, then this overhead would be an
        unimportant fractional increase.  But for many applications,
        the entropy may be as low as one bit per symbol, or even smaller,
        so  the overhead
%`$+1$'
 $L(C,X)- H(X)$ may dominate the encoded file length. Consider English
        text: in some contexts, long strings of characters may be
        highly predictable.
% , as we saw in the guessing game of chapter \chtwo.
% given a simple model of the language. 
        For
        example, in the context `{\verb+strings_of_ch+}', one might
        predict the next nine symbols to be `{\verb+aracters_+}' with
        a probability of 0.99 each. A traditional Huffman code would
        be obliged to use at least one bit per character, making a total cost
        of nine bits where virtually no information is being
        conveyed (0.13 bits in total, to be precise).
 The entropy of English, given a good model, is about
        one bit per character \cite{Shannon48}, so a Huffman code is likely to be highly
% nearly 100\%
        inefficient.

        A traditional patch-up of Huffman codes uses them to compress
        {\dem blocks\/} of symbols, for example the `extended sources'
        $X^N$ we discussed in  \chref{ch.two}.
% \ref{ch2}
% rather than defining a code for       single symbols. 
        The overhead per block is at most 1 bit so the 
 overhead per symbol
% goes down as
 is at most $1/N$ bits. For
        sufficiently large blocks, the problem of the extra bit may be
        removed -- but only at the expenses of (a) losing the elegant
        instantaneous decodeability of  simple Huffman coding; and 
         (b) having 
        to compute the probabilities of all relevant strings and build 
        the associated Huffman tree. One will end up explicitly 
 computing the 
 probabilities and codes for a huge number of strings, most
 of which will never actually occur. (See \exerciseref{ex.Huff99}.)

% A further problem is that it may not be appropriate to model
%        successive symbols as coming independently from a single ensemble
%        $X$.  As we already asserted, any decent model for text will
%        assign a probability over symbols that depends on the context.
%  A changing probability distribution over symbols is
%        not incompatible with the construction of Huffman codes for
%        blocks of symbols. One could consider each possible sequence,
%        computing the relevant probability distributions along the way
%        to evaluate the probability of the entire sequence, then build
%        a Huffman tree for the sequences.  One could account for
%        dependences between blocks as well, if one were willing to
%        use a different Huffman code each time.  But this modified
% encoder would be
%        computationally expensive, since for large block sizes an
%        exponentially large number of possible sequences would have 
%        to be considered along with their adaptive probabilities.
%% is context-dependent.
% \end{description}
% \medskip

\subsection{Beyond symbol codes}
%
        Huffman codes, therefore, although widely trumpeted as 
        `optimal', have many defects for practical
 purposes.\index{Huffman code!`optimality'}
 They {\em are\/}  optimal {\em symbol\/} codes, but  for practical 
 purposes {\em we don't want a symbol code}.

        The defects of Huffman codes are rectified by {\dem arithmetic
        coding},\index{arithmetic coding} which dispenses with the
        restriction that each symbol must translate into an integer
        number of bits. Arithmetic coding is the main topic of the next 
        chapter.
% is not a symbol coding. This
%       we will discuss next.
%       In an arithmetic code, the probabilistic modelling is clearly 
%       separated from the encoding operation.


\section{Summary}
\begin{description}
\item[Kraft inequality\puncspace]
 If a code is {\dbf uniquely decodeable} its lengths must satisfy
\beq
	\sum_i 2^{-l_i } \leq 1 .
\eeq
 For any lengths satisfying the Kraft inequality, there exists
 a prefix code with those lengths.

\item[Optimal source codelengths for an ensemble] are equal to the 
 Shannon information contents\index{source code!optimal lengths}\index{source code!implicit probabilities}
\beq
	l_i = \log_2 \frac{1}{p_i} ,
\eeq
 and conversely, any choice of codelengths defines
 {\dbf\ind{implicit probabilities}}
\beq
	q_i = \frac{2^{-l_i}}{z} .
\eeq

\item[The \ind{relative entropy}] $D_{\rm KL}(\bp||\bq)$ measures
 how many bits per symbol are wasted by using a
% mismatched
 code whose implicit probabilities are $\bq$, when
 the ensemble's true probability distribution is $\bp$.

\item[Source coding theorem for symbol codes\puncspace]
 For an ensemble $X$, there exists a prefix code
 whose expected length satisfies 
\beq
	H(X) \leq L(C,X) < H(X) + 1 .
\eeq
% The expected length is only equal to the entropy if the
 
\item[The Huffman coding algorithm] generates an optimal symbol code
  iteratively. At each iteration,  the two least probable symbols are combined.
\end{description}

\section{Exercises}
\exercisaxB{2}{ex.Cnud}{
 Is the code $\{ {\tt 00}, {\tt 11}, {\tt 0101}, {\tt 111}, {\tt 1010},
 {\tt 100100}, {\tt 0110} \}$
% $\{ 00,11,0101,111,1010,100100,0110 \}$
 uniquely decodeable?
}
\exercisaxB{2}{ex.Ctern}{
 Is the ternary code
 $\{ {\tt 00},{\tt 012},{\tt 0110},{\tt 0112},{\tt 100},{\tt 201},{\tt 212},{\tt 22} \}$ uniquely decodeable?
}
\exercissxA{3}{ex.HuffX2X3}{
 Make  Huffman codes for $X^2$, $X^3$ and $X^4$ where ${\cal A}_X = \{ 0,1 \}$ 
 and ${\cal P}_X = \{ 0.9,0.1 \}$. Compute their expected lengths and compare 
 them with the entropies $H(X^2)$, $H(X^3)$ and $H(X^4)$.

 Repeat this exercise for $X^2$  and $X^4$ where ${\cal P}_X = \{ 0.6,0.4 \}$.
}
\exercissxA{2}{ex.Huffambig}{
 Find a  probability distribution $\{ p_1,p_2,p_3,p_4 \}$ such that 
 there are {\em two\/} optimal codes that assign different lengths $\{ l_i \}$
 to the four symbols.
}
\exercisaxC{3}{ex.Huffambigb}{
 (Continuation of \exerciseonlyref{ex.Huffambig}.)
 Assume that the four probabilities  $\{ p_1,p_2,p_3,p_4 \}$ are ordered
 such that $p_1 \geq p_2 \geq p_3 \geq p_4 \geq 0$.  Let 
 $\cal Q$ be  the  set of  
 all probability vectors $\bp$  such that 
 there are {\em two\/} optimal codes with different lengths.
 Give a complete description of $\cal Q$. 
 Find three probability vectors $\bq^{(1)}$,  $\bq^{(2)}$,  $\bq^{(3)}$, 
 which are the \ind{convex hull} of  $\cal Q$, \ie, such that 
 any $\bp \in \cal Q$ can be written as 
\beq
        \bp = \mu_1 \bq^{(1)} + \mu_2 \bq^{(2)}  +\mu_3 \bq^{(3)} ,
\eeq
 where $\{\mu_i\}$ are positive.
}
\exercisaxB{1}{ex.twenty.questions}{
 Write a short essay discussing how to play
 the game of {\sf{\ind{twenty questions}}} optimally.
 [In twenty questions, one player thinks of an object,
 and the other player has to guess the object using as few binary
 questions as possible, preferably fewer than twenty.]
}
\exercisaxB{2}{ex.powertwogood}{
	Show that, if each probability $p_i$ is equal to an integer power of 2
  then there exists a source code whose expected length equals the entropy.
}
\exercissxB{2}{ex.make.huffman.suck}{
 Make ensembles for which the difference between the entropy
 and the expected length of the Huffman code is as big as possible.
}% 14. Gallager, R. G., "Variations on a Theme by Huffman", 
%     IEEE Trans. on Information Theory, Vol. IT-24, No. 6, Nov. 1978, pp. 668-674. 
%
%\exercisxB{2}{ex.huffman.biggerhalf}{
%	If one of the probabilities $p_m$ is greater than $1/2$, how
% big must the difference between the expected length and the entropy be?
% Sketch a graph the 
%}
%  from {tex/huffmanI.tex} 
\exercissxB{2}{ex.huffman.uniform}
{
% from 02q.tex on rum 
 A  source $X$  has an alphabet 
 of eleven characters $$\{  {\tt{a}} , {\tt{b}} , {\tt{c}} , {\tt{d}} , {\tt{e}} ,  {\tt{f}} , {\tt{g}} , {\tt{h}} , {\tt{i}} , {\tt{j}} , {\tt{k}} \},$$
 all of which have equal probability, $1/11$. 
% State the meaning of the ideal codelengths

  Find an {optimal uniquely decodeable  symbol code}
  for this source.
 How much greater is the expected length of this optimal code  
 than the entropy of $X$?

}
\exercisaxB{2}{ex.huffman.uniform2}{
 Consider the optimal symbol code for an ensemble $X$ with alphabet size 
 $I$ from which all symbols have identical probability 
 $p = 1/I$. $I$ is not a power of 2.

 Show that the fraction $f^+$ of the $I$ symbols  that are assigned 
 codelengths equal to 
\beq
 l^+ \equiv \lceil \log_2 I \rceil
\eeq
 satisfies 
\beq
        f^+ =  2 - \frac{2^{l^+}}{I} 
\label{eq.HIf}
\eeq
 and that the expected length of the optimal symbol code
 is 
\beq
        L = l^+ -1 + f^+ .
\label{eq.HIL}
\eeq
 By differentiating 
 the  excess length
%\beq
 $       \Delta L \equiv L - H(X)$
%\eeq
 with respect to $I$, show that the excess
 length  is bounded by
\beq
        \Delta L \leq 1 - \frac{ \ln ( \ln 2 )}{ \ln 2}  -\frac{ 1 }{ \ln 2}
                = 0.086 .
\eeq

}
\exercisaxA{2}{ex.Huff99}{
 Consider a sparse binary source with ${\cal P}_X = \{ 0.99 , 0.01  \}$. 
 Discuss how Huffman codes could be used to compress this source
 {\em efficiently}.\index{Huffman code}
 Estimate  how many codewords your proposed solutions require.
%  The entropy - hint: could think about run length encoding?
%
}
\exercisaxB{2}{ex.poisonglass}{
%   p.111 martin gardner mathematical carnival{Gardner:Carnival}
 {\em Scientific American\/} carried the following puzzle\index{puzzle!poisoned glass} in 1975.
% roughly!
\begin{description}
\item[The poisoned glass\puncspace]% This should be \exercisetitlestyle ?
 `Mathematicians are curious birds', the police commissioner said to
 his wife. `You see, we had all those partly filled glasses lined up
 in rows on a table in the hotel kitchen. Only one contained poison,
 and we wanted to know which one before searching that glass for
 fingerprints. Our lab could test the liquid in each glass, but the
 tests take time and money, so we wanted to make as few of them as
 possible by simultaneously testing mixtures of small samples from
 groups of glasses. The university sent over a mathematics professor
 to help us. He counted the glasses, smiled and said:

`$\,$``Pick any glass you want, Commissioner. We'll test it first.''

`$\,$``But won't that waste a test?'' I asked.

`$\,$``No,'' he said, ``it's part of the best procedure. We can test one glass
 first. It doesn't matter which one.''$\,$'

 `How many glasses were there to start with?' the commissioner's wife asked.

 `I don't remember. Somewhere between 100 and 200.'

 What was the exact number of glasses?

\end{description}% \cite{Gardner:Carnival}
 Solve this puzzle and then explain why the professor was in fact 
 wrong and the commissioner was right. What is in fact the optimal procedure
 for identifying the one poisoned glass? What is the expected waste
 relative to this optimum if one followed the professor's strategy?
 Explain the relationship to symbol coding.
}
% could get worked up over the all zero codeword, which corresponds to 
% a possible non-detection; if this would require an extra test
% then presumably the story is a bit different, with some deliberate 
% skewing of the tree to make it more likely that we get a positive 
%result along the way.
\exercissxA{2}{ex.optimalcodep1}{% problem fixed Tue 12/12/00
 Assume that a sequence of symbols
 from the ensemble $X$ introduced at the beginning of this
 chapter is compressed using the code $C_3$.
\amarginfignocaption{t}{
\begin{center}
$C_3$:\\[0.1in] 
\begin{tabular}{cllcc} \toprule
$a_i$ & $c(a_i)$ & $p_i$  & \multicolumn{1}{c}{$h({p_i})$}  & $l_i$ 
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule 
{\tt a} & {\tt 0}   & \dhalf         &  1.0     &   1      \\
{\tt b} & {\tt 10}  & \dquarter        &  2.0     &   2      \\
{\tt c} & {\tt 110} & \deighth       &  3.0     &   3      \\
{\tt d} & {\tt 111} & \deighth       &  3.0     &   3      \\
 \bottomrule
\end{tabular}
\end{center}
}
 Imagine picking one bit at random from
 the binary encoded sequence $\bc = c(x_1)c(x_2)c(x_3)\ldots$ .
 What is the probability  that this bit is a 1?
}
\exercissxB{2}{ex.Huffmanqary}{ 
% (Optional)
 How should the\index{Huffman code!general alphabet} binary 
 Huffman encoding scheme be modified to make optimal symbol codes 
 in an encoding alphabet with $q$ symbols? (Also known as `\ind{radix} $q$'.)
}
% answer, Hamming p.73: 
% add enough states with probability zero to make the total 
% number of states equal to $k(q-1)+1$, for some integer $k$.
%  then repeatedly combine $q$ into 1


% \end{document} 
% 
% \item[A code $C(X)$ is {\em non-singular\/}] if every element of $\A_X$ 
%  maps into a different string, \ie, 
% \beq
%       a_i \not = a_j \Rightarrow c(a_i) \not = c(a_j).
% \eeq
% 
% \item[The extension $C^+$ of a code $C$] is a mapping from finite length 
%  strings of $\A_X$ to $\{0,1\}^+$
% % finite length strings of NAME? 
%  defined by the concatentation:
% \beq
%       c(x_1 x_2 \ldots x_N) = c(x_1)c(x_2)\ldots c(x_N)
% \eeq
% 
% \item[A code is uniquely decodeable] if its extension is non-singular.
%
\subsection*{Mixture codes}
 It is a tempting idea to construct a `\ind{metacode}' from several symbol
 codes that assign different-length codewords to the alternative
 symbols, then  switch from one
  code to another, choosing whichever assigns the shortest codeword
   to the current symbol.
   Clearly we cannot do this for free.\index{bits back}
   If one wishes to  choose between two codes, then 
 it is necessary to lengthen the message in a way that 
 indicates which of the two codes is being used. If we indicate this
 choice by 
 a single leading bit, it will be found that the resulting code 
 is suboptimal because it is incomplete (that is,
 it fails the Kraft equality).
\exercissxA{3}{ex.mixsubopt}{
 Prove that this metacode  is incomplete,
 and explain why this combined code is 
 suboptimal.
}
  

%
% need more on prefix property to make clear how strings are decodeable,
% self-punctuating.
\dvips
\section{Solutions}% to Chapter \protect\ref{ch3}'s exercises} 
\fakesection{solns 3}
\soln{ex.C1101}{
 Yes,
 $C_2 = \{ {\tt{1}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$
% $C_2 = \{ 1 , 101 \}$
 is uniquely decodeable, even though 
 it is not a prefix code, because no two different strings 
 can map onto the same string; only the codeword $c(a_2)={\tt 101}$ contains 
 the symbol {\tt0}. 
}
\soln{ex.KIconverse}{
 We wish to prove that for any set of codeword lengths $\{ l_i \}$
 satisfying the \Kraft\ inequality, there is a prefix code having those 
 lengths.
%
%  Symbol Coding Budget -- cut this figure later, it is already in _l3
%
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=figs/budget1.eps,height=3in}\ \psfig{figure=figs/budgetmax.eps,height=3in}}
\end{center}
}{%
\caption[a]{The codeword supermarket and
 the symbol coding budget. The `cost' $2^{-l}$ of each codeword
 (with length $l$)
 is indicated by the size of the box it is written in. The total budget 
 available when making a uniquely decodeable code is 1.}
\label{fig.budget1a}
}%
\end{figure}
 This is readily proved by thinking of  
 the codewords  illustrated in \figref{fig.budget1a}
 as being in a `codeword supermarket', with size indicating 
 cost. We imagine purchasing\index{source code!supermarket}\index{supermarket for codewords} 
 codewords one at a time, starting from the shortest codewords (\ie, the biggest
 purchases), 
 using the  budget shown at the right of \figref{fig.budget1a}. 
 We start at one side of the codeword supermarket, say the top, 
 and purchase the first codeword of the required length. We advance down 
 the supermarket a distance $2^{-l}$, and purchase the next codeword 
  of the next required length, and so forth. 
 Because the codeword lengths are getting longer, and the corresponding 
 intervals are getting shorter, we can always buy 
 an adjacent codeword to the latest purchase, so there is no wasting 
 of the budget. Thus at the $I$th codeword we have advanced 
 a distance $\sum_{i=1}^{I} 2^{-l_i}$ down the supermarket; 
 if $\sum 2^{-l_i} \leq 1$, we will have purchased 
 all the codewords without running out of budget.
}
\soln{ex.Huffmanconverse}{
 The proof that Huffman coding is optimal depends on 
 proving that the key step in the algorithm -- the decision to give
% combination of 
 the two symbols
 with smallest probability equal encoded lengths
 -- cannot lead to a larger expected length 
 than any other code. We can prove this by contradiction. 

 Assume that 
 the two symbols with smallest probability, called $a$ and $b$, 
 to which the Huffman algorithm would assign equal length 
 codewords, 
 do {\em not\/} have equal lengths in {\em any\/}
 optimal symbol code. 
 The optimal symbol code is some 
 other rival code in which these two codewords 
 have  unequal lengths $l_a$ and $l_b$ with $l_a < l_b$.
 Without loss of
 generality we can assume that this other  code is a complete prefix code, 
 because any codelengths of a uniquely decodeable code
 can  be realized by a prefix code.  
%  We now consider transforming the other code into a new code 
%  in which we interchange \ldots

 In this rival code, there must be some other symbol $c$ whose 
 probability $p_c$ is greater than $p_a$ and whose length 
 in the rival code is greater than or equal to $l_b$, because 
 the code for $b$ must have an adjacent codeword of equal or greater
 length -- a complete prefix code never has a solo codeword 
 of the maximum length.
\begin{figure}%[htbp]
\figuremargin{%
\begin{tabular}{llllll} \toprule % \hline
symbol & \multicolumn{2}{c}{probability} & Huffman  & Rival code's & Modified rival \\
  & & & codewords & codewords & code \\ \midrule % [0.1in]\hline
$a$ & $p_a$ & \framebox[0.15in]{} &  \framebox[1.50cm]{$c_{\rm H}(a)$} & \framebox[1.0cm]{$c_{\rm R}(a)$} &  \framebox[1.6cm]{$c_{\rm R}(c)$}   
\\[0.1in]	   
$b$ & $p_b$ & \framebox[0.1in]{}  &  \framebox[1.50cm]{$c_{\rm H}(b)$} & \framebox[1.5cm]{$c_{\rm R}(b)$} &  \framebox[1.5cm]{$c_{\rm R}(b)$} 
\\[0.1in]		                                         
$c$ & $p_c$ & \framebox[0.25in]{} &  \framebox[0.95cm]{$c_{\rm H}(c)$} & \framebox[1.6cm]{$c_{\rm R}(c)$} &  \framebox[1.0cm]{$c_{\rm R}(a)$}   
\\ \bottomrule  % [0.1in] \hline
\end{tabular}
}{%
\caption[a]{Proof that Huffman coding makes an optimal symbol code.
% The proof works by contradiction. 
 We assume that the rival code, which is said to be optimal, assigns {\em unequal\/} length 
 codewords to the two symbols with smallest probability,  $a$ and $b$. 
 By interchanging  codewords $a$ and $c$ of the rival code, where $c$ is a
 symbol with rival codelength as long as $b$'s, we can make 
 a code better than the rival code. This shows that the rival code
 was not optimal.
}
\label{fig.huffman.optimal}
}%
\end{figure}

 Consider exchanging the codewords of $a$ and $c$ (\figref{fig.huffman.optimal}), so that 
 $a$ is encoded with the longer codeword that was $c$'s, and 
 $c$, which is more probable than $a$, gets the shorter codeword. 
 Clearly this reduces the expected length of the code. 
 The change in expected length is $(p_a-p_c)(l_c-l_a)$. 
 Thus we have contradicted the assumption that the rival code is optimal.
 Therefore it is valid to give the two symbols
 with smallest probability equal encoded lengths. 
 Huffman coding produces optimal symbol codes.\ENDsolution  
}
%\soln{ex.Cnud}{
%\soln{ex.Ctern}{
\soln{ex.HuffX2X3}{
 A Huffman code 
 for $X^2$ where ${\cal A}_X = \{ {\tt 0},{\tt 1} \}$ 
 and ${\cal P}_X = \{ 0.9,0.1 \}$ 
 is $\{{\tt 00},{\tt 01},{\tt 10},{\tt 11}\} \rightarrow
 \{{\tt 1},{\tt 01},{\tt 000},{\tt 001}\}$.
 This code has $L(C,X^2) = 1.29$, whereas the entropy $H(X^2)$ is 0.938.

  
 A Huffman code  for $X^3$ is
\[
\begin{array}{c}
\{{\tt 000},{\tt 100},{\tt 010},{\tt 001},{\tt 101},{\tt 011},{\tt 110},{\tt 111}\}
 \rightarrow\\
\hspace*{1in} \{{\tt 1},{\tt 011},{\tt 010},{\tt 001},
 {\tt 00000},{\tt 00001},{\tt 00010},{\tt 
 00011}\}.
\end{array}
\]
% corrected from 1.472 to 1.598
% 9802
 This has expected length $L(C,X^3) =  1.598$ whereas  the entropy $H(X^3)$
 is 1.4069.

  A Huffman code  for $X^4$ maps the sixteen source strings to the
 following codelengths: 
\[
\begin{array}{c}
 \{ {\tt 0000},{\tt 1000},{\tt 0100},{\tt 0010},{\tt 0001},{\tt 1100},{\tt 0110},{\tt 0011},{\tt 0101}, 
 {\tt 1010},{\tt 1001},{\tt 1110},{\tt 1101}, \\
 {\tt 1011},{\tt 0111},{\tt 1111} \}
 \rightarrow \:\: \{ 1,3,3,3,4,6,7,7,7,7,7,9,9,9,10,10 \}.
% 10,10,9,9,9,7,7,7,7,7,6,4,3, 3,3,1\}.
\end{array}
\]
 This has expected length $L(C,X^4) =  1.9702$ whereas  the entropy $H(X^4)$
 is 1.876.
% 

% 0.6,0.4
 When ${\cal P}_X = \{ 0.6,0.4 \}$, the Huffman code for $X^2$ has lengths
 $\{ 2,2,2,2 \}$; the expected length is 2 bits, and the
 entropy is 1.94 bits. A
 Huffman code  for $X^4$ is shown in \tabref{fig.X4huff2}.
% , has lengths
% $\{0000,1000,0100,0010,0001,1100,0110,0011,0101,1010,1001,1110,1101,1011,0111,1111\} \rightarrow$
% $\{3,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5\}$.
 The expected length is 3.92 bits, and the entropy is 3.88 bits.
% see tmp3 for soln using huffman.p
% $\{0000,1000,0100,0010,0001,1100,0110,0011,0101,1010,1001,1110,1101,1011,0111,1111\} \rightarrow \{5,5,5,5,4,4,4,4,4,4,4,4,4,4,3,3\}$.
}
% see tmp3 for use of huffman.p
%\begin{figure}
%\figuremargin{%
\margintab{\footnotesize
\begin{center}
\begin{tabular}{clrl} \toprule % \hline
$a_i$  & $p_i$  &
% \multicolumn{1}{c}{$h({p_i})$}  &
 $l_i$ & $c(a_i)$
% {\rule[-3mm]{0pt}{8mm}}%strut
% \\[0.1in] \hline
\\ \midrule 
{\tt 0000} & 0.1296      &   3 & {\tt 000 }\\ 
{\tt 0001} & 0.0864      &   4 & {\tt 0100 }\\ 
{\tt 0010} & 0.0864      &   4 & {\tt 0110 }\\ 
{\tt 0100} & 0.0864      &   4 & {\tt 0111 }\\ 
{\tt 1000} & 0.0864      &   3 & {\tt 100 }\\ 
{\tt 1100} & 0.0576      &   4 & {\tt 1010 }\\ 
{\tt 1010} & 0.0576      &   4 & {\tt 1100 }\\ 
{\tt 1001} & 0.0576      &   4 & {\tt 1101 }\\ 
{\tt 0110} & 0.0576      &   4 & {\tt 1110 }\\ 
{\tt 0101} & 0.0576      &   4 & {\tt 1111 }\\ 
{\tt 0011} & 0.0576      &   4 & {\tt 0010 }\\ 
{\tt 1110} & 0.0384      &   5 & {\tt 00110 }\\ 
{\tt 1101} & 0.0384      &   5 & {\tt 01010 }\\ 
{\tt 1011} & 0.0384      &   5 & {\tt 01011 }\\ 
{\tt 0111} & 0.0384      &   4 & {\tt 1011 }\\ 
{\tt 1111} & 0.0256      &   5 & {\tt 00111 }\\ \bottomrule %\hline
%expected length     3.9248
%entropy     3.8838
\end{tabular}
\end{center}
%}{%
\caption[a]{Huffman code for $X^4$ when $p_0=0.6$. Column 3 shows the
 assigned codelengths and column 4 the codewords. Some strings
 whose probabilities are identical, \eg, the fourth and fifth,
 receive different codelengths.}
\label{fig.X4huff2}
}%
%\end{figure}
\soln{ex.Huffambig}{
 The set of probabilities  $\{ p_1,p_2,p_3,p_4 \} = 
 \{ \dsixth,\dsixth,\dthird,\dthird\}$ gives rise to two different optimal 
 sets of codelengths, because at the second step of the Huffman 
 coding algorithm we can choose any of the three possible pairings. 
 We may either put them in a constant length code
 $\{ {\tt00},{\tt01},{\tt10},{\tt11} \}$ or
 the code $\{ {\tt000},{\tt001},{\tt01},{\tt1} \}$. 
 Both codes have expected length 2.

 Another solution is  $\{ p_1,p_2,p_3,p_4 \}$ $=$ 
 $\{ \dfifth,\dfifth,\dfifth,\dtwofifth\}$.
% =$ $\{ 0.2 , 0.2 , 0.2 , 0.4 \} $.

 And a third is  $\{ p_1,p_2,p_3,p_4 \} = 
 \{ \dthird,\dthird,\dthird,0\}$.
}
\soln{ex.make.huffman.suck}{
 Let $p_{\max}$ be the largest probability in $p_1,p_2,\ldots,p_I$.
 The difference between the  expected length
 $L$ and the entropy $H$  can be no bigger than
 $\max ( p_{\max} , 0.086 )$ \cite{Gallager78}.

%
 See exercises \ref{ex.huffman.uniform}--\ref{ex.huffman.uniform2} to understand
 where the curious 0.086 comes from.
}
\soln{ex.huffman.uniform}{
% removed to  cutsolutions.tex
 Length $-$ entropy = 0.086.
%length / entropy     1.0249

}
% \soln{ex.Huff99}{
% BORDERLINE
\soln{ex.optimalcodep1}{% problem fixed Tue 12/12/00
	There are two ways to answer this problem correctly,
 and one popular way  to answer it incorrectly.
 Let's give the incorrect answer first:
\begin{description}
\item[Erroneous answer\puncspace]
 ``We can pick a random bit by first picking a
 random source symbol $x_i$ with probability $p_i$,
 then picking a random bit from $c(x_i)$. If we define $f_i$
 to be the fraction of the bits of $c(x_i)$ that are {\tt 1}s,
 we find
\marginpar[b]{\small
\begin{center}
$C_3$: 
\begin{tabular}{cllc} \toprule
$a_i$ & $c(a_i)$ & $p_i$  & $l_i$ 
\\ \midrule 
{\tt a} & {\tt 0}   & \dhalf     &   1      \\
{\tt b} & {\tt 10}  & \dquarter        &   2      \\
{\tt c} & {\tt 110} & \deighth       &   3      \\
{\tt d} & {\tt 111} & \deighth       &   3      \\
 \bottomrule
\end{tabular}
\end{center}
}
\beqan
\!\!\!\!\!\!\!\!\!\!
 P(\mbox{bit is {\tt 1}}) &=& \sum_i p_i f_i
\label{eq.wrongp1}
	\\ &=&
	\dfrac{1}{2} \times 0 + 
	\dfrac{1}{4} \times \dfrac{1}{2} + 
	\dfrac{1}{8} \times \dfrac{2}{3} + 
	\dfrac{1}{8} \times 1
	= \dthird \mbox{.''}
\eeqan
\end{description}
 This answer is wrong because it falls for the \index{bus-stop paradox}{bus-stop fallacy},\index{paradox}
 which was introduced in \exerciseref{ex.waitbus}: if buses arrive
 at random, and we are interested in `the average time from  one bus until
 the next', we must distinguish two possible averages:
 (a) the average time from a randomly chosen bus until the next;
 (b) the average time between the bus you just missed and the next bus.
 The second `average' is twice as big as the first because,
 by waiting for a bus at a random time, you bias your selection of
 a bus in favour of buses that follow a large gap. You're unlikely
 to catch a bus that comes 10 seconds after a preceding bus!
 Similarly, the symbols {\tt c} and {\tt d} get encoded into
 longer-length binary strings than {\tt a}, so when we pick a bit
 from the compressed string at random, we are more likely
 to land in a bit belonging to a {\tt c} or a {\tt d}
 than would be given by the probabilities $p_i$ in the
 expectation (\ref{eq.wrongp1}). All the probabilities need to
 be scaled up by $l_i$, and renormalized.
\begin{description}
\item[Correct answer in the same style\puncspace]
 Every time symbol $x_i$ is encoded, $l_i$ bits
 are added to the binary string, of which $f_i l_i$ are {\tt 1}s.
 The expected number of  {\tt 1}s added per symbol is
\beq
	\sum_i p_i f_i l_i ;
\eeq
 and the expected total number of bits added per symbol is
\beq
	\sum_i p_i  l_i .
\eeq
 So the fraction of {\tt 1}s in the transmitted string is
\beqan
	P(\mbox{bit is {\tt 1}}) &=& \frac{ \sum_i p_i f_i l_i }{ \sum_i p_i  l_i }
\label{eq.rightp1}
	\\ &=&
\frac{	\dfrac{1}{2} \times 0 + 
	\dfrac{1}{4} \times 1 + 
	\dfrac{1}{8} \times 2 + 
	\dfrac{1}{8} \times 3
}{ \dfrac{7}{4} }
	= \frac{\dfrac{7}{8}}{\dfrac{7}{4}}  = 1/2  .
\nonumber
\eeqan
\end{description}
 For a general symbol code and a general ensemble,
 the expectation (\ref{eq.rightp1}) is the correct answer.
 But in this case, we can use a more powerful argument.
\begin{description}
\item[Information-theoretic answer\puncspace]
 The encoded string $\bc$ is the output of
 an optimal compressor that compresses samples from
 $X$ down to an expected length of $H(X)$ bits. We can't expect to compress
 this data any further. But if the probability $P(\mbox{bit is {\tt 1}})$
 were not equal to $\dhalf$ then it {\em would\/} be possible to compress
 the binary string further (using a block compression code, say).
 Therefore  $P(\mbox{bit is {\tt 1}})$
 must be equal to $\dhalf$; indeed the probability of any sequence
 of $l$ bits in the compressed stream taking on any particular
 value must be $2^{-l}$.  The output of a perfect compressor is always
 perfectly random bits.

\begincuttable
 To put it another way, if the probability $P(\mbox{bit is {\tt 1}})$
 were not equal to $\dhalf$, then the information content per bit of
 the compressed string would be at most $H_2( P(\mbox{{\tt 1}}) )$,
 which would be less than 1;
 but this contradicts the fact that we can recover the original data
 from $\bc$, so the information content per bit of the
 compressed string must be $H(X)/L(C,X)=1$.
\ENDcuttable
\end{description}
}
%
% this one is a new addition 
%
\soln{ex.Huffmanqary}{ The \index{Huffman code!general alphabet}{general Huffman coding algorithm} for 
 an encoding alphabet with $q$ symbols
 has one difference from the binary case. 
 The process of combining $q$ symbols into 
 1 symbol reduces the number of symbols by $q\!-\!1$. 
 So if we start with $A$ symbols, we'll only end up 
 with a complete $q$-ary tree if $A \mod (q\!-\!1)$ is equal 
 to 1. 
 Otherwise, we know that whatever prefix code we make, it 
 must be an incomplete tree with a number of missing 
 leaves equal, modulo $(q\!-\!1)$, to  $A \mod (q\!-\!1) - 1$. 
 For example, if a ternary tree is built for eight symbols, 
 then there will unavoidably be one missing leaf in the tree.

 The optimal $q$-ary code is made by putting these 
 extra leaves in the longest branch of the tree. This can be achieved
 by adding the appropriate number of symbols to the original source 
 symbol set, all of these extra symbols having probability zero. 
 The total number of leaves is then equal to $r(q\!-\!1)+1$, for some
 integer $r$.
 The symbols are then repeatedly combined by taking 
 the $q$ symbols with smallest probability and replacing them 
 by a single symbol, as in the binary Huffman coding algorithm.}

\soln{ex.mixsubopt}{ 
%This is important but I haven't written it yet. 
 We wish to show that a greedy \ind{metacode}, which 
 picks the code which gives the shortest encoding, is 
 actually suboptimal, because it violates the Kraft 
 inequality. 

% For generality, let's call the 
% that the objects to be encoded, 
% $x$, `symbols'.
 We'll assume that each symbol $x$ is
 assigned lengths $l_k(x)$ by each of the candidate codes $C_k$. 
 Let us assume there are $K$ alternative codes and that we can 
 encode which code is being used with a header of length $\log K$
 bits. 
 Then the metacode assigns lengths $l'(x)$ that are given by 
\beq
        l'(x) = \log_2 K + \min_k l_k(x) .
\eeq
 We compute the Kraft sum:
\beq
 S = \sum_x 2^{- l'(x)}
                = \frac{1}{K}  \sum_x 2^{- \min_k l_k(x)} .
\eeq 
 Let's divide the set $\A_X$ into non-overlapping  subsets $\{\A_k\}_{k=1}^{K}$ 
 such that  subset $\A_k$ contains all the symbols  $x$ 
 that  the metacode sends via code $k$.
 Then 
\beq
        S = \frac{1}{K} \sum_k \sum_{x \in \A_{k}}  2^{- l_k(x)} .
\eeq
 Now if one sub-code $k$ satisfies the Kraft equality
 $\sum_{x\in \A_X} 2^{- l_k(x)}  \eq 1$, then 
 it must be the case that 
\beq
 \sum_{x \in \A_{k}}  2^{- l_k(x)}  \leq 1 , 
\label{eq.from.kraft}
\eeq
 with equality only if all the symbols $x$ are in $\A_k$, which would mean that we
 are only using one of the $K$ codes. 
 So
\beq
        S \leq \frac{1}{K} \sum_{k=1}^K 1 = 1 ,
\eeq
 with equality only if \eqref{eq.from.kraft} is an equality for all codes $k$. 
 But   it's impossible for all the symbols to be in {\em all\/} the 
 non-overlapping  subsets $\{\A_k\}_{k=1}^{K}$, so 
 we can't have equality  (\ref{eq.from.kraft}) holding 
 for {\em all\/} $k$.
 So
%\beq
%        S < 1 .
%\eeq
 $S < 1$.

 Another way of seeing that a mixture code is suboptimal is to consider
 the binary tree that it defines.  Think of the special case of two
 codes.  The first bit we send identifies which code we are using.
 Now, in a complete code, any subsequent binary string is a valid
 string. But once we know that we are using, say, code A, we know that
 what follows can only be a codeword corresponding to a symbol $x$
 whose encoding is shorter under code A than code B.  So some strings
 are invalid continuations, and the mixture code is incomplete
  and suboptimal. 
 
%%% MAYBE!!!!!!!!!!!!!!
 For further discussion of  this issue
 and its relationship to probabilistic modelling 
 read about  `\ind{bits back} coding' in \secref{sec.bitsback}
 and in \citeasnoun{frey-98}.
}

% \dvipsb{solutions 3}
\prechapter{About      Chapter}
\fakesection{prerequisites for chapter known as 4}
 Before reading  \chref{ch.four}, you should have read  the previous chapter
 and worked on 
 most of the exercises in it.

 We'll also make use of some Bayesian modelling ideas
 that arrived in the vicinity of \exerciseref{ex.postpa}.

% Arithmetic coding has been invented several times,
% by Elias, by Rissanen, and 
% but is only slowly becoming well known
%
% {The description of Lempel--Ziv coding  is based on that of Cover and Thomas (1991).}

%\chapter{Data Compression III: Stream Codes}
\mysetcounter{page}{126}
\ENDprechapter
\chapter{Stream Codes}
\label{ch.four}
\label{ch.ac}
% _l4.tex  
\fakesection{Data Compression III: Stream Codes}
%
% still need to change notation for R(|)
%
\label{ch4}
 In this chapter we discuss  two data
 compression schemes.\index{source code!stream codes|(}\index{stream codes|(}
%% that constitute the state of the art. 

 {\dem\indexs{arithmetic coding}Arithmetic coding}
 is a beautiful   method that goes 
 hand in hand with the philosophy that compression of data 
 from a source entails
 probabilistic modelling of that source. As of 1999, 
 the best compression methods for text files use arithmetic coding,
 and several state-of-the-art image compression systems
 use it too.

  {\dem\ind{Lempel--Ziv coding}} is a `\ind{universal}' method, 
%  in my opinion an ugly hack, but 
 designed under the philosophy that we would like a single compression
 algorithm that will do a reasonable job for {\em any\/} source.
 In fact, for many real
 life sources, this
 algorithm's universal properties hold only 
 in the limit of unfeasibly large amounts of data, but,
 all the same, Lempel--Ziv compression is   widely used
 and often  effective.

\section{The guessing game}
\label{sec.startofch4}
% \looseness=-1 this did not achieve what was advertised!
 As a motivation for these\index{game!guessing}
 two compression  methods,
% let us
 consider the redundancy in a  typical
% imagine compressing a 
 \ind{English} text file. Such files have redundancy at several levels: for example,
 they contain the ASCII characters with non-equal frequency; certain consecutive
 pairs  of letters are more probable than others; and entire words 
 can be predicted given the context and a semantic understanding
 of the text.

 To illustrate the redundancy of English, and a curious way in which 
 it could be compressed, we can imagine a \ind{guessing game}
 in which an English speaker repeatedly
 attempts to predict the next character
 in a text file. 

% \subsection{The guessing game}
\label{sec.guessing}
% Could discuss the compression of English text by guessing 
 For simplicity, let us assume that the allowed alphabet consists
 of
 the 26 upper case letters {\tt  A,B,C,\ldots, Z} and a space `{\tt -}'.
 The game involves asking the subject to guess the next character
 repeatedly, the only feedback being whether the guess is correct
 or not, until the character is correctly guessed. 
 After a correct guess, we note the number of guesses that 
 were made when the character was identified, and ask the subject
 to guess the next character in the same way. 

 One sentence 
% given by Shannon
 gave the following result when a human was asked to guess a sentence.
%  in a guessing game.
 The numbers of guesses 
 are listed below each character.\index{reverse}\index{motorcycle}
%  and the idea of having an identical twin. This introduces the idea 
%  of mapping to a different alphabet with nonuniform probability.
%  The guessing game. From Shannon.
\smallskip
\begin{center}\hspace*{0.3in}
%\begin{tabular}{*{36}{c@{\,\,}}}
\begin{tabular}{*{36}{p{0.15in}@{}}}
\small\tt
T&\small\tt H&\small\tt E&\small\tt R&\small\tt E&\small\tt -&\small\tt I&\small\tt S&\small\tt -&\small\tt N&\small\tt O&\small\tt -&\small\tt R&\small\tt E&\small\tt V&\small\tt E&\small\tt R&\small\tt S&\small\tt E&\small\tt -&\small\tt O&\small\tt N&\small\tt -&\small\tt A&\small\tt -&\small\tt M&\small\tt O&\small\tt T&\small\tt O&\small\tt R&\small\tt C&\small\tt Y&\small\tt C&\small\tt L&\small\tt E&\small\tt -\\
\footnotesize
1&\footnotesize 1&\footnotesize 1&\footnotesize 5&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 1&\footnotesize \hspace{-0.05in}1\hspace{-0.25mm}5&\footnotesize 1&\footnotesize \hspace{-0.05in}1\hspace{-0.25mm}7&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 3&\footnotesize 2&\footnotesize 1&\footnotesize 2&\footnotesize 2&\footnotesize 7&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 4&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1\\
\end{tabular}
\smallskip
\end{center}
% attempt to tighten this para:
\looseness=-1
 Notice that in many cases, the next letter is guessed immediately, in one guess.
 In other cases, particularly at the start of syllables,
 more guesses are needed.

 What do this game and these results offer us?
 First, they demonstrate the redundancy of English from the point of
 view of an English speaker. 
 Second, this game might be used in
 a data compression scheme, as follows.

% encoding
 The string of numbers `1, 1, 1, 5, 1, \ldots', listed above,
 was obtained by presenting
 the text to the subject. The maximum number of guesses that the
 subject will make for a given letter is twenty-seven, so what the subject is
 doing for us is performing a time-varying mapping of the twenty-seven letters
 $\{ {\tt A,B,C,\ldots, Z,-}\}$ onto the twenty-seven numbers $\{1,2,3,\ldots,
 27\}$, which we can view as symbols in a new alphabet. The total number of
 symbols has not been reduced, but since he uses some of
 these symbols much more frequently than others -- for example, 1 and
 2 -- it should be  easy to compress this new string of
 symbols.
% ; we will discuss data compression
%% the details  of how to do this
% properly  shortly. 

% decoding
 How would the {\em uncompression\/} of the sequence of numbers
 `1, 1, 1, 5, 1, \ldots' work? At uncompression time,
 we do not have the original string `{\small\tt{THERE}}\ldots', we  
 have only the encoded sequence. Imagine that our subject has an
 absolutely \ind{identical twin}\index{twin}
%({\em absolutely\/} identical)
 who also
 plays the guessing game\index{guessing game} with us, as if we
%, the experimenters,
 knew the source text.
 If we stop him whenever he has made a
 number of guesses equal to the given number, then he will have just 
 guessed the correct letter, and we can then say `yes, that's right', 
 and move to the next character.
 Alternatively, if the identical twin is not available, we could
 design a compression system with the help of just one human as follows.
 We choose a window length $L$, that is, a number of characters of context
 to show the human. For every one of the $27^L$ possible
 strings of length $L$, we ask them, `What would you predict is the next character?',
 and `If that prediction were wrong, what would your next guesses be?'.
 After tabulating their answers to these $26 \times 27^L$ questions,
 we could use two copies of these enormous tables at the encoder and the
 decoder in place of the two human twins.
 Such a language model is called an $L$th order \ind{Markov model}.

 These systems are clearly  unrealistic  for practical compression, 
 but they illustrate several principles that we will make use of now.



\section{Arithmetic codes}
\label{sec.ac}
% In lecture 2 we discussed fixed length block codes. 
 When we discussed variable-length symbol codes, and the optimal 
 Huffman algorithm for constructing them, we concluded by pointing 
 out two practical
 and theoretical problems with Huffman codes (section \ref{sec.huffman.probs}). 

%
% index decision:  {arithmetic coding} not  {arithmetic codes}
%
        These defects are rectified by {\dem\index{arithmetic coding}{arithmetic codes}}, which
 were  invented by Elias\nocite{EliasACmentionedpages61to62},\index{Elias, Peter}
 by \index{Rissanen, Jorma}{Rissanen} and by \index{Pasco, Richard}{Pasco}, 
 and subsequently made practical by
% Witten, Neal, and Cleary. 
 \citeasnoun{arith_coding}.\index{Neal, Radford} 
        In an arithmetic code, the
 probabilistic modelling is clearly separated from the encoding
 operation. 
 The system is rather similar to the guessing game.\index{guessing game}
%  that we considered  in Chapter \chtwo.
 The human predictor is replaced by a
 {\dem\ind{probabilistic model}} of the source. 
 As each symbol is produced by the source, the probabilistic model 
 supplies a {\dem\ind{predictive distribution}}
 over all possible values of the next
 symbol, that is, a list of positive numbers $\{ p_i \}$ that sum to 
 one. If we choose to model the source as producing i.i.d.\ symbols with some
 known distribution, 
 then the predictive distribution is the same every time; but arithmetic
 coding can with equal ease handle  complex  adaptive models that produce 
 context-dependent
% time-varying
 predictive distributions.  The predictive  model is usually
 implemented in a computer program.
% a model which hypothesizes arbitrary 
% context-dependences and non-stationarities, and which learns as it 
% goes, so that predictive distributions in any given context gradually 
% sharpen up. 
% I will give  an example later on, 
% of an adaptive model producing appropriate  probabilities 
% but first let us discuss the arithmetic coding algorithm itself. 

 The encoder makes use of the model's predictions  to create a 
 binary string. The decoder makes use of an identical twin of the 
 model (just as in the guessing \index{guessing game}game) to interpret the binary string.

 Let the source alphabet be $\A_X = \{a_1 ,\ldots, a_I\}$, and let the 
 $I$th symbol $a_I$ have the special meaning `end of transmission'.
 The source
 spits out a sequence $x_1,x_2,\ldots,x_n,\ldots.$ The source does {\em not\/}
 necessarily produce i.i.d.\ symbols.
 We will assume that a computer program is provided to the encoder
 that assigns a predictive 
 probability distribution over $a_i$ given the sequence that has occurred 
 thus far, 
 $P(x_n \eq a_i \given x_1,\ldots,x_{n-1})$. 
% Nor will we assume that the source
% is correctly modeled by $P$. But if it is, then arithmetic coding achieves 
% the Shannon rate.
%
% The encoder will send a binary transmission to the receiver.
%
        The receiver has  an identical program that produces the 
 same predictive 
 probability distribution $P(x_n \eq a_i \given x_1,\ldots,x_{n-1})$.
% and uses  it to interpret the received message.
\medskip

\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(50,40)(0,0)
\put(18,40){\makebox(0,0)[r]{0.00}}
\put(18,30){\makebox(0,0)[r]{0.25}}
\put(18,20){\makebox(0,0)[r]{0.50}}
\put(18,10){\makebox(0,0)[r]{0.75}}
\put(18, 0){\makebox(0,0)[r]{1.00}}
%
% major horizontals
%
\put(20,40){\line(1,0){37}}
\put(20,30){\line(1,0){13}}
\put(20,20){\line(1,0){28}}
\put(20,10){\line(1,0){13}}
\put(20, 0){\line(1,0){37}}
%
% biggest intervals
%
\put(45,30){\vector(0,1){9}}
\put(45,30){\vector(0,-1){9}}
\put(47,30){\makebox(0,0)[l]{{\tt{0}}}}
\put(45,10){\vector(0,1){9}}
\put(45,10){\vector(0,-1){9}}
\put(47,10){\makebox(0,0)[l]{{\tt{1}}}}
%
\put(35,25){\vector(0,1){4}}
\put(35,25){\vector(0,-1){4}}
\put(37,25){\makebox(0,0)[l]{{\tt{01}}}}
% some subdivs
\put(20,35){\line(1,0){7}}
\put(20,25){\line(1,0){7}}
\put(20,15){\line(1,0){7}}
\put(20, 5){\line(1,0){7}}
%
% 01101 = 13/32 = 16.25 
% 01110 = 14/32 = 17.5  
\put(20,23.75){\line(1,0){4}}
\put(20,22.50){\line(1,0){4}}
\put(62,23.125){\makebox(0,0)[l]{{\tt{01101}}}}
%
% interrupted pointer: 
\put(60,23.125){\line(-1,0){14}}
\put(44,23.125){\line(-1,0){8}}
\put(34,23.125){\vector(-1,0){9.5}}
%
\end{picture}
\end{center}
}{%
\caption[a]{Binary strings define real intervals within the real line [0,1).
  We first encountered a picture like this when we discussed the 
 \index{supermarket for codewords}\index{symbol code!supermarket}\index{source code!supermarket}{symbol-code supermarket} in  \chref{ch3}.
}
\label{fig.arith.Rbinary}
}%
\end{figure}
\subsection{Concepts for understanding arithmetic coding}
\begin{aside}
%\item[Notation for intervals.]
 {\sf Notation for intervals.} The interval $[0.01, 0.10)$ is all numbers 
 between $0.01$ and $0.10$, including $0.01\dot{0}\equiv0.01000\ldots$ but not $0.10\dot{0}\equiv0.10000\ldots.$
\end{aside}

        A binary transmission defines an interval within
 the real line from 0 to 1. For example, the string {\tt{01}} is
 interpreted as a binary real number 0.01\ldots, which corresponds to
 the interval  $[0.01, 0.10)$ in binary, \ie, the interval
 $[0.25,0.50)$ in base ten. 

%
% why strange line breaks?
%
 The longer string  {\tt{01101}} corresponds to a smaller
 interval $[0.01101,$ $0.01110)$. Because {\tt{01101}} has the first string, 
 {\tt{01}}, as a prefix, the new interval is a sub-interval
 of the interval $[0.01, 0.10)$.
 A one-megabyte binary file ($2^{23}$ bits) is thus viewed as specifying a number 
 between 0 and 1 to a precision of about two million
% $10^7$
 decimal places -- {two million decimal digits, because
 each byte translates into a little more than two decimal digits.}
% byte = 8 bits ~= 2 digits.
% 
% one meg-byte = 2^3 * 2^20  = 2^23 binary places -> 2.5*10^7  or (2**23=8388608) .
% shall I tell you a bedtime number between 0 and 1 to 10^7 d.p. darling?
%
\medskip

        Now, we can also
% Similarly, we can
 divide the real line [0,1) into $I$ intervals of 
 lengths equal to the probabilities $P(x_1 \eq a_i)$, as shown
 in \figref{fig.arith.R}.
%                   upsidedown
% p1 = 6         -- 34   mid: 37  w = 3-1
% p2 = 16 cum 22 -- 18   mid: 26  w = 8-1
% last = 6 cum   --  6   mid:  3  w = 3-1
\newcommand{\aonelevel}{34}
\newcommand{\atwolevel}{18}
\newcommand{\apenlevel}{6}% penultimate
\newcommand{\apenmid}{12}% put dots here
\newcommand{\aonemid}{37}
\newcommand{\aonew}{2}
\newcommand{\atwow}{7}
\newcommand{\atwomid}{26}
\newcommand{\aIw}{2}
\newcommand{\aImid}{3}
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(50,40)(0,0)
\put(18,40){\makebox(0,0)[r]{0.00}}
\put(18,\aonelevel){\makebox(0,0)[r]{$P(x_1\eq a_1)$}}
\put(18,\atwolevel){\makebox(0,0)[r]{$P(x_1\eq a_1)+P(x_1\eq a_2)$}}
\put(18,\apenlevel){\makebox(0,0)[r]{$P(x_1\eq a_1)+\ldots+P(x_1\eq a_{I\!-\!1})$}}
\put(18, 0){\makebox(0,0)[r]{1.0}}
%
% major horizontals
%
\put(20,40){\line(1,0){37}}
\put(20,\aonelevel){\line(1,0){20}}
\put(20,\atwolevel){\line(1,0){20}}
\put(20,\apenlevel){\line(1,0){20}}
\put(20, 0){\line(1,0){37}}
\put(30,\apenmid){\makebox(0,0)[l]{$\vdots$}}
%
% biggest intervals
%
\put(35,\aonemid){\vector(0,1){\aonew}}
\put(35,\aonemid){\vector(0,-1){\aonew}}
\put(37,\aonemid){\makebox(0,0)[l]{$a_1$}}% or $P(x_1\eq a_1)$}}
\put(35,\atwomid){\vector(0,1){\atwow}}
\put(35,\atwomid){\vector(0,-1){\atwow}}
\put(37,\atwomid){\makebox(0,0)[l]{$a_2$}}% or $P(x_1\eq a_2)$}}
\put(35,\aImid){\vector(0,1){\aIw}}
\put(35,\aImid){\vector(0,-1){\aIw}}
\put(37,\aImid){\makebox(0,0)[l]{$a_I$}}% or $P(x_1\eq a_I)$}}
\put(37,\apenmid){\makebox(0,0)[l]{$\vdots$}}
%
\put(20,23){\line(1,0){4}}% beg of a5
\put(20,20){\line(1,0){4}}% end a5
%
\put(62,21.5){\makebox(0,0)[l]{$a_2 a_5$}}
% interrupted pointer: 
\put(60,21.5){\line(-1,0){24}}
\put(34,21.5){\vector(-1,0){9.5}}
%
% a2a1: 34 is the top
%
\put(20,30){\line(1,0){4}}% end of a1
\put(20,28){\line(1,0){4}}% end of a2
\put(20,25){\line(1,0){4}}% end of a3
%
\put(62,32){\makebox(0,0)[l]{$a_2 a_1$}}
% interrupted pointer: 
\put(60,32){\line(-1,0){24}}
\put(34,32){\vector(-1,0){9.5}}
%
\end{picture}
\end{center}
}{%
\caption[a]{A probabilistic model defines real
 intervals within the real line [0,1).}
\label{fig.arith.R}
}%
\end{figure}

 We may then take each interval $a_i$ and subdivide it into intervals
 denoted $a_ia_1,a_ia_2,\ldots, a_ia_I$, such that the length of
 $a_ia_j$ is proportional to $P(x_2 \eq a_j \given x_1 \eq a_i)$. Indeed the
 length of the interval $a_ia_j$ will be precisely the joint probability
\beq
	P(x_1 \eq
 a_i,x_2\eq a_j)=P(x_1\eq a_i)P(x_2 \eq a_j \given x_1 \eq a_i).
\eeq

        Iterating this procedure, the interval $[0,1)$ can be divided
 into a sequence of intervals corresponding to all possible finite
 length strings $x_1x_2\ldots x_N$, such that the length of an
 interval is equal to the probability of the string given our model.
% This iterative procedure

\subsection{Formulae describing arithmetic coding}
\begin{aside}
 The process depicted in \figref{fig.arith.R} can be written
 explicitly as follows.
 The  intervals are  defined in terms of the lower and upper cumulative probabilities
\beqan
        Q_{n}(a_i \given x_1,\ldots,x_{n-1})
 &       \equiv & \sum_{i'\eq 1 }^{i-1} P(x_n \eq a_{i'} \given x_1,\ldots,x_{n-1}) ,
\label{eq.arith.Q} \\
        R_{n}(a_i \given x_1,\ldots,x_{n-1})
  &      \equiv & \sum_{i'\eq 1 }^{i} P(x_n \eq a_{i'} \given x_1,\ldots,x_{n-1}) .
\label{eq.arith.R}
\eeqan
%
 As the $n$th  symbol arrives, we subdivide the $n-1$th
 interval at the points defined by $Q_n$ and $R_n$. 
 For example, starting with the first symbol,
 the intervals `$a_1$',  `$a_2$',
% `$a_3$',
 and `$a_I$'  are
% first interval, 
% which we will call
\beq
 a_1 \leftrightarrow  [Q_{1}(a_1),R_{1}(a_1))= [0,P(x_1 \eq a_1)) ,
\eeq
\beq
 a_2 \leftrightarrow [Q_{1}(a_2),R_{1}(a_2))=
 \left[
 P(x\eq a_1),P(x\eq  a_1)+P(x\eq a_2) \right) ,
\eeq
%\beq
% a_3 \leftrightarrow [Q_{1}(a_3),R_{1}(a_3))=
% \left[
% P(x\eq a_1)+P(x\eq a_2) , P(x\eq  a_1)+P(x\eq a_2) +P(x\eq a_3)\right),
%\eeq
 and
\beq
 a_I  \leftrightarrow
 \left[ Q_{1}(a_{I}) , R_{1}(a_{I}) \right)
 = \left[ P(x_1\eq a_1)+\ldots+P(x_1\eq a_{I\!-\!1}) ,1.0 \right) .
\eeq
 Algorithm \ref{alg.ac} describes the general procedure.
\end{aside}

\begin{algorithm}
\begin{framedalgorithmwithcaption}{
\caption[a]{Arithmetic coding.
 Iterative procedure to find the interval $[u,v)$
 	for the string   $x_1x_2\ldots x_N$.
}
\label{alg.ac}
}
%\algorithmmargin{%
\begin{center}
\begin{tabular}{l}
%\begin{description}% should be ALGORITHM
%\item[Iterative procedure to find the interval $[u,v)$
% corresponding to
% 	for the string   $x_1x_2\ldots x_N$]
%
 {\tt $u$ := 0.0} \\
 {\tt $v$ := 1.0} \\
 {\tt $p$ := $v-u$} \\
 {\tt for $n$ = 1 to $N$  \verb+{+ } \\
 \hspace*{0.5in} Compute the cumulative probabilities $Q_n$ and $R_n$
	\protect(\ref{eq.arith.Q},\,\ref{eq.arith.R})
%       $\{  R_{n}(a_i \given x_1,\ldots,x_{n-1})  \}_{i=1}^{I}$
%% $\{ R_{n,i \given x_1,\ldots,x_{n-1}} \}_{i=0}^{I}$
%        using \eqref{eq.arith.R} \\
 \\
 \hspace*{0.5in} {\tt $v$ := $u + p R_{n}(x_n \given x_1,\ldots,x_{n-1}) $  } \\
 \hspace*{0.5in} {\tt $u$ := $u + p Q_{n}(x_n \given x_1,\ldots,x_{n-1}) $  } \\
 \hspace*{0.5in}  {\tt $p$ := $v-u$} \\
% {\tt  ) } \\
 {\tt   \verb+}+ } \\
\end{tabular} 
\end{center}
%\end{description}
%}
\end{framedalgorithmwithcaption}
\end{algorithm}
        To encode a string $x_1x_2\ldots x_N$,
 we  locate the interval corresponding to $x_1x_2\ldots x_N$, and 
 send a binary string whose interval lies within 
 that interval. This encoding can be performed
 on the fly, as we now illustrate.

% \eof defined in itprnnchapter
\subsection{Example: compressing the tosses of a bent coin}
 Imagine that we watch as a bent coin is tossed some number of times (\cf\
 \exampleref{exa.bentcoin} and  \secref{sec.bentcoin}
 (\pref{sec.bentcoin})).
 The two outcomes when the coin is tossed
 are denoted $\tt a$ and $\tt b$. A third possibility is that the
 experiment is halted, an event denoted by the  `end of file' symbol, `$\eof$'.
 Because the coin is bent, we expect that the probabilities of the outcomes $\tt a$
 and $\tt b$ are not equal, though beforehand we don't know which
 is the more probable outcome.

% Let $\A_X=\{a,b,\eof\}$, where 
% $a$ and $\tb$ make up a binary alphabet with 
% $\eof$ is an `end of file' symbol.
\subsubsection{Encoding\subsubpunc}
 Let the source string be `$\tt bbba\eof$'. We pass along the string one symbol
 at a time and use our model to compute the probability
 distribution of the next symbol given the string thus far.
 Let these probabilities be: 
\[\begin{array}{l*{3}{r@{\eq}l}} \toprule
\mbox{Context } \\
\mbox{(sequence thus far) }
       & \multicolumn{6}{c}{\mbox{Probability of next symbol}} \\[0.05in] \midrule
& P( \ta ) &  0.425 & P( \tb ) &  0.425 & P( \eof ) &  0.15 \\[0.05in]
\tb& P( \ta  \given  \tb ) &   0.28 & P( \tb  \given  \tb ) &   0.57 & P( \eof  \given  \tb ) &   0.15 \\[0.05in]
\tb\tb&P( \ta  \given  \tb\tb ) &   0.21 & P( \tb  \given  \tb\tb ) &   0.64 & P( \eof  \given  \tb\tb ) &   0.15 \\[0.05in]
\tb\tb\tb&P( \ta  \given  \tb\tb\tb ) &   0.17 & P( \tb  \given  \tb\tb\tb ) &   0.68 & P( \eof  \given  \tb\tb\tb ) &   0.15 \\[0.05in]
\tb\tb\tb\ta& P( \ta  \given  \tb\tb\tb\ta ) &   0.28 & P( \tb  \given  \tb\tb\tb\ta ) &   0.57 & P( \eof  \given  \tb\tb\tb\ta ) &   0.15 \\ \bottomrule
\end{array}
\]
 \Figref{fig.ac} shows the corresponding intervals.  The
 interval $\tb$ is the middle 0.425 of $[0,1)$. The interval $\tb\tb$ is the
 middle 0.567 of $\tb$, and so forth.
% in the following figure. 

\begin{figure}[htbp]
\figuremargin{%
\begin{center}
% created by ac.p only_show_data=1 > ac/ac_data.tex   %%%%%%% and edited by hand
\mbox{
\hspace{-0.1in}\small
\setlength{\unitlength}{4.8in}
%\setlength{\unitlength}{5.75in}
\begin{picture}(0.59130434782608698452,1)(-0.29565217391304349226,0)
\thinlines
% line    0.0000 from   -0.5000 to    0.0000 
\put(  -0.2957,   1.0000){\line(1,0){   0.2957}}
% a at   -0.4500,   0.2125
\put(  -0.2811,   0.7875){\makebox(0,0)[r]{\tt{a}}}
% line    0.4250 from   -0.5000 to    0.0000 
\put(  -0.2957,   0.5750){\line(1,0){   0.2957}}
% b at   -0.4500,   0.6375
\put(  -0.2811,   0.3625){\makebox(0,0)[r]{\tt{b}}}
% line    0.8500 from   -0.5000 to    0.0000 
\put(  -0.2957,   0.1500){\line(1,0){   0.2957}}
% \teof at   -0.4500,   0.9250
\put(  -0.2811,   0.0750){\makebox(0,0)[r]{\teof}}
% line    1.0000 from   -0.5000 to    0.0000 
\put(  -0.2957,   0.0000){\line(1,0){   0.2957}}
% ba at   -0.3500,   0.4852
\put(  -0.2220,   0.5148){\makebox(0,0)[r]{\tt{ba}}}
% line    0.5454 from   -0.4500 to    0.0000 
\put(  -0.2661,   0.4546){\line(1,0){   0.2661}}
% bb at   -0.3500,   0.6658
\put(  -0.2220,   0.3342){\makebox(0,0)[r]{\tt{bb}}}
% line    0.7862 from   -0.4500 to    0.0000 
\put(  -0.2661,   0.2138){\line(1,0){   0.2661}}
% b\teof at   -0.3500,   0.8181
\put(  -0.2220,   0.1819){\makebox(0,0)[r]{\tt{b\teof}}}
% bba at   -0.2300,   0.5710
\put(  -0.1510,   0.4290){\makebox(0,0)[r]{\tt{bba}}}
% line    0.5966 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.4034){\line(1,0){   0.2070}}
% bbb at   -0.2300,   0.6734
\put(  -0.1510,   0.3266){\makebox(0,0)[r]{\tt{bbb}}}
% line    0.7501 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.2499){\line(1,0){   0.2070}}
% bb\teof at   -0.2300,   0.7682
\put(  -0.1510,   0.2318){\makebox(0,0)[r]{\tt{bb\teof}}}
% bbba at   -0.1000,   0.6096
\put(  -0.0741,   0.3904){\makebox(0,0)[r]{\tt{bbba}}}
% line    0.6227 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.3773){\line(1,0){   0.1360}}
% bbbb at   -0.1000,   0.6749
\put(  -0.0741,   0.3251){\makebox(0,0)[r]{\tt{bbbb}}}
% line    0.7271 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.2729){\line(1,0){   0.1360}}
% bbb\teof at   -0.1000,   0.7386
\put(  -0.0741,   0.2614){\makebox(0,0)[r]{\tt{bbb\teof}}}
% line    0.6040 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.3960){\line(1,0){   0.0591}}
% line    0.6188 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.3812){\line(1,0){   0.0591}}
% line    0.0000 from    0.0100 to    0.5000 
\put(   0.0059,   1.0000){\line(1,0){   0.2897}}
% 0 at    0.0100,   0.2500
\put(   0.2811,   0.7500){\makebox(0,0)[l]{\tt0}}
% line    0.5000 from    0.0100 to    0.5000 
\put(   0.0059,   0.5000){\line(1,0){   0.2897}}
% 1 at    0.0100,   0.7500
\put(   0.2811,   0.2500){\makebox(0,0)[l]{\tt1}}
% line    1.0000 from    0.0100 to    0.5000 
\put(   0.0059,   0.0000){\line(1,0){   0.2897}}
% 00 at    0.0100,   0.1250
\put(   0.2397,   0.8750){\makebox(0,0)[l]{\tt00}}
% line    0.2500 from    0.0100 to    0.4500 
\put(   0.0059,   0.7500){\line(1,0){   0.2602}}
% 01 at    0.0100,   0.3750
\put(   0.2397,   0.6250){\makebox(0,0)[l]{\tt01}}
% 000 at    0.0100,   0.0625
\put(   0.1806,   0.9375){\makebox(0,0)[l]{\tt000}}
% line    0.1250 from    0.0100 to    0.3800 
\put(   0.0059,   0.8750){\line(1,0){   0.2188}}
% 001 at    0.0100,   0.1875
\put(   0.1806,   0.8125){\makebox(0,0)[l]{\tt001}}
% 0000 at    0.0100,   0.0312
% was at 0.1037, move 0.02 right -> 1207
\put(   0.1207,   0.9688){\makebox(0,0)[l]{\tt0000}}
% line    0.0625 from    0.0100 to    0.2800 
\put(   0.0059,   0.9375){\line(1,0){   0.1597}}
% 0001 at    0.0100,   0.0938
\put(   0.1207,   0.9062){\makebox(0,0)[l]{\tt0001}}
% 00000 at    0.0100,   0.0156
\put(   0.0387,   0.9844){\makebox(0,0)[l]{\tt00000}}
% line    0.0312 from    0.0100 to    0.1500 
\put(   0.0059,   0.9688){\line(1,0){   0.0828}}
% 00001 at    0.0100,   0.0469
\put(   0.0387,   0.9531){\makebox(0,0)[l]{\tt00001}}
% line    0.0156 from    0.0100 to    0.0400 
\put(   0.0059,   0.9844){\line(1,0){   0.0177}}
% line    0.0078 from    0.0100 to    0.0200 
\put(   0.0059,   0.9922){\line(1,0){   0.0059}}
% line    0.0234 from    0.0100 to    0.0200 
\put(   0.0059,   0.9766){\line(1,0){   0.0059}}
% line    0.0469 from    0.0100 to    0.0400 
\put(   0.0059,   0.9531){\line(1,0){   0.0177}}
% line    0.0391 from    0.0100 to    0.0200 
\put(   0.0059,   0.9609){\line(1,0){   0.0059}}
% line    0.0547 from    0.0100 to    0.0200 
\put(   0.0059,   0.9453){\line(1,0){   0.0059}}
% 00010 at    0.0100,   0.0781
\put(   0.0387,   0.9219){\makebox(0,0)[l]{\tt00010}}
% line    0.0938 from    0.0100 to    0.1500 
\put(   0.0059,   0.9062){\line(1,0){   0.0828}}
% 00011 at    0.0100,   0.1094
\put(   0.0387,   0.8906){\makebox(0,0)[l]{\tt00011}}
% line    0.0781 from    0.0100 to    0.0400 
\put(   0.0059,   0.9219){\line(1,0){   0.0177}}
% line    0.0703 from    0.0100 to    0.0200 
\put(   0.0059,   0.9297){\line(1,0){   0.0059}}
% line    0.0859 from    0.0100 to    0.0200 
\put(   0.0059,   0.9141){\line(1,0){   0.0059}}
% line    0.1094 from    0.0100 to    0.0400 
\put(   0.0059,   0.8906){\line(1,0){   0.0177}}
% line    0.1016 from    0.0100 to    0.0200 
\put(   0.0059,   0.8984){\line(1,0){   0.0059}}
% line    0.1172 from    0.0100 to    0.0200 
\put(   0.0059,   0.8828){\line(1,0){   0.0059}}
% 0010 at    0.0100,   0.1562
\put(   0.1207,   0.8438){\makebox(0,0)[l]{\tt0010}}
% line    0.1875 from    0.0100 to    0.2800 
\put(   0.0059,   0.8125){\line(1,0){   0.1597}}
% 0011 at    0.0100,   0.2188
\put(   0.1207,   0.7812){\makebox(0,0)[l]{\tt0011}}
% 00100 at    0.0100,   0.1406
\put(   0.0387,   0.8594){\makebox(0,0)[l]{\tt00100}}
% line    0.1562 from    0.0100 to    0.1500 
\put(   0.0059,   0.8438){\line(1,0){   0.0828}}
% 00101 at    0.0100,   0.1719
\put(   0.0387,   0.8281){\makebox(0,0)[l]{\tt00101}}
% line    0.1406 from    0.0100 to    0.0400 
\put(   0.0059,   0.8594){\line(1,0){   0.0177}}
% line    0.1328 from    0.0100 to    0.0200 
\put(   0.0059,   0.8672){\line(1,0){   0.0059}}
% line    0.1484 from    0.0100 to    0.0200 
\put(   0.0059,   0.8516){\line(1,0){   0.0059}}
% line    0.1719 from    0.0100 to    0.0400 
\put(   0.0059,   0.8281){\line(1,0){   0.0177}}
% line    0.1641 from    0.0100 to    0.0200 
\put(   0.0059,   0.8359){\line(1,0){   0.0059}}
% line    0.1797 from    0.0100 to    0.0200 
\put(   0.0059,   0.8203){\line(1,0){   0.0059}}
% 00110 at    0.0100,   0.2031
\put(   0.0387,   0.7969){\makebox(0,0)[l]{\tt00110}}
% line    0.2188 from    0.0100 to    0.1500 
\put(   0.0059,   0.7812){\line(1,0){   0.0828}}
% 00111 at    0.0100,   0.2344
\put(   0.0387,   0.7656){\makebox(0,0)[l]{\tt00111}}
% line    0.2031 from    0.0100 to    0.0400 
\put(   0.0059,   0.7969){\line(1,0){   0.0177}}
% line    0.1953 from    0.0100 to    0.0200 
\put(   0.0059,   0.8047){\line(1,0){   0.0059}}
% line    0.2109 from    0.0100 to    0.0200 
\put(   0.0059,   0.7891){\line(1,0){   0.0059}}
% line    0.2344 from    0.0100 to    0.0400 
\put(   0.0059,   0.7656){\line(1,0){   0.0177}}
% line    0.2266 from    0.0100 to    0.0200 
\put(   0.0059,   0.7734){\line(1,0){   0.0059}}
% line    0.2422 from    0.0100 to    0.0200 
\put(   0.0059,   0.7578){\line(1,0){   0.0059}}
% 010 at    0.0100,   0.3125
\put(   0.1806,   0.6875){\makebox(0,0)[l]{\tt010}}
% line    0.3750 from    0.0100 to    0.3800 
\put(   0.0059,   0.6250){\line(1,0){   0.2188}}
% 011 at    0.0100,   0.4375
\put(   0.1806,   0.5625){\makebox(0,0)[l]{\tt011}}
% 0100 at    0.0100,   0.2812
\put(   0.1207,   0.7188){\makebox(0,0)[l]{\tt0100}}
% line    0.3125 from    0.0100 to    0.2800 
\put(   0.0059,   0.6875){\line(1,0){   0.1597}}
% 0101 at    0.0100,   0.3438
\put(   0.1207,   0.6562){\makebox(0,0)[l]{\tt0101}}
% 01000 at    0.0100,   0.2656
\put(   0.0387,   0.7344){\makebox(0,0)[l]{\tt01000}}
% line    0.2812 from    0.0100 to    0.1500 
\put(   0.0059,   0.7188){\line(1,0){   0.0828}}
% 01001 at    0.0100,   0.2969
\put(   0.0387,   0.7031){\makebox(0,0)[l]{\tt01001}}
% line    0.2656 from    0.0100 to    0.0400 
\put(   0.0059,   0.7344){\line(1,0){   0.0177}}
% line    0.2578 from    0.0100 to    0.0200 
\put(   0.0059,   0.7422){\line(1,0){   0.0059}}
% line    0.2734 from    0.0100 to    0.0200 
\put(   0.0059,   0.7266){\line(1,0){   0.0059}}
% line    0.2969 from    0.0100 to    0.0400 
\put(   0.0059,   0.7031){\line(1,0){   0.0177}}
% line    0.2891 from    0.0100 to    0.0200 
\put(   0.0059,   0.7109){\line(1,0){   0.0059}}
% line    0.3047 from    0.0100 to    0.0200 
\put(   0.0059,   0.6953){\line(1,0){   0.0059}}
% 01010 at    0.0100,   0.3281
\put(   0.0387,   0.6719){\makebox(0,0)[l]{\tt01010}}
% line    0.3438 from    0.0100 to    0.1500 
\put(   0.0059,   0.6562){\line(1,0){   0.0828}}
% 01011 at    0.0100,   0.3594
\put(   0.0387,   0.6406){\makebox(0,0)[l]{\tt01011}}
% line    0.3281 from    0.0100 to    0.0400 
\put(   0.0059,   0.6719){\line(1,0){   0.0177}}
% line    0.3203 from    0.0100 to    0.0200 
\put(   0.0059,   0.6797){\line(1,0){   0.0059}}
% line    0.3359 from    0.0100 to    0.0200 
\put(   0.0059,   0.6641){\line(1,0){   0.0059}}
% line    0.3594 from    0.0100 to    0.0400 
\put(   0.0059,   0.6406){\line(1,0){   0.0177}}
% line    0.3516 from    0.0100 to    0.0200 
\put(   0.0059,   0.6484){\line(1,0){   0.0059}}
% line    0.3672 from    0.0100 to    0.0200 
\put(   0.0059,   0.6328){\line(1,0){   0.0059}}
% 0110 at    0.0100,   0.4062
\put(   0.1207,   0.5938){\makebox(0,0)[l]{\tt0110}}
% line    0.4375 from    0.0100 to    0.2800 
\put(   0.0059,   0.5625){\line(1,0){   0.1597}}
% 0111 at    0.0100,   0.4688
\put(   0.1207,   0.5312){\makebox(0,0)[l]{\tt0111}}
% 01100 at    0.0100,   0.3906
\put(   0.0387,   0.6094){\makebox(0,0)[l]{\tt01100}}
% line    0.4062 from    0.0100 to    0.1500 
\put(   0.0059,   0.5938){\line(1,0){   0.0828}}
% 01101 at    0.0100,   0.4219
\put(   0.0387,   0.5781){\makebox(0,0)[l]{\tt01101}}
% line    0.3906 from    0.0100 to    0.0400 
\put(   0.0059,   0.6094){\line(1,0){   0.0177}}
% line    0.3828 from    0.0100 to    0.0200 
\put(   0.0059,   0.6172){\line(1,0){   0.0059}}
% line    0.3984 from    0.0100 to    0.0200 
\put(   0.0059,   0.6016){\line(1,0){   0.0059}}
% line    0.4219 from    0.0100 to    0.0400 
\put(   0.0059,   0.5781){\line(1,0){   0.0177}}
% line    0.4141 from    0.0100 to    0.0200 
\put(   0.0059,   0.5859){\line(1,0){   0.0059}}
% line    0.4297 from    0.0100 to    0.0200 
\put(   0.0059,   0.5703){\line(1,0){   0.0059}}
% 01110 at    0.0100,   0.4531
\put(   0.0387,   0.5469){\makebox(0,0)[l]{\tt01110}}
% line    0.4688 from    0.0100 to    0.1500 
\put(   0.0059,   0.5312){\line(1,0){   0.0828}}
% 01111 at    0.0100,   0.4844
\put(   0.0387,   0.5156){\makebox(0,0)[l]{\tt01111}}
% line    0.4531 from    0.0100 to    0.0400 
\put(   0.0059,   0.5469){\line(1,0){   0.0177}}
% line    0.4453 from    0.0100 to    0.0200 
\put(   0.0059,   0.5547){\line(1,0){   0.0059}}
% line    0.4609 from    0.0100 to    0.0200 
\put(   0.0059,   0.5391){\line(1,0){   0.0059}}
% line    0.4844 from    0.0100 to    0.0400 
\put(   0.0059,   0.5156){\line(1,0){   0.0177}}
% line    0.4766 from    0.0100 to    0.0200 
\put(   0.0059,   0.5234){\line(1,0){   0.0059}}
% line    0.4922 from    0.0100 to    0.0200 
\put(   0.0059,   0.5078){\line(1,0){   0.0059}}
% 10 at    0.0100,   0.6250
\put(   0.2397,   0.3750){\makebox(0,0)[l]{\tt10}}
% line    0.7500 from    0.0100 to    0.4500 
\put(   0.0059,   0.2500){\line(1,0){   0.2602}}
% 11 at    0.0100,   0.8750
\put(   0.2397,   0.1250){\makebox(0,0)[l]{\tt11}}
% 100 at    0.0100,   0.5625
\put(   0.1806,   0.4375){\makebox(0,0)[l]{\tt100}}
% line    0.6250 from    0.0100 to    0.3800 
\put(   0.0059,   0.3750){\line(1,0){   0.2188}}
% 101 at    0.0100,   0.6875
\put(   0.1806,   0.3125){\makebox(0,0)[l]{\tt101}}
% 1000 at    0.0100,   0.5312
\put(   0.1207,   0.4688){\makebox(0,0)[l]{\tt1000}}
% line    0.5625 from    0.0100 to    0.2800 
\put(   0.0059,   0.4375){\line(1,0){   0.1597}}
% 1001 at    0.0100,   0.5938
\put(   0.1207,   0.4062){\makebox(0,0)[l]{\tt1001}}
% 10000 at    0.0100,   0.5156
\put(   0.0387,   0.4844){\makebox(0,0)[l]{\tt10000}}
% line    0.5312 from    0.0100 to    0.1500 
\put(   0.0059,   0.4688){\line(1,0){   0.0828}}
% 10001 at    0.0100,   0.5469
\put(   0.0387,   0.4531){\makebox(0,0)[l]{\tt10001}}
% line    0.5156 from    0.0100 to    0.0400 
\put(   0.0059,   0.4844){\line(1,0){   0.0177}}
% line    0.5078 from    0.0100 to    0.0200 
\put(   0.0059,   0.4922){\line(1,0){   0.0059}}
% line    0.5234 from    0.0100 to    0.0200 
\put(   0.0059,   0.4766){\line(1,0){   0.0059}}
% line    0.5469 from    0.0100 to    0.0400 
\put(   0.0059,   0.4531){\line(1,0){   0.0177}}
% line    0.5391 from    0.0100 to    0.0200 
\put(   0.0059,   0.4609){\line(1,0){   0.0059}}
% line    0.5547 from    0.0100 to    0.0200 
\put(   0.0059,   0.4453){\line(1,0){   0.0059}}
% 10010 at    0.0100,   0.5781
\put(   0.0387,   0.4219){\makebox(0,0)[l]{\tt10010}}
% line    0.5938 from    0.0100 to    0.1500 
\put(   0.0059,   0.4062){\line(1,0){   0.0828}}
% 10011 at    0.0100,   0.6094
\put(   0.0387,   0.3906){\makebox(0,0)[l]{\tt10011}}
% line    0.5781 from    0.0100 to    0.0400 
\put(   0.0059,   0.4219){\line(1,0){   0.0177}}
% line    0.5703 from    0.0100 to    0.0200 
\put(   0.0059,   0.4297){\line(1,0){   0.0059}}
% line    0.5859 from    0.0100 to    0.0200 
\put(   0.0059,   0.4141){\line(1,0){   0.0059}}
% line    0.6094 from    0.0100 to    0.0400 
\put(   0.0059,   0.3906){\line(1,0){   0.0177}}
% line    0.6016 from    0.0100 to    0.0200 
\put(   0.0059,   0.3984){\line(1,0){   0.0059}}
% line    0.6172 from    0.0100 to    0.0200 
\put(   0.0059,   0.3828){\line(1,0){   0.0059}}
% 1010 at    0.0100,   0.6562
\put(   0.1207,   0.3438){\makebox(0,0)[l]{\tt1010}}
% line    0.6875 from    0.0100 to    0.2800 
\put(   0.0059,   0.3125){\line(1,0){   0.1597}}
% 1011 at    0.0100,   0.7188
\put(   0.1207,   0.2812){\makebox(0,0)[l]{\tt1011}}
% 10100 at    0.0100,   0.6406
\put(   0.0387,   0.3594){\makebox(0,0)[l]{\tt10100}}
% line    0.6562 from    0.0100 to    0.1500 
\put(   0.0059,   0.3438){\line(1,0){   0.0828}}
% 10101 at    0.0100,   0.6719
\put(   0.0387,   0.3281){\makebox(0,0)[l]{\tt10101}}
% line    0.6406 from    0.0100 to    0.0400 
\put(   0.0059,   0.3594){\line(1,0){   0.0177}}
% line    0.6328 from    0.0100 to    0.0200 
\put(   0.0059,   0.3672){\line(1,0){   0.0059}}
% line    0.6484 from    0.0100 to    0.0200 
\put(   0.0059,   0.3516){\line(1,0){   0.0059}}
% line    0.6719 from    0.0100 to    0.0400 
\put(   0.0059,   0.3281){\line(1,0){   0.0177}}
% line    0.6641 from    0.0100 to    0.0200 
\put(   0.0059,   0.3359){\line(1,0){   0.0059}}
% line    0.6797 from    0.0100 to    0.0200 
\put(   0.0059,   0.3203){\line(1,0){   0.0059}}
% 10110 at    0.0100,   0.7031
\put(   0.0387,   0.2969){\makebox(0,0)[l]{\tt10110}}
% line    0.7188 from    0.0100 to    0.1500 
\put(   0.0059,   0.2812){\line(1,0){   0.0828}}
% 10111 at    0.0100,   0.7344
\put(   0.0387,   0.2656){\makebox(0,0)[l]{\tt10111}}
% line    0.7031 from    0.0100 to    0.0400 
\put(   0.0059,   0.2969){\line(1,0){   0.0177}}
% line    0.6953 from    0.0100 to    0.0200 
\put(   0.0059,   0.3047){\line(1,0){   0.0059}}
% line    0.7109 from    0.0100 to    0.0200 
\put(   0.0059,   0.2891){\line(1,0){   0.0059}}
% line    0.7344 from    0.0100 to    0.0400 
\put(   0.0059,   0.2656){\line(1,0){   0.0177}}
% line    0.7266 from    0.0100 to    0.0200 
\put(   0.0059,   0.2734){\line(1,0){   0.0059}}
% line    0.7422 from    0.0100 to    0.0200 
\put(   0.0059,   0.2578){\line(1,0){   0.0059}}
% 110 at    0.0100,   0.8125
\put(   0.1806,   0.1875){\makebox(0,0)[l]{\tt110}}
% line    0.8750 from    0.0100 to    0.3800 
\put(   0.0059,   0.1250){\line(1,0){   0.2188}}
% 111 at    0.0100,   0.9375
\put(   0.1806,   0.0625){\makebox(0,0)[l]{\tt111}}
% 1100 at    0.0100,   0.7812
\put(   0.1207,   0.2188){\makebox(0,0)[l]{\tt1100}}
% line    0.8125 from    0.0100 to    0.2800 
\put(   0.0059,   0.1875){\line(1,0){   0.1597}}
% 1101 at    0.0100,   0.8438
\put(   0.1207,   0.1562){\makebox(0,0)[l]{\tt1101}}
% 11000 at    0.0100,   0.7656
\put(   0.0387,   0.2344){\makebox(0,0)[l]{\tt11000}}
% line    0.7812 from    0.0100 to    0.1500 
\put(   0.0059,   0.2188){\line(1,0){   0.0828}}
% 11001 at    0.0100,   0.7969
\put(   0.0387,   0.2031){\makebox(0,0)[l]{\tt11001}}
% line    0.7656 from    0.0100 to    0.0400 
\put(   0.0059,   0.2344){\line(1,0){   0.0177}}
% line    0.7578 from    0.0100 to    0.0200 
\put(   0.0059,   0.2422){\line(1,0){   0.0059}}
% line    0.7734 from    0.0100 to    0.0200 
\put(   0.0059,   0.2266){\line(1,0){   0.0059}}
% line    0.7969 from    0.0100 to    0.0400 
\put(   0.0059,   0.2031){\line(1,0){   0.0177}}
% line    0.7891 from    0.0100 to    0.0200 
\put(   0.0059,   0.2109){\line(1,0){   0.0059}}
% line    0.8047 from    0.0100 to    0.0200 
\put(   0.0059,   0.1953){\line(1,0){   0.0059}}
% 11010 at    0.0100,   0.8281
\put(   0.0387,   0.1719){\makebox(0,0)[l]{\tt11010}}
% line    0.8438 from    0.0100 to    0.1500 
\put(   0.0059,   0.1562){\line(1,0){   0.0828}}
% 11011 at    0.0100,   0.8594
\put(   0.0387,   0.1406){\makebox(0,0)[l]{\tt11011}}
% line    0.8281 from    0.0100 to    0.0400 
\put(   0.0059,   0.1719){\line(1,0){   0.0177}}
% line    0.8203 from    0.0100 to    0.0200 
\put(   0.0059,   0.1797){\line(1,0){   0.0059}}
% line    0.8359 from    0.0100 to    0.0200 
\put(   0.0059,   0.1641){\line(1,0){   0.0059}}
% line    0.8594 from    0.0100 to    0.0400 
\put(   0.0059,   0.1406){\line(1,0){   0.0177}}
% line    0.8516 from    0.0100 to    0.0200 
\put(   0.0059,   0.1484){\line(1,0){   0.0059}}
% line    0.8672 from    0.0100 to    0.0200 
\put(   0.0059,   0.1328){\line(1,0){   0.0059}}
% 1110 at    0.0100,   0.9062
\put(   0.1207,   0.0938){\makebox(0,0)[l]{\tt1110}}
% line    0.9375 from    0.0100 to    0.2800 
\put(   0.0059,   0.0625){\line(1,0){   0.1597}}
% 1111 at    0.0100,   0.9688
\put(   0.1207,   0.0312){\makebox(0,0)[l]{\tt1111}}
% 11100 at    0.0100,   0.8906
\put(   0.0387,   0.1094){\makebox(0,0)[l]{\tt11100}}
% line    0.9062 from    0.0100 to    0.1500 
\put(   0.0059,   0.0938){\line(1,0){   0.0828}}
% 11101 at    0.0100,   0.9219
\put(   0.0387,   0.0781){\makebox(0,0)[l]{\tt11101}}
% line    0.8906 from    0.0100 to    0.0400 
\put(   0.0059,   0.1094){\line(1,0){   0.0177}}
% line    0.8828 from    0.0100 to    0.0200 
\put(   0.0059,   0.1172){\line(1,0){   0.0059}}
% line    0.8984 from    0.0100 to    0.0200 
\put(   0.0059,   0.1016){\line(1,0){   0.0059}}
% line    0.9219 from    0.0100 to    0.0400 
\put(   0.0059,   0.0781){\line(1,0){   0.0177}}
% line    0.9141 from    0.0100 to    0.0200 
\put(   0.0059,   0.0859){\line(1,0){   0.0059}}
% line    0.9297 from    0.0100 to    0.0200 
\put(   0.0059,   0.0703){\line(1,0){   0.0059}}
% 11110 at    0.0100,   0.9531
\put(   0.0387,   0.0469){\makebox(0,0)[l]{\tt11110}}
% line    0.9688 from    0.0100 to    0.1500 
\put(   0.0059,   0.0312){\line(1,0){   0.0828}}
% 11111 at    0.0100,   0.9844
\put(   0.0387,   0.0156){\makebox(0,0)[l]{\tt11111}}
% line    0.9531 from    0.0100 to    0.0400 
\put(   0.0059,   0.0469){\line(1,0){   0.0177}}
% line    0.9453 from    0.0100 to    0.0200 
\put(   0.0059,   0.0547){\line(1,0){   0.0059}}
% line    0.9609 from    0.0100 to    0.0200 
\put(   0.0059,   0.0391){\line(1,0){   0.0059}}
% line    0.9844 from    0.0100 to    0.0400 
\put(   0.0059,   0.0156){\line(1,0){   0.0177}}
% line    0.9766 from    0.0100 to    0.0200 
\put(   0.0059,   0.0234){\line(1,0){   0.0059}}
% line    0.9922 from    0.0100 to    0.0200 
\put(   0.0059,   0.0078){\line(1,0){   0.0059}}
\end{picture}

\hspace{-0.04in}% was -.25
\raisebox{1.1895in}{% was 1.425
\setlength{\unitlength}{33.39in}
%\setlength{\unitlength}{40in}
\begin{picture}(0.085,0.04)(-0.0425,0.37)
\thinlines
% 
% wings added by hand
\put(  -0.0408 ,   0.4082){\line(-1,-3){   0.005}}
\put(  -0.0408 ,   0.3730){\line(-1,3){   0.005}}
%
% arrow identifying the final interval added by hand
% the center of the interval is 0010 below this point
% 10011110  (0.3809)
% 0.0017 is the length of the stubby lines
%
% want vector's tip to end at height 0.37995 and x=0.0010
% 4*34 = 136 -> 36635
% this was perfectly positioned
%\put(   0.0040,   0.36635){\makebox(0,0)[tl]{\tt100111101}}
%\put(   0.0044,   0.36635){\vector(-1,4){0.0034}}
% but I shifted it to this for arty reasons
\put(   0.0048,   0.36635){\makebox(0,0)[tl]{\tt100111101}}
\put(   0.0052,   0.36635){\vector(-1,4){0.0034}}
%
% line    0.5966 from   -0.4800 to    0.0000 
\put(  -0.0408,   0.4034){\line(1,0){   0.0408}}
% bbba at   -0.2800,   0.6096
\put(  -0.0252,   0.3904){\makebox(0,0)[r]{\tt{bbba}}}
% line    0.6227 from   -0.4200 to    0.0000 
\put(  -0.0357,   0.3773){\line(1,0){   0.0357}}
% bbbaa at   -0.1000,   0.6003
\put(  -0.0099,   0.3997){\makebox(0,0)[r]{\tt{bbbaa}}}
% line    0.6040 from   -0.2800 to    0.0000 
\put(  -0.0238,   0.3960){\line(1,0){   0.0238}}
% bbbab at   -0.1000,   0.6114
\put(  -0.0099,   0.3886){\makebox(0,0)[r]{\tt{bbbab}}}
% line    0.6188 from   -0.2800 to    0.0000 
\put(  -0.0238,   0.3812){\line(1,0){   0.0238}}
% bbba\eof at   -0.1000,   0.6207
\put(  -0.0099,   0.3793){\makebox(0,0)[r]{\tt{bbba\teof}}}
% line    0.6250 from    0.0100 to    0.4900 
\put(   0.0008,   0.3750){\line(1,0){   0.0408}}
% line    0.5938 from    0.0100 to    0.4200 
\put(   0.0008,   0.4062){\line(1,0){   0.0348}}
% 10011 at    0.0100,   0.6094
\put(   0.0299,   0.3906){\makebox(0,0)[l]{\tt10011}} % moved left a bit, was.0329
% 10010111 at    0.0100,   0.5918
\put(   0.0040,   0.4082){\makebox(0,0)[l]{\tt10010111}}
% line    0.5918 from    0.0100 to    0.0300 
\put(   0.0008,   0.4082){\line(1,0){   0.0017}}
% line    0.6094 from    0.0100 to    0.3700 % shortened, was .0306
\put(   0.0008,   0.3906){\line(1,0){   0.0276}}
% line    0.6016 from    0.0100 to    0.3000 
\put(   0.0008,   0.3984){\line(1,0){   0.0246}}
% 10011000 at    0.0100,   0.5957
\put(   0.0040,   0.4043){\makebox(0,0)[l]{\tt10011000}}
% line    0.5977 from    0.0100 to    0.2100 
\put(   0.0008,   0.4023){\line(1,0){   0.0170}}
% 10011001 at    0.0100,   0.5996
\put(   0.0040,   0.4004){\makebox(0,0)[l]{\tt10011001}}
% line    0.5957 from    0.0100 to    0.0300 
\put(   0.0008,   0.4043){\line(1,0){   0.0017}}
% line    0.5996 from    0.0100 to    0.0300 
\put(   0.0008,   0.4004){\line(1,0){   0.0017}}
% 10011010 at    0.0100,   0.6035
\put(   0.0040,   0.3965){\makebox(0,0)[l]{\tt10011010}}
% line    0.6055 from    0.0100 to    0.2100 
\put(   0.0008,   0.3945){\line(1,0){   0.0170}}
% 10011011 at    0.0100,   0.6074
\put(   0.0040,   0.3926){\makebox(0,0)[l]{\tt10011011}}
% line    0.6035 from    0.0100 to    0.0300 
\put(   0.0008,   0.3965){\line(1,0){   0.0017}}
% line    0.6074 from    0.0100 to    0.0300 
\put(   0.0008,   0.3926){\line(1,0){   0.0017}}
% line    0.6172 from    0.0100 to    0.3000 
\put(   0.0008,   0.3828){\line(1,0){   0.0246}}
% 10011100 at    0.0100,   0.6113
\put(   0.0040,   0.3887){\makebox(0,0)[l]{\tt10011100}}
% line    0.6133 from    0.0100 to    0.2100 
\put(   0.0008,   0.3867){\line(1,0){   0.0170}}
% 10011101 at    0.0100,   0.6152
\put(   0.0040,   0.3848){\makebox(0,0)[l]{\tt10011101}}
% line    0.6113 from    0.0100 to    0.0300 
\put(   0.0008,   0.3887){\line(1,0){   0.0017}}
% line    0.6152 from    0.0100 to    0.0300 
\put(   0.0008,   0.3848){\line(1,0){   0.0017}}
% 10011110 at    0.0100,   0.6191
\put(   0.0040,   0.3809){\makebox(0,0)[l]{\tt10011110}}
% line    0.6211 from    0.0100 to    0.2100 
\put(   0.0008,   0.3789){\line(1,0){   0.0170}}
% 10011111 at    0.0100,   0.6230
\put(   0.0040,   0.3770){\makebox(0,0)[l]{\tt10011111}}
% line    0.6191 from    0.0100 to    0.0300 
\put(   0.0008,   0.3809){\line(1,0){   0.0017}}
% line    0.6230 from    0.0100 to    0.0300 
\put(   0.0008,   0.3770){\line(1,0){   0.0017}}
% 10100000 at    0.0100,   0.6270
\put(   0.0040,   0.3730){\makebox(0,0)[l]{\tt10100000}}
% line    0.6289 from    0.0100 to    0.2100 
\put(   0.0008,   0.3711){\line(1,0){   0.0170}}
% line    0.6270 from    0.0100 to    0.0300 
\put(   0.0008,   0.3730){\line(1,0){   0.0017}}
\end{picture}

}
}
\end{center}
}{%
\caption[a]{Illustration of the arithmetic coding
 process as the sequence {$\tt bbba\eof$}  is
 transmitted.}
\label{fig.ac}
}%
\end{figure}

 When the first symbol `$\tb$' is observed, the encoder
 knows that the encoded string will start `{\tt{01}}', 
 `{\tt{10}}', or `{\tt{11}}',
 but does not know  which. The encoder writes nothing
 for the time being, and 
 examines the next symbol, which is `$\tb$'. 
 The interval `$\tt bb$' lies wholly within interval `{\tt{1}}', so 
 the encoder can write the first bit: `{\tt{1}}'. 
 The third symbol `$\tt b$' narrows down the interval 
 a little, but not quite enough for it to lie 
 wholly within interval `{\tt{10}}'. Only when the next `$\tt a$' 
 is read from the source can we transmit some more
 bits. Interval `$\tt bbba$' lies wholly within the interval `{\tt{1001}}', 
 so the encoder adds `{\tt{001}}'
 to the `{\tt{1}}' it has written.  Finally when the `$\eof$'
 arrives, we need a procedure for terminating the encoding.
 Magnifying the interval `$\tt bbba\eof$' (\figref{fig.ac}, right)
 we note that the marked interval `{\tt{100111101}}'
 is wholly contained by $\tt bbba\eof$, so the encoding can be completed by 
 appending `{\tt{11101}}'.
\exercissxA{2}{ex.ac.terminate}{
 Show that the overhead required to terminate a message 
 is never more than 2 bits, relative to the ideal message length  given the 
 probabilistic model $\H$, 
  $h(\bx \given \H) = \log [ 1/ P(\bx \given \H)]$.
}
% \begin{center}
% % created by ac.p sub=1 unit=40 only_show_data=1 > ac/ac_sub_data.tex 
% \input{figs/ac/ac_sub_data.tex}  
% \end{center}
 This is an important result. Arithmetic coding is 
 very nearly optimal. The  message length is always
 within two bits of the \ind{Shannon information content}\index{information content} of the entire
 source string, 
 so the expected message length is within two bits of the 
 entropy of the entire message. 

\subsubsection{Decoding\subsubpunc}
 The decoder receives the string `{\tt{100111101}}'
 and passes along it one 
 symbol at a time. First, the probabilities $P(\ta), P(\tb), P(\eof)$ are computed 
 using the identical program that the encoder used and the intervals 
 `$\ta$', `$\tb$' and `$\eof$' are deduced. Once the first two  
 bits `{\tt{10}}' have been examined, it is certain that the original string 
 must have been started with a `$\tb$', since the interval `{\tt{10}}' lies wholly within 
 interval `$\tb$'. The decoder can then use the model to compute $P(\ta \given \tb), 
 P(\tb \given \tb), P(\eof \given \tb)$ and deduce the boundaries of the intervals 
 `$\tb\ta$', `$\tb\tb$' and `$\tb\eof$'. 
 Continuing, we decode the second $\tb$ once we reach `{\tt{1001}}', 
 the third $\tb$ once we reach `{\tt{100111}}', and so forth, with the 
 unambiguous identification of `$\tb\tb\tb\ta\eof$' once the whole binary
 string has been 
 read. With the convention that `$\eof$' denotes the end of the message, the decoder
 knows to stop decoding.

\subsubsection{Transmission of multiple files\subsubpunc}
 How might one use  arithmetic coding to communicate 
 several distinct files over the binary channel? 
 Once the $\eof$ character has been 
 transmitted, we imagine that the decoder is 
 reset into its initial state.  There is no 
 transfer of the learnt statistics of the first file 
 to the second file.
% We start a fresh arithmetic code. 
 If, however, we did believe that there is a relationship 
 among the files that we are going to compress, we could 
 define our alphabet differently, introducing 
 a second end-of-file character that
 marks the end of the file but instructs 
 the encoder and decoder to continue using the 
 same probabilistic model.
% If we went this route, 
% we would  only  be able to uncompress the second file 
% after  first uncompressing the first file.
 

\subsection{The big picture}
 Notice that to communicate a string of $N$ letters
% coming from an alphabet of size $|\A| = I$
 both the encoder and the decoder  needed to compute  only $N|\A|$
 conditional probabilities -- the probabilities of each possible letter 
 in each context  actually encountered -- just 
 as in the guessing game.\index{guessing game} This cost can be contrasted with the alternative 
 of using a Huffman code\index{Huffman code!disadvantages}
 with a large block size (in order to 
 reduce the possible one-bit-per-symbol overhead discussed in
% the previous  chapter
 section \ref{sec.huffman.probs}), where {\em all\/} block sequences that could
 occur
% be encoded 
% in a block
 must be considered and their probabilities evaluated. 


 Notice how flexible arithmetic coding is: 
 it can be used with any source alphabet 
 and any  encoded  alphabet. The size of the source alphabet and the encoded
 alphabet can 
 change with
 time. Arithmetic coding can be used with any probability distribution, 
 which can change utterly from context to context.

 Furthermore, if we would like the  symbols of the encoding alphabet (say, {\tt 0} and {\tt 1})
 to be used with {\em unequal\/} frequency, that can easily be arranged by subdividing
 the right-hand interval in proportion to the required frequencies.

\subsection{How the probabilistic model might make its predictions}
 The technique of arithmetic coding does not force one to 
 produce the predictive probability in any particular way, but 
 the predictive distributions might naturally be produced by a
 Bayesian model.

 \Figref{fig.ac} was generated using a
 simple model that always assigns a probability of 0.15 to $\eof$,
 and assigns  the remaining 0.85 to $\ta$ and $\tb$, divided
 in proportion to probabilities given by Laplace's
 rule, 
\beq
	P_{\rm L}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+1}{F_{\ta}+F_{\tb}+2} ,
\label{eq.laplaceagain}
\eeq
 where
 $F_{\ta}(x_1,\ldots,x_{n-1})$ is the number of times that $\ta$ has
 occurred so far, and  $F_{\tb}$ is the count of $\tb$s.
 These predictions corresponds to a simple 
 Bayesian model that expects and adapts to
% is able to learn
 a non-equal frequency 
 of use of the source symbols $\ta$ and $\tb$ within a file. 
% The end result will be  an encoder that can adapt  to a nonuniform source.
 
 \Figref{fig.ac2}  displays the  intervals corresponding to 
 a  number of strings of length up to five. Note that if the string so far 
 has contained a large number of $\tb$s then the probability of 
 $\tb$ relative to $\ta$ is increased, and conversely if many $\ta$s 
 occur then $\ta$s are made more probable. Larger intervals, remember, 
 require fewer bits to encode. 
% 

 

\begin{figure}[tbp]
\figuremargin{%
\begin{center}
% created by ac.p only_show_data=1 > ac/ac_data.tex 
\mbox{
\setlength{\unitlength}{5.75in}
\begin{picture}(0.59130434782608698452,1)(-0.29565217391304349226,0)
\thinlines
% line    0.0000 from   -0.5000 to    0.0000 
\put(  -0.2957,   1.0000){\line(1,0){   0.2957}}
% a at   -0.4500,   0.2125
\put(  -0.2811,   0.7875){\makebox(0,0)[r]{\tt{a}}}
% line    0.4250 from   -0.5000 to    0.0000 
\put(  -0.2957,   0.5750){\line(1,0){   0.2957}}
% b at   -0.4500,   0.6375
\put(  -0.2811,   0.3625){\makebox(0,0)[r]{\tt{b}}}
% line    0.8500 from   -0.5000 to    0.0000 
\put(  -0.2957,   0.1500){\line(1,0){   0.2957}}
% \eof at   -0.4500,   0.9250
\put(  -0.2811,   0.0750){\makebox(0,0)[r]{\tt{\teof}}}
% line    1.0000 from   -0.5000 to    0.0000 
\put(  -0.2957,   0.0000){\line(1,0){   0.2957}}
% aa at   -0.3500,   0.1204
\put(  -0.2220,   0.8796){\makebox(0,0)[r]{\tt{aa}}}
% line    0.2408 from   -0.4500 to    0.0000 
\put(  -0.2661,   0.7592){\line(1,0){   0.2661}}
% ab at   -0.3500,   0.3010
\put(  -0.2220,   0.6990){\makebox(0,0)[r]{\tt{ab}}}
% line    0.3612 from   -0.4500 to    0.0000 
\put(  -0.2661,   0.6388){\line(1,0){   0.2661}}
% a\eof at   -0.3500,   0.3931
\put(  -0.2220,   0.6069){\makebox(0,0)[r]{\tt{a\teof}}}
% aaa at   -0.2300,   0.0768
\put(  -0.1510,   0.9232){\makebox(0,0)[r]{\tt{aaa}}}
% line    0.1535 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.8465){\line(1,0){   0.2070}}
% aab at   -0.2300,   0.1791
\put(  -0.1510,   0.8209){\makebox(0,0)[r]{\tt{aab}}}
% line    0.2047 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.7953){\line(1,0){   0.2070}}
% aa\eof at   -0.2300,   0.2228
\put(  -0.1510,   0.7772){\makebox(0,0)[r]{\tt{aa\teof}}}
% aaaa at   -0.1000,   0.0522
\put(  -0.0741,   0.9478){\makebox(0,0)[r]{\tt{aaaa}}}
% line    0.1044 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.8956){\line(1,0){   0.1360}}
% aaab at   -0.1000,   0.1175
\put(  -0.0741,   0.8825){\makebox(0,0)[r]{\tt{aaab}}}
% line    0.1305 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.8695){\line(1,0){   0.1360}}
% line    0.0740 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.9260){\line(1,0){   0.0591}}
% line    0.0887 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.9113){\line(1,0){   0.0591}}
% line    0.1192 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.8808){\line(1,0){   0.0591}}
% line    0.1266 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.8734){\line(1,0){   0.0591}}
% aaba at   -0.1000,   0.1666
\put(  -0.0741,   0.8334){\makebox(0,0)[r]{\tt{aaba}}}
% line    0.1796 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.8204){\line(1,0){   0.1360}}
% aabb at   -0.1000,   0.1883
\put(  -0.0741,   0.8117){\makebox(0,0)[r]{\tt{aabb}}}
% line    0.1970 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.8030){\line(1,0){   0.1360}}
% line    0.1683 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.8317){\line(1,0){   0.0591}}
% line    0.1757 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.8243){\line(1,0){   0.0591}}
% line    0.1870 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.8130){\line(1,0){   0.0591}}
% line    0.1944 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.8056){\line(1,0){   0.0591}}
% aba at   -0.2300,   0.2664
\put(  -0.1510,   0.7336){\makebox(0,0)[r]{\tt{aba}}}
% line    0.2920 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.7080){\line(1,0){   0.2070}}
% abb at   -0.2300,   0.3176
\put(  -0.1510,   0.6824){\makebox(0,0)[r]{\tt{abb}}}
% line    0.3432 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.6568){\line(1,0){   0.2070}}
% ab\eof at   -0.2300,   0.3522
\put(  -0.1510,   0.6478){\makebox(0,0)[r]{\tt{ab\teof}}}
% abaa at   -0.1000,   0.2539
\put(  -0.0741,   0.7461){\makebox(0,0)[r]{\tt{abaa}}}
% line    0.2669 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.7331){\line(1,0){   0.1360}}
% abab at   -0.1000,   0.2756
\put(  -0.0741,   0.7244){\makebox(0,0)[r]{\tt{abab}}}
% line    0.2843 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.7157){\line(1,0){   0.1360}}
% line    0.2556 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.7444){\line(1,0){   0.0591}}
% line    0.2630 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.7370){\line(1,0){   0.0591}}
% line    0.2743 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.7257){\line(1,0){   0.0591}}
% line    0.2817 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.7183){\line(1,0){   0.0591}}
% abba at   -0.1000,   0.3007
\put(  -0.0741,   0.6993){\makebox(0,0)[r]{\tt{abba}}}
% line    0.3094 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.6906){\line(1,0){   0.1360}}
% abbb at   -0.1000,   0.3225
\put(  -0.0741,   0.6775){\makebox(0,0)[r]{\tt{abbb}}}
% line    0.3355 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.6645){\line(1,0){   0.1360}}
% line    0.2994 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.7006){\line(1,0){   0.0591}}
% line    0.3068 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.6932){\line(1,0){   0.0591}}
% line    0.3168 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.6832){\line(1,0){   0.0591}}
% line    0.3316 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.6684){\line(1,0){   0.0591}}
% ba at   -0.3500,   0.4852
\put(  -0.2220,   0.5148){\makebox(0,0)[r]{\tt{ba}}}
% line    0.5454 from   -0.4500 to    0.0000 
\put(  -0.2661,   0.4546){\line(1,0){   0.2661}}
% bb at   -0.3500,   0.6658
\put(  -0.2220,   0.3342){\makebox(0,0)[r]{\tt{bb}}}
% line    0.7862 from   -0.4500 to    0.0000 
\put(  -0.2661,   0.2138){\line(1,0){   0.2661}}
% b\eof at   -0.3500,   0.8181
\put(  -0.2220,   0.1819){\makebox(0,0)[r]{\tt{b\teof}}}
% baa at   -0.2300,   0.4506
\put(  -0.1510,   0.5494){\makebox(0,0)[r]{\tt{baa}}}
% line    0.4762 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.5238){\line(1,0){   0.2070}}
% bab at   -0.2300,   0.5018
\put(  -0.1510,   0.4982){\makebox(0,0)[r]{\tt{bab}}}
% line    0.5274 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.4726){\line(1,0){   0.2070}}
% ba\eof at   -0.2300,   0.5364
\put(  -0.1510,   0.4636){\makebox(0,0)[r]{\tt{ba\teof}}}
% baaa at   -0.1000,   0.4381
\put(  -0.0741,   0.5619){\makebox(0,0)[r]{\tt{baaa}}}
% line    0.4511 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.5489){\line(1,0){   0.1360}}
% baab at   -0.1000,   0.4598
\put(  -0.0741,   0.5402){\makebox(0,0)[r]{\tt{baab}}}
% line    0.4685 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.5315){\line(1,0){   0.1360}}
% line    0.4398 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.5602){\line(1,0){   0.0591}}
% line    0.4472 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.5528){\line(1,0){   0.0591}}
% line    0.4585 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.5415){\line(1,0){   0.0591}}
% line    0.4659 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.5341){\line(1,0){   0.0591}}
% baba at   -0.1000,   0.4849
\put(  -0.0741,   0.5151){\makebox(0,0)[r]{\tt{baba}}}
% line    0.4936 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.5064){\line(1,0){   0.1360}}
% babb at   -0.1000,   0.5066
\put(  -0.0741,   0.4934){\makebox(0,0)[r]{\tt{babb}}}
% line    0.5197 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.4803){\line(1,0){   0.1360}}
% line    0.4836 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.5164){\line(1,0){   0.0591}}
% line    0.4910 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.5090){\line(1,0){   0.0591}}
% line    0.5010 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.4990){\line(1,0){   0.0591}}
% line    0.5158 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.4842){\line(1,0){   0.0591}}
% bba at   -0.2300,   0.5710
\put(  -0.1510,   0.4290){\makebox(0,0)[r]{\tt{bba}}}
% line    0.5966 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.4034){\line(1,0){   0.2070}}
% bbb at   -0.2300,   0.6734
\put(  -0.1510,   0.3266){\makebox(0,0)[r]{\tt{bbb}}}
% line    0.7501 from   -0.3500 to    0.0000 
\put(  -0.2070,   0.2499){\line(1,0){   0.2070}}
% bb\eof at   -0.2300,   0.7682
\put(  -0.1510,   0.2318){\makebox(0,0)[r]{\tt{bb\teof}}}
% bbaa at   -0.1000,   0.5541
\put(  -0.0741,   0.4459){\makebox(0,0)[r]{\tt{bbaa}}}
% line    0.5628 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.4372){\line(1,0){   0.1360}}
% bbab at   -0.1000,   0.5759
\put(  -0.0741,   0.4241){\makebox(0,0)[r]{\tt{bbab}}}
% line    0.5889 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.4111){\line(1,0){   0.1360}}
% line    0.5528 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.4472){\line(1,0){   0.0591}}
% line    0.5602 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.4398){\line(1,0){   0.0591}}
% line    0.5702 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.4298){\line(1,0){   0.0591}}
% line    0.5850 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.4150){\line(1,0){   0.0591}}
% bbba at   -0.1000,   0.6096
\put(  -0.0741,   0.3904){\makebox(0,0)[r]{\tt{bbba}}}
% line    0.6227 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.3773){\line(1,0){   0.1360}}
% bbbb at   -0.1000,   0.6749
\put(  -0.0741,   0.3251){\makebox(0,0)[r]{\tt{bbbb}}}
% line    0.7271 from   -0.2300 to    0.0000 
\put(  -0.1360,   0.2729){\line(1,0){   0.1360}}
% line    0.6040 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.3960){\line(1,0){   0.0591}}
% line    0.6188 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.3812){\line(1,0){   0.0591}}
% line    0.6375 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.3625){\line(1,0){   0.0591}}
% line    0.7114 from   -0.1000 to    0.0000 
\put(  -0.0591,   0.2886){\line(1,0){   0.0591}}
% line    0.0000 from    0.0100 to    0.5000 
\put(   0.0059,   1.0000){\line(1,0){   0.2897}}
% 0 at    0.0100,   0.2500
\put(   0.2811,   0.7500){\makebox(0,0)[l]{\tt0}}
% line    0.5000 from    0.0100 to    0.5000 
\put(   0.0059,   0.5000){\line(1,0){   0.2897}}
% 1 at    0.0100,   0.7500
\put(   0.2811,   0.2500){\makebox(0,0)[l]{\tt1}}
% line    1.0000 from    0.0100 to    0.5000 
\put(   0.0059,   0.0000){\line(1,0){   0.2897}}
% 00 at    0.0100,   0.1250
\put(   0.2397,   0.8750){\makebox(0,0)[l]{\tt00}}
% line    0.2500 from    0.0100 to    0.4500 
\put(   0.0059,   0.7500){\line(1,0){   0.2602}}
% 01 at    0.0100,   0.3750
\put(   0.2397,   0.6250){\makebox(0,0)[l]{\tt01}}
% 000 at    0.0100,   0.0625
\put(   0.1806,   0.9375){\makebox(0,0)[l]{\tt000}}
% line    0.1250 from    0.0100 to    0.3800 
\put(   0.0059,   0.8750){\line(1,0){   0.2188}}
% 001 at    0.0100,   0.1875
\put(   0.1806,   0.8125){\makebox(0,0)[l]{\tt001}}
% 0000 at    0.0100,   0.0312
\put(   0.1207,   0.9688){\makebox(0,0)[l]{\tt0000}}
% line    0.0625 from    0.0100 to    0.2800 
\put(   0.0059,   0.9375){\line(1,0){   0.1597}}
% 0001 at    0.0100,   0.0938
\put(   0.1207,   0.9062){\makebox(0,0)[l]{\tt0001}}
% 00000 at    0.0100,   0.0156
\put(   0.0387,   0.9844){\makebox(0,0)[l]{\tt00000}}
% line    0.0312 from    0.0100 to    0.1500 
\put(   0.0059,   0.9688){\line(1,0){   0.0828}}
% 00001 at    0.0100,   0.0469
\put(   0.0387,   0.9531){\makebox(0,0)[l]{\tt00001}}
% line    0.0156 from    0.0100 to    0.0400 
\put(   0.0059,   0.9844){\line(1,0){   0.0177}}
% line    0.0078 from    0.0100 to    0.0200 
\put(   0.0059,   0.9922){\line(1,0){   0.0059}}
% line    0.0234 from    0.0100 to    0.0200 
\put(   0.0059,   0.9766){\line(1,0){   0.0059}}
% line    0.0469 from    0.0100 to    0.0400 
\put(   0.0059,   0.9531){\line(1,0){   0.0177}}
% line    0.0391 from    0.0100 to    0.0200 
\put(   0.0059,   0.9609){\line(1,0){   0.0059}}
% line    0.0547 from    0.0100 to    0.0200 
\put(   0.0059,   0.9453){\line(1,0){   0.0059}}
% 00010 at    0.0100,   0.0781
\put(   0.0387,   0.9219){\makebox(0,0)[l]{\tt00010}}
% line    0.0938 from    0.0100 to    0.1500 
\put(   0.0059,   0.9062){\line(1,0){   0.0828}}
% 00011 at    0.0100,   0.1094
\put(   0.0387,   0.8906){\makebox(0,0)[l]{\tt00011}}
% line    0.0781 from    0.0100 to    0.0400 
\put(   0.0059,   0.9219){\line(1,0){   0.0177}}
% line    0.0703 from    0.0100 to    0.0200 
\put(   0.0059,   0.9297){\line(1,0){   0.0059}}
% line    0.0859 from    0.0100 to    0.0200 
\put(   0.0059,   0.9141){\line(1,0){   0.0059}}
% line    0.1094 from    0.0100 to    0.0400 
\put(   0.0059,   0.8906){\line(1,0){   0.0177}}
% line    0.1016 from    0.0100 to    0.0200 
\put(   0.0059,   0.8984){\line(1,0){   0.0059}}
% line    0.1172 from    0.0100 to    0.0200 
\put(   0.0059,   0.8828){\line(1,0){   0.0059}}
% 0010 at    0.0100,   0.1562
\put(   0.1207,   0.8438){\makebox(0,0)[l]{\tt0010}}
% line    0.1875 from    0.0100 to    0.2800 
\put(   0.0059,   0.8125){\line(1,0){   0.1597}}
% 0011 at    0.0100,   0.2188
\put(   0.1207,   0.7812){\makebox(0,0)[l]{\tt0011}}
% 00100 at    0.0100,   0.1406
\put(   0.0387,   0.8594){\makebox(0,0)[l]{\tt00100}}
% line    0.1562 from    0.0100 to    0.1500 
\put(   0.0059,   0.8438){\line(1,0){   0.0828}}
% 00101 at    0.0100,   0.1719
\put(   0.0387,   0.8281){\makebox(0,0)[l]{\tt00101}}
% line    0.1406 from    0.0100 to    0.0400 
\put(   0.0059,   0.8594){\line(1,0){   0.0177}}
% line    0.1328 from    0.0100 to    0.0200 
\put(   0.0059,   0.8672){\line(1,0){   0.0059}}
% line    0.1484 from    0.0100 to    0.0200 
\put(   0.0059,   0.8516){\line(1,0){   0.0059}}
% line    0.1719 from    0.0100 to    0.0400 
\put(   0.0059,   0.8281){\line(1,0){   0.0177}}
% line    0.1641 from    0.0100 to    0.0200 
\put(   0.0059,   0.8359){\line(1,0){   0.0059}}
% line    0.1797 from    0.0100 to    0.0200 
\put(   0.0059,   0.8203){\line(1,0){   0.0059}}
% 00110 at    0.0100,   0.2031
\put(   0.0387,   0.7969){\makebox(0,0)[l]{\tt00110}}
% line    0.2188 from    0.0100 to    0.1500 
\put(   0.0059,   0.7812){\line(1,0){   0.0828}}
% 00111 at    0.0100,   0.2344
\put(   0.0387,   0.7656){\makebox(0,0)[l]{\tt00111}}
% line    0.2031 from    0.0100 to    0.0400 
\put(   0.0059,   0.7969){\line(1,0){   0.0177}}
% line    0.1953 from    0.0100 to    0.0200 
\put(   0.0059,   0.8047){\line(1,0){   0.0059}}
% line    0.2109 from    0.0100 to    0.0200 
\put(   0.0059,   0.7891){\line(1,0){   0.0059}}
% line    0.2344 from    0.0100 to    0.0400 
\put(   0.0059,   0.7656){\line(1,0){   0.0177}}
% line    0.2266 from    0.0100 to    0.0200 
\put(   0.0059,   0.7734){\line(1,0){   0.0059}}
% line    0.2422 from    0.0100 to    0.0200 
\put(   0.0059,   0.7578){\line(1,0){   0.0059}}
% 010 at    0.0100,   0.3125
\put(   0.1806,   0.6875){\makebox(0,0)[l]{\tt010}}
% line    0.3750 from    0.0100 to    0.3800 
\put(   0.0059,   0.6250){\line(1,0){   0.2188}}
% 011 at    0.0100,   0.4375
\put(   0.1806,   0.5625){\makebox(0,0)[l]{\tt011}}
% 0100 at    0.0100,   0.2812
\put(   0.1207,   0.7188){\makebox(0,0)[l]{\tt0100}}
% line    0.3125 from    0.0100 to    0.2800 
\put(   0.0059,   0.6875){\line(1,0){   0.1597}}
% 0101 at    0.0100,   0.3438
\put(   0.1207,   0.6562){\makebox(0,0)[l]{\tt0101}}
% 01000 at    0.0100,   0.2656
\put(   0.0387,   0.7344){\makebox(0,0)[l]{\tt01000}}
% line    0.2812 from    0.0100 to    0.1500 
\put(   0.0059,   0.7188){\line(1,0){   0.0828}}
% 01001 at    0.0100,   0.2969
\put(   0.0387,   0.7031){\makebox(0,0)[l]{\tt01001}}
% line    0.2656 from    0.0100 to    0.0400 
\put(   0.0059,   0.7344){\line(1,0){   0.0177}}
% line    0.2578 from    0.0100 to    0.0200 
\put(   0.0059,   0.7422){\line(1,0){   0.0059}}
% line    0.2734 from    0.0100 to    0.0200 
\put(   0.0059,   0.7266){\line(1,0){   0.0059}}
% line    0.2969 from    0.0100 to    0.0400 
\put(   0.0059,   0.7031){\line(1,0){   0.0177}}
% line    0.2891 from    0.0100 to    0.0200 
\put(   0.0059,   0.7109){\line(1,0){   0.0059}}
% line    0.3047 from    0.0100 to    0.0200 
\put(   0.0059,   0.6953){\line(1,0){   0.0059}}
% 01010 at    0.0100,   0.3281
\put(   0.0387,   0.6719){\makebox(0,0)[l]{\tt01010}}
% line    0.3438 from    0.0100 to    0.1500 
\put(   0.0059,   0.6562){\line(1,0){   0.0828}}
% 01011 at    0.0100,   0.3594
\put(   0.0387,   0.6406){\makebox(0,0)[l]{\tt01011}}
% line    0.3281 from    0.0100 to    0.0400 
\put(   0.0059,   0.6719){\line(1,0){   0.0177}}
% line    0.3203 from    0.0100 to    0.0200 
\put(   0.0059,   0.6797){\line(1,0){   0.0059}}
% line    0.3359 from    0.0100 to    0.0200 
\put(   0.0059,   0.6641){\line(1,0){   0.0059}}
% line    0.3594 from    0.0100 to    0.0400 
\put(   0.0059,   0.6406){\line(1,0){   0.0177}}
% line    0.3516 from    0.0100 to    0.0200 
\put(   0.0059,   0.6484){\line(1,0){   0.0059}}
% line    0.3672 from    0.0100 to    0.0200 
\put(   0.0059,   0.6328){\line(1,0){   0.0059}}
% 0110 at    0.0100,   0.4062
\put(   0.1207,   0.5938){\makebox(0,0)[l]{\tt0110}}
% line    0.4375 from    0.0100 to    0.2800 
\put(   0.0059,   0.5625){\line(1,0){   0.1597}}
% 0111 at    0.0100,   0.4688
\put(   0.1207,   0.5312){\makebox(0,0)[l]{\tt0111}}
% 01100 at    0.0100,   0.3906
\put(   0.0387,   0.6094){\makebox(0,0)[l]{\tt01100}}
% line    0.4062 from    0.0100 to    0.1500 
\put(   0.0059,   0.5938){\line(1,0){   0.0828}}
% 01101 at    0.0100,   0.4219
\put(   0.0387,   0.5781){\makebox(0,0)[l]{\tt01101}}
% line    0.3906 from    0.0100 to    0.0400 
\put(   0.0059,   0.6094){\line(1,0){   0.0177}}
% line    0.3828 from    0.0100 to    0.0200 
\put(   0.0059,   0.6172){\line(1,0){   0.0059}}
% line    0.3984 from    0.0100 to    0.0200 
\put(   0.0059,   0.6016){\line(1,0){   0.0059}}
% line    0.4219 from    0.0100 to    0.0400 
\put(   0.0059,   0.5781){\line(1,0){   0.0177}}
% line    0.4141 from    0.0100 to    0.0200 
\put(   0.0059,   0.5859){\line(1,0){   0.0059}}
% line    0.4297 from    0.0100 to    0.0200 
\put(   0.0059,   0.5703){\line(1,0){   0.0059}}
% 01110 at    0.0100,   0.4531
\put(   0.0387,   0.5469){\makebox(0,0)[l]{\tt01110}}
% line    0.4688 from    0.0100 to    0.1500 
\put(   0.0059,   0.5312){\line(1,0){   0.0828}}
% 01111 at    0.0100,   0.4844
\put(   0.0387,   0.5156){\makebox(0,0)[l]{\tt01111}}
% line    0.4531 from    0.0100 to    0.0400 
\put(   0.0059,   0.5469){\line(1,0){   0.0177}}
% line    0.4453 from    0.0100 to    0.0200 
\put(   0.0059,   0.5547){\line(1,0){   0.0059}}
% line    0.4609 from    0.0100 to    0.0200 
\put(   0.0059,   0.5391){\line(1,0){   0.0059}}
% line    0.4844 from    0.0100 to    0.0400 
\put(   0.0059,   0.5156){\line(1,0){   0.0177}}
% line    0.4766 from    0.0100 to    0.0200 
\put(   0.0059,   0.5234){\line(1,0){   0.0059}}
% line    0.4922 from    0.0100 to    0.0200 
\put(   0.0059,   0.5078){\line(1,0){   0.0059}}
% 10 at    0.0100,   0.6250
\put(   0.2397,   0.3750){\makebox(0,0)[l]{\tt10}}
% line    0.7500 from    0.0100 to    0.4500 
\put(   0.0059,   0.2500){\line(1,0){   0.2602}}
% 11 at    0.0100,   0.8750
\put(   0.2397,   0.1250){\makebox(0,0)[l]{\tt11}}
% 100 at    0.0100,   0.5625
\put(   0.1806,   0.4375){\makebox(0,0)[l]{\tt100}}
% line    0.6250 from    0.0100 to    0.3800 
\put(   0.0059,   0.3750){\line(1,0){   0.2188}}
% 101 at    0.0100,   0.6875
\put(   0.1806,   0.3125){\makebox(0,0)[l]{\tt101}}
% 1000 at    0.0100,   0.5312
\put(   0.1207,   0.4688){\makebox(0,0)[l]{\tt1000}}
% line    0.5625 from    0.0100 to    0.2800 
\put(   0.0059,   0.4375){\line(1,0){   0.1597}}
% 1001 at    0.0100,   0.5938
\put(   0.1207,   0.4062){\makebox(0,0)[l]{\tt1001}}
% 10000 at    0.0100,   0.5156
\put(   0.0387,   0.4844){\makebox(0,0)[l]{\tt10000}}
% line    0.5312 from    0.0100 to    0.1500 
\put(   0.0059,   0.4688){\line(1,0){   0.0828}}
% 10001 at    0.0100,   0.5469
\put(   0.0387,   0.4531){\makebox(0,0)[l]{\tt10001}}
% line    0.5156 from    0.0100 to    0.0400 
\put(   0.0059,   0.4844){\line(1,0){   0.0177}}
% line    0.5078 from    0.0100 to    0.0200 
\put(   0.0059,   0.4922){\line(1,0){   0.0059}}
% line    0.5234 from    0.0100 to    0.0200 
\put(   0.0059,   0.4766){\line(1,0){   0.0059}}
% line    0.5469 from    0.0100 to    0.0400 
\put(   0.0059,   0.4531){\line(1,0){   0.0177}}
% line    0.5391 from    0.0100 to    0.0200 
\put(   0.0059,   0.4609){\line(1,0){   0.0059}}
% line    0.5547 from    0.0100 to    0.0200 
\put(   0.0059,   0.4453){\line(1,0){   0.0059}}
% 10010 at    0.0100,   0.5781
\put(   0.0387,   0.4219){\makebox(0,0)[l]{\tt10010}}
% line    0.5938 from    0.0100 to    0.1500 
\put(   0.0059,   0.4062){\line(1,0){   0.0828}}
% 10011 at    0.0100,   0.6094
\put(   0.0387,   0.3906){\makebox(0,0)[l]{\tt10011}}
% line    0.5781 from    0.0100 to    0.0400 
\put(   0.0059,   0.4219){\line(1,0){   0.0177}}
% line    0.5703 from    0.0100 to    0.0200 
\put(   0.0059,   0.4297){\line(1,0){   0.0059}}
% line    0.5859 from    0.0100 to    0.0200 
\put(   0.0059,   0.4141){\line(1,0){   0.0059}}
% line    0.6094 from    0.0100 to    0.0400 
\put(   0.0059,   0.3906){\line(1,0){   0.0177}}
% line    0.6016 from    0.0100 to    0.0200 
\put(   0.0059,   0.3984){\line(1,0){   0.0059}}
% line    0.6172 from    0.0100 to    0.0200 
\put(   0.0059,   0.3828){\line(1,0){   0.0059}}
% 1010 at    0.0100,   0.6562
\put(   0.1207,   0.3438){\makebox(0,0)[l]{\tt1010}}
% line    0.6875 from    0.0100 to    0.2800 
\put(   0.0059,   0.3125){\line(1,0){   0.1597}}
% 1011 at    0.0100,   0.7188
\put(   0.1207,   0.2812){\makebox(0,0)[l]{\tt1011}}
% 10100 at    0.0100,   0.6406
\put(   0.0387,   0.3594){\makebox(0,0)[l]{\tt10100}}
% line    0.6562 from    0.0100 to    0.1500 
\put(   0.0059,   0.3438){\line(1,0){   0.0828}}
% 10101 at    0.0100,   0.6719
\put(   0.0387,   0.3281){\makebox(0,0)[l]{\tt10101}}
% line    0.6406 from    0.0100 to    0.0400 
\put(   0.0059,   0.3594){\line(1,0){   0.0177}}
% line    0.6328 from    0.0100 to    0.0200 
\put(   0.0059,   0.3672){\line(1,0){   0.0059}}
% line    0.6484 from    0.0100 to    0.0200 
\put(   0.0059,   0.3516){\line(1,0){   0.0059}}
% line    0.6719 from    0.0100 to    0.0400 
\put(   0.0059,   0.3281){\line(1,0){   0.0177}}
% line    0.6641 from    0.0100 to    0.0200 
\put(   0.0059,   0.3359){\line(1,0){   0.0059}}
% line    0.6797 from    0.0100 to    0.0200 
\put(   0.0059,   0.3203){\line(1,0){   0.0059}}
% 10110 at    0.0100,   0.7031
\put(   0.0387,   0.2969){\makebox(0,0)[l]{\tt10110}}
% line    0.7188 from    0.0100 to    0.1500 
\put(   0.0059,   0.2812){\line(1,0){   0.0828}}
% 10111 at    0.0100,   0.7344
\put(   0.0387,   0.2656){\makebox(0,0)[l]{\tt10111}}
% line    0.7031 from    0.0100 to    0.0400 
\put(   0.0059,   0.2969){\line(1,0){   0.0177}}
% line    0.6953 from    0.0100 to    0.0200 
\put(   0.0059,   0.3047){\line(1,0){   0.0059}}
% line    0.7109 from    0.0100 to    0.0200 
\put(   0.0059,   0.2891){\line(1,0){   0.0059}}
% line    0.7344 from    0.0100 to    0.0400 
\put(   0.0059,   0.2656){\line(1,0){   0.0177}}
% line    0.7266 from    0.0100 to    0.0200 
\put(   0.0059,   0.2734){\line(1,0){   0.0059}}
% line    0.7422 from    0.0100 to    0.0200 
\put(   0.0059,   0.2578){\line(1,0){   0.0059}}
% 110 at    0.0100,   0.8125
\put(   0.1806,   0.1875){\makebox(0,0)[l]{\tt110}}
% line    0.8750 from    0.0100 to    0.3800 
\put(   0.0059,   0.1250){\line(1,0){   0.2188}}
% 111 at    0.0100,   0.9375
\put(   0.1806,   0.0625){\makebox(0,0)[l]{\tt111}}
% 1100 at    0.0100,   0.7812
\put(   0.1207,   0.2188){\makebox(0,0)[l]{\tt1100}}
% line    0.8125 from    0.0100 to    0.2800 
\put(   0.0059,   0.1875){\line(1,0){   0.1597}}
% 1101 at    0.0100,   0.8438
\put(   0.1207,   0.1562){\makebox(0,0)[l]{\tt1101}}
% 11000 at    0.0100,   0.7656
\put(   0.0387,   0.2344){\makebox(0,0)[l]{\tt11000}}
% line    0.7812 from    0.0100 to    0.1500 
\put(   0.0059,   0.2188){\line(1,0){   0.0828}}
% 11001 at    0.0100,   0.7969
\put(   0.0387,   0.2031){\makebox(0,0)[l]{\tt11001}}
% line    0.7656 from    0.0100 to    0.0400 
\put(   0.0059,   0.2344){\line(1,0){   0.0177}}
% line    0.7578 from    0.0100 to    0.0200 
\put(   0.0059,   0.2422){\line(1,0){   0.0059}}
% line    0.7734 from    0.0100 to    0.0200 
\put(   0.0059,   0.2266){\line(1,0){   0.0059}}
% line    0.7969 from    0.0100 to    0.0400 
\put(   0.0059,   0.2031){\line(1,0){   0.0177}}
% line    0.7891 from    0.0100 to    0.0200 
\put(   0.0059,   0.2109){\line(1,0){   0.0059}}
% line    0.8047 from    0.0100 to    0.0200 
\put(   0.0059,   0.1953){\line(1,0){   0.0059}}
% 11010 at    0.0100,   0.8281
\put(   0.0387,   0.1719){\makebox(0,0)[l]{\tt11010}}
% line    0.8438 from    0.0100 to    0.1500 
\put(   0.0059,   0.1562){\line(1,0){   0.0828}}
% 11011 at    0.0100,   0.8594
\put(   0.0387,   0.1406){\makebox(0,0)[l]{\tt11011}}
% line    0.8281 from    0.0100 to    0.0400 
\put(   0.0059,   0.1719){\line(1,0){   0.0177}}
% line    0.8203 from    0.0100 to    0.0200 
\put(   0.0059,   0.1797){\line(1,0){   0.0059}}
% line    0.8359 from    0.0100 to    0.0200 
\put(   0.0059,   0.1641){\line(1,0){   0.0059}}
% line    0.8594 from    0.0100 to    0.0400 
\put(   0.0059,   0.1406){\line(1,0){   0.0177}}
% line    0.8516 from    0.0100 to    0.0200 
\put(   0.0059,   0.1484){\line(1,0){   0.0059}}
% line    0.8672 from    0.0100 to    0.0200 
\put(   0.0059,   0.1328){\line(1,0){   0.0059}}
% 1110 at    0.0100,   0.9062
\put(   0.1207,   0.0938){\makebox(0,0)[l]{\tt1110}}
% line    0.9375 from    0.0100 to    0.2800 
\put(   0.0059,   0.0625){\line(1,0){   0.1597}}
% 1111 at    0.0100,   0.9688
\put(   0.1207,   0.0312){\makebox(0,0)[l]{\tt1111}}
% 11100 at    0.0100,   0.8906
\put(   0.0387,   0.1094){\makebox(0,0)[l]{\tt11100}}
% line    0.9062 from    0.0100 to    0.1500 
\put(   0.0059,   0.0938){\line(1,0){   0.0828}}
% 11101 at    0.0100,   0.9219
\put(   0.0387,   0.0781){\makebox(0,0)[l]{\tt11101}}
% line    0.8906 from    0.0100 to    0.0400 
\put(   0.0059,   0.1094){\line(1,0){   0.0177}}
% line    0.8828 from    0.0100 to    0.0200 
\put(   0.0059,   0.1172){\line(1,0){   0.0059}}
% line    0.8984 from    0.0100 to    0.0200 
\put(   0.0059,   0.1016){\line(1,0){   0.0059}}
% line    0.9219 from    0.0100 to    0.0400 
\put(   0.0059,   0.0781){\line(1,0){   0.0177}}
% line    0.9141 from    0.0100 to    0.0200 
\put(   0.0059,   0.0859){\line(1,0){   0.0059}}
% line    0.9297 from    0.0100 to    0.0200 
\put(   0.0059,   0.0703){\line(1,0){   0.0059}}
% 11110 at    0.0100,   0.9531
\put(   0.0387,   0.0469){\makebox(0,0)[l]{\tt11110}}
% line    0.9688 from    0.0100 to    0.1500 
\put(   0.0059,   0.0312){\line(1,0){   0.0828}}
% 11111 at    0.0100,   0.9844
\put(   0.0387,   0.0156){\makebox(0,0)[l]{\tt11111}}
% line    0.9531 from    0.0100 to    0.0400 
\put(   0.0059,   0.0469){\line(1,0){   0.0177}}
% line    0.9453 from    0.0100 to    0.0200 
\put(   0.0059,   0.0547){\line(1,0){   0.0059}}
% line    0.9609 from    0.0100 to    0.0200 
\put(   0.0059,   0.0391){\line(1,0){   0.0059}}
% line    0.9844 from    0.0100 to    0.0400 
\put(   0.0059,   0.0156){\line(1,0){   0.0177}}
% line    0.9766 from    0.0100 to    0.0200 
\put(   0.0059,   0.0234){\line(1,0){   0.0059}}
% line    0.9922 from    0.0100 to    0.0200 
\put(   0.0059,   0.0078){\line(1,0){   0.0059}}
\end{picture}

}
\end{center}
}{%
\caption[a]{Illustration of the intervals defined by a 
 simple Bayesian probabilistic model. The size of an intervals is proportional
 to the  probability of the string. 
 This model anticipates that
 the source is likely to be biased towards one of {\tt{a}} and
 {\tt{b}}, so sequences having lots of {\tt{a}}s or lots of 
 {\tt{b}}s have larger intervals than sequences of the same length
 that are 50:50 {\tt{a}}s and
 {\tt{b}}s.}
\label{fig.ac2}
}%
\end{figure}

\begin{aside}
\subsection{Details of the Bayesian model}
 Having emphasized that any model could be used -- arithmetic coding 
 is not wedded to any particular set of probabilities -- let me explain 
 the simple adaptive probabilistic model used in the preceding example;
 we first encountered this model
 in
% chapter \ref{ch1}
% (page \pageref{ex.postpa})
 \exerciseref{ex.postpa}.
%
%
% {\em (This material may be a repetition of material in  \chref{ch1}.)}
%

\subsubsection{Assumptions}
 The model will be described using parameters 
 $p_{\eof}$, $p_{\ta}$ and $p_{\tb}$, defined below,
 which should not be confused with the predictive
 probabilities {\em in a particular context\/},
 for example,
 $P(\ta  \given   \bs\eq {\tb\ta\ta} )$.
% An analogy for this model, as I indicated
% at the start, 
% is the tossing of a  bent coin (\secref{sec.bentcoin}).
 A bent coin labelled 
 $\ta$ and $\tb$ is tossed some number of times $l$, 
 which we don't know beforehand. The coin's probability 
 of coming up $\ta$ when tossed is $p_{\ta}$, and $p_{\tb} = 1-p_{\ta}$; the parameters
 $p_{\ta},p_{\tb}$ are not known beforehand. The source string $\bs = \tt baaba\eof$
 indicates that $l$ was 5 and the sequence of outcomes was $\tt baaba$.
\ben
\item
 It is assumed that the length of the string $l$  has an exponential 
 probability distribution
\beq
 P(l) = (1 - p_{\eof})^l p_{\eof}
.
\eeq
 This distribution corresponds to assuming a constant probability 
 $p_{\eof}$ for the termination symbol `$\eof$' at each character. 
\item
 It is assumed that the non-terminal
 characters in the string are selected independently at random 
 from an ensemble with probabilities
% distribution
 $\P = \{p_{\ta},p_{\tb}\}$; the 
 probability $p_{\ta}$ is fixed throughout the string to some 
 unknown value that could be anywhere between $0$ and $1$. 
 The probability of an $\ta$ occurring as the next symbol, given 
 $p_{\ta}$ (if only we knew it), is $(1-p_{\eof})p_{\ta}$.
% given that it is not
 The probability, given $p_{\ta}$, that an unterminated string of length $F$
 is a given string $\bs$
 that contains $\{F_{\ta},F_{\tb}\}$ counts of the two outcomes
%  $\{ a , b \}$
 is the \ind{Bernoulli distribution}
\beq
        P( \bs  \given  p_{\ta} , F ) =  p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} .
\label{eq.pa.like}
\eeq
\item
 We assume a uniform prior distribution for $p_{\ta}$,
\beq
        P(p_{\ta}) = 1 , \: \: \: \: \: \: p_{\ta} \in [0,1]  ,
\label{eq.pa.prior}
\eeq
 and define  $p_{\tb} \equiv 1-p_{\ta}$.
 It would be  easy to assume other priors on $p_{\ta}$, with beta distributions
 being the most convenient to handle.
\een
 This model was studied in \secref{sec.bentcoin}.
 The key result we require is the predictive distribution for
 the next symbol, given the string so far, $\bs$.
 This probability that the next character is  $\ta$ 
 or $\tb$  (assuming that it is not `$\eof$')
 was derived in \eqref{eq.laplacederived} and is precisely
 Laplace's rule (\ref{eq.laplaceagain}).

\end{aside}

\exercisaxB{3}{ex.ac.vs.huffman}{
	Compare the expected message length 
 when an ASCII file is compressed by the following 
 three methods.
\begin{description}
\item[Huffman-with-header\puncspace] Read the whole 
 file, find the empirical frequency of each symbol, 
 construct a Huffman code for those frequencies, 
 transmit the code by transmitting 
 the lengths of the Huffman codewords, then transmit
 the file using the Huffman code.
 (The actual codewords don't need to be transmitted, 
 since we can use a deterministic method for 
 building the tree given the codelengths.)
\item[Arithmetic code using the \ind{Laplace model}\puncspace]
\beq
	P_{\rm L}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+1}
	{\sum_{{\ta'}}(F_{{\ta'}}+1)}.
\eeq
\item[Arithmetic code using a \ind{Dirichlet model}\puncspace]
 This model's predictions are: 
\beq
	P_{\rm D}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+\alpha}
	{\sum_{{\ta'}}(F_{{\ta'}}+\alpha)},
\eeq
 where $\alpha$ is fixed to a number such as 0.01. 
 A small value of $\alpha$ corresponds to a  more responsive version of the 
 Laplace model; the probability over characters
 is expected to be more nonuniform;
 $\alpha=1$ reproduces the Laplace model.
\end{description}
 Take care that the header of your Huffman message 
 is self-delimiting. 
 Special cases worth considering are (a) short files
 with just a few hundred characters; (b) large files 
 in which some characters are never used. 
}


\section{Further applications of arithmetic coding}
\subsection{Efficient generation of random samples}
\label{sec.ac.efficient}
 Arithmetic coding not only offers a way to compress strings 
 believed to come from a given model; it also offers a way to generate 
 random strings from a model. Imagine sticking a 
 pin into the unit interval at random, that line 
 having been divided into subintervals in proportion 
 to probabilities $p_i$; the probability that your pin will 
 lie in interval $i$ is $p_i$.  

 So to generate a sample from a model, all we need to do is feed ordinary 
 random bits into an arithmetic {\em decoder\/}\index{arithmetic coding!decoder} for that
 model.\index{arithmetic coding!uses beyond compression} An infinite random 
 bit sequence  corresponds to the selection of a point 
 at random from the line $[0,1)$, so the decoder will 
 then  select a string at random  from the assumed distribution. 
 This arithmetic method is guaranteed to use very nearly the smallest 
 number of random bits possible to make the selection -- an important 
 point in communities where random numbers are expensive!
 [{This is
 not a joke. Large amounts of money are spent on generating random bits
 in software and hardware. Random numbers are valuable.}]

 A simple example of the use of this technique is in the 
 generation of random bits with a nonuniform distribution $\{ p_0,p_1 \}$. 
%  This is a useful technique
\exercissxA{2}{ex.usebits}{
 Compare the following two techniques for generating random symbols
 from a nonuniform distribution $\{ p_0,p_1 \} = \{ 0.99,0.01\}$: 
\ben
\item The standard method: use a standard random number generator 
 to generate an integer between 1 and $2^{32}$. Rescale the integer 
 to $(0,1)$. Test whether this uniformly distributed random variable is 
 less than $0.99$, and emit a {\tt{0}} or {\tt{1}} accordingly.
\item
        Arithmetic coding using the correct model, fed with standard
 random bits.
\een
 Roughly how many random bits will each method use to generate a thousand 
 samples from this sparse distribution?
}

\subsection{Efficient data-entry devices}
 When we enter text into a computer, we make gestures of some sort --
 maybe we tap a keyboard, or scribble with a pointer, or click with a mouse;
 an {\em efficient\/}
 \index{user interfaces}\index{data entry}\ind{text entry} system is
 one where the
 number of gestures required  to enter a given text  string is {\em small\/}.

 Writing\index{writing}\index{text entry}%
\marginfignocaption{\small
\begin{center}
\begin{tabular}{rcl}
\multicolumn{3}{l}{Compression:}\\
text& $\rightarrow$ &bits\\[0.2in]
\multicolumn{3}{l}{Writing:} \\
text &$\leftarrow$&  gestures\\[0.2in]
\end{tabular}
\end{center}
}
 can be viewed as an inverse process\index{arithmetic coding!uses beyond compression}
 to data compression. In data compression, the aim is to map
 a given text string into a {\em small\/} number of bits.
 In text entry, we want a small sequence of gestures
 to produce our intended text.

 By inverting an arithmetic coder,
 we can obtain  \index{inverse-arithmetic-coder}an information-efficient
 text entry device that is driven by continuous pointing
 gestures \cite{ward2000}.  In this system, called \ind{Dasher},\index{human--machine interfaces}\index{software!Dasher}
 the user zooms in on the unit interval to locate the\index{text entry}
 interval corresponding to their intended string,
 in the same style as \figref{fig.ac}. A \ind{language
 model} (exactly as used in text compression) controls the
 sizes of  the intervals such that probable strings are
 quick and easy to identify.
 After an hour's practice,
 a  novice
 user can write  with one  \ind{finger} driving {Dasher}
 at about 25 words per minute -- that's about 
 half their normal ten-finger
 \index{QWERTY}typing speed on a regular  \ind{keyboard}.
 It's even possible to write at 25 words per minute, {\em hands-free},
 using gaze direction to drive Dasher \cite{wardmackay2002}.
 Dasher is available as free software for various
 platforms.\footnote{ {\tt http://www.inference.phy.cam.ac.uk/dasher/}}
\label{sec.stopbeforeLZ}

\section{Lempel--Ziv coding\nonexaminable}
 The \index{Lempel--Ziv coding|(}Lempel--Ziv algorithms, which are widely used for data compression
 (\eg, the {\tt\ind{compress}}   and {\tt\ind{gzip}} commands), are  different in philosophy to arithmetic
 coding. There is no separation  between modelling and coding,\index{philosophy} 
 and no opportunity for explicit modelling.\index{source code!algorithms}

\subsection{Basic Lempel--Ziv algorithm}
 The method of compression is to replace a \ind{substring} with a \ind{pointer} to
 an earlier occurrence of the same substring.
 For example if the string is {\tt{1011010100010}}\ldots, we \ind{parse} it into  
 an ordered  {\dem\ind{dictionary}\/} of  substrings that have not appeared before
 as follows:
 $\l$, {\tt{1}}, {\tt{0}}, {\tt{11}}, {\tt{01}}, {\tt{010}}, {\tt{00}}, {\tt{10}}, \dots.
 We include the \index{empty string}empty substring \ind{$\lambda$} as 
 the first substring in the dictionary and order the substrings in the dictionary
 by the order in which they emerged from the source.
 After every comma, we look along the next part of the 
 input sequence until we have read a
 substring that has not been marked off before. A moment's 
 reflection will confirm that
 this substring is longer by one bit than a substring that has occurred
 earlier in the dictionary. This means that we can encode each substring by 
 giving a {\dem pointer\/} to the earlier occurrence of that prefix and then sending
 the extra bit by which the new substring in the dictionary differs from 
 the earlier substring. If, at the $n$th bit, we have enumerated
 $s(n)$ substrings, then we can
 give the value of the pointer in
 $\lceil \log_2 s(n) \rceil$ bits. The code for the above sequence 
 is then as shown in the fourth line of the following table  (with
 punctuation included for clarity), the upper lines indicating the source 
 string and the value of $s(n)$:
%
%
\beginfullpagewidth%% defined in chapternotes.sty, uses {narrow}
\[
\begin{array}{l|*{8}{l}}
 \mbox{source substrings}&\lambda &  {\tt{1}}  & {\tt{0}}   & {\tt{11}}   & {\tt{01}}    & {\tt{010}}  & {\tt{00}}    & {\tt{10}}   \\
 s(n)           & 0    &  1  & 2   & 3    & 4     & 5    & 6     & 7    \\
 s(n)_{\rm binary}
                & {\tt{000}}  & {\tt{001}} & {\tt{010}} & {\tt{011}}  & {\tt{100}}   & {\tt{101}}  & {\tt{110}}   & {\tt{111}}  \\
 (\mbox{pointer},\mbox{bit})&   & (,{\tt{1}})  & ({\tt{0}},{\tt{0}}) & ({\tt{01}},{\tt{1}}) & ({\tt{10}},{\tt{1}}) & ({\tt{100}},{\tt{0}})& ({\tt{010}},{\tt{0}}) & ({\tt{001}},{\tt{0}})  
\end{array}
\]
\end{narrow}
% The pointer 
 Notice that the first  pointer we send is  empty, 
 because, given that there is only one substring in  the 
 dictionary -- the string $\lambda$ --
 no bits are needed to convey the `choice' of that substring
 as the prefix.
 The encoded string is {\tt 100011101100001000010}.
 The encoding,  in this 
 simple case, is actually a longer string than the source string, because
 there was no obvious redundancy in the source string. 
\exercisaxB{2}{ex.Clengthen}{  
 Prove that {\em any\/} uniquely decodeable code from $\{{\tt{0}},{\tt{1}}\}^+$ to 
 $\{{\tt{0}},{\tt{1}}\}^+$ necessarily makes some strings longer if it makes 
 some strings shorter. 
}
 One reason why the algorithm described above lengthens
 a lot of strings
 is because it is inefficient --
 it transmits unnecessary bits; to put it another
 way, its code is not  complete.\label{sec.LZprune}
% is not necessarily the explanation for the above lengthening, 
% however, because
% see also {ex.LZprune}{ 
% the algorithm described is certainly inefficient: o
 Once a substring in 
 the {dictionary} has been joined there by both of its children, then 
 we can be sure that it will not be needed (except possibly as part
 of our protocol for terminating a message); so at that point we 
 could drop it from our dictionary of substrings and shuffle them all along 
 one, thereby reducing the length of subsequent pointer messages. Equivalently, 
 we could write the second prefix into the dictionary at the point
 previously occupied by the parent. A second unnecessary overhead 
 is the transmission of the new bit in these cases -- the second 
 time a prefix is used, we can be sure of the identity of the next bit.
% This is easy to do in a computer but not so easy for a human 
% to cope with.
\subsubsection{Decoding} 
 The decoder again involves an identical twin at the decoding end
 who constructs the dictionary of substrings as the 
 data are decoded.
\exercissxB{2}{ex.LZencode}{
 Encode the string  {\tt{000000000000100000000000}}
 using the basic
 Lempel--Ziv algorithm described above.
}
% lambda  0  00   000  0000  001 00000 000000 
%   000  001 010  011  100   101  110   111
%        ,0  1,0  10,0 11,0  010,1 100,0 110,0
% answer
%           010100110010110001100
\exercissxB{2}{ex.LZdecode}{
 Decode the string
\begin{center}
 {\tt{00101011101100100100011010101000011}}
\end{center}
 that was encoded using the basic 
 Lempel--Ziv algorithm.
}
% answer 
% 0100001000100010101000001                                001000001000000
% lamda, 0, 1, 00, 001, 000, 10, 0010,     101, 0000, 01,  00100, 0001, 00000
% 0    , 1, 10,11, 100, 101,110, 111,     1000, 1001, 1010, 1011,1100, 1101
%       ,0 0,1 01,0 11,1 011,0 010,0 100,0 110,1 0101,0 0001,1  bored!
%   10101011101100100100011010101000011  
%
% see tcl/lempelziv.tcl

\subsubsection{Practicalities}
 In this description I have not discussed the method for terminating 
 a string. 

 There are many variations on the Lempel--Ziv algorithm, all exploiting 
 the same idea but using different procedures for dictionary management, 
 etc.
% Two of the best known
% variations are called the Ziv-Lempel algorithm
% and the LZW algorithm.
% 
 The resulting programs are  fast, but their performance on compression
 of English text, although  useful, 
 does not match the standards set in the arithmetic coding literature.

\subsection{Theoretical properties}
 In contrast to the block code, Huffman code, and arithmetic coding 
 methods we discussed in the last three chapters, 
 the Lempel--Ziv algorithm is defined without making any mention
 of a \ind{probabilistic model} for the source. Yet,
% in fact, 
 given any \ind{ergodic}
%\footnote{Need to clarify this. It means
% the source is memoryless on sufficiently long timescales.}
 source (\ie, one that is memoryless on sufficiently long timescales),
 the Lempel--Ziv algorithm can be 
 proven {\em asymptotically\/} to 
 compress down to the entropy of the source. This is why it is called 
 a `\ind{universal}' compression algorithm. For a proof 
 of this property, see \citeasnoun{Cover&Thomas}. 
%   Cover and Thomas (1991).

 It achieves its compression,
 however, only by {\em memorizing\/} substrings that have happened
 so that it has a short name for them the next time they occur.
 The asymptotic timescale on which this universal performance 
 is achieved
%%is likely to be the time that it takes for 
%% if the source has not been observed long enough for 
% {\em all\/} typical sequences of length $n^*$
% to  occur, where $n^*$ is the longest lengthscale associated with the 
% statistical fluctuations in the source. 
%the longest lengthscale on 
% which there are correlations in .
% red then 
% For many sources the time for all typical sequences to
% occur  is
 may, for many sources, be unfeasibly long, because
 the number of typical substrings that need memorizing
 may be enormous.
%
 The useful performance
 of the algorithm in practice is a reflection of the fact that 
 many files contain multiple repetitions of particular
 short sequences of characters,
 a form of redundancy to which the algorithm is  well suited.

\subsection{Common ground}
 I have emphasized the difference in philosophy behind arithmetic coding
 and Lempel--Ziv coding. There is common ground
 between them, though: in principle, one can design 
 adaptive probabilistic models, and thence arithmetic codes, that
 are `\ind{universal}', that is, models that will asymptotically compress 
 {\em any source in some class\/} to within some factor (preferably 1) 
 of its entropy.\index{compression!universal}
 However,  {for practical purposes\/}, I think such universal models can only be
 constructed  if the class of sources is severely restricted. 
 A general purpose compressor that can discover the probability 
 distribution of {\em any\/} source would be a general purpose
 \ind{artificial intelligence}! A  general purpose artificial 
 intelligence does not yet exist.


% \subsection{Comments}
%  The  Lempel--Ziv algorithm can be generalized to any finite alphabet
%  as long as the input and output alphabets are the same. I believe 
%  it is not convenient to use unequal alphabets.

\section{Demonstration}
 An interactive   aid for exploring arithmetic coding, {\tt dasher.tcl},  is
 available.\footnote{{\tt http://www.inference.phy.cam.ac.uk/mackay/itprnn/softwareI.html}}
% http://www.inference.phy.cam.ac.uk/mackay/itprnn/code/tcl/dasher.tcl

 A demonstration arithmetic-coding\index{source code!algorithms}
 \index{arithmetic coding!software}\index{software!arithmetic coding}software
 package written by \index{Neal, Radford}{Radford  Neal}\footnote{%
% is available from \\ \noindent
 {\tt ftp://ftp.cs.toronto.edu/pub/radford/www/ac.software.html}}
% This package
 consists of encoding and decoding modules to which the 
 user  adds a  module defining the probabilistic model. It 
 should  be emphasized that there is no single
 general-purpose arithmetic-coding compressor; a new model has to be written 
 for each type of source.
% application.
%
 Radford Neal's\index{Neal, Radford}
 package includes a simple adaptive model similar to the 
 Bayesian model  demonstrated in section \ref{sec.ac}. 
 The results using this Laplace model should 
 be viewed as a basic benchmark since it is
 the simplest possible probabilistic model -- it 
% These results are anecdotal and should not be taken too 
% seriously, but it is interesting that the highly developed gzip 
% software only does a little better than the benchmark
% of the simple Laplace model, 
 simply assumes the characters in the file come independently 
 from a fixed ensemble.
 The counts $\{ F_i \}$ of the symbols $\{ a_i \}$ are rescaled
 and rounded  as the file is read such that all the counts  lie
 between 1 and 256.

 \index{DjVu}\index{deja vu}\index{Le Cun, Yann}\index{Bottou, Leon}
% Yann Le Cun, Leon Bottou and  colleagues at AT{\&}T Labs 
% have written a
 A state-of-the-art compressor for documents
 containing text and images, {\tt{DjVu}},
 uses arithmetic coding.\footnote{%
% {\tt{DjVu}} is described at
 \tt http://www.djvuzone.org/}
% (better Reference for deja vu?)
 It uses a carefully designed approximate
 arithmetic coder for binary
 alphabets called the  Z-coder \cite{bottou98coder},
 which  is much faster than the
 arithmetic coding software described above. One of
 the neat tricks the Z-coder uses is this: the adaptive model
 adapts  only occasionally (to save on computer time),
 with the decision about when to adapt being pseudo-randomly
 controlled by
 whether the arithmetic encoder emitted a bit.

 The JBIG image compression standard for binary images 
 uses arithmetic coding with a context-dependent 
 model, which adapts using a rule similar to  Laplace's rule.
 PPM \cite{Teahan95a} is a leading method for text compression,
 and it uses arithmetic coding.

 There are many Lempel--Ziv-based programs.
 {\tt gzip} is based on a version of  Lempel--Ziv
 called `{\tt LZ77}' \cite{Ziv_Lempel77}\nocite{Ziv_Lempel78}. {\tt compress} is based on `{\tt LZW}'
 \cite{Welch84}.
 In my experience the 
 best is {\tt gzip}, with {\tt compress} being  inferior
 on most files.
% To 
% give further credit to {\tt gzip}, it stores additional information in 
% the compressed file such as the name of the file and its 
% last modification date.

 {\tt bzip} is
 a {\dem{\ind{block-sorting} file compressor\/}}, which makes
 use of a neat hack called the {\dem\ind{Burrows--Wheeler transform}}\index{source code!Burrows--Wheeler transform}\index{source code!block-sorting compression}
 \cite{bwt}. This method is not based on an explicit probabilistic
 model, and it only works well for files larger than several
 thousand characters; but in practice it is a very effective
 compressor for files in which the context of a character
 is a good predictor for that character.%
% Maybe I'll describe it in a future edition of this
% book.
\footnote{There is a lot of information about the
 Burrows--Wheeler transform on the net.
 {\tt{http://dogma.net/DataCompression/BWT.shtml}}
}
%bzip2 compresses files using the Burrows--Wheeler block-sorting text compression algorithm, and Huffman
%coding. Compression is generally considerably better than that achieved by more conventional
%LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical
%compressors. 


\subsubsection{Compression of a text file}
 Table \ref{tab.zipcompare1}  gives the computer time in seconds taken and the 
 compression achieved when these  programs are applied to 
 the \LaTeX\ file containing the text of
 this chapter, of size  20,942 bytes.
\begin{table}[htbp]
\figuremargin{
\begin{center}
\begin{tabular}{lccc} \toprule
Method & Compression  & Compressed  size & Uncompression \\
  &      time$ \,/\, $sec       & (\%age of 20,942) & time$ \,/\, $sec \\ \midrule
%Adaptive encoder,
 Laplace model  &
        0.28 & $12\,974$ (61\%) & 0.32 \\
%{\tt gzip / gunzip} &
{\tt gzip} &
        0.10 & \hspace{0.06in}$ 8\,177$ (39\%)  & {\bf 0.01} \\
{\tt compress}
%/ uncompress}
&
        0.05 & $10\,816$ (51\%) & 0.05 \\  \midrule
{\tt bzip}
% / bunzip}
&
            & \hspace{0.06in}$ 7\,495$ (36\%) & \\
{\tt bzip2}
%/ bunzip2}
&
            & \hspace{0.06in}$ 7\,640$ (36\%) & \\
{\tt ppmz } &
	    & \hspace{0.06in}{\bf 6$\,$800 (32\%)} &  \\
 \bottomrule
\end{tabular}
\end{center}
}{
\caption[a]{Comparison of compression algorithms applied to a text file.
}
\label{tab.zipcompare1}
}
\end{table}
% I will report the value of ``u''
% django:
% 0.410u 0.060s 0:00.60 78.3%     0+0k 0+0io 109pf+0w
%   6800 Nov 25 18:05 ../l4.tex.ppm
% time ppmz ../l4.tex.ppm ../l4.tex.up
% 0.480u 0.040s 0:00.60 86.6%     0+0k 0+0io 109pf+0w
%
% 108:wol:/home/mackay/_tools/ac0> time adaptive_encode < ~/_courses/itprnn/l4.tex > l4.tex.aez
% 0.280u 0.040s 0:00.55 58.1% 0+105k 2+3io 0pf+0w
% 109:wol:/home/mackay/_tools/ac0> time gzip ~/_courses/itprnn/l4.tex
% 0.100u 0.060s 0:00.28 57.1% 0+161k 2+12io 0pf+0w
% 110:wol:/home/mackay/_tools/ac0> ls -lisa ~/_courses/itprnn/l4.tex.gz
% 110131    8        8177 Jan 10 15:40 /home/mackay/_courses/itprnn/l4.tex.gz
% 111:wol:/home/mackay/_tools/ac0> gunzip ~/_courses/itprnn/l4.tex.gz
% 112:wol:/home/mackay/_tools/ac0> ls -lisa ~/_courses/itprnn/l4.tex l4.tex.aez
% 109904   21       20942 Jan 10 15:40 /home/mackay/_courses/itprnn/l4.tex
% 444691   13       12974 Jan 10 15:40 l4.tex.aez
% 113:wol:/home/mackay/_tools/ac0> time gzip ~/_courses/itprnn/l4.tex
% 0.100u 0.050s 0:00.24 62.5% 0+150k 0+13io 0pf+0w
% 114:wol:/home/mackay/_tools/ac0> time gunzip ~/_courses/itprnn/l4.tex.gz
% 0.010u 0.060s 0:00.17 41.1% 0+80k 0+8io 0pf+0w
% 115:wol:/home/mackay/_tools/ac0> time adaptive_decode < l4.tex.aez > l4.tex
% 0.320u 0.030s 0:00.39 89.7% 0+101k 6+4io 5pf+0w
% 
% django: bzip and gunzip: 
% 149:django.ucsf.edu:/home/mackay/_tools/ac0> time bzip l4.tex
% BZIP, a block-sorting file compressor.  Version 0.21, 25-August-96.
% 0.060u 0.020s 0:00.22 36.3%     0+0k 0+0io 107pf+0w
%     7495 Jan 10  1997 l4.tex.bz
% 153:django.ucsf.edu:/home/mackay/_tools/ac0> time bunzip l4.tex.bz
% 0.020u 0.010s 0:00.14 21.4%     0+0k 0+0io 93pf+0w
%    20942 Jan 10  1997 l4.tex
% 155:django.ucsf.edu:/home/mackay/_tools/ac0> time bzip2 l4.tex
% 0.050u 0.000s 0:00.37 13.5%     0+0k 0+0io 90pf+0w
%     7640 Jan 10  1997 l4.tex.bz2
% 157:django.ucsf.edu:/home/mackay/_tools/ac0> time bunzip2 l4.tex.bz2
% 0.020u 0.000s 0:00.15 13.3%     0+0k 0+0io 85pf+0w
% time gzip  l4.tex
% 0.010u 0.010s 0:00.28 7.1%      0+0k 0+0io 84pf+0w
%    8177 Jan 10  1997 l4.tex.gz
% time gunzip  l4.tex
% 0.000u 0.010s 0:00.12 8.3%      0+0k 0+0io 87pf+0w
%
\subsubsection{Compression of a sparse file}
 Interestingly, {\tt gzip} does not always do so well. 
 Table \ref{tab.zipcompare2}  gives the 
% computer time in seconds taken and the 
 compression achieved when these  programs are applied to 
 a text file containing $10^6$ characters, each of which is
 either {\tt0} and {\tt1} with probabilities 
 0.99 and 0.01. The Laplace model is quite
 well matched to this source, 
 and the benchmark arithmetic coder
 gives good performance, followed closely by {\tt compress}; {\tt gzip}
% , interestingly,
 is worst.
% see /home/mackay/_tools/ac0
%
% , and {\tt gzip --best} does no better.
% has identical performance to {\tt gzip} on this example.}]
 An ideal model for this source would compress the 
 file into about $10^6 H_2(0.01)/8 \simeq 10\,100$ bytes.  The Laplace model 
 compressor falls short of this performance because it is implemented 
 using only eight-bit precision. The {\tt{ppmz}} compressor compresses
 the best of all, but takes much more computer time.\index{Lempel--Ziv coding|)}
\begin{table}[htbp]
\figuremargin{
\begin{center}
\begin{tabular}{lccc}  \toprule
Method & Compression  & Compressed size & Uncompression \\
  &      time$ \,/\, $sec       & $ \,/\, $bytes & time$ \,/\, $sec \\ \midrule
% Adaptive encoder,
% Laplace model  &
%        6.4 &  14089 (1.4\%)\hspace{0.06in} & 9.2  \\
%{\tt gzip / gunzip} &
%        2.1 &   20548 (2.1\%)\hspace{0.06in} & 0.43  \\
%{\tt compress / uncompress} &
%        0.73 &  14692 (1.47\%) & 0.76 \\  \bottomrule
%{\tt bzip / bunzip} &
%  &          &  (\%) & \\
%{\tt bzip2 / bunzip2} &
%  &          &  (\%) & \\ \hline
Laplace model  &
 0.45        &  $14\,143$ (1.4\%)\hspace{0.06in} & 0.57 \\
{\tt gzip } &
 0.22       &   $20\,646$ (2.1\%)\hspace{0.06in} &  0.04 \\
{\tt gzip {\tt-}{\tt-}best+} &
%{\tt gzip \verb+--best+} &
 1.63       &   $15\,553$ (1.6\%)\hspace{0.06in} &  0.05 \\
{\tt compress} &
 0.13        &  $14\,785$ (1.5\%)\hspace{0.06in} & 0.03  \\  \midrule
{\tt bzip } &
     0.30      & $10\,903$ (1.09\%) & 0.17 \\
{\tt bzip2} &
 0.19          & $11\,260$ (1.12\%) & 0.05 \\
{\tt ppmz} & 
	533 &  {\bf 10$\,$447  (1.04\%)} & 535 \\
\bottomrule
\end{tabular}
\end{center}
% ideal length = 0.0807931 * 10^6 = 80793 bits = 10099 bytes
% /home/mackay/_tools/ac0/README1
}{
\caption[a]{Comparison of compression algorithms applied to a random file
 of $10^6$ characters, 99\% {\tt0}s and 1\% {\tt1}s.
}
\label{tab.zipcompare2}
}
\end{table}

\section{Summary}
 In the last three chapters 
 we have studied three classes of data compression codes.
\begin{description}
\item[Fixed-length block codes]  (Chapter \chtwo). These are mappings 
 from a fixed number of source symbols to a fixed-length binary message.
% Most source strings are given no encoding; 
 Only a tiny fraction of 
 the source strings are given an encoding.
 These codes were fun for identifying the entropy as the measure 
 of compressibility but  they are of little practical use.
\item[Symbol codes] (Chapter \chthree).  Symbol codes employ a variable-length
 code for each symbol in the source alphabet, the codelengths being 
 integer lengths determined by the probabilities of the symbols.
 Huffman's algorithm constructs an optimal symbol code for a given 
 set of symbol probabilities.

 Every source string has a uniquely decodeable encoding, and if 
 the source symbols come from the assumed distribution then the symbol 
 code will compress
 to an expected length $L$ lying in the interval $[H,H\!+\!1)$. 
 Statistical fluctuations in the source may make the actual length 
 longer or shorter than this mean length. 

 If the source is not well matched to the assumed distribution then 
 the mean length is increased by the relative entropy $D_{\rm KL}$
 between the source distribution and the code's implicit distribution.
 For sources with small entropy, the symbol has to emit
 at least one bit per source symbol; compression 
 below one bit per source symbol can only be achieved 
 by the cumbersome procedure of putting the source data into blocks.
\item[Stream codes\puncspace]
 The distinctive property of stream codes, compared with 
 symbol codes, is that  they are not constrained to emit at least one bit for every 
 symbol read from the source stream. So large numbers of
 source symbols may be
 coded into a smaller number of bits.
% , but unlike block codes,  this is achieved
 This property could only be obtained using a symbol code
 if the source stream were somehow  chopped into blocks.
\bit
\item {Arithmetic codes}
 combine a probabilistic model with an encoding algorithm 
 that identifies each string with a sub-interval of $[0,1)$ 
 of size  equal to the probability of that string under the model.
 This code is almost optimal in the sense that 
 the compressed length of a string $\bx$ closely matches
 the Shannon information content of $\bx$ given
 the probabilistic model.  Arithmetic codes fit with 
 the philosophy that  good compression requires 
%intelligence
 {\dem data modelling}, in the form of an adaptive Bayesian model. 
\item
% [Stream codes:  Lempel--Ziv codes\puncspace]
 Lempel--Ziv codes are adaptive in the sense that they memorize strings 
 that have already occurred. They are built on the philosophy that 
 we don't know anything at all about
 what the probability distribution of the source will be, and we want 
 a compression algorithm that will perform reasonably well
 whatever that distribution  is. 
\eit
\end{description}

%\section{Optimal compression must involve artificial intelligence}
%\subsection{A rant about `universal' compression}
% moved this to rant.tex for the time being

 Both arithmetic codes and Lempel--Ziv codes will fail to decode
 correctly if any of the bits of the compressed file are altered.
 So if  compressed files are to be stored or transmitted over
 noisy media, error-correcting codes will be essential.
 Reliable communication over unreliable channels is
 the topic of \partnoun\ \noisypart.
% the next few chapters. 

%Exercises
\section{Exercises on stream codes}%{Problems}
\exercisaxA{2}{ex.AC52}{
 Describe an arithmetic coding algorithm to encode random 
 bit strings of length $N$ and weight $K$ (\ie, $K$ ones and $N-K$
 zeroes) where $N$ and $K$ are given. 

 For the case $N\eq 5$, $K \eq 2$ show in detail the intervals corresponding to 
 all source substrings of lengths 1--5. 
}
\exercissxB{2}{ex.AC52b}{
 How many bits are needed to specify a selection of 
% an unordered collection of 
 $K$ objects from $N$ objects? ($N$ and $K$ are assumed to be known and
 the selection of $K$ objects is unordered.)
 How might such a selection 
 be made  at random without being wasteful of random bits?
}
\exercisaxB{2}{ex.HuffvAC}{
% from 2001 exam
 A binary source $X$ emits independent identically
 distributed symbols with  probability distribution $\{ f_{0},f_1 \}$,
 where $f_1 = 0.01$.
 Find an optimal uniquely-decodeable  symbol code for a string 
 $\bx=x_1x_2x_3$ of {\bf{three}} successive 
 samples from this source.

 Estimate  (to one decimal place) the factor 
 by which the expected length of this optimal code is greater 
 than the entropy of the three-bit string $\bx$.

 [$H_2(0.01) \simeq 0.08$, where
  $H_2(x) = x \log_2 (1/x) + (1-x) \log_2 (1/(1-x))$.]
%\medskip

 An {{arithmetic code}\/} is used to compress a string of $1000$ samples 
 from the source $X$. Estimate the mean and standard deviation of
 the length of the compressed file.
% This is example 6.3, identical, except we are talking about compressing
% rather than generating.
}
\exercisaxB{2}{ex.ACNf}{
 Describe an arithmetic coding algorithm to generate random 
 bit strings of length $N$ with density $f$ (\ie, each 
 bit has probability $f$ of being a one) where $N$ is given. 
}
\exercisaxC{2}{ex.LZprune}{
 Use a modified Lempel--Ziv algorithm in which, as discussed
 on \pref{sec.LZprune}, the dictionary of prefixes 
 is 
% effectively
 pruned by writing new prefixes into the 
 space occupied by prefixes that will not be needed again.
 Such prefixes can be identified when
 both their children have been added to the dictionary of prefixes. 
 (You may neglect the issue of termination of encoding.)
 Use this algorithm to encode the string
 {\tt{0100001000100010101000001}}.
 Highlight the bits that  follow a prefix on the 
 second occasion that that prefix is used. (As discussed earlier, 
 these bits could  be omitted.)
%  from the encoding if we adopted the convention (discussed
% earlier)
% of not transmitting the bit that follows a prefix on the 
% second occasion that that prefix is used.
% nb this is same as an earlier example. 
% i get
%  ,0 0,1 1,0 10,1 10,0 00,0 011,0 100,1 010,0 001,1
}
\exercissxC{2}{ex.LZcomplete}{
 Show that this modified Lempel--Ziv code is still not `complete', 
 that is, there are binary strings that are not encodings of any string. 
}
% answer: this is because there are illegal prefix names, e.g. at the 
% 5th step, 111 is not legal.
%
\exercissxB{3}{ex.LZfail}{
 Give examples of simple sources that have low entropy 
 but  would not be compressed well by the Lempel--Ziv algorithm.
}
%
% Ideas: add a figure showing the flow diagram -- source, model. 
%
%
% \begin{thebibliography}{}
% \bibitem[\protect\citeauthoryear{Witten {\em et~al.\/}}{1987}]{arith_coding}
% {\sc Witten, I.~H.}, {\sc Neal, R.~M.},  \lsaand {\sc Cleary, J.~G.}
% \newblock (1987)
% \newblock Arithmetic coding for data compression.
% \newblock {\em Communications of the ACM\/} {\bf 30} (6):~520-540.
% 
% \end{thebibliography}
% \part{Noisy Channel Coding}
% \end{document} 

\dvips
%
\section{Further exercises on data compression}
%\chapter{Further Exercises on Data Compression}
\label{ch_f4}
%
% _f4.tex: exercises to follow chapter 4 in a 'review, revision, further topics'
% exercise zone.
%
\fakesection{Post-compression general extra exercises}
 The following exercises may be skipped by the reader who
 is eager to learn about noisy channels.

%
% DOES THIS BELONG HERE?    Maybe move to p92.
%
\fakesection{RNGaussian}
\exercissxA{3}{ex.RNGaussian}{
\index{life in high dimensions}\index{high dimensions, life in}
%
 Consider a Gaussian distribution\index{Gaussian distribution!$N$--dimensional} in $N$ dimensions, 
\beq
 P(\bx) = \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \left( - \frac{\sum_n x_n^2}{2 \sigma^2} \right) .
\label{first.gaussian}
\eeq
%  Show that
 Define the radius of a point $\bx$  to be $r = \left( {\sum_n
 x_n^2} \right)^{1/2}$. 
 Estimate the mean and variance of the square of the radius,
   $r^2 = \left( {\sum_n x_n^2} \right)$.

\begin{aside}%{\small
 You may find helpful the integral 
\beq
	\int \! \d x\:  \frac{1}{(2 \pi \sigma^2)^{1/2}} \: x^4  
	\exp \left( - \frac{x^2}{2 \sigma^2} \right) = 3 \sigma^4 ,
\label{eq.gaussian4thmoment}
\eeq
 though you should be able to estimate the required quantities 
 without it.
\end{aside}

% If you like gamma integrals
% derive the probability density of the radius $r = \left( {\sum_n
% x_n^2} \right)^{1/2}$, and find the most probable
% radius.

%\amarginfig{b}{% in first printing, before asides changed
\amarginfig{t}{%
\setlength{\unitlength}{0.7mm}
% there is a strip without ink at the left, hence I use -19
% instead of -21 as the left coordinate
\begin{picture}(42,42)(-19,-21)% original is 6in by 6in, so 7unitlength=1in
% use 42 unitlength for width
\put(-21,-21){\makebox(42,42){\psfig{figure=figs/typicalG.ps,angle=-90,width=29.4mm}}}
%\put(14,14){\makebox(0,0)[l]{\small probability density is maximized here}}
\put(10,18){\makebox(0,0)[bl]{\small probability density}}
\put(13,13){\makebox(0,0)[bl]{\small is maximized here}}
%\put(14,-14){\makebox(0,0)[l]{\small almost all probability mass is here}}
\put(9,-16){\makebox(0,0)[l]{\small almost all}}
\put(2,-21){\makebox(0,0)[l]{\small  probability mass is here}}
%\put(15,-26){\makebox(0,0)[l]{\small  is here}}
\put(-2,-2){\makebox(0,0)[tr]{\small $\sqrt{N} \sigma$}}
\end{picture}
\caption[a]{Schematic representation of the typical
 set of an $N$-dimensional Gaussian distribution.}
}
 Assuming that $N$ is large,
 show that nearly all the probability of a Gaussian is contained in 
 a \ind{thin shell} of  radius $\sqrt{N} \sigma$. Find the thickness of the 
 shell.

 Evaluate the probability density
% in $\bx$ space 
 (\ref{first.gaussian}) at a point in 
 that thin shell and at the origin $\bx=0$ and compare.
 Use the case $N=1000$ as an example.

 Notice that nearly all the probability mass
% the bulk of the probability density
 is located in a
 different part of the space from the region of highest probability
 density.
%
}

%
% extra exercises that are appropriate once source compression has been
% discussed.
%
% contents:
%
% simple huffman question
% Phone chat using rings (originally in mockexam.tex, now in M.tex)
% Bridge bidding as communication (where?)
%
\fakesection{Compression exercises: bidding in bridge, etc}
%
\exercisaxA{2}{ex.source_code}{
%
 Explain what is meant by an {\em optimal binary symbol code\/}.

 Find an optimal binary symbol code for the ensemble:
\[
	\A = \{ {\tt{a}},{\tt{b}},{\tt{c}},{\tt{d}},{\tt{e}},{\tt{f}},{\tt{g}},{\tt{h}},{\tt{i}},{\tt{j}} \} ,
\]
\[
	\P = \left\{ \frac{1}{100} ,
\frac{2}{100} ,
\frac{4}{100} ,
\frac{5}{100} ,
\frac{6}{100} ,
\frac{8}{100} ,
\frac{9}{100} ,
\frac{10}{100} ,
\frac{25}{100} ,
\frac{30}{100} \right\} ,
\]
 and  compute the expected length of the code.
}
\exercisaxA{2}{ex.doublet.huffman}{
 A string $\by=x_1 x_2$ consists of {\em two\/} independent samples from an ensemble
\[
        X : {\cal A}_X = \{ {\tt{a}} , {\tt{b}} , {\tt{c}} \} ; {\cal P}_X = \left\{ \frac{1}{10} ,  \frac{3}{10} ,
                \frac{6}{10} \right\} .
\]
 What is the entropy of $\by$?
 Construct an optimal binary symbol code for the string $\by$, and find 
 its expected length.

}
\exercisaxA{2}{ex.ac_expected}{
% (Cambridge University Part III Maths examination, 1998.)
%
 Strings of $N$ independent samples from an ensemble 
 with $\P = \{ 0.1 , 0.9 \}$ are compressed using 
 an {arithmetic code} that is matched to that ensemble.
 Estimate the mean and standard deviation of 
 the  compressed strings' lengths
 for the case $N=1000$.
%
 [$H_2(0.1) \simeq 0.47$]
% ; $\log_2(9) \simeq 3$.]
% .47, 3.17
% my answer: 470 pm 30
}

% from M.tex, in which model solns are found too
\exercisaxA{3}{ex.phone_chat}{%(Cambridge University Part III Maths examination, 1998.)
 {\sf Source coding with variable-length symbols.}
%  -- Source coding / optimal use of channel}
\begin{quote}
 In the chapters on source coding, we assumed that 
 we were encoding into a binary alphabet $\{ {\tt0} , {\tt1} \}$ in which both symbols\index{source code!variable symbol durations} 
% had the same associated cost. Clearly a good compression algorithm 
%  uses both these symbols with equal frequency, and the capacity of 
% this alphabet is one bit per character.
 should be used with equal frequency.
 In this question we explore how the encoding alphabet should be 
 used
% what happens
 if the symbols take different times to transmit.
% have different costs.
% the 
\end{quote}
%
 A poverty-stricken \ind{student} communicates for free with a friend
 using a \ind{telephone} by selecting an integer
 $n \in \{ 1,2,3\ldots \}$,
  making the friend's
 phone ring $n$ times, then hanging up in the middle of the $n$th ring.
 This process is repeated so that a string of symbols
 $n_1 n_2 n_3 \ldots$ is received. What is the optimal way to communicate?
 If  large integers $n$ are selected 
 then the message takes longer to communicate. If only 
 small integers $n$ are used then the information content per symbol is 
 small.
 We aim to maximize  the rate of information transfer, per unit time.


 Assume that the time taken to transmit
 a number of rings $n$ and to redial
%, including the space that separates them from the next sequence of rings
 is $l_n$ seconds. Consider a probability distribution over $n$, 
 $\{ p_n \}$.
 Defining the average duration {\em per symbol\/} to be 
\beq
        L(\bp) = \sum_n p_n l_n
\eeq
 and the entropy {\em per symbol\/} to be
\beq
	H(\bp) = \sum_n p_n \log_2 \frac{1}{p_n } ,
\eeq
 show that for the average information
 rate {\em per second\/} to be maximized, 
 the symbols must be used with probabilities
 of the form
\beq
      p_n = \frac{1}{Z} 2^{-\beta l_n}
\label{eq.phone.1}
\eeq
 where
% $\beta$ is a Lagrange multiplier
%and
 $Z = \sum_n 2^{-\beta l_n}$
 and $\beta$ satisfies the implicit equation
% \marginpar{[6]}
\beq
	\beta = \frac{H(\bp)}{L(\bp)} ,
\label{eq.phone.2}
\eeq
 that is, $\beta$ is the rate of communication. 
%is set so as to maximize
%\beq
% R(\beta) = - \beta - \frac{\log Z(\beta)}{L(\beta)}
%\eeq
% where $L(\beta)=\sum p_n l_n$.
% By differentiating $R(\beta)$, show that
% $\beta^*$ satisfies
 Show that these two equations 
 (\ref{eq.phone.1}, \ref{eq.phone.2}) imply that $\beta$ must be set
 such that  
\beq
	\log Z =0.
\label{eq.phone.3}
\eeq
% 
 Assuming that the channel has the property
% redialling takes the same time as one ring, so that
\beq
  l_n = n \: \mbox{seconds},
\label{eq.phone.4}
\eeq
 find the optimal distribution $\bp$ and show that 
 the maximal information rate is 1 bit per second.
% $\log xxxx$
% and that the mean number of rings 
% in a group is xxxx and that the information per
% ring is xxxx.
 
 How does this compare with the information rate per second achieved 
 if $\bp$ is set to 
 $(1/2,1/2,0,0,0,0,\ldots)$ --- that is,
 only the symbols $n=1$ and $n=2$ are selected, 
 and they have equal probability?

 Discuss the relationship between the results 
 (\ref{eq.phone.1}, \ref{eq.phone.3}) derived above, 
 and the Kraft inequality  from source coding theory.

 How might a random binary source
 be efficiently encoded into a sequence of symbols 
 $n_1 n_2 n_3 \ldots$ for transmission over the channel defined
 in \eqref{eq.phone.4}? 
}



\exercisaxB{1}{ex.shuffle}{How many bits
 does it take to shuffle a pack of cards?

% [In case this is not clear, here's the long-winded
% version: imagine using a random number generator
% to generate perfect shuffles of a deck of cards.
% What is the smallest number of random bits
% needed per shuffle?]
}


\exercisaxB{2}{ex.bridge}{In the card game\index{game!Bridge}
 Bridge,\index{Bridge}
 the four players receive 13 cards each from the deck of 52 and
 start each game by looking at their own hand
 and bidding. The legal bids are, in ascending order
 $1 \clubsuit, 1 \diamondsuit, 1 \heartsuit, 1\spadesuit,$ $1NT,$
 $2 \clubsuit,$ $2 \diamondsuit,$ 
% 2 \heartsuit, 2\spadesuit, 2NT,
 $\ldots$
% 7 \clubsuit, 7 \diamondsuit, 
 $7 \heartsuit, 7\spadesuit, 7NT$,
 and successive bids must follow this order;
 a bid of, say, $2 \heartsuit$ may only be
 followed by higher bids such as $2\spadesuit$ or $3 \clubsuit$ or $7 NT$.
 (Let us neglect the `double' bid.)

% The outcome of the bidding process determines the subsequent
% game. 
 The players have several aims when bidding. One of the
 aims is for two partners to communicate to each other
 as much as possible about what cards are in their hands.
% There are many bidding systems whose aim is, among other things,
% to communicate this information.

 Let us concentrate on this task.
\begin{enumerate}
\item
 After the cards have been dealt,
 how many bits are needed for North to convey to South what
 her hand is?
\item
 Assuming that E and W do not bid at all, what
 is the maximum total information that N and S can convey to each
 other while bidding? Assume that N starts the bidding, and that
 once either N or S stops bidding, the bidding stops.
\end{enumerate}
}


\exercisaxB{2}{ex.microwave}{
 My old `\ind{arabic}' \ind{microwave oven}\index{human--machine interfaces}
 had 11 buttons for entering
 cooking times, and my new  `\ind{roman}' microwave has just five.
 The buttons of the roman microwave are labelled `10 minutes',
 `1 minute', `10 seconds',  `1 second', and `Start'; I'll abbreviate
 these five strings to the symbols {\tt M}, {\tt C}, {\tt X}, {\tt I}, $\Box$.
% The two keypads then look as follows.
% included by _e4.tex
\amarginfig{b}{%
\begin{center}
\begin{tabular}[t]{c}%%%%%%%%%% table containing microwave buttons
%\toprule
 Arabic \\ \midrule
% The keypad
\begin{tabular}[t]{*{3}{p{.1in}}} 
\framebox{1} & \framebox{2} & \framebox{3}  \\
\framebox{4} & \framebox{5} & \framebox{6}  \\
\framebox{7} & \framebox{8} & \framebox{9}  \\
 & \framebox{0} & \framebox{$\!\Box\!$}  \\
\end{tabular}
\\
%\bottomrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% end all micro table
\end{tabular}
\begin{tabular}[t]{c}%%%%%%%%%% table containing microwave buttons
%\toprule
 Roman \\ \midrule
% The keypad
\begin{tabular}[t]{*{3}{p{.1in}}}
\framebox{{\tt{M}}} & \framebox{{\tt{X}}} & \\
\framebox{{\tt{C}}} & \framebox{{\tt{I}}} & \framebox{$\!\Box\!$}  \\
\end{tabular}
\\
%\bottomrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% end all micro table
\end{tabular}\\
\mbox{$\:$}
\end{center}
\caption[a]{Alternative keypads for microwave ovens.}
}

 To enter one minute and twenty-three seconds (1:23),  the arabic sequence
 is
\beq
 {\tt{123}}\Box,
\eeq
 and the roman sequence is
\beq
 {\tt{CXXIII}}\Box .
\eeq
 Each of these keypads defines a code mapping the
 3599 cooking times from 0:01 to 59:59 into a string of symbols.

\ben
\item
 Which times can be produced with two or three symbols? (For example,
 0:20 can be produced by three symbols in either code:
 ${\tt{XX}}\Box$ and 
 ${\tt{20}}\Box$.)

\item
 Are the two codes complete?
 Give a detailed answer.
% Discuss all the ways in which these two codes are not complete.

\item
 For each code, name a cooking time
% couple of times
 that it can produce in
 four symbols that the other code cannot.

\item
 Discuss the implicit probability distributions over times to which
 each of these codes is best matched.

\item
 Concoct a plausible probability distribution over times
 that a real user might use, and evaluate roughly the expected number of
 symbols, and maximum number of symbols, that each code
 requires. Discuss the ways in which
 each code is inefficient or efficient.
\item
	Invent a more efficient cooking-time-encoding system for a microwave oven.
\een
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
}


%
\fakesection{Cinteger}
%\input{tex/_Cinteger}
\exercissxC{2}{ex.Cinteger}{
 Is the standard binary representation for positive
 integers (\eg\ $c_{\rm b}(5) = {\tt 101}$)
 a uniquely decodeable code?

 Design a binary code for the positive integers, 
 \ie, a mapping from $n \in \{ 1,2,3,\ldots \}$ to $c(n) \in
 \{{\tt 0},{\tt 1}\}^+$,
 that is uniquely decodeable. 
 Try to design codes that are prefix codes and that satisfy the 
 \Kraft\ equality $\sum_n 2^{-l_n} \eq 1$. 
%
% Not a typo.
%

\begin{aside}
  Motivations:  any data file terminated by a special
 end of file character can be mapped onto an integer, 
 so a prefix code for integers can be used as a self-delimiting 
 encoding of files too. Large files correspond to large integers.
 Also, one of the building blocks of a `universal' coding scheme --
 that is, a coding scheme that will work OK for a large variety 
 of sources --  is the ability to encode integers. Finally,
 in microwave ovens, cooking times are positive integers!
\end{aside}

 Discuss criteria by which one might compare alternative codes 
 for integers (or, equivalently, alternative self-delimiting codes for 
 files).
}

%
%

%
\section{Solutions}% to Chapter \protect\ref{ch4}'s exercises} 
%
% solns to exercises in l4.tex
%
\fakesection{solns to exercises in l4.tex}
\soln{ex.ac.terminate}{
 The worst-case situation is when the 
 interval to be represented lies just inside
 a  binary interval. In this case, we may choose either of 
 two binary intervals as shown in \figref{fig.ac.worst.case}. 
  These binary intervals are no 
 smaller than $P(\bx|\H)/4$, so the binary encoding has a length 
 no greater than $\log_2 1/ P(\bx|\H) + \log_2 4$, which is 
 two bits more than the ideal message length.
}

%
% HELP HELP HELP RESTORE ME!
% \input{tex/acvshuffman.tex}
%
%
\soln{ex.usebits}{
 The standard method uses 
 32 random bits per generated
 symbol and so requires $32\,000$ bits
 to generate one thousand samples.

% this is displaced down a bit.
\begin{figure}%[htbp]
\figuremargin{%
\begin{center}
% created by ac.p only_show_data=1 > ac/ac_data.tex 
\mbox{
\small
\setlength{\unitlength}{1.62in}
\begin{picture}(2,1.2)(0,0)
\thicklines
% desired interval on left 
\put(  0.0,  1.01){\makebox(0,0)[bl]{Source string's interval}}
\put(  0.5,  0.5){\makebox(0,0){$P(\bx|\H)$}}
\put(  0.0,  0.05){\line(1,0){ 1.0}}
\put(  0.0,  0.95){\line(1,0){ 1.0}}
%
% binary intervals 
\put(  1.0,  1.03){\makebox(0,0)[bl]{Binary intervals}}
\put(  1.0,   0.0){\line(1,0){ 1.0}}
\put(  1.0,   1.0){\line(1,0){ 1.0}}
%
\thinlines
%
\put(  0.5,  0.4){\vector(0,-1){0.35}}
\put(  0.5,  0.6){\vector(0,1){0.35}}
%
\put(  1.0,   0.5){\line(1,0){ 0.5}}
\put(  1.0,  0.25){\line(1,0){ 0.25}}
\put(  1.0,  0.75){\line(1,0){ 0.25}}
%
\put(  1.125,  0.625){\vector(0,1){0.125}}
\put(  1.125,  0.625){\vector(0,-1){0.125}}
\put(  1.125,  0.375){\vector(0,1){0.125}}
\put(  1.125,  0.375){\vector(0,-1){0.125}}
\end{picture}

}
\end{center}
}{%
\caption[a]{Termination of arithmetic coding in the worst case, where 
 there is a two bit overhead. Either of the two binary intervals marked on the 
 right-hand side may be chosen. These binary intervals are no 
 smaller than $P(\bx|\H)/4$.}
\label{fig.ac.worst.case}
}%
\end{figure}

 Arithmetic coding uses on average
 about  $H_2 (0.01)=0.081$ bits  per generated symbol, and so 
 requires about 83 bits to  generate one thousand samples
 (assuming an  overhead of  roughly two bits associated with termination).

 Fluctuations in the number of {\tt{1}}s would produce variations
 around this mean with standard deviation 21.
 
}
% 57
%\soln{ex.Clengthen}{
% moved to cutsolutions.tex
\soln{ex.LZencode}{
 The encoding is {\tt010100110010110001100}, which comes from the 
 parsing  
\beq
\tt	0,  00,   000,  0000,  001, 00000, 000000 
\eeq
 which is encoded thus:
\beq
{\tt (,0),(1,0),(10,0),(11,0),(010,1),(100,0),(110,0) } .
\eeq
}
\soln{ex.LZdecode}{
The decoding is
\begin{center}
 {\tt 0100001000100010101000001}.
\end{center}
}
%\soln{ex.AC52}{
\soln{ex.AC52b}{
 This problem is equivalent  to \exerciseref{ex.AC52}. 

 The selection of  $K$ objects from $N$ objects requires 
 $\lceil \log_2 {N \choose K}\rceil$ bits $\simeq N H_2(K/N)$ bits.
%
 This selection could be made using arithmetic coding. The selection 
 corresponds to a binary string of length $N$ in which the {\tt{1}} bits represent
 which objects are selected. Initially the probability of a {\tt{1}} is 
 $K/N$ and the probability of a {\tt{0}} is $(N\!-\!K)/N$. Thereafter, given that 
 the emitted string thus far, of length $n$, contains $k$ {\tt{1}}s,
 the probability of a {\tt{1}} is 
 $(K\!-\!k)/(N\!-\!n)$ and the probability of a {\tt{0}} is $1 - (K\!-\!k)/(N\!-\!n)$.

}
\soln{ex.LZcomplete}{
 This modified  Lempel--Ziv code is still not `complete', because, 
 for example, after five prefixes have been collected, 
 the pointer could be any of the strings $\tt000$, $\tt001$, $\tt010$,
 $\tt011$, $\tt100$, but 
 it cannot be $\tt101$, $\tt110$ or $\tt111$. Thus there are some binary strings 
 that cannot be produced as  encodings. 
}
\soln{ex.LZfail}{
 Sources with low entropy that are not well compressed  by Lempel--Ziv
 include:\index{Lempel--Ziv coding!criticisms}
\ben
\item
 Sources with some symbols that have
 long range  correlations and intervening 
 random junk. An ideal model should capture what's correlated
 and compress it.  Lempel--Ziv can only compress the correlated features 
 by memorizing all cases of the intervening junk. 
 As a simple example, consider a
 \index{telephone number}\index{phone number}telephone book in which 
 every line contains an (old number, new number) pair:
\begin{center}
{\tt{285-3820:572-5892}}\teof\\
{\tt{258-8302:593-2010}}\teof\\
\end{center}
 The number of characters per line is 18, drawn from the 13-character 
 alphabet
 $\{ {\tt{0}},{\tt{1}},\ldots,{\tt{9}},{\tt{-}},{\tt{:}},\eof\}$. 
 The characters `{\tt{-}}',
 `{\tt{:}}' and `\teof' occur in a predictable sequence, so
 the true information content per line, assuming 
 all the phone numbers are seven digits long, and assuming 
 that they are random sequences, 
 is about 14 \dits. (A \dit\ is the information content
 of a random integer between 0 and 9.)
 A finite state language model could easily  capture 
 the regularities in these data. 
 A Lempel--Ziv algorithm will  take a long time before 
 it compresses such a file down to 14 bans per line,
% by a factor of $14/18$, 
 however, because in order for it to `learn' that 
 the string {\tt{:}}$ddd$ is always followed by {\tt{-}}, 
 for any three digits $ddd$, it will have to {\em see\/}
 all those  strings. So near-optimal compression 
 will only be achieved after thousands of lines of the 
 file have been read.\medskip

% figs/wallpaper.ps made by pepper.p
\begin{figure}[htbp]
\fullwidthfigureright{%
%\figuremargin{%
\small
\begin{center}
\mbox{%(a)
\psfig{figure=figs/wallpaper.ps}}\\
%\mbox{(b) \psfig{figure=figs/wallpaperc.ps}}\\
%\mbox{(c) \psfig{figure=figs/wallpaperb.ps}}
\end{center}
}{%
\caption[a]{
 A source with low entropy that is not well compressed by Lempel--Ziv. 
 The bit sequence is read from left to right.
 Each line differs from the line above in $f=5$\% of its bits.
 The image width is 400 pixels.
%
% Three 
% sources with low entropy that are not well compressed by Lempel--Ziv. 
% The bit sequence is read from left to right. The image width is 400 pixels
% in each case.
%
% (a) Each line differs from the line above in $p=$5\% of its bits.
%
% (b)
% Each column $c$ has its own transition probability $p_c$ such that
% successive vertical bits are identical with probability $p_c$. The
% probabilities $p_c$ are drawn from a uniform  distribution over $[0,0.5]$.
%
% (c) As in b, but the probabilities $p_c$ are drawn from a uniform
% distribution over $[0,1]$.
}
% ; in columns with $p_c \simeq 1$, successive
% vertical bits are likely to be opposite to each other. }
%
\label{fig.pepper}
}%
\end{figure}
%
% this is beautiful but gratuitous
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
%\begin{figure}[htbp]
%\figuremargin{%
%\begin{center}
%\mbox{\psfig{figure=figs/automaton346.big1.ps,height=7in}}\\
%\end{center}
%}{%
%\caption[a]{A longer cellular automaton history.
%}
%\label{fig.automatonII}
%\end{figure}
\vspace*{-10pt}% included to undo the cumulation of item space and figure space.
\item
 Sources with  long range  correlations, for example two-dimensional
 images that are represented by a sequence of pixels, row by row, 
 so that vertically adjacent pixels are a distance $w$
 apart in the source stream, where $w$ is the image width.
 Consider, for example, a fax transmission in which each line 
 is very similar to the previous line (\figref{fig.pepper}).
 The true entropy is only $H_2(f)$ per pixel, where $f$ 
 is the probability that a pixel differs from its parent. 
% except for a light peppering
% of noise.
% Each line is somewhat similar to the previous line but not identical, 
% so there is no previous occurrence of a long string
% to point to; some algorithms in the Lempel--Ziv class
% will achieve a certain degree of compression 
% by memorizing recent short strings, but the compression achieved 
% will not equal the true entropy. 
% and after a few lines, 
% the pattern has moved on by a random walk, so memorizing ancient patterns
% is of no use. 
 Lempel--Ziv algorithms will only  compress
 down to the entropy once {\em all\/} strings of length $2^w = 2^{400}$ 
 have occurred and their successors have been memorized.
 There are only about $2^{300}$ particles in the universe, so we 
 can confidently say that
 Lempel--Ziv codes will {\em never\/} capture the redundancy
 of such an image. 

% figs/wallpaper.ps made by pepper.p
\begin{figure}[htbp]
%\figuremargin{%
\fullwidthfigureright{%
\begin{center}
%\mbox{(a) \psfig{figure=figs/wallpaperx.ps}}\\
\mbox{%(b)
\psfig{figure=figs/wallpaperx2.ps}}\\
%\mbox{(c) \psfig{figure=figs/automaton346.2.ps}}\\
% see also figs/automaton346.big1.pbm
\end{center}
}{%
\caption[a]{%A second source with low entropy that is not optimally compressed by Lempel--Ziv.
 A texture consisting of horizontal and vertical pins
 dropped at random on the plane.
% (c) The 100-step time-history of a cellular automaton with 400 cells.
}
\label{fig.wallpaper}
}%
\end{figure}
 Another highly redundant texture is shown in \figref{fig.wallpaper}.
 The image was made 
 by dropping horizontal and vertical pins randomly on the plane.
 It contains both long-range vertical correlations and long-range horizontal 
 correlations. There is no practical way that Lempel--Ziv, fed with a pixel-by-pixel scan 
 of this image, could capture both these correlations. 
% gzip on the pbm gives:     2374 wallpaperx.pbm.gz
% That is better than 50%.
% Saved as a gif, wallpaperx.pbm is 2926 characters. Original 40000 pixels would be 5000 characters.
% That is worse than 50% compression.
% cf. perl program, stripwallpaper.p
% is 
%         0         8       274 /home/mackay/bin/stripwallpaper.p.gz
%         0        16       631 wallpaperx.asc.gz
%         0        24       905 total <--------
%        18        65       368 /home/mackay/bin/stripwallpaper.p
%       162       484      1390 wallpaperx.asc
%       180       549      1758 total
% lossless jpg is terrible!:
%  38828  wallpaperx.jpg
% would be nice to try JBIG on this.

% It is worth emphasizing that b
 Biological computational systems
 can readily identify the redundancy in these images and in images 
 that are much more complex; thus we might anticipate that 
 the best data compression algorithms will result from the development
 of \ind{artificial intelligence} methods.\index{compression!future methods}
\item
 Sources with intricate redundancy, such as files generated 
 by computers. For example, a \LaTeX\ file 
 followed by its encoding into a PostScript file. The information content
 of this pair of files is roughly equal to the information content of the 
 \LaTeX\ file alone.
\item
 A picture of the Mandelbrot set. The picture has an information content
 equal to the number of bits required to specify the range of the 
 complex plane studied, the pixel sizes, 
 and the colouring rule used.
%  mapping of set membership to pixel colour. 
% \item
%  Encoded transmissions arising from an error-correcting code of rate $K/N$. 
%  These are very easily compressed by a factor 
%  $K/N$ if the generator operation is known. 
% see README2 in /home/mackay/_courses/comput/newising_mc
\item
 A picture of a  ground state of 
 a frustrated antiferromagnetic \ind{Ising model} (\figref{fig.ising.ground}), 
 which we will discuss
 in \chref{ch.ising}.
 Like \figref{fig.wallpaper}, this binary image has interesting 
 correlations in two directions. 
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\mbox{\bighisingsample{hexagon2}}
\end{center}
}{%
\caption[a]{Frustrated triangular
 Ising model in one of its ground states.}
\label{fig.ising.ground}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item
 Cellular automata -- \figref{fig.wallpaperc} shows 
 the state history of 100 steps of a \ind{cellular automaton} 
 with 400 cells. The update rule, in which each cell's new state depends on the state of five
 preceding cells, was selected at random.  The information content is equal to the information 
 in the boundary (400 bits), and the propagation rule, which here can be described in 32 bits.
 An optimal compressor will thus give a compressed file length which 
 is essentially constant, independent of the vertical height of the image. 
 Lempel--Ziv would only give this zero-cost compression once the 
 cellular automaton has entered a periodic limit cycle, which 
 could easily take about  $2^{100}$ iterations.

 In contrast, the JBIG compression method, which models the probability of 
 a pixel given its local context and uses 
 arithmetic coding, would do a good job on these images.
%\item
% And finally, an example relating to error-correcting codes:
% the\index{error-correcting code!and compression}\index{difficulty of compression}\index{compression!difficulty of}
% received transmissions arising when encoded transmissions are 
% sent over a noisy channel. Such received strings have an entropy 
% equal to the source entropy plus the channel noise's
% entropy. If a \index{Lempel--Ziv coding|)}Lempel--Ziv
% algorithm could compress these strings, 
% this would be tantamount to solving the decoding problem for 
% the error-correcting code!
%
% We have not got to this topic yet, but we will see later that 
% the decoding of a general  error-correcting code is 
% a challenging intractable problem. 
% automaton.p
\begin{figure}%[htbp]
%\figuremargin{%
\fullwidthfigureright{%
\begin{center}
\mbox{%(c)
\psfig{figure=figs/automaton346.2.ps}}\\
% see also figs/automaton346.big1.pbm
\end{center}
}{%
\caption[a]{% Another source with low entropy that is not optimally compressed by Lempel--Ziv.
 The 100-step time-history of a cellular automaton with 400 cells.
}
\label{fig.wallpaperc}
}%
\end{figure}
\een
}
\index{source code!stream codes|)}\index{stream codes|)}

\dvipsb{solutions  stream codes}
%
%
%\section{Solutions}% to Chapter \protect\ref{ch_f4}'s exercises}  
% \section{Solutions to section \protect\ref{ch_f4}'s exercises}  
\fakesection{RNGaussian}
\soln{ex.RNGaussian}{
 For a one-dimensional Gaussian, the 
 variance of $x$, $\Exp[x^2]$, is $\sigma^2$.
 So the mean value of $r^2$ in $N$ dimensions,
 since the components of $\bx$ are independent
 random variables, is
\beq
	\Exp[ r^2] = N \sigma^2 .
\eeq
 The variance of $r^2$, similarly, 
 is $N$ times the variance of $x^2$, 
 where $x$ is a one-dimensional Gaussian 
 variable. 
\beq
	\var (x^2 ) = \int \!  \d x \:
 \frac{1}{(2 \pi \sigma^2)^{1/2}} x^4 \exp \left( - \frac{x^2}{2 \sigma^2} \right) 
 - \sigma^4 .
\eeq 
 The integral is found to be  $3 \sigma^4$ (\eqref{eq.gaussian4thmoment}),
 so $\var(x^2) = 2 \sigma^4$.
 Thus the variance of $r^2$ is $2 N \sigma^4$. 

 For large $N$, the \ind{central-limit theorem}
% law of large numbers 
 indicates that 
 $r^2$ has a Gaussian distribution with mean $N \sigma^2$ and standard 
 deviation $\sqrt{2 N} \sigma^2$, so the probability density of $r$
 must similarly be concentrated about $r \simeq \sqrt{N} \sigma$.

 The thickness of this shell is given by  turning the standard deviation of 
 $r^2$ into a standard deviation on $r$: for small
 $\delta r/r$, 
 $\delta \log r = \delta r/r = (\dhalf) \delta \log r^2 = (\dhalf) \delta (r^2)/r^2$,
 so setting $\delta (r^2) = \sqrt{2 N} \sigma^2$, $r$ has standard deviation 
 $\delta r = (\dhalf) r \delta (r^2)/r^2$
% $=$ $(\dhalf) \sqrt{2 N} \sigma^2 / \sqrt{( N \sigma^2)}$ 
 $=\sigma/\sqrt{2}$.

 The probability density of the Gaussian at a point $\bx_{\rm shell}$ where 
 $r =  \sqrt{N} \sigma$ is 
\beq
	 P(\bx_{\rm shell}) = \frac{1}{(2 \pi \sigma^2)^{N/2}}
	 \exp \left( - \frac{N \sigma^2}{2 \sigma^2} \right) 
	= \frac{1}{(2 \pi \sigma^2)^{N/2}}
	 \exp \left( - \frac{N}{2}  \right) .
\eeq
 Whereas the probability density at the origin is 
\beq
	 P(\bx\eq 0) = \frac{1}{(2 \pi \sigma^2)^{N/2}} .
\eeq
 Thus $P(\bx_{\rm shell})/P(\bx\eq 0) = \exp \left( - \linefrac{N}{2}  \right) .$
 The probability density at the typical radius is $e^{-N/2}$ times
 smaller than the density at the origin. If $N=1000$, then the probability 
 density at the origin is $e^{500}$ times greater. 
%
}
%
%
% for _e4.tex
%
\fakesection{Source coding problems solutions}
%\soln{ex.forward-backward-language}{
%% (Draft.)
%%
% If we write down a language model for strings in forward-English,
% the same model defines a probability distribution over strings
% of backward English. The probability distributions have
% identical entropy, so the average information contents
% of  the reversed
% language and the forward language are equal.
%}

%\soln{ex.microwave}{
% moved to cutsolutions.tex
% removed to cutsolutions.tex
% \soln{ex.bridge}{(Draft.)


\dvipsb{solutions further data compression f4}
%\subchapter{Codes for integers \nonexaminable} 
\chapter{Codes for Integers \nonexaminable}
\label{ch.codesforintegers}
 This chapter is an aside, which may safely be skipped.
\section*{Solution to \protect\exerciseref{ex.Cinteger}}% was fiftythree
\label{sec.codes.for.integers}\label{ex.Cinteger.sol}% special by hand
%\soln{ex.Cinteger}{
%}
\fakesection{Cinteger  Solutions to problems}
%
% original integer stuff is in old/s_integer.tex
%
% chapter 2 , coding of integers

 To discuss the  coding of integers\index{source code!for integers}
 we need some definitions.\index{binary representations}
\begin{description}
\item[The standard binary representation of a positive
 integer] $n$ will be denoted by $c_{\rm b}(n)$,
 \eg, $c_{\rm b}(5) = {\tt 101}$, $c_{\rm b}(45) = {\tt 101101}$.
\item[The standard binary length of a positive
 integer] $n$,  $l_{\rm b}(n)$, is the length
 of the string $c_{\rm b}(n)$. 
 For example, $l_{\rm b}(5) = 3$, $l_{\rm b}(45) = 6$. 
\end{description}
 The standard binary representation  $c_{\rm b}(n)$
 is {\em not\/} a uniquely decodeable code for integers
 since there is no way of knowing when an integer has ended.
 For example, $c_{\rm b}(5)c_{\rm b}(5)$ is identical to $c_{\rm b}(45)$.
 It  would be uniquely decodeable if we knew the
 standard binary length of each integer
 before it was received. 

 Noticing that all positive integers have a standard binary representation
 that starts with a {\tt{1}}, we might define another representation:
\begin{description}
\item[The headless binary representation of a positive
 integer] $n$ will be denoted by $c_{\rm B}(n)$,
 \eg, $c_{\rm B}(5) = {\tt 01}$, $c_{\rm B}(45) = {\tt 01101}$
 and $c_{\rm B}(1) = \lambda$ (where $\l$ denotes the null 
 string).
\end{description}
 This representation would be uniquely decodeable if we knew the 
 length $l_{\rm b}(n)$ of the integer. 

 So, how can we make a uniquely decodeable  code for integers? 
 Two  strategies can be distinguished.
\ben
\item {\bf  Self-delimiting codes}.
 We first communicate somehow
% An alternative strategy is to make the code self-delimiting
 \index{symbol code!self-delimiting}\index{self-delimiting}the length of the integer, $l_{\rm b}(n)$,
 which is also a positive integer; then communicate the original
 integer $n$ itself using $c_{\rm B}(n)$. 
\item {\bf Codes with `end of file' characters}.
 We code the integer into  blocks of length
 $b$ bits, and reserve one of the $2^b$ symbols to 
 have the special meaning `end of file'. The coding 
 of integers into blocks is arranged so that 
 this reserved symbol is not needed for any other purpose.
\een

 The  simplest uniquely decodeable code for integers is the unary code, 
 which can be viewed as a code with an end of file character.
\begin{description}
\item[Unary code\puncspace]
 An integer $n$ is encoded by sending a string of $n\!-\!1$ {\tt 0}s
% zeroes
 followed  by a {\tt 1}. 
\[
\begin{array}{cl} \toprule
n & c_{\rm U}(n) \\ \midrule
1 & {\tt 1} \\
2 & {\tt 01} \\
3 & {\tt 001} \\
4 & {\tt 0001} \\
5 & {\tt 00001} \\
\vdots & \\
45 & {\tt 000000000000000000000000000000000000000000001} \\  \bottomrule
\end{array}
\]
 The unary code has length $l_{\rm U}(n) = n$.

 The unary code is the optimal code for integers if the probability 
 distribution over $n$ is $p_{\rm U}(n) = 2^{-{n}}$. 
\end{description}

\subsubsection*{Self-delimiting codes}
 We can use the unary code to encode the {\em length\/} of the binary 
 encoding of $n$ and make a self-delimiting code:
\begin{description}
\item[Code $C_\alpha$\puncspace]
% The length of the standard binary representation is a positive integer
 We send the unary code for  $l_{\rm b}(n)$, followed 
 by the headless binary representation of $n$. 
\beq
        c_{\alpha}(n) = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n) .
\eeq
 Table \ref{tab.calpha} shows the codes for some integers. The overlining
 indicates the division of each string into the parts $c_{\rm U}[ l_{\rm b}(n) ]$
 and $c_{\rm B}(n)$.
\margintab{\footnotesize
\[
\begin{array}{clll} \toprule
n & c_{\rm b}(n) & \makebox[0in][c]{$l_{\rm b}(n)$} & c_{\alpha}(n)
% = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n)
 \\ \midrule
1 & {\tt 1  } & 1 & {\tt {\overline{1}}} \\
2 & {\tt 10 } & 2 & {\tt {\overline{01}}0} \\
3 & {\tt 11 } & 2 & {\tt {\overline{01}}1} \\
4 & {\tt 100} & 3 & {\tt {\overline{001}}00} \\
5 & {\tt 101} & 3 & {\tt {\overline{001}}01} \\
6 & {\tt 110} & 3 & {\tt {\overline{001}}10} \\
\vdots & \\
45 & {\tt 101101} & 6 & {\tt  {\overline{000001}}01101} \\ \bottomrule
\end{array}
\]
\caption[a]{$C_\alpha$.}
\label{tab.calpha}
}
 We might equivalently view $c_{\alpha}(n)$ as consisting of a string 
 of $(l_{\rm b}(n)-1)$ zeroes followed by the standard binary representation 
 of $n$, $c_{\rm b}(n)$. 

 The  codeword $c_{\alpha}(n)$  has length $l_{\alpha}(n) = 2 l_{\rm b}(n) - 1$.

 The implicit probability distribution over $n$ for the code
  $C_{\alpha}$ is separable
 into the product of a probability distribution over the  length $l$, 
\beq
        P(l) = 2^{-l} ,
\eeq 
 and a uniform distribution over integers having that length, 
\beq
        P(n\given l) = \left\{ \begin{array}{cl} 2^{-l+1} & l_{\rm b}(n) = l \\
                0 & \mbox{otherwise}.
\end{array} \right. 
\eeq
\end{description}

 Now, for the above code, the header that communicates
 the length always occupies the same number 
 of bits as the standard binary representation of the integer (give or take
 one).  If we are expecting to encounter large integers (large files)
 then this representation seems suboptimal, since it leads to 
 all files occupying a size that is double their original 
 uncoded size.  Instead of using the unary 
 code to encode the length $l_{\rm b}(n)$, we could use $C_{\alpha}$.%
% see graveyard for original
\margintab{{\footnotesize
\[
\begin{array}{cll} \toprule
n & c_{\beta}(n) & c_{\gamma}(n)
 \\ \midrule
1 & {\tt{\overline{1}}}    & {\tt{\overline{1}}}      \\
2 & {\tt{\overline{010}}0} & {\tt{\overline{0100}}0}  \\
3 & {\tt{\overline{010}}1} & {\tt{\overline{0100}}1}  \\
4 & {\tt{\overline{011}}00}& {\tt{\overline{0101}}00} \\
5 & {\tt{\overline{011}}01}& {\tt{\overline{0101}}01} \\
6 & {\tt{\overline{011}}10}& {\tt{\overline{0101}}10} \\
\vdots & \\
45 & {\tt{\overline{00110}}01101} &  {\tt{\overline{01110}}01101} \\ \bottomrule
\end{array}
\]
}
\caption[a]{$C_\beta$ and $C_{\gamma}$.}
\label{tab.cbeta}
}


\begin{description}
\item[Code $C_\beta$\puncspace]
% The length of the standard binary representation is a positive integer
  We send  the length $l_{\rm b}(n)$ using $C_{\alpha}$, followed 
 by the headless binary representation of $n$. 
\beq
        c_{\beta}(n) = c_{\alpha}[ l_{\rm b}(n) ] c_{\rm B}(n) .
\eeq
\end{description}
 Iterating this procedure,  we can define a sequence of codes.
\begin{description}
\item[Code $C_{\gamma}$\puncspace]
\beq
        c_{\gamma}(n) = c_{\beta}[ l_{\rm b}(n) ] c_{\rm B}(n) .
\eeq
% see graveyard for gamma table
\item[Code $C_\delta$\puncspace]
\beq
        c_{\delta}(n) = c_{\gamma}[ l_{\rm b}(n) ] c_{\rm B}(n) .
\eeq
\end{description}

\subsection{Codes with end-of-file symbols}
 We can also make byte-based representations. 
(Let's use the term byte
 flexibly here, to denote any fixed-length string of bits, not
 just a string of length 8 bits.)
 If we encode the number in some base, for example decimal, then 
 we can represent each digit in a byte. In order to represent
a digit from 0 to 9 in a byte we need four bits. 
 Because $2^4 = 16$, this leaves 6 extra four-bit symbols, 
 $\{${\tt 1010}, {\tt 1011}, {\tt 1100}, {\tt 1101}, {\tt 1110},
 {\tt 1111}$\}$, 
 that correspond to no decimal digit. We can use these 
 as end-of-file symbols to indicate the end of our positive 
 integer.
% Such a  code can also  code the integer zero, for which 
% we have not been providing a code  up till now. 

 Clearly it is redundant to have more than one end-of-file
 symbol, so a more efficient code would encode the integer
 into base 15, and use just the sixteenth symbol, {\tt 1111}, 
 as  the punctuation character. 
 Generalizing this idea, we can  make similar byte-based
 codes for integers  in bases 3 and 7, and in any base of 
 the form $2^n-1$.
\margintab{\small
\[
\begin{array}{cll} \toprule
n & c_3(n) &  c_{7}(n)
% = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n)
 \\ \midrule
1 & {\tt 01\, 11 }	&  {\tt 001\, 111} \\
2 & {\tt 10\, 11 }      &  {\tt 010\, 111} \\
3 & {\tt 01\, 00\, 11 }	&  {\tt 011\, 111} \\
\vdots & \\
45 & {\tt 01\, 10\, 00\, 00\, 11} &  {\tt 110\, 011\, 111} \\ \bottomrule
\end{array}
\]
\caption[a]{Two codes with end-of-file symbols,
 $C_3$ and $C_7$. Spaces have been included to show the
 byte boundaries.
}
}

 These  codes are  almost complete. (Recall that 
 a code is  `complete' if it satisfies the
 Kraft inequality with equality.)  The codes' 
  remaining inefficiency is that  they  provide the 
 ability to encode the integer zero and the empty string,
 neither of which was required. 


\exercissxB{2}{ex.intEOF}{
 Consider the implicit probability distribution over integers
 corresponding to the code with an end-of-file character. 
\ben
\item
 If the code has eight-bit blocks (\ie, the integer is 
 coded  in base 255), what is the   mean length in bits 
 of the integer, under the implicit distribution?
\item
 If one wishes to encode binary files of expected size about one hundred
 \kilobytes\ using a code with an end-of-file character, what is the optimal 
 block size?
\een
}

\subsection*{Encoding a tiny file} 
% see claude.p in itp/tex
 To illustrate the codes we have discussed, we  now use each
 code to encode
 a small file consisting of just 14 characters,
\[
\framebox{\tt{Claude Shannon}}.
\]
\bit
\item
 If we map the ASCII characters onto seven-bit 
 symbols (\eg, in decimal, 
 ${\tt C}=67$, ${\tt l}=108$, etc.), this 14 character file corresponds to the 
 integer 
\[
        n = 167\,987\,786\,364\,950\,891\,085\,602\,469\,870 \:\:\mbox{(decimal)}.
\]
\item
 The unary code for $n$ consists of this many (less one) zeroes, 
 followed by a one. If all the oceans were turned into ink, and if we 
 wrote a hundred bits with every cubic millimeter,
% or microlitre
 there
% would be roughly
 might be enough ink to write $c_{\rm U}(n)$. 

\item
 The standard binary representation of $n$ is this length-98  sequence of bits:
\beqa
 c_{\rm b}(n) &=& \begin{array}[t]{l}
	\tt 1000011110110011000011110101110010011001010100000 \\
 \tt                 1010011110100011000011101110110111011011111101110.
            \end{array}
\eeqa
% To store this self-delimiting  file 
% on a disc, we would need 
\eit
\exercisaxB{2}{ex.claudeshannonn}{
 Write down or describe the following
  self-delimiting representations of the above number  $n$:
 $c_{\alpha}(n)$, 
 $c_{\beta}(n)$, 
 $c_{\gamma}(n)$, 
 $c_{\delta}(n)$, 
 $c_{3}(n)$, 
 $c_{7}(n)$, and
 $c_{15}(n)$.
 Which of these encodings is the shortest? [{\sf{Answer:}} $c_{15}$.]
}
%
% solution moved to cutsolutions.tex
% 

\subsection{Comparing the codes}
 One could answer the question `which of two codes is
 superior?'  by a sentence of the form `For $n>k$, code 1 is
 superior, for $n

 Secondly, the depiction in terms of Venn diagrams 
 encourages one to believe that all the areas correspond to 
 positive quantities.  In the special case of two random variables
 it is indeed true that $H(X \given Y)$, $\I(X;Y)$ and $H(Y \given X)$ are positive 
 quantities. But as soon as we progress to three-variable ensembles,
 we obtain a diagram with positive-looking  areas that
 may actually  correspond to negative quantities. \Figref{fig.venn3}
 correctly shows  relationships such as 
\beq
	H(X) + H(Z \given X) + H(Y \given X,Z) = H(X,Y,Z) .
\eeq
 But it gives the misleading impression that 
 the conditional mutual information  $\I(X;Y \given Z)$ is 
 {\em less than\/} the mutual information 
 $\I(X;Y)$.  
\begin{figure}
\figuremargin{%3/4
\begin{center}
\mbox{\psfig{figure=figs/venn3.ps,angle=-90,width=5.25in}}
\end{center}
}{%
\caption[a]{A misleading representation of entropies, continued.}
\label{fig.venn3}
}%
\end{figure}
 In fact the area labelled $A$ can correspond to a {\em negative\/}
 quantity. Consider the joint ensemble 
 $(X,Y,Z)$ in which $x \in \{0,1\}$ and $y \in \{0,1\}$ 
 are independent binary variables  and  $z \in \{0,1\}$ is defined 
 to be $z=x+y \mod 2$.
 Then clearly $H(X) = H(Y) = 1$ bit. Also $H(Z) = 1$ bit. 
 And $H(Y \given X) = H(Y) = 1$ since the two variables are independent.
 So the mutual information between $X$ and $Y$ is zero.
 $\I(X;Y) = 0$. However, if $z$ is observed, $X$ and $Y$ become dependent ---
% correlated --- 
 knowing $x$, given $z$, tells you what $y$ is:  $y = z - x \mod 2$. 
 So $\I(X;Y \given Z) = 1$ bit. Thus the area labelled $A$ must correspond 
 to $-1$ bits for the figure to give the correct answers.

 The above example is not at all a capricious or exceptional
 illustration.
 The binary symmetric channel with input $X$, noise $Y$, and output $Z$
%  The classic\index{earthquake and burglar alarm}\index{burglar alarm and earthquake}
% earthquake-burglar-alarm ensemble \exercisebref{ex.burglar}\
%% (section ???), 
% with 
% earthquake $= X$, 
% burglar $ = Y$ and alarm $= Z$, 
% is a perfect example of a
 is a 
 situation in which $\I(X;Y)=0$ (input and noise are independent)
%  uncorrelated
 but $\I(X;Y \given Z) > 0$ (once you see the output, the unknown input and the unknown noise
 are intimately related!).

 The Venn diagram representation is therefore valid only if one is aware
 that positive areas may represent negative quantities.
 With this proviso
% As long as this possibility is
 kept in mind, the 
 interpretation of entropies in terms 
 of sets can be helpful  \cite{Yeung1991}.
% The quantity corresponding to $A$ is denoted $I(X;Y;Z)$
% by \citeasnoun{Yeung1991}.
}

\soln{ex.dataprocineq}{% BORDERLINE
%{\bf New answer:}
 For any joint ensemble $XYZ$, the following chain rule
 for mutual information holds. 
\beq
	\I(X;Y,Z) = \I(X;Y) + \I(X;Z \given Y)  .
\eeq
 Now, in the case $w \rightarrow d \rightarrow r$,
 $w$ and $r$ are independent given $d$, so
 $\I(W;R \given D) = 0$. Using the chain rule twice, we have:
\beq
 \I(W;D,R) = \I(W;D) 
\eeq
 and
\beq
 \I(W;D,R) = \I(W;R)  + \I(W;D \given R)  ,
\eeq
 so
\beq
 \I(W;R) - \I(W;D) \leq 0 .
\eeq
% for more solutions to this problem see
% Igraveyard.tex
}

\prechapter{About       Chapter}
\fakesection{prerequisites for chapter 5}
 Before reading  \chref{ch.five}, you should have read \chapterref{ch.one}
 and worked 
 on
  \exerciseref{ex.rel.ent}, and 
 \exerciserefrange{ex.Hcondnal}{ex.zxymod2}.
% \exfifteen--\exeighteen,
% \extwenty--\extwentyone, and \extwentythree.
% uvw to HXY>0
%  {ex.Hmutualineq}{ex.joint}, 
% \exerciserefrangeshort{ex.rel.ent}

% load of H() and I() stuff shoved in here now.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\ENDprechapter
\chapter{Communication over a Noisy Channel}
\label{ch.five}
% % l5.tex  
%
% useful program: bin/capacity.p for checking channel
% capacities
%
% % \part{Noisy Channel Coding}
% \chapter{Communication over a noisy channel}
% % The noisy-channel coding theorem, part a}
% % \chapter{The noisy channel coding theorem, part a}
\label{ch5}
\section{The big picture}
%
\setlength{\unitlength}{1mm}
\begin{realcenter}
%\begin{floatingfigure}[l]{3.2in}
\begin{picture}(85,50)(-40,5)
\thinlines
\put(0,5){\framebox(25,10){\begin{tabular}{c}Noisy\\ channel\end{tabular}}}
\put(-20,20){\framebox(25,10){\begin{tabular}{c}Encoder\end{tabular}}}
\put(20,20){\framebox(25,10){\begin{tabular}{c}Decoder\end{tabular}}}
\put(-20,40){\framebox(25,10){\begin{tabular}{c}Compressor\end{tabular}}}
\put(20,40){\framebox(25,10){\begin{tabular}{c}Decompressor\end{tabular}}}
\put(-40,40){\makebox(15,10){\begin{tabular}{c}{\sc Source}\\{\sc coding}\end{tabular}}}
\put(-40,20){\makebox(15,10){\begin{tabular}{c}{\sc Channel}\\{\sc coding}\end{tabular}}}
\put(-20,55){\makebox(25,10){Source}}
%
\put(-7.5,18){\line(0,-1){8}}
\put(-7.5,10){\vector(1,0){6}}
\put(32.5,10){\vector(0,1){8}}
\put(32.5,10){\line(-1,0){6}}
%
\put(32.5,31){\vector(0,1){8}}
\put(32.5,51){\vector(0,1){6}}
\put(-7.5,39){\vector(0,-1){8}}
\put(-7.5,57){\vector(0,-1){6}}
\end{picture}
\end{realcenter}
%
 In\index{channel!noisy} Chapters \ref{ch2}--\ref{ch4},
 we  discussed source coding with block
 codes, symbol codes and stream codes. We implicitly assumed that 
 the channel from the compressor to the decompressor 
 was noise-free. Real channels are noisy. We will now spend two
 chapters on the subject of noisy-channel coding -- the fundamental
 possibilities and limitations of error-free \ind{communication} through a
 noisy channel. The aim of channel coding
 is to make the noisy channel behave like a noiseless channel.
 We will assume that the data to be transmitted
 has been through a good compressor, so the bit stream has no
 obvious redundancy. The channel code, which makes the transmission,
 will put\index{redundancy!in channel code}
 back
% into the transmission
 redundancy of a special sort, designed
 to make the noisy received signal decodeable.\index{decoder}

 Suppose we  transmit 1000 bits  per second\index{channel!binary symmetric}
  with $p_0 = p_1 = \dhalf$
 over a noisy channel that flips bits with probability
 $f = 0.1$. What is the rate of 
 transmission of information?
% shannon p.35
 We might guess that the rate is 900 bits per second by subtracting
 the expected number of errors per second. But this is not correct, because 
 the recipient does not know where the errors occurred. 
 Consider the case where the noise is so great that 
 the received symbols are independent of the 
 transmitted symbols. This corresponds to a noise level of $f=0.5$, 
 since half of the received symbols are correct due to chance alone. 
 But when $f=0.5$, no information is transmitted at all. 
% ? cut this clearly?
\label{sec.ch5.intro}
% refer to exercise {ex.zxymod2}.

 Given what we have learnt about entropy, it seems reasonable that 
 a measure of the information transmitted is given by the \ind{mutual
 information} between the source and the received signal, that is, the 
 entropy of the source minus the \ind{conditional entropy}
 of the source given the received signal.
%
% shannon calls the conditional entropy the equivocation
% and points out that the equivocation is the amount of extra 
% information needed for a correcting device to figure out 
% what is going on

 We will now review the definition of conditional entropy 
 and mutual information. Then we will examine
% progress to the question of 
 whether it is possible to use such a noisy channel to communicate 
 {\em reliably}. 
 We will 
% Our aim here is to
 show that for any channel $Q$ there is a non-zero rate, 
 the \inds{capacity}\index{channel!capacity} 
 $C(Q)$, up to which information can be sent with arbitrarily 
 small probability of error. 

\section{Review of probability  and information} 
% 	conditional, joint and mutual information}
%  We now build on
 As an example, we take the joint distribution $XY$ from  
 \extwentyone.
%
%  A useful picture breaks down the total information content $H(X,Y)$ 
%  of a joint ensemble thus: 
% \begin{center}
% \setlength{\unitlength}{1in}
% \begin{picture}(3,1.13)(0,-0.2)
% \put(0,0.7){\framebox(3,0.20){$H(X,Y)$}}
% \put(0,0.4){\framebox(2.2,0.20){$H(X)$}}
% \put(1.5,0.1){\framebox(1.5,0.20){$H(Y)$}}
% \put(1.5125,-0.2){\framebox(0.675,0.20){$\I(X;Y)$}}
% \put(0,-0.2){\framebox(1.475,0.20){$H(X \given Y)$}}
% \put(2.225,-0.2){\framebox(0.775,0.20){$H(Y \specialgiven X)$}}
% \end{picture}
% \end{center}
%
% \subsection{Example of a joint ensemble}
% A joint ensemble $XY$ has the following joint  distribution. 
 The 
 marginal distributions $P(x)$ and $P(y)$ are shown in the 
 margins.
% $P(x,y)$:
\[
\begin{array}{cc|cccc|c}
\multicolumn{2}{c}{P(x,y)} & \multicolumn{4}{|c|}{x} & P(y) \\[0.051in]
 &     & 1   & 2   & 3   & 4  &    \\[0.011in]
\hline
\strutf
   &1    & \dfrac{1}{8} & \dfrac{1}{16} & \dfrac{1}{32} & \dfrac{1}{32} & \dfrac{1}{4} \\[0.01in]
\raisebox{0mm}{\mbox{$y$}}
   &2    & \dfrac{1}{16} & \dfrac{1}{8} & \dfrac{1}{32} & \dfrac{1}{32} & \dfrac{1}{4} \\[0.01in]
   &3    & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{4} \\[0.01in]
   &4    & \dfrac{1}{4}  &  0   & 0    & 0    & \dfrac{1}{4} \\[0.01in]
\hline
\multicolumn{2}{c|}{P(x)}
  	 & \strutf\dfrac{1}{2} & \dfrac{1}{4} & \dfrac{1}{8} & \dfrac{1}{8} &    \\[0.051in]
\end{array}
\]
 The joint entropy is $H(X,Y)=27/8$ bits.
 The marginal entropies are $H(X) = 7/4$ bits and $H(Y) = 2$ bits.

 We can compute the conditional distribution of $x$ for each value of $y$, 
 and the entropy of each of those conditional distributions: 
\[
\begin{array}{cc|cccc|c}
\multicolumn{2}{c|}{P(x \given y)} & \multicolumn{4}{c|}{x} & H(X \given y) / \mbox{bits} \\[0.051in]
 &     & 1   & 2   & 3   & 4  &  \\[0.011in]
\hline
\strutf
   &1    & \dfrac{1}{2} & \dfrac{1}{4} & \dfrac{1}{8} & \dfrac{1}{8} & \dfrac{7}{4}    \\[0.01in]
\raisebox{0mm}{\mbox{$y$}}
   &2    & \dfrac{1}{4} & \dfrac{1}{2} & \dfrac{1}{8} & \dfrac{1}{8} & \dfrac{7}{4}   \\[0.01in]
   &3    & \dfrac{1}{4} & \dfrac{1}{4} & \dfrac{1}{4} & \dfrac{1}{4} & 2  \\[0.01in]
   &4    & 1  &  0   & 0    & 0  & 0  \\[0.01in]
\hline
\multicolumn{3}{c}{\strutf
 } & \multicolumn{4}{r}{H(X \given Y) = \dfrac{11}{8}}   \\[0.1in]
\end{array}
\]
 Note that whereas $H(X \given y\eq 4) = 0$ is less than $H(X)$, $H(X \given y\eq 3)$ is greater 
 than $H(X)$.
% _s5A.tex has a solution link already \label{ex.Hcondnal.sol}
% \label{ex.joint.sol}
 So in some cases, learning $y$ can 
% make us more uncertain 
 {\em increase\/} our uncertainty
 about $x$. Note also that although $P(x \given y\eq 2)$ 
 is a different distribution from $P(x)$, the conditional entropy $H(X \given y\eq 2)$
 is equal to $H(X)$. So learning that $y$ is 2 changes our knowledge 
 about $x$ but does not reduce the uncertainty
 of  $x$, as measured by the entropy. On average though,
 learning $y$ does convey information 
 about $x$, since $H(X \given Y) < H(X)$. 

 One may also evaluate $H(Y \specialgiven X) = 13/8$ bits. 
 The mutual information is 
 $\I(X;Y) = H(X) - H(X \given Y) = 3/8$ bits. 

%  INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% \subsection{Solutions to a few other exercises}
% \input{tex/entropy_soln.tex}
%
%  INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%

%\mynewpage MNBV
\section{Noisy channels}
\begin{description}
\item[A discrete memoryless channel $Q$] is\index{channel!discrete memoryless}
 characterized by 
 an input alphabet $\A_X$, an output alphabet $\A_Y$, 
 and a set of conditional probability distributions $P(y \given x)$, one 
 for each $x \in \A_X$. 

 These {\dbf{transition probabilities}} may be written in a matrix\index{transition probability matrix}
\beq
	Q_{j|i} = P(y\eq b_j \given x\eq a_i) .
\eeq
\begin{aside}
 I\index{notation!conventions of this book}\index{notation!matrices}\index{notation!vectors}\index{conventions!matrices}\index{conventions!vectors}\index{notation!transition probability}
 usually orient this matrix  with the output variable
 $j$ indexing the rows and the input variable $i$
 indexing the columns, so that each column of $\bQ$ is a probability
 vector. With this convention, we can obtain the probability
 of the output, $\bp_Y$, from a probability distribution over the input,
 $\bp_X$, by right-multiplication:
\beq
 \bp_Y = \bQ \bp_X .
\eeq
% 
\end{aside}
% 
\end{description}

\noindent
 Some useful model channels are:
\begin{description}
% bsc
\item[Binary symmetric channel\puncspace]
 \indexs{channel!binary symmetric}\indexs{binary symmetric channel}
 $\A_X \eq  \{{\tt 0},{\tt 1}\}$. $\A_Y \eq  \{{\tt 0},{\tt 1}\}$.
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,20)(-5,0)
\put(-4,9){{\makebox(0,0)[r]{$x$}}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{1}}
\put(4,16){\makebox(0,0)[r]{0}}
\put(16,2){\makebox(0,0)[l]{1}}
\put(16,16){\makebox(0,0)[l]{0}}
\put(24,9){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}

\hspace{1in}
\begin{array}{ccl}
	P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1 - \q ; \\ 	P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& \q ;
\end{array}
\begin{array}{ccl}
	P(y\eq {\tt 0} \given x\eq {\tt 1}) &=&  \q ; \\ P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1 - \q .
\end{array} 
\hspace{1in}
\begin{array}{c}
\ecfig{bsc15.1}
\end{array}
\]
%
% \BEC bec BEC
%
\item[Binary erasure channel\puncspace] \indexs{channel!binary erasure}\indexs{binary erasure channel}
 $\A_X \eq  \{{\tt 0},{\tt 1}\}$. $\A_Y \eq  \{{\tt 0},\mbox{\tt ?},{\tt 1}\}$.
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,30)(-5,0)
\put(-4,15){{\makebox(0,0)[r]{$x$}}}
\put(5,5){\vector(1,0){10}}
\put(5,25){\vector(1,0){10}}
\put(5,5){\vector(1,1){10}}
\put(5,25){\vector(1,-1){10}}
\put(4,5){\makebox(0,0)[r]{\tt 1}}
\put(4,25){\makebox(0,0)[r]{\tt 0}}
\put(16,5){\makebox(0,0)[l]{\tt 1}}
\put(16,25){\makebox(0,0)[l]{\tt 0}}
\put(16,15){\makebox(0,0)[l]{\tt ?}}
\put(24,15){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}

\hspace{1in}
\begin{array}{ccl}
 P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1 - \q ; \\
 P(y\eq \mbox{\tt ?} \given x\eq {\tt 0}) &=& \q ;   \\
 P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& 0 ; 
\end{array} 
\begin{array}{ccl}
 P(y\eq {\tt 0} \given x\eq {\tt 1}) &=& 0 ; \\
 P(y\eq \mbox{\tt ?} \given x\eq {\tt 1}) &=& \q ; \\
 P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1 - \q .
\end{array} 
\hspace{1in}
\begin{array}{c}
\ecfig{bec.1}
\end{array}
\]
\item[Noisy typewriter\puncspace]
 \indexs{channel!noisy typewriter}\indexs{noisy typewriter}
 $\A_X = \A_Y = \mbox{the 27 letters $\{${\tt A}, 
 {\tt B}, \ldots, {\tt Z}, {\tt -}$\}$}$. 
 The letters are arranged in a circle, and 
 when the typist attempts to type {\tt B}, what comes out is 
 either {\tt A}, {\tt B} or {\tt C}, with probability \dfrac{1}{3} each; 
 when the input is {\tt C}, the output is {\tt B}, {\tt C} or {\tt D}; 
 and so forth, with the final letter  `{\tt -}'
% being
 adjacent to the 
 first letter  {\tt A}. 
\[
\begin{array}{c}
\setlength{\unitlength}{1pt}
\begin{picture}(48,130)(0,2)
\thinlines
\put(5,5){\vector(3,0){30}}
\put(5,25){\vector(3,0){30}}
\put(5,15){\vector(3,0){30}}
\put(5,5){\vector(3,1){30}}
\put(5,25){\vector(3,-1){30}}
\put(4,5){\makebox(0,0)[r]{{\tt -}}}
\put(4,15){\makebox(0,0)[r]{{\tt Z}}}
\put(4,25){\makebox(0,0)[r]{{\tt Y}}}
\put(36,5){\makebox(0,0)[l]{{\tt -}}}
\put(36,15){\makebox(0,0)[l]{{\tt Z}}}
\put(36,25){\makebox(0,0)[l]{{\tt Y}}}
%
\put(5,15){\vector(3,1){30}}
\put(5,15){\vector(3,-1){30}}
\put(5,25){\vector(3,0){30}}
\put(5,25){\vector(3,1){30}}
\put(20,43){\makebox(0,0){$\vdots$}}
%
%\put(5,35){\vector(3,0){30}}
%\put(5,35){\vector(3,1){30}}
\put(5,35){\vector(3,-1){30}}
%\put(5,45){\vector(3,0){30}}
\put(5,45){\vector(3,1){30}}
%\put(5,45){\vector(3,-1){30}}
\put(5,55){\vector(3,0){30}}
\put(5,55){\vector(3,1){30}}
\put(5,55){\vector(3,-1){30}}
\thicklines
\put(5,65){\vector(3,0){30}}
\put(5,65){\vector(3,1){30}}
\put(5,65){\vector(3,-1){30}}
\thinlines
\put(5,75){\vector(3,0){30}}
\put(5,75){\vector(3,1){30}}
\put(5,75){\vector(3,-1){30}}
\put(5,85){\vector(3,0){30}}
\put(5,85){\vector(3,1){30}}
\put(5,85){\vector(3,-1){30}}
\put(5,95){\vector(3,0){30}}
\put(5,95){\vector(3,1){30}}
\put(5,95){\vector(3,-1){30}}
\put(5,105){\vector(3,0){30}}
\put(5,105){\vector(3,1){30}}
\put(5,105){\vector(3,-1){30}}
\put(5,115){\vector(3,0){30}}
\put(5,115){\vector(3,1){30}}
\put(5,115){\vector(3,-1){30}}
\put(5,125){\vector(3,0){30}}
\put(5,125){\vector(3,-1){30}}
\put(5,5){\vector(1,4){30}}
\put(5,125){\vector(1,-4){30}}
%\put(4,35){\makebox(0,0)[r]{{\tt J}}}
%\put(36,35){\makebox(0,0)[l]{{\tt J}}}
%\put(4,45){\makebox(0,0)[r]{{\tt I}}}
%\put(36,45){\makebox(0,0)[l]{{\tt I}}}
\put(4,55){\makebox(0,0)[r]{{\tt H}}}
\put(36,55){\makebox(0,0)[l]{{\tt H}}}
\put(4,65){\makebox(0,0)[r]{{\tt G}}}
\put(36,65){\makebox(0,0)[l]{{\tt G}}}
\put(4,75){\makebox(0,0)[r]{{\tt F}}}
\put(36,75){\makebox(0,0)[l]{{\tt F}}}
\put(4,85){\makebox(0,0)[r]{{\tt E}}}
\put(36,85){\makebox(0,0)[l]{{\tt E}}}
\put(4,95){\makebox(0,0)[r]{{\tt D}}}
\put(36,95){\makebox(0,0)[l]{{\tt D}}}
\put(4,105){\makebox(0,0)[r]{{\tt C}}}
\put(36,105){\makebox(0,0)[l]{{\tt C}}}
\put(4,115){\makebox(0,0)[r]{{\tt B}}}
\put(36,115){\makebox(0,0)[l]{{\tt B}}}
\put(4,125){\makebox(0,0)[r]{{\tt A}}}
\put(36,125){\makebox(0,0)[l]{{\tt A}}}
\end{picture}
\end{array}

\hspace{1in}
\begin{array}{ccl} & \vdots &  \\
 P(y\eq {\tt F} \given x\eq {\tt G}) &=& 1/3 ; \\
 P(y\eq {\tt G} \given x\eq {\tt G}) &=& 1/3 ;   \\
 P(y\eq {\tt H} \given x\eq {\tt G}) &=& 1/3 ; \\
 & \vdots & 
\end{array}
\hspace{1.2in}
\begin{array}{c}
\ecfig{type}
\end{array}
\]
\item[Z channel\puncspace]
 \indexs{channel!Z channel}\indexs{Z channel}
 $\A_X \eq  \{{\tt 0},{\tt 1}\}$. $\A_Y \eq  \{{\tt 0},{\tt 1}\}$.
\[
% \begin{array}{c}
% \setlength{\unitlength}{0.46mm}
% \begin{picture}(20,20)(0,0)
% \put(5,5){\vector(1,0){10}}
% \put(5,15){\vector(1,0){10}}
% \put(5,5){\vector(1,1){10}}
% \put(4,5){\makebox(0,0)[r]{1}}
% \put(4,15){\makebox(0,0)[r]{0}}
% \put(16,5){\makebox(0,0)[l]{1}}
% \put(16,15){\makebox(0,0)[l]{0}}
% \end{picture}
% \end{array}
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,20)(-5,0)
\put(-4,9){{\makebox(0,0)[r]{$x$}}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
% \put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{1}}
\put(4,16){\makebox(0,0)[r]{0}}
\put(16,2){\makebox(0,0)[l]{1}}
\put(16,16){\makebox(0,0)[l]{0}}
\put(24,9){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}

\hspace{1in}
\begin{array}{ccl}
 P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1        ; \\
 P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& 0 ; \\
\end{array} 
\begin{array}{ccl}
 P(y\eq {\tt 0} \given x\eq {\tt 1}) &=& \q ; \\
 P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1- \q    .\\
\end{array} 
\hspace{1in}
%\:\:\:\:\:\:
\begin{array}{c}
\ecfig{z15.1}
\end{array}
\]
% {\em Check if this orientation of the channel disagrees
% with any demonstrations.}
\end{description}
\section{Inferring the input given the output}
% was a subsection
%  a single transmission}
 If we assume that the input $x$ to a channel
 comes from an ensemble $X$, then 
 we obtain a joint ensemble $XY$ in which the random variables $x$ and $y$
 have the joint distribution:
\beq
	P(x,y) = P(y \given x) P(x)  .
\eeq
 Now if we receive 
 a particular symbol $y$, what was the input symbol $x$? 
 We typically won't know for certain. We can write down the posterior 
 distribution of the input using \Bayes\  theorem:\index{Bayes' theorem} 
\beq
	P(x \given y) = \frac{ P(y \given x) P(x)  }{P(y) } 
	= \frac{ P(y \given x) P(x)  }{\sum_{x'}  P(y \given x') P(x')  } .
\eeq
\exampla{
%{\sf Example 1:}
 Consider a \index{channel!binary symmetric}\ind{binary symmetric channel}
 with probability of 
 error $\q\eq 0.15$. Let the input ensemble be $\P_X: \{p_0 \eq  0.9, p_1 \eq  0.1\}$.
 Assume we observe $y\eq 1$. 
\beqan
	P(x\eq 1 \given y\eq 1) &=&\frac{ P(y\eq 1 \given x\eq 1) P(x\eq 1) }{\sum_{x'}  P(y \given x') P(x') } \nonumber \\
	&\eq & \frac{ 0.85 \times 0.1 }{  0.85 \times 0.1 +  0.15 \times 0.9 }  \nonumber \\
	&=& \frac{ 0.085 }{ 0.22 } \:\:=\:\: 0.39 .
\eeqan
 Thus `$x\eq 1$' is still less probable than `$x\eq 0$', although it is not
 as improbable as it was before.  
}
% Could turn this into an exercise.
%  Alternatively, assume we observe $y\eq 0$.
% \beqa
% 	P(x\eq 1 \given y\eq 0) &=&  \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'}  P(y \given x') P(x')} \\
% 	&=& \frac{ 0.15 \times 0.1 }{  0.15 \times 0.1 +  0.85 \times 0.9 } \\
% 	&=& \frac{ 0.015 }{0.78} = 0.019 .
% \eeqa
\exercissxA{1}{ex.bscy0}{
 Now assume we observe $y\eq 0$.
 Compute the probability of $x\eq 1$ given $y\eq 0$. 
}
\exampla{
%{\sf Example 2:} 
 Consider a \ind{Z channel}\index{channel!Z channel} with probability of 
 error $\q\eq 0.15$. Let the input ensemble be $\P_X: \{p_0 \eq  0.9, p_1 \eq  0.1\}$.
 Assume we observe $y\eq 1$. 
\beqan
	P(x\eq 1 \given y\eq 1) 
	&=& \frac{ 0.85 \times 0.1 }{  0.85 \times 0.1 +  0 \times 0.9 } 
\nonumber \\
	&=& \frac{ 0.085}{0.085} \:\:=\:\: 1.0 .
\eeqan
 So given the output $y\eq 1$ we become certain of the input. 
}
%  Alternatively, assume we observe $y\eq 0$. 
% \beqa
% 	P(x\eq 1 \given y\eq 0) 
% % &=&  \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'}  P(y \given x') P(x')} \\
% 	&=& \frac{ 0.15 \times 0.1 }{  0.15 \times 0.1 +  1.0 \times 0.9 } \\
% 	&=& \frac{ 0.015}{ 0.915} =  0.016 .
% \eeqa
\exercissxA{1}{ex.zcy0}{
 Alternatively, assume we observe $y\eq 0$. Compute $P(x\eq 1 \given y\eq 0)$.
}

\section{Information conveyed by a channel}
 We now consider how much information can be communicated through 
 a channel. In {operational\/} terms, we are interested in finding 
 ways of using the channel such that all the bits that are communicated
 are recovered with negligible probability of error. 
 In {mathematical\/} terms, 
 assuming a particular input ensemble $X$, we can measure how
 much information the output conveys about the input by the mutual
 information:
\beq
	\I(X;Y) \equiv H(X) - H(X \given Y) = H(Y) - H(Y \specialgiven X) .
\eeq
 Our aim is to establish the connection between these two ideas. 
 Let us evaluate $\I(X;Y)$ for some of the channels above. 

\subsection{Hint for computing mutual information}
 \index{hint for computing mutual information}\index{mutual information!how to compute}We
 will tend to think of $\I(X;Y)$ as $H(X) - H(X \given Y)$, \ie, how much 
 the uncertainty of the input $X$ is reduced when we look at the output
 $Y$. But for computational
 purposes it is often handy to evaluate $H(Y) - H(Y \specialgiven X)$ instead. 
%\medskip
% this reproduced from _p5A.tex, figure 9.1 {fig.entropy.breakdown}
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
%
% included by l1.tex
%
\setlength{\unitlength}{1in}
\begin{picture}(3,1.13)(0,-0.2)
\put(0,0.7){\framebox(3,0.20){$H(X,Y)$}}
\put(0,0.4){\framebox(2.2,0.20){$H(X)$}}
\put(1.5,0.1){\framebox(1.5,0.20){$H(Y)$}}
\put(1.5125,-0.2){\framebox(0.675,0.20){$\I(X;Y)$}}
\put(0,-0.2){\framebox(1.475,0.20){$H(X\,|\,Y)$}}
\put(2.225,-0.2){\framebox(0.775,0.20){$H(Y|X)$}}
\end{picture}

\end{center}
}{%
\caption[a]{The relationship between joint information, 
 marginal entropy, conditional entropy and mutual entropy.

 This figure is important, so I'm showing it twice.}
\label{fig.entropy.breakdown.again}
}%
\end{figure}
%\begin{center}
%\input{tex/entropyfig.tex}
%\end{center}

%\noindent
\exampla{
%{\sf Example 1:}
 Consider the
 \index{channel!binary symmetric}\index{binary symmetric channel}\BSC\
 again, with $\q\eq 0.15$ and
 $\P_X: \{p_0 \eq 0.9, p_1 \eq 0.1\}$.  We already evaluated the
 marginal probabilities $P(y)$ implicitly above: $P(y\eq 0) = 0.78$;
 $P(y\eq 1) = 0.22$.  The mutual information is: 
\beqa 
	\I(X;Y) &=& H(Y) - H(Y \specialgiven X) .
\eeqa
 What is  $H(Y \specialgiven X)$?
 It is defined to be the weighted sum over $x$ of $H(Y \given x)$; but
 $H(Y \given x)$ is the same for each value of $x$:
 $H(Y \given x\eq{\tt{0}})$ is $H_2(0.15)$,
 and  $H(Y \given x\eq{\tt{1}})$ is $H_2(0.15)$. So
\beqan
	\I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber \\
 &=& H_2(0.22) - H_2(0.15) \nonumber \\
 & =& 0.76 - 0.61 \:\: = \:\: 0.15  \mbox{ bits}.
\eeqan
% this used to be in error (0.15)
 This may be contrasted with the entropy of the source $H(X) = H_2(0.1) =
 0.47$ bits.

 Note: here we have  used the binary entropy function  $H_2(p)
 \equiv H(p,1\!-\!p)=p \log \frac{1}{p}
  + (1-p)\log \frac{1}{(1-p)}$.\marginpar{\small\raggedright{Throughout this book, $\log$ means $\log_2$.}}
}
%\medskip

% \noindent
\exampla{
% {\sf Example~2:}
 And now the \ind{Z channel}\index{channel!Z channel}, with $\P_X$ as above. 
% $P(y\eq 0)\eq  0.915; 
 $P(y\eq 1)\eq 0.085$. 
\beqan
 \I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber \\
 &=& H_2(0.085)  - [ 0.9 H_2(0) + 0.1 H_2(0.15) ] \nonumber  \\
 &=& 0.42 - ( 0.1 \times 0.61 )
 = 0.36 \mbox{ bits}. 
\eeqan
 The entropy of the source, as above, is $H(X) = 0.47$ bits. Notice 
 that the mutual information $\I(X;Y)$ for the Z channel is bigger than 
 the mutual information  for the binary symmetric channel with the 
 same $\q$. The  Z channel is a more reliable
 channel.
% is fits with our intuition that the
}
\exercissxA{1}{ex.bscMI}{Compute the mutual information between $X$ and $Y$
 for the \BSC\  with $\q\eq 0.15$  when the input 
 distribution is $\P_X = \{p_0 \eq 0.5, p_1 \eq 0.5\}$. 
}
\exercissxA{2}{ex.zcMI}{Compute the mutual information between $X$ and $Y$
 for the Z channel with $\q=0.15$ when the input 
 distribution is $\P_X: \{p_0 \eq 0.5, p_1 \eq 0.5\}$. 
}

\subsection{Maximizing the mutual information}
 We have observed in the above examples  that 
 the mutual information between the input and 
 the output depends on the chosen
 {input ensemble}\index{channel!input ensemble}.

 Let us assume that we wish to maximize the mutual information 
 conveyed by the channel by choosing 
 the best possible input ensemble. 
 We define the {\dbf\inds{capacity}\/} of the
 channel\index{channel!capacity}
 to be its maximum  \ind{mutual information}. 
\begin{description}
\item[The capacity] of a channel $Q$ is:
\beq
	C(Q) = \max_{\P_X} \, \I(X;Y) .
\eeq
 The distribution $\P_X$ that achieves the maximum is called the
 {\dem{\optens}},\indexs{optimal input distribution}
 denoted by $\P_X^*$. [There may be multiple
 {\optens}s achieving the same value of $\I(X;Y)$.]
\end{description}
%
 In \chref{ch6} we will
 show that the capacity does indeed measure the maximum amount 
 of error-free information that can be transmitted
% is transmittable % yes, spell checked
 over the channel per  unit time.
% \medskip

% Sun 22/8/04 am having problems trying to get fig 9.2 to go at head
% of p 151 - putting it there causes text to move.
%\noindent
\exampla{
%{\sf Example 1:}
 Consider the \BSC\  with $\q \eq  0.15$. Above, we considered
 $\P_X = \{p_0 \eq  0.9, p_1 \eq  0.1\}$, and found
 $\I(X;Y) = 0.15$ bits.
% the page likes to break here
 How much better can we do? By symmetry, 
 the \optens\ is
 $\{ 0.5, 0.5\}$ and%
\amarginfig{t}{
\mbox{%
%\begin{figure}[htbp]
\small
%\floatingmargin{%
%\figuremargin{%
\raisebox{0.91in}{$\I(X;Y)$}%
\hspace{-0.42in}%
\begin{tabular}{c}
\mbox{\psfig{figure=figs/IXY.15.ps,%
width=45mm,angle=-90}}\\[-0.1in]
$p_1$
\end{tabular}
}
%}{%
\caption[a]{The mutual information $\I(X;Y)$ for a binary symmetric
 channel with $\q=0.15$
 as a function of the input distribution.
% (\eqref{eq.IXYBSC}).
}
\label{fig.IXYBSC}
}
%%% 
 the capacity is
\beq
 C(Q_{\rm BSC}) \:=\: H_2(0.5) - H_2(0.15) \:=\: 1.0 - 0.61 \:=\: 0.39
 \ubits.
\eeq
 We'll justify the \ind{symmetry argument}\index{capacity!symmetry argument}
 later.
 If there's any doubt about the
% such a
 symmetry argument,
 we can always resort to explicit maximization of
 the \ind{mutual information} $I(X;Y)$,
\beq
	I(X;Y) = H_2( (1\!-\!\q)p_1 + (1\!-\!p_1)\q ) - H_2(\q) \ \  \mbox{  (\figref{fig.IXYBSC}). }
\label{eq.IXYBSC}
\eeq
}
% \medskip

% \noindent
% {\sf Example 2:}
\exampl{exa.typewriter}{
 The noisy typewriter. 
 The \optens\ is a uniform distribution over $x$, and gives 
 $C = \log_2 9$ bits. 
}

% \medskip
% \noindent
\exampl{exa.Z.HXY}{
% {\sf Example 3:}
 Consider the \ind{Z channel} with $\q \eq  0.15$.
 Identifying the \optens\ is not so straightforward. We 
 evaluate $\I(X;Y)$ explicitly for 
 $\P_X = \{p_0, p_1\}$. First, we need to compute $P(y)$. The probability
 of $y\eq 1$ is easiest to write down:
\beq
	P(y\eq 1) \:\:=\:\: p_1 (1-\q) .
\eeq
 Then%
\amarginfig{t}{
%\begin{figure}[htbp]
\mbox{%
\small
%\floatingmargin{%
%\figuremargin{%
\raisebox{0.91in}{$\I(X;Y)$}%
\hspace{-0.42in}%
\begin{tabular}{c}
\mbox{\psfig{figure=figs/HXY.ps,%
width=45mm,angle=-90}}\\[-0.1in]
$p_1$
\end{tabular}
}
%}{%
\caption{The mutual information $\I(X;Y)$ for a Z
 channel with $\q=0.15$
 as a function of the input distribution.}
\label{hxyz}
}
%\end{figure}
%%%%%%%%%%%%% old:
%\begin{figure}[htbp]
%\small
%\begin{center}
%\raisebox{1.3in}{$\I(X;Y)$}%
%\hspace{-0.2in}%
%\begin{tabular}{c}
%\mbox{\psfig{figure=figs/HXY.ps,%
%width=60mm,angle=-90}}\\
%$p_1$
%\end{tabular}
%\end{center}
%\caption[a]{The mutual information $\I(X;Y)$ for a Z channel with $\q=0.15$
% as a function of the input distribution.}
%% (Horizontal axis $=p_1$.)}
%\label{hxyz.old}
%\end{figure}
 the mutual information is:
\beqan
	\I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber  \\
 &=& H_2(p_1 (1-\q))  - ( p_0 H_2(0) + p_1 H_2(\q) ) \nonumber  \\
 &=& H_2(p_1 (1-\q))  -  p_1 H_2(\q) .
\eeqan
 This is a non-trivial function of $p_1$, shown in \figref{hxyz}. 
 It is maximized for $\q=0.15$ by 
% the \optens\ 
 $p_1^* = 0.445$.
 We find $C(Q_{\rm Z}) = 0.685$.  Notice
% that
 the \optens\ is not
 $\{ 0.5,0.5 \}$.  We can communicate slightly more information
 by using input symbol {\tt{0}} more frequently than {\tt{1}}.
}

%\noindent {\sf Exercise b:}
\exercissxA{1}{ex.bscC}{
 What is the capacity of the \ind{binary symmetric channel} for general $\q$?\index{channel!binary symmetric}
}
\exercissxA{2}{ex.becC}{
 Show that the capacity of the \ind{binary erasure channel}\index{channel!binary erasure} with $\q=0.15$
 is $C_{\rm BEC} = 0.85$. What is its capacity for general $\q$?
 Comment.
}


% \bibliography{/home/mackay/bibs/bibs}
%\section{The Noisy Channel Coding Theorem}
\section{The noisy-channel coding theorem}
 It seems plausible that the `capacity' we have defined may be
 a measure of information conveyed by a channel; what is not obvious, 
 and what we will prove in the next chapter, is that the \ind{capacity} indeed 
 measures the rate at which blocks of data can be communicated over the channel 
 {\em with arbitrarily small probability of error}.

 We make the following definitions.\label{sec.whereCWMdefined}
\begin{description}
\item[An $(N,K)$ {block code}] for\indexs{error-correcting code!block code}
 a channel $Q$ is a list  of $\cwM=2^K$
 codewords 
 $$\{ \bx^{(1)}, \bx^{(2)}, \ldots, \bx^{({2^K)}} \},  \:\:\:\:\:\bx^{(\cwm)} \in \A_X^N ,$$
 each of length $N$. 
 Using this code we can encode a signal $\cwm \in \{ 1,2,3,\ldots, 2^K\}$
% The signal to be encoded is assumed to come from an 
% alphabet of size $2^K$; signal $m$ is encoded
 as $\bx^{(\cwm)}$. [The number of codewords $\cwM$ is an integer,
 but the number of bits specified by choosing a codeword, $K \equiv \log_2 \cwM$,
 is not necessarily an integer.]

 The {\dbf \inds{rate}\/} of\index{error-correcting code!rate}
 the code is $R = K/N$ bits per channel use.
% character. 

 [We will use this definition of the rate  for any channel, not only  channels with binary inputs;
 note however that it is sometimes conventional to define the rate of a code for a channel
 with $q$ input symbols to be $K/(N\log q)$.]

% \item[A linear $(N,K)$ block code] is a block code in which all 
% moved into leftovers.tex
\item[A \ind{decoder}] for an $(N,K)$ block code is a mapping from
 the set of length-$N$ strings of channel outputs, $\A_Y^N$, to 
 a codeword label $\hat{\cwm} \in \{ 0 , 1 , 2 , \ldots, 2^K \}$. 

 The extra symbol $\hat{\cwm} \eq 0$ can be used to indicate a `failure'. 

\item[The \ind{probability of block error}\index{error probability!block}] 
% $p_B$
 of a code and decoder, for a given channel, and for a given probability 
 distribution over the encoded signal $P(\cwm_{\rm in})$, 
 is:
\beq
	p_{\rm B} = \sum_{\cwm_{\rm in}} P( \cwm_{\rm in} )
		P( \cwm_{\rm out} \! \not =  \! \cwm_{\rm in}  \given  \cwm_{\rm in} )
 .
\eeq
% the probability 
% that the decoded signal $\cwm_{\rm out}$ is not equal to $\cwm_{\rm in}$. 
\item[The maximal probability of block error] is 
\beq
	p_{\rm BM} = \max_{\cwm_{\rm in}} P( \cwm_{\rm out}  \! \not =  \!
 \cwm_{\rm in}  \given  \cwm_{\rm in} )
 .
\eeq
\item[The \ind{optimal decoder}] for a channel code is the one that minimizes 
 the probability of block error. It decodes an output $\by$ as 
 the input $\cwm$ that has maximum \ind{posterior probability} $P(\cwm \given \by)$.
\beq
	P(\cwm \given \by) =
 \frac{ P(\by \given \cwm ) P(\cwm) } { \sum_{\cwm' } P(\by \given \cwm') P(\cwm') }
\eeq
\beq
	\hat{\cwm}_{\rm optimal} = \argmax
% _{\cwm} % did not appear underneath
                                          P(\cwm \given \by) . 
\eeq
 A uniform prior distribution on $\cwm$ is usually assumed, in which case the 
 optimal  decoder is also  the  {\dem \ind{maximum likelihood decoder}}, 
 \ie, the decoder that maps an output $\by$ to 
 the input $\cwm$ that has maximum {\dem \ind{likelihood}} $P(\by \given \cwm )$.
\item[The probability of bit error] $p_{\rm b}$ is defined assuming that
 the codeword number 
 $\cwm$ is represented by a binary vector $\bs$ of length $K$ bits;
 it is the average probability 
 that a bit of $\bs_{\rm out}$ is not equal to  the corresponding 
 bit of $\bs_{\rm in}$ (averaging over all $K$ bits).
 

\item[Shannon's\index{Shannon, Claude}
 \ind{noisy-channel coding theorem} (part one)\puncspace]

%\begin{quote}
 Associated with each discrete memoryless channel,
\marginfig{
\begin{center}
\setlength{\unitlength}{2pt}
\begin{picture}(60,45)(-2.5,-7)
\thinlines
\put(0,0){\vector(1,0){60}}
\put(0,0){\vector(0,1){40}}
\put(30,-3){\makebox(0,0)[t]{$C$}}
\put(55,-2){\makebox(0,0)[t]{$R$}}
\put(-1,35){\makebox(0,0)[r]{$p_{\rm BM}$}}
\thicklines
\put(0,0){\dashbox{3}(30,30){achievable}}
% \put(0,0){\line(0,1){50}}
%
\end{picture}
\end{center}
\caption[a]{Portion of the $R,p_{\rm BM}$ plane asserted to
 be 
 achievable by the first part of Shannon's noisy
 channel coding theorem.}
\label{fig.belowCthm}
}%end marginfig
 there is a
 non-negative number $C$ (called the channel capacity) with the following
 property.  For any $\epsilon > 0$ and $R < C$, for large enough $N$,
 there exists a block code of length $N$ and rate $\geq R$ and a decoding
 algorithm, such that the maximal probability of block error is
 $< \epsilon$.
%\end{quote}
% \item[The negative part of the theorem\puncspace] moved to graveyard.tex Sun 3/2/02
\end{description}



\begin{figure}[htbp]
\figuremargin{%
\[
\begin{array}{c}
\setlength{\unitlength}{1pt}
\begin{picture}(48,120)(0,5)
\thinlines
%\put(5,5){\vector(3,0){30}}
%\put(5,25){\vector(3,0){30}}
\put(5,15){\vector(3,0){30}}
%\put(5,5){\vector(3,1){30}}
%\put(5,25){\vector(3,-1){30}}
% \put(4,5){\makebox(0,0)[r]{{\tt -}}}
\put(4,15){\makebox(0,0)[r]{{\tt Z}}}
% \put(4,25){\makebox(0,0)[r]{{\tt Y}}}
\put(36,5){\makebox(0,0)[l]{{\tt -}}}
\put(36,15){\makebox(0,0)[l]{{\tt Z}}}
\put(36,25){\makebox(0,0)[l]{{\tt Y}}}
%
 \put(5,15){\vector(3,1){30}}
 \put(5,15){\vector(3,-1){30}}
%\put(5,25){\vector(3,0){30}}
%\put(5,25){\vector(3,1){30}}
\put(20,40){\makebox(0,0){$\vdots$}}
%
%\put(5,35){\vector(3,0){30}}
%\put(5,35){\vector(3,1){30}}
% \put(5,35){\vector(3,-1){30}}
%\put(5,45){\vector(3,0){30}}
% \put(5,45){\vector(3,1){30}}
%\put(5,45){\vector(3,-1){30}}
\put(5,55){\vector(3,0){30}}
\put(5,55){\vector(3,1){30}}
\put(5,55){\vector(3,-1){30}}
% \thicklines
% \put(5,65){\vector(3,0){30}}
% \put(5,65){\vector(3,1){30}}
% \put(5,65){\vector(3,-1){30}}
% \thinlines
% \put(5,75){\vector(3,0){30}}
% \put(5,75){\vector(3,1){30}}
% \put(5,75){\vector(3,-1){30}}
\put(5,85){\vector(3,0){30}}
\put(5,85){\vector(3,1){30}}
\put(5,85){\vector(3,-1){30}}
% \put(5,95){\vector(3,0){30}}
% \put(5,95){\vector(3,1){30}}
% \put(5,95){\vector(3,-1){30}}
%\put(5,105){\vector(3,0){30}}
%\put(5,105){\vector(3,1){30}}
%\put(5,105){\vector(3,-1){30}}
\put(5,115){\vector(3,0){30}}
\put(5,115){\vector(3,1){30}}
\put(5,115){\vector(3,-1){30}}
%\put(5,125){\vector(3,0){30}}
%\put(5,125){\vector(3,-1){30}}
%
%\put(5,5){\vector(1,4){30}}
%\put(5,125){\vector(1,-4){30}}
\put(36,45){\makebox(0,0)[l]{{\tt I}}}
\put(4,55){\makebox(0,0)[r]{{\tt H}}}
\put(36,55){\makebox(0,0)[l]{{\tt H}}}
% \put(4,65){\makebox(0,0)[r]{{\tt G}}}
\put(36,65){\makebox(0,0)[l]{{\tt G}}}
% \put(4,75){\makebox(0,0)[r]{{\tt F}}}
\put(36,75){\makebox(0,0)[l]{{\tt F}}}
\put(4,85){\makebox(0,0)[r]{{\tt E}}}
\put(36,85){\makebox(0,0)[l]{{\tt E}}}
% \put(4,95){\makebox(0,0)[r]{{\tt D}}}
\put(36,95){\makebox(0,0)[l]{{\tt D}}}
% \put(4,105){\makebox(0,0)[r]{{\tt C}}}
\put(36,105){\makebox(0,0)[l]{{\tt C}}}
\put(4,115){\makebox(0,0)[r]{{\tt B}}}
\put(36,115){\makebox(0,0)[l]{{\tt B}}}
% \put(4,125){\makebox(0,0)[r]{{\tt A}}}
\put(36,125){\makebox(0,0)[l]{{\tt A}}}
\end{picture}
\end{array}

\hspace{1.5in}
\begin{array}{c}
% roughly 8pts from col to col
\setlength{\unitlength}{1.005pt}% this was 1pt in jan 2000, I tweaked it
\begin{picture}(50,110)(-5,-5)
\thinlines
\put(-5,-5){\ecfig{type}}
\multiput(7.95,-3)(12,0){9}{\framebox(4,126){}}
%\put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}}
%\put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}}
%
\end{picture}

\end{array}
\]
}{%
\caption[a]{A non-confusable subset of inputs for the noisy
 typewriter.}
\label{fig.typenine}
}
\end{figure}
\subsection{Confirmation of the theorem for the noisy typewriter channel}
 In the case of the \ind{noisy typewriter}\index{channel!noisy typewriter}, 
 we can easily confirm  the
% positive part of the
 theorem,
% For  this channel,
 because  we can  create a
% n {\em error-free\/}
 completely error-free
 communication strategy using a block code of length $N =1$:
 we  use  only the letters {\tt B}, {\tt E}, {\tt H}, 
 \ldots, {\tt Z}, 
 \ie, every third letter. These letters form a {\dem non-confusable subset\/}\index{non-confusable inputs}
 of the input 
 alphabet  (see \figref{fig.typenine}). Any output can be uniquely decoded. The number of 
 inputs in the non-confusable subset is 9, so the error-free information 
 rate of this system is $\log_2 9$ bits, which is equal to the capacity  $C$,
 which we evaluated in \exampleref{exa.typewriter}.


% 
 How does this translate into the terms of the theorem?
 The following table explains.\medskip

%\begin{center}
\begin{raggedright}
\noindent
% THIS TABLE IS DELIBERATELY FULL WIDTH
% for textwidth, use this
% \begin{tabular}{p{2.2in}p{2.5in}}
\begin{tabular}{@{}p{2.7in}p{4.1in}@{}}
\multicolumn{1}{@{}l}{\sf The theorem} &
\multicolumn{1}{l}{\sf How it applies to the noisy typewriter } \\ \midrule
\raggedright\em Associated with each discrete memoryless channel, there is a
 non-negative number $C$.
% (called the channel capacity).
&
 The capacity $C$ is $\log_2 9$. 
\\[0.047in]
\raggedright\em For any $\epsilon > 0$ and $R < C$, for large enough $N$,
&
% Assume we are given an $R0$.
 No matter what $\epsilon$ and  $R$ are, we  set the blocklength $N$ to
 1.
\\[0.047in]
\raggedright\em there exists a block code of length $N$ and rate $\geq R$
 & The block code is
% can be the following list of nine codewords:
 $\{{\tt B,E,\ldots,Z}\}$.  The value of
 $K$ is given by $2^K = 9$, so $K=\log_2 9$, and this code has rate
 $\log_2 9$, which is greater than the requested value of $R$.
\\[0.047in]
\raggedright\em and a decoding
 algorithm,
&
 The decoding algorithm maps the received
 letter to the nearest letter in the code;
\\[0.047in]
\raggedright\em
 such that the maximal probability of block error is
 $< \epsilon$.
&
 the maximal probability of block error is zero, which
 is
 less than the given $\epsilon$.
\\
\end{tabular}
\end{raggedright}
%\end{center}
% is greater than or equal 
% to 1

% source RUNME
\section{Intuitive preview of proof}
\subsection{Extended channels}
 To prove the theorem for any given channel, we  consider the 
 {\dem \ind{extended channel}\index{channel!extended}}
 corresponding to $N$ uses of the 
% original
 channel.
 The extended channel has
 $|\A_X|^N$ possible inputs $\bx$ and 
 $|\A_Y|^N$ possible outputs.
% {\em add a picture of extended channel here.}
%
\begin{figure}
\figuremargin{%
\small\begin{center}
\begin{tabular}{cccc}
%$\bQ$ 
& \ecfig{bsc15.1} 
& \ecfig{bsc15.2} 
& \ecfig{bsc15.4} 
\\
& $N=1$
& $N=2$ & $N=4$ \\
\end{tabular}
\end{center}
}{%
\caption{Extended channels obtained from a binary symmetric channel
  with transition probability 0.15.}
\label{fig.extended.bsc15}
}
\end{figure}
%
\begin{figure}
\figuremargin{%
\small\begin{center}
\begin{tabular}{cccc}
%$\bQ$ 
& \ecfig{z15.1} 
& \ecfig{z15.2} 
& \ecfig{z15.4} 
\\
& $N=1$
& $N=2$ & $N=4$ \\
\end{tabular}
\end{center}
}{%
\caption{Extended channels obtained from a Z channel
 with transition probability 0.15. Each column corresponds to an input,
 and each row is a different output.}
\label{fig.extended.z15}
}
\end{figure}
%
%
% these figures made using
% cd itp/extended
 Extended channels obtained from a \BSC\ and from
 a Z channel are shown in figures \ref{fig.extended.bsc15}
 and  \ref{fig.extended.z15}, with $N=2$ and $N=4$.
\exercissxA{2}{ex.extended}{
 Find the transition probability matrices $\bQ$ for
 the extended channel, with $N=2$, derived from
       the binary erasure channel having erasure probability 0.15.
%\item the extended channel with $N=2$ derived from
% the ternary confusion channel,

 By selecting two columns of this transition probability matrix,
% that have minimal overlap,
 we can define a rate-\dhalf\ code for this channel with blocklength $N=2$.
 What is the best choice of two columns? What is the decoding
 algorithm? 
}

 To prove the noisy-channel coding theorem, we
 make use of large blocklengths $N$.
 The intuitive idea is that, if $N$ is large,  {\em 
 an extended channel looks a lot like the noisy typewriter.}
 Any particular input $\bx$ is very likely to produce an output 
 in a small subspace of the output alphabet -- the typical output
 set, given that input.
 So we can find a non-confusable subset of the inputs that produce 
 essentially disjoint output sequences.
%
% add something like: 
%  Remember what we learnt
% in chapter \ref{ch2}: 
%
 For a given $N$, let us consider a way of generating such a
 non-confusable subset of the inputs, and count up how many distinct
 inputs it contains.

 Imagine making an input sequence $\bx$ for the extended channel by
 drawing it from an ensemble $X^N$, where $X$ is an arbitrary ensemble
 over the input alphabet. Recall the source coding theorem of
 \chapterref{ch.two}, and consider the number of probable output sequences
 $\by$. The total number of typical output sequences $\by$
% , when $\bx$ comes from the ensemble $X^N$,
 is $2^{N H(Y)}$, all having similar
 probability. For any particular typical input sequence $\bx$, there
 are about $2^{N H(Y \specialgiven X)}$ probable sequences. Some of these subsets of 
 $\A_Y^N$ are depicted by circles in figure \ref{fig.ncct.typs}a. 

\begin{figure}%[htbp]
\small
\figuremargin{%
\begin{center}
\hspace*{-1mm}\begin{tabular}{cc}
\framebox{
\setlength{\unitlength}{0.69mm}%was 0.8mm
\begin{picture}(80,80)(0,0)
\put(0,80){\makebox(0,0)[tl]{$\A_Y^N$}}
\thicklines
\put(40,40){\oval(50,50)}
\thinlines
\put(40,67){\makebox(0,0)[b]{Typical $\by$}}
\put(30,50){\circle{12.5}}
\put(50,40){\circle{12.5}}
\put(35,52){\circle{12.5}}
\put(58,33){\circle{12.5}}
\put(33,40){\circle{12.5}}
\put(35,45){\circle{12.5}}
\put(50,30){\circle{12.5}}
\put(40,50){\circle{12.5}}
\put(52,35){\circle{12.5}}
\put(33,58){\circle{12.5}}
\put(40,33){\circle{12.5}}
\put(45,35){\circle{12.5}}
\put(50,50){\circle{12.5}}
\put(23,55){\circle{12.5}}
\put(24,45){\circle{12.5}}
\put(27,57){\circle{12.5}}
\put(25,40){\circle{12.5}}
\put(55,42){\circle{12.5}}
\put(55,52){\circle{12.5}}
\put(58,53){\circle{12.5}}
\put(53,40){\circle{12.5}}
\put(35,22){\circle{12.5}}
\put(27,30){\circle{12.5}}
\put(40,24){\circle{12.5}}
\put(40,39){\circle{12.5}}
\put(46,43){\circle{12.5}}
\put(55,40){\circle{12.5}}
\put(40,55){\circle{12.5}}
\put(52,23){\circle{12.5}}
\put(50,26){\circle{12.5}}
\put(40,54){\circle{12.5}}
\put(52,55){\circle{12.5}}
\put(33,28){\circle{12.5}}
\put(57,33){\circle{12.5}}
\put(25,35){\circle{12.5}}
\put(55,25){\circle{12.5}}
\put(25,26){\circle{12.5}}
\multiput(23,22)(13,0){3}{\circle{12.5}}
\multiput(30,34)(13,0){3}{\circle{12.5}}
\multiput(23,46)(13,0){3}{\circle{12.5}}
\multiput(30,58)(13,0){3}{\circle{12.5}}
\thicklines
\put(23,30){\circle{12.5}}
\put(21,11){\vector(0,1){13}}
\put(8,6){\makebox(0,0)[l]{ Typical $\by$ for a given typical $\bx$}}
\end{picture}

}
&
\framebox{
\setlength{\unitlength}{0.69mm}
\begin{picture}(80,80)(0,0)
\put(0,80){\makebox(0,0)[tl]{$\A_Y^N$}}
\thicklines
\put(40,40){\oval(50,50)}
\thinlines
\put(40,67){\makebox(0,0)[b]{Typical $\by$}}
% \thicklines
\multiput(23,22)(13,0){3}{\circle{12.5}}
\multiput(30,34)(13,0){3}{\circle{12.5}}
\multiput(23,46)(13,0){3}{\circle{12.5}}
\multiput(30,58)(13,0){3}{\circle{12.5}}
%\put(30,34){\circle{12.5}}
%\put(43,34){\circle{12.5}}
%\put(56,34){\circle{12.5}}
%\put(23,45){\circle{12.5}}
%\put(36,45){\circle{12.5}}
%\put(49,45){\circle{12.5}}
%\put(30,56){\circle{12.5}}
%\put(43,56){\circle{12.5}}
%\put(56,56){\circle{12.5}}
\end{picture}

}\\
(a)&(b) \\
\end{tabular}
\end{center}
}{%
\caption[a]{(a) Some typical outputs 
 in $\A_Y^N$ corresponding 
 to  typical inputs $\bx$. 
(b) A subset of the \ind{typical set}s shown in 
 (a) that do not overlap each other. This picture can be
 compared with  the 
 solution to the  \ind{noisy typewriter} in \figref{fig.typenine}.}
\label{fig.ncct.typs}
\label{fig.ncct.typs.no.overlap}
}
\end{figure}

 We now imagine restricting ourselves to a subset of the typical\index{typical set!for noisy channel} 
 inputs $\bx$ such that the corresponding typical output sets do not overlap, 
 as shown in \figref{fig.ncct.typs.no.overlap}b. 
 We can then bound the number of non-confusable inputs by dividing the size 
 of the typical $\by$ set, 
 $2^{N H(Y)}$, by the size of each typical-$\by$-given-typical-$\bx$ 
 set, $2^{N H(Y \specialgiven X)}$. So the number of non-confusable inputs, 
 if they are selected from the set of typical inputs $\bx \sim X^N$, 
 is $\leq 2^{N H(Y) - N H(Y \specialgiven X)} = 2^{N \I(X;Y)}$.

% \begin{figure}
% \begin{center}
% \framebox{
% \setlength{\unitlength}{0.8mm}
% }
% \end{center}
% \caption[a]{A subset of the typical sets shown in 
%  \protect\figref{fig.ncct.typs} that do not overlap.}
% \label{fig.ncct.typs.no.overlap}
% \end{figure}

 The maximum value of 
 this bound is  achieved if $X$ is the ensemble that 
 maximizes $\I(X;Y)$, in which case the number of non-confusable inputs 
 is $\leq 2^{NC}$. Thus asymptotically 
 up to $C$ bits per cycle, 
 and no more, can be communicated with vanishing error probability.\ENDproof

 This sketch has not rigorously proved that reliable communication really
 is possible --
 that's our task for the next chapter.

\section{Further exercises}
% \noindent
% 
\exercissxA{3}{ex.zcdiscuss}{
 Refer back to the computation of the capacity of the 
 \ind{Z channel} with $\q=0.15$. 
\ben
\item
 Why is $p_1^*$ less than 0.5?   One could argue that it is good
 to favour the {\tt{0}} input, since it is transmitted without error --
 and also argue that it is  good to favour the {\tt1} input, since it
 often gives rise to the highly prized {\tt1} output, which
 allows certain identification of the input! Try to make a convincing
 argument.
\item
 In the case of general $\q$, show that the \optens\ is 
\beq
	p_1^* = \frac{ 1/(1-\q) }
		{ \displaystyle 
		1 + 2^{ \left( H_2(\q) / ( 1 - \q ) \right)} } .
\eeq
\item
 What happens to $p_1^*$ if the noise level $\q$ is very close to 1?
\een
}

% see also ahmed.tex for a nice bound 0.5(1-q) on the capacity of the Z channel
% and related graphs CZ.ps CZ2.ps CZ.gnu
%
\exercissxA{2}{ex.Csketch}{
 Sketch graphs of the capacity of the \ind{Z channel}, the \BSC\ 
 and the \BEC\  as a function of $\q$. 
% answer in figs/C.ps 
% \medskip
}
\exercisaxB{2}{ex.fiveC}{
 What is the capacity of the five-input, ten-output channel
% \index{channel!others}
 whose
 transition probability matrix is
{\small
\beq
\left[ \begin{array}{*{5}{c}}
      0.25 &          0 &          0 &          0 &       0.25 \\ 
      0.25 &          0 &          0 &          0 &       0.25 \\ 
      0.25 &       0.25 &          0 &          0 &          0 \\ 
      0.25 &       0.25 &          0 &          0 &          0 \\ 
         0 &       0.25 &       0.25 &          0 &          0 \\ 
         0 &       0.25 &       0.25 &          0 &          0 \\ 
         0 &          0 &       0.25 &       0.25 &          0 \\ 
         0 &          0 &       0.25 &       0.25 &          0 \\ 
         0 &          0 &          0 &       0.25 &       0.25 \\ 
         0 &          0 &          0 &       0.25 &       0.25 \\ 
\end{array}
\right] 

\hspace{0.4in}
\begin{array}{c}\ecfig{five}\end{array}
?
\eeq
}
}
\exercissxA{2}{ex.GC}{
 Consider a \ind{Gaussian channel}\index{channel!Gaussian}
 with binary input $x \in \{ -1, +1\}$
 and {\em real\/} output alphabet $\A_Y$, with transition probability density
\beq
	Q(y \given x,\sa,\sigma) =  \frac{1}{\sqrt{2 \pi \sigma^2}}
		\,	e^{-\smallfrac{(y-x \sa)^2}{2 \sigma^2}} ,
\eeq
 where $\sa$ is the signal amplitude. 
\ben
\item
	Compute the posterior probability of $x$ given $y$, assuming that 
 the two inputs are equiprobable.
 Put your answer in the form 
\beq
 P(x\eq 1 \given y,\sa,\sigma) = \frac{1}{1+e^{-a(y)}}  .
\eeq
 Sketch the value of $P(x\eq 1 \given y,\sa,\sigma)$ 
 as a function of $y$. 
\item
	Assume that a single bit is to be 
 transmitted. What is the optimal decoder,
 and what is its probability of error? Express your answer in terms 
 of the signal-to-noise ratio $\sa^2/\sigma^2$ and
 the
 \label{sec.erf}\ind{error function}\index{conventions!error function}\index{erf}
 (the \ind{cumulative 
 probability function} of the Gaussian distribution),
\beq
	\Phi(z) \equiv \int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}}
		\,	e^{-\textstyle\frac{z^2}{2}} \: \d z.
\eeq
%
% P(x \given y,s,sigma) = 1/(1+e^{-a}), a = 2 ( s / \sigma^2 ). y
% 
 [Note that this definition of the error function $\Phi(z)$ may not
 correspond to other people's.]
% definitions of the `error function'.
% Some people
%% and some software libraries
% leave out factors of two in the definition.] 
% I think that the 
% above definition is the only natural one.
\een
}
% \section{
\subsection*{Pattern recognition as a noisy channel}
 We may think of many pattern recognition problems in terms of\index{pattern recognition}
 \ind{communication} channels. Consider the case of recognizing handwritten
 digits (such as postcodes on envelopes).  The author of the digit
 wishes to communicate a message from the set $\A_X = \{
 0,1,2,3,\ldots, 9 \}$; this selected message is  the input to the
 channel. What comes out of the channel is a pattern of ink on paper.
 If the ink pattern is represented using 256 binary pixels, the channel $Q$
 has as its output a random variable $y \in \A_Y = \{0,1\}^{256}$.
% Here is an example of an element from this alphabet.
 An example of an element from this alphabet is shown in the margin.
%
% hintond.p zero=0.0 range=1.25 rows=16  background=1.0 pos=0.0  o=/home/mackay/_applications/characters/ex2.ps 16 < /home/mackay/_applications/characters/example2
%
%\[
\marginpar{
{\psfig{figure=/home/mackay/_applications/characters/ex2.ps,width=1.1in}}
}%\end{marginpar}
%\]
\exercisaxA{2}{ex.twos}{
 Estimate how many patterns in $\A_Y$  are recognizable as 
 the character `2'. [The aim of this problem is to 
% Try not to underestimate this number --- 
 try to demonstrate the existence of {\em as many patterns as possible\/}
 that are recognizable as 2s.]
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=figs/random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/2random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/6random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/7random2.ps}}
\end{center} 
\caption[a]{Some more  2s.}
\label{fig.random2s}
%\end{figure}
}%end{marginfig}
% made using figs/random2.ps seed=7

 Discuss how one might model the channel $P(y \given x\eq 2)$.\index{2s}\index{twos}\index{handwritten digits}
% in the case of handwritten  digit recognition.
 Estimate the entropy of the probability distribution $P(y \given x\eq 2)$. 
% Recognition of isolated handwritten digits
% Digit 2  -> Q  -> y $\in \{0,1\}^{256})$
%           3
% Estimate how many 2's there are.

 One strategy for doing \ind{pattern recognition} is to create a model 
 for $P(y \given x)$ for each value of the input $x= \{ 0,1,2,3,\ldots, 9 \}$,
 then use \Bayes\  theorem to 
 infer  $x$ given $y$.
\beq
	P(x \given y) = \frac{ P(y \given x) P(x) } { \sum_{x'}  P(y \given x') P(x') } .
\eeq
 This strategy is known as {\dbf \ind{full probabilistic model}ling\/}
 or {\dbf \ind{generative model}ling\/}. This is essentially how 
 current speech recognition systems work.  In addition to the 
 channel model, $P(y \given x)$, one uses a prior probability distribution 
 $P(x)$, which in the case of both character recognition and 
 speech recognition is a language model that specifies the probability of 
 the next character/word given the context and the known grammar 
 and statistics of the language.

% 
% Alternative, model $P(x \given y)$ directly.
%  Discriminative modelling; conditional modelling.
% Feature extraction -- compute some $f(y)$ then model $P(f \given x)$ 
% - generative modelling in feature space.
% or else model $P(x \given f)$ 
% which is still discriminative modelling / conditional modelling. 
% Notice number of parameters.
% 
% 
}
\subsection*{Random coding}
\exercissxA{2}{ex.birthday}{
 Given 
%\index{random coding}
% \index{code!random}
 \index{random code}twenty-four people in a room,
% at a party,
 what is the probability that 
 there are at least two people present who
% of them
 have the same \ind{birthday} (\ie, day and month of birth)?
 What is the expected number of
 pairs of people with the same birthday? Which of these
 two questions is easiest to solve?  Which answer gives most
 insight?
 You may find it helpful to solve these problems and those that follow
 using notation such as $A=$ number of days in year $=365$
 and $S=$ number of people $=24$.
}
\exercisaxB{2}{ex.birthdaycode}{
 The birthday problem may be related to a coding scheme.
 Assume we wish to convey a message 
 to an outsider identifying one of the twenty-four people. 
 We could simply communicate a number $\cwm$ from $\A_S = \{ 1,2, \ldots,
 24 \}$, having agreed a mapping of people onto numbers;
 alternatively, we could convey a number from 
 $\A_X = \{ 1 ,2 , \ldots, 365\}$, identifying  the 
 day of the year that is the selected person's \ind{birthday}
 (with apologies to leapyearians). [The receiver is assumed to know 
 all the people's birthdays.]
 What, roughly, is the probability of error of this communication scheme, 
 assuming it is used for a single transmission?
 What is the capacity of the communication channel, and what is 
 the rate of communication attempted by this scheme? 
}
%
% CHRIS SAYS ``this is not CLEAR''................. :
%
\exercisaxB{2}{ex.birthdaycodeb}{
 Now imagine that there are $K$ rooms in a building, each containing 
 $q$ people. (You might think of $K=2$ and $q=24$ as an example.)
 The aim is to communicate a selection of one person 
 from each room by transmitting an ordered list of $K$ days (from $\A_X$). 
 Compare the probability of error of the following two schemes.
\ben
\item
	As before, where each 
 room transmits the \ind{birthday} of the selected person.
\item
 To each $K$-tuple of people, one drawn from each room, 
 an ordered $K$-tuple of randomly selected days from $\A_X$ is assigned
 (this $K$-tuple has nothing to do with their birthdays). 
 This enormous list of $S = q^K$ strings is
 known to the receiver. When the building has selected 
 a particular person from each room, the ordered string of days 
 corresponding to that  $K$-tuple of people is transmitted. 
\een
 What is the probability of error when $q=364$ and $K=1$?
 What is the probability of error when $q=364$ and $K$ is large,
 \eg\ $K=6000$?
}

% see synchronicity.tex
% for cut example

\dvips
\section{Solutions}% to Chapter \protect\ref{ch5}'s exercises} 
%
\fakesection{solns to exercises in l5.tex}
%
\soln{ex.bscy0}{
 If we assume we observe $y\eq 0$,
\beqan
 	P(x\eq 1 \given y\eq 0) &=&  \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'}  P(y \given x') P(x')} \\
 	&=& \frac{ 0.15 \times 0.1 }{  0.15 \times 0.1 +  0.85 \times 0.9 } \\
 	&=& \frac{ 0.015 }{0.78} \:=\: 0.019 .
\eeqan
}
\soln{ex.zcy0}{
 If we observe $y=0$, 
\beqan
	P(x\eq 1 \given y\eq 0) 
% &=&  \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'}  P(y \given x') P(x')} \\
	&=& \frac{ 0.15 \times 0.1 }{  0.15 \times 0.1 +  1.0 \times 0.9 } \\
	&=& \frac{ 0.015}{ 0.915} \:=\:  0.016 .
\eeqan
}
\soln{ex.bscMI}{
 The probability that $y=1$ 
 is $0.5$, so the mutual information is: 
\beqan
	\I(X;Y) &=& H(Y) - H(Y \given X) \\
 &=& H_2(0.5) - H_2(0.15)\\
 & =& 1 - 0.61 \:\: = \:\: 0.39  \mbox{ bits}.
\eeqan
}
\soln{ex.zcMI}{
 We again compute the mutual information using
 $\I(X;Y) = H(Y) - H(Y \given X)$.
% fixed Tue 18/2/03
 The probability that $y=0$ 
 is $0.575$, and $H(Y \given X) = \sum_x P(x) H(Y \given x) =  P(x\eq1) H(Y \given x\eq1) $
 $+$  $P(x\eq0) H(Y \given x\eq0)$ so the mutual information is: 
\beqan
	\I(X;Y) &=& H(Y) - H(Y \given X) \\
 &=& H_2(0.575) - [0.5 \times H_2(0.15)+0.5 \times 0 ] \\
 & =& 0.98 - 0.30 \:\: = \:\: 0.679  \mbox{ bits}.
\eeqan
}
\soln{ex.bscC}{
	By symmetry, the \optens\ is 
 $\{0.5,0.5\}$. 
 Then the capacity is 
\beqan
	C \:=\: \I(X;Y) &=& H(Y) - H(Y \given X) \\
 &=& H_2(0.5) - H_2(\q)\\
 & =& 1 - H_2(\q) .
\eeqan
 Would you like to find the \optens\ without invoking symmetry? 
 We can do this by computing the mutual information in the general 
 case where the input ensemble is $\{p_0,p_1\}$: 
\beqan
	 \I(X;Y) &=& H(Y) - H(Y \given X) \\
 &=& H_2(p_0 \q+ p_1(1-\q) ) - H_2(\q) .
\eeqan
 The only $p$-dependence is in the first term $H_2(p_0\q+ p_1(1-\q) )$, 
 which is maximized by setting the argument to 0.5. 
 This value is given by setting $p_0=1/2$.
}
\soln{ex.becC}{
\noindent {\sf Answer 1}.
 By symmetry, the \optens\ is 
 $\{0.5,0.5\}$.  The capacity is
 most easily evaluated by 
 writing the mutual information as $\I(X;Y) = H(X) - H(X \given Y)$.
  The conditional entropy $H(X \given Y)$ is $\sum_y P(y) H(X \given y)$; 
 when $y$ is known, $x$  is only uncertain if $y=\mbox{\tt{?}}$, which 
 occurs with probability $\q/2+\q/2$, 
 so the conditional entropy $H(X \given Y)$ is $\q H_2(0.5)$.
\beqan
 C \:=\: \I(X;Y) &=& H(X) - H(X \given Y) \\
 &=& H_2(0.5) - \q H_2(0.5)\\
 & =& 1 - \q .
\eeqan
% The conditional entropy $H(X \given Y)$ is $\q H_2(0.5)$.
%
 The binary erasure channel 
 fails a fraction $\q$ of the 
 time.  Its capacity is precisely 
 $1-\q$, which is the fraction of 
 the time that the channel is 
 reliable.
% functional.
% , even though the sender
% does not know when the channel will 
% fail.
 This result seems very reasonable, but it is far from obvious
 how to encode information so as to communicate {\em reliably\/} over this channel.
\smallskip

\noindent {\sf Answer 2}.
 Alternatively, without invoking the symmetry assumed above, we can
 start from the input ensemble $\{p_0,p_1\}$. The probability that
 $y=\mbox{\tt{?}}$ is $p_0 \q+ p_1 \q = \q$, and when we receive $y=\mbox{\tt{?}}$,
 the posterior probability of $x$ is the same as the prior
 probability, so:
\beqan 
 \I(X;Y) &=& H(X) - H(X \given Y) \\
 &=& H_2(p_1) - \q H_2(p_1)\\
 & =& (1 - \q  ) H_2(p_1) .
\eeqan
 This mutual information achieves its maximum value of $(1-\q)$ when 
 $p_1=1/2$.
}
%
%
%
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{ccccc}
$\bQ$ 
& \ecfig{bec.1} 
&{\small{(a)}} \, \ecfig{bec.2} 
&{\small{(b)}} \,
% roughly 8pts from col to col
\setlength{\unitlength}{1pt}
\begin{picture}(50,110)(-5,-5)
\put(-5,-5){\ecfig{bec.2}}
\put(3.95,-3){\framebox(8,96){}}
\put(28.5,-3){\framebox(8,96){}}
\put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}}
\put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}}

\end{picture}

&{\small{(c)}} \,
% roughly 8pts from col to col
\setlength{\unitlength}{1pt}
\begin{picture}(50,110)(-5,-5)
\put(-5,-5){\ecfig{bec.2}}
\put(3.95,-3){\framebox(8,96){}}
\put(28.5,-3){\framebox(8,96){}}
\put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}}
\put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}}

% roughly 8pts from col to col
%\setlength{\unitlength}{1pt}
%\begin{picture}(100,110)(-5,-5)
%\put(-5,-5){\ecfig{bec.2}}
%\put(3.95,-3){\framebox(8,96){}}
%\put(28.5,-3){\framebox(8,96){}}
%\put(2.5,97){\makebox(0,0)[bl]{$\bx^{(1)}$}}
%\put(26.5,97){\makebox(0,0)[bl]{$\bx^{(2)}$}}
%
\multiput(-4,3)(0,8){2}{\line(1,0){8}}
\multiput(-4,27)(0,8){3}{\line(1,0){8}}
\multiput(-4,59)(0,8){2}{\line(1,0){8}}
\multiput(37,3)(0,8){2}{\vector(1,0){14}}
\multiput(37,27)(0,8){3}{\vector(1,0){14}}
\multiput(37,59)(0,8){2}{\vector(1,0){14}}
\multiput(57,4)(0,8){2}{\makebox(0,0)[l]{\tiny$\hat{m}=2$}}
\multiput(57,28)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=2$}}
\multiput(57,44)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=1$}}
\multiput(57,60)(0,8){2}{\makebox(0,0)[l]{\tiny$\hat{m}=1$}}
\multiput(57,36)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=0$}}
% the box starts exactly at x=0.

\end{picture}

\\
 & $N=1$ & $N=2$ &  \\[-0.1in]
\end{tabular}
\end{center}
}{%
\caption[a]{(a) The {\ind{extended channel}} ($N=2$)
 obtained from a binary erasure channel
 with erasure probability 0.15. (b) A block code
 consisting of the two codewords {\tt 00} and {\tt 11}.
 (c) The optimal decoder for this code. }
\label{fig.extended.bec}
}
\end{figure}
%
\soln{ex.extended}{
 The extended channel is shown in \figref{fig.extended.bec}.
 The best code for this channel with $N=2$ is obtained by choosing
 two columns that have minimal overlap, for example, columns {\tt 00}
 and {\tt 11}. The decoding algorithm returns `{\tt 00}'
 if the extended channel output is among the top four
% either output is {\tt 0},
 and `{\tt 11}' if
 it's among the bottom four, 
% if either output is {\tt 1},
 and gives up if the output is `{\tt ??}'.
}
%
% end of chapter
%
\soln{ex.zcdiscuss}{
 In \exampleref{exa.Z.HXY}
%  \exaseven\ of chapter \chfive\ 
 we  showed that the mutual information between input and output 
 of the Z channel is 
\beqan
	\I(X;Y) &=& H(Y) - H(Y \given X) \nonumber  \\
 &=& H_2(p_1 (1-\q))  -  p_1 H_2(\q) .
\eeqan
 We differentiate this expression with respect to $p_1$, taking care not
 to confuse $\log_2$ with $\log_e$:
\beq
	\frac{\d}{\d p_1}	\I(X;Y)
	=  (1-\q) \log_2 \frac{ 1-  p_1 (1-\q) }{ p_1 (1-\q) } - H_2(\q) .
\eeq
 Setting this derivative to zero and rearranging using skills developed 
 in  \exthirtyone, we obtain:
\beq
	{ p_1^* (1-\q) } = \frac{1}{1 + \displaystyle 2^{H_2(\q)/(1-\q)}} ,
\eeq
 so the \optens\ is 
\beq
	p_1^* = \frac{ 1/(1-\q) }
		{ \displaystyle 
		1 + 2^{ \left( H_2(\q) / ( 1 - \q ) \right)} } .
\eeq
 As the noise level $\q$ tends to 1, this expression tends to $1/e$
 (as you can prove using L'H\^opital's rule). 

 For all values of $\q\!$, $p_1^*$ is smaller than $1/2$. A rough 
 intuition for why input {\tt1} is used less than input {\tt0} is that 
 when input {\tt1} is used, the noisy channel injects entropy into 
 the received string; whereas when input {\tt0} is used, the noise has 
 zero entropy. Thus starting from $p_1=1/2$, a perturbation 
 towards smaller $p_1$ will reduce the conditional entropy 
 $H(Y \given X)$ linearly while leaving $H(Y)$ unchanged, to first order.
 $H(Y)$  decreases only quadratically in $(p_1-\dhalf)$.
}
\soln{ex.Csketch}{
 The capacities of the three channels are shown in \figref{fig.capacities}.
% below.
\amarginfig{b}{
\begin{center}
\mbox{\psfig{figure=figs/C.ps,angle=-90,width=2in}
}
\end{center}
\caption[a]{Capacities of the Z channel, \BSC, and binary erasure channel.}
\label{fig.capacities}
}%end marginpar
 For any $\q <0.5$,
% the channels can be ordered with the BEC being the
 the BEC is the 
 channel with  highest capacity and the BSC the lowest. 
}
\soln{ex.GC}{
 The logarithm of the posterior probability ratio, given $y$, is 
\beq
	a(y) = \ln \frac{P(x\eq 1 \given y,\sa,\sigma)}{P(x\eq -1 \given y,\sa,\sigma)}
=  \ln \frac{Q(y \given x\eq 1,\sa,\sigma)}{Q(y \given x\eq -1,\sa,\sigma)}
		= 2 \frac{\sa y}{\sigma^2} .
% corrected march 2000
% and corrected log to ln Sun 22/8/04
\eeq
 Using our skills picked up from
% in chapter \ref{ch1},
 \exerciseref{ex.logit}, we  rewrite 
% from exercise \label{eq.sigmoid} \label{eq.logistic}
 this in the form 
\beq
 P(x\eq 1 \given y,\sa,\sigma) = \frac{1}{1+e^{-a(y)}}  .
\eeq
 The optimal decoder selects the most probable hypothesis; this can 
 be done simply by looking at the sign of $a(y)$. If $a(y)>0$
 then decode as $\hat{x}=1$. 

 The probability of error is
\beq
	p_{\rm b} = \int_{-\infty}^{0} \!\! \d y \:
 	Q(y \given x\eq 1,\sa,\sigma) =
% chris suggests removing the x (=1) from what follows (twice)
\int_{-\infty}^{- x \sa} \! \d y \: \frac{1}{\sqrt{2 \pi \sigma^2}}
			e^{-\smallfrac{y^2}{2 \sigma^2}} 
	= \Phi \left( - \frac{ x\sa }{ \sigma } \right) .
% corrected march 2000
\eeq
% where 
%\beq
%	\Phi(z) \equiv \int_{z}^{\infty} \frac{1}{\sqrt{2 \pi}}
%			e^{-\frac{z^2}{2}} .
%\eeq
%\beq
%	\Phi(z) \equiv \int_{-\infty}^{z}{\smallfrac{1}{\sqrt{2 \pi}}}
%			e^{-\textstyle\frac{z^2}{2}} .
%\eeq
}
\subsection*{Random coding}
\soln{ex.birthday}{
 The probability that $S=24$ people whose birthdays are drawn at random 
 from $A=365$ days all have {\em distinct\/} birthdays is 
\beq
	\frac{ A(A-1)(A-2)\ldots(A-S+1) }{  A^q } .
\eeq
 The probability that two (or more) people share a \ind{birthday} is one minus 
 this quantity, which, for $S=24$ and $A=365$,
 is about 0.5. This exact  way of answering the question
 is not very informative
 since it is not clear  for what
 value of $S$  the probability changes from being close to 0 to being
 close to 1.
 

 The number of pairs is $S(S-1)/2$, and the probability that a particular
 pair shares a birthday is $1/A$, so the {\em expected number\/} of collisions
 is
\beq
	\frac{ S(S-1)}{2 } \frac{1}{A} .
\eeq
 This answer is more instructive. The expected number of collisions
 is tiny if $S \ll \sqrt{A}$ and big if $S \gg \sqrt{A}$.

 We can also approximate the probability that all birthdays are distinct,
 for small $S$, thus:
\beqan
\lefteqn{\hspace*{-0.7in}
 \frac{ A(A-1)(A-2)\ldots(A-S+1) }{  A^S }
	\:\:=\:\: (1)(1-\dfrac{1}{A})(1-\dfrac{2}{A})\ldots(1-\dfrac{(S\!-\!1)}{A})
\hspace*{1.7in}}
% this hspace{ no good
\nonumber \\
	&\simeq&
\exp( 0 ) \exp ( -\linefrac{1}{A})  \exp ( -\linefrac{2}{A}) \ldots   \exp ( -\linefrac{(S\!-\!1)}{A})
\\
	&\simeq&
 \exp \left( - \frac{1}{A} \sum_{i=1}^{S-1} i  \right) 
=  \exp \left( - \frac{S(S-1)/2}{A} \right)  .
\eeqan
}

\dvipsb{solutions noisy channel s5}
\prechapter{About        Chapter}
\fakesection{prerequisites for chapter 6}
 Before reading  \chref{ch.six}, you should have read Chapters
 \chtwo\ and \chfive. \Exerciseref{ex.extended} is
 especially recommended.
% and worked on \exerciseref{ex.dataprocineq}.
%
%  \extwentytwo\ from chapter \chone.

% Please note that you {\em don't\/} need to understand
% this proof in order to be able to solve most of the
% problems involving noisy channels.

%\footnote
% {This  exposition is based on that of Cover and Thomas (1991).}

\subsection*{Cast of characters}
\noindent%
\begin{tabular}{lp{4in}} \toprule
$Q$  & the noisy channel \\
$C$ & the capacity of the channel \\
$X^N$ & an ensemble used to create a \ind{random code} \\
$\C$ & a random code \\
$N$ & the length of the codewords \\
$\bx^{(\cwm)}$ & a codeword, the $\cwm$th in the code \\
$\cwm$ % $s$
 & the number of a chosen codeword
	 (mnemonic: the {\em source\/} selects $\cwm$)  \\
$\cwM = 2^{K}$ % $S$
 & the total number of codewords in the code\\
$K=\log_2 \cwM$
    & the number of bits conveyed by the choice of one codeword from $\cwM$,
	assuming  it is chosen with uniform probability \\
$\bs$ & a binary representation of the number  $\cwm$ \\
$R = K/N$ & the rate of the code, in bits per channel use
			(sometimes called $R'$ instead) \\
% $R'$ & another  rate, close to $R$ \\
$\hat{\cwm}$ % $s$
 & the decoder's guess of $\cwm$ \\
\bottomrule
\end{tabular} \medskip


%{\sf Typo Warning:}
% the letter $m$ may turn up where it should read $\cwm$.
%%%% !!!!!!!!!!!!!! ok???????????????????????

\ENDprechapter
\chapter{The Noisy-Channel Coding Theorem}
% {The noisy-channel coding theorem}% Proof of
\label{ch.six}
% % \lecturetitle{The noisy-channel coding theorem, part b} 
% \chapter{The noisy channel coding theorem}% Proof of
\label{ch6}
\section{The theorem}\index{noisy-channel coding theorem}\index{communication}
 The theorem has three parts, two positive and one negative.
 The main positive result is the first.
\amarginfig{t}{
\begin{center}\small
\setlength{\unitlength}{2pt}
\begin{picture}(60,45)(-2.5,-7)
\thinlines
\put(0,0){\vector(1,0){60}}
\put(0,0){\vector(0,1){40}}
\put(30,0){\line(0,1){30}}
\put(30,0){\line(1,2){10}}
\put(30,-3){\makebox(0,0)[t]{$C$}}
\put(55,-2){\makebox(0,0)[t]{$R$}}
\put(42,22){\makebox(0,0)[bl]{$R(p_{\rm b})$}}
\put(-1,35){\makebox(0,0)[r]{$p_{\rm b}$}}
\thicklines
\put(0,0){\makebox(30,30){1}}
\put(30,0){\makebox(7.5,35){2}}
\put(35,0){\makebox(30,20){3}}
% \put(0,0){\line(0,1){50}}
%
\end{picture}
\end{center}
\caption[a]{Portion of the $R,p_{\rm b}$ plane
 to be proved 
 achievable (1,$\,$2) and
 not achievable (3).
}
\label{fig.belowCcoming}
}%end marginfig
\ben%gin{itemize}
\item
 For every  discrete memoryless channel, the
 channel capacity 
\beq
 C = \max_{\P_X}\, \I(X;Y)
\eeq
 has the following
 property.  For any $\epsilon > 0$ and $R < C$, for large enough $N$,
 there exists a code of length $N$ and rate $\geq R$ and a decoding
 algorithm, such that the maximal probability of block error is $<
 \epsilon$.

\item
 If a probability of bit error $p_{\rm b}$ is acceptable, rates up to $R(p_{\rm b})$ 
 are achievable, where 
\beq
	R(p_{\rm b}) = \frac{ C }
   		   {1 - H_2(p_{\rm b})} .
\eeq
\item
 For any $p_{\rm b}$, rates greater than $R(p_{\rm b})$ are not achievable.
\een%d{itemize} 

\section{Jointly-typical sequences}
 We formalize the intuitive preview of the last chapter.\index{typicality}

 We will define codewords $\bx^{(\cwm )}$ as coming from an ensemble $X^N$,
 and consider the random selection of one codeword and a
 corresponding channel output $\by$,
 thus defining a joint ensemble $(XY)^N$.
%, corresponding to random  generation of a codeword and a corresponding channel output.
 We will use a {\dem typical-set decoder}, which
 decodes 
 a received signal 
 $\by$  as $\cwm$ if $\bx^{(\cwm )}$ and $\by$ are {\dem jointly typical},
 a term to be defined shortly.

 The proof will then centre on determining the probabilities (a) that the true 
 input codeword is {\em not\/}
 jointly \index{typicality}{typical} with the output sequence; 
 and (b) that a {\em false\/} input codeword {is\/} jointly typical with the output. 
 We will show that, for large $N$, both  probabilities
% $\rightarrow 0$,
 go to zero
 as long as there are fewer 
 than $2^{NC}$ codewords, and the ensemble $X$ is the \index{optimal input distribution}{\optens}.

 

\newcommand{\JNb}{\mbox{$J_{N \beta}$}}
\begin{description}
\item[Joint typicality\puncspace]
 A pair of sequences $\bx,\by$ of length $N$ are defined to be
	{jointly 
	typical (to tolerance $\beta$)}\index{joint typicality}
 with respect to the distribution 
 $P(x,y)$ if 
\beqan
\mbox{$\bx$ is typical of $P(\bx)$,}
& \mbox{\ie,} &
 \left| \frac{1}{N} \log \frac{1}{P(\bx)} - H(X) \right| < \beta ,
\nonumber
\\
\mbox{$\by$ is typical of $P(\by)$,}
& \mbox{\ie,} &
 \left| \frac{1}{N} \log \frac{1}{P(\by)} - H(Y) \right| < \beta ,
\nonumber
\\
\mbox{and $\bx,\by$ is typical of $P(\bx,\by)$,}
& \mbox{\ie,} &
 \left| \frac{1}{N} \log \frac{1}{P(\bx,\by)} - H(X,Y) \right| < \beta .
\nonumber
\eeqan
\item[The jointly-typical set] $\JNb$ is the set of all jointly
	typical sequence pairs of length $N$.
% It has the following three properties,
\end{description}

%\begin{example}
\noindent {\sf Example.}
 Here is a jointly-typical pair of length $N=100$
 for the  ensemble
 $P(x,y)$ in which $P(x)$ has $(p_0,p_1) = (0.9,0.1)$
 and $P(y \given x)$ corresponds to a binary symmetric channel with
 noise level $0.2$.
\[%beq
\mbox{
\begin{tabular}{cc}
$\bx$ &\mbox{\footnotesize\tt 1111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}\\
$\by$ &\mbox{\footnotesize\tt 0011111111000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111}
\end{tabular}
}
\]%eeq
 Notice that $\bx$ has 10 {\tt 1}s, and so is typical of the probability
 $P(\bx)$ (at any tolerance $\beta$); and $\by$ has
% 18 + 8 = 26
 26 {\tt 1}s, so it is typical of $P(\by)$ (because $P(y\eq 1) = 0.26$);
 and $\bx$ and $\by$ differ in
% 18 + 2
 20 bits, which is  the typical number of flips for
 this channel.
%\end{example}
\begin{description}
\item[Joint typicality theorem\puncspace]
	Let $\bx,\by$ be drawn from the ensemble $(XY)^N$ defined
 by
 $$P(\bx,\by)=\prod_{n=1}^N P(x_n,y_n).$$
 Then\index{joint typicality theorem}\label{theorem.jtt}  
\ben
\item
	the probability that $\bx,\by$
	 are jointly typical (to tolerance $\beta$)
 tends to 1 as $N \rightarrow \infty$;
\item
	the number of jointly-typical sequences $|\JNb|$
	is  close to  $2^{N  H(X,Y) }$. To be precise, 
\beq
 |\JNb| \leq 2^{N ( H(X,Y) + \beta ) };
\eeq
\item
	if $\bx'\sim X^N$ and $\by'\sim Y^N$, \ie, $\bx'$ and $\by'$
	are {\em independent\/} samples 
	with the same marginal distribution as $P(\bx,\by)$, then
	the probability that $(\bx' ,\by')$ lands in the
 jointly-typical set is about $2^{- N  \I(X;Y)}$.  To be precise, 
\beq
	P( (\bx' ,\by') \in \JNb )
		\leq 2^{- N ( \I(X;Y) - 3 \beta ) } .
\eeq 
% also, for the proof of the converse, we want...
% for sufficiently large N
% P( (\bx' ,\by') \in \JNb 
%	 	\geq (1-\beta) 2^{- N ( \I(X;Y) + 3 \beta ) }
\een

\item[{\sf Proof.}] The proof of parts 1 and 2 
 by the law of large numbers follows that of the source coding theorem in
  \chref{ch2}. For part 2, let the pair $x,y$ play the role of
 $x$ in the source coding theorem, replacing $P(x)$ there by
 the probability distribution $P(x,y)$.

% \marginpar{\footnotesize } 
 For the third part, 
\beqan
% \begin{array}{lll}
% was (thin column) --
% \multicolumn{3}{l}{
% 	P( (\bx' ,\by') \in \JNb )
% 		\: = \: \sum_{(\bx ,\by) \in \JNb} P(\bx ) P(\by)}
% \\[0.06in]
% 	&\leq & |\JNb| \, 2^{-N(H(X)-\beta)} 2^{-N(H(Y)-\beta)} 
% \\[0.045in]
% 	&\leq& 2^{N( H(X,Y) + \b) - N(H(X)+H(Y)-2\b)}
% \\
% 	& =& 	2^{-N ( \I(X;Y) - 3 \beta )}
	P( (\bx' ,\by') \in \JNb )
		& = & \sum_{(\bx ,\by) \in \JNb} P(\bx ) P(\by) 
\\% [0.06in]
	&\leq & |\JNb| \, 2^{-N(H(X)-\beta)} \, 2^{-N(H(Y)-\beta)} 
\\% [0.045in]
	&\leq& 2^{N( H(X,Y) + \b) - N(H(X)+H(Y)-2\b)}
\\
	& =& 	2^{-N ( \I(X;Y) - 3 \beta )} . \hspace{1in}\epfsymbol
\eeqan
% This quantity is a bound on the probability of confusing 
\end{description}
 A cartoon of the jointly-typical set is shown in
 \figref{fig.joint.typ}.
% The property just proved, that t
 Two
 independent typical vectors are jointly typical with probability
\beq
 P(
 (\bx' ,\by') \in \JNb ) \simeq 2^{-N ( \I(X;Y))}
\eeq
% because
%, is readily understood by noticing that
 because the {\em total\/} number of independent typical pairs is the
 area of the dashed rectangle, $2^{NH(X)} 2^{NH(Y)}$, and the number of
 jointly-typical pairs is roughly $2^{NH(X,Y)}$, so the probability of hitting
 a jointly-typical pair is roughly 
\beq
	2^{NH(X,Y)}/2^{NH(X)+NH(Y)} = 2^{-N\I(X;Y)}.
\eeq
%
% the above eq was in-line but it looked ugly 
%
\newcommand{\rad}{0.81}
\begin{figure}
\small
\figuremargin{%
\begin{center}\small
\setlength{\unitlength}{1mm}% original picture is 9.75 in by 5.25 in
\begin{picture}(74,105)(-15,-5)
%
\put(-10,-7){\framebox(62,99){}}
% as well as box put Ax and Ay sizes
\put(0,93.5){\vector(-1,0){10}}
\put(0,93.5){\vector(1,0){52}}
\put(-11,8){\vector(0,-1){15}}
\put(-11,8){\vector(0,1){84}}
%\put(0,92){\vector(-1,0){10}}
%\put(0,92){\vector(1,0){52}}
%\put(-10,8){\vector(0,-1){15}}
%\put(-10,8){\vector(0,1){84}}
%
% width indicator
\put(21,90){\vector(1,0){21}}
\put(21,90){\vector(-1,0){21}}
\put(21,88.7){\makebox(0,0)[t]{$2^{NH(X)}$}}
%
% height indicator
\put(-2,45){\vector(0,1){43}}
\put(-2,45){\vector(0,-1){43}}
\put(0,30){\makebox(0,0)[l]{$2^{NH(Y)}$}}% was 45
%
% RECTANGLE
%\put(-1,0){\framebox(45,89){}}
\put(-1,1){\dashbox{1}(44.5,88){}}
%
% strip width indicator
\put(26,35){\vector(1,0){2}}
\put(26,35){\vector(-1,0){2}}
\put(26,15){\vector(1,0){2}}
\put(26,15){\vector(-1,0){2}}
\put(26,14){\makebox(0,0)[t]{$2^{NH(X|Y)}$}}
%
% strip height indicator
\put(21,45){\vector(0,1){5}}
\put(21,45){\vector(0,-1){5}}
\put(28,45){\vector(0,1){5}}% was at 31,32
\put(28,45){\vector(0,-1){5}}
\put(29,45){\makebox(0,0)[l]{$2^{NH(Y|X)}$}}
%
% JT set
\multiput(2,88)(2,-4){21}{\circle*{\rad}}
\multiput(2,86)(2,-4){21}{\circle*{\rad}}
\multiput(0,86)(2,-4){22}{\circle*{\rad}}
\multiput(0,88)(2,-4){22}{\circle*{\rad}}
\multiput(0,82)(2,-4){21}{\circle*{\rad}}
\multiput(0,84)(2,-4){21}{\circle*{\rad}}
%
%\put(38,20){\makebox(0,0)[l]{$2^{NH(X,Y)}$}}
\put(18,64){\makebox(0,0)[l]{$2^{NH(X,Y)}$ dots}}
\put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}}
\put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}}
\end{picture}
\end{center}
}{%
\caption[a]{{The jointly-typical set.} The horizontal direction 
 represents $\A_{X}^N$, the set of all input strings of length $N$. 
 The vertical direction 
 represents $\A_{Y}^N$, the set of all output strings of length $N$.
 The outer box contains all conceivable input--output pairs.
 Each dot represents
 a jointly-typical pair of sequences $(\bx,\by)$.
 The total number of jointly-typical sequences is about $2^{NH(X,Y)}$.
% [Compare with \protect\figref{fig.extended.bec}a,
%  \protect\pref{fig.extended.bec}.]
% page \protect\pageref{fig.extended.bec}.]
}
\label{fig.joint.typ}
}
\end{figure}



\section{Proof of the noisy-channel coding theorem}
\subsection{Analogy}
 Imagine that we wish to prove that there is a baby\index{weighing babies} in a class
 of one hundred babies who weighs less than 10\kg. Individual babies
 are difficult to catch and weigh.%
\amarginfig{c}{
\begin{center}
\mbox{\psfig{figure=figs/babiesscale4.ps,width=53mm}}
\end{center}
\caption[a]{Shannon's method for
 proving one baby weighs less than 10\kg.}
}
 Shannon's method of\index{Shannon, Claude}
 solving the task is to scoop up all the babies  and weigh them
 all at once  on  a big weighing machine. If we find that their {\em average\/} weight is
%  smaller than 1000\kg\ then the children's average weight
% must be
 smaller than 10\kg, there must exist {\em at least one\/}
 baby who weighs less than 10\kg\ -- indeed there must be many!
% In the context of weighing children,
 Shannon's method isn't guaranteed to reveal  the existence of an underweight child,
 since it relies on
 there being a tiny number of elephants in the class. But if we use his method
 and get a total weight smaller than 1000\kg\ then our task is solved.
 
\subsection{From skinny children to fantastic codes}
 We wish to show that there exists a code and a decoder having small
 probability of error. Evaluating the probability of error of any
 particular coding and decoding
 system is not easy.  Shannon's innovation was this: instead of
 constructing a good coding and decoding
 system and evaluating its error probability,
 Shannon 
 calculated the average probability of block error of {\em all\/}
 codes, and proved that this average is small. There must then exist
 individual codes that have small probability of block error.

% Finally
% to prove that the {\em maximal\/} probability of error is small too,
% we modify one of these good codes by throwing away the worst 50\%
% of its codewords. 
 

\begin{figure}
\small
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\setlength{\unitlength}{0.81mm}% original picture is 9.75 in by 5.25 in
%\begin{picture}(74,100)(-15,-5)
\begin{picture}(62,100)(-5,-5)
%
%\put(-10,-2){\framebox(62,94){}}
% codewords
\put( 5,0){\framebox(2,91){}}
\put(13,0){\framebox(2,91){}}
\put(31,0){\framebox(2,91){}}
\put(35,0){\framebox(2,91){}}
%
\put(5,94){\makebox(0,2.5)[bl]{$\bx^{(3)}$}}
\put(13,94){\makebox(0,2.5)[bl]{$\bx^{(1)}$}}
\put(29,94){\makebox(0,2.5)[bl]{$\bx^{(2)}$}}
\put(37,94){\makebox(0,2.5)[bl]{$\bx^{(4)}$}}
% JT set
\multiput(2,88)(2,-4){21}{\circle*{\rad}}
\multiput(2,86)(2,-4){21}{\circle*{\rad}}
\multiput(0,86)(2,-4){22}{\circle*{\rad}}
\multiput(0,88)(2,-4){22}{\circle*{\rad}}
\multiput(0,82)(2,-4){21}{\circle*{\rad}}
\multiput(0,84)(2,-4){21}{\circle*{\rad}}
%
%\put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}}
%\put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}}
\end{picture}
&
\setlength{\unitlength}{0.81mm}
\begin{picture}(78,100)(-15,-5)
%
%\put(-10,-2){\framebox(62,94){}}
% codewords
\put(5,0){\framebox(2,91){}}
\put(13,0){\framebox(2,91){}}
\put(31,0){\framebox(2,91){}}
\put(35,0){\framebox(2,91){}}
%
\put(5,94){\makebox(0,2.5)[bl]{$\bx^{(3)}$}}
\put(13,94){\makebox(0,2.5)[bl]{$\bx^{(1)}$}}
\put(29,94){\makebox(0,2.5)[bl]{$\bx^{(2)}$}}
\put(37,94){\makebox(0,2.5)[bl]{$\bx^{(4)}$}}
%
% decodings
\put(-13,10){\makebox(0,0)[r]{$\by_c$}}
\put(-13,20){\makebox(0,0)[r]{$\by_d$}}
\put(-13,72){\makebox(0,0)[r]{$\by_b$}}
\put(-13,82){\makebox(0,0)[r]{$\by_a$}}
\put(-11.3,10){\vector(1,0){63}}
\put(-11.3,20){\vector(1,0){63}}
\put(-11.3,72){\vector(1,0){63}}
\put(-11.3,82){\vector(1,0){63}}
\put(54,10){\makebox(0,0)[l]{$\hat{\cwm}(\by_c)\eq 4$}}% was 10,
\put(54,20){\makebox(0,0)[l]{$\hat{\cwm}(\by_d)\eq 0$}}% was 25,
\put(54,72){\makebox(0,0)[l]{$\hat{\cwm}(\by_b)\eq 3$}}
\put(54,82){\makebox(0,0)[l]{$\hat{\cwm}(\by_a)\eq 0$}}
% top end
%
% JT set
\multiput(2,88)(2,-4){21}{\circle*{\rad}}
\multiput(2,86)(2,-4){21}{\circle*{\rad}}
\multiput(0,86)(2,-4){22}{\circle*{\rad}}
\multiput(0,88)(2,-4){22}{\circle*{\rad}}
\multiput(0,82)(2,-4){21}{\circle*{\rad}}
\multiput(0,84)(2,-4){21}{\circle*{\rad}}
%
%\put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}}
%\put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}}
\end{picture}
\\
(a) & (b) \\
\end{tabular}

\end{center}
}{%
\caption[a]{(a) {A \ind{random code}.}
% A random code is a selection of input 
% sequences $\{ \bx^{(1)}, \ldots, \bx^{(\cwM)}\}$ from the ensemble
% $X^N$. Each codeword 
% $\bx^{(\cwm)}$ is likely to be a typical sequence.
% [Compare  with \protect\figref{fig.extended.bec}b,
% page \protect\pageref{fig.extended.bec}.]

(b) {Example decodings by the typical set decoder.} A sequence that is not
 jointly typical 
 with any of the codewords, such as  $\by_a$, is decoded as $\hat{\cwm}=0$.
 A sequence that is  jointly typical 
 with  codeword $\bx^{(3)}$ alone, $\by_b$, is decoded as $\hat{\cwm}=3$.
 Similarly, $\by_c$ is decoded as $\hat{\cwm}=4$.
 A sequence that is  jointly typical 
 with more than one codeword, such as
 $\by_d$, is decoded as $\hat{\cwm}=0$.
% [Compare  with \protect\figref{fig.extended.bec}c,
% page \protect\pageref{fig.extended.bec}.]
}
\label{fig.rand.code}
\label{fig.typ.set.dec}
}
\end{figure}

\subsection{Random coding and typical-set decoding}
 Consider the following  encoding--decoding
 system, whose rate is $R'$.\index{random code}
\ben
\item
 We fix $P(x)$ and generate the $\cwM = 2^{NR'}$
 codewords of a $(N,NR')=(N,K)$
 code $\C$
 at random according to 
\beq
	P(\bx) = \prod_{n=1}^{N} P(x_n) . 
\eeq
 A random code is shown schematically in \figref{fig.rand.code}a.
\item
 The code is known to both sender and receiver. 
\item
 A message $\cwm$ is chosen  from $\{1,2,\ldots, 2^{NR'}\}$, and $\bx^{(\cwm )}$
 is transmitted. The received signal is $\by$, with
\beq
	P(\by  \given  \bx^{(\cwm )} ) =  \prod_{n=1}^{N} P(y_n \given x^{(\cwm )}_n) .
\eeq
\item
 The signal is decoded by {\dem{typical-set decoding}\index{typical-set decoder}}.
\begin{description}
\item[Typical-set decoding\puncspace] Decode
 $\by$ as $\hat{\cwm }$ {\sf if}
 $(\bx^{(\hat{\cwm })},\by)$ are jointly typical {\em
 and\/} there is no other $\cwm' $ such that $(\bx^{(\cwm')},\by)$ are jointly
 typical;\\
 {\sf otherwise} declare a failure $(\hat{\cwm }\eq 0)$.
\end{description}
 This is not
 the optimal decoding algorithm, but it will be good enough, and easier
 to \analyze. The typical-set decoder is illustrated in 
 \figref{fig.typ.set.dec}b.
\item
 A decoding error occurs if $\hat{\cwm } \not = \cwm $. 
\een

 There are three probabilities of error that we can distinguish.
 First, there is the probability of block error for a particular
 code $\C$, that is,
\beq
 p_{\rm B}(\C) \equiv P(\hat{\cwm } \neq \cwm  \given  \C).
\eeq
 This is
 a difficult quantity to evaluate for any given code.

 Second, there is the average  over all codes of this block error probability,
\beq
	\langle p_{\rm B} \rangle \equiv  \sum_{\C} P(\hat{\cwm } \neq \cwm  \given  \C)
								P(\C) .
\eeq
 Fortunately, this quantity is much easier to evaluate than
 the first quantity $P(\hat{\cwm } \neq \cwm  \given  \C)$.%
\marginpar{\small\raggedright{$\langle p_{\rm B} \rangle$
 is just the probability that there is a decoding error
 at step 5 of the process on the previous page.}}

 Third, the maximal block error probability of a code $\C$,
\beq
 p_{\rm BM}(\C) \equiv \max_{\cwm }  P(\hat{\cwm } \neq \cwm  \given \cwm, \C),
\eeq
 is the quantity we are most interested in: we wish to show
 that there exists a code $\C$ with the required rate
 whose maximal block error probability is small.

 We will get to this result by first finding the
 average block error probability, $\langle p_{\rm B} \rangle$.
 Once we have shown that this can be made smaller than
 a desired small number, we immediately deduce that
 there must exist {\em at least one\/} code $\C$
 whose  block error probability is also less than this
 small number.  Finally, we show that this code, whose
 block error probability is satisfactorily small but whose 
  maximal block error probability is unknown (and could
 conceivably be enormous), can be
 modified to make a code of slightly smaller rate whose
 maximal block error probability
 is also guaranteed to be small.
 We modify  the code by throwing away the worst 50\%
 of its codewords. 

 We therefore  now embark on finding the average probability of block error.

\subsection{Probability of error of typical-set decoder}
 There are two  sources of error when we use typical-set
 decoding.  Either (a) the output $\by$ is not jointly typical with the
 transmitted codeword $\bx^{(\cwm )}$, or (b) there is some other codeword
 in $\cal{C}$ that is
 jointly typical with $\by$.

 By the  symmetry of the code construction, the average probability of error 
 averaged over all codes  does not depend on the selected value of $\cwm$; we can 
 assume without loss of generality that $\cwm=1$. 

 (a) The probability that the input  $\bx^{(1)}$ and
 the output $\by$ are not jointly typical
 vanishes,  by the joint typicality theorem's first part
 (\pref{theorem.jtt}).
 We give a name, $\delta$, to the upper bound on this probability,
% .  
 satisfying $\delta
 \rightarrow 0$ as $N \rightarrow \infty$; for any desired $\delta$,
 we can find a  blocklength  $N(\delta)$ such that the $P( (\bx^{(1)},\by) \not \in
 \JNb) \leq \delta$.

 (b) The probability that  $\bx^{(\cwm')}$ and $\by$
% $(\bx^{(\cwm' )},\by)$
 are jointly typical, for
 a {\em given\/}  $\cwm' \not = 1$
 is $\leq 2^{-N(\I(X;Y)-3 \beta)}$, by part 3.
 And there are $(2^{NR'}-1)$ rival values of $\cwm'$ to worry about.

 Thus the average probability of error $\langle p_{\rm B} \rangle$
 satisfies:
\beqan
	\langle	p_{\rm B} \rangle  &\leq &
	 \delta + \sum_{\cwm' =2}^{2^{NR'}} 2^{-N(\I(X;Y)-3 \beta)}
\label{eq.uniona}
\\
&\leq &
	\delta + 2^{-N(\I(X;Y)- R' -3 \beta)} .
\label{eq.unionaa}
\eeqan
% MARGINPAR should align with the eqn if possible (above)
\begin{aside}
{The inequality (\ref{eq.uniona}) that bounds a
 total probability of error $P_{\rm TOT}$ by the sum of the probabilities $P_{s'}$ of
 all sorts of events $s'$ each of which is sufficient to cause error,
 $$P_{\rm TOT} \leq P_1 + P_2 + \cdots, $$
 is called a {\dem\ind{union bound}}. It is only an equality if the different events
 that cause error never occur at the same time as each other.
}
\end{aside}
 The  average probability of error (\ref{eq.unionaa})
 can be made $< 2 \delta$  by increasing  $N$ if 
% {\em if\/} 
\beq
	R' < \I(X;Y) -3 \beta .
\eeq
 We are almost there. We make three modifications:
\newcommand{\expurgfig}[1]{%
\hspace*{-0.3in}\raisebox{-0.975in}[2.05in][0pt]{\psfig{figure=figs/expurgate#1.ps,width=3.2in}}\hspace*{-0.3in}}
\begin{figure}
\figuremargin{
%\marginfig{
\begin{center}\small
\begin{tabular}{c@{}c@{}c}
\expurgfig{1}
&$\Rightarrow$ &
\expurgfig{2}
\\
(a) A random code $\ldots$ & &
(b) expurgated    \\
\end{tabular}
\end{center}
}{
\caption[a]{How expurgation works.
 (a) In a typical random code, a small  fraction of the
 codewords are involved in  collisions -- pairs of codewords are sufficiently
 close to each other that the probability of error when either codeword
 is transmitted is not tiny.
 We obtain a new code from a random code by deleting
 all these confusable codewords.
 (b) The resulting code has slightly fewer  codewords, so
 has a slightly lower rate, and its maximal probability  of error
 is greatly reduced.
}
\label{fig.expurgate}
}
\end{figure}
% \newcommand{\optens}{optimal input distribution}
\ben
\item
	We choose $P(x)$ in the proof to be the \optens\ of the channel. 
 	Then the condition $R'<\I(X;Y) -3 \beta$ becomes $R' N C$ is not achievable, so $R > \smallfrac{C}{1-H_2(p_{\rm b})}$
 is not achievable.\ENDproof

\exercisxC{3}{ex.m.s.I.aboveC}{
 Fill in the details in the preceding  argument.
 If the bit errors between $\hat{\cwm }$ and $\cwm$ are independent
 then we have $\I(\cwm;\hat{\cwm }) = N R ( 1 - H_2(p_{\rm b}))$. What if
 we have complex correlations among those bit errors? Why
 does the inequality  $\I(\cwm;\hat{\cwm }) \geq
 N R ( 1 - H_2(p_{\rm b}))$ hold?
} 

\section{Computing capacity\nonexaminable}
\label{sec.compcap}
 We\marginpar[c]{\small\raggedright{Sections \ref{sec.compcap}--\ref{sec.codthmpractice}
 contain advanced material. The first-time reader is encouraged to
 skip to  section \ref{sec.codthmex} (\pref{sec.codthmex}).}}
 have proved that the capacity of a channel is
 the maximum rate at which reliable communication can be achieved.
 How can we compute the capacity of a given discrete
 memoryless channel?
 We need to find its \optens. In general we can find
 the   \optens\ by  a 
 computer search, making use of the derivative of the mutual information
 with respect to the input probabilities.
\exercisxB{2}{ex.Iderivative}{
 Find the derivative of $\I(X;Y)$ with respect to the
 input probability $p_i$, $\partial \I(X;Y)/\partial p_i$,  for a channel with
 conditional probabilities $Q_{j|i}$. 
} 
\exercisxC{2}{ex.Iconcave}{
 Show that  $\I(X;Y)$ is a \concavefrown\ function of
 the input probability vector $\bp$.
}
 Since  $\I(X;Y)$ is  \concavefrown\ in the input distribution $\bp$,
 any probability distribution  $\bp$ at which
% that has $\partial \I(X;Y)/\partial p_i$
 $\I(X;Y)$ is stationary 
 must be a global maximum of  $\I(X;Y)$. 
% 
 So it is tempting to put the derivative of $\I(X;Y)$ into a routine that
 finds a local maximum of $\I(X;Y)$, that is,  an input distribution
 $P(x)$ such that
\beq
	\frac{\partial \I(X;Y)}{\partial p_i}
 = \lambda \:\:\: \mbox{for all $i$},
\label{eq.Imaxer}
\eeq
 where $\lambda$ is a Lagrange multiplier associated with the constraint
 $\sum_i p_i = 1$.
 However, this approach may fail to find the right
 answer, because $\I(X;Y)$ might be maximized
 by a distribution that has $p_i \eq 0$ for some inputs.
 A simple example is given by the ternary confusion channel.
\begin{description}
% 
\item[Ternary confusion channel\puncspace] $\A_X \eq  \{0,{\query},1\}$. $\A_Y \eq  \{0,1\}$.
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(20,30)(0,0)
\put(5,5){\vector(1,0){10}}
\put(5,25){\vector(1,0){10}}
\put(5,15){\vector(1,1){10}}
\put(5,15){\vector(1,-1){10}}
\put(4,5){\makebox(0,0)[r]{1}}
\put(4,25){\makebox(0,0)[r]{0}}
\put(16,5){\makebox(0,0)[l]{1}}
\put(16,25){\makebox(0,0)[l]{0}}
\put(4,15){\makebox(0,0)[r]{{\query}}}
\end{picture}
\end{array}
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
 P(y\eq 0 \given x\eq 0) &=& 1 \,; \\
 P(y\eq 1 \given x\eq 0) &=& 0 \,; 
\end{array} 
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
 P(y\eq 0 \given x\eq {\query}) &=& 1/2 \,; \\
 P(y\eq 1 \given x\eq {\query}) &=& 1/2 \,; 
\end{array} 
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
 P(y\eq 0 \given x\eq 1) &=& 0 \,; \\
 P(y\eq 1 \given x\eq 1) &=& 1 .
\end{array} 
\]
 Whenever the input $\mbox{\query}$ is used, the output is random;
 the other inputs are reliable inputs. The maximum information
 rate of 1 bit is achieved by making no use of the
 input $\mbox{\query}$.
\end{description}
\exercissxB{2}{ex.Iternaryconfusion}{
 Sketch the mutual information for this channel as a function of
% $$a\in (0,1)$ and $b\in (0,1)$,
  the input distribution $\bp$.  Pick a convenient two-dimensional
 representation of $\bp$.
}
 The \ind{optimization} routine must therefore take account
 of the possibility that, as we go up hill on $\I(X;Y)$,
 we may run into the  inequality constraints $p_i \geq 0$.
\exercissxB{2}{ex.Imaximizer}{
 Describe the condition, similar to \eqref{eq.Imaxer}, that is satisfied at a
 point where  $\I(X;Y)$ is maximized, and describe a computer
 program for finding the capacity of a channel.
}
\subsection{Results that may help in finding the \optens}
% The following results 
\ben
\item
{All outputs must be used}.
\item
{$\I(X,Y)$ is a \convexsmile\ function of the channel parameters.}\marginpar{\small\raggedright  {\sf Reminder:} The term `\convexsmile' means `convex',
 and the term `\concavefrown' means `concave'; the little
 smile and frown symbols are included simply to remind you what
 convex and concave mean.}
\item
{There may be several {\optens}s, but they all look the same at the output.}
\een
%\subsubsection{All outputs must be used\subsubpunc}
\exercisxB{2}{ex.Iallused}{
  Prove that no output $y$ is unused by an \optens, unless it is  unreachable,
 that is, has $Q(y \given x)=0$ for all $x$. 
}
%\subsubsection{Convexity of $\I(X,Y)$ with respect to the channel parameters\subsubpunc}
\exercisxC{2}{ex.Iconvex}{
 Prove that 
 $\I(X,Y)$ is a \convexsmile\ function of $Q(y \given x)$.
}
%\subsubsection{There may be several {\optens}s, but they all look the same at the output\subsubpunc}
\exercisxC{2}{ex.Imultiple}{
 Prove that all {\optens}s of a channel have the same output
 probability distribution $P(y) = \sum_x P(x)Q(y \given x)$. 
}
 These results, along with the fact that
  $\I(X;Y)$ is a \concavefrown\ function of
 the input probability vector $\bp$, prove the validity of
 the symmetry argument that we have used when finding
 the capacity of symmetric channels.
 If a channel is invariant under  a group of symmetry
 operations -- for example, interchanging the
 input symbols and interchanging the output symbols --
 then, given any \optens\ that is not
 symmetric, \ie, is not invariant under these operations,
 we can create another input distribution
 by averaging together this \optens\ and all
%
% WORDY!!!!!!!!!!!
%
 its permuted forms that we can make by applying the
 symmetry operations to the original \optens.
 The permuted distributions must have the same
 $\I(X;Y)$ as the original, by symmetry,  so the 
 new input distribution created by averaging must
 have  $\I(X;Y)$  bigger than or equal to that of the
 original distribution, because of the concavity
 of $\I$.

% see capacity.p
\subsection{Symmetric channels}
\label{sec.Symmetricchannels}
 In order to use symmetry arguments, it will help 
 to have a definition of a symmetric channel.
 I like \quotecite{Gallager68}
% Gallager's
 definition.\index{Gallager, Robert}

% page 94
%\subsubsection{Gallager's definition of a symmetric channel}
\begin{description}
\item[A discrete memoryless channel is a symmetric channel]
 if the set of outputs can be partitioned  into subsets
 in such a way that for each subset the matrix of
 transition probabilities
% (using inputs as columns and outputs in the subset as rows)
 has the property
 that each row (if more than 1) is a permutation of each other row
 and each column is a permutation of each other column.
\end{description}
\exampl{exSymmetric}{
 This channel 
\beq
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
 P(y\eq 0 \given x\eq 0) &=& 0.7 \,; \\
 P(y\eq {\query} \given x\eq 0) &=& 0.2 \,;   \\
 P(y\eq 1 \given x\eq 0) &=& 0.1 \,; 
\end{array} 
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
 P(y\eq 0 \given x\eq 1) &=& 0.1 \,; \\
 P(y\eq {\query} \given x\eq 1) &=& 0.2 \,; \\
 P(y\eq 1 \given x\eq 1) &=& 0.7.
\end{array} 
\eeq
 is a symmetric channel because
 its outputs can be partitioned into $(0,1)$ and ${\query}$, so that
 the matrix can be rewritten:
\beq
\begin{array}{cc} \midrule
\begin{array}{ccl}%{c@{}c@{}l}
 P(y\eq 0 \given x\eq 0) &=& 0.7 \,; \\
 P(y\eq 1 \given x\eq 0) &=& 0.1 \,; 
\end{array}
&
\begin{array}{ccl}%{c@{}c@{}l}
 P(y\eq 0 \given x\eq 1) &=& 0.1 \,; \\
 P(y\eq 1 \given x\eq 1) &=& 0.7 \,;
\end{array}
\\ \midrule
\begin{array}{ccl}%{c@{}c@{}l}
 P(y\eq {\query} \given x\eq 0) &=& 0.2 \,;   \\
\end{array}
&
\begin{array}{ccl}%{c@{}c@{}l}
 P(y\eq {\query} \given x\eq 1) &=& 0.2 . \\
\end{array}
\\ \midrule
\end{array} 
%
\eeq
}
 Symmetry is a useful property because, as we will see
 in a later chapter, 
 communication at capacity can be achieved  over symmetric channels
 by {\em{linear}\/} codes.\index{error-correcting code!linear}\index{linear block code}
% that are good codes
%-- a considerable simplification of the task of finding excellent codes.

\exercisxC{2}{ex.Symmetricoptens}{
 Prove that for a \ind{symmetric channel} with any
 number of inputs,\index{channel!symmetric}
 the uniform distribution over the inputs is an {\optens}.
}
\exercissxB{2}{ex.notSymmetric}{
 Are there channels that are not symmetric whose {\optens}s are uniform? 
 Find one, or prove there are none.
}

\section{Other coding theorems}% this star indicates skippable
\label{sec.othercodthm}
 The noisy-channel coding theorem  that we  proved in this chapter
 is quite general, applying to any discrete memoryless channel;
 but it is not very specific. The theorem  only says that
 reliable communication with error probability $\epsilon$ and rate $R$
%  can be achieved over a  channel
  can be achieved
 by using codes with {\em sufficiently large\/} blocklength $N$.
 The theorem does not say how large $N$ needs to be
% as a function
 to achieve  given values 
 of $R$ and $\epsilon$.

 Presumably, the smaller $\epsilon$ is
 and the closer $R$ is to $C$, the larger $N$ has to be.
% The task of proving explicit results about the blocklength
% is challenging and solutions to this problem are considerably
% more complex than the theorem we proved in this chapter.
 
%\begin{figure}
\marginfig{
\begin{center}
\mbox{\raisebox{0.5in}{$E_{\rm r}(R)$}\psfig{figure=figs/Er.eps,width=0.97in}}
\end{center}
\caption[a]{A typical random-coding exponent.}
\label{fig.Er}
%\end{figure}
}%\end{marginfig}
%
%
\subsection{Noisy-channel coding theorem -- version with 
 explicit $N$-dependence}
% explicit blocklength dependence}
\index{noisy-channel coding theorem}
\begin{quote}
 For a discrete memoryless channel, a blocklength $N$
 and a rate $R$, there exist block codes of length $N$
 whose average probability of error satisfies:
\beq
	p_{\rm B} \leq \exp \left[ -N  E_{\rm r}(R) \right] 
\label{eq.pbEr}
\eeq
 where $E_{\rm r}(R)$ is the {\dem\ind{random-coding exponent}\/}
 of the channel, a  \convexsmile, decreasing, positive function of $R$
%which
% satisfies
%\beq
% E_{\rm r}(R) > 0 \:\: \mbox{for all $R$ satisfying $0 \leq R < C$} .
%\eeq
 for  $0 \leq R < C$. The {random-coding exponent}
 is also known as the \ind{reliability function}.

 [By an \ind{expurgation} argument it can also be shown that
 there exist block codes for which  the {\em{maximal\/}} probability of error
 $p_{\rm BM}$
% , like  $p_{\rm B}$ in \eqref{eq.pbEr},
 is also exponentially small in $N$.] 
\end{quote}
 The definition of $E_{\rm r}(R)$  is
 given in \citeasnoun{Gallager68}, p.$\,$139.
 $E_{\rm r}(R)$  approaches
 zero as $R \rightarrow C$; the typical behaviour of this function
 is illustrated in \figref{fig.Er}.
 The computation of the {random-coding exponent}
 for interesting channels is a challenging task
 on which much effort has been expended. Even for simple
 channels like the \BSC, there is no simple expression for  $E_{\rm r}(R)$.  

\subsection{Lower bounds on the error probability as a function of blocklength}
 The theorem stated above
% gives an upper bound on the error probability:
 asserts that there are codes with  $p_{\rm B}$  smaller than $\exp \left[ -N  E_{\rm r}(R) \right]$.
 But how  small can  the error probability be?   Could it be much smaller? 
\begin{quote}
 For any code with blocklength $N$ on a discrete memoryless channel,
 the probability of error assuming all source messages are
 used with equal probability satisfies
\beq
 p_{\rm B} \gtrsim \exp[ - N E_{\rm sp}(R) ] ,
\eeq
 where the function  $E_{\rm sp}(R)$,
 the {\dem\ind{sphere-packing exponent}\/} of the channel,
 is   a  \convexsmile, decreasing, positive function of $R$
 for  $0 \leq R < C$.
\end{quote}
 For a precise statement of this result and further  references,
 see \citeasnoun{Gallager68}, \mbox{p.$\,$157}.\index{Gallager, Robert}
 
\section{Noisy-channel coding theorems and coding practice}
\label{sec.codthmpractice}
 Imagine a customer who wants to buy an error-correcting
 code and decoder for a noisy channel.
 The results described above allow us to offer
 the following service: if he tells us the properties of
 his channel, the desired rate $R$ and the desired error probability $p_{\rm B}$,
 we can, after working out the relevant functions
 $C$,  $E_{\rm r}(R)$, and $E_{\rm sp}(R)$, advise him
 that there exists a solution to his problem using a particular
 blocklength $N$; indeed that almost any randomly
 chosen  code with that  blocklength
 should do the job. Unfortunately we have
 not found out how to implement these encoders
 and decoders in practice; the cost of implementing
 the encoder and decoder for a random code with large $N$ would
 be exponentially large in $N$. 

 Furthermore, for practical purposes, the customer is unlikely
 to know exactly what channel he is dealing with.
% and might be  reluctant to specify a desired rate
 So \citeasnoun{Berlekamp80} suggests that\index{Berlekamp, Elwyn}
 the sensible way to approach error-correction
 is to design  encoding-decoding systems
 and plot their performance on a {\em variety\/}
 of idealized channels
 as a function of the channel's noise level. These charts (one of which
 is illustrated on page \pageref{fig:GCResults})
 can then be shown to the customer, who can choose
 among the systems on offer without having to
 specify what he really thinks his channel is like.
 With this attitude to the practical problem, the importance of the
 functions $E_{\rm r}(R)$ and $E_{\rm sp}(R)$ is  diminished. 

%
% put this back somewhere. :
%
%
%\subsection{Noisy-channel coding theorem with errors allowed: 
%	rate-distortion theory}
% See Gallager p.466$\pm 20$. 
%
%\subsection{Special case of linear codes}
% Give Gallager's p.94 definition of a discrete symmetric channel.
% Give coding theorem for linear codes on any symmetric channel
% (including with memory). 
% 
%\subsection{More general case of
% channels with memory}
%
%\subsection{Finite state channels}
% Channels with and without intersymbol interference and
% with and without noise.  (Is it worth discussing these in any
% individual detail, or shall 
% I just have a general channels with memory discussion?)
%  
% end detour
\section{Further exercises}
\label{sec.codthmex}
\exercisaxA{2}{ex.exam01}{
 A binary erasure channel with input $x$ and output $y$
 has transition probability matrix:
\[
 \bQ = \left[
\begin{array}{cc}
1-q & 0 \\
q   & q  \\
 0  & 1-q 
\end{array}
\right]
\hspace{1in}
\begin{array}{c}
\setlength{\unitlength}{0.13mm}
\begin{picture}(100,100)(0,0)
\put(18,0){\makebox(0,0)[r]{\tt 1}}
%
\put(18,80){\makebox(0,0)[r]{\tt 0}}
\put(20,0){\vector(1,0){38}}
\put(20,80){\vector(1,0){38}}
%
\put(20,0){\vector(1,1){38}}
\put(20,80){\vector(1,-1){38}}
%
\put(62,0){\makebox(0,0)[l]{\tt 1}}
\put(62,40){\makebox(0,0)[l]{\tt ?}}
\put(62,80){\makebox(0,0)[l]{\tt 0}}
\end{picture}
\end{array}
\]
 Find the {\em{mutual information}\/} $I(X;Y)$ between the input and output
 for general input distribution $\{ p_0,p_1 \}$, and show that the 
 {\em{capacity}\/} of this channel is $C = 1-q$ bits.
\medskip

\item
%\noindent (c) 
 A Z channel\index{channel!Z channel}
 has transition probability matrix:
\[
 \bQ = \left[
\begin{array}{cc}
1 &  q  \\
 0  & 1-q 
\end{array}
\right]
\hspace{1in}
\begin{array}{c}
\setlength{\unitlength}{0.1mm}
\begin{picture}(100,100)(0,0)
\put(18,0){\makebox(0,0)[r]{\tt 1}}
%
\put(18,80){\makebox(0,0)[r]{\tt 0}}
\put(20,0){\vector(1,0){38}}
\put(20,80){\vector(1,0){38}}
%
\put(20,0){\vector(1,2){38}}
%
\put(62,0){\makebox(0,0)[l]{\tt 1}}
\put(62,80){\makebox(0,0)[l]{\tt 0}}
\end{picture}
\end{array}
\]
 Show that, using
 a $(2,1)$ code,
%  of blocklength 2,
   {\bf{two}} uses of a Z channel
 can be made to emulate  {\bf{one}} use of an erasure channel, 
 and state the erasure probability of that erasure channel.
 Hence show that the 
 capacity of the Z channel, $C_{\rm Z}$,
 satisfies $C_{\rm Z} \geq \frac{1}{2}(1-q)$ bits.

 Explain why the result $C_{\rm Z} \geq \frac{1}{2}(1-q)$
 is an inequality rather than an equality.


}

\exercissxC{3}{ex.wirelabelling}{
 A \ind{transatlantic} cable contains $N=20$ indistinguishable
 electrical wires.\index{puzzle!transatlantic cable}\index{puzzle!cable labelling}
 You have the job of figuring out which
 wire is which, that is,
% Alice and Bob, located at the opposite ends of the
% cable, wish
 to create a consistent labelling of the wires at each end.
 Your only tools are the ability to connect wires to each other
 in groups of two or more, and to test for connectedness with
 a continuity tester.
 What is the smallest number of transatlantic trips you need to
 make, and how do  you do it?

 How would you solve the problem for larger $N$ such as $N=1000$?

 As an illustration, if $N$ were 3 then the task can be solved
 in two steps by labelling one wire at one end $a$, connecting the other two together,
 crossing the \ind{Atlantic}, measuring which two wires are connected, labelling them
 $b$ and $c$ and the unconnected one $a$, then connecting $b$ to $a$
 and returning across the Atlantic, whereupon on disconnecting
 $b$ from $c$, the identities of  $b$ and $c$ can be deduced.

 This problem can be solved by persistent search,
 but the reason it is posed in this chapter is that it can
 also be solved by a greedy approach based on maximizing the acquired
 {\em information}.
 Let the unknown permutation of wires be $x$.
% , drawn from an ensemble $X$.
 Having chosen a set of connections of wires $\cal C$ at one end,
 you can then make measurements at the other end, and
 these measurements $y$ convey  {\em information\/} about $x$.
 How much? And for what set of connections is the information that $y$ conveys
 about $x$ maximized?
}




\dvips
\section{Solutions}% to Chapter \protect\ref{ch6}'s exercises} % 80,82,84,85,86
% solutions to _l6.tex
%
%\soln{ex.m.s.I.aboveC}{
%%\input{tex/aboveC.tex}
% {\em [More work needed here.]}
%}
%\soln{ex.Iderivative}{
%% Find derivative of $I$ w.r.t $P(x)$.
% Get a specific mutual information 
% like object minus $\log e$.
%}
%\soln{ex.Iconcave}{
% $\I(X,Y) = \sum_{x,y} P(x) Q(y|x) \log \frac{Q(y|x)}{P(x)Q(y|x)}$ 
% is a \concavefrown\ function of $P(x)$. 
% Easy Proof in Gallager p.90, using \verb+z->x->y+, where $z$ chooses 
% between the two things we are mixing.
% This satisfies $I(X;Y|Z) = 0$ 
% (data processing inequality). 
%}
\soln{ex.Iternaryconfusion}%
{
\marginpar{\[
\begin{array}{c}
\setlength{\unitlength}{1mm}
\begin{picture}(20,30)(0,0)
\put(5,5){\vector(1,0){8}}
\put(5,25){\vector(1,0){8}}
\put(5,15){\vector(1,1){8}}
\put(5,15){\vector(1,-1){8}}
\put(10,18){\makebox(0,0)[l]{\dhalf}}
\put(10,12){\makebox(0,0)[l]{\dhalf}}
\put(4,5){\makebox(0,0)[r]{\tt1}}
\put(4,25){\makebox(0,0)[r]{\tt0}}
\put(16,5){\makebox(0,0)[l]{\tt1}}
\put(16,25){\makebox(0,0)[l]{\tt0}}
\put(4,15){\makebox(0,0)[r]{\tt{?}}}
\end{picture}
\end{array}
\]
}
 If the input distribution is $\bp=(p_0,p_{\tt{?}},p_1)$,
 the mutual information is
\beq
	I(X;Y) = H(Y) - H(Y|X)
	= H_2(p_0 + p_{{\tt{?}}}/2)  - p_{{\tt{?}}} .
\eeq
 We can build
 a good sketch of this function in two ways:
 by careful inspection of the function, or
 by looking at special cases.

 For the plots, the two-dimensional
 representation of $\bp$ I will use
 has $p_0$ and $p_1$ as the independent
 variables, so that $\bp=(p_0,p_{\tt{?}},p_1) = (p_0,(1-p_0-p_1),p_1)$.
\medskip

\noindent {\sf By inspection.}
 If we use the quantities $p_* \equiv p_0 + p_{{\tt{?}}}/2$ and $p_{\tt{?}}$
 as our two degrees of freedom, the mutual information becomes
 very simple: $I(X;Y) = H_2(p_*) - p_{{\tt{?}}}$. Converting back to
 $p_0 =  p_* - p_{{\tt{?}}}/2$ and $p_1 = 1 - p_* - p_{{\tt{?}}}/2$,
 we obtain the sketch shown at the  left below.
 This function is like  a tunnel rising up the direction of
 increasing $p_0$ and $p_1$.
 To obtain the required plot of $I(X;Y)$ we have to strip
 away the parts of this tunnel that live outside
 the feasible \ind{simplex} of
 probabilities; we do this by redrawing the surface,
  showing only the parts where $p_0>0$
 and $p_1>0$.  A full plot of the function is shown at the right.
\medskip

\begin{center}
 \mbox{%
\hspace*{2.3in}%
\makebox[0in][r]{\raisebox{0.3in}{$p_0$}}%
\hspace*{-2.3in}%
\raisebox{0in}[1.9in]{\psfig{figure=figs/confusion.view1.ps,angle=-90,width=3.62in}}%
\hspace{-0.3in}%
\makebox[0in][r]{\raisebox{0.87in}{$p_1$}}%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\hspace*{2.3in}%
\makebox[0in][r]{\raisebox{0.3in}{$p_0$}}%
\hspace*{-2.3in}%
\raisebox{0in}[1.709in]{\psfig{figure=figs/confusion.view2.ps,angle=-90,width=3.62in}}%
\hspace{-0.3in}%
\makebox[0in][r]{\raisebox{0.87in}{$p_1$}}%
}\\[-0.3in]
\end{center}
\medskip

\noindent {\sf Special cases.}
 In the special case $p_{{\tt{?}}}=0$, the channel is a noiseless
 binary channel, and $I(X;Y) =  H_2(p_0)$.

 In the special case $p_0=p_1$, the term $H_2(p_0 + p_{{\tt{?}}}/2)$ is equal to 1,
 so $I(X;Y) =  1-p_{{\tt{?}}}$.

 In the special case $p_0=0$, the channel is a Z channel with error
 probability 0.5. We know how to sketch that, from the previous chapter
 (\figref{hxyz}).

\amarginfig{c}{\small% skeleton  fixed Thu 10/7/03
\begin{center}% was -0.51in until Sat 24/5/03
\hspace*{-0.31in}\mbox{%
\hspace*{1.62in}%
\makebox[0in][r]{\raisebox{0.25in}{$p_0$}}%
\hspace*{-1.62in}%
{\psfig{figure=figs/confusion.skel.ps,angle=-90,width=2.5in}}%was 3in
\hspace{-0.3in}%
\makebox[0in][r]{\raisebox{0.77in}{$p_1$}}}\vspace{-0.2in}%
\end{center}
\caption[a]{Skeleton of the mutual information for the ternary confusion channel.}
\label{fig.skeleton}
}% end marginpar
 These special cases allow us to construct the skeleton shown
 in \figref{fig.skeleton}.
% below.
}




\soln{ex.Imaximizer}{
 Necessary and sufficient conditions for $\bp$ to maximize
 $\I(X;Y)$ are
\beq
\left.
\begin{array}{rclcc}
	\frac{\partial \I(X;Y)}{\partial p_i} & =& \lambda & \mbox{and} & p_i>0 \\[0.05in]
	\frac{\partial \I(X;Y)}{\partial p_i} & \leq & \lambda & \mbox{and} & p_i=0 \\
\end{array} \right\}
\:\:\: \mbox{for all $i$},
\label{eq.IequalsC}
\eeq
 where $\lambda$ is a constant related to the capacity by $C = \lambda + \log_2 e$.

 This result can be used in a computer program that evaluates the
 derivatives, and increments and decrements the probabilities $p_i$
 in proportion  to the differences between those derivatives.

 This result is also useful  for lazy human capacity-finders
 who are good guessers. Having guessed the \optens, one can
 simply confirm that \eqref{eq.IequalsC} holds. 
}
%\soln{ex.Iallused}{
% coming
%}
%\soln{ex.Iconvex}{
% Easy Proof, using \verb+(x,z)->y+.
%}
%\soln{ex.Imultiple}{
%% If there are several \optens, they all give the same 
%% output probability (theorem). This is a general proof that
%% the `by symmetry' argument is valid.
% coming
%}
%\soln{ex.Symmetricoptens}{
% This can be proved by the symmetry argument given in the chapter.
%
% Alternatively  see p.94 of Gallager.
%}
\soln{ex.notSymmetric}{
 We certainly expect nonsymmetric channels with uniform {\optens}s  to exist, since 
 when inventing a channel we have $I(J-1)$
 degrees of freedom  whereas
 the \optens\ is just $(I-1)$-dimensional; 
 so   in the $I(J\!-\!1)$-dimensional
 space of perturbations around   a symmetric channel,
 we expect there to be a 
 subspace of perturbations of dimension
 $I(J-1)-(I-1) = I(J-2)+1$
 that leave the \optens\ unchanged. 

 Here is an explicit example, a bit like a Z channel.
\beq \bQ = 
\left[
\begin{array}{cccc}
0.9585 & 0.0415 &  0.35 &  0.0   \\
0.0415 & 0.9585 &  0.0  &  0.35  \\
0      & 0      &  0.65 &  0     \\
0      & 0      &  0    &  0.65  \\
\end{array}
\right]
\eeq
}
% removed to cutsolutions.tex
% \soln{ex.exam01}{
\soln{ex.wirelabelling}{
 The labelling problem can be solved for any $N>2$ with
 just two trips, one each way across the Atlantic.

 The key step in the information-theoretic approach to
 this problem is to write down
 the  information content of
 one {\dem\ind{partition}}, the combinatorial object that is the connecting
 together  of  subsets of wires.
 If $N$ wires are grouped together into
 $g_1$ subsets of size $1$, 
 $g_2$ subsets of size $2$, $\ldots,$
% $g_r$ groups of size $r$ $\ldots,$
 then the number of such partitions is
\beq
	\Omega = \frac{ N! }{\displaystyle  \prod_r \left( r! \right)^{g_r} g_r! } ,
\eeq
 and the information content of one such \ind{partition} is the $\log$ of this quantity.
 In a greedy strategy we choose the first partition to maximize this information
 content.

 One game we can play is to maximize this information content
 with respect to the quantities $g_r$, treated as real numbers, subject to the
 constraint $\sum_r g_r r = N$.
 Introducing a \ind{Lagrange multiplier} $\l$ for the constraint,
 the derivative is
\beq
\frac{	\partial }{\partial g_r} \left( \log \Omega + \l \sum_r g_r r \right)
	= - \log r! - \log g_r  + \l r ,
\eeq
 which, when set to zero, leads to the rather nice expression
\beq
	g_r = \frac{ e^{\l r} }{ r! } ;
% \:\:(r \geq 1)
\eeq
 the optimal $g_r$ is 
 proportional to a \ind{Poisson distribution}\index{distribution!Poisson}!
 We can solve for the Lagrange multiplier by plugging $g_r$ into the
 constraint  $\sum_r g_r r = N$,  which gives the implicit
 equation
\beq
	N = \mu \, e^{\mu},
\eeq
 where $\mu \equiv e^{\l}$ is a convenient reparameterization of the
 Lagrange multiplier. 
 \Figref{fig.atlantic}a shows a graph of $\mu(N)$;
 \figref{fig.atlantic}b
 shows the deduced non-integer assignments $g_r$ when $\mu=2.2$,
 and nearby integers $g_r = \{1,2,2,1,1\}$
 that motivate setting the first partition to
 (a)(bc)(de)(fgh)(ijk)(lmno)(pqrst).
\marginfig{\footnotesize
\begin{center}\hspace*{-0.2in}
\begin{tabular}{r@{\hspace{0.2in}}l}
(a)&\mbox{\psfig{figure=figs/atlanticmuN.ps,width=1.5in,angle=-90}}\\[0.2in]
(b)&\mbox{\psfig{figure=figs/atlanticpoi.ps,width=1.5in,angle=-90}}\\
\end{tabular}
\end{center}
\caption[a]{Approximate solution of the \index{cable labelling}{cable-labelling} problem
 using Lagrange multipliers.
 (a) The parameter $\mu$ as a function of $N$; the value $\mu(20) = 2.2$ is highlighted.
 (b) Non-integer values of the function  $g_r = \dfrac{ \mu^{r} }{ r! }$
 are shown by lines and
 integer values of $g_r$ motivated by those non-integer values are
 shown by  crosses.
}
\label{fig.atlantic}
}
 This partition produces a random partition at the other
 end, which has an information content of $\log \Omega =40.4\ubits$, 
%  pr log(20!*1.0/( (2!)**2 * 2 * (3!)**2 * 2 * (4!) * (5!) ) )/log(2.0)
%  pr log(20!*1.0/( (2!)**10 * 10! ))/log(2.0)
 which is a lot more than half the total information content
 we need to acquire to infer the transatlantic permutation, $\log 20! \simeq 61\ubits$.
 [In contrast, if all the wires are joined together in pairs,
 the information content generated
 is only about 29$\ubits$.]
 How to choose the second partition  is left
 to the reader. A Shannonesque approach is appropriate, picking a
 random   partition at the other end, using the same $\{g_r\}$; you
 need to ensure the two partitions are as unlike each other as possible.
 

 If $N \neq 2$, 5 or 9, then  the labelling problem
 has solutions 
 that are particularly simple to implement,
 called \ind{Knowlton--Graham partitions}:
 partition $\{1,\ldots,N\}$ into disjoint sets in two ways $A$
% $A_1,\ldots,A_p$ and
 and $B$,
% $B_1,\ldots,B_q$,
 subject to the condition that at most one element appears
 both in an $A$~set of cardinality~$j$ and in a $B$~set of cardinality~$k$, for
 each $j$ and~$k$ \cite{Graham66,GrahamKnowlton68}.\index{Graham, Ronald L.} 
% (R. L. Graham, ``On partitions of a finite set,'' {\sl Journal of Combinatorial
% Theory\/ \bf 1} (1966), 215--223;\index{Graham, Ronald L.}
% Ronald L. Graham and Kenneth C. Knowlton, ``Method of identifying conductors in
% a cable by establishing conductor connection groupings at both ends of the
% cable,'' U.S. Patent 3,369,177 (13~Feb 1968).)


}
%%%%%%%%%%%%%%%%%%%%%%%%
%
%  end of chapter
%
%%%%%%%%%%%%%%%%%%%%%%%%

%
\dvipsb{solutions noisy channel s6}
%
%
% CHAPTER 12 (formerly 7)
%
\prechapter{About         Chapter}
\fakesection{prerequisites for chapter 7}
 Before reading  \chref{ch.ecc}, you should have read Chapters
 \ref{ch.five} and \ref{ch.six}.

 You will also need to be familiar with the {\dem\inds{Gaussian distribution}}.
\label{sec.gaussian.props}
\begin{description}
\item[One-dimensional Gaussian distribution\puncspace]
 If a 
        random variable $y$ is Gaussian and has mean $\mu$ and variance $\sigma^2$,
 which we write:
\beq
        y \sim \Normal(\mu,\sigma^2) ,\mbox{ or } P(y) = \Normal(y;\mu,\sigma^2) ,
\eeq
 then the distribution of $y$ is:
% a Gaussian distribution: 
\beq
        P(y\given \mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} 
                \exp \left[ - ( y - \mu )^2 / 2  \sigma^2 \right] .
\eeq
 [I  use the symbol $P$ for both  probability densities and
 probabilities.]

 The inverse-variance $\tau \equiv \dfrac{1}{\sigma^2}$ is sometimes
 called the {\dem\inds{precision}\/} of the Gaussian distribution. 

\item[Multi-dimensional Gaussian distribution\puncspace]
 If $\by = (y_1,y_2,\ldots,y_N)$ has a \ind{multivariate Gaussian} {distribution}, then
\beq
        P( \by \given  \bx, \bA ) =  \frac{1}{Z(\bA)}  \exp \left( - \frac{1}{2}
                (\by -\bx)^{\T} \bA  (\by -\bx) \right) ,
\eeq
 where $\bx$ is the mean of the distribution, 
 $\bA$ is the inverse of the \ind{variance--covariance matrix}\index{covariance matrix}, and
 the normalizing constant is ${Z(\bA)} = \left(  { {\det}\! \left( \linefrac{\bA}{2 \pi}
        \right) } \right)^{-1/2}$.

 This distribution has the property that 
 the variance $\Sigma_{ii}$  of $y_i$, and the covariance $\Sigma_{ij}$ of
 $y_i$ and $y_j$
 are given by
\beq
	\Sigma_{ij} \equiv \Exp \left[ ( y_i - \bar{y}_i ) ( y_j - \bar{y}_j ) \right]
		= A^{-1}_{ij} ,
\eeq
 where $\bA^{-1}$ is the inverse of the matrix $\bA$.

 The marginal distribution $P(y_i)$ of one component $y_i$ is Gaussian;
 the joint marginal distribution of any subset of the
 components is multivariate-Gaussian;
 and the conditional density of any subset, given the values of
 another subset, for example, $P(y_i\given y_j)$, is also Gaussian.
\end{description}



%\chapter{Error correcting codes \& real channels}
% ampersand used to keep the title on one line on the chapter's opening page
\ENDprechapter
\chapter[Error-Correcting Codes and Real Channels]{Error-Correcting Codes \& Real Channels}
\label{ch.ecc}\label{ch7}
% %  : l7.tex -- was l78.tex 
% \setcounter{chapter}{6}%  set to previous value
% \setcounter{page}{70} % set to current value 
% \setcounter{exercise_number}{89} % set to imminent value
% % 
% \chapter{Error correcting codes \& real channels}
% \label{ch7}
 The noisy-channel coding theorem that we have proved shows that there
 exist reliable
% `very good'
 error-correcting codes for any noisy channel.
 In this chapter we address two questions. 

 First, many practical channels have real, rather than discrete,
 inputs and outputs. What can Shannon tell us about
 these continuous channels? And how should digital signals be
         mapped into analogue waveforms, and {\em vice versa}?

 Second, how are practical error-correcting codes
 made, and what is achieved in practice, relative to the 
 possibilities proved by Shannon?

\section{The Gaussian channel}
 The most popular  model of a real-input, real-output
 channel is the \inds{Gaussian channel}.\index{channel!Gaussian}
\begin{description}
\item[The Gaussian channel] has a real input $x$ and a real output $y$. 
 The conditional distribution of $y$ given $x$ is a Gaussian distribution: 
\beq
        P(y\given x) = \frac{1}{\sqrt{2 \pi \sigma^2}} 
                \exp \left[ - ( y - x )^2 / 2  \sigma^2 \right] .
\label{eq.gaussian.channel.def}
\eeq
%
 This channel has a continuous input and output but is discrete 
 in time.
 We will show  below that certain continuous-time channels
 are  equivalent to the discrete-time Gaussian channel.

 This channel is sometimes called the additive white Gaussian noise (AWGN)
 channel.\index{channel!AWGN}\index{channel!Gaussian}\index{AWGN} 


\end{description}
% Why is this a useful channel model? And w
As with discrete channels, we will discuss 
 what rate of error-free
 information communication can be achieved over this channel.

\subsection{Motivation
% for the Gaussian channel
 in terms of a continuous-time channel \nonexaminable}
 Consider a physical (electrical, say) channel with inputs and outputs that 
 are continuous in time. We put in $x(t)$,
% which is a
%% some sort of
% band-limited  signal,
 and out comes $y(t) = x(t) + n(t)$. 

 Our transmission has a power cost. The average power of 
 a transmission of length $T$ may be constrained thus:
\beq
        \int_0^T \d t \: [x(t)]^2 / T   \leq P .
\eeq
 The received signal is assumed to differ from $x(t)$ by additive
 noise $n(t)$ (for example \ind{Johnson noise}), which we will model as
 white\index{white noise}\index{noise!white}
 Gaussian noise.  The magnitude of this noise is quantified by the
 {\dem noise spectral
 density}, $N_0$.\index{noise!spectral density}\index{E$_{\rm b}/N_0$}\index{signal-to-noise ratio}
% , which might depend on the effective temperature of the system.

 How could such a channel be used to communicate information?
\amarginfig{t}{
\begin{tabular}{r@{}l}
$\phi_1(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi1.ps,angle=-90,width=1in}}\\
$\phi_2(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi2.ps,angle=-90,width=1in}}\\
$\phi_3(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi3.ps,angle=-90,width=1in}}\\
$x(t)$     &\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/xt.ps,angle=-90,width=1in}}\\
\end{tabular}
%
\caption[a]{Three basis functions, and a
 weighted combination of them,
$
        x(t) = \sum_{n=1}^N x_n \phi_n(t) ,
$
 with $x_1  \eq  0.4$,
 $x_2 \eq -0.2$, and $x_3 \eq 0.1$.
% see figs/realchannel.gnu
}
\label{fig.continuousfunctionexample}
}
 Consider transmitting a set of $N$ real numbers $\{ x_n \}_{n=1}^N$ 
 in a signal of duration $T$ made up of a weighted combination 
 of orthonormal basis functions $\phi_n(t)$, 
\beq
        x(t) = \sum_{n=1}^N x_n \phi_n(t) ,
\eeq
 where $\int_0^T \: \d t \: \phi_n(t) \phi_m(t) = \delta_{nm}$.
 The receiver can then compute the scalars:
\beqan
        y_n \:\: \equiv \:\: \int_0^T \: \d t \: \phi_n(t) y(t) 
&=&
        x_n + \int_0^T \: \d t \: \phi_n(t) n(t) 
\\
&\equiv&        x_n + n_n
\eeqan
 for $n=1 \ldots N$.
 If there were no noise, then $y_n$ would equal $x_n$. The white Gaussian 
 noise $n(t)$ adds scalar noise $n_n$ to the estimate $y_n$. This noise 
 is Gaussian:
\beq
 n_n \sim \Normal(0,N_0/2),
\eeq
 where $N_0$ is the spectral 
 density introduced above.
% [This is the definition of $N_0$.]
 Thus  a continuous channel used in this way
 is equivalent to the Gaussian channel 
 defined at \eqref{eq.gaussian.channel.def}. 
 The power constraint $\int_0^T \d t \, [x(t)]^2  \leq P T$
 defines a constraint on the signal amplitudes $x_n$, 
\beq
        \sum_n x_n^2 \leq PT \hspace{0.5in} \Rightarrow
 \hspace{0.5in}
 \overline{x_n^2}         \leq \frac{PT}{N} .
\eeq

 Before returning to the Gaussian channel, we define the {\dbf\ind{bandwidth}} 
 (measured in \ind{Hertz})
 of the \ind{continuous channel}\index{channel!continuous} to be:
\beq
        W = \frac{N^{\max}}{2 T}, 
\eeq
 where $N^{\max}$ is the maximum number of orthonormal functions that can be
 produced in an interval of length $T$. 
 This definition can be motivated by imagining creating a
 \ind{band-limited signal} of duration $T$ from orthonormal cosine and sine
 curves of maximum frequency $W$.  The number of orthonormal functions
 is $N^{\max} = 2 W T$.  This definition relates to the
 \ind{Nyquist sampling theorem}: if the highest frequency present in a signal 
 is $W$, then the signal can
 be fully determined from its values at a series of discrete 
 sample points separated by the Nyquist interval 
 $\Delta t = \dfrac{1}{2W}$ seconds.

 So the use of a real continuous channel with bandwidth $W$, noise spectral
 density $N_0$ and power $P$ is equivalent to $N/T = 2 W$ uses per second 
 of a Gaussian channel with noise level
 $\sigma^2 = N_0/2$ and subject to the  signal power
 constraint $\overline{x_n^2} \leq\dfrac{P}{2W}$. 

\subsection{Definition of $E_{\rm b}/N_0$\nonexaminable}
 Imagine\index{E$_{\rm b}/N_0$}
 that the Gaussian channel $y_n = x_n + n_n$  is used {with
% an 
% error-correcting code
 an encoding system} to transmit {\em binary\/}
 source bits at a rate of $R$ bits per channel use.
% , where a rate of 1 corresponds to  the uncoded case.
 How can we compare two encoding systems that have different
 rates of \ind{communication} $R$ and that use different powers $\overline{x_n^2}$?
 Transmitting at a large rate $R$ is good; using small power is
 good too.

 It is conventional to measure the rate-compensated
 \ind{signal-to-noise ratio}
% \marginpar{\footnotesize{I'm using  signal to noise ratio in two different ways. Elsewhere it is defined to be $\frac{\overline{x_n^2}}{\sigma^2}$. Should I modify this phrase?}} 
 by the ratio of the power per source bit $E_{\rm b} = \overline{x_n^2}/R$ 
 to the noise spectral density $N_0$:\marginpar[t]{\small\raggedright
 {$E_{\rm b}/N_0$ is dimensionless, but it is usually reported in the units
 of \ind{decibels}; the value given is $10 \log_{10} E_{\rm b}/N_0$.}}
\beq
        E_{\rm b}/N_0  = \frac{\overline{x_n^2}}{2 \sigma^2 R} .
\eeq
% This signal-to-noise measure equates low rate, low power
% cf ebno.p

% The difference in
 $E_{\rm b}/N_0$ is one of the
 measures used to compare coding schemes
  for Gaussian channels.

\section{Inferring the input to a real channel}
\subsection{`The best detection of pulses'}
\label{sec.pulse}
 In 1944 
 Shannon
 wrote a memorandum \cite{shannon44} on the 
 problem of best differentiating between two types of pulses of known shape,
 represented by vectors $\bx_0$ and $\bx_1$, given that one of them has
 been transmitted over a noisy channel. This is a 
 \ind{pattern recognition} problem.% 
\amarginfig{t}{
\begin{tabular}{r@{}l}
$\bx_0$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/x0.ps,angle=-90,width=1in}}\\
$\bx_1$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/x1.ps,angle=-90,width=1in}}\\
$\by$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/xn1.ps,angle=-90,width=1in}}\\
\end{tabular}
%
\caption[a]{Two pulses  $\bx_0$ and $\bx_1$, represented
 as 31-dimensional vectors, and
 a noisy version of one of them, $\by$.
% see figs/realchannel.gnu
}
\label{fig.detectionofpulses}
}
 It is assumed that the noise is Gaussian with probability density
\beq
        P( \bn ) = \left[ {\det}\left( \frac{\bA}{2 \pi}
        \right) \right]^{1/2} \exp \left( - \frac{1}{2}
                \bn^{\T} \bA \bn \right) , 
\eeq
 where $\bA$ is  the inverse of the variance--covariance matrix of the 
 noise, a symmetric and positive-definite matrix.
 (If $\bA$ is a multiple of the identity matrix, $\bI/\sigma^2$,
 then the noise is `white'.\index{noise!white}\index{white noise}
 For more general $\bA$, the 
 noise is \index{noise!coloured}\index{coloured noise}`{coloured}'.)  The probability of the received vector $\by$ given that the 
 source signal was $s$ (either zero or one) is then
\beq
        P( \by \given s ) =  \left[ { {\det} \left( \frac{\bA}{2 \pi}
        \right) }\right]^{1/2} \exp \left( - \frac{1}{2}
                (\by -\bx_s)^{\T} \bA  (\by -\bx_s) \right) .
\eeq
 The optimal detector is based on the posterior probability ratio:
\beqan
\hspace{-0.6cm}
\lefteqn{\frac{ P( s \eq 1\given \by )}{P(s \eq 0\given \by )} = 
 \frac{ P( \by \given s \eq 1 ) }{ P( \by \given s \eq 0)}
        \frac{  P( s \eq 1 )}{P(s \eq 0 )} }
\\
&=& \exp \left( - \frac{1}{2}
                (\by -\bx_1)^{\T} \bA  (\by -\bx_1) + \frac{1}{2}
                (\by -\bx_0)^{\T} \bA  (\by -\bx_0) + \ln \frac{       P( s \eq 1 )}{P(s \eq 0 )} \right)
\nonumber
\\ &=& \exp \left(  \by^{\T}  \bA  ( \bx_1 -\bx_0) + \theta \right),
\eeqan
 where $\theta$ is a constant independent of the received vector $\by$, 
\beq
        \theta =  - \frac{1}{2}
                \bx_1^{\T} \bA  \bx_1 + \frac{1}{2}
                \bx_0^{\T} \bA \bx_0 +  \ln \frac{     P( s \eq 1 )}{P(s \eq 0 )} . 
\eeq
 If the detector is forced to make a decision (\ie, guess either 
 $s \eq 1$ or $s \eq 0$) then the 
 decision that minimizes the probability of error is 
 to  guess the most probable hypothesis.  We can write the 
 optimal decision in terms of a {\dem\ind{discriminant function}}: 
\beq
        a(\by) \equiv   \by^{\T}  \bA  ( \bx_1 -\bx_0) + \theta 
\eeq
 with the decisions 
\marginfig{
\begin{tabular}{r@{}l}
$\bw$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/w.ps,angle=-90,width=1in}}\\
\end{tabular}
%
\caption[a]{The weight vector $\bw \propto \bx_1 -\bx_0$
 that is used to discriminate between  $\bx_0$ and $\bx_1$.
% see figs/realchannel.gnu
}
\label{fig.detectionofpulses.w}
}
\beq
 \begin{array}{ccl} a(\by) > 0&  \rightarrow & \mbox{guess $s \eq 1$} \\
 a(\by) < 0& \rightarrow & \mbox{guess $s \eq 0$} \\
a(\by)=0 &  \rightarrow & \mbox{guess either.} 
\end{array}
\eeq
 Notice
% It should be noted
 that $a(\by)$ is a linear function of the 
 received vector,
\beq
        a(\by) = \bw^{\T} \by + \theta ,
\eeq
 where $\bw \equiv  \bA  ( \bx_1 -\bx_0)$.


\section{Capacity of Gaussian channel}
\label{sec.entropy.continuous}
 Until now we have only measured the joint, marginal, and conditional
 entropy of discrete variables. In order to define the information conveyed
 by continuous variables, there are two issues we must
 address -- the infinite length of the real line, and the infinite
 precision of real numbers.
 
\subsection{Infinite inputs}
 How much information can we convey in one use of a Gaussian
 channel? If we are allowed to put {\em any\/} real number $x$ into the
 Gaussian channel, we could communicate an enormous
 string of $N$ digits $d_1d_2d_3\ldots d_N$
 by setting $x = d_1d_2d_3\ldots d_N 000\ldots 000$.
 The amount of
 error-free information conveyed in just a single transmission could
 be made arbitrarily large by increasing  $N$,
 and the communication could be made arbitrarily reliable
 by increasing the number of zeroes at the end of $x$.
 There is usually some \ind{power cost} associated
 with large inputs, however, not to mention practical limits
 in the dynamic range acceptable to a receiver.
 It is therefore conventional to introduce a {\dem\ind{cost
 function}\/} $v(x)$ for every input $x$, and constrain codes to have 
 an average cost  $\bar{v}$ less than or equal to some maximum value.
%  a maximum average cost $\bar{v}$.
 A generalized channel coding theorem, including a cost
 function for the inputs, can be proved
% for the discrete channels  discussed previously
 -- see  McEliece (1977).\nocite{McEliece77}
 The result is a channel
 capacity $C(\bar{v})$ that is a function of the permitted cost.  For
 the Gaussian channel we will assume a cost
\beq
        v(x) = x^2
\eeq
 such that the `average power' $\overline{x^2}$ of the input is
 constrained. We  motivated this cost function 
 above in the case of real electrical channels in
 which the physical power consumption is indeed quadratic in $x$.
 The constraint $\overline{x^2}=\bar{v}$  makes it impossible to 
 communicate infinite information in one use of  the Gaussian channel. 

\subsection{Infinite precision}
\amarginfig{b}{
{\footnotesize\setlength{\unitlength}{1mm}
\begin{tabular}{lc}
(a)&{\psfig{figure=gnu/grainI.ps,angle=-90,width=1.3in}}\\
(b)&\makebox[0in]{\hspace*{4mm}\begin{picture}(20,10)%
\put(17.65,6){\vector(1,0){1.42}}
\put(17.65,6){\vector(-1,0){1.42}}
\put(17.5,8){\makebox(0,0){$g$}}
%
\end{picture}}%
{\psfig{figure=gnu/grain10.ps,angle=-90,width=1.3in}}\\
&{\psfig{figure=gnu/grain18.ps,angle=-90,width=1.3in}}\\
&{\psfig{figure=gnu/grain34.ps,angle=-90,width=1.3in}}\\
& $\vdots$ \\
\end{tabular}
}
%
\caption[a]{(a) A probability density $P(x)$. {\sf Question:}
 can we define the `entropy' of this density?
 (b) We could evaluate the entropies of
 a sequence of probability distributions with
 decreasing grain-size $g$, but these entropies tend to 
 $\displaystyle \int P(x) \log \frac{1}{ P(x) g } \, \d x$,
 which is not independent of $g$:
% increases as $g$ decreases:
 the entropy goes up by
 one bit for every halving of $g$. 

 $\displaystyle \int P(x) \log \frac{1}{ P(x) } \, \d x$
 is an\index{sermon!illegal integral}
% \\ \hspace
 illegal integral.}
% see gnu/grain.gnu
\label{fig.grain}
}
 It is tempting to define joint, marginal, and conditional entropies\index{entropy!of continuous variable}\index{grain size}
 for real variables simply by replacing summations by integrals, but
 this is not a well defined operation. As we discretize an interval
 into smaller and smaller divisions, the entropy of the discrete
 distribution diverges (as the logarithm of the granularity) (\figref{fig.grain}).
 Also, it is  not permissible
 to take the logarithm of a dimensional quantity such as 
 a probability density $P(x)$ (whose dimensions are $[x]^{-1}$).\index{sermon!dimensions}\index{dimensions} 

 There is one information measure, however, that has a well-behaved
 limit, namely the mutual information -- and this is the one that
 really matters, since it measures how much information one variable
 conveys about another. In the discrete case,
\beq
        \I(X;Y) = \sum_{x,y}  
                P(x,y) \log \frac{P(x,y)}{P(x)P(y)} .
\eeq
 Now because the argument of the log is a ratio of two probabilities 
 over the same space, it is
 OK to have $P(x,y)$, $P(x)$ and $P(y)$ be
 probability densities
% (as long as they are not pathological)
% densities) 
 and replace the sum by an integral: 
\beqan
        \I(X;Y)& =& \int \! \d x \: \d y \:  
                P(x,y) \log \frac{P(x,y)}{P(x)P(y)}  
\\ &=&
         \int \! \d x \: \d y \:  
                P(x)P(y\given x) \log \frac{P(y\given x)}{P(y)} .
\eeqan
 We can now ask these questions for the Gaussian channel: 
 (a) what probability distribution
 $P(x)$ maximizes the mutual information (subject to the constraint
 $\overline{x^2}={v}$)? and (b) does the maximal
 mutual information still  measure the maximum
 error-free communication rate of this real channel,
 as it did for the discrete channel?

\exercissxD{3}{ex.gcoptens}{
 Prove that the probability distribution
 $P(x)$ that maximizes the mutual information (subject to the constraint
 $\overline{x^2}={v}$) is a Gaussian distribution of mean zero 
 and variance $v$.
}
% solution is in tex/sol_gc.tex
\exercissxB{2}{ex.gcC}{
%
 Show that the
 mutual information $\I(X;Y)$,
 in the case of this optimized distribution, is 
\beq
        C = \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2} \right) .
\eeq
}
 This is an important result. We see that the capacity of the Gaussian 
 channel is a function of the {\dem signal-to-noise ratio} $v/\sigma^2$. 
 
\subsection{Inferences given a Gaussian input distribution}
 If 
$
        P(x) = \Normal(x;0,v) \mbox{ and } P(y\given x) =  \Normal(y;x,\sigma^2)
$
 then the marginal distribution of $y$ is
$
        P(y) =  \Normal(y;0,v\!+\!\sigma^2)
$
 and the posterior distribution of the input, given that the output is $y$,
 is:
\beqan
        P(x\given y) &\!\!\propto\!\!& 
        P(y\given x)P(x) 
\\
&\!\!\propto\!\!& \exp( -(y-x)^2/2 \sigma^2)  
                                \exp( -x^2/2 v) 
\label{eq.two.gaussians}
\\
&\!\! =\!\! &
 \Normal\left( x ; \frac{ v}{v+\sigma^2} \, y \, , \, 
        \left({\frac{1}{v}+\frac{1}{\sigma^2}}\right)^{\! -1} \right) .
\label{eq.infer.mean.gaussian}
\eeqan
%
%  label this bit for reference when we get to Gaussian land
 [The step from (\ref{eq.two.gaussians}) to (\ref{eq.infer.mean.gaussian})
 is made by completing the square in the exponent.]
 This
\label{sec.infer.mean.gaussian}
 formula deserves careful study. The mean of the posterior 
 distribution, $\frac{ v}{v+\sigma^2} \, y $, can be viewed 
 as a weighted combination of the value that best fits the 
 output, $x=y$, and the value that best fits the prior, $x=0$:
\beq
\frac{ v}{v+\sigma^2} \, y =
        \frac{1/\sigma^2 }{1/v+1/\sigma^2} \, y  + \frac{1/v}{1/v+1/\sigma^2} \, 0 .
\eeq
 The weights $1/\sigma^2$ and $1/v$ are the {\dem\ind{precision}s\/}
% parameters'
 of the two Gaussians that we multiplied together in \eqref{eq.two.gaussians}:
 the prior and the likelihood.
%-- the probability of the output given the input, 
% and the prior probability of the input. 

 The precision  of the posterior distribution is 
 the sum of these two precisions. This is a general property:
 whenever two independent sources  contribute information, via  
 Gaussian distributions, about an unknown variable,  the\index{precisions add}
 precisions add. [This is the dual to the better-known
 relationship `when independent variables are added,
 their variances add'.]\index{variances add}
% inverse-variances add to define the inverse-variance of the
% posterior distribution.

\subsection{Noisy-channel coding theorem for the Gaussian channel}
 We\index{noisy-channel coding theorem!Gaussian channel}
 have evaluated a maximal mutual information.  Does it correspond
 to a maximum possible rate of error-free information transmission?
 One way of proving that this is so
 is  to define a sequence
 of discrete channels, all derived from the Gaussian channel, with
 increasing numbers of inputs and outputs, and prove that the maximum
 mutual information of these channels tends to the asserted $C$.
 The noisy-channel coding theorem for discrete channels applies
 to each of these derived channels, thus we obtain a coding theorem for
 the continuous channel.
% coding theorem is then proved.
% (with discrete inputs and 
% discrete outputs) by chopping the  output into bins and using a 
% finite set of inputs, and then defining a sequence of such channels  with 
% increasing numbers of inputs and outputs. A proof that the maximum 
% mutual information 
% of these channels tends to $C$ then completes the job, as we have already 
% proved the noisy channel coding theorem for discrete channels. 
%
% A more intuitive argument for the coding theorem may be preferred.
 Alternatively, we can make an intuitive argument for the coding theorem 
 specific for the Gaussian channel.

\subsection{Geometrical view of the noisy-channel coding theorem: sphere packing}
 \index{sphere packing}Consider a sequence $\bx = (x_1,\ldots, x_N)$ of inputs, and the
 corresponding output $\by$, as defining two points in an $N$ dimensional
 space. For large $N$, the noise power is very likely to be close
 (fractionally) to $N \sigma^2$. The output $\by$ is therefore very likely 
 to be close to the surface of a sphere of radius $\sqrt{ N  \sigma^2}$ 
 centred on $\bx$.  Similarly, if the original signal $\bx$ is generated 
 at random subject to an average power constraint $\overline{x^2} = v$, 
 then $\bx$ is likely to lie close to a  sphere, centred on the 
 origin, of radius $\sqrt{N v}$; and because the total average power of $\by$
 is $v+\sigma^2$, the received signal $\by$ is likely to lie on the surface 
 of a sphere of radius $\sqrt{N (v+\sigma^2)}$, centred on the origin. 

 The volume of an $N$-dimensional sphere of radius $r$ is
%
% this also appeared in _s1.tex
%
\beq
\textstyle
        V(r,N) = \smallfrac{ \pi^{N/2} }{ \Gamma( N/2 + 1 ) } r^N .
\eeq

 Now consider making a communication system based on non-confusable 
 inputs $\bx$, that is, inputs whose spheres do not overlap significantly. 
 The maximum number $S$ of non-confusable inputs is given by dividing 
 the volume of the sphere of probable $\by$s by the volume of 
 the sphere for $\by$ given $\bx$:
%
% An upper bound for the number $S$ of non-confusable inputs is:
\beq
 S \leq \left( \frac{ \sqrt{N (v+\sigma^2)} }{ \sqrt{ N  \sigma^2} }
        \right)^{\! N}
\eeq
 Thus the capacity is bounded by:\index{capacity!Gaussian channel}
\beq
 C = \frac{1}{N} \log M \leq \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2}
        \right) .
\eeq
 A more detailed argument
% using the law of large numbers
 like the one used in the previous chapter
 can establish equality.

\subsection{Back to the continuous channel}
 Recall that 
the use of a real continuous channel with bandwidth $W$, noise spectral
 density $N_0$ and power $P$ is equivalent to $N/T = 2 W$ uses per second 
 of a Gaussian channel with $\sigma^2 = N_0/2$ and subject to the 
 constraint $\overline{x_n^2} \leq P/2W$.
 Substituting the result for the capacity of the Gaussian channel, we find the 
 capacity of the continuous channel  to be: 
\beq
 C = W \log \left( 1 + \frac{P}{N_0 W} \right) \: \mbox{ bits per second.}
\eeq
 This formula gives insight into the tradeoffs of practical
 \ind{communication}. Imagine that we have a fixed power constraint.  What
 is the best \ind{bandwidth} to make use of that power?  Introducing
 $W_0=P/N_0$, \ie, the bandwidth for which the signal-to-noise ratio
 is 1,  figure \ref{fig.wideband} shows $C/W_0 = W/W_0 \log \! \left( 1 + W_0/W
 \right)$ as a function of $W/W_0$.  The capacity increases to an
 asymptote of $W_0 \log e$. It is dramatically better (in terms of capacity
 for fixed power) to transmit at a
 low signal-to-noise ratio over a large bandwidth, than with high
 signal-to-noise in a narrow bandwidth; this is  one motivation for wideband
 communication methods such as the `direct sequence spread-spectrum'\index{spread spectrum}
 approach used
 in {3G} \ind{mobile phone}s. Of course, you are not alone,
 and your electromagnetic neighbours
 may not be pleased if you use a large bandwidth, so for social reasons,
 engineers often have to make do with higher-power, narrow-bandwidth
 transmitters.
%\begin{figure}
%\figuremargin{%
\marginfig{
% figs: load 'wideband.com'
\begin{center}
\mbox{\psfig{figure=figs/wideband.ps,%
width=1.75in,angle=-90}}
\end{center}
%}{%
\caption[a]{Capacity versus bandwidth for a real channel: 
        $C/W_0 = W/W_0 \log \left( 1 + W_0/W
 \right)$ as a function of $W/W_0$.}
\label{fig.wideband}
}%
%\end{figure}

\section{What are the capabilities of practical error-correcting codes?\nonexaminable}
\label{sec.bad.code.def}% see also {sec.good.codes}!
% cf also \ref{sec.bad.dist.def}
% in _linear.tex

% Description of Established Codes}
%
 Nearly all codes are good, but nearly all codes require exponential look-up
 tables for practical
 implementation of the encoder and decoder -- exponential in the 
 blocklength $N$. And the coding theorem required $N$ to be large. 

 By a {\dem\ind{practical}\/} error-correcting code, we mean one that
 can be encoded and decoded in a reasonable amount of time,
 for example, a time that scales as a polynomial function
 of the blocklength $N$ -- preferably linearly.

\subsection{The Shannon limit is not achieved in practice}
 The non-constructive proof of the noisy-channel coding theorem showed 
 that good block codes exist for any noisy channel, and indeed that nearly 
 all block codes are good. But writing down an explicit and {practical\/}
 encoder 
 and decoder that are as good as promised by Shannon is still an unsolved 
 problem. 

%  Most of the explicit families of codes that have been written down have the 
%  property that they can achieve a vanishing error probability $p_{\rm b}$ 
%  as $N \rightarrow \infty$ only if the rate $R$ also goes to zero. 
% 
%  There is one exception to this statement:
% , given by a family of codes based on 
% {\dbf concatentation}. 

\label{sec.good.codes}
\begin{description}
\item[Very good codes\puncspace]
 Given a channel, a family of block\index{error-correcting code!very good}
 codes  that achieve arbitrarily small
 probability of error 
 at any communication rate 
 up to the capacity 
 of the channel are called  `very good' codes
 for that channel.
\item[Good codes]
 are code families that
 achieve arbitrarily small probability of error 
 at non-zero communication rates 
 up to some maximum rate 
 that may be {\em less than\/} the \ind{capacity} 
 of the given channel.\index{error-correcting code!good}
\item[Bad codes] are code families that  cannot achieve arbitrarily small
 probability of error, or that
 can only achieve arbitrarily small
 probability of error\index{error-correcting code!bad}
% $\epsilon$   `bad'
 by decreasing the information rate
% $R$ 
 to zero.
 Repetition codes\index{error-correcting code!repetition}\index{repetition code}%
\index{error-correcting code!bad}
 are an example of a bad code family.
 (Bad codes  are not necessarily useless for practical
 purposes.)
\item[Practical codes] are code families that can be\index{error-correcting code!practical}
 encoded and decoded in time and space polynomial in the blocklength.
\end{description}



\subsection{Most established codes are linear codes}
 Let us review the definition of a block code, and then add
 the definition of a linear block
 code.\index{error-correcting code!block code}\index{error-correcting code!linear}\index{linear block code}  
\begin{description}
\item[An $(N,K)$ block code] for a channel $Q$ is a list  of $\cwM=2^K$
 codewords 
 $\{ \bx^{(1)}, \bx^{(2)}, \ldots, \bx^{({2^K)}} \}$, each of length $N$: 
 $\bx^{(\cwm)} \in \A_X^N$.
 The signal to be encoded, $\cwm$, which comes from an 
 alphabet of size $2^K$, is encoded as $\bx^{(\cwm)}$.

% The {\dbf\ind{rate}} of the code\index{error-correcting code!rate} is $R = K/N$ bits. 
%
% [This definition holds for any channels, not only binary channels.]
\item[A linear $(N,K)$ block code] is a block code in which  
 the codewords $\{ \bx^{(\cwm)} \}$ make up a $K$-dimensional subspace of
 $\A_X^N$. The encoding operation  can be represented by an $N \times K$
 binary matrix\index{generator matrix}
 $\bG^{\T}$ such that if the signal to be encoded,
 in binary notation, is $\bs$ (a vector of length $K$ bits), then the
 encoded signal is $\bt = \bG^{\T} \bs \mbox{ modulo } 2$.

 The codewords $\{ \bt \}$ can be defined as the set of vectors
 satisfying $\bH \bt = {\bf 0} \mod 2$, where $\bH$ is the
 {\dem\ind{parity-check matrix}\/}
 of the code.
\end{description}

\marginpar[c]{\[%beq
 \bG^{\T} = {\small \left[ \begin{array}{@{\,}*{4}{c@{\,}}} 
1 & \cdot & \cdot & \cdot \\[-0.05in]
\cdot & 1 & \cdot & \cdot \\[-0.05in]
\cdot & \cdot & 1 & \cdot \\[-0.05in]
\cdot & \cdot & \cdot & 1 \\[-0.05in]
1 & 1 & 1 & \cdot \\[-0.05in]
\cdot & 1 & 1 & 1 \\[-0.05in]
1 & \cdot & 1 & 1  \end{array} \right] }  % nb different from l1.tex, no longer
\]%eeq
}
 For example  the
 $(7,4)$ \ind{Hamming code} of section \ref{sec.ham74}
 takes $K=4$ signal bits, $\bs$, and transmits
 them followed by three parity-check bits. The $N=7$ transmitted
 symbols are given by $\bG^{\T} \bs \mod 2$.
% , where:

 Coding theory was born with the work of Hamming, who invented a
 family of practical
 error-correcting codes, each able to correct one error in a
 block of length $N$, of which the repetition code $R_3$ and the
 $(7,4)$ code  are the simplest.
 Since then most established codes have been 
 generalizations of Hamming's codes:
% `BCH' (Bose, Chaudhury and Hocquenhem)
 Bose--Chaudhury--Hocquenhem
% The search for decodeable codes has produced the following families. 
 codes,  Reed--M\"uller codes,  Reed--Solomon codes, and
 Goppa codes, to name a few.

\subsection{Convolutional codes}
 Another family of linear codes are {\dem\ind{convolutional code}s}, which 
 do not divide the source stream into blocks, but instead read
 and\index{error-correcting code!convolutional}  
 transmit bits continuously. The transmitted bits
 are a  linear function of  the past  source bits. 
%  both bits and parity checks in some fixed proportion. 
 Usually the rule for generating the transmitted bits
% parity checks 
 involves feeding the present source bit
 into a  \lfsr\index{linear feedback shift register} of length $k$, 
 and transmitting one or more
 linear functions of the state of the shift register 
 at each iteration. 
 The resulting  transmitted bit stream 
 is  
%can be thought of as 
 the convolution 
 of the source stream with a linear filter. 
 The impulse-response function of this filter may have finite 
 or infinite duration, depending on the choice of feedback shift-register.
% it is

 We will discuss convolutional codes in  \chapterref{ch.convol}.
 
\subsection{Are linear codes `good'?}
 One might ask, is the reason that the Shannon limit is not achieved
 in practice 
 because linear codes are inherently not\index{error-correcting code!linear}\index{error-correcting code!good}\index{error-correcting code!random} 
 as good as random codes?\index{random code} The answer is no, the noisy-channel coding theorem 
 can still be proved for linear codes, at least for some channels
 (see  \chapterref{ch.linear.good}),
 though the proofs, like Shannon's
 proof for random codes, are non-constructive.
%(We will prove that
% there exist linear codes that are very good codes 
% in chapter \ref{ch.linear.good}.
% and in particular for `cyclic codes', 
% a class to which BCH and Reed--Solomon codes belong. 

 Linear codes are easy to implement at the encoding end. Is decoding a
 linear code also easy? Not necessarily. The general decoding problem\index{error-correcting code!decoding}\index{linear block code!decoding}
 (find the maximum likelihood $\bs$ in the equation $\bG^{\T} \bs + \bn =
 \br$) is in fact \inds{NP-complete} \cite{BMT78}.  [NP-complete problems are
 computational problems that are all equally difficult and which
 are widely believed to  require exponential
 computer time to solve in general.] So attention focuses on families of codes
% (such as those listed above)
 for which there is a fast decoding algorithm.


\subsection{Concatenation}
 One trick for building codes with practical decoders 
 is the idea of {concatenation}.\index{error-correcting code!concatenated}\index{concatenation!error-correcting codes} 

 An\amarginfignocaption{t}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(25,10)%
\put(17.5,8){\makebox(0,0){$\C' \rightarrow \underbrace{\C \rightarrow Q \rightarrow \D}
 \rightarrow \D'$}}
\put(17.5,3){\makebox(0,0){$Q'$}}
%
\end{picture}%
\end{center}
%\caption[a]{none}
}
 encoder--channel--decoder system $\C \rightarrow Q \rightarrow \D$
 can be viewed as defining a \ind{super-channel} $Q'$
 with a smaller probability of error, and with complex\index{channel!complex}
 correlations among
 its errors. We can  create an encoder $\C'$ and decoder $\D'$ 
 for this super-channel $Q'$.
 The code consisting of the outer code $\C'$ followed by the inner code $\C$ is known 
 as a {\dem{concatenated code}}.\index{concatenation!error-correcting codes}

 Some concatenated codes make use of the idea of {\dbf
 \ind{interleaving}}. We read
% Interleaving involves encoding
 the data in  blocks, the size of each block being larger than the
 blocklengths of the constituent codes $\C$ and $\C'$. 
 After encoding the data of one block using code $\C'$, the bits 
 are reordered within the block in such a way that
 nearby bits are separated from each other once the block is fed to
  the second code $\C$. A simple example of an interleaver
  is a {\dbf\ind{rectangular code}\/}
 or\index{error-correcting code!rectangular}\index{error-correcting code!product code}
 {\dem\ind{product code}\/} in which the data are arranged in a
  $K_2 \times K_1$ block, and encoded horizontally using an
 $(N_1,K_1)$
  linear code, then vertically using a $(N_2,K_2)$ linear code.
\exercisaxB{3}{ex.productorder}{
 Show that either of the two codes can be viewed as the \ind{inner code} or the
 \ind{outer code}.
}
%\subsection{}

% see also _concat2.tex
As an example,  \figref{fig.concath1}
 shows a product code  in which we
% encode horizontally 
% For example, if we
 encode first with the repetition code $\Rthree$ (also known
 as the \ind{Hamming code} $H(3,1)$)
 horizontally then with $H(7,4)$
 vertically.
 The blocklength of the
 concatenated\index{concatenation} code is 27. The number of source bits per  codeword is
 four, shown by the small rectangle.
% The code would be equivalent if we
% encoded first with $H(7,4)$ and second with $\Rthree$.
\begin{figure}
\figuremargin{%
\setlength{\unitlength}{0.4mm}
\begin{center}
\begin{tabular}{rrrrr}
(a)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){0}}
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){0}}
\put(5,15){\makebox(0,0){0}}
\put(5,5){\makebox(0,0){1}}
\put(15,65){\makebox(0,0){1}}
\put(15,55){\makebox(0,0){0}}
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){1}}
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){0}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){0}}
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture}&
%
% noise picture
%
(b)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,55){\makebox(0,0){$\star$}}%
\put(5,15){\makebox(0,0){$\star$}}%
%
\put(15,55){\makebox(0,0){$\star$}}%
\put(15,35){\makebox(0,0){$\star$}}%
%
\put(25,25){\makebox(0,0){$\star$}}%
\end{picture}&
%
% received vector picture
%
(c)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){1}}%
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){0}}
\put(5,15){\makebox(0,0){1}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){1}}
\put(15,55){\makebox(0,0){1}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){0}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){0}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){1}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture} &
% after R3 correction
(d)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){1}}%
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){0}}
\put(5,15){\makebox(0,0){{\bf 0}}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){1}}
\put(15,55){\makebox(0,0){1}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){{\bf 1}}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){{\bf 1}}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){{\bf 0}}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture}&
% after 74 correction
(e)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){{\bf 0}}}%
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){0}}
\put(5,15){\makebox(0,0){{0}}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){1}}
\put(15,55){\makebox(0,0){{\bf 0}}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){{1}}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){{\bf 0}}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){{0}}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture}\\
&
%
% noise picture
%
&
 &
% after 74 correction
(d$^{\prime}$)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){1}}%
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){{\bf 1}}}%
\put(5,15){\makebox(0,0){1}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){{\bf 0}}}%
\put(15,55){\makebox(0,0){1}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){0}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){0}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){{\bf 0}}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture} &
% after R3 correction
(e$^{\prime}$)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){(1)}}
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){{\bf 0}}}
\put(5,15){\makebox(0,0){{\bf 0}}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){{\bf 1}}}
\put(15,55){\makebox(0,0){(1)}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){{\bf 1}}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){(1)}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){{0}}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture}\\
\end{tabular}
\end{center}
}{%
\caption[a]{A product code.
 (a) A string {\tt{1011}}  encoded using a concatenated code
 consisting of two Hamming codes, $H(3,1)$ and $H(7,4)$.
 (b) a noise pattern that flips 5 bits. (c) The received vector.
 (d) After decoding using the  horizontal $(3,1)$ decoder,
 and (e) after subsequently using the  vertical $(7,4)$ decoder. The decoded
 vector matches the original.

 (d$^{\prime}$, e$^{\prime}$) After decoding in the other order, three errors
 still remain.}
\label{fig.concath1}
}%
\end{figure}

\label{sec.concatdecode}We
 can decode conveniently (though not optimally) by using the
 individual decoders for each of the subcodes in some sequence.
 It makes most sense to first decode the code which has the
 lowest rate and hence the greatest error-correcting ability.

 \Figref{fig.concath1}(c--e) shows what happens if we receive the
 codeword of \figref{fig.concath1}a with some errors (five bits  flipped,
 as shown) and
 apply the decoder for $H(3,1)$ first, and then the
 decoder for $H(7,4)$. The first decoder corrects three of the errors,
 but erroneously modifies the third bit in the second row where there
 are two bit errors. The $(7,4)$ decoder can then correct all three of
 these errors. 

 \Figref{fig.concath1}(d$^{\prime}$--$\,$e$^{\prime}$) shows what happens if we decode the two codes
 in the other order.
 In columns one and two there are two errors, so the $(7,4)$ decoder
 introduces two extra errors. It corrects the one error in column 3.
 The $(3,1)$ decoder then cleans up four of the errors, but erroneously
 infers the second bit.


% To make simple decoding possible,
% we split up bits that are in a single codeword at the first level,
% grouping them with other bits. Rectangular arrangement makes this easiest
% to see.


 



\subsection{Interleaving}
  The motivation for interleaving is that by spreading out bits that
  are nearby in one code, we make it possible to ignore
% forget about
 the complex correlations among the errors that are produced by the
  inner code.  Maybe the inner code will mess up an entire codeword;
  but that codeword is spread out one bit at a time over several codewords
  of the outer code. So we can treat  the errors introduced by the
  inner code as if they are independent.\index{approximation!of complex distribution}
% by a simpler one}

% 
%  By iterating this process, with each successive 
%  code adding a small amount of redundancy to a geometrically increasing block, 
%  we can define an explicit sequence of codes with the property that 
%  $p_{\rm b} \rightarrow 0$ for some rate $R > 0$ (but not any $R$ up to the 
%  capacity $C$). 
% 
%  There is also a  proof by Forney that better concatenations 
%  exist, which achieve rates up to capacity and have encoding and decoding 
%  complexity of order $O(N^4)$. But the proof is non-constructive.
%
%  gf.tex could be included here
% 
% \subsection{Coding theory sells you short}
%  At this point  could discuss the universalist `this code corrects 
%  all errors up to $t$' with the Shannonist `the prob of error 
%  is tiny'. The latter attitude allows you to communicate at far 
%  greater rates. The former attitude is happy with something that 
%  is only halfway. 
%  
% Distance
%  Show Prob of error of ideal decoder (Schematic) as function of noise level. 
%  Show that you can cope with double the noise. 

\subsection{Other channel models}
% Most of the codes mentioned above are designed in terms of 
 In addition to the binary
 symmetric channel and the Gaussian channel,
% or in terms of the number of errors they can correct, but
 coding theorists  keep more complex
 channels in mind also.

%\index{burst-error channels}
 {\dem Burst-error channels\/}\index{channel!bursty}\index{burst errors}
 are important models in
 practice. \ind{Reed--Solomon code}s use \ind{Galois field}s
 (see  \appendixref{app.GF})
 with large numbers of
 elements (\eg\ $2^{16}$) as their input alphabets,
 and thereby automatically achieve a degree
 of burst-error tolerance in that even if 17 successive bits are
 corrupted, only 2 successive symbols in the Galois field representation are
 corrupted. Concatenation and interleaving can give further 
% fortuitous 
 protection against 
% \index{concatenated code}
 burst errors. The concatenated\index{concatenation!error-correcting codes}\index{error-correcting code!concatenated}
 Reed--Solomon codes used on digital compact discs
% DISKS?
 are able to correct  bursts of errors of length 4000 bits. 

\exercissxB{2}{ex.interleaving.dumb}{
 The technique of \ind{interleaving},\index{implicit assumptions}
 which allows bursts of\index{error-correcting code!interleaving}
 errors to be treated as independent, is widely used, but is theoretically
 a poor way  to protect data against
 \ind{burst errors}, in terms of the amount of redundancy required. 
 Explain why interleaving is a poor method, using the following
 burst-error channel as an example. Time is divided into chunks
 of length 
 $N=100$ clock cycles; during each chunk, there is a burst with
 probability $b=0.2$; during a burst, the channel is a binary symmetric channel
 with $f=0.5$. If there is no burst, the channel is an error-free binary
 channel.  Compute the capacity of this channel and compare it with the
 maximum communication rate that could conceivably be achieved if one
 used interleaving and treated the errors as independent.
  }

%  The BSC is an inadequate channel model for a second reason: many
%  channels have {\em real outputs}. For example, a
%  binary input $x$ may give rise to a
%  probability distribution over a real output $y$. Codes whose decoders
%  can handle real outputs (log likelihood ratios) are therefore
%  important. `Convolutional codes' are such codes, as are some block codes.

 {\dem\index{fading channel}{Fading channels}\/} are real\index{channel!fading}
 channels like Gaussian\index{channel!Gaussian}
 channels except that the received
 power is assumed to vary with time.
 A  moving
 \ind{mobile phone}\index{cellphone|see{mobile phone}}\index{phone!cellular|see{mobile phone}}
 is an important example.
 The incoming \ind{radio} signal  is reflected
 off nearby objects so that there are  interference patterns and the
 intensity of the
 signal received by the phone varies with its location. The received power
 can easily vary by 10 decibels\index{decibel}
 (a factor of ten) as the phone's antenna
 moves through a distance similar to the wavelength of the radio signal
 (a few centimetres).
%Fading channels are used as models
% of the radio channel of mobile phones, in which the received power
% varies rapidly 


\section{The state of the art}
 What are the best known codes for communicating over Gaussian channels? 
 All the practical codes are linear codes, and are either
 based on convolutional codes or block codes.\index{linear block code}
\subsection{Convolutional codes, and codes based on them}
\begin{description}
\item[Textbook convolutional codes\puncspace] The `de facto standard'
% cite golomb?
 error-correcting code for\index{communication} 
 \ind{satellite communications} is a 
 convolutional code with constraint length 7.
 Convolutional codes are discussed in \chref{ch.convol}.
\item[Concatenated convolutional codes\puncspace]
 The above \ind{convolutional code}
 can be used as the inner code of a\index{error-correcting code!concatenated}
 concatenated code whose
 outer code 
 is a \ind{Reed--Solomon code} with eight-bit symbols. This code 
 was  used in  deep space communication systems such as the
 Voyager spacecraft.
 For further reading about Reed--Solomon codes,
 see \citeasnoun{lincostello83}.
\item[The  code for \index{Galileo code}{Galileo}\puncspace]
        A code using  the same format but using a longer 
 constraint length -- 15 -- for its convolutional code and a larger 
 Reed--Solomon code was developed by the \ind{Jet Propulsion Laboratory} \cite{JPLcode}. 
 The details of this code are unpublished outside JPL, 
 and the decoding is only possible using 
 a room full of special-purpose hardware.
 In 1992, this
 was the best  code known of rate \dfrac{1}{4}.
\item[Turbo codes\puncspace]
        In 1993, \index{Berrou, C.}{Berrou}, \index{Glavieux, A.}{Glavieux}
 and  \index{Thitimajshima, P.}{Thitimajshima}
 \nocite{Berrou93:Turbo}reported 
 work on {\dem\ind{turbo code}s}. The encoder of a turbo code is based on
 the encoders of two
% or more constituent codes. In 
% the original paper the two constituent codes were
 convolutional codes. 
 The source bits are fed into each encoder, the order of the 
 source bits being permuted in a  random way, and the resulting 
 parity bits from each constituent code are transmitted. 

        The decoding algorithm
% invented by Berrou {\em et al\/}
 involves iteratively decoding each constituent code%
\amarginfig{b}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(25,30)(0,8)%
\put(15,18){\framebox(8,8){$C_1$}}
\put(15, 8){\framebox(8,8){$C_2$}}
\put( 9,12){\circle{6}}
\put( 5, 8){\framebox(8,8){$\pi$}}
\put(9.7,14.875){\vector(1,0){0.1}}% right pointing circle vector % was 975
\put(23,22){\vector(1,0){3}}
\put(23,12){\vector(1,0){3}}
\put(13,12){\line(1,0){2}}
\put( 2,12){\vector(1,0){3}}
\put( 0,22){\vector(1,0){15}}
\put( 2,22){\line(0,-1){10}}
%
\end{picture}%
\end{center}

\caption[a]{The encoder of a turbo code.
 Each box $C_1$, $C_2$, contains a convolutional code.
 The source bits are reordered using a permutation $\pi$ before
 they are fed to $C_2$. The transmitted codeword is obtained
 by concatenating or interleaving the outputs of the two
 convolutional codes.
 The random
 permutation is chosen when the code is designed, and fixed
 thereafter.
}
}
 using its 
 standard decoding algorithm, then using the 
 output of the decoder as the input to the other decoder.
 This decoding algorithm is an instance of a
 {\dbf{message-passing}}\index{message passing}
 algorithm called the {\dbf\ind{sum--product algorithm}}.

 Turbo codes are discussed in \chref{ch.turbo}, and message passing in Chapters \ref{ch.message},
 \ref{ch.noiseless}, \ref{ch.exact}, and \ref{ch.sumproduct}.
\end{description}
\subsection{Block codes}
\begin{description}
\item[Gallager's low-density parity-check codes\puncspace]
 The%
\amarginfig{c}{
\[
\raisebox{0.425in}{ \bH \hspace{0.02in} =}\hspace{-0.1in}
\psfig{figure=MNCfigs/12.4.3.111/A.ps,angle=-90,width=1.5in,height=1in}
\]
\begin{center}
 \mbox{
\psfig{figure=/home/mackay/itp/figs/gallager/16.12.ps,width=2in,angle=-90}
}\end{center}
\caption[a]{A low-density parity-check matrix
 and the corresponding graph of  a rate-\dfrac{1}{4}
 low-density parity-check code with
% $(j,k) = (3,4)$, 
 blocklength $N \eq 16$, and $M \eq 12$ constraints.
 Each white circle represents a transmitted bit. Each bit
 participates in $j=3$ constraints, represented by
 \plusnode\ squares. Each
% \plusnode\
 constraint forces the
 sum of the $k=4$ bits to which it is connected to
 be even.
 This code is a $(16,4)$ 
 code.  Outstanding performance is obtained when 
 the blocklength is increased to 
 $N \simeq 10\,000$.
}
\label{fig.ldpccIntro}
}
 best  block codes known for Gaussian channels  
 were invented by Gallager\index{Gallager, Robert} in 
 1962 but were promptly forgotten by most of the coding theory community. 
% by MacKay and Neal,
 They were rediscovered in 1995\nocite{mncEL,wiberg:phd}\index{Wiberg, Niclas}\index{MacKay, David}\index{error-correcting code!low-density parity-check}\index{Neal, Radford}
 and shown to  have outstanding theoretical and practical properties.\index{error-correcting code!practical}
 Like turbo codes, they are decoded by message-passing algorithms.

 We will discuss these beautifully simple codes  in Chapter
% \ref{ch.belief.propagation} and
 \ref{ch.gallager}.
\end{description}

 The performances of the above codes are compared for Gaussian 
 channels in \figref{fig:GCResults}, \pref{fig:GCResults}.%{fig.gl.gc}.
% the Galileo code and
% Only  the Galileo code and turbo codes outperform the original
% regular, binary Gallager codes. 
% The best known Gallager codes, which are irregular,
%% and non-binary,  
% outperform the Galileo code and turbo codes too \cite{DaveyMacKay96,Richardson2001b}.


\section{Summary}
\begin{description}
\item[Random codes] are good, but they require exponential resources to encode
 and decode them. 

\item[Non-random codes] tend for the most part not to be as good as 
 random codes. For a non-random code,  encoding  may be easy, but even for
 simply-defined linear codes, the decoding problem remains very difficult. 

\item[The best practical codes]
%\ben
%\item 
(a)
        employ very large block sizes; (b) 
% \item
         are based on semi-random code constructions; 
 and (c) 
%\item 
        make use of probability-based decoding algorithms.
%  \een
\end{description}

\section{Nonlinear codes}
 Most practically used codes are linear, but not all.\index{error-correcting code!nonlinear}\index{nonlinear code}
 Digital soundtracks are encoded onto cinema film as
  a binary  pattern. The likely errors affecting the
 film  involve
 dirt and scratches, which produce large numbers of {\tt{1}}s
 and {\tt{0}}s respectively.  We  want none
 of the  codewords to look like all-{\tt{1}}s or all-{\tt{0}}s,
 so that it will be
 easy to detect errors caused by dirt and scratches.
 One of the codes used in \ind{digital cinema}\index{cinema} \ind{sound} systems is
 a nonlinear $(8,6)$ code consisting of 64 of  the ${{8}\choose{4}}$
 binary patterns of weight 4.
% That's 70 patterns. Pick  64.

\section{Errors other than noise}
 Another source of uncertainty for the receiver
 is uncertainty about the {\em{\ind{timing}}\/} of the transmitted signal $x(t)$.
 In ordinary coding theory and information theory,
 the transmitter's time $t$ and the receiver's time
 $u$ are assumed to be perfectly synchronized.
% If a bit sequence is encoded by a simple signal
% $x(t) \in \pm 1$, information is easily conveyed if
% the transmitter and the receiver both know the same
% time $t$;
 But if the receiver receives a signal $y(u)$,
 where  the receiver's time, $u$, is an imperfectly
 known function $u(t)$
 of the transmitter's time
 $t$, then the capacity of this channel for communication
 is reduced. The theory of 
 such channels is incomplete, compared
 with the
% ordinary
% `normal'
 synchronized channels\index{insertions}\index{deletions}
 we have discussed thus far. Not even
 the  {\em capacity\/} of channels with \ind{synchronization errors}\index{capacity!channel with synchronization errors}
 is known \cite{Levenshtein66,Ferreira97};
%
% ear recommends citing zigangirov69  ullman67
%
 codes for reliable communication over channels
 with synchronization errors remain an active research area
 \cite{DaveyMacKay99b}. 
% ear recommends citing  ratzer2003

\subsection*{Further reading}
 For a review of the history of spread-spectrum\index{spread spectrum} methods, see
 \citeasnoun{Scholtz82}.


\section{Exercises}
\subsection{The Gaussian channel}
\exercissxB{2}{ex.gcCb}{
 Consider a Gaussian channel with a real input $x$, and signal to 
 noise ratio $v/\sigma^2$. 
\ben
\item
 What is its capacity $C$? 
\item
 If the input is constrained to be binary, $x \in \{ \pm \sqrt{v} \}$, 
 what is the capacity $C'$ of this constrained channel?
\item
 If in addition the output of the channel is thresholded using the 
 mapping 
\beq
        y \rightarrow y' = \left\{ \begin{array}{cc} 1 & y > 0 \\
                                                0 & y \leq 0, \end{array}
\right. 
\eeq
 what is the capacity $C''$ of the resulting channel?
\item
 Plot the three capacities above as a function of $v/\sigma^2$
 from 0.1 to 2. [You'll need to do a numerical integral to
 evaluate $C'$.]
\een
}
\exercisaxB{3}{ex.codeslinear}{
 For large integers $K$ and $N$,
 what fraction of all binary error-correcting codes of length $N$
 and rate $R=K/N$ are {\em{linear}\/} codes?
 [The answer will depend on whether you choose to define the
 code to be an {\em{ordered}\/} list of $2^K$ codewords,
 that is, a mapping from $s \in \{1,2,\ldots,2^K\}$ to $\bx^{(s)}$,
 or to define the code to be an unordered list, so that
 two codes consisting of the same codewords are identical.
 Use the latter definition: a code\index{error-correcting code} is a set of
 codewords; how the encoder operates is not part of the
 definition of the code.]
}
% that have not already been covered.
\subsection{Erasure channels}
\exercisxB{4}{ex.beccode}{
 Design a code for the binary erasure channel, and a decoding
 algorithm, and evaluate their probability of error.
 [The design of good codes for erasure channels\index{erasure-correction}\index{channel!erasure}
 is an active research area
 \cite{spielman-96,LubyDF}; see also \chref{chdfountain}.]
% Have fun!]
%
}
\exercisaxB{5}{ex.qeccode}{
 Design a code for the $q$-ary erasure channel,\index{erasure-correction}
 whose input $x$ is drawn from $0,1,2,3,\ldots,(q-1)$,
 and whose output $y$ is equal to $x$ with probability $(1-f)$
 and equal to {\tt{?}} otherwise.
 [This erasure channel is a good model for \ind{packet}s
 transmitted over the \ind{internet}, which are either received reliably
 or are lost.]
}
\exercissxC{3}{ex.raid}{
 How do redundant arrays of independent disks (RAID) work?\marginpar{%
\small\raggedright{%
% aside
 [Some people say RAID stands for `redundant array of inexpensive disks',
 but I think that's silly -- RAID would still be a good idea\index{RAID}\index{redundant array of independent disks}
 even if the disks were expensive!]
% end aside
}}
 These are  information storage systems consisting of about\index{erasure-correction} 
 ten \disc{} drives,\index{disk drive} of which any two or three can be disabled and the others
 are able to still able to reconstruct any requested file.\index{file storage} 
 What codes are used, and how far are these systems from the Shannon 
 limit for the problem they are solving? How would {\em you\/} design
 a better RAID system?
%
 Some information is provided in the solution section.
 See {\tt http://{\breakhere}www.{\breakhere}acnc.{\breakhere}com/{\breakhere}raid2.html}; see also \chref{chdfountain}.
%  and {\tt http://www.digitalfountain.com/} for more.

}
%%%%\input{tex/_e7.tex}




\dvips
\section{Solutions}% to Chapter \protect\ref{ch.ecc}'s exercises} % 
% ex 89
\soln{ex.gcoptens}{
% \subsection{Maximization}
 Introduce a Lagrange multiplier $\l$ for the power constraint and another,
 $\mu$, for the constraint of normalization of $P(x)$. 
\beqan
        F &\eq & \I(X;Y) - 
        { \l \textstyle \int \d x \, P(x)  x^2 -  \mu  \textstyle \int \d x \, P(x) }
\\ &\eq   &
        \int \! \d x  \,
         P(x) \left[ \int \!  \d y \, P(y\given x) \ln \frac{P(y\given x)}{P(y)} 
        - \l  x^2 - \mu \right] .
\eeqan
 Make the  functional derivative with respect to 
 $P(x^*)$.
\beqan
        \frac{\delta F}{\delta P(x^*)} &=& 
         \int \!  \d y \, P(y\given x^*) \ln \frac{P(y\given x^*)}{P(y)} 
        - \l  {x^*}^2 - \mu 
 \nonumber \\ &&
        -       \int \! \d x  \:  P(x) 
         \int  \! \d y \: P(y\given x)  \frac{1}{P(y)} \frac{\delta P(y)}{\delta P(x^*)} . \hspace{0.5cm}
\eeqan
 The final factor $\delta P(y)/\delta P(x^*)$ is found, using $P(y) =
 \int \! \d x \, P(x) P(y\given x)$, to be $P(y\given x^*)$, and the whole of the
 last term collapses in a puff of smoke to 1, which can be absorbed into the 
 $\mu$ term.

% We now substitute 
 Substitute 
 $P(y\given x) = \exp( -(y-x)^2/2 \sigma^2) / \sqrt{2 \pi \sigma^2}$
 and set the derivative to zero:
\beq
        \int \!  \d y \, P(y\given x) \ln \frac{P(y\given x)}{P(y)} - \l  x^2 - \mu'  = 0 
\eeq
\beq
\Rightarrow
          \int \!  \d y \, 
        \frac{\exp( -(y-x)^2/2 \sigma^2)}{\sqrt{2 \pi \sigma^2} }
          \ln \left[ P(y) \sigma \right] = - \l  x^2 - \mu' -  \frac{1}{2}  .
\label{eq.theconstr}
\eeq
 This condition must 
 be satisfied by $\ln \! \left[ P(y) \sigma \right]$ for all $x$. 
 
 Writing a Taylor expansion of $\ln \! \left[ P(y) \sigma \right]
  = a + b y + c y^2 + \cdots$,  only  a quadratic function 
 $\ln  \! \left[ P(y) \sigma \right]
 = a + c y^2$ would satisfy the constraint (\ref{eq.theconstr}).
 (Any higher order terms $y^p$, $p>2$, would produce 
 terms in $x^p$ that are not present on the right-hand side.)
 Therefore $P(y)$ is Gaussian. We can obtain this optimal output distribution 
 by using a Gaussian input distribution $P(x)$.
% \footnote{Note in passing that 
% the Gaussian is the probability distribution that has maximum 
% pseudo-entropy 


}
\soln{ex.gcC}{
 Given a Gaussian input distribution of variance $v$, the 
 output distribution is $\Normal(0,v\!+\!\sigma^2)$, since
 $x$ and the noise are independent random variables, 
 and variances add for independent random variables.
 The mutual information is:
\beqan
\!\!\!\!\!\!\!\!\!\!       \I(X;Y)& =&  \!\! \int \!  \d x \, \d y \:  
                P(x)P(y\given x) \log {P(y\given x)}
 - \int  \! \d y \:  
                P(y) \log {P(y)}  \\
&=&
 \frac{1}{2} \log \frac{1}{\sigma^2} - \frac{1}{2} \log \frac{1}{v+\sigma^2} \\
&=&
% \frac{1}{2} \log \frac{v+\sigma^2}{\sigma^2} =
 \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2}
        \right) .
\eeqan
}
\soln{ex.interleaving.dumb}{
 The capacity of the channel is one minus the information content
 of the noise that it adds. That information content is, per chunk, 
 the entropy of the selection of whether the chunk is bursty,
 $H_2(b)$, plus, with probability $b$, the entropy of the flipped bits, $N$, 
 which adds up to $H_2(b) + Nb$ per chunk (roughly; accurate if $N$ is large).
 So, per bit, the capacity is, for $N=100$,
 \beq
 C = 1 - \left( \frac{1}{N} H_2(b) + b \right) = 1 -  0.207  = 0.793 .
 \eeq
 In contrast, interleaving, which treats bursts of\index{sermon!interleaving}
 errors as independent, causes the channel to be treated as
 a binary symmetric channel with $f= 0.2 \times 0.5 = 0.1$, whose capacity is about 0.53.

 Interleaving throws away the useful information about the
 correlatedness of the errors. 
 Theoretically, we should be able to communicate about
 $(0.79/0.53) \simeq 1.6$ times
 faster using a code and decoder that explicitly treat bursts as bursts.

}
% ex 91
\soln{ex.gcCb}{
\ben
\item
   Putting together the results of exercises \ref{ex.gcoptens} and \ref{ex.gcC},
 we deduce that
 a Gaussian channel with  real input $x$, and signal to 
 noise ratio $v/\sigma^2$ has capacity  
\beq
 C = \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2}
        \right) .
\label{eq.unconstrained.cap}
\eeq
\item
 If the input is constrained to  be binary, $x \in \{ \pm \sqrt{v} \}$, 
 the capacity is achieved by using these two inputs 
 with equal probability.
 The capacity is reduced to  a somewhat messy integral,
\beq
C'' = 
\int_{-\infty}^{\infty}
 \d y \, N(y;0) \log N(y;0)
%\nonumber \\
%& &
 -
\int_{-\infty}^{\infty}
 \d y \,
 P(y)  \log P(y)  ,
\eeq
 where $N(y;x) \equiv (1/\sqrt{2 \pi}) \exp [ ( y-x)^2/2 ]$,
 $x\equiv \sqrt{v}/ \sigma$,
 and  $P(y) \equiv [ N(y;x)+N(y;-x) ]/2$.
 This capacity is smaller than
 the unconstrained capacity (\ref{eq.unconstrained.cap}),
 but for small signal-to-noise ratio, the two capacities are close
 in value.
\item
 If the output is thresholded, then the Gaussian channel is turned
 into a binary symmetric channel whose transition probability
 is given by the error function $\erf$
 defined on page \pageref{sec.erf}. The capacity is
%%%%%%%
\marginfig{%
\begin{center}
\psfig{figure=/home/mackay/_doc/code/brendan/gc.ps,width=1.85in,angle=-90}
\mbox{\psfig{figure=/home/mackay/_doc/code/brendan/gc.l.ps,width=1.85in,angle=-90}}\\[-0.05in]
\end{center}
%
\caption[a]{Capacities (from top to bottom in each graph)
 $C$, $C'$, and $C''$,
 versus the signal-to-noise ratio $(\sqrt{v}/\sigma)$.
 The lower graph is a log--log plot.}
}
%%%%%%%%
\beq
	C'' = 1 - H_2( f ), \mbox{ where $f= \erf(\sqrt{v}/\sigma)$} .
\eeq
%\item
% The capacities are plotted in the margin.
\een
}
%\soln{ex.beccode}{
% The design of good codes for erasure channels\index{erasure-correction}
% is an active research area
% \cite{spielman-96,LubyDF}. Have fun!
%}
% RAID
\soln{ex.raid}{
 There are several RAID systems. One of the easiest 
 to understand consists of 7 \disc{} drives which store data\index{erasure-correction} 
 at rate $4/7$ using a $(7,4)$ \ind{Hamming code}: each successive\index{RAID}\index{redundant array of independent disks} 
 four bits are encoded with the code and the seven codeword
 bits are written one to each disk. Two or perhaps
 three disk drives
 can go down and the others can recover the data. The 
 effective channel model here is a binary erasure channel, 
 because it is assumed that we can tell when a disk is 
 dead. 

 It is not
 possible to recover the data for {\em some\/} choices 
 of the  three dead disk drives; can you see why?
}
\exercissxB{2}{ex.raid3}{
 Give an example of three \disc{} drives that, if lost, lead 
 to failure of the above RAID system, and three that can 
 be lost without failure.
}
\soln{ex.raid3}{
 The $(7,4)$ Hamming code has codewords of weight 3. If any set of 
 three \disc{} drives\index{erasure-correction} corresponding to one of those codewords
 is lost, then the other four disks can only recover 3 bits
 of information about the four source bits; a fourth bit is lost. 
 [\cf\ \exerciseref{ex.qeccodeperfect} with $q=2$: there are
 no binary MDS codes. This deficit is discussed further in
 \secref{sec.RAIDII}.]

 Any other set of three disk drives can be lost without 
 problems because the corresponding four by four submatrix
 of the generator matrix is invertible.
% The simplest 
% example of a recoverable failure is when the three parity 
% drives (5,6,7) go down.
 A better code would be the  digital fountain
 -- see \chref{chdfountain}.
% \cite{LubyDF},\footnote{{\tt http://www.digitalfountain.com/}}
}



\dvipsb{solutions real channels s7}
%%%%%%% was a chapter on further exercises here once!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%      PART           %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\renewcommand{\partfigure}{\poincare{8.2}}
\part{Further Topics in Information Theory}
\prechapter{About             Chapter}
 In Chapters \ref{ch1}--\ref{ch7}, we
 concentrated on two aspects of information theory
 and coding theory: source coding -- the compression 
 of information so as to make efficient use of data transmission 
 and storage channels; and channel coding -- the redundant 
 encoding of information so as to be able to detect and correct 
 \ind{communication} errors. 

 In both these areas we started by ignoring practical
 considerations, concentrating on the  question 
 of the theoretical limitations and possibilities of coding. 
 We then discussed practical source-coding and channel-coding 
 schemes, shifting the emphasis towards  computational 
 feasibility. But the prime criterion for comparing encoding 
 schemes remained the efficiency of the code in terms of 
 the channel resources it required: the best source codes 
 were those that achieved the greatest compression; the best channel 
 codes were those that communicated at the highest rate with a given
 probability of error. 

 In this chapter we now shift our viewpoint a little, thinking of
 {\em ease of information retrieval\/} as a primary goal. It turns out that 
 the random codes\index{random code} which were theoretically useful in our 
 study of channel coding are also useful for rapid information 
 retrieval.

 Efficient information retrieval is one of the problems that brains seem
 to solve effortlessly, and
 \ind{content-addressable memory}\index{memory!content-addressable} is one of the
 topics we will study when we look at neural networks.



\medskip


%\chapter{Hash codes: codes for efficient information retrieval}
\ENDprechapter
\chapter{Hash Codes: Codes for Efficient Information Retrieval \nonexaminable}
% 9
\label{ch.hash}
% \chapter{Hash codes: codes for efficient information retrieval}
% \input{tex/_lhash.tex}
%
% prerequisites -- the birthday problem questions
% postreqs: hopfield nets
%
% exercises also in _e8.tex AND _e7.tex, solns in _shash and _se8
% _e8 has ones relevant to hashes
%
% \label{ch.hash}
%
% SUGGESTION:
%
% include an illustrative example at start.
% add a diagram showing buckets, memory....

\newcommand{\hashS}{S}
\newcommand{\hashs}{s}
\newcommand{\hashN}{N}
\newcommand{\hashT}{T}
% \newcommand{\hashn}{n}
\section{The information-retrieval problem}
 A simple example of an
 \index{information retrieval}{information-retrieval}\index{hash code}\ 
%\index{code!hash}
 problem is the task of 
 implementing a \ind{phone directory}\index{telephone directory} service, which, in response to a 
 person's {\dem name}, returns (a) a confirmation that that person 
 is listed in the directory; and (b) the person's {phone number} and other 
 details. 
 We could formalize this problem as follows, with $\hashS$ being the
 number of names that must be stored in the \ind{directory}.
\marginfig{\small
\begin{tabular}{@{}p{1.20in}l} \toprule
\parbox[t]{1.2in}{\small string length} & $N \simeq 200$ \\
\parbox[t]{1.2in}{\small\raggedright number of strings} & $S \simeq 2^{23}$ \\
\parbox[t]{1.2in}{\small\raggedright number of possible} & $2^N \simeq 2^{200}$ \\
\parbox[t]{1.2in}{\small\raggedright \hspace{0.2in} strings} & \\
\bottomrule
% WOULD love this paragraph to be indented differently
% HELP
\end{tabular}
\caption[a]{Cast of characters.}
}

% Imagine that y
 You are given a list of $\hashS$ binary strings of length
 $\hashN$ bits, $\{\bx^{(1)}, \ldots, \bx^{(\hashS)}\}$, where
 $\hashS$ is considerably
 smaller than the total number of possible strings, $2^\hashN$.  We will call 
 the superscript `$\hashs$' in $\bx^{(\hashs)}$ the {\dem record number\/} of the string.
 The idea is that $\hashs$ runs over customers in the order in which they are
  added to the directory and $\bx^{(\hashs)}$ is the name of customer $\hashs$. We assume 
 for simplicity that all people have names of the same length.
 The name length might be, say,
 $\hashN = 200$ bits, and
 we might want to store the details of
 ten million customers, so $\hashS \simeq 10^7 \simeq 2^{23}$. We will ignore the possibility that two 
 customers have identical names.

 The task is to construct the inverse of the mapping from $s$ to 
 $\bx^{(\hashs)}$, \ie, to make a system that, given  a string $\bx$,
% with an unknown record number, will
 returns the value of $\hashs$ such that 
 $\bx = \bx^{(\hashs)}$ if one exists, and otherwise  reports that no such 
 $\hashs$ exists. (Once we have the record number, we can go and look in 
 memory location $\hashs$ in a separate  memory full of
 phone numbers to find the required
 number.)
 The aim, when solving this task, is to
% is system should
 use minimal computational resources 
 in terms of the amount of memory used to store the inverse
 mapping from $\bx$ to 
 $\hashs$ and the amount of time to compute the inverse
 mapping. And, preferably, the inverse mapping should be implemented
 in such a way  that
 further new strings can be added to the directory
 in a small amount of computer time too.\index{content-addressable memory}

%
% add picture to show lookup table
%
\subsection{Some standard solutions}
\label{sec.simplehash}
 The simplest and dumbest solutions to the information-retrieval problem 
 are a look-up table and a raw list.
\begin{description}
\item[The look-up table] is a piece of memory of size $2^N \log_2 \hashS$, 
 $\log_2 \hashS$ being the amount of memory required to store an integer 
 between 1 and $\hashS$.  In each of the $2^N$ locations, we put a zero, except
 for the locations $\bx$ that correspond to strings $\bx^{(\hashs)}$, 
 into which we write the value of $\hashs$. 

 The look-up table is a simple and quick solution, but only if  there
 is sufficient memory for the table, and if the cost of
 looking up entries in memory is independent of the memory size.
  But in our definition of the task, we assumed that $N$ is
% sufficiently large 
 about 200 bits or more, so the amount of memory required would be
 of size $2^{200}$; 
 this solution is completely out of the question.  Bear in mind that
 the number of particles in the solar system is only about $2^{190}$. 
% particles in the known universe is 
\item[The raw list] 
 is a simple list of ordered pairs $(\hashs, \bx^{(\hashs)} )$ ordered by the value of 
 $\hashs$. The mapping from $\bx$ to $\hashs$ is achieved by searching through 
 the list of strings, starting from the top, and comparing the incoming 
 string $\bx$
 with each record $\bx^{(\hashs)}$ until a 
 match is found.  This system  is very easy to
 maintain, and uses a small amount of memory, about $\hashS \hashN$ bits, 
 but is rather slow to use, since on average five million pairwise
 comparisons will be made. 
\end{description}
\exercissxB{2}{ex.meanhash}{
 Show that the average time taken 
 to find the required string in a raw list, assuming that the original names 
 were chosen at random,  is about $\hashS + N$ binary comparisons. 
 (Note 
 that you don't have to compare the whole string of length $N$,
 since a comparison can be terminated as soon as a mismatch occurs; 
 show that you need on average two binary comparisons per incorrect
 string match.)
 Compare this with the worst-case search time
 -- assuming that the devil chooses 
 the set of strings and the search key.
}
  The standard way in which phone directories are made improves
 on the look-up table and the raw list by using 
 an {\dem{{alphabetically-ordered list}}}\index{alphabetical ordering}. 
\begin{description}
\item[Alphabetical list\puncspace] 
 The strings  $\{ \bx^{(\hashs)} \}$ 
% $...$
 are sorted into alphabetical order. Searching for 
 an entry now usually takes less time than was needed for the raw list because 
 we can  take advantage of the sortedness; for example, we can open 
 the phonebook at its middle page,  and compare the 
 name we find there with the target string; if the target is `greater'
 than the middle string then we know that the required string, if 
 it exists, will be found in the second half of the alphabetical directory.
 Otherwise, we look in the first half.
 By iterating this splitting-in-the-middle procedure,
 we can identify the target string, or establish that the string is not
 listed, in $\lceil \log_2 \hashS \rceil$ string comparisons. The expected 
 number of binary comparisons per string comparison
 will tend to increase as the search 
 progresses,
%, because the leading bits of the two strings involved 
% in the comparison are expected to become similar; but by  being smart 
% and keeping track of which leading bits we have looked at 
% already in previous searches, it seems plausible that 
% we can reduce the number of binary 
% operations to about $\lceil \log_2 \hashS \rceil + N$ binary comparisons.
 but the total number of binary comparisons required will be
 no greater than  $\lceil \log_2 \hashS \rceil  N$.
 
 The amount of memory required is the same as that required for the raw list.
 
 Adding new strings to the database requires that we insert them in the 
 correct location in the list. To find that location takes about 
 $\lceil \log_2 \hashS \rceil$ binary comparisons.
%Then shuffling along all 
% of the subsequent entries in the directory to make space for the 
% new entry may take some computer time, depending on how the memory works.
\end{description}

 Can we improve on the well-established  alphabetized list?
 Let us consider our task from some new viewpoints.
% for a moment and think of other ways of viewing  it. 

 The task is to construct a mapping $\bx \rightarrow \hashs$ from $N$ bits
% ($\bx$)
 to $\log_2 \hashS$ bits.
% ($\hashs$).
%
% what does this mean?
%
 This is a pseudo-invertible mapping, since for any $\bx$ 
 that maps to a non-zero $\hashs$, the customer database contains the 
 pair $(\hashs , \bx^{(\hashs)})$ that takes us back.  Where have we come 
 across the idea of mapping from $N$ bits to $M$ bits before?

 We encountered  this idea twice: first,
 in source coding, we studied block codes which were mappings 
 from strings of $N$ symbols to a selection of one label in a list.
% $...$.
 The task of information retrieval is similar
% pretty much identical
 to the task 
 (which we never actually solved) of making  an encoder for a
 typical-set compression code.

 The second time that we mapped bit strings to bit strings of another dimensionality 
 was  when we studied channel codes. There, we considered  codes 
 that mapped from $K$ bits to $N$ bits, with $N$ greater than $K$,
 and we made theoretical progress using {\em random\/} codes. 

 In hash codes, we put together these two notions. 
 We will study {random codes that map from $N$ bits to $M$ bits where 
 $M$ is {\em smaller\/} than $N$}.\index{random code}

% Another strand: the dumb look-up table would be really nice, very quick, 
% the only problem is it requires too much memory. But there are so 
% few vectors, what if we project them down into a lower-dimensional 
% space? A few will collide, but if they are mainly distinct then 
% we can just implement the look-up table in a lower dimensional 
% space.

 The idea is that we will map the original high-dimensional space
 down into a lower-dimensional space, one in which it is feasible
 to implement the dumb look-up table method which we rejected a
 moment ago.

\marginfig{\small
\begin{tabular}{@{}p{1.2in}l} \toprule
\parbox[t]{1.2in}{\small string length} & $N \simeq 200$ \\
\parbox[t]{1.2in}{\small number of strings} & $S \,\simeq 2^{23}$ \\
\parbox[t]{1.2in}{\small size of hash function} & $M \simeq 30\ubits$ \\[0.01in]
\parbox[t]{1.2in}{\small size of hash table} & $T = 2^M $\\
 &      $\:\:\:\:\: \simeq 2^{30}$ \\ \bottomrule
% HELP the spacing between successive rows
% is smaller than the spacing between lines!! :-(
% HELP
\end{tabular}
\caption[a]{Revised cast of characters.}
}


\section{Hash codes}
 First we will describe how a hash code works, then we will study the
 properties of idealized hash codes. 
 A hash code implements a solution to the information-retrieval problem, 
 that is, a mapping from $\bx$ to $s$, with the help of a pseudo-random 
 function called a {\dem\ind{hash function}},
 which maps  the $N$-bit string $\bx$ to an  $M$-bit string  $\bh(\bx)$, 
 where $M$ is smaller than $N$. $M$ is typically chosen 
% to be  sufficiently  small
 such that the `table size' $\hashT \simeq
 2^M$ is a little bigger than $S$ -- say, 
 ten times
% one or two orders of magnitude
 bigger.
 For example,
 if we were expecting
% $S$ a million values for $\bx$
 $S$ to be about a million, 
 we might map
%a 200-bit
 $\bx$ into a 30-bit hash $\bh$ (regardless of the size $N$ of each
 item $\bx$).
 The hash function is some fixed deterministic function which should 
 ideally be indistinguishable from a fixed random code. For practical 
 purposes, the hash function must be  quick to compute.

Two simple examples of  \ind{hash function}s are:
\begin{description}
\item[Division method\puncspace]
 The table size $\hashT$ is a prime number, preferably
 one that is not close to a power of 2. The hash value is the remainder
 when the integer $\bx$ is divided by  $\hashT$. 
\item[Variable string addition method\puncspace]
 This method assumes that $\bx$ is a string of
 bytes and that the table size  $\hashT$ is 256.
 The characters of $\bx$ are added, modulo 256.
%
% 
% http://members.xoom.com/thomasn/s_man.htm
%
%
%
% This hash function does not distinguish anagrams.
 This hash function has the defect that it maps strings
 that are  anagrams of each other onto the same hash.

 It may be improved by putting the running total through
 a fixed pseudorandom permutation
 after each  character is added.
%
%\item[
 In the\index{hash function}
 {\dem variable string exclusive-or method\/} with table size $\leq 65\,536$,
 the string is hashed twice in this way, with the initial running total
 being set to 0 and 1 respectively (\algref{alg.hashxor}).
 The result is a 16-bit hash.
\end{description}

%
% probably a good idea to include this code stolen from Thomas Niemann
% typedef unsigned short int HashIndexType;      (changed to int)
%
\begin{algorithm}% figure}
\begin{framedalgorithmwithcaption}%
{
\caption[a]{{\tt C} code implementing the variable string exclusive-or method
 to create 
 a hash {\tt h} in the range $0\ldots 65\,535$
 from a string {\tt x}.
 Author: Thomas Niemann.}
\label{alg.hashxor}
}
\small
\begin{verbatim}
unsigned char Rand8[256];     // This array contains a random
                                 permutation from 0..255 to 0..255 
int Hash(char *x) {           // x is a pointer to the first char; 
     int h;                   //    *x is the first character       
     unsigned char h1, h2;    					    
			      					    
     if (*x == 0) return 0;   // Special handling of empty string  
     h1 = *x; h2 = *x + 1;    // Initialize two hashes             
     x++;                     // Proceed to the next character     
     while (*x) {	      					    
         h1 = Rand8[h1 ^ *x]; // Exclusive-or with the two hashes  
         h2 = Rand8[h2 ^ *x]; //    and put through the randomizer 
         x++;		     					    
     }                        // End of string is reached when *x=0 
     h = ((int)(h1)<<8) |     // Shift h1 left 8 bits and add h2   
          (int) h2 ; 	     					    
     return h ;               // Hash is concatenation of h1 and h2
}
\end{verbatim}
% original code stored in tex/_hash.code
\end{framedalgorithmwithcaption}
\end{algorithm}% figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremargin{\footnotesize
\setlength{\unitlength}{1mm}
\thinlines
\begin{picture}(100,100)(-20,-40)
\put(65,-40){\line(0,1){90}}
\put(75,-40){\line(0,1){90}}
\multiput(65,-40)(0,3){31}{\line(1,0){10}}
\newcommand{\xvector}[2]{\put(-10,#1){\framebox(40,4){$\bx^{(#2)}$}}}
\newcommand{\hvector}[2]{\put(53,#1){\makebox(5,0){$\bh(\bx^{(#2)})\rightarrow$}}}
\newcommand{\svector}[2]{\put(74.3,#1){\makebox(0,0)[r]{$#2$}}}
\newcommand{\slvector}[2]{\put(35,#1){\vector#2{10}}}
\newcommand{\xhs}[4]{\xvector{#1}{#2}\hvector{#3}{#2}\svector{#3}{#2}\slvector{#1}{#4}}
\xhs{30}{1}{18.7}{(1,-1)}
\xhs{24}{2}{45.536}{(1,2)}
\xhs{18}{3}{6.7}{(1,-1)}
\xhs{0}{s}{-20.5}{(1,-2)}
% labels
\put(39,65){\makebox(0,0){Hash}}
\put(39,62){\makebox(0,0){function}}
\put(34,59){\vector(1,0){11}}
\put(10,60){\makebox(0,0){Strings}}
\put(48,58.60){\makebox(0,0)[l]{hashes}}
\put(70,62){\makebox(0,0){Hash table}}
%
\put(10,12){\makebox(0,0){$\vdots$}}
\put(10,-8){\makebox(0,0){$\vdots$}}
% N range indication
\put(10,40){\vector(-1,0){20}}
\put(10,40){\vector(1,0){20}}
\put(10,43){\makebox(0,0){$N$ bits}}
% M range indication
\put(70,54){\vector(-1,0){5}}
\put(70,54){\vector(1,0){5}}
\put(70,57){\makebox(0,0){$M$ bits}}
% 2^M range
\put(82,5){\vector(0,-1){45}}
\put(82,5){\vector(0,1){45}}
\put(84,5){\makebox(0,0)[l]{$2^M$}}
% S range
\put(-15,10){\vector(0,1){23}}
\put(-15,10){\vector(0,-1){30}}
\put(-17,10){\makebox(0,0)[r]{$S$}}
% 
\end{picture}
}{
\caption[a]{Use of hash functions for information retrieval.
 For each string $\bx^{(s)}$,
 the hash $\bh= \bh(\bx^{(s)})$ is computed,
 and the value of $s$ is written into the
 $\bh$th row of the hash table. Blank rows in the hash table
 contain the value zero.
 The table size is $T = 2^M$.}
\label{fig.hashtable}
}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Having picked a hash function $\bh(\bx)$,
 we implement an
% efficient
 information retriever 
 as follows. (See \figref{fig.hashtable}.)
 \begin{description}
   \item[Encoding\puncspace]
     A piece of memory called the {\em hash table\/}
 is created of size $2^Mb$ memory units, where $b$ is the amount of memory needed to represent 
 an integer between $0$ and $\hashS$. This table is initially set to zero 
 throughout. Each memory $\bx^{(\hashs)}$ is put through 
 the hash function, and at the location in the hash table corresponding 
 to the resulting vector $\bh^{(\hashs)} = \bh( \bx^{(\hashs)} )$, the integer $\hashs$ is written -- 
 unless that entry in the hash table is already occupied, in which case 
 we have a {\em collision\/} between $\bx^{(\hashs)}$ and some earlier 
 $\bx^{(\hashs')}$ which both happen to have the same hash code. 
 Collisions can be handled in various ways -- we will discuss some
 in a moment -- but first let us complete the basic picture.

\item[Decoding\puncspace] 
 To retrieve a piece of information corresponding to a
 target vector $\bx$, we compute the hash $\bh$ of  
  $\bx$ and look at the corresponding location in the hash table. 
 If there is a zero, then we know immediately that the string $\bx$ is 
 not in the database. The cost of this answer is the cost of one hash-function
 evaluation and one look-up in the table of size $2^M$. 
 If, on the other hand, there is a non-zero entry $\hashs$ in the table, 
 there are two possibilities: either the vector $\bx$ is 
 indeed equal to $\bx^{(\hashs)}$; or the vector $\bx^{(\hashs)}$ is another 
 vector that happens to have the same hash code as the target $\bx$. (A third 
 possibility is that this
 non-zero entry might have something to do with our 
 yet-to-be-discussed collision-resolution system.)
 

 To check whether $\bx$ is 
 indeed equal to $\bx^{(\hashs)}$, we take the tentative answer $\hashs$, 
 look up   $\bx^{(\hashs)}$ in the original forward database, and compare it 
 bit by bit with $\bx$; if it matches then we report $\hashs$ as the 
 desired answer.  This successful retrieval has an overall cost of 
 one hash-function
 evaluation, one look-up in the table of size $2^M$, 
 another look-up in a table of size $\hashS$, and
% up to
 $N$ binary comparisons -- which may be much cheaper
 than the simple solutions presented in section \ref{sec.simplehash}.
 \end{description}

\exercissxB{2}{ex.hash.retrieval}{
  If we have checked the first few bits of   $\bx^{(\hashs)}$  with
  $\bx$
  and found them to be equal, what is the probability  that 
 the correct entry has been retrieved, if the alternative hypothesis
 is that $\bx$ is actually not in the database? Assume that
 the original source strings are random, and the hash function is a random hash function.
 How many 
% Could have an exercise here on the number of
 binary evaluations  are
 needed to be sure with odds of a billion to one that
  the correct entry has been retrieved?
% [Note we are not assuming that the
% original strings $\{ \bx^{(\hashs)} \}$ are random; they may be
% very similar to each other. We are just assuming that the hash function
% is random.]
}

% 
% view as a kind of source 
% encoding - reduces huge redundancy, where the redundancy 
% has the form P(x) = sum_x pi_c delta(x_c)
% 
% does so using random coding.

 The hashing method of information retrieval
 can be used for strings $\bx$ of arbitrary length,
 if the hash function $\bh(\bx)$  can be applied to strings of
 any length.

\section{Collision resolution}
 We will study two ways of resolving collisions: appending in the
 table, and storing elsewhere.

 
\subsection{Appending in table}
 When encoding, if a collision occurs, we continue
 down the hash table and write the value of $s$ into the next available
 location in memory that currently contains a zero. If we reach the bottom
 of the table before encountering a zero, we continue from the top. 

 When decoding, if we compute the hash code for $\bx$ and find that
 the $s$ contained in the table doesn't point to an $\bx^{(s)}$ that
 matches the cue $\bx$, we continue down the hash table until we
 either find an $s$ whose $\bx^{(s)}$ does match the cue
% key
 $\bx$, in which case we are done, or else encounter a zero, in which
 case we know that the cue $\bx$ is not in the database.

 For this method, it is essential that the table be substantially
 bigger in size than $\hashS$.  If $2^M < \hashS$ then the encoding
 rule will become stuck with nowhere to put the last strings.
 
\subsection{Storing elsewhere}
 A more robust and flexible method is to use {\dem pointers\/}
 to additional pieces of memory in which collided strings are stored.
 There are many ways of doing this. As an example, we could store
 in location $\bh$ in
 the hash table a pointer (which must be distinguishable from
 a valid record number $s$) to a `bucket' where all the
 strings that have  hash code $\bh$  are stored in a 
 {\dem sorted list}.
 The encoder sorts the strings in each bucket alphabetically as the hash table and buckets
 are created.

 The decoder simply has to go and look in the relevant bucket
 and then check the short list of strings that are there by a
  brief alphabetical search.
  
% of strings that have this encoding.
  This method of storing the strings in buckets allows the option of
  making the hash table quite small, which may have practical benefits. We
  may make it so small that almost all strings are involved in collisions,
  so all buckets contain a small number of strings.
 It only takes a small number of binary comparisons to identify which
 of the strings in the bucket matches the  cue $\bx$. 

\section{Planning for collisions: a birthday problem}
\index{birthday}
\exercissxA{2}{ex.hash.collision}{
 If we wish to store $S$ entries using a hash function
 whose output has $M$ bits, how many collisions should we expect
 to happen, assuming that our hash function is an ideal random function?
 What size $M$ of hash table is needed if we would like
 the expected number of collisions to be smaller than 1?

 What size $M$ of hash table is needed if we would like
 the expected number of collisions to be a small fraction, say 1\%,
 of $S$?
 }
 [Notice the similarity of this problem to
 \exerciseref{ex.birthday}.]

\section{Other roles for hash codes}
\subsection{Checking arithmetic}
 \index{error detection}If you wish to check  an addition that was done by hand,
 you may find useful the method of {\dem{\ind{casting out nines}}}.\index{nines}
 In casting out nines, one finds the sum, modulo nine, of
 all the {\em digits\/} of the numbers
 to be summed and compares it with the
 sum, modulo nine, of the digits of the putative answer.
 [With a little practice, these sums can be computed
 much more rapidly than the full original addition.]
% calculation proper.]
\exampla{%???????????
% want this to have reference:     {ex.nines}{
	In the calculation shown in the margin
\marginpar{\begin{center}
\begin{tabular}[t]{r}
{\tt 189} \\
{\tt +1254} \\
{\tt + 238}  \\
\hline
{\tt 1681} \\
\end{tabular}
\end{center}}
 the sum, modulo nine, of the digits in {\tt 189+1254+238}
 is {\tt 7}, and the sum, modulo nine, of {\tt 1+6+8+1} is {\tt 7}.
 The calculation thus passes the casting-out-nines test.
}

 Casting out nines gives a simple example  of a hash function.
 For any addition expression  of the form $a+b+c+\cdots$,
 where $a, b, c, \ldots$ are decimal numbers 
 we define $h \in \{0,1,2,3,4,5,6,7,8\}$ by 
\beq
	h(a+b+c+\cdots) = \mbox{ sum modulo nine of all digits in $a,b,c$ } ;
\eeq
 then it is nice property of decimal arithmetic that  if
\beq
	a+b+c+\cdots = m+n+o+\cdots
\eeq
 then the hashes $h(a+b+c+\cdots)$ and $h(m+n+o+\cdots)$ are equal.

\exercissxB{1}{ex.nines.p}{
 What evidence\index{model comparison} does a correct casting-out-nines
 match give in favour of the
 hypothesis that the addition has been done correctly? 
}

\subsection{Error detection among friends}
 \index{error detection}Are two files the same?  If the files are on the same computer,
 we could just compare them bit by bit. But if the two files
 are on separate machines,
 it would be nice to have a way of confirming that two
 files are identical without having to transfer one of the
 files from A to B. [And even if we did transfer one of the files,
 we would still like a way to confirm whether it has been
 received without modifications!]

 This problem can be solved using hash codes. 
% Alice  sends a file to Bob,  and wants to do error detection.
 Let Alice and Bob be the holders of the two files; Alice sent
 the file to Bob, and they wish to confirm it has been received
 without error. 
 If Alice computes the hash
% function
 of her  file and sends it to Bob,
 and Bob computes the hash
% function
 of his file, using the
 same $M$-bit hash function, and the two hashes match, then
 Bob can deduce that the two files are almost surely the
 same.
% should have some sort of reference to digest?
% The hash of the file is often called the {\dem\ind{digest}}.
\exampl{example.hash.II}{
 What is the probability of a false negative, \ie,
 the probability, given that
 the two files do differ, that the two hashes 
% Bob  concludes
  are nevertheless identical?
}
% Solution::::::::
 If we assume that the hash function is random and that
 the 
% unrelated
 process that causes the files to differ knows nothing about the hash function,
 then the probability of a false negative is $2^{-M}$.\ENDsolution
 A 32-bit hash gives a probability of false negative of about $10^{-10}$.
% 2.3283064365387e-10
 It is common practice to use a linear hash function called
 a 32-bit cyclic redundancy check to detect errors in files.
 (A cyclic redundancy check is a set of 32 parity-check bits
 similar to the 3 parity-check bits of the $(7,4)$ Hamming code.)
%%%%%%%%% end solution

\begin{conclusionbox}
 To have a false-negative rate smaller than one in a billion,
 $M = 32$ bits is plenty, if the errors are produced by noise.
\end{conclusionbox}
 

\exercissxB{2}{ex.whyonlyCRC}{
 Such a simple parity-check code only detects errors; it doesn't help correct
 them. Since error-{\em{correcting\/}} codes exist, why not use
 one of them to get some  error-correcting 
 capability too?
}


%
% more maths requested here
% 

\subsection{Tamper detection}
 \index{security}\index{tamper detection}\index{detection of forgery}\index{forgery}What
 if the differences between the two files are not
 simply `noise', but are  introduced by an adversary,
 a clever {\dem forger\/} called 
 Fiona, who  modifies the original file to make
 a \ind{forgery}\index{cryptography!digital signatures}\index{cryptography!tamper detection}
 that purports to be \ind{Alice}'s file?
 How can Alice make a \ind{digital  signature}
 for the file so that \ind{Bob} can confirm  that
 no-one has tampered with the file?
 And how can we prevent Fiona from listening
 in on Alice's signature and attaching it to other
 files? 

 Let's assume that Alice computes a hash function for the
 file and sends it securely to Bob.
% , in the same way as for error-detection above.
 If Alice computes a simple hash function for the file
 like the linear cyclic redundancy check, and Fiona knows
 that this is the method of verifying the file's
 integrity, Fiona can  make her chosen modifications
 to the file and then easily identify (by linear
 algebra) a further 32-or-so single bits
 that, when flipped, restore the hash function
 of the file to its original value.
 {\em  Linear hash functions give  no security against
 forgers.}

 We must
 therefore require that the hash function\index{inversion of hash function} 
 be {\em hard to invert\/} so that no-one can construct a
 tampering that leaves the hash function unaffected.
 We would still like the hash function to be easy
 to compute, however, so that Bob doesn't have to
 do hours of work to verify every file he received.
 Such a hash function -- easy to compute, but hard to invert --
 is called
 a {\dem\ind{one-way hash function}}.\index{hash function!one-way}
 Finding such functions is one of the active research areas of
 \ind{cryptography}. 
% Don't want to use an ecc, because with a linear ecc it is easy to construct
% a pair of tamperings which have the same syndrome and
% so leave the hash  unaffected.
%How can we invent a function that has the 
%property that h(x) is easy to compute, but
%it is very hard to find an x
%suxh that h(x) has a chosen value h?
%A lot of research is being done on this question
%still, and the sort of functions people use
%to make a one-way hash function are functions like:
%
%        exponentiation-modulo-M
%
%Definition: 
%        take x, and think of it as a number.
%        compute 1023^(x) modulo M,
%        where "^" means "1023 to the power x",
%        and M is some other integer, eg 97.
%
%Apparently it is hard to invert this sort of
%        function (i.e. to take the "discrete logarithm").
%
%Real one-way hash functions are more complicated than
%this, but I hope this gives the idea.
%

 A hash function that is widely used  in the free software\index{software!hash function}
 community to confirm that two files do not differ
 is {\tt\ind{MD5}}, which produces a 128-bit
 hash. The details of how it works
 are quite complicated, involving convoluted exclusive-or-ing
 and if-ing and and-ing.\footnote{{\tt http://www.freesoft.org/CIE/RFC/1321/3.htm}}
%
%  of bits with each other
%
% Cryptography is the topic of the next chapter.
%
% rsync uses MD4 with a 128-bit checksum (for files with a matching size
% and date) initially.  But (from the man entry):
%              Current  versions of rsync actually use an adaptive
%              algorithm for the checksum length by default, using
%              a  16 byte file checksum to determine if a 2nd pass
%              is required with a longer block checksum. Only  use
%              this  option  if  you have read the source code and
%              know what you are doing.
% The `md5sum' program also uses 128 bits.



 Even with a good one-way hash function, the digital signatures
 described above 
 are still vulnerable to attack, if Fiona has access to the
 hash function. Fiona could take the tampered file
 and hunt for a further tiny modification to it such that its hash
 matches the original hash of Alice's file. This would take
 some time -- on average,
 about $2^{32}$ attempts, if the hash function has
 32 bits -- but eventually Fiona would find a tampered file that
 matches the given hash. To be secure against
 forgery,  \ind{digital signature}s must either  have
 enough bits for such a  random search to take too long,
 or the \ind{hash function} itself must  be kept
 \ind{secret}. 

\begin{conclusionbox}
 Fiona has to hash $2^M$ files to cheat.
 $2^{32}$ file modifications is not very many, so a 32-bit hash function
 is not  large enough for \ind{forgery} prevention.
\end{conclusionbox}
% If Fiona works as


 Another person who might have a  motivation for forgery is
 Alice herself.
 For example, she might be making a bet on the outcome
 of a race, without wishing to broadcast her prediction
 publicly; a method for placing bets would be for her
 to send to Bob the bookie the hash of her bet.
 Later on, she could send Bob the details of her bet.
 Everyone can confirm that her bet is consistent with
 the previously publicized hash. [This method
 of secret publication
% shing ideas
 was used by Isaac Newton and Robert Hooke\index{Newton, Isaac}\index{Hooke, Robert}
% (1635-1703)
 when they
 wished to establish priority for scientific ideas
 without revealing them. Hooke's hash function was alphabetization
% ed latin statements,
 as illustrated by the
 conversion of {\em UT TENSIO, SIC VIS\/} into the \ind{anagram} {\tt{CEIIINOSSSTTUV}}.]
% http://www.microscopy-uk.org.uk/mag/artmar00/hooke2.html
% http://www.rod.beavon.clara.net/leonardo.htm
%  It was in his Helioscopes in 1676 that Hooke followed the popular seventeenth-century conceit of announcing a discovery in an anagram: cediinnoopsssttuu. He published its key two years later, in his most complete treatment of elasticity, in De Potentia Bestitutiva, or Of Spring. Here Hooke enunciated the original formulation of the law that bears his name: Ut Pondus sic Tensia, or 'the weight is equal to the tension'. [33] As the tension was seen as the product of an increasing series of weights in pans suspended on coiled springs, it is easy in this pre-Newtoniangravitation age to understand how Hooke spoke of the pondus, or weight, as acting on the spring. The formulation of 'Hooke's Law' with which we are more familiar today is Ut Tensia, sic Vis, or 'the tension is equal to the force'.
%
% http://www.aero.ufl.edu/~uhk/strength/strength.htm  ??? CEIIOSSOTTUU ??? CEIINOSSITTUV
% ??? ceiiinosssttvv
% all accounts differ!
% http://arc-gen1.life.uiuc.edu/Bioph354/lect19.html
 Such a protocol relies on the assumption that Alice cannot
 change her bet after the event without the hash coming out wrong.
 How big a hash function do  we need to use to ensure that
 Alice cannot cheat?
 The answer is different from the size of the hash we needed
 in order to defeat Fiona above, because Alice is the author of {\em
 both\/} files. Alice could \ind{cheat} by searching for
 two files that  have identical
 hashes to each other. For example, if
 she'd like to cheat by placing two bets for the price of
 one, she could make a large number $N_1$ of
 versions of bet one (differing from each other
 in minor details only), and a large number $N_2$ of versions of bet two,
 and hash them all. If there's a \ind{collision} between
 the hashes of two bets of different types,
 then she can submit the common hash and thus buy herself the
 option of placing either \ind{bet}.
\exampl{example.hashN1N2}{
 If the hash has $M$ bits, how big do $N_1$
 and $N_2$ need to be for Alice to have a good chance of finding
 two different bets with the same hash? 
}
% solution
 This is a \ind{birthday} problem like \exerciseref{ex.birthday}.
 If there are  $N_1$ Montagues and  $N_2$ Capulets at a party,
 and each is assigned a `birthday' of $M$ bits, 
 the  expected number of  \ind{collision}s between a Montague and a Capulet
 is
\beq
	N_1 N_2 2^{-M} ,
\eeq
 so to minimize the number of files hashed, $N_1+N_2$, Alice
 should make $N_1$ and $N_2$ equal, and will need to hash about
 $2^{M/2}$ files until she finds two that match.\ENDsolution
\begin{conclusionbox}
 Alice has to hash $2^{M/2}$ files to cheat.
 [This is the square root of the number of hashes Fiona
 had to make.]
\end{conclusionbox}

 If Alice has the use of $C=10^6$ computers for $T=10$\,years, each computer
 taking  $t=1\,$ns to evaluate a hash,
 the bet-communication system is\index{security}
 secure against Alice's dishonesty  only if $M \gg 2 \log_2 CT/t \simeq
 160$ bits.
% end solution


\section*{Further reading}
 The Bible for hash codes is volume 3 of \citeasnoun{KnuthAll}.
 I highly recommend  the story of Doug McIlroy's {\tt{\ind{spell}}}
 program, as told in section 13.8 of {\em{Programming Pearls}} \cite{Bentley2}.
 This astonishing piece of software makes use of a 64-\kilobyte\ data structure
 to store the spellings of all the words of $75\,000$-word dictionary.
% also has some hash functions for strings on p 161, chapter 15.
% and random text generator.


\section{Further exercises} % removed and returned (maybe should transfer some of these?)
% solutions in  _se8.tex
% oct 97
% 
% info theory and the real world
%
\fakesection{Information theory and the real world (questions relating to hash functions)}
\exercisaxA{1}{ex.address}{
 What is the shortest the \ind{address} on a typical international {letter} 
 could be, if it is to get to a unique human recipient? (Assume the permitted
 characters are {\tt{[A-Z,0-9]}}.)
 How long are typical \ind{email} addresses?
}
\exercissxA{2}{ex.uniquestring}{
 How long does a piece of text need to be for you to be pretty
 sure that no human has written that string of characters
 before?
 How many notes are there in a new \ind{melody}\index{music} that
 has not been composed before? 
}
\exercissxB{3}{ex.proteinmatch}{
 {\sf Pattern recognition by \ind{molecules}}.\index{pattern recognition}

 Some proteins produced in a cell have a regulatory role.
 A regulatory \ind{protein} controls
 the transcription of specific \ind{genes} in the \ind{genome}.
% that might code for other proteins or sometimes the protein itself.
 This control often involves the protein's binding to a particular \ind{DNA} 
 sequence in the vicinity of the regulated gene.  The presence of the 
 bound protein either promotes or inhibits transcription of the gene.

\ben
\item
 Use information-theoretic arguments to obtain a lower bound on the size of a 
 typical  protein that acts as a regulator specific to one gene in the 
 whole human genome. Assume that the genome is 
 a sequence of
 $3 \times 10^{9}$ 
 nucleotides drawn from a four letter alphabet $\{{\tt A},{\tt C},{\tt G},{\tt T}\}$;\index{amino acid}\index{nucleotide}\index{binding DNA} 
 a protein is a sequence of amino acids drawn from a twenty letter alphabet.
 [Hint:  establish how long the recognized DNA sequence has to be 
 in order for that sequence to be
 unique to  the vicinity of one 
 gene, treating the rest of the genome as a random sequence. Then 
 discuss how big the protein must be to  recognize a  
 sequence of that length uniquely.]

\item
	Some of the sequences recognized by \ind{DNA}-binding regulatory\index{protein!regulatory} 
 proteins  consist of a subsequence that is repeated twice or
 more, for example
 the sequence 
\beq
\mbox{{\tt{\underline{GCCCCC}CACCCCT\underline{GCCCCC}}}}
\eeq
 is a binding site found upstream of the alpha-actin gene in humans.
%; this is a binding site for a transcription factor called Sp1. 
 Does the fact that some binding sites consist of
 a {repeated\/} subsequence influence your answer to part (a)?
\een
 
}
%
% stole information acquisition exercises from here to move to gene chapter
%





\dvips
\section{Solutions}% to Chapter \protect\ref{ch.hash}'s exercises} % 
\soln{ex.meanhash}{
 First imagine comparing the string $\bx$ with
 another random string $\bx^{(s)}$.
 The probability that the first bits of the two strings match
 is $1/2$. The probability that the second bits match
 is $1/2$. Assuming we stop comparing once we hit the
 first mismatch, the expected number of matches  is 1,
 so the expected number of comparisons is 2
 \exercisebref{ex.waithead}.

% errors corrected in draft 2.0.7 on Sun 31/12/00
 Assuming the correct string is located at random in the
 raw list, we will have to compare with an average
 of $\hashS/2$ strings before we find it, which costs
 $2 \hashS/2$ binary comparisons; and comparing
 the correct strings takes $N$  binary comparisons,
 giving a total expectation of $\hashS + N$ binary comparisons,
 if the strings are chosen at random.

 In the worst case (which  may indeed happen in practice),
 the other strings are very similar
 to the search key, so that a lengthy sequence of comparisons
 is needed to find each mismatch. The worst case is when the correct
 string is last in the list, and all the other strings
 differ in the last bit only, giving a requirement of $\hashS N$
 binary comparisons.
}
\soln{ex.hash.retrieval}{
	The likelihood ratio for the two hypotheses,
 $\H_0$:  $\bx^{(\hashs)} = \bx$, and
 $\H_1$:  $\bx^{(\hashs)} \neq \bx$,
 contributed by the datum `the first bits of  $\bx^{(\hashs)}$ and $\bx$ are equal'
 is
\beq
\frac{	P( \mbox{Datum} \given  \H_0 ) }
	{ 	P( \mbox{Datum} \given  \H_1 ) }
 = \frac{1}{1/2} = 2.
\eeq
 If the first $r$ bits all match, the likelihood ratio is $2^r$ to one.
 On finding that 30 bits match, the odds are a billion to one
 in favour of $\H_0$, assuming we start from even odds.
 [For a complete answer, we should compute the evidence
% prior probability of $\H_0$ and $\H_1$
 given by the prior information that the hash entry $s$
 has been found in the table at $\bh(\bx)$. This fact gives further evidence
 in favour of $\H_0$.]
}
\soln{ex.hash.collision}{
 Let the hash function have an output alphabet of size $T = 2^M$.
 If $M$ were equal to $\log_2 S$ then we would have exactly enough bits
 for each entry to have its own unique hash.
 The probability that one particular pair of entries collide under a random
 hash function  is $1/T$.
 The number of pairs is $S(S-1)/2$. So the expected number
 of collisions between pairs is exactly
\beq
	S(S-1)/(2T).
\eeq
 If we would like this to be smaller than 1, then we need
$
	T > S(S-1)/2
% S(S-1) < 2A \:\: \Rightarrow \:\: S < \sqrt{2A}
$
 so
\beq
	M > 2 \log_2 S.
\label{eq.M2Shash}
\eeq
 We need {\em twice as many\/} bits as the number of bits,
 $\log_2 S$, 
 that would be sufficient to give each entry a unique name.

% fS = S(S-1)/(2A)
% A = (S-1) / (2 f )
 If we are happy to have occasional collisions, involving  a fraction
 $f$ of the names $S$, then
 we need $T > S/f$ (since the probability  that one particular name
 is collided-with is $f \simeq S/T$) so
\beq
	M > \log_2 S  + \log_2  [1/f] ,
\label{eq.MShash}
\eeq
 which means for $f \simeq 0.01$ that we need an extra
 7 bits above $\log_2 S$.

 The important point to note  is the \ind{scaling}  of $T$
 with $S$ in the two cases (\ref{eq.M2Shash},$\,$\ref{eq.MShash}). If we want
 the hash function to be collision-free, then
 we must have $T$ greater than $\sim \! S^2$.
 If we are happy to have a small frequency of collisions, then
 $T$  needs to be of order  $S$ only.
% some factor greater  than
}
%
%
%
\soln{ex.nines.p}{
	The posterior probability ratio for
 the two hypotheses, $\H_{+} = $ `calculation correct'
 and  $\H_{-} = $ `calculation incorrect'
 is the product of the prior probability ratio
 $P(\H_{+})/P(\H_{-})$ and the likelihood ratio,
 $P(\mbox{match} \given \H_{+})/P(\mbox{match} \given \H_{-})$.
 This second factor is the answer to the question.
 The numerator  $P(\mbox{match} \given \H_{+})$ is equal to 1.
 The denominator's value depends on our model of errors.
 If we know that the human calculator is prone to errors
 involving multiplication of the answer by 10, or to transposition
 of adjacent digits, neither of which affects the hash value,
 then
 $P(\mbox{match} \given \H_{-})$ could be equal to 1 also,
 so that the correct match gives no evidence
 in favour of $\H_{+}$. But if we assume that errors are
 `random from the point of view of the hash function' then
 the probability  of a false positive is
 $P(\mbox{match} \given \H_{-}) = 1/9$, and the
 correct match gives evidence 9:1 in favour
 of  $\H_{+}$.
}
%
%
%
\soln{ex.whyonlyCRC}{
 If you add a tiny $M=32$ extra bits of hash to a huge $N$-bit
 file you get pretty good \ind{error detection}\index{error-correcting code} --
% $1-2^{-M}$
 the probability that an
% of  detecting an error, less than a one-in-a-billion chance that the
 error is undetected is $2^{-M}$,
  less than  one in a billion. To do error {\em correction\/}
 requires far more check bits, the number depending on the expected types of
 corruption, and on the file size. 
 For example, if just eight random bits in a megabyte file
 are corrupted,  it would take
% $\log_2 {{ 8\times 10^{6}} \choose {8} }  \simeq 180$
 about $\log_2 {{ 2^{23} }\choose{8} }  \simeq 23 \times 8 \simeq 180$
 bits
 to specify which are the corrupted bits, and the number of \ind{parity-check bits} used by a successful error-correcting code would have to
 be at least this number, by the counting argument of \exerciseonlyref{ex.makecode2error}
 (solution, \pref{ex.makecode2error.sol}).
% Shannon's \ind{noisy-channel coding theorem}. 
}
% see also _se8.tex




\fakesection{se8}
%\begincuttable% NO, I LIKE IT
\soln{ex.uniquestring}{
 We want to know the length $L$ of a string
 such that it is very improbable that that
 string matches any part of the entire writings
 of humanity.
 Let's estimate that these writings total
 about one book for each person living, and that each book
 contains two million characters (200 pages with $10\,000$ characters
 per page) -- that's 
% $5\times 10^9 \times 2 \times 10^6 =
 $10^{16}$ characters, drawn from
 an alphabet of, say, 37 characters.

 The probability that a randomly chosen string of length $L$ matches
 at one point in the collected works of humanity is $1/37^{L}$.
 So the expected number of matches is
 $10^{16} /37^{L}$, which is vanishingly small if
 $L \geq 16/\log_{10} 37 \simeq 10$.
% 10.2
 Because of the redundancy and repetition of humanity's writings,
 it is possible that $L \simeq 10$ is an overestimate. 

 So, if you want to write something unique, sit down and compose
 a string of ten characters. But don't write {\tt{gidnebinzz}}, because
 I already thought of that string.

 As for a new \ind{melody},\index{music} if we focus on the sequence  of notes,
ignoring duration and stress, and
 allow leaps of up to an octave at each note,
 then the number of choices per note is 23.
 The pitch of the first note is arbitrary.
 The number of melodies of length $r$ notes in this rather
 ugly ensemble of  \ind{Sch\"onberg}ian tunes  is $23^{r-1}$;
 for example, there are $250\,000$ of length $r=5$.
 Restricting the permitted intervals will reduce this figure;
 including duration and stress will increase it again.
  [If we restrict the permitted intervals to
 repetitions and tones or semitones,
 the reduction is particularly severe; is this why
 the melody of 
 `\ind{Ode to Joy}' sounds so boring?]
 The number of recorded compositions is probably less than a 
 million.
% top of the pops for 50 * 50 weeks with 100 new songs per week
 If you learn 100 new melodies per week for every week of your
 life then you will have learned $250\,000$ melodies at age 50.
 Based on empirical experience of playing the game\index{game!guess that tune}
 `{\tt{guess that tune}}',\marginpar{\small\raggedright{In {\tt{guess that tune}},
 one player chooses a melody, and sings a gradually-increasing number
 of its notes, while the other
 participants try to guess the whole melody.\medskip

% aka  http://www.melodyhound.com/
 The {\dem\ind{Parsons code}\/} is a related hash function for
 melodies:
%  . To make the Parsons code of a melody, 
 each pair of consecutive notes is coded as {\tt{U}} (`up')
 if the second note is higher than the first, {\tt{R}} (`repeat')
 if the pitches are equal, and {\tt{D}} (`down') otherwise.
 You can find out how well this hash function
 works at {\tt{www.{\breakhere}name-{\breakhere}this-{\breakhere}tune.{\breakhere}com}}.
}}
 it seems to me that
 whereas many four-note sequences are shared in common
 between melodies, the number of collisions between
 five-note sequences is rather smaller -- most famous five-note
 sequences are  unique. 
}
%\ENDcuttable
\soln{ex.proteinmatch}{
%\ben
%\item
 (a) Let the DNA-binding \ind{protein} recognize a sequence of length $L$ nucleotides. 
 That is, it binds preferentially to that \ind{DNA} sequence, and not to 
 any other pieces of DNA in the whole genome. (In reality, the 
 recognized sequence may contain some wildcard characters, \eg, 
 the {\tt{*}} in {\tt{TATAA*A}}, which denotes `any of {\tt{A}}, {\tt{C}},
 {\tt{G}} and {\tt{T}}'; 
 so, to be precise, we are assuming that the recognized sequence
 contains $L$ non-wildcard
 characters.)
% in a sequence whose length can be greater than $L$.)

 Assuming the rest of the genome is `random', \ie, that the sequence
 consists of random nucleotides  {\tt{A}}, {\tt{C}},
 {\tt{G}} and {\tt{T}} with equal probability -- which is obviously 
 untrue, but it shouldn't make too much difference to our calculation --
 the  chance of there being no other occurrence of the target sequence
 in the whole genome, of length $N$ nucleotides, is roughly
\beq
	(1 - (1/4)^L )^N  \simeq \exp ( - N (1/4)^L ) ,
\eeq
 which is close to one only if
\beq
	N 4^{-L} \ll 1 , 
\eeq
 that is, 
\beq
	L >   \log N / \log 4 .
\eeq
 Using $N= 3 \times 10^9$, 
% from cell p.386
 we require the recognized sequence to be longer than $L_{\min} = 16$
 nucleotides.

 What size of \ind{protein} does this imply? 
%\ben
\bit
\item
%
 A weak lower bound can be obtained by assuming that the information 
 content of the protein sequence itself is greater than 
 the information content  of the \ind{nucleotide}
 sequence the protein prefers to bind to (which we have argued above 
 must be at least 32 bits). 
 This  gives a minimum protein length of $32 / \log_2(20) \simeq 7$
 \ind{amino acid}s.
\item
 Thinking realistically, the \ind{recognition} of the DNA sequence  
 by the protein presumably involves the protein coming into contact
 with all sixteen nucleotides in the target sequence. 
 If the protein is a monomer, it must be big enough that it can 
 simultaneously make contact with sixteen nucleotides of DNA. 
 One helical turn of DNA containing ten nucleotides has a length of 
 3.4$\,$nm, so a contiguous sequence of sixteen nucleotides has a length 
 of 5.4$\,$nm. The diameter of the protein must therefore be about 5.4$\,$nm 
 or greater. Egg-white lysozyme is a small globular protein with 
 a length of 129 amino acids
% cell p.90
 and a diameter of about 4$\,$nm.
% cell p.130.
 Assuming that volume is proportional to sequence length
 and that volume scales as the cube of the diameter, a protein of 
 diameter 5.4$\,$nm must have a sequence of length $2.5 \times 129 
 \simeq 324$ amino acids.
%\een
\eit
% \item
%
(b) 
 If, however, a target sequence consists of a twice-repeated sub-sequence, 
 we can get by with a much smaller protein that recognizes 
  only the sub-sequence, and that binds to the \ind{DNA} strongly only if 
 it can form a {\em\ind{dimer}},
 both halves of which are bound to the recognized sequence.
% , which must appear twice in succession in the DNA.
% with a neighbour. 
 Halving the diameter of the protein, we now only need a protein whose length 
 is greater than 324/8 = 40 amino acids.
 A protein of length smaller than this cannot by itself serve as a 
 regulatory protein\index{protein!regulatory} specific to one gene,
 because it's simply too small to be able to make a sufficiently
 specific match -- its available surface does not have enough
 information content.
% \een
 
}

% 
\dvips
%
% ch 8 LINEAR
%\chapter{Linear Error correcting codes and perfect codes}
%\chapter{Linear Error Correcting Codes and Perfect Codes \nonexaminable}
\prechapter{About               Chapter}
% prechapter for linear codes / binary codes
 In Chapters \ref{ch.prefive}--\ref{ch.ecc},
 we established Shannon's noisy-channel coding theorem
 for a  general channel with any
 input and output alphabets.
 A great deal  of attention in coding theory focuses on the special
 case of  channels with binary inputs.
 Constraining ourselves to these channels simplifies
 matters, and leads us into an exceptionally rich world,
 which we will only taste in this book.


 One of the aims of this chapter is to point out a
 contrast between Shannon's aim of achieving reliable communication
 over a noisy channel and the apparent aim of many in the
% this wonderful
 world of \ind{coding theory}.\index{sphere packing}
 Many coding theorists take as their fundamental problem
 the task of packing as many spheres as possible, with radius
 as large as possible, into  an $N$-dimensional space, {\em with
 no spheres overlapping}.
 Prizes are awarded to people
 who find packings that squeeze in an extra few spheres.
% of a given radius.
 While this is a fascinating mathematical topic,
 we shall see that the aim  of maximizing
 the \ind{distance} between codewords in a code has only a tenuous
 relationship to Shannon's aim of reliable \ind{communication}.

\ENDprechapter
\chapter{Binary  Codes \nonexaminable}
\label{ch.linearecc}
\label{ch.linear}

% see also linearblock.tex
%
% chapter 8: linear error correcting codes
%
% distance
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% see also NOTES.tex

 We've established Shannon's noisy-channel coding theorem
 for a  general channel with any
 input and output alphabets.
 A great deal  of attention in coding theory focuses on the special
 case of 
 channels with binary inputs, the first implicit choice being the
 binary symmetric channel.\index{channel!binary symmetric}

 The optimal decoder for a code, given a binary symmetric channel,
 finds the codeword that is closest to the received vector, closest\marginpar[b]{\small{{\sf Example:}\\[0.0012in]
%\begin{center}
\begin{tabular}{rl}
\multicolumn{2}{c}{
 The Hamming distance
}\\
{between}& {\tt{00001111}}\\ and & {\tt{11001101}}\\
\multicolumn{2}{c}{
is 3.
}\\
\end{tabular}
%\end{center}
 }}
 in {\dem\ind{Hamming distance}}.\index{distance!Hamming}
 The Hamming distance between two binary vectors is the number
 of coordinates in which the two vectors
 differ.
 Decoding errors will occur
 if the noise takes us from the transmitted codeword  $\bt$ to a
 received vector $\br$ that is closer to some other codeword.
 The {\dem{distances\/}} between codewords are thus relevant to the
 probability of a decoding error.\index{distance!of code}

\section{Distance properties of a code}
%\begin{description}
%\item[The {\dem{distance}\/}  of a\index{distance!of code} code]
 The {\dem{distance}\/}  of a\index{distance!of code}
% \index{error-correcting code!distance}
 code is the smallest separation  between two of its
 codewords.
% \end{description}
% \begin{ indented 
\exampl{ex.hamm74dist}{
%\noindent {\sf Example:}
 The $(7,4)$ Hamming code (\pref{sec.ham74})
 has distance $d= 3$. All pairs of  its
 codewords differ in at least 3 bits.
 The maximum number of errors it can correct is  $t=1$;
 in general
 a code with  distance $d$ is
 $\lfloor (d\!-\!1)/2 \rfloor$-error-correcting.
}
% , and
% the distance is related to this quantity by
% $d=2t+1$.
% \end{indented

 A more precise term for distance is
 the {\dem\ind{minimum distance}\/} of the code.
 The distance of a code is often denoted by $d$ or $d_{\min}$.
%
% \section{Weight enumerator function}
%    see  code/bucky/README
\index{error-correcting code!weight enumerator}%
%\index{error-correcting code!distance distribution}%

 We'll now constrain our attention to linear codes.
 In a linear code, all codewords have identical
 distance properties, so we can summarize
% the dis.
% are equivalent,
% from the point of view of the spectrum of
% distances to other codewords. 
%  summarizes
 all the distances between the code's codewords
 by counting the distances from the all-zero codeword.

%\begin{description}
%\item[The {\dem\ind{weight enumerator} function} of a code,] $A(w)$,
 The {\dem\ind{weight enumerator} function} of a code, $A(w)$,
% $A(w)$
 is defined to be the number of codewords in the code that
 have weight $w$.
\amarginfig{b}{%
\footnotesize
\begin{tabular}{c}
\raisebox{0.2in}{\buckypsfig{H74.eps}}
\\
%# weight enumerator of $(7,4)$ code
%# w           A(w)         C     Random  Random N-choose-w
\begin{tabular}[b]{rr}
\toprule
    $w$ & $A(w)$ \\   \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     0 &    1  \\
     3 &    7  \\
     4 &    7  \\ 
     7 &    1  \\ \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
Total   & 16\\  \bottomrule
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\\
\buckypsgraphb{H74.Aw.ps}
\end{tabular}
% see /home/mackay/code/bucky/H74.gnu
\caption[a]{The graph of the $(7,4)$ Hamming code, and its weight enumerator
 function.}
\label{fig.wef.h74}
}
%
 The weight enumerator function is also
 known as the {\dem{{distance distribution}\index{distance!distance distribution}}\/} of the code.
%\end{description}

% original is in graveyard.tex
\begin{figure}
\figuremargin{%
\footnotesize
\begin{tabular}{ccc}
\buckypsfig{dodec.eps}
&
%# weight enumerator of (30,11) code dodec2.G
%# w           A(w)         C     Random  Random N-choose-w
\begin{tabular}{rr}
\toprule
    $w$ & $A(w)$ \\   \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      0 &      1  \\
      5 &     12  \\
      8 &     30  \\
      9 &     20  \\
     10 &     72  \\
     11 &    120  \\
     12 &    100  \\
     13 &    180  \\
     14 &    240  \\ 
     15 &    272  \\
     16 &    345  \\
     17 &    300  \\
     18 &    200  \\
     19 &    120  \\
     20 &     36  \\ \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
Total   & 2048\\  \bottomrule
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

&
\begin{tabular}{@{}c@{}}
\buckypsgraphB{dodec2.Aw.ps}
\\
\buckypsgraphB{dodec2.Aw.l.ps}
\end{tabular}
\\% see /home/mackay/code/bucky
\end{tabular}
}{
\caption[a]{
 The graph defining the $(30,11)$
 \ind{dodecahedron code}\index{error-correcting code!dodecahedron}
% first introduced in secref{sec.dodecahedron}
 (the circles are the 30 transmitted bits and the triangles are the 20 parity checks,
 one of which is redundant) and the
% (b-ii) The
 weight enumerator function (solid lines). The
 dotted lines show  the
 average weight enumerator function of all random linear codes
 with the
 same size of generator matrix,
% (dotted lines),
 which will be computed shortly.
 The lower
 figure shows the same functions on a log scale.

%%%%%%%%%%%%%%% CHECK %%%%%%%%%%%%%%%
% {\em (Check for cross-reference to earlier occurrence?)}
}
\label{fig.Aw}
}
\end{figure}

% \begin{ indented ?
\exampl{ex.hamm74Aw}{
% \noindent {\sf Example:}
 The weight enumerator functions
 of the $(7,4)$ Hamming code and the \ind{dodecahedron code}\index{error-correcting code!dodecahedron}
 are shown in figures \ref{fig.wef.h74} and \ref{fig.Aw}.
% \end{indented
}


\section{Obsession with distance}
 Since the maximum number of errors that  a code  can {\em guarantee\/} to correct,
 $t$, is related to its distance $d$ by
 $t= \lfloor (d\!-\!1)/2 \rfloor$,\marginpar{\small{%
 $d=2t+1$ if $d$ is odd, and\\
$d=2t+2$
 if $d$ is even.}}
 many coding theorists focus on the\index{distance!of code} distance of a code, searching for
 codes of a given size that have the biggest possible distance.
 Much of practical coding theory has focused on
 decoders  that give the optimal decoding for all error patterns
 of weight up to the half-distance $t$ of their codes.
\begin{description}
\item[A \ind{bounded-distance decoder}]\index{decoder!bounded-distance}
 is a decoder that returns the closest codeword to a received\label{sec.bdd}
 binary vector $\br$ if the distance from $\br$ to that codeword
 is less than or equal to $t$; otherwise it returns a failure
 message.
\end{description}
 The rationale for not trying to decode when more than $t$
 errors have occurred might be `we can't {\em guarantee\/}
 that we can  correct more than $t$ errors, so we
 won't bother trying -- who would
 be interested in a decoder that  corrects some\index{sermon!worst-case-ism}
 error patterns of weight greater than $t$, but not others?'
 This defeatist attitude is an example of {\dem\ind{worst-case-ism}},
 a widespread mental ailment
% yes, spell checked
 which this book is intended to cure.

 The fact is that bounded-distance decoders cannot reach the\wow\
 Shannon limit of the binary symmetric channel; only a decoder
 that often corrects more than $t$ errors can do this.
 The state of the art in error-correcting codes
 have decoders that work way beyond the minimum distance
 of the code.

\subsection{Definitions of good and bad distance properties}
 \index{distance!of code!good/bad}Given
 a family of codes of increasing blocklength $N$, and with rates
 approaching a limit $R>0$, 
 we may be able to put that family in one of the  following categories,
 which have some similarities to the categories of `good' and `bad' codes
 defined earlier (\pref{sec.bad.code.def}):\index{error-correcting code!good}\index{error-correcting code!bad}\index{error-correcting code!very bad}\index{distance!good}\index{distance!bad}\index{distance!very bad}
\label{sec.bad.dist.def}
\begin{description}
\item[A sequence of codes has `good' distance]
 if $d/N$ tends to a constant greater than zero.
\item[A sequence of codes has `bad' distance]
 if $d/N$ tends to zero.
\item[A sequence of codes has `very bad' distance]
 if $d$ tends to a constant.
\end{description}
% THIS really belongs over the page
\amarginfig{b}{
\begin{center}
 \mbox{
\psfig{figure=/home/mackay/itp/figs/gallager/16.12.G.ps,width=2in,angle=-90}
}\end{center}
\caption[a]{The graph of  a rate-\dfrac{1}{2}
  low-density generator-matrix code. The rightmost $M$
 of the transmitted bits are each connected to a single distinct parity
 constraint.
}
\label{fig.ldgmc}
}


\exampl{example.badcode}{
 A {\dem\ind{low-density generator-matrix code}\/} is a
 linear code whose $K \times N$ generator matrix
 $\bG$ has a small number $d_0$ of {\tt{1}}s per row,
 regardless of how big $N$ is.
 The minimum distance of such a code is at most $d_0$,
 so  {low-density generator-matrix code}s have `very bad' distance.
}

 While having large distance is no bad thing, we'll see, later on, why
 an emphasis on  distance can be unhealthy.




\begin{figure}[htbp]
\figuremargin{
\mbox{\psfig{figure=figs/caveperfect.ps,angle=-90,width=3in}}
}{
\caption[a]{Schematic picture of part of
 Hamming space perfectly filled
 by $t$-spheres centred on the codewords of a perfect code.}
\label{fig.caveperfect}
}
\end{figure}
\section{Perfect codes}
 A $t$-sphere (or a sphere of radius $t$)
 in Hamming space, centred on a point $\bx$,
 is the set of points  whose Hamming distance from $\bx$
 is less than or equal to $t$.

 The $(7,4)$ \ind{Hamming code}\index{perfect code}\index{error-correcting code!perfect} has the beautiful property that
 if we place 1-spheres
% of radius 1
 about each of its 16 codewords,
 those spheres perfectly fill Hamming space without overlapping.
 As we saw in \chref{ch1},
 every binary vector of length 7 is within a distance of $t=1$ of
 exactly one codeword of the Hamming code.
\begin{description}
\item[A code is a perfect $t$-error-correcting code]
 if the set of $t$-spheres centred on the codewords of the code fill the
 Hamming space without overlapping. (See \figref{fig.caveperfect}.)
\end{description}

 Let's recap our cast of characters.
 The number of codewords is $S=2^K$. The number of points in the
 entire Hamming space is $2^N$. The number of points in a
 Hamming sphere of radius $t$ is
\beq
	\sum_{w=0}^{t} {{N}\choose{w}} .
\eeq
 For a code to be perfect with these parameters, we
 require $S$ times the number of points in the $t$-sphere to equal $2^N$:
\beqan
\mbox{for a perfect code, } \:\:
 2^K \sum_{w=0}^{t} {{N}\choose{w}} & =& 2^N
\\
\mbox{or, equivalently, }\:\:
     \sum_{w=0}^{t} {{N}\choose{w}} & =& 2^{N-K} .
\eeqan
 For a perfect code, the number of noise vectors
 in one sphere must equal the number of possible syndromes.
 The $(7,4)$ Hamming code satisfies this numerological condition\index{numerology}
 because
\beq
	1 + {{7}\choose{1}}  = 2^3 .
\label{eq.coincidence}
\eeq
% Interestingly, the first appearance of the ternary Golay code predated
%Golay's publication by a good year. A Finnish devotee of football pools thought it up in list form (!) and published it
% in 1947.
% Covering codes. 
%G.Cohen, I.Honkala. S.Litsyn, and A.Lobstein 
%North-Holland Publishing Co., Amsterdam, 1997. xxii+542 pp. ISBN 0-444-82511-8
%  It is this "ternary" Golay code which was first discovered by a Finn who was
%  determining good strategies for betting on blocks of 11 soccer games. Here,
%  one places a bet by predicting a Win, Lose, or Tie for all 11 games, and as
%  long as you do not miss more than two of them, you get a payoff. If a group
%  gets together in a "pool" and makes multiple bets to "cover all the options"
%  (so that no matter what the outcome, somebody's bet comes within 2 of the
%  actual outcome), then the codewords of a 2-error-correcting perfect code
%  provide a very nice option; the balls around its codewords fill all of the
%  space, with none left over.
%
%  It was in this vein that the ternary Golay code was first constructed; its
%  discover, Juhani Virtakallio, exhibited it merely as a good betting system
%  for football-pools, and its 729 codewords appeared in the football-pool
%  magazine Veikkaaja. For more on this, see Barg's article [1].
%
%  [1] Barg, Alexander. "At the Dawn of the Theory of Codes," The Mathematical
%  Intelligencer, Vol. 15 (1993), No. 1, pp. 20--26.
\subsection{How happy we would be to use perfect codes}
 If there were large numbers of perfect codes to choose from,
 with a wide range of blocklengths and rates,
 then these would be the perfect solution to Shannon's problem.
 We could communicate over a binary symmetric channel with noise
 level $f$, for example, by picking a perfect $t$-error-correcting code
 with blocklength $N$ and $t=f^* N$, where $f^* = f + \delta$
 and $N$ and $\delta$ are chosen such that the probability that
 the noise flips more than $t$ bits is satisfactorily small.

 However, {\em there are almost no perfect codes}.\wow\
 The only nontrivial
 perfect binary
 codes are
\ben
\item
 the Hamming codes, which are perfect codes with $t=1$
% -error-correcting with
 and blocklength  $N=2^M-1$,
 defined below; the rate of a \ind{Hamming code} approaches 1
 as its blocklength $N$ increases;
\item
 the repetition codes of odd blocklength $N$, which are  perfect codes
 with $t=(N-1)/2$; the rate of repetition codes goes to zero as $1/N$; and
\item
 one remarkable $3$-error-correcting code with $2^{12}$
 codewords of
 blocklength $N=23$ known
 as the binary \ind{Golay code}\index{error-correcting code!Golay}.
 [A second 2-error-correcting  Golay code of
 length $N=11$ over a
 ternary alphabet was 
% 729 cw's in football-pool magazine Veikkaaja.
 discovered by a Finnish football-pool
 enthusiast\index{football pools}\index{bet}\index{design theory}
% \index{Finland}
 called  Juhani Virtakallio\index{Virtakallio, Juhani} in 1947.]
% 1+23+23*11 + 23*11*7 = 2048
% If we allow more symbols in our alphabet than just 0 and 1, then we get analogues of the
% Hamming codes, and another Golay code of length 11, this time on three letters (say 0, +, and -) and with parameters (11,
% 3^6, 5). This completes the list of all linear perfect codes. parameters (11,3^6, 5).
% http://lev.yudalevich.tripod.com/ECC/betting.html
%
% [1] Barg, Alexander. "At the Dawn of the Theory of Codes," The Mathematical Intelligencer, Vol. 15 (1993), No. 1, pp.
% 20--26. 
\een
 There are no other binary  perfect codes. 
 Why this shortage of perfect codes?
 Is it because precise numerological coincidences like those satisfied by the parameters
 of the Hamming code (\ref{eq.coincidence}) and the  Golay
 code, 
\beq
 1 + {{23}\choose{1}} + {{23}\choose{2}}  + {{23}\choose{3}}  = 2^{11},
\eeq
 are rare? Are  there   plenty of `almost-perfect' codes for which
 the $t$-spheres  fill {\em almost\/} the whole space?

 No. In fact, the picture
 of Hamming spheres centred on the
 codewords {\em{almost}\/} filling Hamming space (\figref{fig.cavenotquite})
 is a misleading one: for most codes, whether
 they are good codes or bad codes,\index{sermon!sphere-packing}
% 
 almost all the Hamming space is taken up by the space {\em{between}\/}
 $t$-spheres
% \wow\
 (which is shown in grey in \figref{fig.cavenotquite}).
\begin{figure}
\figuremargin{
\mbox{\psfig{figure=figs/cavenotquite.ps,angle=-90,width=3in}}
}{
\caption[a]{Schematic picture of Hamming space not perfectly filled
 by $t$-spheres centred on the codewords of a  code.
 The grey regions show points that  are at a Hamming distance
 of more than $t$ from any codeword. This is a misleading picture,
 as, for any code with large $t$
 in high dimensions, the grey space between the
 spheres takes up almost all of Hamming space.
}
\label{fig.cavenotquite}
}
\end{figure}

 Having established this gloomy picture, we spend a moment
 filling in the properties of the perfect codes mentioned
 above.
\subsection{The Hamming codes}
 The $(7,4)$ Hamming code can be defined as the linear code
 whose $3\times 7$ parity-check matrix contains, as its columns,
 all the 7 ($=2^3-1$) non-zero vectors of length 3.
 Since these 7 vectors are all different, any single bit-flip
 produces a distinct syndrome, so all single-bit errors
 can be detected and corrected.
% from \input{tex/_concat2.tex} 

 We can generalize this code, with $M=3$ parity constraints,
 as follows.
 The Hamming codes are  single-error-correcting codes
 defined by picking  a number of  parity-check constraints,  $M$;
 the blocklength $N$ is  $N = 2^M-1$; the parity-check matrix
 contains, as its columns, all the $N$ non-zero vectors
 of length $M$ bits.

 The first few Hamming codes have the following rates:
\medskip% added because of my change to the center environment
\begin{center}
\begin{tabular}{cr@{,$\,$}llp{1.4in}} \toprule
% checks &
%% (block length, source bits)
% & rate & \\
% $M$ & ($N = 2^M-1$ , $K = N - M$) &  $R=K/N$ &  \\ \midrule
\multicolumn{1}{c}{Checks, $M$} & \multicolumn{2}{c}{($N,K$)} &  $R=K/N$ &  \\ \midrule
 2   & (3&1) & 1/3 & repetition code $R_3$ \\
 3 & (7&4) & 4/7 & $(7,4)$ Hamming code \\
 4 & (15&11) & 11/15 & \\
 5 & (31&26) & 26/31 & \\
 6 & (63&57) & 57/63 & \\ \bottomrule
\end{tabular}
\end{center}

\exercissxA{2}{ex.HammingP}{
 What is  the probability of block error of the $(N,K)$ Hamming
 code to leading order, when the code
 is used for  a binary symmetric channel with noise density $f$?
}


\section{Perfectness is unattainable -- first proof \nonexaminable}
 We will show  in several ways
 that useful \ind{perfect code}s do not exist (here,
 `useful' means `having large blocklength $N$, and rate
 close neither to 0 nor 1').
%  First, let's study a pithy, no-nonsense example.

 Shannon proved that, given a binary symmetric channel
 with any noise level $f$, there exist codes with large blocklength $N$
 and rate as close as you like to $C(f) = 1 - H_2(f)$
 that enable \ind{communication} with
 arbitrarily small error probability.
 For large $N$, the number of errors per block will typically
 be about $\fN$, so  these codes of Shannon are
 `almost-certainly-$\fN$-error-correcting'
 codes.

 Let's pick the special case of a noisy channel with $f \in ( 1/3, 1/2)$.
 Can we find a large
 {\em perfect\/} code that is $\fN$-error-correcting?
% with large blocklength for this channel?
 Well, let's suppose that such a code has been found, and examine
 just three of its codewords. (Remember that the code
 ought to have  rate $R \simeq 1-H_2(f)$, so it should have
 an enormous number ($2^{NR}$) of codewords.)
\begin{figure}
\figuremargin{
\mbox{\psfig{figure=figs/noperfect3.ps,%
width=64mm,angle=-90}}
}{%
\caption[a]{Three
 codewords.
}
\label{fig.noperfect}
}
% load 'gnuR'
\end{figure}
 Without loss of generality, we choose one of the codewords
 to be the all-zero codeword and define the other two to have
 overlaps with it as shown in \figref{fig.noperfect}.
 The second codeword differs from the first in a fraction $u+v$
 of its coordinates.
 The third codeword differs from the first in a fraction $v+w$,
 and from the second in a fraction $u+w$. A fraction $x$
 of the coordinates have value zero in all three codewords.
 Now, if the code is $\fN$-error-correcting, its minimum distance
 must be greater than $2\fN$, so
\beq
 u+v > 2f, \:\:\: v+w > 2f, \:\:\: \mbox{and} \:\:\:  u+w > 2f .
\eeq
 Summing these three inequalities and dividing by two, we have
\beq
	u +v+w > 3f .
\eeq
 So if $f>1/3$,  we can deduce $u+v+w > 1$, so that $x<0$,
 which is impossible.  Such a code cannot exist.
 So the code cannot have {\em three\/} codewords, let alone
 $2^{NR}$.
 
 We conclude that, whereas Shannon proved there
 are plenty of codes for communicating over
 a \ind{binary symmetric channel}\index{channel!binary symmetric}\index{perfect code}
 with $f>1/3$, {\em there are no perfect codes\index{error-correcting code!perfect}
 that can do this.}

 We now study a more general argument that indicates
 that there are no large perfect linear codes for general rates (other than 0 and 1).
 We do this by finding the typical distance of a random linear code.

%\mynewpage
\section{Weight enumerator function of random linear codes \nonexaminable}
\label{sec.wef.random}
 Imagine
% H=rand(12,24)>0.5  
% octave
\marginfig{\tiny{
\[%\mbox{\footnotesize{$\bH=$}}
\hspace{-2mm}\begin{array}{c}
{N}\\
\overbrace{\left.\hspace{-2mm}\left[\begin{array}{@{}*{24}{c@{\hspace{0.45mm}}}}
1&0&1&0&1&0&1&0&0&1&0&0&1&1&0&1&0&0&0&1&0&1&1&0\\
0&0&1&1&1&0&1&1&1&1&0&0&0&1&1&0&0&1&1&0&1&0&0&0\\
1&0&1&1&1&0&1&1&1&0&0&1&0&1&1&0&0&0&1&1&0&1&0&0\\
0&0&0&0&1&0&1&1&1&1&0&0&1&0&1&1&0&1&0&0&1&0&0&0\\
0&0&0&0&0&0&1&1&0&0&1&1&1&1&0&1&0&0&0&0&0&1&0&0\\
1&1&0&0&1&0&0&0&1&1&1&1&1&0&0&0&0&0&1&0&1&1&1&0\\
1&0&1&1&1&1&1&0&0&0&1&0&1&0&0&0&0&1&0&0&1&1&1&0\\
1&1&0&0&1&0&1&1&0&0&0&1&1&0&1&0&1&1&1&0&1&0&1&0\\
1&0&0&0&1&1&1&0&0&1&0&1&0&0&0&0&1&0&1&1&1&1&0&1\\
0&1&0&0&0&1&0&0&0&0&1&0&1&0&1&0&0&1&1&0&1&0&1&0\\
0&1&0&1&1&1&1&1&0&1&1&1&1&1&1&1&1&0&1&1&1&0&1&0\\
1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&1&0&1&0&0&0&0&1&1
\end{array}\right]\right\} M \hspace{-2mm}\hspace{-0.25in} }
\end{array} \]
}
\caption[a]{A random binary parity-check matrix.}
\label{fig.randommatrix}
}%
 making a code by picking the binary entries
 in the $M \times N$ parity-check matrix  $\bH$ at random.\index{error-correcting code!random linear}
 What weight enumerator function should we expect?

 The \ind{weight enumerator} of one particular code with
 parity-check matrix $\bH$, $A(w)_{\bH}$, is
 the number of codewords of weight $w$, which
 can be written
\beq
	A(w)_{\bH} = \sum_{\bx: |\bx| = w} \truth\! \left[ \bH \bx = 0 \right] ,
\eeq
 where the
 sum is over all vectors $\bx$  whose weight is $w$ and
 the \ind{truth function}  $\truth\!  \left[ \bH \bx = 0 \right]$
 equals one if
% it is true that
 $\bH \bx = 0$
 and zero otherwise.

 We can find the expected value of $A(w)$,
\beqan
	\langle A(w) \rangle  &=& \sum_{\bH} P(\bH)  A(w)_{\bH} 
\\
	&=&  \sum_{\bx: |\bx| = w}  \sum_{\bH} P(\bH) \,
		\truth\!  \left[ \bH \bx \eq 0 \right]
,
\label{eq.expAw}
\eeqan
 by  evaluating the probability
 that a particular word of weight $w>0$ is a codeword of the code (averaging
 over all binary linear codes in our ensemble).
 By symmetry, this probability depends only on the weight $w$ of the word,
 not on the details of the word. 
 The probability that the entire syndrome
 $\bH \bx$ is zero can be found by multiplying together
 the probabilities that each  of the $M$ bits in the syndrome
 is zero. Each bit $z_m$ of the syndrome is a sum (mod 2)
 of $w$ random bits, so the probability that $z_m \eq 0$ is $\dhalf$.
 The probability that   $\bH \bx \eq 0$ is thus
\beq
  \sum_{\bH} P(\bH) \, \truth\!  \left[ \bH \bx \eq 0 \right]
 = (\dhalf)^M = 2^{-M},
\eeq
 independent of
 $w$.

 The expected number of words of weight $w$ (\ref{eq.expAw})
 is given by summing, over all words of weight $w$, the probability
 that each word is a codeword.
 The number of words of weight $w$ is ${{N}\choose{w}}$,
 so
\beq
	\langle A(w) \rangle = {{N}\choose{w}} 2^{-M} \:\:\mbox{for any $w>0$}.
\eeq
 For large $N$, we can use $\log	{{N}\choose{w}} 
  \simeq N H_2(w/N)$ and $R\simeq 1-M/N$ to write
\beqan
	\log_2 \langle A(w) \rangle &\simeq& N H_2(w/N) -M
\\
		&\simeq& N [ H_2(w/N) - (1-R) ]  \:\:\mbox{for any $w>0$}.
\label{eq.wef.random}
\eeqan
 As a concrete example, \figref{fig.Aw.540} shows the
 expected weight enumerator function of a rate-$1/3$
 random linear code\index{error-correcting code!random linear} with $N=540$ and $M=360$.
\marginfig{
\begin{center}
\mbox{%
\small
\hspace{-0.01in}%
\begin{tabular}{c}
\hspace{-0.15in}\mbox{\psfig{figure=/home/mackay/_doc/code/gallager/Am540R.ps,%
width=41.5mm,angle=-90}}\\[-0.01in]
\hspace{0.1in}\mbox{\hspace*{-0.35in}\psfig{figure=/home/mackay/_doc/code/gallager/Am540Rl.ps,%
width=41.5mm,angle=-90}}\\[-0.1in]
\end{tabular}
}
\end{center}
%}{%
\caption[a]{The
  expected weight enumerator function
  $\langle A(w) \rangle$ of a
 \index{error-correcting code!random linear}random linear code with $N=540$ and $M=360$. Lower figure shows
 $\langle A(w) \rangle$ on a logarithmic scale.
}
\label{fig.Aw.540}
% load 'gnuR'
}

\subsection{Gilbert--Varshamov distance}
 For weights $w$ such that $H_2(w/N) < (1-R)$, the expectation
 of $A(w)$ is smaller than 1; for  weights such that $H_2(w/N) > (1-R)$,
 the expectation is greater than 1. We thus expect, for large $N$,
 that the minimum distance
 of a random linear  code  will be close to the distance $d_{\rm GV}$
 defined by
\beq
	H_2(d_{\rm GV}/N) = (1-R) .
\label{eq.GV.def}
\eeq
% INDENT ME?
\noindent
{\sf Definition.}
 This distance, $d_{\rm GV} \equiv N H_2^{-1}(1-R)$,
 is
% known as
 the
 {\dem{Gilbert--Varshamov\index{distance!Gilbert--Varshamov}\index{Gilbert--Varshamov distance}
 distance}\/}
 for  rate $R$ and blocklength $N$.

 The {\dem{Gilbert--Varshamov conjecture}},
 widely believed, asserts that (for large $N$) it is not possible to\index{Gilbert--Varshamov conjecture}
 create binary codes with minimum distance significantly greater than $d_{\rm GV}$.
\medskip

% INDENT ME?
\noindent
{\sf Definition.}
 The {\dem{\index{Gilbert--Varshamov rate}Gilbert--Varshamov rate}\/} $R_{\rm GV}$
 is the maximum rate at which you can reliably
 communicate with a \ind{bounded-distance decoder} (as defined
 on \pref{sec.bdd}),
 assuming that the
 Gilbert--Varshamov conjecture\index{Gilbert--Varshamov conjecture}
 is true.
 
 
% \section{Perfect codes} A \index{error-correcting code!perfect}\see{perfect code}{code}

\subsection{Why sphere-packing is a bad perspective, and an obsession with
 distance is inappropriate}
 If one uses a \ind{bounded-distance decoder},\index{sermon!sphere-packing}
 the maximum tolerable noise level will flip  a fraction $f_{\rm bd} = \half d_{\min}/N$
 of the bits. So, assuming $d_{\min}$ is equal to the \index{Gilbert--Varshamov distance}Gilbert distance
 $d_{\rm GV}$ (\ref{eq.GV.def}), we have:%
\amarginfig{b}{
\begin{center}
\mbox{\psfig{figure=figs/RGV.ps,angle=-90,width=1.7in}}\\[-0.1in]
$f$
\end{center}
\caption[a]{Contrast between Shannon's channel capacity $C$
 and the  Gilbert rate $R_{\rm GV}$ --
 the maximum communication rate
 achievable using a \ind{bounded-distance decoder}, as a function
 of noise level $f$.
 For any given rate, $R$, the maximum tolerable
 noise level for Shannon is twice as big as the
 maximum tolerable noise level for a `worst-case-ist'
 who uses a bounded-distance decoder.
}
}
\beq
	H_2(2 f_{\rm bd}) = (1-R_{\rm GV}) .
\label{eq.idiotf}
\eeq
\beq
 	R_{\rm GV} = 1 - H_2(2 f_{\rm bd}).
\eeq
 Now, here's the crunch: what did Shannon say is achievable?\index{Shannon, Claude}
 He said the maximum possible rate of communication is the capacity, 
\beq
	C = 1 - H_2(f) .
\eeq
 So for  a given rate $R$,
 the maximum tolerable noise level, according to Shannon,
 is given by
\beq
	H_2(f) = (1-R) .
\label{eq.shannonf}
\eeq
 Our conclusion: imagine a good code of rate $R$ has been chosen;
 equations (\ref{eq.idiotf}) and (\ref{eq.shannonf})
 respectively define
 the maximum noise levels tolerable 
 by a bounded-distance
 decoder, $f_{\rm bd}$,  and by Shannon's decoder, $f$.
\beq
	f_{\rm bd} = f/2 .
\eeq
 Bounded-distance decoders can only ever cope with
 {\em half\/} the noise-level that Shannon proved is tolerable!
% Need to show implication for perfect codes at the same time.

 How does this relate to perfect\index{error-correcting code!perfect}
 codes?  A code is perfect
 if there are $t$-spheres around its codewords that
 fill Hamming space without overlapping.
 But when a typical random linear code is used to
 communicate over a binary symmetric channel near to the
 Shannon limit, the typical number of bits flipped
 is $\fN$, and the minimum distance between codewords is
 also $\fN$, or a little bigger, if we are
 a little below the Shannon limit.
 So the $\fN$-spheres around the codewords overlap
 with each other sufficiently that each sphere almost contains
 the centre of its nearest neighbour!
\marginfig{\begin{center}
\mbox{\psfig{figure=figs/overlap.eps,width=1.7in}}\\[-0.02in]
\end{center}
\caption[a]{Two overlapping spheres whose radius
 is almost as big as the distance between their centres.
}
\label{fig.overlap}
}
 The reason why this overlap is not disastrous is because,
 in high dimensions, the volume associated with the overlap,
 shown shaded in \figref{fig.overlap}, is a tiny fraction of
 either sphere, so the probability of landing in it is
 extremely small.

 The moral of the story is that \ind{worst-case-ism} can be  bad for you,
 halving your ability to tolerate noise.
 You have to be able to decode {\em way\/} beyond the minimum distance of a code
 to get to the Shannon limit!

 Nevertheless, the minimum
 distance of a code is of interest in practice, because, under some
 conditions, the minimum distance dominates the errors made by
 a code.
% On to the bat cave. (Could also dissect the random code
% in more detail.)


\section{Berlekamp's bats}
\label{sec.bats}
 A blind \ind{bat}\index{Berlekamp, Elwyn} lives in a cave.
 It  flies about the centre of the cave, which corresponds to
 one codeword,
 with its typical distance from the centre controlled by
 a friskiness parameter $f$. (The displacement of the
 bat from the centre corresponds to the noise vector.)
 The boundaries of the cave are made up of stalactites that
 point in towards the centre of the cave (\figref{fig.cavereal}).  Each stalactite
 is analogous to the boundary between the home codeword
 and another codeword.  The stalactite is
 like  the shaded region in \figref{fig.overlap},
 but reshaped to convey the idea that it is a region of very small volume.

 Decoding errors correspond to  the bat's intended trajectory passing
 inside a stalactite. Collisions with  stalactites at various distances
 from the centre are possible.

 If the friskiness
% (noise level)
 is very small, the bat is usually very  close to the
 centre of the cave;
 collisions will be rare,
 and when they do occur, they will usually involve the
 stalactites whose tips are closest to the centre point. Similarly,
 under  low-noise conditions,  decoding errors will be rare,
 and they will typically involve low-weight codewords. Under low-noise
 conditions, the minimum distance of a code is relevant to
 the (very small) probability of error.
\begin{figure}[hbtp]
\figuremargin{
\mbox{\psfig{figure=figs/cavereal.ps,angle=-90,width=3in}}
}{
\caption[a]{Berlekamp's schematic picture of Hamming space in
 the vicinity of a codeword. The jagged solid line encloses all points to which
 this codeword is the closest.
 The $t$-sphere around the
 codeword takes up a small fraction of this space.
}
\label{fig.cavereal}
}
\end{figure}

 If the friskiness is higher, the bat may often make excursions
 beyond the safe distance $t$ where the longest stalactites start,
 but
% it is quite possible that
 it will collide most frequently
 with more distant stalactites, owing to their greater number.
 There's only a tiny number of \ind{stalactite}s at the minimum
 distance, so they are relatively unlikely to cause the errors.
 Similarly, errors in a real
 error-correcting code
 depend on the properties of the \ind{weight enumerator} function.

 At very high friskiness, the \ind{bat} is always a long way from the centre of
 the \ind{cave}, and almost all  its collisions involve contact with distant stalactites.
% bat in a cave.
 Under these conditions,
 the bat's collision frequency has nothing to do with
 the distance from the centre to the closest stalactite.

%\section{Concatenation}
% see also _concat.tex
% this is the bit where we do the ``hamming are good'' story
\section{Concatenation of Hamming codes\nonexaminable}
\label{sec.concatenation}
 It is instructive to play some more with the \ind{concatenation} of
 \ind{Hamming code}s,\index{error-correcting code!Hamming}
 a concept we first visited in  \figref{fig.concath1},
 because we will get insights into the notion of good codes
 and the relevance or otherwise of the \ind{minimum distance} of a code.\index{distance!of code}

 We  can create a concatenated code
for  a binary symmetric channel with noise density $f$
by encoding with
 several Hamming codes in succession.

% /home/mackay/bin/concath.p~
% /home/mackay/_courses/itprnn/hamming/concath
% /home/mackay/_courses/itprnn/hamming/concath.gnu

 The table recaps the key properties of
 the Hamming codes, indexed by number of constraints, $M$.
 All the Hamming codes have minimum distance $d=3$
 and can correct one error in $N$.
\medskip% because of modified center
\begin{center}
\begin{tabular}{ll}\toprule
 $N = 2^M-1$ 
& blocklength
\\
% $K$ &
 $K = N - M$ & number of source bits \\
 $p_{\rm B} = \smallfrac{3}{N} {{N} \choose {2}} f^2$
 & probability of block error to leading order \\  \bottomrule
% $R$ & $K/N$ \\
\end{tabular}
\medskip
\end{center}

\marginfig{
\begin{center}
%\mbox{%
\footnotesize
\raisebox{0.3591in}{$R$}%
\hspace{0.2in}%
\begin{tabular}{c}
\mbox{\psfig{figure=hamming/concath.rate.ps,%
width=40.5mm,angle=-90}}\\[0.1in]
\hspace{0.3in}$C$
\end{tabular}
%}
\end{center}
%}{%
\caption[a]{The rate $R$ of the concatenated Hamming code
 as a function of the number of concatenations, $C$.
}
\label{fig.concath.rate}
}
%

% \subsection{Proving that good codes can be made by concatenation}
 If we make a \ind{product code} by\index{error-correcting code!good}\index{error-correcting code!product code}
 concatenating a sequence of $C$ Hamming codes with increasing $M$,
 we can choose those parameters $\{ M_c \}_{c=1}^{C}$
 in such a way that the rate of the product
 code
% $R_C$ 
\beq
 R_C =  \prod_{c=1}^C \frac{N_c - M_c}{N_c} 
\eeq
 tends to a non-zero limit as $C$ increases.
 For example, if we  set $M_1 =2$, $M_2=3$, $M_3=4$, etc.,
 then the asymptotic rate is 0.093 (\figref{fig.concath.rate}).



 The blocklength $N$ is a rapidly-growing function of $C$, so these codes
 are somewhat impractical.
 A further weakness of these codes is\index{distance!of concatenated code}
 that\index{error-correcting code!concatenated}
 their\index{error-correcting code!product code}
 minimum distance is not very good (\figref{fig.concath.n.d}).%
\amarginfig{b}{
\begin{center}
\small\footnotesize
%
%\hspace{0.042in}%
%\begin{tabular}{c}
%\mbox{\psfig{figure=hamming/concath.n.k.l.ps,%
%width=40.5mm,angle=-90}}
%% \\[-0.1in] $C$
%\end{tabular}\\[0.13in]
\hspace*{0.2042in}%
\begin{tabular}{c}
\mbox{\psfig{figure=hamming/concath.n.d.ps,%
width=40.5mm,angle=-90}}\\[0.1in]
\hspace{0.3in}$C$\\[-0.05in]
\end{tabular}
\end{center}
%}{%
\caption[a]{The blocklength  $N_C$ (upper curve)
 and 
% $(N,K)$ (upper figure) and
 minimum distance $d_C$ (lower curve)
% (lower figure)
 of the concatenated Hamming code
 as a function of the number of concatenations $C$.
}
\label{fig.concath.n.k.l}
\label{fig.concath.n.d}
}
%
% why is this fig not taking up its correct space?
%
% The blocklength $N$ is a rapidly growing function of $C$, so these codes
% are mainly of theoretical interest.
%
 Every one of the constituent
 Hamming codes has
 \ind{minimum distance}\index{distance!of code} 3, so the minimum
 distance of the $C$th product is $3^C$. The blocklength $N$ grows faster
% with $C$
 than  $3^C$, so the ratio $d/N$ tends to zero as $C$ increases. In contrast,
 for typical  random codes, the ratio $d/N$ tends to a constant\index{random code}
% distance tends to a fraction of $N$,
 such that $H_2(d/N) = 1-R$.\index{Hamming code}
 Concatenated Hamming codes\index{distance!bad}\index{distance!of product code}
 thus have `bad' distance.% \pref{distance.defs}


 Nevertheless, it turns out that this simple sequence of codes
 yields good codes\index{error-correcting code!good} for some channels -- but
 not very good codes
 (see \sectionref{sec.good.codes} to recall the definitions of the terms
 `good' and `very good').
 Rather than prove this result, we will simply explore it numerically.


 \Figref{fig.concath.rateeb} shows the bit error probability $p_{\rm b}$
 of the concatenated
 codes assuming that the constituent codes are decoded in sequence,
 as described in section \ref{sec.concatdecode}. [This one-code-at-a-time
 decoding is suboptimal, as we saw there.]
% refers to {tex/_concat.tex}% contains simple example
%
% concath.p
 The horizontal axis shows the rates of  the codes.
 As the number of concatenations increases, the rate drops
 to 0.093 and the error probability drops towards zero.
 The channel assumed in the figure is the  binary
 symmetric channel with $\q=0.0588$. This is the highest noise level that
 can be tolerated using this concatenated code.
\amarginfig{c}{
\begin{center}
\footnotesize
\mbox{%
\raisebox{0.591in}{$p_{\rm b}$}%
\hspace{0.2042in}%
\begin{tabular}{c}
\mbox{\psfig{figure=hamming/concath.rate.058.ps,%
width=40mm,angle=-90}}\\[0.1in]
\hspace{0.54in}$R$\\[-0.03in]
\end{tabular}}
\end{center}
%}{%
\caption[a]{The bit error probabilities versus the rates
 $R$ 
 of the concatenated Hamming codes, for the binary
 symmetric channel with $\q=0.0588$. Labels alongside the points show the
 blocklengths, $N$. The solid line shows the Shannon
 limit for this channel. 

 The bit error probability drops to zero while the rate tends to
 0.093, so the concatenated Hamming codes are a `good' code family.

}
\label{fig.concath.rateeb}
}
%%%%%%%%%%%%%%%%%%%% there is a major margin object problem here,
% don't understand it!



 

 The take-home message from this story is
 {\em{distance isn't everything}}.\index{distance!isn't everything}
% Indeed, t
 The minimum distance of a code, although widely worshipped by coding
 theorists, is not of fundamental importance\index{coding theory}
 to  Shannon's
 mission of achieving reliable \ind{communication} over noisy channels.\index{Shannon, Claude}\index{coding theory}

\exercisxB{3}{ex.distancenotE}{
 Prove that there exist families of codes with `bad' distance
 that are `very good' codes.
}
% soln in _linear.tex


\section{Distance isn't everything}
 Let's
% look at this assertion some more in order to
 get a
 quantitative feeling for the effect of the minimum distance
 of a code, for the special case of a \ind{binary symmetric channel}.\index{channel!binary symmetric}

%\exampl{ex.bhat}{
\subsection{The error probability associated with one low-weight codeword}
\label{sec.err.prob.one}
% begin INTRO
	Let a binary code have blocklength $N$ and
 just two codewords, which differ in $d$ places. For simplicity, let's
 assume $d$ is even.
 What is the error probability if this code is used on a binary
symmetric channel with noise level $f$?

 Bit flips matter only in places where the two codewords differ.
% Only flips of bits in the places that differ matter.
 The error probability is dominated by the probability that $d/2$
 of these bits are flipped.
 What happens
 to the other bits is irrelevant, since the optimal decoder ignores them.
\beqan
	P(\mbox{block error}) & \simeq & {{d}\choose{d/2}} f^{d/2} (1-f)^{d/2} .
% \geq here if you want
\eeqan
 This error probability associated with a single codeword of weight $d$
 is plotted in \figref{fig.dist}.%
\amarginfig{c}{%
\footnotesize
\begin{tabular}{c}
\hspace*{0.2in}\psfig{figure=gnu/errorVdist.ps,width=1.8in,angle=-90}\\[0.1in]
\end{tabular}
% see /home/mackay/itp/gnu/dist.gnu
\caption[a]{ The error probability associated with a single codeword of weight $d$, 
${{d}\choose{d/2}} f^{d/2} (1-f)^{d/2}$, as
 a function of $f$.}
\label{fig.dist}
}
 Using the  approximation for the binomial coefficient (\ref{eq.stirling.choose}), we
 can further approximate
\beqan
	P(\mbox{block error})
% \leq here if you want
& \simeq & \left[ 2 f^{1/2} (1-f )^{1/2} \right]^{d} \\
& \equiv & [\beta(f)]^{d} ,
\label{eq.bhatta}
\eeqan
 where $\beta(f) =  2 f^{1/2} (1-f )^{1/2}$
 is called the \ind{Bhattacharyya parameter} of the channel.\nocite{Bhattacharyya}
%\marginpar{\footnotesize{You don't need
% to memorize this name; indeed, I need to check this is the correct name, as it is not in the
%index of any coding theory books on my shelf! Must check in McEliece.}}
%
% Bhattacharyya, A.On a measure of divergence between two statistical 
% populations defined by their probability distributions. Bull. 
% Calcutta Math. Soc. 35 (1943), pp. 99-110.
% 
% A recent book that calls your $\beta$ the Bhattacharyya parameter is 
% Johanesson and Zigangirov's book on convolutional codes. I think some 
% of Viterbi's books also use the term.
% 

% end INTRO
% \subsection{Recap of `very bad' distance}
 Now, consider  a general linear code with distance $d$.
 Its block error probability 
 must be at least ${{d}\choose{d/2}} f^{d/2} (1-f)^{d/2}$,
 independent of the blocklength $N$ of the code.
 For this reason,  a sequence of codes of increasing blocklength
 $N$ and  constant distance $d$ (\ie,  `very bad' distance)\label{sec.verybadisbad}
 cannot have  a block error probability
 that tends to zero, on any binary symmetric channel.
 If we are interested in making superb error-correcting
 codes with tiny, tiny error probability,
 we might therefore shun codes with bad distance.
 However, being pragmatic, we should look more carefully
 at  \figref{fig.dist}.
 In \chref{ch1} we argued that codes for disk drives
 need an error probability smaller than about $10^{-18}$.
 If the raw error probability in the \ind{disk drive} is
 about $0.001$, the error probability associated
 with one codeword at distance $d=20$ is smaller than
 $10^{-24}$.
 If the raw error probability in the disk drive is
 about $0.01$, the error probability associated
 with one codeword at distance $d=30$ is smaller than
 $10^{-20}$.
 For practical purposes, therefore, it is not essential for
 a code to have good distance. For example, 
 codes of blocklength $10\,000$, known to
 have many codewords of weight 32, can nevertheless 
 correct errors of weight 320 with tiny error probability.

 I wouldn't want you to think I am {\em recommending\/}
 the use of codes with bad distance; in \chref{ch.ldpcc}
 we will discuss low-density parity-check codes,
  my favourite codes, which have both excellent performance
 and {\em good\/} distance.
% These are my favourite codes.
% It's as a matter of honesty that I am pointing out 
% that having good distance scarcely matters.
% So regardless of the blocklength used,

\section{The union bound}
 The error probability of a code on the binary symmetric
 channel can be bounded in terms
 of its \ind{weight enumerator} function by adding up
 appropriate multiples of
 the error probability associated with a single codeword (\ref{eq.bhatta}):
\beq
	P(\mbox{block error}) \leq \sum_{w>0} A(w) [\beta(f)]^w .
\label{eq.unionB}
\eeq
% could include Bob's  poor man's coding theorem here.
 This inequality, which is an example of a {\dem\ind{union bound}},
 is accurate for low noise levels $f$,
 but inaccurate for high noise levels, because it overcounts
 the contribution of errors that cause confusion with more than
 one codeword at a time.
%MNBV\newpage

\exercisxB{3}{ex.poormancoding}{
 {\sf Poor man's noisy-channel coding theorem}.\index{noisy-channel coding theorem!poor
 man's version}\index{poor man's coding theorem}

	Pretending
 that the  union bound
 (\ref{eq.unionB}) {\em is\/}
 accurate, and using the
 average {\ind{weight enumerator} function of a random linear code} (\ref{eq.wef.random}) (\secref{sec.wef.random})
 as $A(w)$, estimate the maximum rate $R_{\rm UB}(f)$ at which
 one can communicate over a binary symmetric channel.

 Or, to look at it more positively, using the   union bound
 (\ref{eq.unionB}) as an inequality, show that communication
 at rates up to $R_{\rm UB}(f)$ is possible over the binary symmetric channel.
% In proving this result, you are proving a `poor man's version' of
% {Shannon}'s  noisy-channel coding theorem.
}

 In the following chapter, by analysing the probability of error
 of {\em \ind{syndrome decoding}\/} for a binary linear code,
 and using a union bound, we will  prove
 Shannon's noisy-channel coding theorem (for
 symmetric binary channels), and thus show that {\em very good linear codes exist}.


% possible point for exercise from exact.tex to be included.

\section{Dual codes\nonexaminable}
 A concept that has some importance in coding theory,\index{error-correcting code!dual}
 though we will have no immediate use for it in this book,
 is the idea of the {\dem\ind{dual}} of a linear error-correcting code.


 An $(N,K)$
 linear error-correcting code can be thought of as a set of $2^{K}$
 codewords
 generated by adding together all combinations of $K$ independent basis
 codewords. The generator matrix of the code consists of
 those $K$ basis codewords, conventionally written as row vectors.
 For example, the $(7,4)$ Hamming code's generator matrix (from \pref{eq.Generator})
% \eqref{eq.Generator}, 
 is 
\beq
       \bG = \left[ \begin{array}{ccccccc} 
 \tt 1& \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1 \\
 \tt 0& \tt 1& \tt 0& \tt 0& \tt 1& \tt 1& \tt 0 \\
 \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1& \tt 1 \\
 \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1 \\
  \end{array} \right] 
\label{eq.Generator2}
\eeq
 and its sixteen codewords were displayed in
 \tabref{tab.74h} (\pref{tab.74h}).
 The codewords of this code are linear combinations of
 the four vectors $\left[
 \tt 1 \: \tt 0 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1  \right]$,
 $\left[ 					 
 \tt 0 \: \tt 1 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 1 \: \tt 0  \right]$,
 $\left[ 					 
 \tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1 \: \tt 1 \: \tt 1  \right]$,
 and						 
 $\left[ 					 
 \tt 0 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1 \: \tt 1  \right]$.


 An $(N,K)$ code may also be described in terms
 of an $M \times N$ parity-check matrix (where $M=N-K$)
 as the set of vectors $\{ \bt \}$ that satisfy
\beq
	\bH \bt = {\bf 0}  .
\eeq
 One way of thinking of this equation is that each row
 of $\bH$ specifies a vector to which $\bt$ must be orthogonal
 if it is a codeword.
\medskip

\noindent
\begin{conclusionbox}
 The generator matrix specifies $K$ vectors {\em from 
 which\/} all codewords can be built, and
 the parity-check matrix specifies a set of $M$ vectors
 {\em to which\/}
 all codewords are orthogonal. \smallskip

 The dual of a code is obtained by exchanging the generator
 matrix and the parity-check matrix.
\end{conclusionbox}
\medskip

\noindent
 {\sf Definition.}
 The set of {\em all\/} vectors of length $N$ that are orthogonal to all
 codewords in a code, $\C$, is called the dual of the code, $\C^{\perp}$.
\medskip

 If $\bt$ is orthogonal to $\bh_1$
 and $\bh_2$, then it is also orthogonal to $\bh_3 \equiv \bh_1 + \bh_2$;
 so all codewords are orthogonal to
 any linear combination of the $M$ rows of $\bH$. 
 So 
 the set of all linear combinations of the rows of the parity-check matrix
 is the dual code.
% called the dual of the code.
% The dual is itself a linear
% error-correcting code, whose generator matrix is $\bH$.
%% And similarly, t
% The parity-check matrix of the dual is $\bG$,
% the generator matrix of the first code.

 For our Hamming $(7,4)$ code, the parity-check matrix is
 (from \pref{eq.pcmatrix}):
\beq
        \bH =   \left[ \begin{array}{cc} \bP & \bI_3 \end{array}
 \right] = \left[ 
 \begin{array}{ccccccc} 
\tt  1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0&\tt 0 \\
\tt  0&\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0 \\
\tt  1&\tt 0&\tt 1&\tt 1&\tt 0&\tt 0&\tt 1
 \end{array} \right] .
\label{eq.pcmatrix2}
\eeq
% and the three vectors to which the codewords are
% orthogonal are
%$\left[ 
%\tt  1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0\: \tt 0 
% \right]$,
%$\left[
%\tt  0\: \tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0 
% \right]$,
% and
%$\left[
%\tt  1\: \tt 0\: \tt 1\: \tt 1\: \tt 0\: \tt 0\: \tt 1
% \right]$.

% The codewords are not  orthogonal to these $M$
% vectors only, however. I

 The dual of the $(7,4)$ Hamming code  $\H_{(7,4)}$
 is the code shown in
 \tabref{tab.74h.dual}.



\begin{table}[htbp]
\figuremargin{%
\begin{center}
\mbox{\small
\begin{tabular}{c} \toprule
% Transmitted sequence
%               $\bt$ \\ \midrule
\tt 0000000 \\% yes
\tt 0010111 \\% yes
 \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{c} \toprule
%   $\bt$ \\ \midrule
\tt 0101101 \\% yes
\tt 0111010 \\ \bottomrule % yes
\end{tabular} \hspace{0.02in}
\begin{tabular}{c} \toprule 
%   $\bt$ \\ \midrule
\tt 1001110 \\% yes
\tt 1011001 \\ \bottomrule % yes
\end{tabular} \hspace{0.02in}
\begin{tabular}{c} \toprule
%   $\bt$ \\ \midrule
\tt 1100011 \\% yes
\tt 1110100 \\ % yes
 \bottomrule
\end{tabular}
}%%%%%%%%% end of row of four tables
\end{center} 
}{%
\caption[a]{The eight codewords
%  $\{ \bt \}$
 of the dual of the $(7,4)$ Hamming  code.
 [Compare with \protect\tabref{tab.74h},
 \protect\pref{tab.74h}.]
}
\label{tab.74h.dual}
}
\end{table}

% STRANGE MISREF????????? CHECK
 A possibly unexpected property 
 of this pair of  codes is that the dual, $\H_{(7,4)}^{\perp}$,
 is contained within the code $\H_{(7,4)}$ itself:
 every word in the dual code is a codeword of the
 original $(7,4)$ Hamming code.
 This relationship can be written using set notation:
\beq
 \H_{(7,4)}^{\perp} \subset \H_{(7,4)}
 . 
\eeq

 The possibility that the set of dual vectors
 can overlap the set of codeword vectors is counterintuitive
 if we think of the vectors as real vectors -- how
 can a vector be orthogonal to itself?
 But when we work in modulo-two arithmetic, many non-zero vectors
 are indeed orthogonal
% perpendicular
 to themselves!

\exercissxB{1}{ex.perp}{
	Give a simple rule  that distinguishes
whether a binary vector is   orthogonal to itself, as is each of the
 three vectors 
$\left[ 
\tt  1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0\: \tt 0 
 \right]$,
$\left[
\tt  0\: \tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0 
 \right]$,
 and
$\left[
\tt  1\: \tt 0\: \tt 1\: \tt 1\: \tt 0\: \tt 0\: \tt 1
 \right]$.
}

\subsection{Some more duals}
 In general, if a code has a systematic generator matrix,
\beq
	\bG = \left[ \bI_K | \bP^{\T} \right] ,
\eeq
 where $\bP$ is a $K \times M$ matrix, 
 then its parity-check matrix is
\beq
	\bH = \left[ \bP | \bI_M \right] .
\eeq

\exampl{example.rthreedual}{
 The repetition code $\Rthree$ has generator matrix
\beq
	\bG =\left[
\begin{array}{ccc}
\tt 1 &\tt 1 &\tt 1 
\end{array}
\right];
% [{\tt 1\:1\:1} ] ;
\eeq
 its parity-check matrix is
\beq
	\bH = \left[
\begin{array}{ccc}
\tt 1 &\tt 1 &\tt 0 \\
\tt 1 &\tt 0 &\tt 1 
\end{array}
\right] .
\eeq
 The two codewords are [{\tt 1 1 1}] and [{\tt 0 0 0}].

 The dual code has generator matrix
\beq
	\bG^{\perp} = \bH = \left[
\begin{array}{ccc}
\tt 1 &\tt 1 &\tt 0 \\
\tt 1 &\tt 0 &\tt 1 
\end{array}
\right]
\eeq
 or equivalently, modifying $\bG^{\perp}$ into systematic form
 by row additions,
% manipulations, 
\beq
	\bG^{\perp} = \left[
\begin{array}{ccc}
\tt 1 &\tt 0 &\tt 1 \\
\tt 0 &\tt 1 &\tt 1 
\end{array}
\right] .
\eeq
 We  call this dual code the {\dem{simple parity code}} P$_3$;\index{error-correcting code!P$_3$}\index{error-correcting code!simple parity}\index{error-correcting code!dual}
 it is the code with one parity-check bit, which is equal to
 the sum of the two source bits.
 The dual code's four codewords are
$ \left[
\tt 1 \: \tt 1 \: \tt 0 
\right]
$,
$ \left[
\tt 1 \: \tt 0 \: \tt 1 
\right]
$,
$ \left[
\tt 0 \: \tt 0 \: \tt 0 
\right]
$,
and
$ \left[
\tt 0 \: \tt 1 \: \tt 1 
\right]
$.


 In this case, the only vector common to the code and the dual is
 the all-zero codeword.
}

\subsection{Goodness of duals}
 If a sequence of codes is `good', are their \index{error-correcting code!dual}duals
 {good} too?\index{error-correcting code!good}
 Examples can be constructed of all cases:
 good codes with good duals (random linear codes);
 bad codes with bad duals; and good codes with bad duals.
 The last category is especially important:
 many state-of-the-art codes  have the property that
 their duals are bad.
 The classic example is the low-density parity-check code,
 whose dual is a low-density generator-matrix code.\index{error-correcting code!low-density generator-matrix}
\exercisxB{3}{ex.ldgmbad}{
	Show that low-density generator-matrix codes
 are bad.
 A family of low-density  generator-matrix  codes
 is defined by two parameters $j,k$, which are the column
 weight and row weight of all rows and columns respectively
 of $\bG$. These weights are fixed, independent of $N$;
 for example, $(j,k)=(3,6)$.
 [Hint: show that the code has low-weight codewords, then
 use the argument from \pref{sec.verybadisbad}.]
}
\exercisxD{5}{ex.ldpcgood}{
	Show that low-density parity-check codes
 are good, and have good distance.\index{error-correcting code!low-density parity-check}
 (For solutions, see \citeasnoun{Gallager63} and
 \citeasnoun{mncN}.)
}
 

\subsection{Self-dual codes} 
 The $(7,4)$ Hamming code had the property that the dual
 was contained in the code itself.
% used to say - 
% A code is {\dem{\ind{self-orthogonal}}} if it contains its dual.
 A code is {\dem{\ind{self-orthogonal}}\/} if it is contained in its dual.
 For example,
 the dual of the  $(7,4)$ Hamming code is a self-orthogonal code.
 One way of seeing this is that the overlap between any pair
 of rows of $\bH$ is even.
%\marginpar{Is
% it an accepted abuse of terminology to also say
% a code is self-orthogonal if it contains its dual?}
 Codes that contain their duals are important in quantum error-correction
 \cite{ShorCSS}.

 It is intriguing, though not necessarily useful,  to
 look at codes that are {\dem\ind{self-dual}}.
 A  code $\C$ is self-dual if
 the dual
 of the code is identical to the code.
% Here, we are looking for codes that satisfy
\beq
 \C^{\perp} = \C  . 
\eeq

 Some properties of self-dual codes can be deduced:
% 
\ben
\item
 If a code is self-dual, then its generator matrix is also a parity-check
 matrix for the code.
\item
	Self-dual codes  have rate $1/2$, \ie, $M=K=N/2$.
\item
	All codewords  have even weight.
\een

\exercissxB{2}{ex.selfdual}{
 What property must the matrix $\bP$ satisfy, if the code
 with generator matrix
$\bG = \left[ \bI_K  | \bP^{\T} \right]$
 is self-dual?
}

\subsubsection{Examples of self-dual codes}
\ben
\item
 The repetition code R$_2$ is a simple example of
 a self-dual code.
\beq
	\bG = \bH = \left[
\begin{array}{cc}
\tt 1 &\tt 1 
\end{array}
\right] .
% [{\tt  1 \: 1 } ] 
\eeq
\item
 The smallest non-trivial self-dual code is the following
 $(8,4)$ code.
\beq
        \bG =   \left[ \begin{array}{c|c}  \bI_4 & \bP^{\T}  \end{array}
 \right] = \left[ 
 \begin{array}{cccc|cccc}
\tt 1&\tt 0&\tt 0 &\tt 0 &\tt  0&\tt 1&\tt 1&\tt 1\\
\tt 0&\tt 1&\tt 0 &\tt 0 &\tt  1&\tt 0&\tt 1&\tt 1\\
\tt 0&\tt 0&\tt 1 &\tt 0 &\tt  1&\tt 1&\tt 0&\tt 1\\
\tt 0&\tt 0&\tt 0 &\tt 1 &\tt  1&\tt 1&\tt 1&\tt 0
\end{array} \right]  .
\label{eq.selfdual84G}
\eeq
\een
\exercissxB{2}{ex.dual84.74}{
 Find the relationship of the above $(8,4)$ code to the $(7,4)$ Hamming code.
}

\subsection{Duals and graphs}
 Let a code be represented by a graph in which there are
 nodes of two types, parity-check constraints and equality
 constraints, joined by edges which represent the bits
 of the code (not all of which need be transmitted).

 The dual code's graph is obtained by replacing all
 \ind{parity-check nodes} by equality nodes and {\em vice versa}.
 This type of graph is called a \ind{normal graph} by
 \citeasnoun{Forney2001}.
% Forney

% added Thu 16/1/03
\subsection*{Further reading}
 Duals are important in coding theory because functions
 involving a code (such as the posterior distribution over
 codewords) can be transformed by a \ind{Fourier transform}
 into functions over the dual code.
 For an accessible introduction to Fourier analysis on
 finite groups, see \citeasnoun{Terras99}. 
 See also \citeasnoun{macwilliams&sloane}.








 

\section{Generalizing perfectness to other channels}
 Having given up on the search for \ind{perfect code}s
 for the binary symmetric channel, we could console
 ourselves by changing channel.
 We could call a code
 `a perfect $u$-error-correcting code for the binary \ind{erasure channel}'\index{channel!erasure}
 if it can restore any $u$ erased bits, and never more than $u$.%
\marginpar{\small\raggedright{In a perfect $u$-error-correcting code for the
 binary {erasure channel}, the number of redundant  bits must be $N-K=u$.
 }}
 Rather than using the word perfect, however,
 the conventional term for such a code is a `\ind{maximum distance separable} code', or MDS code.
\label{sec.RAIDII}

% Examples:

 As we already noted in \exerciseref{ex.raid3},
 the $(7,4)$ \ind{Hamming code} is {\em not\/}
 an MDS
% maximum  distance separable
 code.
 It can recover {\em some\/} sets of 3 erased bits,
 but not all. If any 3 bits corresponding to a codeword of weight 3
 are erased, then  one bit of information is unrecoverable.
 This is why the $(7,4)$ code is a poor choice for a \ind{RAID} system.

%A maximum distance separable (MDS) block code is a linear code whose distance is maximal among all linear
%     block codes of rate k/n. It is well known that MDS block codes do exist if the field size is more than n.

 A tiny example of a
 maximum  distance separable code\index{erasure-correction}\index{error-correcting code!maximum distance separable}\index{error-correcting code!parity-check code}\index{MDS}
 is the simple parity-check code $P_{3}$
 whose parity-check matrix is
$\bH = [{\tt 1\, 1\, 1}]$.
 This code has 4 codewords, all of which have even parity. All codewords
 are separated by a distance of 2. Any single erased bit can be restored
 by setting it to the parity of the other two bits.
 The repetition codes are also maximum  distance separable codes.

\exercissxB{5}{ex.qeccodeperfect}{
 Can you make an $(N,K)$ code, with $M=N-K$ parity symbols,
 for a $q$-ary erasure channel, such that the decoder can recover
 the codeword when {\em{any}\/} $M$ symbols
 are erased  in a block
 of $N$?
 [Example: for
% There do exist some such codes: for example,  for
 the channel with
 $q=4$ symbols there is
 an $(N,K) = (5,2)$ code which can correct any $M=3$ erasures.]
% ; and for $q=8$ there is a $(9,2)$ code.]
}

 For the $q$-ary erasure channel with $q>2$, there are large numbers
 of MDS codes, of which the Reed--Solomon codes are the most
 famous and most widely used.
 As long as the field size $q$ is bigger than the blocklength $N$,
 MDS block codes of any rate can be found. (For further reading, see \citeasnoun{lincostello83}.)
% according to my notes.

% 4-ary erasure channel.
% Include tournament example. GF4, 16 individuals. can tolerate 3 erasures.
% Reed--Solomon codes.

\section{Summary}
 Shannon's codes for the binary symmetric channel
  can almost always correct $\fN$ errors, but they
 are not $\fN$-error-correcting codes.

%\noindent
\subsection*{Reasons why the distance of a code has little relevance}
\ben
\item
 The Shannon limit shows  that the best codes must be able to
 cope with a noise level twice as big as the maximum
 noise level for a bounded-distance decoder.
\item
 When the binary symmetric channel has
 $f>1/4$, no code with a bounded-distance decoder
 can communicate at all; but Shannon says good codes exist
 for such channels.
\item
 Concatenation shows that we can get good performance even if
 the distance is bad.\index{concatenation}\index{distance!of code}
\een
%
% Furthermore, `distance isn't everything' -- you can actually
% get to the Shannon limit with a code whose distance is `bad'.
%
% Exercise - prove that if a sequence of codes is very bad then it can't
% have arbitrarily small error probability.

 The whole weight enumerator function is relevant to the question
 of whether a code is a good code.

 The relationship between good codes and
 distance  properties is discussed further in \exerciseref{ex.prob.error.match}.
% ex.equal.threshold}.

%\section*{Further reading}
% For a paper with codes having the property
%  distance, but for practical purposes a code with blocklength $N=10\,000$
% can have codewords of weight $d=32$ and the error probability
% can remain negligibly small even when the channel
% is creating errors of weight 320.
% {mackaymitchisonmcfadden2003}

\section{Further exercises}
% also known as {ex.equal.threshold}
\exercissxC{3}{ex.prob.error.match}{
 A codeword $\bt$ is selected from a linear $(N,K)$
 code $\C$, and it is transmitted 
 over a noisy channel; the received signal is 
 $\by$.
 We  assume that the channel is a memoryless 
 channel such as a Gaussian channel.
 Given an assumed channel model $P(\by  \given \bt)$, there are 
 two decoding problems. 
\begin{description}
\item[The codeword decoding problem] is the task of\index{decoder!codeword} 
 inferring which codeword $\bt$ was transmitted given the 
 received signal.
\item[The bitwise decoding problem] is the task of inferring\index{decoder!bitwise} 
 for each transmitted bit $t_n$ how likely it is that that 
 bit was a one rather than a zero.
\end{description}
 Consider optimal decoders for these two decoding problems.
%
% these will be presented again in
%  section \ref{sec.decoding.problems}
% exact.tex
%
 Prove that the probability of error of the optimal
 bitwise-decoder is closely related to the probability of error of
 the optimal  codeword-decoder, by proving the following
 theorem.\index{decoder!probability of error}
\begin{ctheorem}
 If a binary linear code\index{distance!of code, and error probability}
 has minimum distance  $d_{\min}$,
 then,
 for any given channel, the codeword bit error probability of the optimal
 bitwise decoder, $p_{\rm b}$,
 and the block error probability of the maximum likelihood decoder, $p_{\rm B}$,
are related by:
\beq
	 p_{\rm B} \geq p_{\rm b} \geq \frac{1}{2} \frac{d_{\min}}{N} p_{\rm B} .
\label{eq.thmpBpb}
\eeq
% [I am sure this theorem is well-known; I am not claiming it is original.]
\end{ctheorem}
}

\exercisaxA{1}{ex.HammingD}{
	What are the minimum distances of the $(15,11)$ Hamming
 code and the  $(31,26)$ Hamming
 code?
}

\exercisaxB{2}{ex.estimate.wef}{
 Let $A(w)$ be the
 average weight enumerator function of a rate-$1/3$
 random linear code with $N=540$ and $M=360$.
 Estimate, from first principles, the value of $A(w)$ at $w=1$.
}
\exercisaxC{3C}{ex.handshakecode}{
 {\sf A code with minimum distance\index{Gilbert--Varshamov distance}\index{distance!Gilbert--Varshamov}
  greater than $d_{\rm GV}$.}
% Another way to make a code is to define a generator matrix
% or parity-check matrix.
 A rather nice $(15,5)$ code
 is generated by this   generator  matrix, which is based on measuring the parities
 of all the ${{5}\choose{3}} = 10$ triplets of source bits:
\beq
\bG = \left[
\begin{array}{*{15}{c}}
1&\tinyo&\tinyo&\tinyo&\tinyo&\tinyo&1&1&1&\tinyo&\tinyo&1&1&\tinyo&1 \\
\tinyo&1&\tinyo&\tinyo&\tinyo&\tinyo&\tinyo&1&1&1&1&\tinyo&1&1&\tinyo \\
\tinyo&\tinyo&1&\tinyo&\tinyo&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo&1&1\\
\tinyo&\tinyo&\tinyo&1&\tinyo&1&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo&1\\
\tinyo&\tinyo&\tinyo&\tinyo&1&1&1&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo
\end{array} \right] .
\eeq
 Find the   minimum distance and weight enumerator function
 of this code.
}
\exercisaxC{3C}{ex.findAwmonodec}{
% {\sf A code with minimum distance\index{Gilbert--Varshamov distance}\index{distance!Gilbert--Varshamov}
% slightly greater than $d_{\rm GV}$.}
 Find the minimum distance of the `{pentagonful}\index{pentagonful code}'\index{error-correcting code!pentagonful}%
\amarginfig{t}{
\begin{center}
\buckypsfigw{pentagon.eps}
\end{center}
\caption[a]{The graph of the  pentagonful
 low-density parity-check code with
 15 bit nodes (circles) and 10 parity-check nodes (triangles).
}
}
 low-density parity-check code whose
 parity-check matrix is
\beq
\bH = \left[ \begin{array}{*{5}{c}|*{5}{c}|*{5}{c}}
1 & \tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ 
1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ 
\tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ 
\tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ 
\tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\  \hline
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & 1 \\ 
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo \\ 
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo \\ 
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & 1 & 1 & \tinyo \\ 
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & 1 
\end{array} \right] .
\label{eq.monodec}
\eeq
 Show that nine of the ten rows are independent, so the
 code has parameters $N=15$, $K=6$.
 Using a computer, find its weight enumerator function.
% Find its weight enumerator function.
}

\exercisxB{3C}{ex.concateex}{
 Replicate the calculations used to produce
  \figref{fig.concath.rate}.
 Check the assertion that  the highest noise level
 that's correctable is 0.0588.
 Explore alternative concatenated
 sequences of codes.  Can you find a better sequence of concatenated
 codes -- better in the sense that it
 has either  higher asymptotic rate  $R$ or  can tolerate
 a higher noise level $\q$?
}

\exercissxA{3}{ex.syndromecount}{
	Investigate the possibility of achieving the Shannon 
 limit with linear block codes, using the following \ind{counting argument}.
 Assume a linear code of large blocklength $N$ and rate $R=K/N$.
 The code's parity-check matrix $\bH$ has $M = N - K$ rows.
 Assume that the code's optimal decoder, which solves the
 syndrome decoding problem $\bH \bn = \bz$, allows reliable communication
 over a binary symmetric channel with flip probability $f$.

	How many `typical' noise vectors $\bn$ are there?

	Roughly how many distinct syndromes $\bz$ are there?

	Since $\bn$ is reliably deduced from $\bz$ by the optimal decoder,
 the number of syndromes must be greater than or equal  to the number of
 typical noise vectors.  What does this tell you about the largest
 possible value of rate $R$ for a given $f$?
}
\exercisxB{2}{ex.zchanneldeficit}{
 Linear binary codes use  the input symbols {\tt{0}} and {\tt{1}} with
 equal probability,  implicitly treating the channel as a symmetric
 channel. Investigate how much loss in communication rate is caused by
 this assumption, if in fact the channel is a highly asymmetric channel.
 Take as an example a Z-channel. How much smaller is the maximum possible rate
 of communication using symmetric inputs than the capacity of the channel?
 [Answer: about 6\%.]
}
\exercisxC{2}{ex.baddistbad}{
 Show that codes with `very bad' distance are `bad' codes, as defined
 in \secref{sec.bad.code.def} (\pref{sec.bad.code.def}).
%
% Show that there exist codes with `bad' distance
% that are `very good' codes.
%
% this bit already done in   {ex.distancenotE}{
}
\exercisxC{3}{ex.puncture}{
 One linear code can be obtained from another
 by {\dem{\ind{puncturing}}}. Puncturing
 means taking each codeword and deleting a defined set of bits.
 Puncturing turns an $(N,K)$ code into
 an $(N',K)$ code, where $N'2$, some MDS codes can be found.

 As a simple example, here is a $(9,2)$ code for the
 $8$-ary erasure channel.
 The code is defined in terms of the\index{Galois field}
% \index{finite field}
 multiplication and addition rules of  $GF(8)$,
 which are given in \appendixref{sec.gf8}.
 The elements of the input alphabet are $\{0,1,A,B,C,D,E,F\}$
 and 
 the 
 generator matrix of the code is 
\beq
	\bG = \left[ \begin{array}{*{9}{c}}
 1 &0 &1 &A &B &C &D &E &F \\
 0 &1 &1 &1 &1 &1 &1 &1 &1 \\
\end{array} \right] .
\eeq

 The resulting 64 codewords are:\smallskip
{\footnotesize\tt
\begin{narrow}{0in}{-\margindistancefudge}%
\begin{realcenter}
\begin{tabular}{*{8}{c}}
000000000 & 
011111111 & 
0AAAAAAAA & 
0BBBBBBBB & 
0CCCCCCCC & 
0DDDDDDDD & 
0EEEEEEEE & 
0FFFFFFFF 
\\ 
101ABCDEF & 
110BADCFE & 
1AB01EFCD & 
1BA10FEDC & 
1CDEF01AB & 
1DCFE10BA & 
1EFCDAB01 & 
1FEDCBA10  
\\ 
A0ACEB1FD & 
A1BDFA0EC & 
AA0EC1BDF & 
AB1FD0ACE & 
ACE0AFDB1 & 
ADF1BECA0 & 
AECA0DF1B & 
AFDB1CE0A  
\\ 
B0BEDFC1A & 
B1AFCED0B & 
BA1CFDEB0 & 
BB0DECFA1 & 
BCFA1B0DE & 
BDEB0A1CF & 
BED0B1AFC & 
BFC1A0BED 
\\ 
C0CBFEAD1 & 
C1DAEFBC0 & 
CAE1DC0FB & 
CBF0CD1EA & 
CC0FBAE1D & 
CD1EABF0C & 
CEAD10CBF & 
CFBC01DAE  
\\ 
D0D1CAFBE & 
D1C0DBEAF & 
DAFBE0D1C & 
DBEAF1C0D & 
DC1D0EBFA & 
DD0C1FAEB & 
DEBFAC1D0 & 
DFAEBD0C1  
\\ 
E0EF1DBAC & 
E1FE0CABD & 
EACDBF10E & 
EBDCAE01F & 
ECABD1FE0 & 
EDBAC0EF1 & 
EE01FBDCA & 
EF10EACDB  
\\ 
F0FDA1ECB & 
F1ECB0FDA & 
FADF0BCE1 & 
FBCE1ADF0 & 
FCB1EDA0F & 
FDA0FCB1E & 
FE1BCF0AD & 
FF0ADE1BC 
\\ 
\end{tabular}
\end{realcenter}
\end{narrow}
}

}
% from exercise section in _linear.tex
%
% this was in _sexact
%
% ex.prob.error.match
\soln{ex.prob.error.match}{% ex.equal.threshold}{
{\sf Quick, rough proof of the theorem.} Let $\bx$ denote the difference 
 between the reconstructed codeword and the transmitted codeword.
 For any given channel output $\br$, there is a posterior distribution over
 $\bx$. This posterior distribution is positive only
 on vectors $\bx$ belonging to the code; the sums
 that follow are over codewords $\bx$. The block error probability is:
\beq
	p_{\rm B} = \sum_{\bx \neq 0} P(\bx \given \br) .
\label{eq.pBdef}
\eeq
 The average bit error probability, averaging over all bits in
 the codeword, is:
\beq
	p_{\rm b} = \sum_{\bx \neq 0} P(\bx \given \br) \frac{w(\bx)}{N} ,
\label{eq.pbdef}
\eeq
 where $w(\bx)$ is the weight of codeword $\bx$.
 Now the weights of the non-zero codewords satisfy
\beq
 1 \geq   \frac{w(\bx)}{N} \geq                 \frac{d_{\min}}{N} .
\label{eq.ineq}
\eeq
 Substituting the inequalities (\ref{eq.ineq}) into
 the definitions (\ref{eq.pBdef},$\,$\ref{eq.pbdef}),
 we obtain:
%
\beq
	 p_{\rm B} \geq p_{\rm b} \geq
% \frac{1}{2}
                 \frac{d_{\min}}{N} p_{\rm B} ,
\label{eq.thmpBpbA}
\eeq
 which is a factor of two stronger, on the right, than
 the stated result (\ref{eq.thmpBpb}).
 In making the proof watertight, I have weakened the result a little.\medskip

% So the bit and block {\em thresholds\/} of a code with good distance
% are identical.

%\section
\noindent
{\sf Careful proof.}
 The  theorem relates the performance of the optimal 
 block  decoding algorithm and the optimal bitwise decoding algorithm.

 We introduce another pair of decoding algorithms, called the block-guessing
 decoder and the bit-guessing decoder. The idea is that
 these two algorithms are similar to
 the optimal block decoder and the  optimal bitwise decoder,
 but lend themselves more easily to
 analysis.

We now define these decoders. Let $\bx$ denote
 the inferred codeword. For any given code: 
\begin{description}
\item[The optimal block decoder] returns the codeword $\bx$ that maximizes
	the posterior probability
	$P(\bx  \given  \br)$, which is proportional to the likelihood
	$P( \br  \given  \bx)$.

The probability of error of this decoder is called
	$\PB$.
\item[The optimal bit decoder] returns for each of the $N$ bits, $x_n$, 
 the value of $a$ that maximizes
 the posterior probability
	$P( x_n \eq  a  \given  \br ) = \sum_{\bx} P(\bx  \given  \br) \,\truth\! [ x_n\eq a ]$.

The probability of error of this decoder is called
	$\Pb$.

\item[The block-guessing decoder] returns a random codeword $\bx$
	with probability distribution  given by the posterior probability
	$P(\bx  \given  \br)$.

The probability of error of this decoder is called
	$\PGB$.

\item[The bit-guessing decoder]  returns for each of the $N$ bits, $x_n$, 
 a random bit from the probability distribution $P( x_n \eq  a  \given  \br )$.

The probability of error of this decoder  is called 
	$\PGb$.

\end{description}
 The theorem states that 
 the optimal bit error probability $\Pb$
 is bounded above by 
  $\PB$ and below by a given multiple of $\PB$ (\ref{eq.thmpBpb}).
%
%\beq
%	 P_B \geq P_b \geq \frac{1}{2}                  \frac{d_{\min}}{N} P_B .
%\label{eq.thmpBpb.again}
%\eeq

 The left-hand inequality in (\ref{eq.thmpBpb})
 is trivially true -- if  a block is correct,
 all its constituent bits are correct; so if the optimal
 block decoder outperformed the optimal bit decoder, we could
 make a better bit decoder from the block decoder. 

 We prove the right-hand inequality by establishing that:
% the following two lemmas:
\ben
\item
	the bit-guessing decoder is nearly
 as good as the optimal bit decoder:
\beq
	\PGb \leq 2 \Pb .
\label{eq.guess}
\eeq
\item
	the bit-guessing decoder's  error probability
 is related to the block-guessing decoder's
 by
\beq
	\PGb \geq                   \frac{d_{\min}}{N} \PGB .
\eeq
\een
 Then since $\PGB \geq \PB$, we have
\beq
	\Pb > \frac{1}{2} \PGb \geq \frac{1}{2} \frac{d_{\min}}{N} \PGB
   \geq \frac{1}{2}   \frac{d_{\min}}{N}  \PB  .
\eeq
 We now prove the two lemmas.\medskip

\noindent
%\subsection
{\sf Near-optimality of guessing:}
 Consider first the case of
 a single bit, with posterior probability $\{ p_0, p_1 \}$.
% Without loss of generality, let $p_0 \geq p_1$.
 The optimal bit decoder
% picks $\argmax_a p_a$,
% \ie, 0,
% and
 has probability of error
\beq
%	\Pb
               P^{\rm{optimal}} = \min (p_0,p_1).
\eeq
% $p_1$.
 The guessing decoder picks from 0 and 1. The truth is also
 distributed with the same probability. The probability
 that the guesser and the truth  match is
 $p_0^2 + p_1^2$; the probability that they
 mismatch is the guessing error probability, 
\beq
% \PGb
  P^{\rm guess} =  2 p_0 p_1 \leq 2  \min (p_0,p_1) = 2 P^{\rm{optimal}} .
\eeq
 Since $\PGb$ is the average
 of many such  error probabilities, $P^{\rm guess}$,
 and $\Pb$ is the average of the corresponding optimal
 error probabilities, $P^{\rm{optimal}}$,
 we obtain the desired relationship  (\ref{eq.guess})
  between $\PGb$ and $\Pb$.\ENDproof
%
\medskip

%\subsection
\noindent
 {\sf Relationship between bit error probability
 and block error probability:}
 The bit-guessing and block-guessing decoders
 can be combined in a single system:
% The posterior probability of a bit $x_n$ and a block $\bx$
% is given by
%\beq
% P( x_n = a , \bx  \given  \br ) =
% P(  \bx  \given  \br ) P( x_n = a  \given  \bx, \br ) =
%\eeq
% So w
 we can draw a sample $x_n$ from the marginal distribution
 $P(x_n \given \br)$ by
 drawing a sample $( x_n , \bx )$
 from the  joint distribution $P( x_n , \bx  \given  \br )$,
 then discarding the value of $\bx$.

 We can distinguish between two cases: the discarded value of $\bx$
 is  the correct codeword, or not.
 The probability of bit error for the bit-guessing decoder
 can then be written as a sum of two terms:
\beqa
	\PGb &\eq &
 P(\mbox{$\bx$ correct}) P(\mbox{bit error} \given \mbox{$\bx$ correct})
\nonumber
\\
 & & + \,
 P(\mbox{$\bx$ incorrect}) P(\mbox{bit error} \given \mbox{$\bx$ incorrect})
\\
 &=&
% P(\mbox{$\bx$ correct}) \times
 0 + \PGB  P(\mbox{bit error} \given \mbox{$\bx$ incorrect}) .
\eeqa
% The first of these terms is zero.
 Now, whenever the guessed $\bx$ is incorrect, the true
 $\bx$ must differ from it in at least $d$ bits, so
 the probability of bit error in these cases is at least $d/N$.
So
\[%beq
	\PGb \geq  \frac{d}{N} \PGB .
% \eepf
\]%eeq
 QED.\hfill $\epfsymbol$
}

\soln{ex.syndromecount}{
 The number of   `typical' noise vectors $\bn$  is
 roughly $2^{NH_2(f)}$.
% , where $H=H_2(f)$.
 The number of distinct syndromes $\bz$ is $2^M$.
 So reliable communication implies
\beq
	M \geq  NH_2(f) ,
\eeq
 or, in terms of the rate $R = 1-M/N$,
\beq
	R \leq 1 - H_2(f) ,
\eeq
 a bound which agrees precisely with the capacity of the channel.

 This argument is turned into a proof in the following chapter.
}
% BORDERLINE
\soln{ex.hat.puzzle}{
%  Mathematicians credit the problem to Dr. Todd Ebert, a computer
%   science instructor at the University of California at Irvine, who
%   introduced it in his Ph.D. thesis at the University of California at
%   Santa Barbara in 1998.
 In the three-player case,
   it is possible for the group to win three-quarters of the time.

   Three-quarters of the time, two of the players will have hats of the
   same colour and the third player's hat will be the opposite colour. The
   group can win every time this happens by using the following strategy.
 Each player looks at the other two players'
   hats. If the two hats are {\em different\/}
 colours, he passes. If they are the
 {\em  same\/} colour, the player guesses his own hat is the {\em opposite\/}
 colour.

   This way, every time the hat colours are distributed two and one, one
   player will guess correctly and the others will pass, and the group
   will win the game. When all the hats are the same colour, however, {\em all
   three\/} players will guess incorrectly and the group will lose.

 When any particular player guesses a colour, it is true
 that there is only a 50:50 chance that their guess is right.
 The reason that the group wins 75\% of the time is that their
 strategy ensures that when players are guessing wrong, a great many are
   guessing wrong.

 For larger numbers of players, the aim is
 to ensure  that most of the time no one
   is wrong and occasionally everyone is wrong at once.
 In the game with 7 players, there is a strategy for
   which the group wins 7 out of every 8 times they play.
 In the game with 15 players, the group can win 15 out of 16 times.
 If you have not figured out these winning strategies for teams
 of 7 and 15, I recommend thinking about the
 solution to the three-player game in terms of the locations
 of the winning and losing states on the three-dimensional hypercube,
 then thinking laterally.

\begincuttable
 If the number of players, $N$, is $2^r-1$,
 the optimal strategy can be defined using a Hamming code of length $N$,
 and the probability of winning the prize is $\linefrac{N}{(N+1)}$.
 Each player
 is identified with a number $n \in 1\ldots N$.
 The two colours
 are mapped onto {\tt{0}} and {\tt{1}}. Any state of their hats
 can be viewed as  a received vector out of a binary channel.
 A random binary vector of length $N$
 is either a codeword of the Hamming code, with probability
 $1/(N+1)$, or it differs
 in exactly one bit from a codeword.
% There is a probability 
 Each player looks at all the other bits and considers whether his bit
 can be set to a colour
 such that   the state is a codeword (which can be deduced
using the decoder
 of the Hamming code). If it can, then
 the player guesses that his hat is the {\em other\/} colour.
 If the state is actually a codeword, all players will guess and
 will guess wrong. If the state is a non-codeword, only
 one player will guess, and his guess will be correct.
 It's quite easy to train seven players to follow the optimal
 strategy if the cyclic representation of the $(7,4)$ Hamming code
 is used (\pref{sec.h74cyclic}).

% I am not sure of the optimal solution for the `Scottish version'
% of the rules  in which the prize is only awarded to the group
% if they {\em all\/} guess correctly.
% As a starting point, if one flips the guesses of the winning strategy
% for the original game, the group
% will win whenever it is in a codeword state, which
% happens with probability  $1/(N+1)$.  The question is
% what to do with the `passes'.
%% since passing is never in one's interests.
% Can the group do better than replacing passes with random guessing?
}
% \soln{ex.selforthog}{
% removed to cutsolutions.tex

% end from _linear.tex

\dvips
%\section{Solutions to Chapter \protect\ref{ch.linearecc}'s exercises} % 
%\section{Solutions to Chapter \protect\ref{ch.linearecc}'s exercises} % 
\dvipsb{solutions linear}
\dvips
\prechapter{About               Chapter}
 In this chapter we will draw together several ideas 
 that we've encountered so far in one nice short proof.
 We will simultaneously prove both
 Shannon's noisy-channel coding theorem (for
 symmetric binary channels)
 and his source coding theorem  (for binary sources).
 While this proof has connections to many preceding chapters
 in the book, it's not essential to have read them all.

 On the noisy-channel coding side,
 our proof will be more constructive than the
 proof given in \chref{ch.six}; there, we proved that
 almost any random code is `very good'.
 Here we will show that
 almost any {\em linear\/} code is very good.
 We will make use of the idea of typical sets (Chapters \ref{ch.two} and \ref{ch.six}),
 and we'll borrow from the previous chapter's
 calculation of the  weight enumerator function of random linear codes (\secref{sec.wef.random}).

 On the source coding side,
 our proof will show that {\em random linear \ind{hash function}s} can be used
 for compression of compressible binary sources, thus giving
 a link to \chref{ch.hash}.

\ENDprechapter
\chapter{Very Good Linear Codes Exist}
\label{ch.lineartypical}
%
% very good linear codes exist
%
 In this chapter we'll use a single calculation
 to prove simultaneously
   the \ind{source coding theorem} and the\index{noisy-channel coding theorem}
 noisy-channel  coding theorem for the \ind{binary symmetric channel}.\index{channel!binary symmetric}\index{noisy-channel coding theorem!linear codes}\index{linear block code!noisy-channel coding theorem}\index{error-correcting code!linear!noisy-channel coding theorem}

 {Incidentally,
	this proof works for much more general channel models,
	not only the binary symmetric channel.  For example,
	the proof can be reworked for channels with
	non-binary outputs, for time-varying channels
	and for channels with memory, as long as they
	have binary inputs satisfying a symmetry property,
 \cf\ \secref{sec.Symmetricchannels}.}
%
\label{ch.linear.good}
\section{A simultaneous proof of the source coding and
 noisy-channel  coding theorems}
 We consider a linear error-correcting code with binary \ind{parity-check
 matrix} $\bH$. The matrix has $M$ rows and $N$ columns.
 Later in the proof we will increase $N$ and $M$, keeping $M \propto N$.
 The 
 rate of the code satisfies
\beq
	R \geq 1 - \frac{M}{N}.
\eeq
 If all the rows of $\bH$ are independent then this
 is an equality, $R = 1 -M/N$. In what follows,\index{error-correcting code!rate}\index{error-correcting code!linear}
 we'll assume the equality holds. Eager  readers
 may work out the  expected rank of
 a random binary matrix $\bH$ (it's very close to $M$)
 and pursue the effect that the difference ($M - \mbox{rank}$) has
% small number of linear dependences have
 on the rest of this proof (it's negligible).
 
 A codeword $\bt$ is selected, satisfying
\beq
	\bH \bt = {\bf 0}  \mod 2 ,
\eeq
 and a binary symmetric channel adds noise $\bx$, giving
 the received signal\marginpar{\small\raggedright{In this chapter
 $\bx$  denotes the noise added by the channel,
 not the input to the channel.}}
\beq
	\br = \bt + \bx  \mod 2.
\eeq

 The receiver aims to infer both $\bt$ and $\bx$ from
 $\br$ using a \index{syndrome decoding}{syndrome-decoding} approach.
 Syndrome decoding was first introduced in
 \secref{sec.syndromedecoding} (\pref{sec.syndromedecoding} and \pageref{sec.syndromedecoding2}).
% and \secref{sec.syndromedecoding2}.
 The receiver computes the syndrome
\beq
	\bz = \bH \br \mod 2  = \bH \bt + \bH  \bx  \mod 2
	= \bH  \bx  \mod 2  .
\eeq
%  Since $\bH \bt = {\bf 0}$, t
 The syndrome only depends on the noise $\bx$,
 and the decoding problem is to find the most probable $\bx$ that
 satisfies
\beq
	\bH \bx = \bz  \mod 2.
\eeq
 This best estimate for the noise vector, $\hat{\bx}$, is then
 subtracted from $\br$ to give the best guess for $\bt$.
 Our aim is to show that,
 as long as $R < 1-H(X) =  1-H_2(f)$,
 where $f$ is the flip probability of the binary symmetric channel,
 the optimal decoder for this syndrome-decoding
 problem has vanishing  probability of error, as $N$ increases,
 for random $\bH$.
% and averaging over all binary matrices $\bH$.

 We prove this result by studying a sub-optimal
 strategy for solving the decoding problem. Neither the optimal decoder
 nor this {\em \ind{typical-set decoder}\/} would be easy to implement,
 but the  typical-set decoder is easier to \analyze.
 The typical-set decoder examines the typical
 set  $T$ of noise vectors,  the set of
 noise vectors $\bx'$ that satisfy $\log \dfrac{1}{P(\bx')} \simeq
 NH(X)$,\marginpar{\small\raggedright{We'll leave out the $\epsilon$s and $\beta$s that make
 a typical-set definition rigorous. Enthusiasts are encouraged
 to revisit \secref{sec.ts} and put  these details into this proof.}}
 checking to see if any of those typical vectors
 $\bx'$ satisfies the observed syndrome,
\beq
	\bH \bx' = \bz .
\eeq
 If exactly one typical vector $\bx'$ does so, the typical
 set decoder reports that vector as the hypothesized
 noise vector.
 If no typical vector matches the observed syndrome,
 or more than one does, then the  typical
 set decoder  reports an error.

 The probability of error of the typical-set decoder, for
 a given matrix $\bH$, can be written as a sum of two terms, 
\beq
	P_{{\rm TS}|\bH} =  P^{(I)} + P^{(II)}_{{\rm TS}|\bH} ,
\eeq
 where $P^{(I)}$ is the probability that the true noise
 vector $\bx$ is itself not typical,
 and $P^{(II)}_{{\rm TS}|\bH}$ is the probability
 that the true $\bx$  is typical and at least one other typical vector
 clashes with it.
 The first probability vanishes as $N$ increases,
 as we proved when we first studied typical sets (\chref{ch.two}).
 We concentrate on the second probability.
% , the probability of a type-II error.
 To recap, we're imagining a true noise vector, $\bx$;
 and if {\em any\/} of the typical noise vectors
 $\bx'$, different from $\bx$, satisfies $\bH (\bx' - \bx) = 0$,
 then we have an error.
 We use the truth function
\beq
	\truth \! \left[   \bH (\bx' - \bx) = 0  \right],
\eeq
 whose value is one if the statement  $\bH (\bx' - \bx) = 0$ is true
 and zero otherwise.
 We can bound the number of type II errors made when the noise is
 $\bx$ thus:
\newcommand{\xprimecondition}{\raisebox{-4pt}{\footnotesize\ensuremath{\bx'}:}
\raisebox{-3pt}[0.025in][0.0in]{% prevent it from hanging down and pushing other stuff down
\makebox[0.2in][l]{\tiny$\!\begin{array}{l} {\tiny\bx' \!\in T}\\
                   {\tiny\bx' \! \neq \bx} \end{array}$}}}
\beq
	\left[\mbox{Number of errors given $\bx$ and $\bH$}\right] \leq \sum_{\xprimecondition}
	\truth\! \left[   \bH (\bx' - \bx) = 0  \right] .
\label{eq.lt.union}
\eeq
 The number of errors is either zero or one; the sum on the
 right-hand side may exceed one,\marginpar{\small\raggedright{\Eqref{eq.lt.union}
 is a \ind{union bound}.}}
 in cases where several typical noise
 vectors have the same syndrome.
 
 We can now write down the probability of a type-II error
 by  averaging over $\bx$:
\beq
	 P^{(II)}_{{\rm TS}|\bH} \leq \sum_{\bx \in T}  P(\bx)
	\sum_{\xprimecondition}   \truth\! \left[   \bH (\bx' - \bx) = 0  \right] . 
\eeq
 Now, we will find the average of this   probability of  type-II error
 over all linear codes by averaging over $\bH$.
 By showing that  the {\em average\/}  probability of  type-II error
 vanishes, we will thus show that there exist linear
 codes with vanishing error probability, indeed, that
 almost all linear codes are very good.

 We denote averaging over all binary matrices $\bH$ by $\left< \ldots \right>_{\bH}$.
 The average probability of type-II error is
\beqan
	 \bar{P}^{(II)}_{{\rm TS}}
& =&
\sum_{\bH} P(\bH)
	 P^{(II)}_{{\rm TS}|\bH} \: = \:
	\left<  P^{(II)}_{{\rm TS}|\bH} \right>_{\bH}
\\
&=& 
\left< 
  \sum_{\bx \in T}  P(\bx)
	\sum_{\xprimecondition}   \truth\! \left[   \bH (\bx' - \bx) = 0  \right] 
	  \right>_{\!\bH}
\\
&=&
  \sum_{\bx \in T}  P(\bx)
	\sum_{\xprimecondition}
	\left< 
  \truth\! \left[   \bH (\bx' - \bx) = 0  \right] 
	  \right>_{\bH}
 .
\eeqan
 Now, the quantity
$\left< 
  \truth\! \left[   \bH (\bx' - \bx) = 0  \right] 
	  \right>_{\bH}$ already cropped up
 when we
 were calculating the
 expected weight enumerator function of random linear codes (\secref{sec.wef.random}):
 for any non-zero binary vector $\bv$, the probability  that $\bH \bv =0$,
 averaging over all  matrices $\bH$, is $2^{-M}$.
 So
\beqan
	 \bar{P}^{(II)}_{{\rm TS}}
& = &
 \left( \sum_{\bx \in T}  P(\bx) \right)
	\left( |T| - 1 \right)
 2^{-M}\\
& \leq &
	|T| \: 2^{-M}
 ,
\eeqan
 where $|T|$ denotes the size of the typical set.
 As you will recall from \chref{ch.two}, there are roughly
 $2^{NH(X)}$ noise vectors in the typical set.
 So
\beqan
	 \bar{P}^{(II)}_{{\rm TS}}
& \leq &
	2^{NH(X)} 2^{-M}
 .
\eeqan
 This bound on the probability of error either vanishes
 or grows exponentially as $N$ 
 increases (remembering that
% , as we are fixing the code rate
% $R = 1-M/N$, 
 we are keeping $M$  proportional to $N$ as $N$ increases).
 It vanishes if
\beq
	H(X) < M/N     .
\eeq
% this clause is cuttable
% CUT ME?
% and grows  if 
%\beq
%	NH(X) > M  .
%\eeq
% end CUT ME
 Substituting $R=1-M/N$,
 we have thus established the
% positive half of Shannon's
 noisy-channel coding theorem for the binary symmetric channel:
 very good linear codes exist
%as long as 
%$H(X) < M/N$, \ie, as long as
 for any rate $R$ satisfying 
\beq
	R < 1-H(X) ,
\eeq
 where $H(X)$ is the  entropy of the channel
 noise, per bit.\ENDproof
\exercisxC{3}{ex.generalchannel}{
	Redo the proof for a more general channel.
}

\section{Data compression by linear hash codes}
 The decoding game we have just played can also\index{random code!for compression}
 be viewed as an {\dem\ind{uncompression}\/} game.\index{hash code}
 The world produces a binary  noise vector $\bx$
 from a source $P(\bx)$. The noise has redundancy (if the flip probability is not 0.5). We
 compress it with a linear compressor
 that maps the $N$-bit input $\bx$ (the noise) to
 the $M$-bit output $\bz$ (the syndrome).\index{hash function!linear}\index{hash code}
 Our uncompression task is to recover the
 input $\bx$ from the output $\bz$.
 The rate of the compressor is
\beq
	R_{\rm compressor} \equiv M/N .
\eeq
 [We don't care about the possibility of linear redundancies
 in our definition of the rate, here.]
 The result that we just found, that
 the decoding problem can be solved, for
 almost any $\bH$, with vanishing error probability,
 as long as $H(X) < M/N$, thus instantly
 proves a \ind{source coding theorem}:

\begin{quote}
	Given a binary  source $X$ of entropy $H(X)$, and
	a required compressed rate $R > H(X)$, there exists a 
	 linear compressor $\bx \rightarrow \bz = \bH  \bx \mod 2$
	having rate $M/N$ equal to that required rate $R$,
	and an associated uncompressor,
	that is virtually lossless.
\end{quote}

% To put it another way, if you have a source of
%  entropy $H(X)$ and you encode a string of
% $N$ bits from it using a \ind{hash code} (\chref{ch.hash})
% where the hash $\bz$ is of length $M$ bits,
% where $M > N H(X)$,
% a random linear hash function  $\bz = \bH  \bx \mod 2$
% is just as good (for collision avoidance) as a
% fully random hash function.
%% there are very unlikely  to be any collisions among
%% the hashes


{This theorem is true
 not only for a source of independent identically distributed
 symbols but also for any source for which a typical set can be defined:
 sources with memory, and time-varying sources, for example; all that's
 required is that the source be ergodic.
}

 
\subsection*{Notes}
 This method for proving that codes are
 good can be applied to
 other linear codes,
 such as low-density parity-check codes 
  \cite{mncN,McElieceMacKay00}.
 For each code we need an approximation of its expected weight
 enumerator function. 










%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
%
\dvips
% \chapter{Further exercises on information theory}
\chapter{Further Exercises on Information Theory}
% this was two chapters once
\label{ch_fInfo}
% {noisy channels}
\label{ch_f8}
\fakesection{Further exercises on noisy channels}
% I've been asked to include some exercises {\em without\/} worked
% solutions. Here are a few. Numerical solutions to some of them
% are provided on page \pageref{sec.solf8}.
%
 The most exciting exercises, which will introduce you
 to further ideas in information theory, 
 are towards the end of this chapter.
%\section{Exercises}
\subsection*{Refresher exercises on source coding and noisy channels}
\exercisaxB{2}{ex.X100}{
% from Yaser
 Let $X$ be an ensemble with $\A_X = \{0,1\}$ and $\P_X = \{ 0.995,
 0.005\}$.  Consider source coding
 using the block coding of $X^{100}$ where every $\bx
 \in X^{100}$ containing 3 or fewer 1s is assigned a distinct
 codeword, while the other $\bx$s are ignored. 
\ben
\item
 If the assigned codewords are all of the same length, find the minimum length 
 required to provide the above set with distinct codewords. 
\item
 Calculate the probability of getting an $\bx$ that will be ignored.
\een
}
\exercisaxB{2}{ex.0001}{
 Let $X$ be an ensemble with $\P_X = \{ 0.1,0.2,0.3,0.4 \}$. 
 The ensemble is encoded using the symbol
 code $\C = \{ 0001 , 001 , 01 , 1 \}$. 
 Consider the codeword corresponding to $\bx \in X^N$, where 
 $N$ is large.
\ben
\item
	Compute the entropy of the fourth bit of transmission. 
\item
	Compute the conditional entropy of the fourth bit given 
	the third bit.
\item
	Estimate the entropy of the hundredth bit.
\item 
	Estimate the conditional entropy of the hundredth bit given the
	ninety-ninth bit.
% \item
\een
}
\exercisaxA{2}{ex.dicetree}{
 Two fair dice are rolled by Alice and the sum is recorded. 
 Bob's task is to ask a sequence of questions with yes/no answers to 
 find out this number.  
 Devise in detail a strategy that achieves the minimum possible 
 average number of questions. 
}
% added Wed 22/1/03
\exercisxB{2}{ex.fairstraws}{
 How can you use a coin to \ind{draw straws} among 3 people?\index{straws, drawing}
}% my solution: arithmetic coding.
% perhaps use this in exam?
% - could also use exact sampling method! (see mcexact.tex)
\exercisxB{2}{ex.magicnumber}{
	In a {magic} trick,\index{puzzle!magic trick}
 there are three participants: the \ind{magician}, an assistant, and a volunteer.
 The assistant, who
 claims to have \ind{paranormal}\index{conjuror}\index{puzzle!magic trick}
 abilities, is in a soundproof room.
 The magician gives the volunteer six blank cards, five white and one blue.
 The volunteer writes a different integer from 1 to 100
 on each \ind{card}, as the magician is watching.
 The volunteer keeps the blue card.
 The magician arranges the five white cards in some order and passes them to the assistant.
 The assistant then announces the number on the blue card.

 How does the trick work?
}
% card trick
\exercisxB{3}{ex.magicnumber2}{
 How does {\em this\/} trick work?

\begin{quote}
`Here's an ordinary pack of cards, shuffled into random
order. Please choose five cards from the pack, any that you wish. Don't
let me see their faces. No, don't give them to me: pass them to my
assistant Esmerelda. She can look at them.

`Now, Esmerelda, show me four of the cards. Hmm$\ldots$ nine of spades, six of
clubs, four of hearts, ten of diamonds. The hidden card, then, must be the
queen of spades!'
\end{quote}

 The trick can be performed as described above\index{puzzle!magic trick}
 for a pack of 52 cards. Use information theory
 to give an upper bound
 on the  number of cards for which the trick can be performed.
%  (This exercise is much harder than \exerciseonlyref{ex.magicnumber}.)
% Hint: think of X = the 5 cardds, Y = the seque of 4 cards. how does H(X) compare with H(Y)?
% n choose 5 cf. n....(n-3) -> (n-4)/5! = 1 -> n=124.
}
% see l/iam for soln
\exercisxB{2}{ex.Hinfty}{
 Find a probability sequence $\bp = (p_1,p_2, \ldots)$ such that 
 $H(\bp) = \infty$. 
}
\exercisaxB{2}{ex.typical2488}{
 Consider a discrete memoryless source with $\A_X = \{a,b,c,d\}$
 and $\P_X =$ $\{1/2,1/4,$ $1/8,1/8\}$. There are $4^8 = 65\,536$ eight-letter 
 words that can be formed from the four letters.  Find the total number 
 of such words that are in the typical set $T_{N\beta}$ (equation \ref{eq.TNb})
 where  $N=8$ and $\beta = 0.1$.
%The definition of $T_{N\b}$, from 
% chapter \chtwo, is:% equation \ref{eq.TNb}
%\beq
%	T_{N\b} = \left\{ \bx\in\A_X^N : 
%	\left| \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right| < \b
%	\right\} .
%\eeq
}
% source coding and channels...........
\exercisxB{2}{ex.sourcechannel}{
 Consider the source 
 $\A_S = \{ a,b,c,d,e\}$, 
 $\P_S = \{ \dthird, \dthird, \dfrac{1}{9}, \dfrac{1}{9}, \dfrac{1}{9} \}$ and the 
 channel  whose transition probability matrix is 
\beq
	Q =
 \left[
 \begin{array}{cccc}
	1 & 0 & 0 & 0 \\
	0 & 0   & \dfrac{2}{3} & 0 \\
	0 & 1   & 0   & 1 \\
	0 & 0   & \dthird & 0 \\
%	1 & 0 & 0 & 0 \\
%	0 & 0   & 1 & 0 \\
%	0 & \dfrac{2}{3} & 0 & \dthird \\
%	0 & 0   & 1 & 0 \\
\end{array}\right] .
\eeq
 Note that the source alphabet 
% $\A_S = \{a,b,c,d,e\}$
 has five symbols, but the channel
 alphabet $\A_X = \A_Y = \{0,1,2,3\}$
 has only four. Assume that the source produces symbols at
 exactly 3/4 the rate that the channel accepts channel symbols. For a
 given (tiny) $\epsilon>0$, explain how you would design a system for
 communicating the source's output over the channel with an 
% overall
 average error probability per source symbol
 less than $\epsilon$. Be as explicit as possible. 
 In particular, {\em do not\/} invoke Shannon's noisy-channel coding theorem.
}
% \subsection{Noisy Channels}
\exercisxB{2}{ex.C0000}{Consider a binary symmetric channel and a code 
 $C = \{ 0000,0011,1100,1111 \}$; assume that the 
 four codewords are used with probabilities 
 $\{ 1/2, 1/8,1/8,1/4\}$. 

 What is the decoding rule that minimizes the probability of 
 decoding error? [The optimal decoding rule depends on 
 the noise level $f$ of the binary symmetric channel. Give 
 the decoding rule for each range of values of $f$, for $f$ between 0 and
 $1/2$.]
}
\exercisaxA{2}{ex.C3channel}{
 Find the capacity and \optens\
% optimizing input distribution 
 for the three-input, three-output 
 channel whose transition probabilities are:
\beq
	Q = \left[
\begin{array}{ccc}
        1 & 0 & 0 \\
	0 & \dfrac{2}{3} & \dthird \\
	0 & \dthird & \dfrac{2}{3}
\end{array}\right] .
\eeq
}
%
% I am not sure I like this ex: 
%
%\exercis{ex.Herrors}{
% Consider the $(7,4)$ Hamming code. 
%\ben\item
% What is the probability of bit error if 3 channel errors occur 
%	in a single block?
%\item
% What is the probability of bit error if 4 channel errors occur 
%	in a single block?
%\een
%}
% \end{document} 




% see also _e6.tex

%
% extra exercises do-able after chapter 6.
%
\fakesection{e6 exam qs}
\exercissxA{3}{ex.85channel}{
% Describe briefly the encoder for a $(7,4)$ Hamming code. 
%
% Assuming that one codeword of this code is sent over a 
% binary symmetric channel, define the {\em syndrome\/} $\bf z$
% of the received vector $\bf r$; state how many different possible syndromes
% there are; and state
% the maximum number of channel errors that the optimal decoder
%% code
% can correct.
%
% Define the {\em capacity\/} of a channel with input $x$ and output $y$
% and transition probability matrix $Q(y|x)$.
%
 The input to a channel $Q$ is a word of 8 bits. The output is also 
 a word of 8 bits. 
% A message block consisting of 8 bits is transmitted over a channel which 
 Each time it is used, the channel 
 flips {\em exactly one\/} of the transmitted bits, but
 the receiver does not know which one. The other 
 seven bits are received without error. All 8 bits are equally likely to 
 be the one that is flipped. Derive the capacity 
 of this channel. 

% Tough version:
%
% {\bf Either} show, by constructing an explicit encoder and decoder using a 
% linear (8,5) code that it 
% is possible to reliably communicate 5 bits per cycle 
% over this channel, {\bf or} prove that no such linear (8,5) code exists.
%
% Wimps version:

% practical
 Show, by describing an {\em explicit\/} encoder
% {\em and\/}
 and
 decoder that it 
 is possible {\em reliably\/} (that is, with 
 {\em zero\/} error probability)  to communicate 5 bits per cycle 
 over this channel.  
% Your description should be 

% {\em should I give a hint here?}
% [Hint: a solution exists that involves  a simple $(8,5)$ code.]
}
\exercisxB{2}{ex.rstu}{
 A channel with input $x \in \{ {\tt a},{\tt b},{\tt c} \}$
  and output $y \in \{ {\tt r},{\tt s},{\tt t} ,{\tt u} \}$ 
 has conditional probability matrix:
\[
 \bQ = \left[
\begin{array}{ccc}
\dhalf & 0   & 0  \\
\dhalf & \dhalf & 0  \\
0 & \dhalf & \dhalf  \\
0 & 0 & \dhalf  \\
\end{array}
\right] .
\hspace{1in}
\begin{array}{c}
\setlength{\unitlength}{0.13mm}
\begin{picture}(100,140)(0,-20)
\put(18,0){\makebox(0,0)[r]{\tt c}}
\put(18,40){\makebox(0,0)[r]{\tt b}}
\put(18,80){\makebox(0,0)[r]{\tt a}}
%
\multiput(20,0)(0,40){3}{\vector(2,1){36}}
\multiput(20,0)(0,40){3}{\vector(2,-1){36}}
%
\put(62,-20){\makebox(0,0)[l]{\tt u}}
\put(62,20){\makebox(0,0)[l]{\tt t}}
\put(62,60){\makebox(0,0)[l]{\tt s}}
\put(62,100){\makebox(0,0)[l]{\tt r}}
\end{picture}
\end{array}
\]
 What is its capacity?
}
\exercisxB{3}{ex.isbn}{
 The ten-digit number  on the cover of a book known as the\index{book ISBN}
 \ind{ISBN}\amargintab{t}{
\begin{center}
\begin{tabular}{l}
 0-521-64298-1  \\
 1-010-00000-4 \\
\end{tabular}
\end{center}
\caption[a]{Some valid ISBNs.
 [The hyphens 
 are included for legibility.]
}
}
 incorporates an error-detecting code. 
 The number consists of nine source digits $x_1,x_2,\ldots,x_{9}$,
 satisfying $x_n \in \{ 0,1,\ldots,9 \}$, and a tenth check 
 digit whose value is given by
\[
	x_{10} = \left( \sum_{n=1}^{9} n x_n \right) \mod 11 .
\]
 Here $x_{10} \in  \{ 0,1,\ldots,9 , 10 \}.$ If $x_{10} = 10$ then 
 the tenth digit is shown using the roman numeral X.
% $\tt X$.
% For example,  1-010-00000-4 is a valid ISBN. 
% bishop
% 0-19-853864-2
% see lewis:con/isbn.p

 Show that a valid ISBN satisfies: 
\[
	 \left( \sum_{n=1}^{10} n x_n \right) \mod 11 = 0 .
\]
 Imagine that an ISBN is communicated over an unreliable human 
 channel which sometimes {\em modifies\/} digits and  sometimes 
 {\em reorders\/}  digits.

 Show that this code can be used to detect (but not correct)
 all errors in which 
 any one of the ten digits is modified (for example,
 1-010-00000-4 $\rightarrow$ 1-010-00080-4).

 Show that this code can be used to detect all errors in which 
 any two adjacent digits are transposed (for example,
 1-010-00000-4 $\rightarrow$  1-100-00000-4).

 What other transpositions of pairs of {\em non-adjacent\/}
 digits can be detected?
 
%  What types of error can be detected {\em and corrected?}

 If the tenth digit were defined 
 to be
\[
	x_{10} = \left( \sum_{n=1}^{9} n x_n \right) \mod 10 ,
\]
 why would the code not work so well? (Discuss the detection of 
% errors
% involving 
 both modifications of single digits and transpositions 
 of  digits.)
}
\exercisaxA{3}{ex.two.bsc.choose}{
 A\marginpar{\[
\setlength{\unitlength}{0.17mm}
\begin{picture}(100,140)(0,-45)
\put(15,-40){\makebox(0,0)[r]{d}}
\put(15,0){\makebox(0,0)[r]{{c}}}
\put(15,40){\makebox(0,0)[r]{b}}
\put(15,80){\makebox(0,0)[r]{a}}
\put(20,0){\vector(1,0){34}}
\put(20,40){\vector(1,0){34}}
\put(20,-40){\vector(1,0){34}}
\put(20,80){\vector(1,0){34}}
\put(20,40){\vector(1,1){34}}
% \put(20,40){\vector(1,-1){34}}
\put(20,-40){\vector(1,1){34}}
\put(20,0){\vector(1,-1){34}}
% \put(20,0){\vector(1,1){34}}
\put(20,80){\vector(1,-1){34}}
%
\put(65,-40){\makebox(0,0)[l]{d}}
\put(65,0){\makebox(0,0)[l]{c}}
\put(65,40){\makebox(0,0)[l]{b}}
\put(65,80){\makebox(0,0)[l]{a}}
\end{picture}
\]
}
 channel with input $x$ and output $y$ has transition probability matrix:
\[
 Q = \left[
\begin{array}{cccc}
1-f & f & 0 & 0 \\
f & 1-f & 0 & 0 \\
0 & 0 & 1-g & g \\
0 & 0 & g & 1-g 
\end{array}
\right] .
\]
 Assuming an input distribution of the form 
\[
 {\cal P}_X
 = \left\{ \frac{p}{2}, \frac{p}{2} , \frac{1-p}{2} , \frac{1-p}{2} \right\},
\]
 write down the entropy of the output, $H(Y)$, and the 
 conditional entropy of the output given the input, $H(Y|X)$.

 Show that the optimal input distribution 
 is given by 
\[
% corrected!
        p = \frac{1}{1 + 2^{-H_2(g) + H_2(f) }} ,
\]
 where $H_2(f) = f \log_2 \frac{1}{f}  + 
 (1-f) \log_2 \frac{1}{(1-f)}$.

% CUTTABLE
% [You may find the identity 
% $\frac{\d}{\d p}  H_2(p) = \log_2 \frac{1-p}{p}$ helpful.]
\marginpar{\small\raggedright{Remember
 $\frac{\d}{\d p}  H_2(p) = \log_2 \frac{1-p}{p}$.}}

 Write down the optimal input distribution and
 the capacity of the channel in the  case $f=1/2$, $g=0$, 
 and comment on your answer. 
}

\exercisxB{2}{ex.detect.vs.correct}{
 What are the differences in the redundancies needed
 in an error-detecting code (which can reliably
 detect that a block of data has been corrupted)
 and an error-correcting code (which can detect and
 correct errors)?
}






% difficult exercises see _e7
% \input{tex/_fInfo.tex}
% included directly by thebook.tex after _f8.tex
\subsection{Further tales from information theory}
 The following exercises give you the chance to 
 discover for yourself the answers to some  more surprising 
 results of information theory.
% \subsection{Further tales from information theory}
% \input{tex/_e7.tex}
% \noindent
\ExercisxC{3}{ex.corrinfo}{ 
% \item[Communication of correlated information.]
{\sf Communication of  information from correlated
% dependent  <--- would be better, but I want to keep same name for exercise as in first edn.
 sources.}\index{channel!with dependent sources}
 Imagine that we want to communicate data from
 two data sources $X^{(A)}$ and $X^{(B)}$ to a central 
 location C via noise-free one-way \index{communication!of dependent information}{communication} channels (\figref{fig.achievableXY}a). 
 The signals   $x^{(A)}$ and $x^{(B)}$ are strongly
 dependent, so their joint information 
 content is only a little greater than the marginal information 
 content of either of them.
 For example,
 C is  a \ind{weather collator} who wishes to  receive a string of
 reports  saying
 whether it is raining in Allerton ($x^{(A)}$)
 and whether it is raining in Bognor ($x^{(B)}$).
 The joint probability of $x^{(A)}$ and $x^{(B)}$ might be 
\beq
\fourfourtabler{{$P(x^{(A)},x^{(B)})$}}{$x^{(A)}$}{{\mathsstrut}0}{{\mathsstrut}1}{{\mathsstrut}$x^{(B)}$}{0.49}{0.01}{0.01}{0.49} 
%\fourfourtable{\makebox[0.2in][r]{$P(x^{(A)},x^{(B)})$}}{$x^{(A)}$}{{\mathsstrut}0}{{\mathsstrut}1}{{\mathsstrut}$x^{(B)}$}{0.49}{0.01}{0.01}{0.49} 
%\:\: 
%\begin{array}{c|cc}
%x^{(A)} :x^{(B)}	&	0  & 1 \\ \hline
%0		&	0.49 & 0.01 \\
%1		&	0.01 & 0.49 \\
%\end{array}
\eeq
 The weather collator would like to know $N$ successive
 values of $x^{(A)}$ and $x^{(B)}$
 exactly, but, since he has  to pay for every bit
 of information he receives,
 he is interested in the possibility of avoiding buying 
 $N$ bits from source $A$
 {\em and\/} $N$ bits from source $B$.
 Assuming that  variables $x^{(A)}$ and $x^{(B)}$ are generated 
 repeatedly from this distribution,  can they be encoded at rates $R_A$
 and $R_B$ 
 in such a way that C can reconstruct all the variables, with the 
 sum of information transmission rates on the two lines  being less than two
 bits per cycle?
% For  simplicity, assume that the 
% one-way communication channels are noise-free binary channels.

% Encoding of correlated sources. Slepian Wolf (409)
\begin{figure}
\figuremargin{%
\begin{center}\small
\begin{tabular}{cc}
\raisebox{0.71in}{(a)\hspace{0.2in}{\input{tex/corrinfo.tex}}} &
\mbox{(b)\footnotesize
\setlength{\unitlength}{0.075in}
\begin{picture}(28,21)(-7.5,-1)
\put(0.3,0){\makebox(0,0)[bl]{\psfig{figure=figs/achievableXY.eps,width=1.5in}}}
\put(0,6.5){\makebox(0,0)[r]{\footnotesize$H(X^{(B)} \given X^{(A)})$}}
\put(0,14){\makebox(0,0)[r]{\footnotesize$H(X^{(B)})$}}
\put(0,17.5){\makebox(0,0)[r]{\footnotesize$H(X^{(A)},X^{(B)})$}}
\put(0,20){\makebox(0,0)[r]{\footnotesize$R_B$}}
% 
\put(20,-0.27){\makebox(0,0)[t]{\footnotesize$R_A$}}
\put(2.5,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)} \given X^{(B)})$}}
\put(12,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)})$}}
%\put(15,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)},X^{(B)})$}}
\end{picture}
}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Communication of 
% correlated
 information from dependent sources.
 (a)
% The communication situation:
 $x^{(A)}$ and $x^{(B)}$ are dependent
 sources (the dependence is represented by the dotted arrow).
 Strings of values of each variable are encoded using 
 codes of rate $R_A$ and $R_B$ into transmissions
 $\bt^{(A)}$ and $\bt^{(B)}$, which are communicated 
 over  noise-free channels to a receiver $C$.
 (b) The achievable rate region.
 Both strings can be conveyed
 without error even though $R_A < H(X^{(A)})$ and
 $R_B < H(X^{(B)})$.
}
%
% this copy is all ready to work on......
%
% cp achievableXY.fig achievableXYAB.fig
\label{fig.achievableXY}
}%
\end{figure}
 The answer, which you should demonstrate,\index{dependent sources}\index{correlated sources} 
%\index{Slepian--Wolf|see{dependent sources}}
 is indicated in \figref{fig.achievableXY}.
 In the general 
 case of two dependent sources $X^{(A)}$ and $X^{(B)}$, there exist codes for 
 the two  transmitters  that can achieve reliable communication 
 of both $X^{(A)}$ and $X^{(B)}$ to C, as long as: the information rate from 
 $X^{(A)}$, $R_A$, exceeds $H(X^{(A)} \given X^{(B)})$; the information rate from 
 $X^{(B)}$, $R_B$, exceeds $H(X^{(B)} \given X^{(A)})$; and the total information rate 
 $R_A+R_B$ exceeds the joint entropy $H(X^{(A)},X^{(B)})$ \cite{SlepianWolf}.
% In the general 
% case of two correlated sources $X$ and $Y$, there exist codes for 
% the two  transmitters  that can achieve reliable communication 
% of both $X$ and $Y$ to C, as long as: the information rate from 
% $X$, $R(X)$, exceeds $H(X \given Y)$; the information rate from 
% $Y$, $R(Y)$, exceeds $H(Y \given X)$; and the total information rate 
% $R(X)+R(Y)$ exceeds the joint information $H(X,Y)$.

 So in the case of $x^{(A)}$ and $x^{(B)}$ above, each transmitter must transmit 
 at a rate greater than $H_2(0.02) = 0.14$ bits, and the total 
 rate $R_A+R_B$ must be greater than 1.14 bits, for example $R_A=0.6$, $R_B=0.6$. 
 There exist codes that can achieve these rates. Your task is to 
 figure out why this is so.

 Try to find an explicit  solution  in which one of the sources
 is sent as plain text,  $\bt^{(B)} = \bx^{(B)}$, and the other is
 encoded. 
}
% \end{description}

%\noindent
\ExercisxC{3}{ex.multaccess}{ 
 {\sf \index{multiple access channel}Multiple
  access channels}.\index{channel!multiple access}
 Consider a channel with two sets of 
 inputs and one output --
 for example, a shared telephone line (\figref{fig.achievableAB}a). 
 A simple model system has two binary inputs $x^{(A)}$ and $x^{(B)}$ and a ternary output $y$
 equal to the arithmetic sum of the two inputs, that's 0, 1 or 2.
 There is no noise.  Users $A$ and $B$ cannot communicate with each other, and they
 cannot hear the output of the channel.
 If the output is a 0, the receiver can be certain that both inputs 
 were set to 0;
 and if the output is a 2, the receiver can be certain that both inputs 
 were set to 1. But if the output is 1, then it could be that the input
 state was $(0,1)$ or $(1,0)$.
 How should users $A$ and $B$ use this channel so that their messages
 can be deduced from the received signals? How fast can $A$
 and $B$ communicate?

 Clearly the total information rate from $A$ and $B$ 
 to the receiver cannot be two bits. On the other hand it is easy to achieve 
 a total information rate $R_A + R_B$ of one bit.  Can reliable communication 
 be achieved at rates $(R_A,R_B)$ such that $R_A + R_B> 1$?
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{l}
(a) \hspace{0.1in}{\input{tex/multacc.tex}} \\[0.1in]
(b)\hspace{0.2in}\fourfourtabler{$y$}{$x^{(A)}$}{{\mathsstrut}$\:0\:$}{{\mathsstrut}$\:1\:$}{{\mathsstrut}$x^{(B)}$}{0}{1}{1}{2}\hspace{0.5492in} 
%(c)\raisebox{-0.425in}{\psfig{figure=figs/achievableAB.eps,angle=-90,width=2in}}
(c)\raisebox{-0.25in}{\mbox{\epsfbox{metapost/channels.1}}}
\end{tabular}
\end{center}
}{%
\caption[a]{Multiple access channels.
 (a) A general multiple access channel with two transmitters and one receiver.
 (b) A binary multiple access channel with output
% given by adding the
 equal to the sum of 
two inputs. 
 (c) The achievable region. }
\label{fig.achievableAB}
}%
\end{figure}

 The answer is indicated in \figref{fig.achievableAB}.
% There exist codes for 
% the two  transmitters  such that  the rates $(R(A),R(B))$ can be 
% any point in the convex hull of 
% $\{(1,0),$ $(1,.5),$  $(.5,1),$  (0,1), $(0,0)\}$.

 Some practical codes for multi-user channels are presented in \citeasnoun{RatzerMacKay2003}.
}
%
% answer anything in the convex hull of 1,0, 1,.5  .5,1  0,1, 0,0
%

\ExercisxC{3}{ex.broadcast}{ 
{\sf \index{broadcast channel}Broadcast channels}\index{channel!broadcast}.
 A broadcast channel consists of a single transmitter and
 two or more receivers. The properties of the
 channel are defined by a conditional distribution
 $Q(y^{(A)},y^{(B)} \given  x)$. (We'll assume the channel is memoryless.)
%\begin{figure}
%\figuremargin{%
\amarginfig{t}{
\begin{center}\footnotesize\small
\raisebox{0in}{%(a)\hspace{0.2in}
{\input{tex/broadcast.tex}}}
%\hspace{0.4in}
%(b)
% \mbox{\psfig{figure=figs/achievableXY.eps,angle=-90,width=2in}}
\end{center}
%}{%
\caption[a]{The broadcast channel. $x$ is
 the channel input; $y^{(A)}$ and $y^{(B)}$ are the outputs.
% (b) The achievable rate region.
}
%
% this copy is all ready to work on......
%
% cp achievableXY.fig achievableXYAB.fig
\label{fig.achievableBroadcast}
}%
%\end{figure}
 The task is to add an encoder and two decoders to enable
 reliable communication\index{communication!broadcast} of
 a common message at rate $R_0$ to both receivers,
 an individual message at rate $R_A$ to receiver $A$, 
 and an individual message at rate $R_B$ to receiver $B$.
 The {\dem{capacity}} region of the broadcast channel
 is the convex hull  of the set of achievable rate triplets $(R_0,R_A,R_B)$.

 A simple benchmark for such a channel is given by
 time-sharing
%
% had to move the figure down a bit to avoid clash
% it was here
%
 (\ind{time-division} signaling). If the capacities of the
 two channels, considered separately, are $C^{(A)}$ and
 $C^{(B)}$, then by devoting a fraction $\phi_A$ of
 the transmission
 time to channel $A$ and $\phi_B\eq 1\!-\!\phi_A$ to channel B, we can achieve
 $(R_0,R_A,R_B) = (0,\phi_A C^{(A)},\phi_B C^{(B)})$.

\amarginfig{t}{
\begin{center}\footnotesize\small
\setlength{\unitlength}{0.03975in}
\begin{picture}(28,21)(-7.5,-2.91)
\put(0.3,0){\vector(1,0){20}}
\put(0.3,0){\vector(0,1){20}}
\put(0.3,15){\line(1,-1){15}}
\put(0,15){\makebox(0,0)[r]{\footnotesize$C^{(B)}$}}
\put(-0.40,20){\makebox(0,0)[r]{\footnotesize$R_B$}}
% 
\put(22,-1.5){\makebox(0,0)[t]{\footnotesize$R_A$}}
\put(15,-1.3){\makebox(0,0)[t]{\footnotesize$C^{(A)}$}}
\end{picture}
\end{center}
%}{%
\caption[a]{Rates achievable by simple timesharing.}
%
% this copy is all ready to work on......
%
% cp achievableXY.fig achievableXYAB.fig
\label{fig.timesharing}
}%
 We can do better than this, however.
% To borrow an analogy from Cover and Thomas.
 As an analogy, imagine speaking simultaneously to an American
 and  a \ind{Belarusian};
%\ind{Golgafrinchan} \ind{telephone sanitizer};
 you are fluent in \ind{American}
 and in \ind{Belarusian}, but
% , needless to say,
 neither
 of your two receivers understands the
 other's language. If each receiver can distinguish
 whether a word is in their own language or not,
 then an extra binary file can be conveyed to both recipients
 by using its bits to decide whether the next transmitted
 word should be from the   American source text or from the
% \ind{Golgafrinchan}
 \ind{Belarusian} source text. Each recipient can concatenate
 the words that they understand in order to receive their personal
 message, and can also recover the binary string.

 An example of a broadcast channel consists  of two
 binary symmetric channels with a common input. The two halves
 of the channel
 have  flip probabilities
 $f_A$ and $f_B$. We'll assume that $A$ has the better
 half-channel, \ie, $f_A < f_B < \dhalf$.
 [A closely related  channel is a
 `degraded' broadcast channel, 
 in which  the conditional probabilities are such that
 the random variables have the structure of a Markov chain,
\beq
	x \rightarrow y^{(A)} \rightarrow y^{(B)},
\eeq
 \ie, $y^{(B)}$ is a further degraded version of $y^{(A)}$.]
 In this special case,  it turns out that whatever information
 is getting through to receiver $B$ can also be recovered by
 receiver $A$.
% stolen from Blahut
% [This is obvious for the degraded channel, 
 So there is no point distinguishing between $R_0$ and $R_B$:
 the task is to find the capacity region for the rate pair $(R_0,R_A)$,
 where $R_0$ is the rate of information reaching both $A$ and $B$,
 and $R_A$ is the rate of the extra information reaching $A$.

 The following exercise is equivalent to this one,
 and a solution to it is illustrated in
 \figref{fig.broadcastIII}.
% Blahut page 338.
% Cover and Thomas page 
}
\ExercisxC{3}{ex.broadcastII}{ 
{\sf Variable-rate error-correcting codes
 for\index{channel!unknown noise level}\index{error-correcting code!variable rate}\index{variable-rate error-correcting codes}
 {channels with unknown noise level}}.
 In real life,%
\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/broadcastII.eps,angle=0,width=1.27in}}
\end{center}
\caption[a]{Rate of reliable communication $R$,
 as a function of noise level $f$, for Shannonesque
 codes designed to operate at noise levels $f_A$ (solid line)
 and $f_B$ (dashed line).}
\label{fig.broadcastII}
}
 channels may sometimes not be well characterized
 before the encoder is installed. As a model
 of this situation,  imagine that a channel
 is known to be a binary symmetric channel with noise level
  either $f_A$ or $f_B$. Let $f_B>f_A$, and let the
 two capacities be $C_A$ and $C_B$.

 Those who like to live dangerously might install a system
 designed for   noise level $f_A$
 with rate $R_A \simeq C_A$; in the event that the noise level
 turns out to be $f_B$, our experience of Shannon's theories
 would lead us to expect that there would be a catastrophic failure
 to communicate
 information reliably (solid line in \figref{fig.broadcastII}).%
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=figs/broadcastIIa.eps,angle=0,width=1.27in}}
\end{center}
\caption[a]{Rate of reliable communication $R$,
 as a function of noise level $f$, for a desired
 {\dem{variable-rate}} code.}
\label{fig.broadcastIIa}
}

 A conservative approach would design the encoding system
 for the worst-case scenario, installing a code with rate $R_B
 \simeq C_B$   (dashed line in \figref{fig.broadcastII}).
 In the event that the lower noise level, $f_A$, holds
 true,  the managers would have a feeling of regret
 because of the wasted  capacity difference $C_A - R_B$.


 Is
 it possible to create a system that not only transmits
 reliably at some rate $R_0$ whatever the noise level,
 but also communicates some extra,
 `lower-priority'\index{priority of bits in a message}
 bits if the noise level is low, as shown in\index{error-correcting code!with varying level
 of protection}
 \figref{fig.broadcastIIa}?
 This code communicates
 the high-priority bits reliably at all noise levels
 between  $f_A$
 and $f_B$, and communicates the low-priority bits also
 if the noise level is $f_A$ or below.

 This problem is mathematically equivalent to the
 previous problem, the degraded \ind{broadcast channel}.\index{channel!broadcast}
 The lower rate of communication was there called $R_0$, and
 the rate at which the low-priority bits are communicated if
 the noise level is low was called $R_A$.
\amarginfig{t}{
\begin{center}
\raisebox{0.1in}{\psfig{figure=figs/broadcastans.ps,angle=-90,width=1.27in}}
\end{center}
\caption[a]{An achievable region for
 the channel with unknown noise level.
 Assuming the two possible noise levels
 are $f_A=0.01$ and $f_B=0.1$, the
  dashed lines show the rates $R_A,R_B$ that
 are achievable using a simple time-sharing approach,
 and the solid line shows rates achievable using a more
 cunning approach.  
}
\label{fig.broadcastIII}
}
% load 'broadcast.gnu'

 An illustrative answer  is shown in \figref{fig.broadcastIII},
 for the case  $f_A=0.01$ and $f_B=0.1$.
 (This figure also shows the
 achievable region for a broadcast channel whose
 two half-channels have noise levels $f_A=0.01$ and $f_B=0.1$.)
 I admit I find the gap between the simple time-sharing
 solution and the cunning solution disappointingly small.

 In \chref{chdfountain} we will discuss codes for a
 special class of broadcast channels, namely erasure channels,
 where every symbol is either received without error or erased.
 These codes have the nice property that they are {\dem rateless} --
 the number of symbols  transmitted is determined on the fly  such that
 reliable comunication is achieved, whatever the erasure statistics of
 the channel.

}
% \begin{description}
%\item[Multiterminal information networks]
% \noindent
\ExercisxC{3}{ex.multiterminal}{ 
{\sf \index{multiterminal networks}{Multiterminal information networks}}\index{channel!multiterminal}
 are both important practically and 
	intriguing theoretically. Consider  the following example of a two-way
 binary channel (\figref{fig.achievabletwo}a,b): 
 	 two people both wish to talk over the channel,
 and they both want to hear what 
 	the other person is saying; but you can only hear 
	the signal transmitted by the other person if you are transmitting 
	a zero. What simultaneous information rates from $A$ to $B$ and 
 	from $B$ to $A$ can be achieved, and how?  Everyday examples 
 of such networks include
 the VHF channels used by ships, and computer ethernet networks (in which
 {\em all\/} the devices are unable to hear {\em anything\/}
 if two or more devices are broadcasting simultaneously).
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{{\footnotesize{(a)}}
\setlength{\unitlength}{0.07in}
\begin{picture}(50,10)(0,2.5)
\put(4,10){\makebox(0,0)[r]{$x^{(A)}$}}
\put(4,5){\makebox(0,0)[r]{$y^{(A)}$}}
\put(5,10){\vector(1,0){5}}
\put(10,5){\vector(-1,0){5}}
\put(10,2.5){\framebox(25,10){$P(y^{(A)},y^{(B)}| x^{(A)} , x^{(B)} )$}}
\put(41,10){\makebox(0,0)[l]{$y^{(B)}$}}
\put(41,5){\makebox(0,0)[l]{$x^{(B)}$}}
\put(35,10){\vector(1,0){5}}
\put(40,5){\vector(-1,0){5}}
\end{picture}

}\\[0.2in]
\mbox{
{\footnotesize{(b)}}\hspace{0.2in}
{%\footnotesize
\fourfourtabler{$y^{(A)}$}{$x^{(A)}$}{{\mathsstrut}$\:0\:$}{{\mathsstrut}$\:1\:$}{{\mathsstrut}$x^{(B)}$}{0}{0}{1}{0}\hspace{0.2in} 
\fourfourtabler{$y^{(B)}$}{$x^{(A)}$}{{\mathsstrut}0}{{\mathsstrut}1}{{\mathsstrut}$x^{(B)}$}{0}{1}{0}{0}
}}
\\[0.2in]
\mbox{
\hspace{0.4in}
{\footnotesize{(c)}}
\hspace{-0.2in}
\raisebox{-0.1in}{\psfig{figure=figs/twoway.ps,angle=-90,height=1.8in,width=2.45in}}}
\end{center}
}{%
\caption[a]{(a) A general two-way channel. 
(b) The rules for a binary two-way channel. The two tables show the
 outputs $y^{(A)}$ and  $y^{(B)}$ that result for each state  of the inputs.
(c) Achievable region for the two-way binary channel.
 Rates below the solid line are achievable. 
 The dotted line shows the `obviously achievable' region which 
 can be attained by simple time-sharing.}
\label{fig.achievabletwo}
}%
\end{figure}
%gnuplot> plot "twoway1.4" u 1:2 w l 1, "twoway1.4" u 2:1 w  l 1,1-x w l 2
%gnuplot> set term post
%Terminal type set to 'postscript'
%Options are 'landscape monochrome dashed "Helvetica" 14'
%gnuplot> set output "twoway.ps"
%gnuplot> replot
%
% generated using figs/twoway.p

 Obviously, we can achieve rates of $\dhalf$ in both directions by simple
 time-sharing. But can the two information rates be made larger? 
 Finding  the  capacity of a  general two-way
 channel  is still an open problem. However, 
 we can obtain interesting results concerning achievable points for the simple
 binary channel discussed above, as
 indicated in \figref{fig.achievabletwo}c. There exist codes that can achieve 
 rates up to the boundary shown. 
 There may exist better codes too. 
% cover 457
% using independently generated codes you can prove that the following rate region is achievable: R1 < I(X1;Y2 \given X2), R2 print (1+sqrt(5))/2 - 1
%        0.618034
%gnuplot> print (1+sqrt(5))/2 - 2
%        -0.381966
\exercissxB{2}{ex.count.ones}{
 If a file containing a fraction $f=0.5$ {\tt 1}s is   
 transmitted by $C_2$, what fraction of
 the transmitted stream is {\tt 1}s?

	What fraction of the transmitted bits is {\tt 1}s 
 if we drive code $C_2$ with a sparse source of density  $f = 0.38$?
}
% answer f/(1+f) =
% (gamma-1.0)/(2*gamma - 1.0) =
% 0.2764

 A second, more fundamental approach {\em counts\/}
% Alternatively, count
 how many valid sequences of length $N$ there are, $S_N$. 
  We can communicate $\log S_N$ bits in $N$ channel cycles by giving 
 one name to each of these valid sequences.
% Define capacity here.

% Having got a feel for this toy channel, let us now tackle the 
% general problem.

\section{The capacity of a constrained noiseless channel}
% How can we define the capacity of a constrained channel? 
 We defined the capacity of a noisy channel in terms of\index{channel!capacity}  
 the mutual information between its input and its output, then 
 we proved
% -- with considerable effort --
 that this number, the capacity, was related to the 
 number of distinguishable messages
 $S(N)$
% $M(N)$
% \marginpar{Do I want to use $M$?}% Sun 31/12/00: YES, from here on
 that 
 could be reliably conveyed over the channel in $N$ uses of 
 the channel by
\beq
	C = \lim_{N \rightarrow \infty} \frac{1}{N} \log S(N) .
\eeq  
 In the case of the constrained noiseless channel, 
 we can adopt this identity as our definition of 
 the channel's capacity.
 However, the name $s$, which,
 when we were making codes for noisy channels (\secref{sec.whereCWMdefined}),
 ran over messages $s = 1, \ldots, S$,
 is about to take on a new role: labelling the states
 of our channel; so in this chapter
 we will denote the   number of distinguishable messages of length $N$
 by $M_N$, and define the capacity to be:\index{capacity!constrained channel}  
\beq
	C = \lim_{N \rightarrow \infty} \frac{1}{N} \log M_N .
\eeq  

% Knowing the capacity of a channel doesn't tell us how practically to 
% achieve that rate of communication, so o
 Once we have figured out 
 the capacity of a channel we will  return  to
 the task of making a practical code for that channel.


\section{Counting the number of possible messages}
 First let us introduce some representations of 
 constrained channels. 
% We can often conveniently represent a  constrained channel
% by a state diagram.
 In a {\dem\ind{state diagram}}, states of the transmitter are represented
 by circles labelled with the name of the state. 
 Directed edges\index{edge}\index{graph} from one state to another indicate that the
 transmitter is permitted to move from the first state to the
 second, and a label on that edge indicates the 
 symbol emitted when that \ind{transition} is made. 
 \Figref{fig.state1}a shows the state diagram for 
 channel A.
% the ${\tt 0}^+{\tt 1}^1$ 
 It has two states, $0$
 and $1$. When transitions to state $0$ are made, 
 a {\tt 0} is transmitted; when transitions to state $1$ are made, 
 a {\tt 1} is transmitted; transitions from  state $1$ to state $1$
 are not possible.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremarginb{% bottom aligned to avoid clash
\small
\begin{center}
\begin{tabular}{cc}
(a)\mbox{\psfig{figure=noiseless/figs/state1.ps,angle=-90,width=0.6in}}&
(c)
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.01cm}
\begin{picture}(1004,180)(-25,-25)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(61,-2){\makebox(0,0)[t]{\tt{0}}}
\put(17,6){\vector(1,1){88}}
\put(86,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(122,130){\makebox(0,0)[b]{{$s_{1}$}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(183,-2){\makebox(0,0)[t]{\tt{0}}}
\put(139,94){\vector(1,-1){88}}
\put(158,77){\makebox(0,0)[bl]{\tt{0}}}
\put(139,6){\vector(1,1){88}}
\put(208,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(244,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{0}}}
\put(244,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(244,130){\makebox(0,0)[b]{{$s_{2}$}}}
%
% lines
%
\put(261,0){\vector(1,0){88}}
\put(305,-2){\makebox(0,0)[t]{\tt{0}}}
\put(261,94){\vector(1,-1){88}}
\put(280,77){\makebox(0,0)[bl]{\tt{0}}}
\put(261,6){\vector(1,1){88}}
\put(330,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(366,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{0}}}
\put(366,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(366,130){\makebox(0,0)[b]{{$s_{3}$}}}
%
% lines
%
\put(383,0){\vector(1,0){88}}
\put(427,-2){\makebox(0,0)[t]{\tt{0}}}
\put(383,94){\vector(1,-1){88}}
\put(402,77){\makebox(0,0)[bl]{\tt{0}}}
\put(383,6){\vector(1,1){88}}
\put(452,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(488,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(488,0){\makebox(0,0){\tiny{0}}}
\put(488,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(488,130){\makebox(0,0)[b]{{$s_{4}$}}}
%
% lines
%
\put(505,0){\vector(1,0){88}}
\put(549,-2){\makebox(0,0)[t]{\tt{0}}}
\put(505,94){\vector(1,-1){88}}
\put(524,77){\makebox(0,0)[bl]{\tt{0}}}
\put(505,6){\vector(1,1){88}}
\put(574,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(610,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(610,0){\makebox(0,0){\tiny{0}}}
\put(610,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(610,130){\makebox(0,0)[b]{{$s_{5}$}}}
%
% lines
%
\put(627,0){\vector(1,0){88}}
\put(671,-2){\makebox(0,0)[t]{\tt{0}}}
\put(627,94){\vector(1,-1){88}}
\put(646,77){\makebox(0,0)[bl]{\tt{0}}}
\put(627,6){\vector(1,1){88}}
\put(696,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(732,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(732,0){\makebox(0,0){\tiny{0}}}
\put(732,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(732,130){\makebox(0,0)[b]{{$s_{6}$}}}
%
% lines
%
\put(749,0){\vector(1,0){88}}
\put(793,-2){\makebox(0,0)[t]{\tt{0}}}
\put(749,94){\vector(1,-1){88}}
\put(768,77){\makebox(0,0)[bl]{\tt{0}}}
\put(749,6){\vector(1,1){88}}
\put(818,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(854,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(854,0){\makebox(0,0){\tiny{0}}}
\put(854,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(854,130){\makebox(0,0)[b]{{$s_{7}$}}}
%
% lines
%
\put(871,0){\vector(1,0){88}}
\put(915,-2){\makebox(0,0)[t]{\tt{0}}}
\put(871,94){\vector(1,-1){88}}
\put(890,77){\makebox(0,0)[bl]{\tt{0}}}
\put(871,6){\vector(1,1){88}}
\put(940,77){\makebox(0,0)[br]{\tt{1}}}
%
% end circles
%
\multiput(976,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(976,0){\makebox(0,0){\tiny{0}}}
\put(976,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(976,130){\makebox(0,0)[b]{{$s_{8}$}}}
%
\end{picture}
%

\\
(b)
\raisebox{-0.4in}{
%
% trellis section written by trellis.p
% handedited Tue 24/12/02
%
\setlength{\unitlength}{0.015cm}
\begin{picture}(150,180)(-25,-3)
%
% starting circles
%
\multiput(0,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){{0}}}
\put(0,100){\makebox(0,0){{1}}}
%
% state label for this column
%
\put(0,130){\makebox(0,0)[b]{{$s_{n}$}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(61,-2){\makebox(0,0)[t]{\tt{0}}}
\put(17,94){\vector(1,-1){88}}
\put(36,77){\makebox(0,0)[bl]{\tt{0}}}
\put(17,6){\vector(1,1){88}}
\put(86,77){\makebox(0,0)[br]{\tt{1}}}
%
% end circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){{0}}}
\put(122,100){\makebox(0,0){{1}}}
%
% state label for this column
%
\put(122,130){\makebox(0,0)[b]{{$s_{n+1}$}}}
%
\end{picture}
%

}
&
(d)\hspace{0.1in}
{ $\bA = \begin{array}[b]{c@{}cc@{}c}
    &      &     \multicolumn{2}{c}{\mbox{\tiny (from)}}  \\
%    &      &     \multicolumn{2}{c}{\mbox{state}}  \\
           &                        & \:    1  &  0 \:   \\
 \mbox{\tiny (to)}
% {state}
               & \begin{array}{c}  
              1\\
              0\end{array} & \left[ \begin{array}{c}  
                                     0\\
                                     1\end{array} \right. & \left. 
                                               \begin{array}{c}  
		                                     1\\
                		                     1\end{array} \right]\\
\end{array}$
}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{(a) State diagram for
% the ${\tt 0}^+{\tt 1}^1$
 channel A.
 (b) Trellis section. (c) Trellis. (d) \Connectionmatrix.}
\label{fig.state1}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% worked on these figs Sun 3/2/02
%
\begin{figure}
\figuremargin{%
\small
\begin{center}
\begin{tabular}{cc@{\hspace{0.2in}}|cc}
\raisebox{0.1in}[0in][0in]{\psfig{figure=noiseless/figs/state101.ps,angle=-90,height=2in}}
&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.015cm}
\begin{picture}(150,380)(-25,-25)
%
% starting circles
%
\multiput(0,0)(0,100){4}{\circle{32}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\small{00}}}
\put(0,100){\makebox(0,0){\small{0}}}
\put(0,200){\makebox(0,0){\small{1}}}
\put(0,300){\makebox(0,0){\small{11}}}
%
% state label for this column
%
\put(0,330){\makebox(0,0)[b]{{$s_{n}$}}}
%
% lines
%
\put(19.5384615384615,0){\vector(1,0){88}}
\put(63.5384615384615,-2){\makebox(0,0)[t]{\tt{0}}}
\put(19.5384615384615,94){\vector(1,-1){88}}
\put(38.5384615384615,77){\makebox(0,0)[bl]{\tt{0}}}
\put(19.5384615384615,288){\vector(1,-2){88}}
\put(38.5384615384615,252){\makebox(0,0)[bl]{\tt{0}}}
\put(19.5384615384615,12){\vector(1,2){88}}
\put(63.5384615384615,102){\makebox(0,0)[br]{\tt{1}}}
\put(19.5384615384615,206){\vector(1,1){88}}
\put(88.5384615384615,277){\makebox(0,0)[br]{\tt{1}}}
\put(19.5384615384615,300){\vector(1,0){88}}
\put(63.5384615384615,302){\makebox(0,0)[b]{\tt{1}}}
%
% end circles
%
\multiput(127.076923076923,0)(0,100){4}{\circle{32}}
%
% labels for circles
%
\put(127.076923076923,0){\makebox(0,0){\small{00}}}
\put(127.076923076923,100){\makebox(0,0){\small{0}}}
\put(127.076923076923,200){\makebox(0,0){\small{1}}}
\put(127.076923076923,300){\makebox(0,0){\small{11}}}
%
% state label for this column
%
\put(127.076923076923,330){\makebox(0,0)[b]{{$s_{n+1}$}}}
%
\end{picture}
%

& % divider
\begin{tabular}{@{}c@{}}
\raisebox{0.15in}[0in][0.2in]{%
\mbox{\psfig{figure=noiseless/figs/state111.ps,angle=-90,height=1.7in}}
}
\\
\end{tabular}
&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.015cm}
\begin{picture}(150,380)(-25,-25)
%
% starting circles
%
\multiput(0,0)(0,100){4}{\circle{36}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\small{00}}}
\put(0,100){\makebox(0,0){\small{0}}}
\put(0,200){\makebox(0,0){\small{1}}}
\put(0,300){\makebox(0,0){\small{11}}}
%
% state label for this column
%
\put(0,330){\makebox(0,0)[b]{{$s_{n}$}}}
%
% lines
%
\put(21.2307692307692,94){\vector(1,-1){88}}
\put(40.2307692307692,77){\makebox(0,0)[bl]{\tt{0}}}
\put(21.2307692307692,194){\vector(1,-1){88}}
\put(40.2307692307692,177){\makebox(0,0)[bl]{\tt{0}}}
\put(21.2307692307692,288){\vector(1,-2){88}}
\put(40.2307692307692,252){\makebox(0,0)[bl]{\tt{0}}}
\put(21.2307692307692,12){\vector(1,2){88}}
\put(65.2307692307692,102){\makebox(0,0)[br]{\tt{1}}}
\put(21.2307692307692,106){\vector(1,1){88}}
\put(90.2307692307692,177){\makebox(0,0)[br]{\tt{1}}}
\put(21.2307692307692,206){\vector(1,1){88}}
\put(90.2307692307692,277){\makebox(0,0)[br]{\tt{1}}}
%
% end circles
%
\multiput(130.461538461538,0)(0,100){4}{\circle{36}}
%
% labels for circles
%
\put(130.461538461538,0){\makebox(0,0){\small{00}}}
\put(130.461538461538,100){\makebox(0,0){\small{0}}}
\put(130.461538461538,200){\makebox(0,0){\small{1}}}
\put(130.461538461538,300){\makebox(0,0){\small{11}}}
%
% state label for this column
%
\put(130.461538461538,330){\makebox(0,0)[b]{{$s_{n+1}$}}}
%
\end{picture}
%

\\ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \normalsize B &
$\bA = \input{noiseless/tex/mfile101.tex}$
& % divider
 \normalsize  C &
$\bA = \input{noiseless/tex/mfile111.tex}$
\\
\end{tabular}
\end{center}
}{%
\caption[a]{State diagrams, trellis sections
 and \connectionmatrices\ for  channels B and C. }
\label{fig.state101}
}% 
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremargin{%
\begin{center}\raisebox{0.2in}[0.85in]{
\begin{tabular}{ccc} %  \toprule
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(150,310)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,6){\vector(1,1){88}}
% section 1 : cumulative counts
%      0     1
%      1     1
\put(97,-50){\makebox(50,30){\small\bf{1}}}
\put(97,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,200){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% end circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
\end{picture}
%

&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(272,310)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,6){\vector(1,1){88}}
% section 1 : cumulative counts
%      0     1
%      1     1
\put(97,-50){\makebox(50,30){\small\bf{1}}}
\put(97,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,200){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% starting circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(139,94){\vector(1,-1){88}}
\put(139,6){\vector(1,1){88}}
% section 2 : cumulative counts
%      0     2
%      1     1
\put(219,-50){\makebox(50,30){\small\bf{2}}}
\put(219,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,200){\makebox(0,0)[t]{\small{$M_{2}\eq 3$}}}
%
% end circles
%
\multiput(244,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{0}}}
\put(244,100){\makebox(0,0){\tiny{1}}}
%
\end{picture}
%

&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(394,310)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,6){\vector(1,1){88}}
% section 1 : cumulative counts
%      0     1
%      1     1
\put(97,-50){\makebox(50,30){\small\bf{1}}}
\put(97,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,200){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% starting circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(139,94){\vector(1,-1){88}}
\put(139,6){\vector(1,1){88}}
% section 2 : cumulative counts
%      0     2
%      1     1
\put(219,-50){\makebox(50,30){\small\bf{2}}}
\put(219,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,200){\makebox(0,0)[t]{\small{$M_{2}\eq 3$}}}
%
% starting circles
%
\multiput(244,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{0}}}
\put(244,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(261,0){\vector(1,0){88}}
\put(261,94){\vector(1,-1){88}}
\put(261,6){\vector(1,1){88}}
% section 3 : cumulative counts
%      0     3
%      1     2
\put(341,-50){\makebox(50,30){\small\bf{3}}}
\put(341,120){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(366,200){\makebox(0,0)[t]{\small{$M_{3}\eq 5$}}}
%
% end circles
%
\multiput(366,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{0}}}
\put(366,100){\makebox(0,0){\tiny{1}}}
%
\end{picture}
%

\\ %  \bottomrule
\end{tabular}
}
\end{center}
}{%
\caption[a]{Counting the number of paths in the trellis of channel A.
 The counts
% in the square boxes
 next to the nodes
 are accumulated by passing from left to right 
 across the trellises.}
\label{fig.state1count123}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\fullwidthfigureright{%
\begin{center}\small
\begin{tabular}{@{}*{1}{l@{}l@{}}}  \toprule
\raisebox{1in}{(a) Channel A}&
%
% trellis section written by trellis.p
%
% handedited Tue 24/12/02 to widen from 1004 to 1034
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(1064,310)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,6){\vector(1,1){88}}
% section 1 : cumulative counts
%      0     1
%      1     1
\put(97,-50){\makebox(50,30){\small\bf{1}}}
\put(97,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,200){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% starting circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(139,94){\vector(1,-1){88}}
\put(139,6){\vector(1,1){88}}
% section 2 : cumulative counts
%      0     2
%      1     1
\put(219,-50){\makebox(50,30){\small\bf{2}}}
\put(219,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,200){\makebox(0,0)[t]{\small{$M_{2}\eq 3$}}}
%
% starting circles
%
\multiput(244,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{0}}}
\put(244,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(261,0){\vector(1,0){88}}
\put(261,94){\vector(1,-1){88}}
\put(261,6){\vector(1,1){88}}
% section 3 : cumulative counts
%      0     3
%      1     2
\put(341,-50){\makebox(50,30){\small\bf{3}}}
\put(341,120){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(366,200){\makebox(0,0)[t]{\small{$M_{3}\eq 5$}}}
%
% starting circles
%
\multiput(366,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{0}}}
\put(366,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(383,0){\vector(1,0){88}}
\put(383,94){\vector(1,-1){88}}
\put(383,6){\vector(1,1){88}}
% section 4 : cumulative counts
%      0     5
%      1     3
\put(463,-50){\makebox(50,30){\small\bf{5}}}
\put(463,120){\makebox(50,30){\small\bf{3}}}
%
% total count
%
\put(488,200){\makebox(0,0)[t]{\small{$M_{4}\eq 8$}}}
%
% starting circles
%
\multiput(488,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(488,0){\makebox(0,0){\tiny{0}}}
\put(488,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(505,0){\vector(1,0){88}}
\put(505,94){\vector(1,-1){88}}
\put(505,6){\vector(1,1){88}}
% section 5 : cumulative counts
%      0     8
%      1     5
\put(585,-50){\makebox(50,30){\small\bf{8}}}
\put(585,120){\makebox(50,30){\small\bf{5}}}
%
% total count
%
\put(610,200){\makebox(0,0)[t]{\small{$M_{5}\eq 13$}}}
%
% starting circles
%
\multiput(610,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(610,0){\makebox(0,0){\tiny{0}}}
\put(610,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(627,0){\vector(1,0){88}}
\put(627,94){\vector(1,-1){88}}
\put(627,6){\vector(1,1){88}}
% section 6 : cumulative counts
%      0    13
%      1     8
\put(707,-50){\makebox(50,30){\small\bf{13}}}
\put(707,120){\makebox(50,30){\small\bf{8}}}
%
% total count
%
\put(732,200){\makebox(0,0)[t]{\small{$M_{6}\eq 21$}}}
%
% starting circles
%
\multiput(732,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(732,0){\makebox(0,0){\tiny{0}}}
\put(732,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(749,0){\vector(1,0){88}}
\put(749,94){\vector(1,-1){88}}
\put(749,6){\vector(1,1){88}}
% section 7 : cumulative counts
%      0    21
%      1    13
\put(829,-50){\makebox(50,30){\small\bf{21}}}
\put(829,120){\makebox(50,30){\small\bf{13}}}
%
% total count
%
\put(854,200){\makebox(0,0)[t]{\small{$M_{7}\eq 34$}}}
%
% starting circles
%
\multiput(854,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(854,0){\makebox(0,0){\tiny{0}}}
\put(854,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(871,0){\vector(1,0){88}}
\put(871,94){\vector(1,-1){88}}
\put(871,6){\vector(1,1){88}}
% section 8 : cumulative counts
%      0    34
%      1    21
\put(951,-50){\makebox(50,30){\small\bf{34}}}
\put(951,120){\makebox(50,30){\small\bf{21}}}
%
% total count
%
\put(976,200){\makebox(0,0)[t]{\small{$M_{8}\eq 55$}}}
%
% end circles
%
\multiput(976,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(976,0){\makebox(0,0){\tiny{0}}}
\put(976,100){\makebox(0,0){\tiny{1}}}
%
\end{picture}
%

\\ \midrule
\raisebox{1.8in}{(b) Channel B}&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(999,510)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{00}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,12){\vector(1,2){88}}
% section 1 : cumulative counts
%      0     1
%      1     0
%      2     1
%      3     0
\put(102,-50){\makebox(50,30){\small\bf{1}}}
\put(102,220){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,400){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% starting circles
%
\multiput(122,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{00}}}
\put(122,100){\makebox(0,0){\tiny{0}}}
\put(122,200){\makebox(0,0){\tiny{1}}}
\put(122,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(139,94){\vector(1,-1){88}}
\put(139,288){\vector(1,-2){88}}
\put(139,12){\vector(1,2){88}}
\put(139,206){\vector(1,1){88}}
\put(139,300){\vector(1,0){88}}
% section 2 : cumulative counts
%      0     1
%      1     0
%      2     1
%      3     1
\put(224,-50){\makebox(50,30){\small\bf{1}}}
\put(224,220){\makebox(50,30){\small\bf{1}}}
\put(224,320){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,400){\makebox(0,0)[t]{\small{$M_{2}\eq 3$}}}
%
% starting circles
%
\multiput(244,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{00}}}
\put(244,100){\makebox(0,0){\tiny{0}}}
\put(244,200){\makebox(0,0){\tiny{1}}}
\put(244,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(261,0){\vector(1,0){88}}
\put(261,94){\vector(1,-1){88}}
\put(261,288){\vector(1,-2){88}}
\put(261,12){\vector(1,2){88}}
\put(261,206){\vector(1,1){88}}
\put(261,300){\vector(1,0){88}}
% section 3 : cumulative counts
%      0     1
%      1     1
%      2     1
%      3     2
\put(346,-50){\makebox(50,30){\small\bf{1}}}
\put(346,120){\makebox(50,30){\small\bf{1}}}
\put(346,220){\makebox(50,30){\small\bf{1}}}
\put(346,320){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(366,400){\makebox(0,0)[t]{\small{$M_{3}\eq 5$}}}
%
% starting circles
%
\multiput(366,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{00}}}
\put(366,100){\makebox(0,0){\tiny{0}}}
\put(366,200){\makebox(0,0){\tiny{1}}}
\put(366,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(383,0){\vector(1,0){88}}
\put(383,94){\vector(1,-1){88}}
\put(383,288){\vector(1,-2){88}}
\put(383,12){\vector(1,2){88}}
\put(383,206){\vector(1,1){88}}
\put(383,300){\vector(1,0){88}}
% section 4 : cumulative counts
%      0     2
%      1     2
%      2     1
%      3     3
\put(468,-50){\makebox(50,30){\small\bf{2}}}
\put(468,120){\makebox(50,30){\small\bf{2}}}
\put(468,220){\makebox(50,30){\small\bf{1}}}
\put(468,320){\makebox(50,30){\small\bf{3}}}
%
% total count
%
\put(488,400){\makebox(0,0)[t]{\small{$M_{4}\eq 8$}}}
%
% starting circles
%
\multiput(488,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(488,0){\makebox(0,0){\tiny{00}}}
\put(488,100){\makebox(0,0){\tiny{0}}}
\put(488,200){\makebox(0,0){\tiny{1}}}
\put(488,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(505,0){\vector(1,0){88}}
\put(505,94){\vector(1,-1){88}}
\put(505,288){\vector(1,-2){88}}
\put(505,12){\vector(1,2){88}}
\put(505,206){\vector(1,1){88}}
\put(505,300){\vector(1,0){88}}
% section 5 : cumulative counts
%      0     4
%      1     3
%      2     2
%      3     4
\put(590,-50){\makebox(50,30){\small\bf{4}}}
\put(590,120){\makebox(50,30){\small\bf{3}}}
\put(590,220){\makebox(50,30){\small\bf{2}}}
\put(590,320){\makebox(50,30){\small\bf{4}}}
%
% total count
%
\put(610,400){\makebox(0,0)[t]{\small{$M_{5}\eq 13$}}}
%
% starting circles
%
\multiput(610,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(610,0){\makebox(0,0){\tiny{00}}}
\put(610,100){\makebox(0,0){\tiny{0}}}
\put(610,200){\makebox(0,0){\tiny{1}}}
\put(610,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(627,0){\vector(1,0){88}}
\put(627,94){\vector(1,-1){88}}
\put(627,288){\vector(1,-2){88}}
\put(627,12){\vector(1,2){88}}
\put(627,206){\vector(1,1){88}}
\put(627,300){\vector(1,0){88}}
% section 6 : cumulative counts
%      0     7
%      1     4
%      2     4
%      3     6
\put(712,-50){\makebox(50,30){\small\bf{7}}}
\put(712,120){\makebox(50,30){\small\bf{4}}}
\put(712,220){\makebox(50,30){\small\bf{4}}}
\put(712,320){\makebox(50,30){\small\bf{6}}}
%
% total count
%
\put(732,400){\makebox(0,0)[t]{\small{$M_{6}\eq 21$}}}
%
% starting circles
%
\multiput(732,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(732,0){\makebox(0,0){\tiny{00}}}
\put(732,100){\makebox(0,0){\tiny{0}}}
\put(732,200){\makebox(0,0){\tiny{1}}}
\put(732,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(749,0){\vector(1,0){88}}
\put(749,94){\vector(1,-1){88}}
\put(749,288){\vector(1,-2){88}}
\put(749,12){\vector(1,2){88}}
\put(749,206){\vector(1,1){88}}
\put(749,300){\vector(1,0){88}}
% section 7 : cumulative counts
%      0    11
%      1     6
%      2     7
%      3    10
\put(834,-50){\makebox(50,30){\small\bf{11}}}
\put(834,120){\makebox(50,30){\small\bf{6}}}
\put(834,220){\makebox(50,30){\small\bf{7}}}
\put(834,320){\makebox(50,30){\small\bf{10}}}
%
% total count
%
\put(854,400){\makebox(0,0)[t]{\small{$M_{7}\eq 34$}}}
%
% starting circles
%
\multiput(854,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(854,0){\makebox(0,0){\tiny{00}}}
\put(854,100){\makebox(0,0){\tiny{0}}}
\put(854,200){\makebox(0,0){\tiny{1}}}
\put(854,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(871,0){\vector(1,0){88}}
\put(871,94){\vector(1,-1){88}}
\put(871,288){\vector(1,-2){88}}
\put(871,12){\vector(1,2){88}}
\put(871,206){\vector(1,1){88}}
\put(871,300){\vector(1,0){88}}
% section 8 : cumulative counts
%      0    17
%      1    10
%      2    11
%      3    17
\put(956,-50){\makebox(50,30){\small\bf{17}}}
\put(956,120){\makebox(50,30){\small\bf{10}}}
\put(956,220){\makebox(50,30){\small\bf{11}}}
\put(956,320){\makebox(50,30){\small\bf{17}}}
%
% total count
%
\put(976,400){\makebox(0,0)[t]{\small{$M_{8}\eq 55$}}}
%
% end circles
%
\multiput(976,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(976,0){\makebox(0,0){\tiny{00}}}
\put(976,100){\makebox(0,0){\tiny{0}}}
\put(976,200){\makebox(0,0){\tiny{1}}}
\put(976,300){\makebox(0,0){\tiny{11}}}
%
\end{picture}
%

\\ \midrule
\raisebox{1.8in}{(c) Channel C}&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(999,510)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{00}}}
%
% lines
%
\put(17,12){\vector(1,2){88}}
% section 1 : cumulative counts
%      0     0
%      1     0
%      2     1
%      3     0
\put(102,220){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,400){\makebox(0,0)[t]{\small{$M_{1}\eq 1$}}}
%
% starting circles
%
\multiput(122,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{00}}}
\put(122,100){\makebox(0,0){\tiny{0}}}
\put(122,200){\makebox(0,0){\tiny{1}}}
\put(122,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(139,94){\vector(1,-1){88}}
\put(139,194){\vector(1,-1){88}}
\put(139,288){\vector(1,-2){88}}
\put(139,12){\vector(1,2){88}}
\put(139,106){\vector(1,1){88}}
\put(139,206){\vector(1,1){88}}
% section 2 : cumulative counts
%      0     0
%      1     1
%      2     0
%      3     1
\put(224,120){\makebox(50,30){\small\bf{1}}}
\put(224,320){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,400){\makebox(0,0)[t]{\small{$M_{2}\eq 2$}}}
%
% starting circles
%
\multiput(244,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{00}}}
\put(244,100){\makebox(0,0){\tiny{0}}}
\put(244,200){\makebox(0,0){\tiny{1}}}
\put(244,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(261,94){\vector(1,-1){88}}
\put(261,194){\vector(1,-1){88}}
\put(261,288){\vector(1,-2){88}}
\put(261,12){\vector(1,2){88}}
\put(261,106){\vector(1,1){88}}
\put(261,206){\vector(1,1){88}}
% section 3 : cumulative counts
%      0     1
%      1     1
%      2     1
%      3     0
\put(346,-50){\makebox(50,30){\small\bf{1}}}
\put(346,120){\makebox(50,30){\small\bf{1}}}
\put(346,220){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(366,400){\makebox(0,0)[t]{\small{$M_{3}\eq 3$}}}
%
% starting circles
%
\multiput(366,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{00}}}
\put(366,100){\makebox(0,0){\tiny{0}}}
\put(366,200){\makebox(0,0){\tiny{1}}}
\put(366,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(383,94){\vector(1,-1){88}}
\put(383,194){\vector(1,-1){88}}
\put(383,288){\vector(1,-2){88}}
\put(383,12){\vector(1,2){88}}
\put(383,106){\vector(1,1){88}}
\put(383,206){\vector(1,1){88}}
% section 4 : cumulative counts
%      0     1
%      1     1
%      2     2
%      3     1
\put(468,-50){\makebox(50,30){\small\bf{1}}}
\put(468,120){\makebox(50,30){\small\bf{1}}}
\put(468,220){\makebox(50,30){\small\bf{2}}}
\put(468,320){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(488,400){\makebox(0,0)[t]{\small{$M_{4}\eq 5$}}}
%
% starting circles
%
\multiput(488,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(488,0){\makebox(0,0){\tiny{00}}}
\put(488,100){\makebox(0,0){\tiny{0}}}
\put(488,200){\makebox(0,0){\tiny{1}}}
\put(488,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(505,94){\vector(1,-1){88}}
\put(505,194){\vector(1,-1){88}}
\put(505,288){\vector(1,-2){88}}
\put(505,12){\vector(1,2){88}}
\put(505,106){\vector(1,1){88}}
\put(505,206){\vector(1,1){88}}
% section 5 : cumulative counts
%      0     1
%      1     3
%      2     2
%      3     2
\put(590,-50){\makebox(50,30){\small\bf{1}}}
\put(590,120){\makebox(50,30){\small\bf{3}}}
\put(590,220){\makebox(50,30){\small\bf{2}}}
\put(590,320){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(610,400){\makebox(0,0)[t]{\small{$M_{5}\eq 8$}}}
%
% starting circles
%
\multiput(610,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(610,0){\makebox(0,0){\tiny{00}}}
\put(610,100){\makebox(0,0){\tiny{0}}}
\put(610,200){\makebox(0,0){\tiny{1}}}
\put(610,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(627,94){\vector(1,-1){88}}
\put(627,194){\vector(1,-1){88}}
\put(627,288){\vector(1,-2){88}}
\put(627,12){\vector(1,2){88}}
\put(627,106){\vector(1,1){88}}
\put(627,206){\vector(1,1){88}}
% section 6 : cumulative counts
%      0     3
%      1     4
%      2     4
%      3     2
\put(712,-50){\makebox(50,30){\small\bf{3}}}
\put(712,120){\makebox(50,30){\small\bf{4}}}
\put(712,220){\makebox(50,30){\small\bf{4}}}
\put(712,320){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(732,400){\makebox(0,0)[t]{\small{$M_{6}\eq 13$}}}
%
% starting circles
%
\multiput(732,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(732,0){\makebox(0,0){\tiny{00}}}
\put(732,100){\makebox(0,0){\tiny{0}}}
\put(732,200){\makebox(0,0){\tiny{1}}}
\put(732,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(749,94){\vector(1,-1){88}}
\put(749,194){\vector(1,-1){88}}
\put(749,288){\vector(1,-2){88}}
\put(749,12){\vector(1,2){88}}
\put(749,106){\vector(1,1){88}}
\put(749,206){\vector(1,1){88}}
% section 7 : cumulative counts
%      0     4
%      1     6
%      2     7
%      3     4
\put(834,-50){\makebox(50,30){\small\bf{4}}}
\put(834,120){\makebox(50,30){\small\bf{6}}}
\put(834,220){\makebox(50,30){\small\bf{7}}}
\put(834,320){\makebox(50,30){\small\bf{4}}}
%
% total count
%
\put(854,400){\makebox(0,0)[t]{\small{$M_{7}\eq 21$}}}
%
% starting circles
%
\multiput(854,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(854,0){\makebox(0,0){\tiny{00}}}
\put(854,100){\makebox(0,0){\tiny{0}}}
\put(854,200){\makebox(0,0){\tiny{1}}}
\put(854,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(871,94){\vector(1,-1){88}}
\put(871,194){\vector(1,-1){88}}
\put(871,288){\vector(1,-2){88}}
\put(871,12){\vector(1,2){88}}
\put(871,106){\vector(1,1){88}}
\put(871,206){\vector(1,1){88}}
% section 8 : cumulative counts
%      0     6
%      1    11
%      2    10
%      3     7
\put(956,-50){\makebox(50,30){\small\bf{6}}}
\put(956,120){\makebox(50,30){\small\bf{11}}}
\put(956,220){\makebox(50,30){\small\bf{10}}}
\put(956,320){\makebox(50,30){\small\bf{7}}}
%
% total count
%
\put(976,400){\makebox(0,0)[t]{\small{$M_{8}\eq 34$}}}
%
% end circles
%
\multiput(976,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(976,0){\makebox(0,0){\tiny{00}}}
\put(976,100){\makebox(0,0){\tiny{0}}}
\put(976,200){\makebox(0,0){\tiny{1}}}
\put(976,300){\makebox(0,0){\tiny{11}}}
%
\end{picture}
%

\\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{Counting  the number of paths in the trellises of channels A, B, and C.
 We assume that at the start the first bit is preceded
 by {\tt 00}, so that for channels A and B, 
 any initial character is permitted, but
 for channel C, the first character must be 
 a {\tt 1}.}
\label{fig.state1count}
\label{fig.state101count}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{*{1}{c@{}c}} 
&
% written by fibonacci.p
\begin{tabular}{*{4}{r}c}  \toprule
\multicolumn{1}{r}{$n$} &
 \multicolumn{1}{r}{$M_n$} &
  \multicolumn{1}{c}{$M_n/M_{n-1}$} &
  \multicolumn{1}{l}{$\log_2 M_n$} &
   \multicolumn{1}{c}{$\frac{1}{n} \log_2 M_n$} \\[0.051in] \midrule
1	& 2	&        	&        1.0	&      1.00 \\ 
2	& 3	&  1.500	&        1.6	&      0.79 \\ 
3	& 5	&  1.667	&        2.3	&      0.77 \\ 
4	& 8	&  1.600	&        3.0	&      0.75 \\ 
5	& 13	&  1.625	&        3.7	&      0.74 \\ 
6	& 21	&  1.615	&        4.4	&      0.73 \\ 
7	& 34	&  1.619	&        5.1	&      0.73 \\ 
8	& 55	&  1.618	&        5.8	&      0.72 \\ 
9	& 89	&  1.618	&        6.5	&      0.72 \\ 
10	& 144	&  1.618	&        7.2	&      0.72 \\ 
11	& 233	&  1.618	&        7.9	&      0.71 \\ 
12	& 377	&  1.618	&        8.6	&      0.71 \\ 
100	& $9\!\times\! 10^{20}$	&  1.618	&       69.7	&      0.70 \\ 
200	& $7\!\times\! 10^{41}$	&  1.618	&      139.1	&      0.70 \\ 
300	& $6\!\times\! 10^{62}$	&  1.618	&      208.5	&      0.70 \\ 
400	& $5\!\times\! 10^{83}$	&  1.618	&      277.9	&      0.69 \\ 
\bottomrule
\end{tabular}

% but needs to be changed to use toprule not hline
\\
\end{tabular}
\end{center}
}{%
\caption[a]{Counting  the number of paths in the trellis of channel A.}
\label{fig.fibonacci}
}%
\end{figure}

% .p 
% see noiseless
% source makeall
 We can also represent the state diagram by a
 {\dem\ind{trellis  section}}, which shows two successive states 
 in time at two successive horizontal locations (\figref{fig.state1}b). 
 The state of the transmitter at time $n$ is called $s_n$.
% \footnote{Change $s_n$ to $i_n$?} NO
 The set of possible state sequences can be represented 
 by a {\dem\ind{trellis}} as shown in \figref{fig.state1}c.
 A valid sequence corresponds to a path through the
 trellis, and the number of valid sequences is the
 number of paths.
 For the purpose of counting how many paths there are 
 through the trellis, we can ignore the labels on the 
 edges and summarize the trellis section by 
 the {\dem\ind{\connectionmatrix}\/} $\bcmA$, in which $\cmA_{ss'} = 1$
 if there is an edge from state $s$ to $s'$, and  $\cmA_{ss'} = 0$
 otherwise (\figref{fig.state1}d).
 \Figref{fig.state101} shows the state diagrams, trellis sections
 and \connectionmatrices\ for  channels B and C. 

% So, let's count!
 Let's count the number of paths for channel A by message-passing
 in its trellis.
 \Figref{fig.state1count123} shows the first few steps of this counting
 process, and
 \figref{fig.state1count}a shows the number of paths ending in each state
 after $n$ steps for $n=1, \ldots, 8$.
 The total number of paths of length $n$, $M_n$, is shown along the top.
 We recognize $M_n$ as the Fibonacci series.
\exercisxB{1}{ex.fibo}{
	Show that the ratio of successive terms in the \ind{Fibonacci} series tends
 to the \ind{golden ratio},
\beq
	\gamma \equiv \frac{1 + \sqrt{5}}{2} = 1.618 .
\eeq
}
 Thus, to within a constant factor, $M_N$ scales as $M_N \sim \gamma^N$
 as $N \rightarrow \infty$, so the capacity of channel A is
\beq
	C = \lim \frac{1}{N} \log_2 \!\left[ \mbox{constant} \cdot  \gamma^N \right]
	 = \log_2 \gamma = \log_2 1.618 = 0.694 .
\eeq

 How can we describe what we just did? 
 The count of the number of paths is a vector $\bc^{(n)}$; we can obtain 
 $\bc^{(n+1)}$ from  $\bc^{(n)}$ using:
\beq
	\bc^{(n+1)} = \bAcm \bc^{(n)} .
\eeq 
 So
\beq
	 \bc^{(N)} =  \bAcm^{\!N}  \bc^{(0)} ,
\eeq
 where $\bc^{(0)}$ is the state count before any symbols are transmitted. 
 In  \figref{fig.state1count} we assumed  $\bc^{(0)} = [ 0 , 1]^{\T}$, \ie, that 
 either of the two symbols is permitted at the outset. 
 The total number of paths is $M_n = \sum_s c^{(n)}_s =  \bc^{(n)} \cdot \bn$.
 In the limit, $\bc^{(N)}$ becomes dominated by the principal right-eigenvector 
 of $\bAcm$.
\beq
 \bc^{(N)} \rightarrow \mbox{constant} \cdot \lambda_1^N \eR^{(0)}  .
\eeq
 Here, $\lambda_1$ is the principal eigenvalue of $\bAcm$. 
 
 So to find the capacity of any constrained channel,
% defined by  a \connectionmatrix,
 all we need to do is find the
 principal eigenvalue, $\l_1$, of its \connectionmatrix. 
 Then
\beq
	C = \log_2 \l_1 .
\eeq

\section{Back to our model channels}
 Comparing \figref{fig.state1count}a and figures \ref{fig.state101count}b and c
 it looks as if channels B and C have the same
 capacity as channel A. The principal eigenvalues of 
 the three trellises are  the same (the eigenvectors for
 channels A and B are given at the bottom of \tabref{tab.eigsforyou},
 \pref{tab.eigsforyou}).
% see section \ref{sec.rll.eigenvectors}).
 And indeed the channels are intimately related. 

%\begin{figure}[htbp]
%\figuremargin{%
\marginfig{
\begin{tabular}{c}
  \mbox{\input{convol/tex/k1_1_3sr.tex}}
\\%&
 \mbox{\input{convol/tex/k1_1_3snFLIP.tex}} % has s and t reversed from normal
\\
%
\end{tabular}
%}{%
\caption[a]{An \ind{accumulator} and a \ind{differentiator}.}
% , with $s$ and $t$ possibly mislabelled.}
}%
%\end{figure}

\subsubsection{Equivalence of channels A and B}
 If we take any valid string $\bs$ for channel A and pass it through 
 an {\dem\ind{accumulator}}, obtaining $\bt$ defined by:
\beq
\begin{array}{rclc}
	t_1 &=& s_1 \\
	t_{n} &=& t_{n-1} + s_{n} \mod 2 & \mbox{for $n \geq 2$,} 
\end{array} 
\eeq
 then the resulting string is a valid string for channel B, because 
 there are no {\tt 11}s in $\bs$, so there are no isolated digits
 in $\bt$.
 The accumulator is an invertible operator, so, similarly, 
 any valid string $\bt$ for channel B can be mapped onto a 
 valid string $\bs$ for channel A through the
 {\dem{binary \ind{differentiator}}}, 
\beq
\begin{array}{rclc}
	s_1 &=& t_1 \\
	s_{n} &=& t_{n} - t_{n-1} \mod 2 & \mbox{for $n \geq 2$.} 
\end{array} 
\eeq
 Because $+$ and $-$ are equivalent in modulo 2 arithmetic, 
 the differentiator is also a blurrer, convolving the source stream 
 with the filter $(1,1)$.
% (A bit surprising that blurring is invertible?)

 Channel C is also intimately related to channels A and B.
\exercissxB{1}{ex.abc.compare}{
% It looks as if channels B and C have the same
% capacity as channel A.  Show this is so by showing that (apart from edge effects) 
% all three channels  are actually equivalent channels, in that
% any valid string for one channel can be mapped onto  valid 
% strings for the others.
 What is the relationship of channel C to channels A and B?
}

\section{Practical communication over constrained channels}
 OK, how to  do it in practice? Since all three channels are equivalent, we can
 concentrate on channel A.
%
\subsection{Fixed-length solutions}
% This code
 We start with explicitly-enumerated codes.
 The code in the \tabref{tab.eightwords}%
\margintab{
\begin{center}
\begin{tabular}{cc} \toprule
% this was m, not s.........feb 2000
 $s$ & $c(s)$ \\ \midrule
 1 & {\tt 00000} \\
 2 & {\tt 10000} \\
 3 & {\tt 01000} \\
 4 & {\tt 00100} \\
 5 & {\tt 00010} \\
 6 & {\tt 10100} \\
 7 & {\tt 01010} \\
 8 & {\tt 10010} \\ \bottomrule
\end{tabular}
\end{center}
\caption{A runlength-limited code for channel A.}
\label{tab.eightwords}
}
 achieves a rate of $\dfrac{3}{5} = 0.6$.

% added Sun 3/2/02
\exercissxB{1}{ex.con8.10}{
 Similarly,  enumerate all strings of length 8 that end in the zero state.
 (There are 34 of them.)
 Hence show that we can map 5 bits (32 source strings) to 8 transmitted
 bits and achieve
 rate $\dfrac{5}{8} = 0.625$.

 What rate can be achieved by mapping an integer
 number of source bits to
% Crank up to
 $N=16$ transmitted bits?
}

\subsection{Optimal variable-length solution}
% {\em It is probably confusing that I have used
% $s$ to run over source message names, and $s$ to
% run over states in the trellis. Let's change to $u$ or $i$
% for trellis states?}

 The optimal way to convey information over the constrained
 channel is to find the 
 optimal transition probabilities
 for all points in the trellis,
 $Q_{s'|s}$, and make transitions with these probabilities.

 When discussing channel A, we showed that  a sparse source with density $f=0.38$,
 driving code $C_2$,
 would achieve capacity.\index{arithmetic coding!uses beyond compression} 
 And we know how to make \ind{sparsifiers} (\chapterref{ch4}):
 we design  an  arithmetic code that is optimal for compressing
 a  sparse source; then its associated decoder gives an optimal
 mapping from dense (\ie, random binary) strings to sparse strings.
%
% improve this reference to ch 4
%

 The task of finding the optimal  probabilities is given
 as an exercise.
\exercisxC{3}{ex.optimal.constrained}{
 Show that the 
 optimal transition probabilities $\bQ$ can be found as follows.

 Find the principal right- and left-eigenvectors of 
 $\bcmA$, that is the
 solutions of $\bA \be^{(R)} = \l \be^{(R)}$
 and   ${\be^{(L)}}^{\T}\bA  = \l {\be^{(L)}}^{\T}$
 with largest eigenvalue $\l$.
 Then construct  a  matrix $\bQ$ whose invariant distribution 
 is proportional to $e^{(R)}_i e^{(L)}_i$, namely
% . This is given by
\beq
	Q_{s'|s} =  \frac{e^{(L)}_{s'} \cmA_{s's} }{\l e^{(L)}_s } .
\label{eq.optimalQ}
\eeq
 [Hint: \exerciseref{ex.path2} might give helpful
 cross-fertilization here.]
% in message.tex
}
\exercissxB{3}{ex.show.trellis.entropy}{
 Show that when sequences are generated using  the
 optimal   transition probability 
 matrix (\ref{eq.optimalQ}),
 the entropy of the resulting sequence is 
 asymptotically $\log_2 \l$ per symbol.
 [Hint:  consider the conditional entropy of just
 one symbol given the previous one, assuming the previous 
 one's distribution is the invariant distribution.]
}

 In practice, we would probably use\index{channel!constrained}
 finite-precision approximations to the optimal
 variable-length solution.\index{variable-length code}\index{code!for constrained channel!variable-length} 
 One might dislike  variable-length solutions
 because of  the resulting unpredictability of the actual
 encoded length in any particular case. Perhaps in
 some applications we would like a guarantee that the
 encoded length of a source file of size $N$ bits will be less than
 a given length such as  $N/(C+\epsilon)$.
 For example, 
 a \ind{disk drive} is easier to control if
 all blocks of 512 bytes are known to take exactly the same amount of 
 disk real-estate.
% \index{disk drive}
%
 For some constrained channels we can make a simple modification to
 our variable-length encoding and offer such a guarantee,
 as follows\nocite{MacKay00RLLT}.
 We find two codes, two mappings of  binary strings  to variable-length
 encodings, having the property that
 for any source string $\bx$, if
  the encoding of $\bx$ under the first code is shorter than average,
 then the encoding of $\bx$ under the second code is
% refer to rllt.tex
 longer than average, and {\em vice versa}.
 Then to transmit a string $\bx$ we  encode the whole string with both codes
 and send whichever encoding has the shortest length,
 prepended by a suitably encoded single bit to convey which
 of the two codes is being used.

% \section{Exercises}
\amarginfig{c}{\small
\begin{center}
\begin{tabular}{@{}cc}
\raisebox{0.215in}{$\input{noiseless/tex/mfile_rl2.tex}$ }% a simple array
&
\raisebox{-0.0in}{% was -0.15in
\mbox{\psfig{figure=noiseless/figs/staterl2.ps,angle=-90,height=0.8in}}
}
% \hspace{0.2in}
\\
%\hspace{0.2in}
\raisebox{0.3in}{$\input{noiseless/tex/mfile_rl3.tex}$}
&
\raisebox{-0.15in}{%
\mbox{\psfig{figure=noiseless/figs/staterl3.ps,angle=-90,height=1in}}
}
\\
\end{tabular}
\end{center}
%}{%
\caption[a]{State diagrams
 and \connectionmatrices\ for  channels with maximum runlengths for
 {\tt 1}s equal to  2 and 3. }
\label{fig.state_rl23}
}%
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Problems here?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
\exercissxB{3C}{ex.rl.small}{
 How
%\begin{figure}
%\figuremargin{%
 many valid sequences  of length 8  starting
% ending
 with a {\tt 0}
 are there for the  run-length-limited channels
% $0^+1^{1,2}$ and $0^+1^{1,3}$ (
 shown in \figref{fig.state_rl23}?

 What are the capacities of these channels? 
% eigs are 1.839286 -> 0.879
% and  1.9276 -> 0.947

 Using  a computer, find
% Find
 the matrices $\bQ$ for generating a random path 
 through the trellises of the
% run-length-limited channels $0^+1^{1}$ (channel A),  $0^+1^{1,2}$ and $0^+1^{1,3}$.
 channel A, and the two run-length-limited channels
 shown in \figref{fig.state_rl23}.
}
% BORDERLINE
\exercissxB{3}{ex.rl.limit}{
 Consider the run-length-limited channel in which 
 any length of run of {\tt 0}s is permitted, 
 and the maximum run length of {\tt 1}s is a large
 number $L$ such as nine or ninety. 
% (Nine is large enough for our purposes.)

 Estimate the capacity of this channel. (Give the first two terms in a series
 expansion involving $L$.)
 
	What, roughly,  is the form of the optimal 
 matrix $\bQ$ for generating a random path through  
 the trellis of this channel?  Focus  on the 
 values  of the elements $Q_{1|0}$, the probability of
  generating a {\tt{1}} given a preceding {\tt0},
 and $Q_{L|L-1}$, the probability
 of generating a {\tt1} given a preceding run of $L\!-\!1$ {\tt1}s.
 Check your answer by explicit computation
 for the
% $0^+1^{1,9}$
 channel in which
% the string {\tt 1111111111} is forbidden, \ie,
 the maximum \ind{runlength} of {\tt 1}s is nine. 
% my code is in the file matlabs, see also  qrl9h.dat and qrl9h.tex  in eigen
}



\section{Variable symbol durations\nonexaminable}
 We can add a further frill\index{channel!variable symbol durations}
 to the  task of communicating over constrained channels
 by assuming that the symbols we send have different
 {\em{durations}\/}, and that our aim is to
 communicate at the maximum possible rate per unit time.
 Such channels can come in two flavours:
 unconstrained, and constrained.

\subsection{Unconstrained channels with variable symbol durations}
 We
% already
 encountered an
 unconstrained  noiseless channel  with variable symbol durations
 in \exerciseref{ex.phone_chat}.
% Each symbol had a different duration. 
 Solve that problem, and you've done this topic.\index{source code!variable symbol durations}
 The task is to determine the optimal frequencies with which
  the symbols should be used, given  their durations.

 There is a nice analogy between this task
 and the task of designing an  optimal symbol code (\chref{ch.two}). 
 When we make  an   binary symbol code for a source
 with unequal probabilities
 $p_i$, the optimal message lengths are $	l^*_i = \log_2 \dfrac{1}{p_i}$,
 so
\beq
 p_i = 2^{-l^*_i}.
\eeq
 Similarly, when we have a channel whose symbols have durations
 $l_i$ (in some units of time), the optimal probability with which those
 symbols should be used is
\beq
	p^*_i = 2^{ - \beta l_i }, 
\eeq
 where $\beta$ is the capacity of the channel in bits per unit time.


\subsection{Constrained channels with variable symbol durations}
% Then there's the general problem of a channel with constraints
% and with \ind{variable duration symbols}. [\eg, \ind{Morse code},
% where dots and dashes must be separated by either short or
% long spaces.]
%
% {\em MORE HERE. Add an exercise to solve the general case
% with both constraints and variable duration.}
 Once you have grasped  the\index{channel!constrained} preceding topics in this chapter,
 you should be able to figure out how to define and
 find the capacity of these,  the trickiest
 constrained channels.

\exercisxC{3}{ex.morse}{
 A classic example of a constrained channel   with variable symbol durations
 is the `\ind{Morse}' channel, whose symbols are
\begin{center}
\begin{tabular}{ll}
the dot & {\tt{d}}, \\
the dash & {\tt{D}}, \\
the short space (used between letters in morse code) & {\tt{s}}, and \\
the long space (used between words)  & {\tt{S}}; \\
\end{tabular}
\end{center}
 the constraints are that
% dots and dashes may only be followed by  spaces, and
 spaces may only be followed by dots and dashes.

 Find the capacity of this channel  in bits per unit time
 assuming (a) that
 all four symbols have equal durations; or (b) that the
 symbol durations are 2, 4, 3 and 6 time units respectively.
}
\exercisxC{4}{ex.morse2}{
 How well-designed is Morse code
 for English (with, say, the probability
 distribution of \figref{fig.monogram})?
}



%Figure showing state, symbol, duration. 
%
% There we used an entropy-maximization method to solve for the 
% optimal probability distribution over symbols. 
%
% This method  works fine for all channels that can be described 
% by a single state with a load of edges. 
% But if there's several states then 
% a new solution is needed.
%
%
% Find by representing the state diagram by a polynomial. 
% Exponent is path length. 
%
% Another approach is to assume a probability distribution 
% 
% Write $f(x) = x + x^2$. What does that mean?

%
% PUT ME BACK:
%
%
% \input{tex/rll_fortuitous.tex}
%

%
%& ${\tt 0}^+{\tt 1}^+$         &1
%& ${\tt 0}^+{\tt 1}^1$         &?
%& ${\tt 0}^{2+}{\tt 1}^{2+}$   &?
%& ${\tt 0}^{1,2}{\tt 1}^{1,2}$ &?
%



\exercisxC{3C}{ex.constrainedphysics}{
 {\sf How difficult is it to get \ind{DNA} into a narrow tube?}

	To an information theorist, the entropy associated with a
 \ind{constrained channel} reveals how much information can be conveyed over it.
 In \ind{statistical physics},  the same calculations are done for a
 different reason: to predict the thermodynamics of  polymers,
 for example.

 As a toy example, consider a \ind{polymer} of length $N$
 that can either sit in a constraining \ind{tube}, of width $L$,
 or in the open where there are no constraints.
 In the open, the polymer adopts a state drawn at random
 from the set of one dimensional random walks, with, say, 3
 possible directions per step.
\marginfig{
\begin{center}
\mbox{\psfig{figure=metrop/dna/walk.ps,width=0.4in,angle=180}}
\end{center}
\caption[a]{Model of DNA squashed in a narrow tube.
% The tube's diameter is 10 steps.
 The DNA will have a tendency to pop
 out of the tube, because, outside the tube,  its random walk has greater entropy.
}}
% see _research/drunkard
 The entropy of this walk is $\log 3$ per step, \ie, a total of $N \log 3$.
 [The \ind{free energy} of the polymer is defined to be $-kT$ times this,
 where $T$ is the temperature.]
% # /home/mackay/_research/drunkard/DO.m
% and itp/metrop/dna
	In the tube, the polymer's one-dimensional walk
 can go in 3 directions unless the wall is in the way, so
 the \ind{connection matrix}\index{trellis section} is, for example (if $L=10$),
\[
%\begin{realcenter}\small
%\begin{tabular}{*{10}{c}}
\left[\begin{array}{*{10}{c}}
 1 &1 &0 &0 &0 &0 &0 &0 &0 &0\\
 1 &1 &1 &0 &0 &0 &0 &0 &0 &0\\
 0 &1 &1 &1 &0 &0 &0 &0 &0 &0\\
 0 &0 &1 &1 &1 &0 &0 &0 &0 &0\\
 0 &0 &0 &1 &1 &1 &0 &0 &0 &0\\
   &  &  &  & \makebox[0in][c]{$\ddots$} & \makebox[0in][c]{$\ddots$} & \makebox[0in][c]{$\ddots$}  & \\
%
% graveyard.tex
%
% 0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &1 &1 &1 &0 &0\\
% 0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &1 &1 &1 &0\\
 0  &0 &0 &0 &0 &0 &0 &1 &1 &1\\
 0  &0 &0 &0 &0 &0 &0 &0 &1 &1\\
\end{array}\right].
%\end{tabular}.
%\end{realcenter}
\]
 Now, what is the entropy of the polymer?
 What is the {\em change\/} in entropy associated
 with the polymer entering the tube?
% In DO.m I got  0.0075 nats * N
% or 0.01 bits * N
 If possible, obtain an expression as a function of $L$.
 Use a computer to find the  entropy of the
 walk for a particular value of $L$, \eg\ 20,
 and plot the probability density of the
 polymer's  transverse location in the tube.
% in DO.m I found this, which is rather pretty to plot.
% The shape seems to be independent of L.
% 
%  0.011169
%  0.022089
%  0.032515
%  0.042215
%  0.050972
%  0.058590
%  0.064900
%  0.069759
%  0.073061
%  0.074730
%  0.074730
%  0.073061
%  0.069759
%  0.064900
%  0.058590
%  0.050972
%  0.042215
%  0.032515
%  0.022089
%  0.011169
%
%

 Notice the difference in  capacity  between two channels, one constrained
 and one unconstrained, is directly proportional to the force required to pull
 the DNA into the
 tube.\index{connection between!channel capacity and physics}\index{channel!capacity!connection with physics}
}

\dvips
\section{Solutions}% to Chapter  \protect\ref{ch.noiseless}'s exercises}
\soln{ex.count.ones}{
 A file transmitted by $C_2$  contains, on average, one-third {\tt 1}s
 and two-thirds {\tt 0}s.

 If $f = 0.38$,
 the fraction of {\tt 1}s is
 $f/(1+f) =
 (\gamma-1.0)/(2\gamma - 1.0) = 0.2764$.
}
\soln{ex.abc.compare}{
	A valid string for channel C can be obtained from a valid 
 string for channel A by first inverting it [${\tt 1} \rightarrow {\tt 0}$;
 ${\tt 0} \rightarrow {\tt 1}$], then passing it through 
 an accumulator. These operations are invertible, so any 
 valid string for C can also be mapped onto a  valid string
 for A. The only proviso here comes from the 
 edge effects.  If we assume that the first character 
 transmitted over channel C is preceded by a string of zeroes, so that 
 the first character is forced to be a {\tt 1} (\figref{fig.state101count}c)
 then the two channels are  exactly equivalent only if we assume that
 channel A's first character must be a zero.
}
% added Sun 3/2/02
\soln{ex.con8.10}{
 With $N=16$ transmitted bits, 
% 10.6
 the largest integer number of
 source bits that can be encoded is 10,
 so the maximum rate of a fixed length code with  $N=16$  is 0.625.
}
% \exercis{ex.show.trellis.entropy}{

\begincuttable
\soln{ex.show.trellis.entropy}{
 Let the invariant distribution be 
\beq
	P(s) = \alpha e^{(L)}_s e^{(R)}_s ,
\eeq
 where $\a$ is a normalization constant.\marginpar{\small\raggedright{Here,
 as in
  \chapterref{chtwo}, $S_t$ denotes the ensemble whose random variable
 is the state $s_t$.}}
 The entropy of $S_t$ given $S_{t-1}$, assuming $S_{t-1}$ comes
 from the invariant distribution, is
\beqan
	H(S_t|S_{t-1}) 
&\eq &
		 - \sum_{s,s'} P(s)P(s'|s) \log P(s'|s) 
\\
&\eq &
		 - \sum_{s,s'} \alpha e^{(L)}_s e^{(R)}_s  
	\frac{e^{(L)}_{s'} \cmA_{s's} }{\l e^{(L)}_s } 
	\log
	 \frac{e^{(L)}_{s'} \cmA_{s's} }{\l e^{(L)}_s }
\eeqan
\beq
%\\
%\lefteqn{%&\eq &
 =
		 - \sum_{s,s'} \alpha \, e^{(R)}_s  
	\frac{e^{(L)}_{s'} \cmA_{s's} }{\l} 
	\left[
	\log
	 e^{(L)}_{s'}
	+ \log  \cmA_{s's} - \log \l - \log  e^{(L)}_s 
	\right] .
%}
\eeq
 Now, $\cmA_{s's}$ is either 0 or 1, so the contributions from 
 the terms proportional to $\cmA_{s's} \log  \cmA_{s's}$
 are all zero. So
\beqan
 H(S_t|S_{t-1})
&\eq & \log \l +
	 - \frac{ \alpha}{\l} 
	  \sum_{s'} 
	\left( \sum_{s} \cmA_{s's}  e^{(R)}_s   \right)
	e^{(L)}_{s' }
	\log
	 e^{(L)}_{s'} +
\nonumber \\ 
& &
	 \frac{ \alpha}{\l} 
	 \sum_{s} 
\left(	  \sum_{s'}
	 e^{(L)}_{s'} \cmA_{s's}  \right)
         e^{(R)}_s  
 	\log   e^{(L)}_s 
\eeqan
\beqan
&\eq &
 \log \l
%
		 - \frac{ \alpha}{\l} 
	  \sum_{s'} 
	\l   e^{(R)}_{s'} 
	e^{(L)}_{s' }
	\log
	 e^{(L)}_{s'}
	 +\frac{ \alpha}{\l} 
	 \sum_{s} 
	\l	 e^{(L)}_{s}
         e^{(R)}_s  
 	\log   e^{(L)}_s
\\
&=&
\log \l .
\eeqan
}
\ENDcuttable

\soln{ex.rl.small}{
 The principal  eigenvalues of the 
 \connectionmatrices\ of the two channels  are 1.839
 and  1.928.
 The capacities ($\log \l$) are  0.879 and  0.947 bits.
% See the eigenvector tables (section \ref{sec.eigenvectors.qrl})
% for the matrices $\bQ$.
% I think this is a ref to eigen.tex,
% see \label{sec.eigenvectors.qrl}% Fri 14/12/01
%
%\begin{center}
%\input{noiseless/tex/tcounts_rl2.tex}
%\\
%\input{noiseless/tex/tcounts_rl3.tex}
%\end{center}
}

% BORDERLINE
%
%%%%%%%%%%%%%\soln{ex.rl.limit}{
%
% conjecture this is too big for a { }
% so doing it by hand
\begincuttable
\begin{Sexercise}{ex.rl.limit}
 The channel is  similar to the unconstrained binary channel;  
 runs of length greater than $L$ are rare if $L$ is large, 
 so we only expect weak differences from this channel; these
 differences will show up in contexts where the run length 
 is close to $L$. The capacity of the channel is 
 very close to one bit.

 A lower bound on the capacity is obtained by considering the 
 simple variable-length code for this channel which 
 replaces occurrences of the maximum runlength 
 string {\tt 111$\ldots$1} by {\tt 111$\ldots$10}, 
 and otherwise leaves the source file unchanged. 
 The average rate of this code is $1/(1+2^{-L})$
 because the invariant distribution will hit the `add an extra zero'
 state a fraction $2^{-L}$
 of the time.

%   sum( a * r^n , n=0..N ) ;
% 
%                                   (N + 1)
%                                a r            a
%                                ---------- - -----
%                                   r - 1     r - 1


 We can reuse the solution for the variable-length channel
 in \exerciseref{ex.phone_chat}. The capacity 
 is the value of $\beta$ such that the equation 
\beq
 Z(\beta) =  \sum_{l=1}^{L+1} 2^{-\beta l} = 1 
\eeq
 is satisfied. 
 The $L+1$ terms in the sum correspond to the $L+1$ possible 
 strings that can be emitted, {\tt 0}, {\tt 10}, {\tt 110}, $\ldots$~, {\tt 11$\ldots$10}.
 The sum is\index{geometric progression} 
 exactly given by:
% \marginpar{\footnotesize{$\displaystyle\left[\sum_{n=0}^{N}ar^{n}={\frac {a (r^{N+1}-1)}{r-1}}\right]$}}
%
\beq
  Z(\beta) =   2^{-\beta} \frac{  \left(2^{-\beta}\right)^{L+1} - 1 }{  2^{-\beta} - 1}    .
\eeq
 $\displaystyle\left[\mbox{Here we used\ }\sum_{n=0}^{N}ar^{n}={\frac {a (r^{N+1}-1)}{r-1}}.\right]$

 We anticipate that $\beta$ should be a little less than 1 in order for $Z(\beta)$ to 
 equal 1.
 Rearranging and solving approximately for $\beta$, using $\ln (1+x) \simeq x$, 
\beqan
	Z(\beta) & = & 1 \\
%\Rightarrow 
%	 2 2^{-\beta} - 1& =&   \left(2^{-\beta}\right)^{L+2} \\
%\Rightarrow 
%	 2^{1-\beta} & =&  1 + \left(2^{-\beta}\right)^{L+2} \\
%\Rightarrow 
%	 {1-\beta} & =& \log_2 \left[ 1 + \left(2^{-\beta}\right)^{L+2} \right] \\
%\Rightarrow 
%	 {1-\beta} &\simeq&  \left(2^{-\beta}\right)^{L+2} / \ln 2 \\
\:\Rightarrow \:
	 {\beta}& \simeq & 1  - 2^{-(L+2)} / \ln 2  .
\eeqan
%  c(x) = 1 - (2.0**(-(x+2.0)) )/log(2.0)
% print c(1)
% print c(2)
% print c(3) 
% print c(4) 
% print c(9) 
% L=9: the eigenvalue is     1.9990
% log_2 is     0.99929
%
% eigen matlabs rl4:   1.9659  -> 0.9752
% 5:                   1.9836  -> 0.9881
% 6:                   1.9920  -> 0.9942
% L guess rate     true capacity
% 1 0.81966        0.6942 
% 2 0.90983	   0.879  
% 3 0.95491	   0.947  
% 4 0.97745        0.9752
% 5 0.98873        0.9881
% 6 0.994364       0.99419
% 9 0.9992919      0.99929556
 We evaluated  the true capacities for $L=2$ and $L=3$ in an 
 earlier exercise. The  table%
\amargintab{b}{\footnotesize
\begin{center}
\begin{tabular}{ccc} \toprule
$L$ & $\beta$ & \mbox{True capacity} \\ \midrule
2 & 0.910\phantom{0} & 0.879\phantom{0} \\
3 & 0.955\phantom{0} & 0.947\phantom{0} \\
4 & 0.977\phantom{0} & 0.975\phantom{0} \\
5 & 0.9887 & 0.9881 \\
6 & 0.9944 & 0.9942 \\
9 & 0.9993 & 0.9993 \\ \bottomrule
\end{tabular}
\end{center}
}
 compares the approximate
 capacity $\beta$ with the true capacity for a selection of values of $L$.

 The element  $Q_{1|0}$ will be  close
 to $1/2$ (just a tiny bit larger), since in the 
  unconstrained binary channel $Q_{1|0}=1/2$.
 When a run of length $L-1$ has occurred, we effectively have a choice of 
 printing  {\tt 10} or {\tt 0}. Let the probability of selecting {\tt 10}
 be $f$.  Let us estimate the entropy of the {\em remaining\/} $N$
 characters in the stream as a function 
 of $f$, assuming the rest of the matrix $\bQ$ to have been 
 set to its optimal value. 
 The entropy of the next  $N$ characters in the  stream is
 the entropy of the first bit,  $H_2(f)$, plus the entropy of 
 the remaining characters, which is roughly 
  $(N\!-\!1)$ bits if we select {\tt 0} as the first bit
 and  $(N\!-\!2)$ bits if {\tt 1} is selected. More precisely, if $C$ 
 is the capacity of the channel (which is roughly 1), 
\beqan
\!\!\!\!\hspace*{-1.25em}
 H(\mbox{the next $N$ chars})& \simeq& H_2(f) + \left[ (N-1) (1-f) + (N-2) f \right] C 
\nonumber \\
	&=& H_2(f) + N C - f C \: \simeq \:  H_2(f) + N  - f   .
\eeqan
 Differentiating and setting to zero to find the optimal $f$, we obtain:
\beq
	\log_2 \frac{1-f}{f} \simeq 1  \:\: \Rightarrow \frac{1-f}{f} \simeq 2  
	\:\: \Rightarrow f \simeq 1/3 .
\eeq
 The probability of emitting a {\tt 1} thus decreases from about 0.5
 to about $1/3$ as the number of emitted {\tt 1}s increases.

 Here is the optimal matrix:
\beq
%%%% 
%%%% written by matrix2tex.p 
%%%% 
%%%% beginning of matrix 
%%%% 
\left[
\begin{array}{@{\,}*{9}{c@{\,\,}}c@{\,}}
       0  &   .3334  &        0  &        0  &        0  &        0  &        0  &        0  &        0  &        0 \\ 
       0  &        0  &   .4287  &        0  &        0  &        0  &        0  &        0  &        0  &        0 \\ 
       0  &        0  &        0  &   .4669  &        0  &        0  &        0  &        0  &        0  &        0 \\ 
       0  &        0  &        0  &        0  &   .4841  &        0  &        0  &        0  &        0  &        0 \\ 
       0  &        0  &        0  &        0  &        0  &   .4923  &        0  &        0  &        0  &        0 \\ 
       0  &        0  &        0  &        0  &        0  &        0  &   .4963  &        0  &        0  &        0 \\ 
       0  &        0  &        0  &        0  &        0  &        0  &        0  &   .4983  &        0  &        0 \\ 
       0  &        0  &        0  &        0  &        0  &        0  &        0  &        0  &   .4993  &        0 \\ 
       0  &        0  &        0  &        0  &        0  &        0  &        0  &        0  &        0  &   .4998 \\ 
       1  &   .6666  &   .5713  &   .5331  &   .5159  &   .5077  &   .5037  &   .5017  &   .5007  &   .5002 
\end{array}
\right]
%%%% 

\eeq
%
% something wrong here? 
%

 Our rough theory works.

\end{Sexercise}
\ENDcuttable
% What is wrong with latex here?
%%%%%%%%%%}
% c(9) 0.999755859375
% c(8) 0.99951171875

\dvips
% ch 10
%\chapter{Language models and crosswords \nonexaminable}
%\chapter{Language Models and Crosswords \nonexaminable}
\chapter{Crosswords  and Codebreaking \nonexaminable}% An Aside
% This chapter belongs as close as possible to the
% compression and noisy channel chapters. But it would
% also go better after the constrained channel chapter,
% since language can be viewed as a constrained channel.
%
% \input{tex/monogram.tex}
 In this chapter we make a random walk through a few topics related
 to language modelling.
% \section{Crosswords}
\label{ch.xword}
%\section{}
\section{Crosswords}
 The rules of crossword-making may be thought of as defining
 a constrained channel. The fact that {\em many\/}
 valid crosswords can be made demonstrates that
 this \ind{constrained channel} has a capacity greater than zero.

 There are two archetypal \ind{crossword} formats.%
\amarginfig{t}{
\begin{tabular}{c}
\mbox{\epsfbox{metapost/xword.14}}\\
\mbox{\epsfbox{metapost/xword.1}}\\
%\psfig{figure=figs/xwordA3.ps,width=1.5in} \\
%\psfig{figure=figs/xwordB.ps,width=1.5in} \\
%\psfig{figure=figs/grid-us.ps,width=1.5in} \\
%\psfig{figure=figs/grid994.ps,width=1.5in} \\
\end{tabular}
\caption[a]{Crosswords
%$grids
 of types A (\ind{American})  and B (\ind{British}).}
}
%, which differ in their grids' properties.
 In a `type A' (or \ind{American})
 \ind{crossword}, every row and column  consists of a succession of 
 words of length 2 or more separated by one or more spaces. 
 In a `type B' (or \ind{British}) crossword, each row and column 
 consists of a mixture of words and 
 single characters, separated by one or more spaces, and 
 every character lies in at least one word (horizontal or vertical).
 Whereas in a type A crossword every letter lies in a horizontal 
 word {\em and\/}
 a vertical word, in a typical type B crossword only about half of 
 the letters do so; the other half lie in one word only.
%[`A' and `B' are
% mnemonic for America and Britain, 
% where these two types of  crosswords are respectively 
% more widespread.]
 
 Type A crosswords are harder to {\em create\/}
 than type B  because  of the constraint 
% that they are subject to,
 that no single characters are permitted.
 Type B crosswords are generally harder to {\em solve\/} because
 there are fewer constraints per character.
 
\subsection{Why are crosswords possible?}
 If  a language has no redundancy, then any letters written on
 a grid form a valid crossword.
 In a language with high redundancy, on the
 other hand, it is hard to make crosswords (except perhaps
 a small number of trivial ones).
 The possibility of making crosswords in a language
 thus demonstrates a {\em bound on the redundancy\/} of
 that language.
 Crosswords
% , when read horizontally or vertically,
 are not normally written in genuine \ind{English}. They are written in 
% Perhaps we should introduce a name 
% like Wenglish
% \footnote{Need to decide whether to call this Wenglish, as in chapter 2.}
% for
 `\ind{word-English}', the language consisting of 
 strings of words from a dictionary, separated by spaces.
\exercisxB{2}{ex.winglishcap}{
 Estimate the capacity of word-English, in bits per
 character.
 [Hint: think of word-English
 as defining a constrained channel (\chref{ch.noiseless})
 and see  \exerciseref{ex.phone_chat}.]
}
% ? (relate to telephone rings and 
% chapter on constrained channels). Give an estimate.
 The fact that many crosswords can be made leads to 
% that  the redundancy is not very big
% the entropy of english is quite big
 a lower bound on the entropy of word-English.

 
 For simplicity, we now model
 \ind{word-English} by \ind{\wenglish},
 the language introduced in
 \secref{sec.wenglish} which
  consists of $W$ words all of length
 $L$. The entropy of
 such a language, per character, including inter-word spaces,  is:
\beq
 H_W \equiv \frac{\log_2 W }{L+1} .
\label{eq.HW}
\eeq
% I reckon $W$ is about 100,000 tops and L=6 seems reasonable.
% that's  17/7 -> 2ish bits per character.
 We'll find that the conclusions we come to depend on the value of $H_W$ and are not terribly 
 sensitive to the value of $L$.
%
  Consider a large crossword of size $S$ squares in area. 
 Let the number of words be $f_w S$ and let the number of 
 letter-occupied squares be $f_1 S$. For typical crosswords of
 types A and B  made of words of length $L$, the two fractions $f_w$
 and $f_1$  have
% the following values:
 roughly  the  values in \tabref{tab.xwordf}.
\margintab{\small
\begin{center}
\begin{tabular}{ccc} \toprule
            & A & B \\ \midrule
$f_w$ & $\displaystyle \frac{2}{L+1}$ &  $\displaystyle \frac{1}{L+1}$ \\[0.1in]
$f_1$ & $\displaystyle \frac{L}{L+1}$ &  $\displaystyle \frac{3}{4}\frac{L}{L+1}$ \\
\bottomrule
\end{tabular}
\end{center}
\caption[a]{Factors $f_w$ and $f_1$ by which the  number of words
 and number of letter-squares respectively are smaller than the total
 number of squares.}
\label{tab.xwordf}
}

 We now estimate how many crosswords there are of size $S$
 using our simple model of \Wenglish.
% , and work out the condition for 
 We
% Let's
 assume that \Wenglish\ is created at random by generating $W$
 strings from a monogram
% single
  (\ie, memoryless)
  source with entropy $H_0$. If, for example, the source used all
 $A=26$ characters with equal probability then $H_0 = \log_2 A =
4.7$ bits. If instead we use \chref{ch.prob.ent}'s distribution then
 the entropy is 4.2.
 The redundancy of Wenglish stems from
% these
 two sources:
 it tends to use some letters more than others;
 and there are only $W$ words in the dictionary.
% 3.xxx.

 Let's now count how many crosswords there are by 
 imagining filling in the squares of a crossword at random using the same
distribution that produced the \Wenglish\ dictionary
 and evaluating the probability that this random scribbling produces 
 valid words in all rows and columns. 
The total number of {\em typical\/} fillings-in of the 
 $f_1 S$ squares in the  crossword
 that can be made is 
\beq
	 |T| =	2^{  f_1 S   H_0} .
\eeq
 The probability that one word of length $L$ is validly filled-in
 is 
\beq
	\beta = \frac{W}{2^{L H_0 }},
\eeq
 and the probability that the whole crossword, made of $f_w S$ words,  is validly filled-in
 by a single typical in-filling is approximately\marginpar{\small\raggedright{This  calculation 
 underestimates
 the number of valid Wenglish crosswords
 by counting only crosswords filled with `typical' strings.
 If the monogram distribution is non-uniform then  the 
 true count is dominated
 by `atypical' fillings-in, in which crossword-friendly words appear more often.
}}
\beq
	\beta^{f_w S} .
\eeq
 So the log of the 
 number of valid crosswords of size $S$ is estimated to be 
\beqan
	\log \beta^{f_w S} |T|  &=& S \left[
	( f_1 - f_w L ) H_0 + f_w  \log W
\right]
%	\log \beta^{f_w S} |T|
\\
&=& S \left[
	( f_1 - f_w L ) H_0 + f_w (L+1) H_W
% by defn, ref{eq.HW} ===  \frac{\log W}{L+1}
\right] ,
\eeqan
 which is an increasing function of $S$
 only if
\beq
	( f_1 - f_w L ) H_0 + f_w (L+1) H_W
	> 0.
\eeq
 So arbitrarily many crosswords can be made only 
 if there's enough words in the \Wenglish\ dictionary  that
\beq
	H_W > \frac{( f_w L-  f_1 )}{f_w(L+1)} H_0 .
\eeq
 Plugging in the values of $f_1$ and $f_w$ from \tabref{tab.xwordf},
 we find the following.
\begin{realcenter}
\begin{tabular}{lcc} \toprule
Crossword type             & A & B \\ \midrule
%$f_w$ &  $\frac{2}{L+1}$ &  $\frac{1}{L+1}$  \\[0.05in]
%$f_w(L+1)$ &  {2} &  {1}  \\[0.05in]
%$f_1$ &  $\frac{L}{L+1}$ &  $\frac{3}{4}\frac{L}{L+1}$ \\[0.05in]
%$-f_1+f_wL$ &  $\frac{L}{L+1}$ &  $\frac{1}{4}\frac{L}{L+1}$ \\[0.05in]
Condition for crosswords
 &
 $H_W > \frac{1}{2}\frac{L}{L+1}  H_0$ 
& $H_W >   \frac{1}{4}\frac{L}{L+1}   H_0$ \\
\bottomrule
\end{tabular}
\end{realcenter}

 If we set $H_0=4.2\ubits$ and assume there are $W=4000$
 words in a normal English-speaker's
 dictionary, all with length $L=5$, then we find
 that the condition for crosswords of type B is
 satisfied, but
 the condition for crosswords of type A is
 {\em only just\/} satisfied. This fits with
 my experience that crosswords of type A
 usually contain more obscure words. 

% Thus crosswords are possible in English because English has 
% high enough entropy.

% In a language with fewer, longer words, the possibility of making
% crosswords vanishes.
% see xwordaside.tex


% units.tex has its own further reading
\section*{Further reading}
 These observations about crosswords were first  made by
 \index{Shannon, Claude}\index{Wolf, Jack}\index{Siegel, Paul}\citeasnoun{Shannon48};
%Shannon;
% http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html
% p15
 I learned about them from \citeasnoun{wolf1998}.
 The topic is closely related to the capacity of two-dimensional
 constrained channels. An example of a  two-dimensional\index{channel!two-dimensional}
 constrained channel is a two-dimensional  \ind{bar-code},
 as seen
% in hexagonal patterns
 on parcels.
%\section{}
% http://www.adams1.com/pub/russadam/stack.html
% exercises at the end fo the crossword chapter.
\fakesection{Xword Exercises}
\exercisxC{3}{ex.constrainedchannel2}{
 A two-dimensional channel is defined by the constraint that,
 of the eight neighbours of every interior pixel
 in an $N \times N$ rectangular grid,
 four must be black and four white. (The counts of black and white pixels
 around boundary pixels are not constrained.)
 A binary pattern satisfying this constraint is shown in
 \figref{fig.granny}.
\marginfig{
\begin{center}
\mbox{\epsfbox{metapost/xword.21}}
\end{center}
%
\caption[a]{A binary pattern in which every  pixel is adjacent
 to four black and four white pixels.
}
\label{fig.granny}
}
 What is the capacity of this channel, in bits per pixel, for large $N$?
% answer: tends to 0.
}





%
\dvips
%
\section{Simple language models}
\label{sec.zipf}
\subsection{The Zipf--Mandelbrot distribution}
 The\index{Zipf, George K.}
 crudest model for a language is the monogram
 model, which asserts that each successive word
 is drawn independently from a distribution over
 words.
 What is the nature of this distribution over words?

 Zipf's law \cite{zipf} asserts that\index{Zipf's law}\index{Zipf plot}
 the probability of the  $r$th most probable word in a language is
 approximately
\beq
	P(r) = \frac{\kappa}{ r^{\alpha} },
\eeq
 where the exponent $\alpha$ has a value close to 1, and $\kappa$ is
 a constant. According to Zipf,
 a log--log plot of frequency versus word-rank should
 show a straight line with slope $-\alpha$.

\quotecite{Frac}
% Mandelbrot's
 modification\index{Mandelbrot, Benoit}
 of Zipf's law introduces a third parameter $v$,
 asserting that the probabilities are given by
\beq
	P(r) = \frac{\kappa}{ (r+v)^{\alpha} } . % 1/D
\label{eq.mandelbrot}
\eeq
 For some documents, such as Jane Austen's {\em Emma},
 the Zipf--Mandelbrot distribution
 fits well -- \figref{fig.emma.zipf}.

 Other documents give distributions that are not so well fitted
 by a  Zipf--Mandelbrot distribution.
 \Figref{fig.book.zipf} shows a plot of  frequency versus rank for
 the \LaTeX\ source of this book. Qualitatively, the graph
 is similar to a straight line, but a curve is noticeable.
 To be fair,
% to Zipf and Mandelbrot,
 this source file is not written
 in pure English -- it is a mix of English, maths symbols such as `$x$',
 and \LaTeX\ commands.
\begin{figure}[hbtp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
\mbox{\psfig{figure=zipf/pr_ps/161014.emma.ps,angle=-90,width=2.3in}}
\end{tabular} 
\end{center}
}{
\caption[a]{Fit of the Zipf--Mandelbrot distribution (\ref{eq.mandelbrot}) (curve)
 to
 the empirical frequencies of words in Jane Austen's {\em Emma} (dots).
 The fitted parameters are 
 $\kappa = 0.56$; $v = 8.0$; $\alpha =1.26$.
% D               = 0.79$.
}
\label{fig.emma.zipf}
}
\end{figure}
% 
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
\mbox{\psfig{figure=zipf/pr_ps/346998.book.ps,angle=-90,width=2.3in}}
\end{tabular} 
\end{center}
}{
\caption[a]{Log--log plot of  frequency versus rank for
 the words  in the \LaTeX\ file
 of this book.
}
\label{fig.book.zipf}
}
\end{figure}

\subsection{The Dirichlet process}
\label{sec.dirichletprocess}
 Assuming we are interested in 
 monogram models for languages, what model  should we
 use? One difficulty in modelling a language is the
 unboundedness of vocabulary.  The greater the sample
 of language, the greater the number of words encountered.
 A generative model for a language should emulate this property.
 If asked `what is the next word in a newly-discovered
 work of Shakespeare?' our probability distribution over words
 must surely include some non-zero probability for
 {\em words that  Shakespeare never used before}.
 Our generative monogram model for language should
 also satisfy a consistency rule called {\dem\ind{exchangeability}}.
 If we imagine generating a new language from our generative model,
 producing an ever-growing corpus of text,
 all statistical properties of the text should
 be homogeneous: the probability of finding a particular word
 at a given location in the stream of text should be
 the same everywhere in the stream.

 The Dirichlet process model is a model for a stream
 of symbols (which
 we think of as  `words')
 that satisfies the exchangeability rule
 and that allows the vocabulary of symbols to grow without limit.
 The model has one parameter $\alpha$. As the
 stream of symbols is produced, we identify each new symbol 
 by a unique integer $w$. 
 When we have seen a stream of length $F$ symbols, we define
 the probability of the next symbol in terms of
 the counts $\{ F_w \}$ of the symbols seen so far thus:
 the probability that the next symbol is a new symbol, never
 seen before, is
\beq
	\frac{ \alpha }{ F + \alpha } .
\eeq
 The probability that the next symbol is symbol $w$ is
\beq
	\frac{ F_w }{ F + \alpha } .
\eeq

 \Figref{fig.zipf.dprocess}
 shows Zipf plots\index{Zipf plot} (\ie, plots of symbol frequency versus rank)
 for million-symbol `documents' generated by
 Dirichlet process priors with values of
 $\alpha$ ranging from 1 to 1000.

% load 'gnu/1000000.all'
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{c}
\mbox{\psfig{figure=zipf/pr_ps/1000000.all.ps,angle=-90,width=2.3in}}
\end{tabular} 
\end{center}
}{
\caption[a]{Zipf plots for four `languages' randomly generated
 from  Dirichlet processes with parameter $\alpha$ ranging
 from 1 to 1000. Also shown is the Zipf plot for this book.
}
\label{fig.zipf.dprocess}
}
\end{figure}

 It is evident that a Dirichlet process is
 not an adequate model for  observed distributions
 that roughly obey Zipf's law.\index{Zipf's law}

 With a small tweak, however, Dirichlet processes
 can produce rather nice Zipf plots.
 Imagine generating a language composed of
 elementary symbols using a Dirichlet process 
 with a rather small value of the  parameter $\alpha$,
 so that the number of reasonably  frequent symbols is about 27.
 If we  then declare one
 of those symbols (now called `characters' rather
 than words) to be a space character,
 then we can identify the strings between the space characters
 as `words'.
 If we generate a language in this way then
 the frequencies of words often come out as
 very nice Zipf plots, as shown in \figref{fig.dprocess2.zipf}.
 Which character is selected as the space character
 determines the slope of the Zipf plot -- a less probable
 space character gives rise to a richer language with a
% larger vocabulary and a
 shallower slope.
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
\mbox{\psfig{figure=zipf/pr_ps/fakes2003.ps,angle=-90,width=2.3in}}\\
\end{tabular} 
\end{center}
}{
\caption[a]{Zipf plots for the words of two `languages'  generated
 by creating successive characters from a Dirichlet process
 with $\alpha=2$, and declaring  one
% randomly selected
 character to be the space character. The two curves result
 from two different choices of the space character.
}
\label{fig.dprocess2.zipf}
}
\end{figure}






% ch 10
%\chapter{Cryptography and cryptanalysis: codes for information concealment \nonexaminable}
%\chapter{Cryptography and Cryptanalysis: Codes for Information Concealment \nonexaminable}
%\label{ch.crypto}
%\input{tex/crypto.tex}
%
\dvips
%\chapter{Units of measurement of information content \nonexaminable}
\section{Units of information content \nonexaminable}
%\chapter{Units of Information Content \nonexaminable}
% units.tex
\fakesection{Units of measurement of information content}
 The information content of an outcome, $x$,
 whose probability is $P(x)$, is defined to
 be
\beq
	h(x) = \log \frac{1}{P(x)} .
\eeq
 The entropy of an ensemble is
 an average information content,
\beq
	H(X) = \sum_x P(x) \log \frac{1}{P(x)} .
\eeq
 When we compare hypotheses with each other in
 the light of data, it is often convenient
 to compare the log of the probability
 of the data under the alternative hypotheses,
% models,
\beq
	\mbox{`log evidence for $\H_i$'} = \log P( D \given  \H_i ) ,
\eeq
 or, in the case where just two hypotheses
% models
 are being compared, we evaluate the
 `log odds',
\beq
	\log \frac{ P( D \given  \H_1 ) }{ P( D \given  \H_2 ) } ,
\eeq
 which has also been called the `weight of evidence in favour
 of $\H_1$'.
 The log evidence for a hypothesis, $\log P( D \given  \H_i )$ is
 the negative of the information content of the data $D$:
 if the data have large information content, given a hypothesis, then they
 are surprising to that hypothesis;
 if some other hypothesis is not so surprised
 by the data, then that hypothesis becomes  more probable.
 `Information content', `\ind{surprise value}', and
 log likelihood or log evidence are the same thing.

 All these quantities are logarithms of probabilities,
 or weighted sums of logarithms  of probabilities, so they
 can all be measured in the same units. The units  depend
 on the choice of the base of the logarithm.

% This chapter is a brief aside to mention
 The names that have been given to  these units
 are shown in \tabref{tab.units}.\index{bit (unit)}\index{nat (unit)}\index{ban (unit)}\index{deciban (unit)}\index{units}
\begin{table}[htbp]
\figuremargin{
\begin{center}
\begin{tabular}{cc} \toprule
 Unit & Expression that has those units \\ \midrule
 bit  & $\log_2 p$ \\
 nat  & $\log_e p$ \\
 ban  & $\log_{10} p$ \\
 deciban (db) & ${10}\log_{10} p$ \\ \bottomrule
\end{tabular}
\end{center}
}{\caption[a]{Units of measurement of information content.}
\label{tab.units}}
\end{table}
% Jaynes p.91 calls the db the decibel.

 The {\em bit\/} is the unit that we use most in this book. Because
 the word `bit' has other meanings, a backup name for this unit
 is the {\em shannon}.\index{shannon (unit)}
 A {\em byte\/} is 8 bits. A megabyte is $2^{20} \simeq 10^6$ bytes.
 If one works in natural logarithms,
% (which is conventional in Bayesian 
 information contents and weights of evidence
 are measured in {\em nats}.
 The most interesting units are the {\em ban\/} and the {\em deciban}.

\subsection{The history of the ban}
 Let me tell you why
 a factor of ten in probability is called a ban.
% , after the
% English
% town of Banbury.
 When Alan {Turing} and the other\index{Turing, Alan} 
% British
 \ind{codebreakers} at \ind{Bletchley Park} were breaking each new
 day's
% German 
 \ind{Enigma} code, their task was a huge inference problem: to infer,
 given the day's cyphertext, which  three  wheels were in
 the Enigma machines that day; what their starting positions were;
 what further letter substitutions were in use on the steckerboard;
 and, not least, what the original German messages were.
 These inferences were conducted using Bayesian methods (of course!),
 and the chosen units were decibans or half-decibans, the deciban
 being judged the smallest weight of evidence discernible to
 a human. The evidence in favour of particular hypotheses
 was tallied using  sheets of paper that
 were specially printed in {Banbury}, a town
 about 30 miles  from {Bletchley}. The inference task was
 known as \ind{Banburismus}, and the units in which
% the game
  Banburismus
 was played were called  {ban}s, after that town. 

\section{A taste of Banburismus}
 The details of the code-breaking methods of Bletchley
 Park were kept secret for a long time, but some aspects
 of Banburismus can be pieced together.  I hope the following
 description of a small part of Banburismus is not too inaccurate.\footnote{I've
 been most helped by descriptions given by Tony Sale
 ({\tt http://{\breakhere}www.{\breakhere}codesandciphers.{\breakhere}org.uk/{\breakhere}lectures/})
% http://www.codesandciphers.org.uk/lectures/
% was http://www.cranfield.ac.uk/ccc/bpark/lectures/})
 and by Jack Good (1979),\nocite{GoodEnigma} who worked with
 Turing at Bletchley.
}

 How much information was needed? The number of possible
 settings of the Enigma machine was about $8 \times 10^{12}$.
% see cryptonotes
 To deduce the state of the machine, `it was
 therefore necessary to find about 129 decibans from somewhere',
 as Good\index{Good, Jack} puts it. \ind{Banburismus} was aimed not at  deducing the
 entire state of the machine, but only at figuring out which
 wheels were in use; the logic-based \ind{bombes}, fed with guesses
 of the \ind{plaintext} (\ind{crib}s), were then
  used to crack what the settings of the wheels were.
%  the remaining uncertainty.

 The \ind{Enigma} machine, once its wheels and plugs were put in place,
 implemented a continually-changing permutation
 cypher that wandered deterministically through a
 state space
% , starting from
 of $26^3$ permutations.
 Because an enormous number of messages were sent each day,
 there was a good chance that whatever state one machine
 was in when sending one character
 of a message, there would be another machine
 {\em in the same state\/} while sending a particular character
 in another message.
 Because the evolution of the machine's state was deterministic,
 the two machines would remain in the same state as
 each other for the rest of the transmission.
 The resulting correlations between the outputs of
 such pairs of machines
 provided a dribble of information-content
 from which Turing and his co-workers
 extracted their daily 129 decibans.

\subsection{How to  detect
 that two messages came from machines with a common state
 sequence}
 The hypotheses are the null hypothesis, $\H_0$, which
 states that the machines are in {\em different\/} states, and
 that 
 the two plain messages are  unrelated; and the
 `match' hypothesis, $\H_1$, which
 says that the machines are in the {\em same\/} state, and
 that  the two plain messages are unrelated.
 No attempt is being made here to infer what the state of
 either machine is.
 The data provided are the two cyphertexts $\bx$ and $\by$;
 let's assume they
 both have length $T$ and that the alphabet size is $A$ (26 in Enigma).
 What is the probability of the data, given the two hypotheses?

 First, the null hypothesis.
 This hypothesis asserts that the two cyphertexts are
 given by 
\beq
\bx = x_1x_2x_3\ldots = c_1(u_1)c_2(u_2)c_3(u_3)\ldots
\eeq
 and
\beq
 \by = y_1y_2y_3\ldots = c'_1(v_1)c'_2(v_2)c'_3(v_3)\ldots,
\eeq
 where the codes $c_t$ and $c'_t$ are two unrelated time-varying
 permutations of the alphabet, and
 $u_1u_2u_3\ldots$ and
 $v_1v_2v_3\ldots$ are the plaintext messages.
 An exact computation of the probability of the data ($\bx,\by$)
 would depend on a language model of the plain text,
 and a model of the Enigma machine's guts, but if we
 assume that each Enigma machine is an {\em ideal\/} random time-varying
 permutation, then the probability distribution of the
 two cyphertexts is uniform. All cyphertexts are
 equally likely.
\beq
	P(\bx , \by \given  \H_0 )  = \left( \frac{1}{A} \right)^{\! 2 T}
\:\:\mbox{for all $\bx,\by$ of length $T$}.
\eeq
 What about $\H_1$?
 This hypothesis asserts that a {\em single\/}
 time-varying permutation $c_t$ underlies both
\beq
 \bx = x_1x_2x_3\ldots = c_1(u_1)c_2(u_2)c_3(u_3)\ldots
\eeq
 and
\beq
 \by = y_1y_2y_3\ldots = c_1(v_1)c_2(v_2)c_3(v_3)\ldots \: .
\eeq
% are  generated  from two plaintext messages  $u_1u_2u_3\ldots$ and
% $v_1v_2v_3\ldots$
 What is the probability of  the data ($\bx,\by$)?
 We have to make some assumptions about
 the plaintext language.
% [`Horrors! How can we possibly
% make assumptions?' the idiot non-Bayesians ask.]
 If it were the case that the plaintext language was
 completely random, then the probability of
  $u_1u_2u_3\ldots$ and
 $v_1v_2v_3\ldots$ would be uniform, and so would that
 of  $\bx$ and $\by$, so the probability $P(\bx,\by\given \H_1)$
 would be equal to  $P(\bx,\by\given \H_0)$, and the two hypotheses
 $\H_0$ and $\H_1$ would be
 indistinguishable.

 We make progress by assuming that the plaintext is not
 completely random. Both plaintexts are written in a
 language, and that language has redundancies.
 Assume for example that particular plaintext letters
 are used more often than others. So, even though the two
 plaintext messages are unrelated, they are slightly more
 likely to use the same letters as each other;  if $\H_1$ is
 true, two synchronized letters from the two cyphertexts
 are slightly more likely to
 be identical. Similarly, if a language uses particular
 bigrams and trigrams frequently, then the two plaintext messages
 will occasionally contain the same bigrams and trigrams
 at the same time as each other, giving rise, if $\H_1$ is
 true, to a little  burst of 2 or 3 identical letters.
 \Tabref{fig.coincidenceexample} shows such a \ind{coincidence} in
 two plaintext messages that are unrelated, except that
 they are both written in English.
\begin{table}
\figuredangle{
\small
\begin{center}
\hspace*{0.6in}
%\parbox{4in}{
%\begin{tabular}{cl}
%$\bu$ &  \verb+THEXCODEXBREAKERSXWEREXLOOKINGXFORXINSTANCESXWHEREX+ \\
%$\bv$ &  \verb+TRIGRAMSXFORXTWOXORXMOREXMESSAGESXDIFFEREDXONLYXINX+ \\
%      &  \verb+*.......*..........................*..............*+ \\
%\end{tabular}\\
\begin{tabular}{rl}\toprule
$\bu$         &  {\tt{LITTLE-JACK-HORNER-SAT-IN-THE-CORNER-EATING-A-CHRISTMAS-PIE--HE-PUT-IN-H}} \\
$\bv$         &  {\tt{RIDE-A-COCK-HORSE-TO-BANBURY-CROSS-TO-SEE-A-FINE-LADY-UPON-A-WHITE-HORSE}} \\
{\sf matches:}&  {\tt{.*....*..******.*..............*...........*................*...........}} \\
\bottomrule
\end{tabular}
%}
\end{center}
}{\caption[a]{%
 Two aligned pieces of English plaintext, $\bu$ and $\bv$, with
 matches marked by {\tt{*}}.
% Notice that there are four matches,
% whereas the expected number of matches in two completely
% random strings of length $T=51$ would be about 2.
 Notice that there are twelve matches, including a run of six,
 whereas the expected number of matches in two completely
 random strings of length $T=74$ would be about 3.
 The two corresponding cyphertexts from two machines in
 identical states would also have twelve matches.
 }
\label{fig.coincidenceexample}}
\end{table}

 The codebreakers hunted among pairs of messages for
 pairs that were suspiciously similar to each other,
 counting up the numbers of matching monograms, bigrams, trigrams, etc.
 This method was first used by the Polish codebreaker Rejewski.

 Let's look at the simple case of a monogram language model and
 estimate how long a message is needed to be able to decide whether
 two machines are in the same state. 
%many messages would be needed, and of
% what length, to  have a good chance of cracking the Enigma.
 I'll assume the source language is monogram-English,
 the language in which successive letters are drawn
 i.i.d.\ from the probability distribution $\{ p_i \}$ of
 \figref{fig.monogram}.
 The probability of $\bx$ and $\by$ is nonuniform:
 consider two single characters, $x_t=c_t(u_t)$ and $y_t=c_t(v_t)$;
 the probability that they are identical is
\beq
	\sum_{u_t,v_t} P(u_t) P(v_t) \, \truth[ u_t\eq v_t ]
 \: = \:  \sum_i p_i^2
 \: \equiv \:
 m.
\eeq
 We give this quantity the name $m$, for `match probability';
 for both English and German, $m$ is about $2/26$ rather than $1/26$ (the value
 that would hold for a completely random language).
 Assuming that $c_t$ is an ideal random permutation,
 the probability of $x_t$ and $y_t$ is, by symmetry,
\beq
	P(x_t,y_t\given  \H_1) \: =  \: \left\{ \begin{array}{ccl}
	\smallfrac{m}{A} & & \mbox{if $ x_t = y_t $} \\
	\smallfrac{(1-m)}{A(A-1)} & & \mbox{for $ x_t \not = y_t $.} 
	\end{array} \right.
\eeq
 Given a pair of cyphertexts $\bx$ and $\by$ of length $T$
 that match in $M$ places and do not match in $N$ places, 
 the log evidence in favour of $\H_1$ is
 then
\beqan
	\log \frac{P(\bx,\by\given \H_1)}{P(\bx,\by\given \H_0)}
	&=& M \log \frac{ m/A }{ 1/A^2 }
	+  N \log \frac{ \smallfrac{(1-m)}{A(A-1)} }{ 1/A^2 }
\\
	&=& M \log  m A 
	+  N \log \frac{ (1-m) A}{A-1} .
\label{eq.weight.of.evidence}
\eeqan
 Every match contributes  $\log  m A$ in favour
 of $\H_1$;
 every non-match contributes $\log \frac{A-1}{ (1-m) A}$
 in favour of $\H_0$.
%tex/crypto/psquared.p
% double checked.........
%gnuplot> pr 10 * log(0.075884 * 27)/log(10.0)
%3.11513979524554
%gnuplot> pr 10 * log((1.0 - 0.075884) * 27/26.0 )/log(10.0)
%-0.178830941957283
\medskip
\begin{center}
\begin{tabular}{lcr@{}l} \toprule
 Match probability for monogram-English & $m$ & & 0.076  \\
 Coincidental match probability  & $1/A$ &&  0.037 \\
 log-evidence for $\H_1$ per match &
		${10}\log_{10}  m A$ & & 3.1\,db \\ 
 log-evidence for $\H_1$ per non-match &
                ${10}\log_{10} \frac{ (1- m) A}{(A-1)}$ & $-$&$0.18$\,db \\
\bottomrule
\end{tabular}
\medskip
\end{center}
 If there were $M=4$ matches and $N=47$ non-matches
 in a pair of length $T=51$, for example,
 the weight of evidence in favour of
 $\H_1$ would be +4 decibans, or a likelihood ratio of 2.5 to 1
 in favour.
% odds === (1-p)/p 
%
% If there were $M=3$ matches and $N=17$ non-matches
% in a pair of length $T=20$, for example,
% the evidence in favour of
% $\H_1$ would be +12.4 decibans, or odds of 17 to 1
% in favour.
%%%%%%%%%%%%%%%%%
%%%%%%% [Check if this is the right use of odds.]
%%%%%%%%%%%%%%%%%

 The {\em expected\/} weight of evidence
 from a line of text of length $T=20$ characters
 is the expectation of (\ref{eq.weight.of.evidence}),
 which depends on whether $\H_1$ or $\H_0$ is true.
 If $\H_1$ is true then matches are expected to turn up
 at rate $m$, and
 the expected weight of evidence is
 1.4\,decibans per 20 characters.
 If $\H_0$ is true then spurious matches are expected to turn up
 at rate $1/A$, and
 the expected weight of evidence is
 $-1.1$~decibans per 20 characters.
% $-1.1$\,decibans per 20 characters.
 Typically, roughly 400 characters need to be inspected in order
 to have a weight of evidence greater than a hundred to one (20 decibans) in
 favour of one hypothesis or the other.

 So, two English plaintexts have more matches
 than two random strings. Furthermore, because consecutive characters
 in English are not independent, the  bigram and trigram 
 statistics of English are nonuniform and the
 matches tend to occur in bursts of consecutive matches.
 [The same observations also apply to German.]
% , the plaintext language  used in the Enigma messages.]
 Using better language models, the evidence contributed by
 runs of matches was more accurately computed. Such a scoring
 system was worked out by Turing and refined by Good.
 Positive results were passed on to automated and human-powered codebreakers.
 According to
 Good, the longest false-positive that arose in this
 work was a string of 8 consecutive matches between two machines  that were
 actually in unrelated states. 

% The same codebreaking
%% cracking
% system was implemented on the Colossus
% computer in the work known as Fish. The computer
% accumulated weights of evidence and searched
% for the most probable hypothesis.

% xword.tex has its own further reading
\section*{Further reading}
 For further reading about Turing and Bletchley Park,
 see \citeasnoun{hodges83} and \citeasnoun{GoodEnigma}.
 For an in-depth read about cryptography,
 \quotecite{Schneier96} book is highly recommended.
 It is readable, clear, and entertaining.


% see also xword.tex which includes exword.tex
\section{Exercises}
\exercisxB{2}{ex.enigmaleak}{
 Another weakness in the design of the \ind{Enigma} machine, which
 was intended to emulate a perfectly random time-varying
 \ind{permutation}, is that it never  mapped a letter to
 itself. When you press {\tt{Q}}, what comes out is
 always a different letter from {\tt{Q}}.
 How much information per character is leaked by this
 design flaw?
 How long a \ind{crib} would be needed to be confident
 that the crib is correctly aligned with the cyphertext?
 And how long a crib would be needed to be able
 confidently to identify the correct key?

 [A {\dem{crib}\/} is a  guess for what the plaintext was.
 Imagine that the Brits know that a very important German
 is travelling from Berlin to Aachen, and they intercept
 Enigma-encoded messages sent to Aachen. It is a  good bet
 that one or more of the original plaintext messages contains the
 string {\tt OBERSTURMBANNFUEHRERXGRAFXHEINRICHXVONXWEIZSAECKER},
 the name of the important chap.
 A crib could be used in a brute-force approach
 to find the correct Enigma key (feed the received messages
 through all possible Engima machines and see if any of the
 putative decoded texts match the above plaintext).
 This question centres on the idea that the crib can also be 
 used in a much less expensive manner: slide the plaintext crib
 along all the encoded messages until a perfect {\em mismatch\/}
 of the crib and the encoded message is found; if correct,
 this alignment then tells you a lot about the key.]
}



%{Why have sex? Information acquisition and evolution}
\chapter{Why have Sex? Information Acquisition and Evolution}
\label{ch.sex}
%\title{Rate of Information Acquisition\\ by a Species subjected to Natural Selection}
% \date{\today\ -- Draft 5.5} from _doc/gene/gene.tex
\newcommand{\explanfig}[1]{\raisebox{-0.5cm}{\psfig{figure=psm/e.#1.ps,width=2in,height=0.5in,angle=-90}}}
\newcommand{\fitfig}[1]{\mbox{\psfig{figure=psm2/#1.ps,width=2.5in,angle=-90}}}
\newcommand{\fitfigx}[1]{\mbox{\psfig{figure=psx/#1.ps,width=2.5in,angle=-90}}}

% \exercisxC{5}{ex.evolutionteach}{
% {\bf What is the difference (in bits) between an ape and a human?}
%5?????????????????????????????????/
 Evolution has been\index{evolution}\index{natural selection}
 happening on earth for  about the last $10^{9}$ years. 
% DNA-binding proteins are just one of the families of sophisticated 
% molecules which the Blind Watchmaker of evolution has created.
 Undeniably, {\em information has been acquired\/} during this process.
 Thanks to the tireless work
 of the \ind{Blind Watchmaker},
 some cells now carry within them all the information required
 to be outstanding spiders; other cells carry all the information
 required to make excellent octopuses.  Where did this information
 come from?

 The entire blueprint of all organisms on the planet has emerged 
 in a teaching process in which the teacher is
 natural selection:
% , \ie, the process whereby 
 fitter individuals have more progeny, the \ind{fitness} being defined by the 
 local environment (including the other organisms).
 The teaching signal is only a few bits per
 individual: an individual simply has a smaller 
 or larger number of grandchildren, depending on the
 individual's fitness.
 `Fitness' is a broad term that could cover
\bit
\item
 the ability of an antelope to run faster than other antelopes
 and hence avoid being eaten by a lion;
\item
 the ability of a lion to be well-enough camouflaged and  run
 fast enough to catch one antelope per day;
\item
 the ability of a peacock to attract a peahen to mate with it;
\item
 the ability of a peahen to rear many young simultaneously.
\eit
 The fitness of an organism is largely determined
 by its  DNA -- both the coding regions, or genes,
 and the non-coding regions (which play an important
 role in regulating the transcription of genes).
 We'll think of fitness as a  function of the DNA
 sequence and the environment.

% For simplicity, let's focus on a gene and think a bit more
% about the information acquisition process.
 How does the DNA determine fitness, and how
 does information get from natural selection into the genome? Well,
 if  the gene that codes for one of an antelope's  proteins is 
 defective, that antelope might get eaten by a lion
 early in life and have only two grandchildren rather than forty.
 The information content of natural selection is fully
 contained in a  specification of which offspring survived to
 have children -- an information content of {\em
 at most one bit per offspring}.
 The teaching signal does not communicate to the ecosystem any description 
 of the imperfections in the organism that caused it to have
 fewer children.
% And  these
 The bits of the teaching signal are highly
 redundant, because,  throughout a species,
 unfit individuals who are similar to each other
 will be failing to have offspring for similar reasons.


 So, how many bits per generation are acquired by the \ind{species}\index{human}\index{ape}
 as a whole by \ind{natural selection}? 
% What is the difference
 How many bits has  natural selection succeeded in  conveying to the human 
 branch of the tree  of life, since the divergence between Australopithecines
% and apes 4,000,000 years ago.
% 277, Maynard Smith 
%
%
%  Australopithecines
 and apes   $4\,000\,000$ years ago?
 Assuming a generation time of 10 years for reproduction, 
 there have been about $400\,000$ generations of human precursors
 since the divergence from apes. Assuming a population of 
 $10^{9}$ individuals, each receiving a couple of bits of 
 information from natural selection, the total number of bits 
 of information responsible for modifying the genomes of 4 million 
 B.C.\ into today's human genome is about 
 $8\times 10^{14}$ bits.  However, as we noted, natural selection is not
 smart at collating the information that it dishes out to the
 population, and there is a great deal of redundancy in that
 information. If the population size were twice as great, would it evolve
 twice as fast? No, because natural selection will simply be
 correcting the same defects twice as often.

 John Maynard Smith has suggested that the rate of information
 acquisition by a species is independent of the population size,
 and is of order 1 bit per generation.
 This figure would  allow for only $400\,000$ bits of difference
 between apes and humans, a number that is much smaller than
 the total size of  the human genome  -- $6 \times 10^9$  bits.
 [One human genome contains about $3\times 10^{9}$ nucleotides.]
 It is certainly the case that the genomic overlap between
 apes and humans is huge, but is the difference that small?
% (Don't forget that
% if two bit sequences of length $N$ have 90\% overlap, then it takes
% about $N/2$ bits to describe the differences between them;
% according to {\tt http://users.ox.ac.uk/$\sim$mckee/chimp.html},
% we share 98.4\% of our DNA with chimpanzees, which corresponds to
% a difference of 0.12$N$ bits, or $7 \times 10^{8}$ bits.
% This is considerably larger than the 400,000 bits of difference
% mentioned above. Of course, the difference between
% us and chimpanzees could involve neutral changes to the DNA,
% and if some of the differences are redundant, then we are
% further overcounting; but are we overcounting by a factor of 1000?)
% http://users.ox.ac.uk/~mckee/chimp.html
% %We share 98.4% of our DNA


 In this chapter, we'll develop a crude model
 of the process  of  information acquisition through evolution,
 based on the assumption that a gene with two defects
 is typically likely to be more defective than a gene with one defect,
 and an organism with two defective genes is likely to be
 less fit than an organism with one defective gene.
 Undeniably, this is a crude model, since
 real biological systems are baroque constructions with
 complex interactions.  Nevertheless, we persist with a simple
 model  because it readily  yields  striking results. 


% I have developed a simple model of natural selection
% \footnote{{\tt http://www.inference.phy.cam.ac.uk/mackay/abstracts/gene.html}}
 What we find from this simple model is that
\ben
\item
% whereas
 John Maynard Smith's figure of 1 bit per generation
 is correct for an {\em asexually-reproducing\/} population;
\item in contrast,
 {\em if the species reproduces
 sexually}, the rate of information
 acquisition
% , though independent of the population size,
 can be as large as
  $\sqrt{G}$ bits per generation, where $G$
 is the size of the genome.
\een
% Setting $G \simeq 10^4$--$10^{8}$, we would then have had time to acquire 
% about $4 \times 10^7$ or $4 \times 10^9$ bits of information from
% evolution.

 We'll also find interesting results concerning 
 the maximum mutation rate that a species can withstand.

\section{The model}
% At what rate, in bits per generation, can the blind watchmaker
% cram information into a species by natural selection?
% And what is the maximum mutation rate that a species can withstand?
 We study a simple
 model of a reproducing population of $N$ individuals with a genome of size
 $G$ bits:
% fitness is a strictly
% additive trait subjected to directional selection;
 variation is produced by mutation or by recombination (\ie, sex)
 and truncation selection
 selects the  $N$ fittest children at each generation
 to be the parents of the next. 
 We  find striking differences between populations that
 have recombination and populations that do not.
% If variation is produced by mutation alone, then the entire population gains
% up to roughly 
% 1 bit per generation. If variation is created by
% recombination, the population can gain
% $O(\sqrt{G})$ bits per generation.
% Furthermore,  recombination raises
% the maximum mutation rate that can be tolerated
% by a factor of order  $\sqrt{G}$.
%% the square root of the size of the genome.
% This model explains the prevalence of sex in  evolution
% and shows why  sex persists in 
% species with large genomes, even when they
% have reached evolutionary stasis.



%\subsection{Fitness}
 The genotype of each individual is a vector $\bx$ of
 $G$ bits, each having a good state $x_g \eq 1$ and a bad
 state $x_g \eq 0$.
 The fitness $F(\bx)$ of
 an individual is simply the sum of her bits:
\beq
 F(\bx) = \sum_{g=1}^G x_g . 
\eeq
 The bits in the genome could  be considered to
 correspond either to genes that have good alleles ($x_g \eq 1$)
 and bad alleles ($x_g \eq 0$), or to the nucleotides of
 a genome.
% , with two bits per nucleotide.
 We will concentrate on the
 latter interpretation.
 The essential property of fitness that we are assuming is
 that it is  locally a roughly linear function of the genome, that is, 
 that there are many possible changes one could make to the
 genome, each of which has a small effect on fitness, and
 that these effects combine approximately linearly. 

 We define the normalized
 fitness $f(\bx) \equiv F(\bx)/G$. 

 We consider   evolution by natural selection under
 two models of variation.

\begin{description}
\item[Variation by mutation\puncspace]% was colon
% \subsection{Variation by mutation}
 The model assumes discrete
 generations.
 At each generation, $t$, every individual produces two children.
% and then dies.
% progenies'
  The children's
 genotypes differ from the parent's
% genotype
 by random
 mutations. Natural selection selects the fittest $N$ progeny in the
 child population to reproduce, and a new generation starts.


 [The selection of the fittest $N$ individuals at each generation
 is known as truncation selection.]

 The simplest model of mutations is that the child's bits  $\{ x_g \}$
 are independent. Each bit has a small probability of being flipped, which,
 thinking of the bits as corresponding roughly to nucleotides, is 
 taken to be a constant $m$, independent of $x_g$.
 [If alternatively we thought
 of the bits as corresponding to genes, then we would
 model the probability of the discovery of a good gene,
% by mutation,
 $P(x_g \eq 0 \rightarrow x_g \eq 1)$, as being
 a smaller number
% $m_{\uparrow}$
 than the probability of a deleterious mutation
 in a good gene,
 $P(x_g \eq 1 \rightarrow x_g \eq 0)$.]
% ,
% which we denote by
% $m_{\downarrow}$.]

\item[Variation by recombination (or crossover, or sex)\puncspace]
% \subsection{Sex}
 Our organisms are haploid, not diploid. They enjoy sex by recombination.
% crossover.
 The $N$ individuals in the population are married into $M \eq N/2$ couples,
 at random,
 and each couple has $C$ children -- with $C\eq 4$  children being our
 standard assumption, so as to have the population double and halve
 every generation, as before.
 The $C$
% siblings'
 children's
 genotypes are independent given the parents'.
 Each child obtains its genotype $\bz$ by random crossover of its parents'
 genotypes, $\bx$ and $\by$. The simplest model of recombination
% crossover,
% which we use here,
 has no linkage, so that:
\beq
 z_g \:=\: \left\{ \begin{array}{cl}
 x_g & \mbox{with probability $1/2$} \\
 y_g & \mbox{with probability $1/2$.} \end{array} \right. 
\eeq
% It would be easy to introduce linkage if we wanted to.

 Once the $MC$ progeny have been born, the parents pass away, the fittest
 $N$ progeny are selected by natural selection, and a new generation starts.
\end{description}

 We now study these two models of variation in detail.



%\section{Rate of information acquisition}
\section{Rate of increase of fitness}
\subsection{Theory of mutations}
 We assume
 that the genotype of an individual with normalized fitness $f \eq F/G$ is
 subjected to mutations that flip bits with probability $m$.
 We first show that if the average normalized
 fitness $f$ of the population is greater than $1/2$, then
 the optimal mutation rate is small, and the rate of
 acquisition of information is at most of order one bit per
 generation.

 Since it is easy to achieve a  normalized fitness of $f \eq 1/2$ by
 simple mutation, we'll assume $f > 1/2$ and work in terms of
 the excess normalized fitness $\deltaf \equiv f - 1/2$.
 If an individual with excess normalized
 fitness $\deltaf$ has a child and the  mutation rate $m$ is  small,
 the probability distribution
 of the excess normalized fitness of the child has 
 mean
\beq
%\mbox{mean}(t\!+\!1) =
	 \overline{\deltaf}_{\rm child} = (1-2 m) \deltaf 
\eeq
 and variance
% standard deviation \sqrt
\beq
	{ \frac{m(1-m)}{G} } \simeq { \frac{m}{G} } .
\eeq
% where the approximation is based on the assumption that the mutation
% rate $m$ will be small.
% If $G$ is large, this binomial distribution is well approximated
% by a Gaussian, and w
 If the population of parents has mean $\deltaf(t)$
 and variance $\sigma^2(t) \equiv \beta \linefrac{m}{G}$, then
 the child population, before selection, will
 have mean $(1-2 m) \deltaf(t)$ and variance $(1+\beta) \linefrac{m}{G}$.
 Natural selection chooses the upper half of this distribution,
% e Gaussian,
 so the mean  fitness and variance of fitness
 at the next generation are given by
\beq
	\deltaf(t\!+\!1) = (1-2 m) \deltaf(t) 
	+ \alpha \sqrt{(1+\beta)}  \sqrt{\frac{m}{G} }  ,
%	+ \sqrt{\frac{2}{\pi}} \sqrt{ \frac{m}{G} }  .
\label{eq.rate1}
\eeq
\beq
        \sigma^2(t\!+\!1) = \gamma (1+\beta) \frac{m}{G} ,
\eeq
 where $\alpha$ is the  mean  deviation from the mean,
 measured in
 standard deviations,
 and  $\gamma$ is the factor by which the child distribution's
 variance is reduced by selection.
 The
 numbers $\alpha$ and $\gamma$ are of order 1.
% , and  satisfies $\alpha \leq 1$.
 For the
 case of a Gaussian distribution, $\alpha = \sqrt{\linefrac{2}{\pi}} \simeq
 0.8$
 and $\gamma = (1-2/\pi) \simeq 0.36$.
 If we assume that the variance is in
 dynamic equilibrium, \ie, $\sigma^2(t\!+\!1) \simeq \sigma^2(t)$,
 then
\beq
        \gamma (1+\beta) = \beta, \mbox{ so } (1+\beta) = \frac{1}{1-\gamma}, 
\eeq
 and the factor $\alpha \sqrt{(1+\beta)}$ in \eqref{eq.rate1}
 is equal to 1, if we take the results for the Gaussian distribution,
 an approximation that becomes poorest when the discreteness of
 fitness becomes important, \ie, for small $m$.
% \footnote{We get the same result for any symmetric distribution. If
% the distribution is not symmetrical, then we are approximating.}
 The rate of increase of normalized fitness is thus:
\beq
	\frac{\d f}{\d t} \simeq  -2 m \, \deltaf + \sqrt{\frac{m}{G}},
\label{eq.rate2}
\eeq
 which,
 assuming $G (\deltaf)^2 \gg 1$, 
 is maximized
% with respect to the mutation rate by   setting $m$ to
 for
\beq
	m_{\rm opt} = \frac{1}{16 G (\deltaf)^2} ,
\label{eq.mopt}
\eeq
% critical df is 0.2, if use the Gaussian approx.
%
% if keep m(1-m) around, and assume G(\deltaf)^2 >> 1, get
%
% something like 1/( 2 + 16 G df^2 )
%
 at which point,
\beq
	\left(\frac{\d f}{\d t}\right)_{\! \rm opt} = \frac{1}{8 G  (\deltaf)}.
\eeq
 So the rate of increase of fitness $F \eq fG$ is at most
\beq
	\frac{\d F}{\d t} = \frac{1}{8 (\deltaf)} \:\:\mbox{per generation}.
\eeq
% critical df is 0.08, if use the Gaussian approx.
 For a population with low fitness ($\deltaf < 0.125$),
 the rate of increase of fitness may exceed 1 unit per generation. Indeed,
 if $\deltaf \lesssim 1/\sqrt{G}$, the rate of increase, if $m \eq \dhalf$,
 is of order $\sqrt{G}$;  this initial spurt can  last only of order
 $\sqrt{G}$ generations.
%  
% if the mutation rate is tuned to the fitness, 
 For $\deltaf > 0.125$, the rate of increase of fitness is
% acquisition of information is
 smaller than one per generation.
 As the fitness approaches $G$, the optimal mutation rate
 tends to $m \eq 1/(4 G)$, so that an average of $1/4$
 bits are flipped per genotype, and the rate of increase of
 fitness is also equal to $1/4$; 
 information is gained at a  rate of about $0.5$ bits per generation. 
 It takes about $2 G$ generations for the
 genotypes of all individuals in the population to
 attain perfection.

 For fixed $m$, the fitness is given by
\beq
	\deltaf(t) = \frac{1}{2 \sqrt{mG}} ( 1 - c \, e^{-2 mt} ) ,
\label{eq.mutation.soln}
\eeq
 subject to the constraint $\deltaf(t) \leq 1/2$, 
 where $c$ is a constant of integration, equal to 1 if $f(0)=1/2$.
 If the mean
 number of bits flipped per genotype, $mG$, exceeds 1, then
 the fitness $F$ approaches an equilibrium value
 $F_{\rm eqm} = (1/2 + 1/(2 \sqrt{mG})) G$.

% If $m$ is tuned to the optimal fitness-dependent value, 
% $m_{\rm opt}$ (\ref{eq.mopt}),
% then the fitness is given,  assuming $\deltaf(0) = 0$, by
%\beq
%	\deltaf(t) = \frac{ t^{1/2} }{ 2 \sqrt{G} },
%\eeq
% which hits $\deltaf = 1/2$ at $t=G$.

 This theory is somewhat inaccurate in that the true probability
 distribution of fitness is non-Gaussian, asymmetrical, and quantized to
 integer values. All the same, the predictions of the theory  are
 not grossly at variance with the results of simulations
 described below.
% in section \ref{sec.simulations}.

\begin{figure}
\figuredanglenudge{\footnotesize
\begin{center}
\begin{tabular}{p{2in}cc}
& No sex & Sex \\
Histogram of parents' fitness
 & \explanfig{iparent}
 & \explanfig{iparent}
\\
Histogram of children's fitness
 & \explanfig{mchild}
 & \explanfig{schild}
\\
Selected children's fitness
 & \explanfig{mnextparent}
 & \explanfig{snextparent}
\\
\end{tabular}
\end{center}
}{
\caption[a]{Why sex is better than sex-free reproduction.\index{parthenogenesis}
 If mutations are used to create variation among children,
 then it is unavoidable that the average fitness of the children
 is lower than the parents' fitness; the
 greater the variation, the greater the average deficit. Selection bumps
 up the mean fitness again. 
 In contrast,
%sex (recombination)
 recombination produces variation without
 a decrease in average fitness. The typical amount of variation
 scales as $\sqrt{G}$, where $G$ is the genome size, so after
 selection, the average fitness rises by $O(\sqrt{G})$.
}
\label{fig.nutshell}
}{-0.14in}
\end{figure}
\subsection{Theory of sex}
% {\em Shorten this bit.}
%
 The analysis of the sexual population becomes tractable
 with two approximations:
 first, we assume that the  {gene-pool} mixes sufficiently rapidly
 that correlations between genes can be neglected; second, we
 assume  {\em homogeneity}, \ie, that
 the fraction $f_g$ of bits $g$ that are in the good state
 is the same, $f(t)$,  for all  $g$.

\begin{boxfloat}
\margincaption{
\caption[a]{Details of the  {theory of sex}.}
\label{sec.sex.app}
}
\begin{framedalgorithm}
\footnotesize
% Theory of sex   appendix
 How does $f(t\!+\!1)$ depend on $f(t)$?  Let's first assume 
 the two parents of a child both have exactly $f(t) G$ good bits, and,
 by our homogeneity assumption, that those bits are independent
 random subsets of the $G$ bits.
% (We will include variation in the parental population in a moment.)
 The number of bits that
 are good in both parents is roughly $f(t)^2 G$, and the number
 that are good in one parent only is  roughly $2 f(t)(1-f(t)) G$,
 so the fitness of the child will be  $f(t)^2 G$ plus
% a number drawn from a binomial distribution
 the sum of $2 f(t)(1-f(t)) G$ fair coin flips, which
 has a binomial distribution of mean $f(t)(1-f(t)) G$ and
 variance $\frac{1}{2} f(t)(1-f(t)) G$. 
 The fitness of a child
 is thus roughly distributed as
\[%\beq
  F_{\rm{child}} \sim  \mbox{Normal}\left(\mbox{mean}\eq f(t) G,
	\mbox{variance}\eq \frac{1}{2} f(t)(1-f(t)) G \right) .
\]%\eeq
 The important property of this distribution, contrasted with
 the distribution under mutation, is that the  mean fitness is equal
 to the parents' fitness; the variation produced by sex does
 not reduce the average fitness.

 If we  include the parental population's variance, which
 we will write as $\sigma^2(t) = \beta (t) \frac{1}{2} f(t)(1-f(t)) G$,
 the children's fitnesses are
% .
% The average  of the  two parents will have variance $\sigma^2(t)/2$,
% so the population of all children will have fitness, before selection,
 distributed as
\[%\beq
  F_{\rm{child}} \sim  \mbox{Normal}\left(\mbox{mean}\eq f(t) G,
	\mbox{variance}\eq \left(1+\frac{\beta}{2}\right)
                 \frac{1}{2} f(t)(1-f(t)) G \right) .
\]%\eeq
 Natural selection selects the children on the upper side
 of this distribution. The mean  increase in
 fitness will be
% of order
\[%\beq
 \bar{F}(t\!+\!1) - \bar{F}(t)
 = [ \alpha (1+\beta/2)^{1/2}/\sqrt{2} ] \sqrt{f(t)(1-f(t)) G},
\label{eq.alpha.sex}
\]%\eeq
% [A factor of $\sqrt{2/\pi}$ appears from the mean absolute
% value of a standard normal variate.]
 and the variance of the surviving children will be
\[%\beq
 \sigma^2(t+1) = \gamma  (1+\beta/2) \frac{1}{2} f(t)(1-f(t)) G,
\]%\eeq
  where $\alpha = \sqrt{2/\pi}$ and
 $\gamma = (1-2/\pi)$.
 If there is dynamic equilibrium [$\sigma^2(t+1) = \sigma^2(t)$]
 then
%\[%\beq
%	 \gamma  (1+\beta/2)  = \beta , \mbox{ so } (1+\beta/2) = \frac{2}{2-\gamma} ,
%\]%\eeq
% and
 the factor in (\ref{eq.alpha.sex}) is 
\[%\beq
	\alpha (1+\beta/2)^{1/2}/\sqrt{2}
%  = {\alpha}\frac{1}{(2-\gamma)^{1/2}}
%	= \sqrt{ \frac{ 2/\pi }{ 1 + 2/\pi } }
 = \sqrt{\frac{2}{(\pi+2)}} \simeq   0.62.
\]%\eeq
% print sqrt((4/pi)/(1+2/pi))
% 0.882025543449103
% print sqrt((2/pi)/(1+2/pi))  
% 0.62368624295261
 Defining this constant to be $\eta \equiv \sqrt{{2/(\pi+2)}}$,
% formerly, eta was 1/sqrt(pi)
 we conclude that, under sex and natural selection,
 the mean fitness of the population
  increases at a rate
 {\em proportional to the square root of the size of the
 genome},
\[%\beq
	\frac{\d\bar{F}}{\d t}
 \simeq \eta \sqrt{f(t)(1-f(t)) G} \:\:\:\mbox{bits per generation}. 
\]%\eeq
% If, recklessly,  we take our homogeneity assumption to hold
% for all time, we can
% write  $\bar{F} = f G$ and obtain the differential equation:
%\[%\beq
%	\frac{\d f}{\d t} \simeq \frac{\eta}{\sqrt{G}} \sqrt{f(t)(1-f(t))} ,
%\]%\eeq
%% an equation
% whose solution is 
%%\[%\beq
%%	\sin^{-1}( 2 f(t) - 1 ) = \frac{1}{\sqrt{G}} ( C + t ) ,
%%\]%\eeq
%% or
%%\[%\beq
%%	( 2 f(t) - 1 ) = \sin( \frac{1}{\sqrt{G}} ( C + t ) ) ,
%%\]%\eeq
%% or
%\[%\beq
%	  f(t) = \frac{1}{2}	\left[ 1 + \sin \left(
%					\frac{\eta}{\sqrt{G}} ( t + c )
%				 \right) \right] ,
%\:\:\:\mbox{ for $t+c \in \left(-\frac{\pi}{2}\sqrt{G}/\eta,\frac{\pi}{2}\sqrt{G}/\eta
% \right)$,}
%\label{eq.sex.solution.app}
%\]%\eeq
% where $c$ is a constant of integration, $c = \sin^{-1} (2 f(0) - 1)$.
%% asin( 2*a0 - 1 )
\end{framedalgorithm}
\end{boxfloat}
 Given these assumptions, if two parents of  fitness $F \eq fG$
 mate, the probability distribution of their children's fitness
 has mean equal
 to the parents' fitness, $F$; the variation produced by sex does
 not reduce the average fitness. The standard deviation
 of the fitness of the children scales as $\sqrt{G f(1-f)}$.
 Since, after selection, the increase in  fitness  is 
 proportional to this standard deviation, {\em the
 fitness increase per generation scales as the square root of the size of the
 genome,} $\sqrt{G}$. 
 As shown in  \boxref{sec.sex.app}, the mean fitness  $\bar{F} \eq f G$
 evolves in accordance with  the differential equation:
\beq
	\frac{\d\bar{F}}{\d t} \simeq {\eta} \sqrt{f(t)(1-f(t)) G} ,
\eeq
 where $\eta \equiv \sqrt{{2/(\pi+2)}}$.
% an equation
 The solution of this equation is 
%\beq
%	\sin^{-1}( 2 f(t) - 1 ) = \frac{1}{\sqrt{G}} ( C + t ) ,
%\eeq
% or
%\beq
%	( 2 f(t) - 1 ) = \sin( \frac{1}{\sqrt{G}} ( C + t ) ) ,
%\eeq
% or
\beq
	  f(t) = \frac{1}{2}	\left[ 1 + \sin \left(
					\frac{\eta}{\sqrt{G}} ( t + c )
				 \right) \right] ,
\:\:\:\mbox{ for $t+c \in
 \left(-\frac{\pi}{2}\sqrt{G}/\eta,\frac{\pi}{2}\sqrt{G}/\eta
 \right)$,}
\label{eq.sex.solution}
\eeq
 where $c$ is a constant of integration, $c = \sin^{-1} (2 f(0) - 1)$.
% asin( 2*a0 - 1 )
 So this idealized system reaches a state of
 eugenic\index{eugenics}
 perfection $(f=1)$ within a finite time: $(\pi/\eta)\sqrt{G}$ generations.


\begin{figure}
\figuremargin{\footnotesize
\begin{center}\small
\begin{tabular}{c}
\raisebox{13pt}{(a)}\hspace{-0.2in}\mbox{\psfig{figure=perl1/1000.1000.d.ps,width=2.8in,angle=-90}}\\
%(b)\mbox{\psfig{figure=perl1/1000.500.d.ps,width=2in,angle=-90}}\\
%(c)\mbox{\psfig{figure=perl1/1000.200.d.ps,width=2in,angle=-90}}&
%(d)\mbox{\psfig{figure=perl1/1000.100.d.ps,width=2in,angle=-90}}\\
\hspace{-0.3in}\begin{tabular}{cc} 
%(b1)\mbox{\psfig{figure=perl1/1000.1000.25M.ps,width=2in,angle=-90}}&
(b)\hspace{-0.2in}\mbox{\psfig{figure=perl1/1000.1000.25S+M.ps,width=2.53in,angle=-90}}&
%(b2)\mbox{\psfig{figure=perl1/1000.1000.6M.ps,width=2in,angle=-90}}&
(c)\hspace{-0.2in}\mbox{\psfig{figure=perl1/1000.1000.6S+M.ps,width=2.53in,angle=-90}}\\
\end{tabular}
\end{tabular}
\end{center}
}{
\caption[a]{Fitness as a function of time.
% These experiments were identical to those in figure 1,
% except that I forced all the initial genomes to have
% fitness exactly $F=G/2$, instead of picking the
% genotypes completely at random.
 The genome size is $G=1000$.
%
 The dots show
 the fitness of six randomly selected individuals from the
 birth population at each generation.
% The error bars show
% the standard deviation of fitness in the population.
 The initial population of $N=1000$ had randomly
 generated genomes
 with $f(0) = 0.5$ (exactly).
 (a)  Variation produced by {sex} alone. Line shows theoretical curve
 (\ref{eq.sex.solution})
 for infinite homogeneous population.

 (b,c) Variation produced by mutation, with and without sex,
  when the mutation rate is $mG=0.25$ (b) or 6 (c) bits per
  genome. The dashed line shows the curve (\ref{eq.mutation.soln}).
%
% (c) Variation produced by mutation, with and without
% sex, when the mutation rate is $mG=6$ bits per
%  genome.
}
\label{fig.fitness.500}
\label{fig.fitness.1000}
}
\end{figure}

\subsection{Simulations}
\label{sec.simulations}
 Figure \ref{fig.fitness.1000}a shows the  fitness
 of a sexual  population of $N=1000$ individuals with a
 genome size of $G=1000$ starting from
 a random initial state with normalized fitness $0.5$.
 It also shows the theoretical curve $f(t)G$
% using $f(t)$ derived for
% the infinite homogeneous population,
 from \eqref{eq.sex.solution},
 which fits remarkably well.

 In contrast, figures  \ref{fig.fitness.1000}(b) and (c) show the
 evolving fitness  when variation is
 produced by mutation at rates
 $m=0.25/G$ and $m=6/G$ respectively. Note the difference in the
 horizontal scales from panel (a).


% Figure \ref{fig.fitness.1000}(b) shows the  fitness
% of a population of $N=500$ individuals with a
% genome size of $G=1000$ starting from
% a random initial state with normalized fitness $0.1$.
%
% Figures \ref{fig.fitness.1000}(c) and (d) show what happens for smaller
% population sizes, $N=200$ and $N=100$.

\exercissxC{3}{ex.smallpopn}{
%\subsection{Small populations}
 {\sf Dependence on  population size}.
 How do the results for a sexual population depend on the
 population size? We anticipate that there is a minimum population
 size above which the theory of sex is accurate.
% infinite-population approximation works well.
 How
% In what way
 is that minimum  population size
 related to $G$?
}
\exercisxC{3}{ex.crossover}{
 {\sf Dependence on crossover mechanism}.
 In the simple model of sex, each bit is taken at random
 from one of the two parents, that is, we allow crossovers
 to occur with probability 50\% between any two adjacent
 nucleotides.
 How is the model affected
%\ben
%\item
 (a)
 if the crossover probability
 is smaller?
 (b)
% \item
 if crossovers  occur exclusively
 at {\dem\ind{hot-spot}s\/}
 located every $d$ bits along the genome?
% \een
}

\begin{figure}
\figuremargin{\footnotesize
\begin{center}
\begin{tabular}{ccc}
& $G=1000$ &
 $G=100\,000$ \\
\raisebox{1in}{$mG$}\hspace{-0.2in} &
\mbox{\psfig{figure=psm/maxrate.1000.ps,width=2.32in,angle=-90}}
&\mbox{\psfig{figure=psm/maxrate.100000.ps,width=2.32in,angle=-90}}
\\
&
 $f$ & $f$ \\
\end{tabular}
\end{center}
}{
\caption[a]{Maximal tolerable mutation rate, shown as number of
 errors per genome ($mG$), versus normalized fitness $f=F/G$.
 Left panel: genome size  $G=1000$; right: 
 $G=100\,000$.

 Independent of genome size, a parthenogenetic species (no sex) can 
 tolerate only of order 1 error per genome per generation;
 a species that uses recombination (sex) can tolerate far greater
 mutation rates. 
}
\label{fig.maxrate}
}
\end{figure}
\section{The maximal tolerable mutation rate}
%{Sex with mutations}
%{\em This section needs checking over, to confirm the
% details of the factors of $\eta$, etc.}

 What if we combine the two models of variation? What
 is the maximum mutation rate that can be tolerated by a
 species that has sex?

 The rate of increase of fitness is given by
\beq
	\frac{\d f}{\d t} \simeq - 2 m \, \deltaf +
         \eta\sqrt{{2}} \sqrt{ \frac{m + f(1-f)/2}{G} }  ,
\eeq
 which
%  This quantity
 is positive if
%\beq
%	2 m \, \deltaf < \eta\sqrt{{2}} \sqrt{ \frac{m + f(1-f)/2}{G} } .
%\eeq
% Replacing $\deltaf$ by its largest value, $1/2$, and omitting the $m$
% on the right-hand side,
% the rate of increase of fitness is positive, for a given $f$,
% if
 the mutation rate satisfies
\beq
	m <  \eta\sqrt{\frac{f(1-f)}{G}}  .
\eeq
 Let us compare this rate with the result in the absence of sex,
 which, from \eqref{eq.rate2}, is that the maximum tolerable mutation rate
 is
\beq
	m < \frac{1}{G} \frac{1}{(2 \, \deltaf)^2} .
\label{eq.no.sex.crit.m}
\eeq
% These two maximum mutation rates are of completely different
% orders.
% (May I be permitted an exclamation mark?)
 The tolerable mutation rate with sex is
 of order $\sqrt{G}$ times greater than that without sex!
% this is d/dm[ df/dt ]:
%plot[x=0.5:1] -(2*x-1) + 1.0/sqrt(G*pi * x*(1-x) )             
% optimum mutation rate: 
%  is m=0
% this omits G: 
%f=0.75; plot[m=0:0.5] -(2*x*(f-0.5)) + sqrt(2.0/pi) *  sqrt(x+pi * x*(1f=0.75; plot[m=0:0.5] -(2**(f-0.5)) + sqrt(2.0/pi) *  sqrt(x+pi * x*(1-x)/2.0 ) m*(f-0.5)) + sqrt(2.0/pi) *  sqrt(m+pi * f*(1-f)/2.0 )   



 A parthenogenetic (non-sexual) species could try to wriggle out of
 this bound on its mutation rate by increasing its litter sizes.
% , so as to tolerate higher mutation rates.
 But if mutation flips on average $mG$ bits, the probability
 that no bits are flipped in one genome is roughly $e^{-mG}$, so a mother
 needs to have roughly $e^{mG}$ offspring in order
 to have a good chance of having one child with
 the same fitness as her. The  litter size of a non-sexual
 species thus has
 to be exponential in $mG$ (if $mG$
% the factor by which $m$
 is bigger than 1),
%  exceeds the critical value defined in equation \ref{eq.no.sex.crit.m},
 if the species is to persist.

 So the maximum tolerable mutation rate  is  pinned close to
 $1/G$, for a non-sexual species, whereas it is a larger
 number of order $1/\sqrt{G}$,  for a species with
 recombination.

 Turning these results around, we can predict the  largest
 possible genome size
 for a given fixed mutation rate, $m$.
 For a parthenogenetic species,
 the largest genome size is of order $1/m$, and for a sexual species, $1/m^2$.
 Taking the figure $m=  10^{-8}$ as the mutation rate 
 per nucleotide per generation \cite{EWK99},
% 2 \times this going by EWK actually
 and allowing for a maximum brood size of $20\,000$ (that is,
 $mG \simeq 10$),
 we predict that
 all species with more than $G = 10^{9}$ coding
 nucleotides make at least occasional use of recombination.
 If the brood size is 12, then this number falls to
 $G = 2.5 \times 10^{8}$.
% graveyard.tex




\section{Fitness increase and information acquisition}
 For this simple model it is possible to relate increasing fitness
 to information acquisition.

 If the bits are set at random, the fitness is roughly
 $F=G/2$.
 If evolution leads to a population  in which
 all individuals have the maximum fitness $F=G$, then
 $G$ bits of information have been acquired by the species,
 namely for each bit $x_g$, the species has figured
 out which of the two states is the better.

 We define the information acquired at an intermediate
 fitness
% , suggested by \citeasnoun{Kimura61}, is
 to be the amount of selection (measured in bits)
 required to select the perfect state from the gene pool. 
 Let  a fraction $f_g$ of the population
 have $x_g \eq 1$.  Because $\log_2 (1/f)$ is the information required to
 find a black ball in an urn containing black and white balls
 in the ratio $f:1\!-\!f$, 
% 
% Defining $\delta F \equiv F-G/2$, it will be convenient
% to 
% We therefore view  the fitness  as measuring, in bits,
 we define the  information acquired  to be
\beq
	I = \sum_g \log_2 \frac{ f_g }{ 1/2 } \mbox{bits} .
\eeq
 If all the fractions $f_g$ are equal to $F/G$, then
\beq
	I = G \log_2 \frac{ 2F }{ G } ,
\eeq
 which is well approximated by 
\beq
	\tI \equiv 2( F-G/2 ) .
\eeq
 The rate of information acquisition is thus roughly two times
 the rate of  increase of fitness in the population.
% We will find it useful to define the normalized
% fitness $f(\bx) \equiv F(\bx)/G$. 

\section{Discussion}
 These results quantify the well known
 argument for why species reproduce by sex with recombination, namely
 that recombination allows useful mutations to spread more rapidly through
 the species and allows deleterious mutations to be more rapidly cleared
 from the population  \cite{JMS78,Felsenstein85,JMS88,JMSES95}.
%
%
 A population that reproduces by recombination can 
% parthenogenesis
% and experiences
% variation through 
% random mutations can 
 acquire information from natural selection at a
 rate of order  $\sqrt{G}$ times faster than
 a parthenogenetic population, and it can tolerate 
% only of about one bit per generation.
% A population  that reproduces by sex
% can acquire information at a rate of order $\sqrt{G}$, the
% square root of the size of the genome. For
 a mutation rate that is  of order  $\sqrt{G}$ times  greater.
 For 
 genomes of size $G \simeq 10^8$ coding  nucleotides,
 this factor of $\sqrt{G}$
 is
% e differences between these two rates are
 substantial.

 This enormous advantage conferred by sex has been noted before
 by \citeasnoun{Kondrashov1988}, 
 but this meme, which Kondrashov calls `the deterministic mutation hypothesis',
 does not seem to have diffused throughout the
 evolutionary research community, as there are still numerous
 papers in which the prevalence of sex is viewed as a 
 mystery to be explained by elaborate mechanisms.
% removed to   itp/tex/genecut.tex Tue 22/10/02



\subsection*{`The cost of males' -- stability of a gene for sex or parthenogenesis }
 Why
% has the meme explaining the prevalence of sex been swamped by this plethora of articles that
 do people declare sex to be a mystery?
 The main motivation for being mystified is an idea
 called the `\ind{cost of males}'.\index{male}\index{female}
 Sexual reproduction is disadvantageous compared with asexual reproduction,
 it's argued, because of every two offspring produced by sex, one
 (on average) is a useless male, incapable of child-bearing,
 and only one is a productive female. In the same time,
 a parthenogenetic mother could give birth to {\em two\/}
 female clones.
 To put it another way, the big advantage of parthenogenesis, from the
 point of view of the individual, is that one is able
 to pass on 100\% of one's genome to one's children,
 instead of only 50\%.
%
 Thus if there were two versions of a species, one
 reproducing with and one without sex, the
% population of
 single mothers would be expected to
 outstrip their  sexual cousins.  The simple model presented
 thus far did not include either genders or the ability
 to convert from sexual reproduction to asexual, but we can easily
 modify the model. 
% include the effect which supposedly should give a disadvantage
% to sexual production.
\begin{figure}
\figuremargin{
\begin{center}\small\footnotesize
\small\begin{tabular}{c@{\hspace*{-0.05in}}c@{\hspace*{-0.1in}}c}
&
\mbox{(a) $mG=4$}
&
\mbox{(b) $mG=1$}
\\
\raisebox{0.6in}{\rotatebox{90}{\footnotesize\sf Fitnesses}}  & \fitfig{F1000.1000.4.C4} & \fitfig{F1000.1000.1.C4} \\
\raisebox{0.6in}{\rotatebox{90}{\footnotesize\sf Percentage}} & \fitfig{P1000.1000.4.C4} & \fitfig{P1000.1000.1.C4} \\
\end{tabular}
\end{center}
}{
\caption[a]{Results when there is a gene for parthenogenesis,
 and no interbreeding, {\em and single mothers produce as many children
 as sexual couples}. $G=1000$, $N=1000$.
 (a) $mG = 4$; (b) $mG=1$.
%Vertical axis shows both fitness and
 Vertical axes show the fitnesses of the two
 sub-populations, and the percentage of the population
 that is parthenogenetic.}
\label{fig.mixed2.C4}
}
\end{figure}


 We modify the
 model so that one of the $G$ bits in the
 genome determines whether an
 individual prefers to reproduce
 parthenogenetically ($x \eq 1$) or
 sexually ($x \eq 0$).
%
 The results depend on the number of children
 had by a single parthenogenetic mother, $\Kp$ and the number
 of children born by a sexual couple, $\Ks$.
 Both ($\Kp \eq 2$, $\Ks \eq 4$) and ($\Kp \eq 4$, $\Ks \eq 4$)
 are reasonable models. The former  ($\Kp \eq 2$, $\Ks \eq 4$)
 would seem most appropriate
 in the case of unicellular organisms, where the cytoplasm
 of both parents goes into the children. The latter  ($\Kp \eq 4$, $\Ks \eq 4$)
 is appropriate if the children are solely nurtured by
 one of the parents, so single mothers have just as many offspring
 as a sexual pair. I concentrate on the latter model, since it gives the
greatest advantage to the parthenogens, who are supposedly
 expected to outbreed the sexual community.
 Because parthenogens have four children per generation, the maximum
 tolerable mutation rate for them is twice the expression
 (\ref{eq.no.sex.crit.m})
 derived before for $\Kp \eq 2$. If the fitness
 is large, the maximum tolerable rate is $mG \simeq 2$. 


 Initially the genomes are set randomly with $F=G/2$,
 with half of the population having the gene for parthenogenesis.
%
%\subsection{$\Kp \eq 4$, $\Ks \eq 4$, `consensual sex'}
 \Figref{fig.mixed2.C4} shows the outcome.
% if single parthenogens produce as many offspring
% as a {\em pair\/} of sexuals.
 During the `learning' phase of evolution,\index{learning!in evolution}\index{evolution!as learning}
 in which the fitness is increasing rapidly,
 pockets of parthenogens appear briefly, but
 then disappear within a couple of generations
 as their sexual cousins overtake them in fitness
 and leave them behind.  Once the population reaches its
 top fitness, however, the parthenogens can take over,
 if the mutation rate is sufficiently low ($mG\eq1$).
% In these simulations, sex does not tend to reappear once
% the parthenogens have taken over, because a small sexual
% community, having size $N_{\rm sexual}<\sqrt{G}$,
% will be in-bred and will not have the advantage
% discussed in the rest of this paper.

 In the presence of a higher mutation rate ($mG \eq 4$),
 however, the parthenogens never take over. The breadth of the
 sexual population's fitness is of order  $\sqrt{G}$, so
 a mutant parthenogenetic colony arising
 with slightly above-average fitness will last for about
  $\sqrt{G}/(mG) = 1/(m\sqrt{G})$ generations before its fitness falls
 below that of its sexual cousins. As long as the  population
 size is sufficiently large for some sexual individuals
 to survive for this time, sex will not die out.

 In a sufficiently unstable environment, where the
 fitness function  is continually changing,
 the parthenogens will always lag behind the sexual
 community.
 These results are consistent with
 the  argument of \index{Haldane, J.B.S.}{Haldane}
% \citeasnoun{Haldane1949}
 and \index{Hamilton, William D.}\citeasnoun{Hamilton2002}
% \citeasnoun{Hamilton1990}
% {Hamilton}
 that sex is helpful
% maintains variation which is
% useful in the co-evol. arms race with parasites.
 in an \ind{arms race} with parasites. The \ind{parasite}s define
 an effective fitness function which changes with time,
 and a sexual population will always ascend the current fitness
 function more rapidly.

\subsection{Additive fitness function}
 Of course, our results depend on the fitness
 function that we assume, and on our model
 of selection. Is it reasonable to model fitness, to first
 order, as a {\em sum\/} of independent terms?
 \citeasnoun{Smith68} argues that it is: the more good genes you
 have, the higher you come in the pecking order, for example.
 The directional selection model has been   used extensively in theoretical population
genetic studies \cite{Bulmer1985}.
 We might expect real fitness functions to involve interactions,
 in which case crossover might reduce the average fitness.
 However, since recombination gives the biggest advantage
 to species whose fitness functions are additive, we might predict
 that {\em evolution will have favoured species that used a representation
 of the genome that corresponds to a fitness function that
 has only weak interactions}. And even if there are interactions,
 it seems plausible that the fitness would
 still involve a sum of such interacting terms, with the number
 of terms being some fraction of the genome size $G$.
% moved this to genecut.tex
% Fitness functions that are sums of interacting terms are investigated in section \ref{sec.interactions}.
\exercisxC{3C}{ex.interactions}{
	Investigate  how fast  sexual
 and asexual species evolve if they have  a fitness
 function with interactions. 
 For example, let the fitness be a sum of exclusive-ors of pairs
 of bits; compare the evolving fitnesses with
 those of the   sexual
 and asexual species with a simple additive fitness function.
}

\begincuttable
 Furthermore, if the \ind{fitness} function were a highly nonlinear
 function of the genotype, it could be made more smooth and locally linear
 by the \ind{Baldwin effect}.
 The Baldwin effect \cite{Baldwin1896,HintonNowlan87}
 has been widely studied as a mechanism whereby
 {\em learning\/} guides evolution, and it could also act at the level of
 transcription and translation.
 Consider the \ind{evolution} of a peptide sequence for
 a new purpose. Assume  the effectiveness
 of the peptide is a highly nonlinear function of the
 sequence, perhaps having a small island of good sequences surrounded\index{evolution!Baldwin effect}
 by an ocean of equally bad sequences. In an
 organism whose transcription and translation machinery
 is flawless, the fitness will be an equally  nonlinear function
 of the  DNA sequence, and evolution will wander around the
 ocean making progress towards the island only by
 a random walk.   In contrast, an organism having the same
 DNA sequence, but whose DNA-to-RNA
 transcription or RNA-to-protein translation is `faulty',
 will occasionally, by mistranslation or mistranscription,
 accidentally produce a working enzyme; and it will do so with greater
 probability if its DNA sequence is close to a good
 sequence.  One cell might produce 1000 proteins from the
 one mRNA sequence, of which 999 have no enzymatic effect, and one
 does. The one working catalyst will be enough for that cell
 to have an increased fitness relative to rivals whose DNA sequence
 is further from the island of good sequences.
 For this reason I conjecture that,
 at least early in evolution, and perhaps still now, the
 \ind{genetic code} was not implemented perfectly but was implemented noisily,\index{evolution!of the genetic code}
 with some codons coding for a distribution of possible
 \ind{amino acid}s.  This noisy code could even be switched on and off
 from cell to cell in an organism by
 having multiple aminoacyl-tRNA synthetases, some more reliable than
 others.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 
 Whilst our model assumed that the bits of the genome do not interact,
 ignored the fact that the information is represented redundantly,
 assumed that there is a direct relationship between phenotypic
 fitness and the genotype, 
 and assumed that the crossover probability in recombination is high,
 I believe these qualitative results would still hold if more complex
 models of fitness and crossover were used: the relative benefit
 of sex will still scale as $\sqrt{G}$.
% , where $G$ is proportional to the genome size.
 Only in small, in-bred populations
 are the benefits of sex expected to be diminished.

 In summary: Why have sex? Because sex is good for your bits!

\section*{Further reading}
% Do all self-replicating systems have a lot of information content?
% If so, how did life start at all, given that information can be acquired by
% natural selection only gradually?
 How did a high-information-content self-replicating system ever
 emerge in the first place?
 In the general area of the origins of life and other tricky questions about evolution, 
% , the genetic code, and sex,
 I highly recommend \citeasnoun{JMSES95}, \citeasnoun{JMSES99}, \citeasnoun{Kondrashov1988},
 \citeasnoun{JMS88}, \citeasnoun{MarkRidley}, \citeasnoun{Dyson1985},
 \citeasnoun{CairnsSmith1985}, and \citeasnoun{Hopfield1978}.

\section{Further exercises}
\ExercisxC{3}{ex.estimateDNAerror}{
	How good must the error-correcting
 machinery in \index{DNA!replication}DNA replication be, given
 that mammals have not all died out long ago?
 Estimate the probability of nucleotide substitution, per cell division.
%\soln{ex.estimateDNAerror}{
 [See 
% chapter
 \appendixref{ch.numbers}.] 
% for some  estimates.] 
%}
}
\ExercisxC{4}{ex.dna-ecc}{
 Given that {DNA replication} is achieved by bumbling
 \ind{Brownian motion} and ordinary thermodynamics
 in a biochemical \ind{porridge} at a temperature of 35$\,$C, it's astonishing
 that the error-rate  of \index{DNA!replication}DNA replication is about $10^{-9}$ per
 replicated nucleotide. How can this  reliability be achieved,\index{error correction!in DNA replication}\index{error-correcting code!in DNA replication}
 given that the energetic difference between a correct
 base-pairing and an incorrect one is only one or two \ind{hydrogen bond}s
% (one hydrogen bond is worth about 1$\,$kJ$\,$mol$^{-1}$ in free energy)
 and the thermal energy $kT$ is only about a factor of
 four smaller than the free energy associated with a hydrogen bond?
% about 8$\,$kJ$\,$mol$^{-1}$.
% thermal energy is 0.6 kcal/mol
% hydrogen bond is 1-5 kcal/mol
 If ordinary  thermodynamics  is what favours correct \ind{base-pairing},\index{Watson--Crick base pairing}
 surely the frequency of incorrect base-pairing should be
 about
\beq
	f = \exp( - \dfrac{\upDelta E}{kT} ),
\eeq
 where $\upDelta E$ is the free energy difference, \ie,
 an error frequency of $f \simeq 10^{-4}$?
% \exp(-8)$?
%
 How has DNA replication cheated thermodynamics?

 The situation is equally perplexing
 in the case of \ind{protein synthesis},\index{puzzle!fidelity of DNA replication}
 which translates an mRNA sequence into a polypeptide in accordance
 with the genetic code.  Two specific chemical reactions are
 protected against errors: the binding of tRNA molecules to amino acids,
 and the production of the polypeptide in the ribosome, which,
 like DNA replication, involves base-pairing.
 Again, the  fidelity is high (an error rate of about $10^{-4}$),
 and this fidelity can't be caused by the 
 energy of the `correct' final state being especially low --
 the correct polypeptide sequence is not expected to be significantly lower in energy
 than any other sequence. How do cells perform error correction?\index{error correction!in protein synthesis}\index{error-correcting code!in protein synthesis}
 (See \citeasnoun{Hopfield1974}, \citeasnoun{Hopfield1980}).\index{Hopfield, John J.}
}
\ExercisxC{2}{ex.estimateBrainmemoryrate}{
 While the \ind{genome} acquires information through
 natural selection at a rate of a few bits per generation,
% (\chref{ch.sex}),
% exerciseonlyref{ex.evolutionteach}), 
 your brain acquires information at a greater rate.

 Estimate at what rate new information can be stored in
 long term memory by your brain.   Think of learning
 the words of a new language, for example. 
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Solutions}
\soln{ex.smallpopn}{
 For small enough $N$,
 whilst the average fitness of the population increases,
 some unlucky bits  become frozen into the bad state. (These
 bad genes are sometimes known as \ind{hitchhiker}s.)
 The homogeneity assumption breaks down.
 Eventually,  all individuals have identical genotypes that
 are mainly 1-bits, but contain some 0-bits too.
 The smaller the population, the greater the number of
 frozen 0-bits is expected to be.
 How small can the population size $N$  be if the theory of sex is accurate?

 We find experimentally that the theory based on assuming homogeneity
  fits poorly only if the  population size
 $N$ is smaller than $\sim\! \sqrt{G}$.
 If $N$ is significantly smaller than $\sqrt{G}$,   information
 cannot possibly be acquired at a rate as big as $\sqrt{G}$,
 since the information content of the Blind Watchmaker's
 decisions cannot be any greater than $2N$ bits per generation,
 this being the number of bits required to specify which
 of the $2N$ children get to reproduce.
 \citeasnoun{Baum95}, analyzing a similar model, show that
 the population size $N$ should be about  $\sqrt{G}(\log G)^2$
 to make hitchhikers unlikely to arise.
% the finite population's fitness is to rise at the same rate
% as the infinite population.
}

\dvips
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%      PART           %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\renewcommand{\partfigure}{\poincare{8.frag1}} 
\part{Probabilities and Inference}
\subchapter{About Part IV}% Introduction to 
%
\fakesection{introduction to pt IV}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 The number of inference problems 
 that can (and perhaps should) be tackled
 by Bayesian inference methods is enormous. 
 In this book, for example, we  discuss the decoding problem for 
 error-correcting codes, the task of inferring clusters 
 from data,  the task of interpolation through noisy data, 
 and the task of  classifying patterns given labelled examples. 
 Most techniques for solving these problems 
 can be categorized as follows.
\begin{description}
\item[Exact methods] compute the required quantities
 directly. Only a few interesting problems have  a direct 
 solution, but exact methods are important as tools 
 for solving subtasks within larger problems. 
 Methods for the exact solution of inference problems 
 are the subject of  Chapters \ref{ch.enumerate},
 \ref{ch.exactmarg}, \ref{ch.exact}, and \ref{ch.sumproduct}.

% for example using forward-backward within EM.
\item[Approximate methods] can be subdivided into 
\ben
\item {\bf deterministic approximations}, which include\index{approximation!of complex distribution}
 maximum likelihood (\chref{ch.ml}), 
        Laplace's method (Chapters \ref{ch.laplace} and \ref{ch.occam})
        and variational methods  (\chapterref{ch.variational}); and 
\item {\bf Monte Carlo methods} -- techniques in which
	random numbers play an integral part -- which will be discussed
        in Chapters \ref{ch.mc},
	\ref{ch.mc2}, and \ref{ch.mcexact}.
\een
\end{description}

% removed fit.tex from here
% removed material from here to enumerate.tex

% \section{Overview}
 This part of the book does not form a one-dimensional 
 story. Rather, the ideas make up a web of interrelated threads which
 will
% .  These threads
 recombine in subsequent chapters.  

  \Chapterref{ch.bayes}, which
 is an honorary member of this part,
 discussed a range of simple examples of inference 
 problems  and their Bayesian solutions.

 To give further motivation for the toolbox of
 inference methods discussed in this part, 
 \chapterref{ch.clustering} discusses the problem of clustering; subsequent chapters
  discuss the probabilistic interpretation of
 clustering as \ind{mixture modelling}. 

  \Chapterref{ch.enumerate} discusses the option of
 dealing with probability distributions by completely
 enumerating all hypotheses.  
  \Chapterref{ch.ml} introduces the idea of maximization
 methods as a way of avoiding the large cost associated with complete
 enumeration, and points out reasons why maximum likelihood is
 not good enough.
  \Chapterref{ch.distributions} reviews the probability distributions
 that arise most often in Bayesian inference.
 Chapters \ref{ch.exactmarg}, \ref{ch.exact}, and \ref{ch.sumproduct}
 discuss another way of avoiding the
 cost of complete enumeration: marginalization.  
 Chapter \ref{ch.exact} discusses message-passing methods appropriate
 for graphical models, using the 
 decoding of error-correcting codes as an example.
 Chapter \ref{ch.sumproduct} combines these ideas with
 message-passing concepts  from Chapters \ref{ch.message} and
 \ref{ch.noiseless}.  These chapters are a 
 prerequisite for the understanding of advanced error-correcting codes.

 Chapter \ref{ch.laplace} discusses deterministic approximations including 
 Laplace's method. This chapter is a prerequisite for understanding 
 the topic of complexity control in learning algorithms, an idea that
 is discussed in general terms in \chref{ch.occam}. 

 Chapter \ref{ch.mc} discusses Monte Carlo methods. 
	Chapter \ref{ch.mc2}  gives details of
	state-of-the-art Monte Carlo techniques.
 
 Chapter \ref{ch.ising} introduces the \ind{Ising model} as a test-bed 
 for probabilistic methods. An exact {message-passing} method\index{message passing} and a Monte Carlo method
 are demonstrated.  A  motivation for studying the Ising model
 is that it is intimately related to several neural network models.
 \Chref{ch.mcexact} describes `exact' Monte Carlo methods
 and demonstrates their application to the Ising model.

 Chapter \ref{ch.variational} discusses variational methods and their application 
 to Ising models and to simple statistical inference problems including
 clustering. This 
 chapter will help the reader understand the \ind{Hopfield network}
 (\chapterref{ch.hopfield}) and 
 the \ind{EM algorithm}, which is an important method in {latent-variable modelling}.\index{latent variable model}
% (\chapterref{ch.em}). 
 \Chref{ch.ica} discusses a particularly simple latent variable
 model called independent component analysis.

%  Is there going to be a chapter called hierarchical modelling?
%  Will I define graphical models?
%  Where do I talk about trellises?
% 
%  Latent variable models will come in a later part. Have parts on nn's, 
%  on l.v.'s. Or on supervised and unsupervised.

% This part of the book ends with
	\Chref{ch.ignorance}
 discusses  a ragbag of
 assorted inference topics.
 \Chref{ch.decision} discusses a simple
 example of decision theory.
% What experiments should one do
%discusses interesting examples
% of  prior probability distributions that describe ignorance.
 \Chref{ch.sampling}    discusses differences between
 sampling theory and Bayesian methods.

%\subsection*{Head off the misconceptions early}
\subsection*{A theme: what inference is about}
 A widespread misconception is that
 the aim of inference is to find
 {\em the most probable explanation\/} for some data.\index{sermon!MAP method}
 While this most probable hypothesis may
 be of interest, and some inference methods do
 locate it, this hypothesis is just the peak of
 a probability distribution, and it is the
 whole distribution that is of interest.
%
 As we saw in  \chapterref{ch2}, the {\em most probable\/}
 outcome from a source is often not a {\em typical\/} outcome
 from that source.
 Similarly, the most probable hypothesis given some data
 may be atypical of the whole set of reasonably-plausible
 hypotheses.\index{sermon!most probable is atypical} 

%%%%%%% Maybe I should say marginalization is the key idea.
% Another important idea is the concept of marginalization,
% \ie, integrating over variables that we are not
% interested in;  typical hypotheses contribute
% most to the marginal probability densities.

% YES/?????????????????/


 


% \prechapter{About             Chapter}
\subsection*{About  \protect\chref{ch.clustering}}
 Before reading the next chapter,
 exercise
 \ref{ex.logit} (\pref{ex.logit})
% \ref{ex.logit} (\pref{ex.logit})
 and section \ref{sec.pulse} (inferring  the input to a Gaussian channel)
 are 
 recommended reading.

 

\dvips
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\typeout{ DON'T FORGET input{tex/fit.tex}  shows the fit of a gaussian in 2d!!!!!  }
%
%\chapter{An example inference task: clustering}
\chapter{An Example Inference Task: Clustering}
\label{ch.clustering}
%
% clust.tex
%
 Human brains are good at finding regularities in data.
 One  way of expressing regularity is to put a set of
 objects into groups that are similar to each other.
 For example,  biologists have
 found that most objects in the natural world
 fall into one of two categories: things that
 are brown and run away, and things that are green
 and don't run away. The first group they call  animals,
 and the second, plants.
 We'll call this operation of grouping things together
 {\dem\ind{clustering}}.
 If the  biologist   further sub-divides
 the cluster of plants into sub-clusters, we would
 call this `\ind{hierarchical clustering}'; but
 we won't be talking about hierarchical clustering  yet.
 In this chapter we'll just discuss ways to take a set of $N$
 objects and group  them into $K$ clusters.
 

 There are several motivations for  clustering.\indexs{clustering}
 First,   a good clustering
 has predictive power. When an early biologist encounters a
 new green thing he has not seen before, his internal model
% which says that all living things are either
 of plants and
 animals  fills in predictions for attributes of the
 green thing: it's unlikely to jump on him and eat him;
 if he touches it, he might  get grazed or stung; if he eats
 it, he might feel sick.  All of these predictions, while uncertain,
 are useful, because they  help the biologist  invest his
 resources (for example, the time spent watching for predators) well.
 Thus, we perform clustering because we believe the underlying
 cluster labels are meaningful,  will lead to a more efficient
 description of our data, and will help us choose better actions. This type of clustering
 is sometimes called `mixture \ind{density modelling}',\index{mixture modelling}\index{modelling!density modelling}
 and the objective function that measures how well the predictive
 model is working is the information content of the data, $\log 1/P(\{\bx\})$.

 Second, clusters can be a useful aid to communication because
 they allow
% provide codewords for
 \ind{lossy compression}.\index{compression!lossy}
 The biologist can give directions to a friend such as
% of the form
 `go to the
 third
 {\em tree\/}
% {\underline{tree}\/}
 on the right then take a right turn' (rather than
 `go past the large green thing with red berries, then past
 the large green thing with thorns, then $\ldots$'). 
 The brief category name `tree' is helpful because   it is
 sufficient to identify an object.
 Similarly, in  lossy \ind{image compression}, the aim is to convey
 in as few bits as possible a reasonable reproduction of a picture;
 one way to do this is to divide the image into $N$ small patches, and find
 a close match to each patch in an alphabet of $K$ image-templates;
 then we send a close fit to the image
 by sending the list of labels $k_1,k_2,\ldots,k_N$ of the matching    templates.
 The task of creating a good library of image-templates is equivalent
 to finding  a set of  cluster centres.
 This type of clustering is sometimes called `\ind{vector quantization}'.

%\marginfig{
%\caption[a]{Vector quantization}
%}
 We can formalize a vector quantizer in terms of an {\dem{assignment rule}}
 $\bx \rightarrow k(\bx)$ for  assigning 
 datapoints $\bx$ to one of $K$ codenames, and a {\dem{reconstruction 
 rule}} $k \rightarrow \bm^{(k)}$, the aim being to  choose the
 functions $k(\bx)$ and $\bm^{(k)}$ so as to 
 minimize the {\dem{expected distortion}}, which might be  
 defined to be
\beq
	D = \sum_{\bx} P(\bx) \half \left[ \bm^{(k(\bx))} - \bx \right]^2 .
\eeq
% Vector quantization is used in some lossy  image compression algorithms
% which represent small patches of image using a small alphabet of 
% template images.
 [The ideal objective function  would be 
 to minimize the psychologically perceived distortion of the image. 
 Since it is hard to quantify the distortion perceived by a human, 
 vector quantization and \ind{lossy compression}\index{compression!lossy}
 are not so crisply defined 
 problems as {data modelling} and lossless compression.]\index{modelling}
	In vector quantization, we don't necessarily believe that
 the templates $\{  \bm^{(k)}\}$ have any natural meaning; they 
 are simply tools to do a job.  We note in passing
 the similarity of the assignment rule (\ie, the encoder)
 of vector quantization  to the  {\em decoding\/} problem
 when decoding an error-correcting code.\index{connection between!vector quantization and error-correction}

 A third reason for making a cluster model is that failures of the
 cluster model may highlight interesting  objects that deserve
 special attention.
 If we have trained a vector quantizer to do a good job of compressing
 satellite pictures of ocean surfaces, then maybe  patches of image that
 are not well compressed by the vector quantizer  are the patches that
 contain ships!
 If the biologist encounters a green thing and sees it run (or slither) away,
 this misfit with his cluster model (which says green things don't run
 away) cues him to pay special attention.  One can't spend all one's time being
 fascinated by things; the cluster model can help sift out from the
 multitude of objects in one's world the ones that really deserve attention.

\amarginfig{c}{
\begin{center}\small
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/data.ps,width=1.65in,angle=-90}
\end{center}
\caption[a]{$N=40$ data points.}
}A fourth reason for liking clustering algorithms is that
 they may serve as models of learning processes in neural systems.
 The  clustering algorithm that we now discuss, the K-means\index{learning algorithms!competitive learning}
 algorithm, is an example of a {\dem\ind{competitive learning}\/} algorithm.
 The algorithm works by having the $K$ clusters compete with
 each other for the right to own the data points.

% At the heart of a clustering method there is always an {\dem{assignment rule}},
% a method for allocating a point $\bx$ to one of the $K$ clusters.
% Often this rule takes the form `
%
%
% Motivations for clustering. 
%\ben
%\item
%	Similarity of clustering assignment step to decoding.
%\item
% Clustering as mixture density modelling.
% If we adopt the attitude of density modelling, then our aim is
% to find a good description of the observed data in terms 
% of a mixture of probability densities. In contrast to vector quantization, 
% we are likely to view the underlying clusters as having a natural meaning. 
%
% For example, if we model handwritten characters with a mixture 
% model, we might intend  each cluster to  correspond to a different 
% character; if we model protein sequences with a mixture model, 
% we might think of the clusters as representing protein families
% all of whose members descended by evolution from a common protein ancestor. 
%\een


\section{K-means clustering}
 The\marginpar{\small\raggedright
%\begin{aside}
{\sf About the name...}
 As far as I know, the `K' in K-means clustering 
 simply refers to the chosen number of clusters. 
 If Newton had followed the same naming policy, maybe
 we would learn at school about `calculus for the variable $x$'. 
 It's a silly name, but we are stuck with it.
%\end{aside}
}
 K-means algorithm is an algorithm\indexs{K-means clustering}
 for putting $N$ data points in an $I$-dimensional space
 into $K$ clusters.
 Each cluster is parameterized by a vector $\bm^{(k)}$ called its mean.

 The data points will be denoted by  $\bx^{(n)}$ where the  superscript $n$
 runs from 1 to the number of data points $N$. Each vector $\bx$ is a vector with
 $I$ components $x_i$.
 We will assume that the 
 space that $\bx$ lives in
 is a real space and that we have a metric that defines distances between points,
for example,
\beq
	d(\bx,\by) = \half \sum_i (x_i - y_i )^2 .
\eeq

% \subsection{The K-means algorithm}
 To start the K-means algorithm (\algref{alg.kmeans}), the $K$
% parameter vectors called the
 means $\{\bm^{(k)}\}$
 are initialized in some way, for example to random values.
 K-means is then an iterative two-step algorithm.
 In the {\dem{assignment step}},
 each data point $n$ is assigned to the nearest mean.
 In the {\dem{update step}}, the means are adjusted to match 
 the sample means of the data points that they are responsible for.

%\newcommand{\rnk}{r^{(n)}_k}
%\newcommand{\hkn}{\hat{k}^{(n)}}

\begin{algorithm}[htbp]
\algorithmmargin{%
\begin{description}
\item[Initialization\puncspace] Set $K$ means $\{ \bm^{(k)} \}$ to random  values.
\item[Assignment step\puncspace]
 Each data point $n$ is assigned to the nearest mean. We denote our 
 guess for the cluster $k^{(n)}$ that the point $\bx^{(n)}$ belongs to 
 by $\hkn$.
\beq
	\hkn = \argmin_k \{ d(\bm^{(k)} ,\bx^{(n)} ) \} .
\eeq
 An alternative, equivalent representation of this assignment of  points to clusters
 is given by `responsibilities', which are indicator variables $\rnk$. 
 In the assignment step, we set $\rnk$ to one if mean $k$ is the closest
 mean to datapoint $\bx^{(n)}$; otherwise  $\rnk$ is zero.
\beq
 \rnk = \left\{
 \begin{array}{ccc} 1 &\mbox{ if } & \hkn  = k
 \\ 0 & \mbox{ if } & \hkn \neq k  .
\end{array} \right.  
\eeq
\noindent
{\em What about ties?} --
 We don't expect two means to be exactly the same distance from 
 a data point,
 but if a tie does happen, $\hkn$ is set to the smallest of the 
 winning $\{ k \}$.	

\item[Update step\puncspace]% also called Adaptation or Reestimation
 The model parameters, the means, are adjusted to match 
 the sample means of the data points that they are responsible for.
\beq
	\bm^{(k)} = \frac{ \displaystyle \sum_{n} \rnk \bx^{(n)} }{ R^{(k)} }
\eeq
 where $R^{(k)}$ is the total responsibility of mean $k$, 
\beq
 R^{(k)} = \sum_{n}   \rnk  .
\eeq

 {\em What about means with no responsibilities?} --
 If $R^{(k)} = 0$, then we  leave  the mean $\bm^{(k)}$ where it is.
\item[Repeat  the assignment step and update step]
 until the assignments do not change.
\end{description}
}{
\caption{The K-means clustering algorithm.\indexs{learning algorithms!K-means clustering}}
\label{alg.kmeans}
}
\end{algorithm}

 {The K-means algorithm} is demonstrated for a toy
 two-dimensional data set in \figref{fig.kmeans.2},
 where 2 means are used. The assignments of the points to
 the two clusters are indicated by two point styles, and the
 two means are shown by the circles.
% 
 The algorithm converges after three iterations, at which point
 the assignments
 are unchanged so the means remain unmoved when updated.
 The K-means algorithm always converges to a fixed point.
\exercissxC{4}{ex.proveconverge}{
 See if you can prove that K-means always converges. [Hint: find a
 physical analogy and an associated \ind{Lyapunov function}.]

 [A Lyapunov function is a function of the state of the
 algorithm that decreases whenever the state changes
 and that is bounded below.
 If a system has a Lyapunov function then its dynamics converge.]
% You might like to try to  prove this fact. We'll prove it in a few
% chapters's time.
}
 {The K-means algorithm} with a larger number of means, 4,
 is demonstrated  in \figref{fig.kmeans.4}.
 The outcome of the algorithm depends on the initial condition.
 In the first case, after five iterations,
 a steady state is found in which the data points
 are fairly evenly split between the four clusters. 
 In the second case, after six iterations,
 half the data points are in one cluster, and the others are
 shared among the other three clusters.
% 

\begin{figure}
\figuremargin{%
\begin{center}\small
\begin{tabular}{llllllll}
% \raisebox{1in}{(a)}\hspace{-0.3in}%
& \raisebox{0.81in}{Data:} &
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/data.ps,width=1.65in,angle=-90}&\\
% \hline
 Assignment & Update  & Assignment & Update  & Assignment & Update  & \\
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.2.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.3.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.4.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.5.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.6.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.7.ps,width=1.65in,angle=-90}&
\\[0.12in]
\end{tabular}
\end{center}
}{%
\caption[a]{K-means algorithm applied to a data set of 40 points. $K=2$ means
 evolve to stable locations after three iterations.}
\label{fig.kmeans.2}
}%
\end{figure}


\begin{figure}
\figuremargin{%
\begin{center}\small
\begin{tabular}{*{6}{l}}
Run 1\\
\hspace{-0.45in}\psfig{figure=octave/kmeans/ps1/15.2.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.4.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.6.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.8.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.10.ps,width=1.50in,angle=-90}
%&
%\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.11.ps,width=1.50in,angle=-90}
\\[0.12in]
Run 2\\
\hspace{-0.45in}\psfig{figure=octave/kmeans/ps1/16.2.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.4.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.6.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.8.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.10.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.12.ps,width=1.50in,angle=-90}
\\[0.12in]
\end{tabular}
\end{center}
}{%
\caption[a]{K-means algorithm applied to a data set of 40 points.
 Two separate runs, both  with $K=4$ means, reach different solutions.
 Each frame shows a successive assignment step.}
\label{fig.kmeans.4}
}%
\end{figure}

% Fri 29/6/01 removed k=5 figure to graveyard
%\label{fig.kmeans.5}


\subsection{Questions about this algorithm}
 The K-means algorithm has several {\em ad hoc\/}
 features.
 Why does the update step set the `mean' to the mean of the assigned points?
% What if there were a few outliers?
% Outlying data points can have a big influence on the mean!
 Where did  the distance $d$ come from? What if we used a different
 measure of distance between $\bx$ and $\bm$? How can
 we choose the `best' distance? [In vector quantization,
 the distance function is provided as
 part of the problem definition; but I'm assuming
 we are interested in data-modelling rather than  vector quantization.]
%-- it's  a measure of perceived distortion,
% whose expectation  we wish to minimize --
% but in mixture density modelling, the choice of distance 
% corresponds to a choice of density. The choice of distance certainly  can have 
% an effect on the resulting clusters, as we'll see in a moment.]
 How do we choose $K$? Having found multiple alternative clusterings for
 a given $K$, how can we choose among them?

% How to do spaces other than real spaces? For example 
% categorical spaces. 
%
% Choice of number of clusters
%
% What about clusters with unequal width.
%
% And clusters with unequal weight.

\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{%
\raisebox{1in}{(a)}\hspace{-0.3in}%
\psfig{figure=octave/kmeans/xbs75.ps,%
width=2.4in,angle=-90}%
\hspace{0.3in}%
\raisebox{1in}{(b)}\hspace{-0.3in}%
\psfig{figure=octave/kmeans/xbs75m.ps,%
width=2.4in,angle=-90}}%
\end{center}
}{%
\caption[a]{K-means algorithm for a case with two dissimilar clusters.
 (a) The\index{little 'n' large data set}\index{data set}
%{!little 'n' large}
 ``little 'n' large'' data. (b) A stable set of assignments and means.
 Note that four points belonging to the broad cluster have been incorrectly
 assigned to the narrower cluster. (Points assigned to the right-hand cluster
 are shown by plus signs.)}
\label{fig.kmeans.xbs}
}%
\end{figure}

\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{%
\raisebox{1in}{(a)}\hspace{-0.03in}%
\psfig{figure=octave/kmeans/ps3/30.1.ps,%
width=2in,angle=-90}%
\hspace{0.3in}%
\raisebox{1in}{(b)}\hspace{-0.03in}%
\psfig{figure=octave/kmeans/ps3/31.9.ps,%
width=2in,angle=-90}}%
\end{center}
}{%
\caption[a]{Two elongated clusters, and
 the stable solution found by the K-means algorithm.}
\label{fig.kmeans.lozenge}
}%
\end{figure}
\subsection{Cases where K-means might be viewed as failing.}
% We can deliberately construct examples where K-means 
% gives inadequate answers, from a density modelling perspective.


%\subsubsection{Outliers}
% Similarly, one  or two outlying data points can have a large
% effect on the stable state of the K-means algorithm. 
% The sample mean is a good estimator of the centre of a {\em Gaussian\/} 
% distribution, but if a cluster is not Gaussian in shape, then 
% the sample mean is not the most robust estimator of the centre 
% of the cluster. 

 Further questions arise when we look for cases where
 the algorithm behaves badly (compared with  what
 the man in the street would call `clustering').
% \Figref{fig.kmeans.xbs}a shows a data set which evidently
% contains two clusters -- a big one and a small one.
 \Figref{fig.kmeans.xbs}a shows a set of 75 data points 
 generated from a mixture of two Gaussians. The
 right-hand Gaussian
% centred  at $(8,5)$ differs from that centred at $(3,5)$ in two ways: 
% it
 has less weight (only one fifth of the data points),  
 and it is a less broad cluster. 
 \Figref{fig.kmeans.xbs}b shows  the outcome of using
 K-means clustering with $K=2$ means. Four of the big cluster's
 data points have been assigned to the small cluster,
 and both means end up displaced
 to the left of the true centres of the   clusters.
 The K-means algorithm  takes account only of the distance between 
 the means and the data points; it has no representation of the 
 weight or breadth of each cluster. Consequently, data points that 
 actually belong to the broad cluster are incorrectly
 assigned to the narrow cluster. 
%Can get silly answers, see the big'n'small example.
% Algorithm implicitly assumes clusters have similar size and 
% similar weight. 
%\subsubsection{Unequal weight and unequal width clusters}
% Once the algorithm converges, as shown in \figref{fig.kmeans.xbs}b, 
% the means have both become displaced to the left from their 
% correct locations. 

 \Figref{fig.kmeans.lozenge} shows another case of K-means
 behaving badly. The data evidently fall into two elongated clusters.
 But the only stable state of the K-means algorithm is that shown in
 \figref{fig.kmeans.lozenge}b: the two clusters have been sliced
 in half!
%  at their midpoints
 These two examples show that there is something wrong with the
 distance $d$ in the K-means algorithm.
  The K-means algorithm has no way of
 representing the size or shape of a cluster.

 A final criticism of K-means is that it is  a `hard' rather than a `soft' algorithm:
 points are assigned to exactly one  cluster and 
 all points assigned to a cluster are equals in that cluster. 
 Points located near the border between two or more clusters
 should, arguably, play a {\em{partial}\/}
 role in determining the locations   of all the clusters
 that they could plausibly be assigned to. But in the K-means algorithm,
 each borderline point is dumped in one cluster, and has an equal vote
 with all the other points in that cluster,  and no vote in any other clusters.

\section{Soft K-means clustering}
 These criticisms of K-means motivate  the `soft K-means algorithm',\indexs{learning algorithms!K-means clustering}\index{K-means clustering!soft}\index{soft K-means clustering}
 \algref{alg.softkmeans1}. The algorithm has one parameter, $\beta$,
 which we could term the {\dem\ind{stiffness}}.
% , stiff being  the opposite of soft. 

% Soft version. Write algorithm, showing how similar it is
% to hard K-means.
% Could demonstrate the repulsion effect of hard K-means
% when two clusters overlap. Hard to make convincing because
% human can't see two clusters in there.

% BOX THIS and assign it a number, algorithm 23.x
% first arg is the algm, 2nnd is the title
\begin{algorithm}[htbp]
\algorithmmargin{%
\begin{description}
\item[Assignment step\puncspace]
 Each data point $\bx^{(n)}$ is given a soft `degree of assignment'
 to each of the means. We call the degree to which  $\bx^{(n)}$
 is assigned to cluster $k$ the {\dem{\ind{responsibility}}} $r_k^{(n)}$
 (the responsibility  of  cluster $k$ for point $n$).
\beq
	\rnk
% r_k^{(n)}
 = \frac{ \exp \left( - \beta \, d(\bm^{(k)} ,\bx^{(n)}) \right) }
		{\sum_{k'}  \exp \left( -\beta \, d(\bm^{(k')} ,\bx^{(n)}) \right) } .
\label{eq.softminr}
\eeq
 The sum of the $K$ responsibilities for the $n$th point is 1.

\item[Update step\puncspace]% also called Adaptation or Reestimation
 The model parameters, the means, are adjusted to match 
 the sample means of the data points that they are responsible for.
\beq
	\bm^{(k)} = \frac{ \displaystyle \sum_{n} \rnk \bx^{(n)} }{ R^{(k)} }
\eeq
 where $R^{(k)}$ is the total responsibility of mean $k$, 
\beq
 R^{(k)} = \sum_{n}   \rnk  .
\eeq
\end{description}
}{
%{\sf Soft K-means algorithm, version 1}
\caption{Soft K-means algorithm, version 1.}
\label{alg.softkmeans1}
}
\end{algorithm}
 Notice the similarity of this soft K-means algorithm
% \ref{alg.softkmeans1}
 to the
 hard K-means algorithm  \ref{alg.kmeans}.
 The update step is identical; the only difference is
 that the responsibilities\index{responsibility} $\rnk$ can take on values
 between 0 and 1.
 Whereas the assignment $\hkn$ in the K-means algorithm
 involved a `min' over the distances,
 the rule for assigning the responsibilities is
 a `soft-min' (\ref{eq.softminr}).\index{softmax, softmin}
\exercisxB{2}{ex.stiffnessKmeans}{
 Show that as the stiffness $\beta$ goes to $\infty$, the  soft
 K-means algorithm becomes identical to the original hard K-means
 algorithm, except for the way in which means with no
 assigned points behave. Describe what those means do instead of
 sitting still.
}

 Dimensionally, the stiffness $\beta$ is an inverse-length-squared,
 so we can associate a lengthscale, $\sigma \equiv 1/\sqrt{\beta}$, with it.
% the value of $\beta$.
 The soft K-means algorithm is demonstrated in \figref{fig.skmeans.2d}.
 The lengthscale is shown by the radius of the circles surrounding the
 four means.
 Each panel shows the final fixed point reached for a different value of
 the lengthscale  $\sigma$.
% ,  with large lengthscale at the top left and short lengthscale at the bottom right.

\section{Conclusion}
 At this point, we may have fixed some of the problems with the original
 K-means algorithm by introducing an extra {complexity-control}\index{complexity control} parameter $\beta$.
% whose value controls the algorithm's outcome.
 But how should we set $\beta$?
 And what about the problem of the  elongated clusters, and
 the clusters of unequal weight and width? Adding one  stiffness
 parameter $\beta$ is not going to make all these problems go away.

 We'll come back to these questions in a later chapter,
 as we develop the mixture-density-modelling view of clustering.

\section*{Further reading}
 For a \index{vector quantization}{vector-quantization} approach
 to clustering see \cite{Luttrell89d,Luttrell_IEEE90}.

\section{Exercises}

\exercissxB{3}{ex.softkmeans}{
 Explore the properties of the soft K-means algorithm,
 version 1,
% (\pageref{alg.softkmeans})
 assuming that the
 datapoints $\{\bx\}$ come from a {\em single\/} separable two-dimensional Gaussian
 distribution with mean zero and  variances $(\var(x_1),\var(x_2)) =
 (\sigma^2_1, \sigma^2_2)$, with $\sigma^2_1 > \sigma^2_2$.
 Set $K=2$, assume $N$ is large,  and investigate the fixed points of the
 algorithm as  $\beta$ is varied. [Hint: assume that $\bm^{(1)} = (m,0)$
 and  $\bm^{(2)} = (-m,0)$.]
}
% Discuss dependence of algorithm on $\beta$.
% Show the bifurcations as $\beta$ varies.

% here are the filenames for a fixed variance sequence in ps5
\begin{figure}
\figuremargin{%
\begin{center}\small
\begin{tabular}{*{6}{l}}
\multicolumn{4}{l}{\makebox[0in][l]{Large $\sigma$ $\ldots$}}\\
\softfc{1.39}&
\softfc{2.35}&
\softfc{3.35}&
\softfc{4.49}\\
\multicolumn{4}{c}{\makebox[0in][l]{$\ldots$}}\\
\softfc{5.51}&
\softfc{6.37}&
\softfc{7.69}&
\softfc{8.119}\\
\multicolumn{4}{r}{\makebox[0in][r]{$\ldots$ small $\sigma$}}\\
\softfc{9.37}&
\softfc{10.35}&
%\softfc{11.35}&
%\softfc{12.35}&
%\softfc{13.35}\\&
\softfc{14.35}&
%\softfc{15.35}&
%\softfc{16.35}&
%\softfc{17.35}\\&
\softfc{18.35}\\
%\softfc{19.35}&
%\softfc{20.35}&
%\softfc{21.35}\\&
%\softfc{22.35}&\\
%\softfc{23.35}&
%\softfc{24.35}&
%\softfc{25.35}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm, version 1,
 applied to a data set of 40 points. $K=4$.
 Implicit lengthscale parameter $\sigma=1/\beta^{1/2}$ varied from
 a large to a small value.
 Each picture shows the  state of
 all four means, with the implicit lengthscale
 shown by the radius of the four circles,
  after running the algorithm
 for several tens of iterations.
 At the largest lengthscale,
 all four means converge exactly to the
 data mean. Then the four means separate into
 two groups of two. At  shorter lengthscales,
 each of these pairs itself
 bifurcates into subgroups.
}% 3 down to 0.3.}
\label{fig.skmeans.2d}
}%
\end{figure}
\label{sec.SOFT-KMEANS}
% Give probabilistic interpretation -- no, given later in enumerate.tex?
% Refer forward to that exercise in which the algorithm was derived by the reader.

% \section{Exercises}
\exercisxB{3}{ex.repelkmeans}{
	Consider  the soft K-means%
\amarginfignocaption{t}{
\mbox{\psfig{figure=figs/m2g.ps,width=1.7in,angle=-90}}
}
 algorithm applied to a large amount of one-dimensional data
 that comes from a mixture of two equal-weight Gaussians
 with true  means $\mu=\pm 1$ and standard deviation $\sigma_P$,
 for example $\sigma_P=1$.
 Show that  the hard K-means algorithm with $K=2$
 leads to a solution in which the two means are
 further apart than the two true means.
 Discuss what happens for other values of $\beta$,
% in particular the value $\beta = 1/\sigma_P^2$.
 and find the value of $\beta$ such that the soft algorithm
 puts the two means in the correct places.
}


\section{Solutions}
\soln{ex.proveconverge}{
	We can associate an `\ind{energy}' with the state of the K-means algorithm
 by connecting a spring  between  each point $\bx^{(n)}$
 and the mean that is responsible for it.
 The energy of one spring is proportional to its squared-length, namely
 $\b d(\bx^{(n)}, \bm^{(k)})$ where  $\b$ is the  stiffness of the spring.
 The 
 total energy of all the \ind{spring}s is a {\dem\ind{Lyapunov function}\/} for the algorithm,
 because
%\ben
%\item
 (a)
 the assignment step can only decrease the  energy -- a point
 only changes its allegiance if the length of its spring would be reduced;
%\item
 (b)
 the update step can only decrease the energy -- moving $\bm^{(k)}$ to
 the  mean
% centre of mass
 is the way to minimize the energy of its springs; and
%\item
 (c) the  energy is bounded below -- which is the second condition for a Lyapunov
 function.
%\een
 Since the algorithm has a Lyapunov function, it converges.
}
\soln{ex.softkmeans}{
 If the means are initialized to   $\bm^{(1)} = (m,0)$
 and  $\bm^{(1)} = (-m,0)$, the assignment step for a point at location $x_1,x_2$
 gives
\amarginfig{c}{
\begin{center}
\mbox{\psfig{figure=figs/gallager/clusterbelow.ps,%
width=1.75in,angle=-90}}\\[0in]
\end{center}
%}{%
\caption[a]{Schematic diagram of the \ind{bifurcation} as the largest data variance
 $\sigma_1$ increases from below $1/\beta^{1/2}$
 to above  $1/\beta^{1/2}$. The data variance is indicated by the ellipse.}
\label{fig.kmeansbifurc1}
}%
%%% (\cf\ \exerciseref{ex.easyclassificationexample}) 
\beqan
	r_1(\bx) &=& \frac{ \exp ( - \beta (x_1-m)^2 / 2 ) }
		{ \exp ( - \beta (x_1-m)^2 / 2 ) + \exp ( - \beta (x_1+m)^2 / 2 ) }
\\
&=&
\frac{1}{1  + \exp ( - 2 \beta  m x_1 ) } ,
\eeqan
 and the updated $m$ is
\beqan
	m' & =& \frac{ \int \d x_1 \: P(x_1) \, x_1\,  r_1 (\bx) }
			{ \int \d x_1 \: P(x_1) \, r_1 (\bx) }
\\
	&=& 2   \int \d x_1 \:
%\frac{1}{\sqrt{2\pi} \sigma_1}}		\exp ( - x_1^2/ ( 2 \sigma_1^2) )
	P(x_1) \, 
	x_1 \,
	\frac{1}{1  + \exp ( - 2 \beta  m x_1 ) }.
\eeqan
 Now,%
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=gnu/kmeansbi.ps,%
width=1.7in,angle=-90}}\vskip 0.1in
\end{center}
%}{%
\caption[a]{The stable mean locations as a function of 
 $\sigma_1$, for constant $\b$, found numerically (thick lines), and the
 approximation (\ref{eq.approxbifurc}) (thin lines). }
\label{fig.kmeansbifurc}
}
 $m=0$ is a fixed point, but the question is,
 is it  stable or unstable?
 For  tiny $m$ (that is, $\beta \sigma_1 m \ll 1$), we can Taylor
 expand
%
\beq
	\frac{1}{1  + \exp ( - 2 \beta  m x_1 )} \simeq
	\frac{1}{2} ( 1 + \beta  m x_1 ) + \cdots
\eeq
 so
\beqan
	m' & \simeq &   \int \d x_1 \:
% \frac{1}{\sqrt{2\pi \sigma_1^2}}		\exp ( - x_1^2/ ( 2 \sigma_1^2) )
	P(x_1) \:
	x_1 \:
		 ( 1 + \beta  m x_1 )
\\
%&=&  \int \d x_1 \:
%	P(x_1) \: x_1^2 \:
%		  \beta  m  
%\\
&=&  \sigma_1^2   \beta  m   .
\eeqan
 For small $m$, $m$ either grows or decays exponentially under this mapping, 
 depending on whether $\sigma_1^2   \beta$ is greater than or
 less than 1.
 The fixed point $m=0$ is {\em stable\/} if
\beq
	\sigma_1^2  \leq 1/ \beta
\eeq
 and 
 {\em unstable\/} otherwise.
 [Incidentally, this derivation shows that this result
 is general, holding for any true probability
 distribution  $P(x_1)$ having variance  $\sigma_1^2$,
 not just the Gaussian.]

 If $\sigma_1^2  > 1/ \beta$ then there is a \ind{bifurcation}
 and there are two stable fixed points surrounding the unstable
 fixed point at $m=0$.
% There are two ways to visualize this bifurcation. Either we can imagine
% an algorithm with fixed $\beta$ and look at what happens
% as we increase the variance of the data fed to it,
% or we can imagine attacking fixed data with various values of  $\beta$.
% On dimensional grounds we can think of $\beta$ as defining
% an inverse-variance, and $1/\beta^{1/2}$ as defining an implicit
% length scale
%% standard deviation
% in the algorithm.
% see kmeansoft/ms.ms
% see itp/gnu/kmeans.gnu
 To illustrate this bifurcation, \figref{fig.kmeansbifurc}
 shows the outcome of running the soft K-means
 algorithm with $\beta=1$
 on one-dimensional data with standard deviation $\sigma_1$
 for various values of $\sigma_1$.
% for four iterations, starting from initial mean locations $m = \pm 1$.
 \Figref{fig.kmeansbifurcinv}
 shows this \ind{pitchfork bifurcation} from the
 other point of view, where the
 data's standard deviation $\sigma_1$ is fixed and the
 algorithm's lengthscale  $\sigma = 1/\beta^{1/2}$
 is varied on the horizontal axis.%
\amarginfig{b}{
\begin{center}~\par
\mbox{\psfig{figure=gnu/kmeansbi-inv.ps,%
width=1.75in,angle=-90}}\vskip 0.1in
\end{center}
%}{%
\caption[a]{The stable mean locations as a function of 
 $1/\beta^{1/2}$, for constant $\sigma_1$.}
\label{fig.kmeansbifurcinv}
}


% adding another term in the expansion looked hopeless 
%  does it converge???
% We'll be able to show this is a standard pitchfork bifurcation
% once we have discussed the objective function
% that the K-means algorithm minimizes.
%
% Meanwhile, h
\begin{aside}
 Here is a cheap theory to model how the fitted parameters $\pm m$ behave
 beyond the bifurcation, based on continuing
 the series expansion. This continuation of the series is
 rather suspect, since the
 series isn't necessarily expected to converge
 beyond the bifurcation point, but the theory fits
 well anyway.

 We take our analytic approach one term further in the
 expansion
\beq
	\frac{1}{1  + \exp ( - 2 \beta  m x_1 )} \simeq
	\frac{1}{2} ( 1 + \beta  m x_1  - \frac{1}{3} ( \beta  m x_1)^3 ) + \cdots
\eeq
% (but this expansion may be invalid!)
 then we can solve for the shape of the bifurcation to leading order,
 which depends on the fourth moment of the distribution:
%\marginpar{\footnotesize{At (\ref{eq.gauss3m}) we use the fact that $P(x_1)$ is Gaussian to find the fourth moment.}}
\beqan
	m' & \simeq &   \int \d x_1 \:
	P(x_1)
% \frac{1}{\sqrt{2\pi \sigma_1^2}}		\exp ( - x_1^2/ ( 2 \sigma_1^2) )
	x_1
		 ( 1 + \beta  m x_1   - \frac{1}{3} ( \beta  m x_1)^3 )
\\
%&=&  \int \d x_1 \:
%% \frac{1}{\sqrt{2\pi \sigma_1^2}}		\exp ( - x_1^2/ ( 2 \sigma_1^2) )
%	P(x_1)
%	\left[ x_1^2		 \,  \beta  m
%	-  \frac{1}{3} ( \beta  m)^3 x_1^4  \right]
%\\
&=&  \sigma_1^2   \beta  m -  \frac{1}{3} ( \beta  m)^3 3 \sigma_1^4 .
 \label{eq.gauss3m}
%\\
%&=&  \sigma_1^2   \beta  m  ( 1 -   ( \beta  m)^2  \sigma_1^2   ) .
\eeqan
 [{At (\ref{eq.gauss3m}) we use the fact that $P(x_1)$ is Gaussian to find the fourth moment.}]
 This map has a fixed point at $m$ such that
\beq
	 \sigma_1^2   \beta  ( 1 -   ( \beta  m)^2  \sigma_1^2   )  = 1,
\eeq
\ie,
\beq
%	   ( \beta  m)^2  \sigma_1^2     =  ( 1 -  1/ (\sigma_1^2   \beta ) ),
%	    m     = \pm \frac{ ( 1 -  1/ (\sigma_1^2   \beta ) )^{1/2} }{ \beta  \sigma_1  } ,
% m = \pm \frac{ ( \sigma_1^2   \beta -  1 )^{1/2} }{  \sigma_1   \beta^{1/2} \beta  \sigma_1  } ,
 m = \pm  \beta^{-1/2} \frac{ ( \sigma_1^2   \beta -  1 )^{1/2} }{  \sigma_1^2   \beta  } .
\label{eq.approxbifurc}
\eeq

 The thin line in \figref{fig.kmeansbifurc}
 shows this theoretical approximation.
 \Figref{fig.kmeansbifurc} shows the bifurcation as a function of $\sigma_1$
 for fixed $\beta$; \figref{fig.kmeansbifurcinv} shows the bifurcation
 as a function of $1/\b^{1/2}$ for fixed $\sigma_1$.
\end{aside}
}
\exercissxB{2}{ex.kmeansdetails}{
 Why does the pitchfork in \figref{fig.kmeansbifurcinv}
 tend to the values
 \mbox{$\sim \! \pm  0.8$} as $1/\beta^{1/2} \rightarrow 0$?
 Give an analytic expression for this asymptote.

}
\soln{ex.kmeansdetails}{
	The asymptote is the mean of the rectified Gaussian,
\beq
	\frac{\int_{0}^{\infty} \Normal(x,1) x \: \d x}{1/2}
	= \sqrt{ 2/\pi } \simeq 0.798 .
\eeq
}



%
%
\dvips
\chapter{Exact Inference by Complete Enumeration} 
%\chapter{Exact inference by complete enumeration}
\label{ch.enumerate}

% \section{Complete enumeration}
 We open our toolbox of methods for handling
 probabilities  by
 discussing a brute-force 
 inference 
% of handling
 method:  complete enumeration of all
 hypotheses, and evaluation of their probabilities.
 This approach is an exact method, and the difficulty of
 carrying it out will motivate the smarter exact
 and approximate methods introduced in the
 following chapters.

\section{The {burglar alarm} }
	Bayesian probability theory is sometimes called
 `common sense, amplified'.
 When thinking about the following questions, please ask your
 common sense what it thinks the answers are; we will then
 see how Bayesian methods confirm your everyday intuition.
% EXAMPLE 1 }
% Explaining away example -- earthquake/burglar?
% stolen from \input{tex/_e1b.tex}% contains earthquake - should be earlier
\fakesection{quake}%
%\begin{figure}
%\figuremargin{%
\amarginfig{t}{\small
\begin{center}
\begin{tabular}{c}
\setlength{\unitlength}{0.451mm}
\begin{picture}(70,50)(-20,-36)% 
\put(0,20){\circle{8.5}} % quake
\put(0,28){\makebox(0,0)[b]{Earthquake}} % quake
\put(5,15){\vector(1,-1){10}}
\put(40,20){\circle{8.5}} % buglar
\put(40,28){\makebox(0,0)[b]{Burglar}} 
\put(35,15){\vector(-1,-1){10}}
\put(20,0){\circle{8.5}} % alarm
\put(28,0){\makebox(0,0)[l]{Alarm}} 
\put(-5,15){\vector(-1,-1){10}}
\put(-20,0){\circle{8.5}} % q report radio
\put(-20,-8){\makebox(0,0)[t]{Radio}} 
\put(25,-5){\vector(1,-1){10}}
\put(40,-20){\circle{8.5}} % alarm report
\put(40,-28){\makebox(0,0)[t]{Phonecall}} 
\end{picture}
\\
\end{tabular}
%
\end{center}
%}{%
\caption[a]{Belief network for the burglar alarm problem.}
\label{fig.quake}
}%
%\end{figure}

%\subsection*{The {burglar alarm}}
%\exercisxA{1}{ex.burglar}{
\exampl{ex.burglar}{
 Fred\index{explaining away}\index{Bayesian belief networks}\index{earthquake and burglar alarm}\index{burglar alarm and earthquake}
 lives in Los Angeles and commutes 60 miles to work. Whilst at work, 
 he receives a phone-call from his neighbour saying that Fred's burglar 
 alarm is ringing. What is the probability that there was a burglar in his 
 house today? While driving home to investigate, Fred hears on the radio that 
 there was a small earthquake that day near his home.
 `Oh', he says, feeling relieved, `it was probably the earthquake that
 set off the alarm'.
% Given that 
% earthquakes sometimes set off burglar alarms (\figref{fig.quake}), 
% {\em now\/}
 What is the probability that there was
 a burglar in his  house?
 (After Pearl, 1988).

%{\em Aims of this problem: illustrate meaning of probability; 
% and show the subtlety of inverse probability: E and B are 
% independent, but given A they become dependent.
%
% \input{figs/quake_nums.tex}
}
% \input{tex/_s1b.tex} % earthquake solution
% \fakesection{quake}

 Let's introduce 
% You may make use of the following probability distributions relating 
 variables $b$ (a burglar was present in Fred's house today),
 $a$ (the alarm is ringing), $p$ (Fred receives a phonecall from the
 neighbour reporting the alarm),
 $e$ (a small earthquake
% capable of triggering burglar alarms
 took place today near Fred's house),
 and $r$ (the radio report of earthquake is heard by Fred).
 The probability of all these variables might factorize as follows:
\beq
	P( b, e, a, p , r ) = P(b) P(e) P(a\given b,e) P(p\given a) P(r\given e) ,
\eeq
 and plausible values for the probabilities are:
\begin{enumerate}
\item Burglar probability:
\beq
 P(b\eq 1) = \beta ,  \:\:\: P(b\eq 0) = 1-\beta , 
\eeq
 \eg, $\beta = 0.001$  gives a mean burglary rate of once every three
 years.
\item Earthquake probability:
\beq
 P(e\eq 1) = \epsilon , \:\:\: P(e\eq 0) = 1-\epsilon , 
\eeq
 with, \eg,  $\epsilon = 0.001$;
 our assertion that the earthquakes are independent of burglars, \ie, the
 prior probability of $b$ and $e$ is $P(b,e) = P(b)P(e)$,
 seems reasonable unless we take into account opportunistic burglars
 who strike immediately after earthquakes.
\item Alarm ringing probability: we assume
 the alarm will ring if {\em{any}\/} of
 the following
 three events happens:  (a) a burglar enters the house, and
 triggers the alarm (let's
 assume the alarm has a reliability of  $\alpha_b=0.99$,
 \ie, 99\% of burglars trigger the alarm);
 (b) an earthquake takes place, and triggers the alarm
 (perhaps $\a_e =1$\% of alarms are triggered by earthquakes?);
 or (c) some other event causes a false alarm; let's assume
 the false alarm rate $f$ is 0.001, so Fred has false alarms from
 non-earthquake causes once every
 three years.
 [{This type of dependence of $a$ on $b$ and $e$ is known as
 a `\ind{noisy-or}'.}]
 The probabilities of $a$ given $b$ and $e$ are then:
\[
\begin{array}{rclrcl}
  P(a\eq 0\given b\eq 0,\, e\eq 0) &=& (1-f)                         ,& P(a\eq 1\given b\eq 0,\, e\eq 0) &=& f \\
  P(a\eq 0\given b\eq 1,\, e\eq 0) &=& (1-f)(1-\alpha_b)             ,& P(a\eq 1\given b\eq 1,\, e\eq 0) &=& 1- (1-f)(1-\alpha_b) \\
  P(a\eq 0\given b\eq 0,\, e\eq 1) &=& (1-f)(1-\alpha_e)             ,& P(a\eq 1\given b\eq 0,\, e\eq 1) &=& 1- (1-f)(1-\alpha_e) \\
  P(a\eq 0\given b\eq 1,\, e\eq 1) &=& (1-f)(1-\alpha_b)(1-\alpha_e) ,& P(a\eq 1\given b\eq 1,\, e\eq 1) &=& 1- (1-f)(1-\alpha_b)(1-\alpha_e)
\end{array}
\]
 or, in numbers, 
\[
\begin{array}{rclrcl}
  P(a\eq 0\given b\eq 0,\, e\eq 0) &=& 0.999               ,& P(a\eq 1\given b\eq 0,\, e\eq 0) &=& 0.001 \\
  P(a\eq 0\given b\eq 1,\, e\eq 0) &=& 0.009\,99             ,& P(a\eq 1\given b\eq 1,\, e\eq 0) &=& 0.990\,01 \\
  P(a\eq 0\given b\eq 0,\, e\eq 1) &=& 0.989\,01             ,& P(a\eq 1\given b\eq 0,\, e\eq 1) &=& 0.010\,99  \\
  P(a\eq 0\given b\eq 1,\, e\eq 1) &=& 0.009\,890\,1           ,& P(a\eq 1\given b\eq 1,\, e\eq 1) &=& 0.990\,109\,9 . 		
\end{array}
\]
% with  $\alpha_b=0.99$, $f=0.001$, $\alpha_e=0.01$. 
\end{enumerate}
 We assume the neighbour would never phone if the
 alarm is not ringing [$P(p\eq 1\given a\eq 0)=0$];
 and that the radio is a trustworthy reporter too [$P(r\eq 1\given e\eq 0)=0$]; 
 we won't need to specify the probabilities  $P(p\eq 1\given a\eq 1)$ or $P(r\eq 1\given e\eq 1)$
 in order to answer the questions above, since the outcomes $p=1$
 and $r\eq 1$ give us certainty respectively that $a\eq 1$ and $e\eq 1$.

 We can answer the two questions about the burglar
 by computing the posterior probabilities of all hypotheses
 given the available information.
 Let's start by reminding
 ourselves that the probability that there is a burglar,
 before either $p$ or $r$ is observed, is $P(b\eq 1)=\b=0.001$,
 and the probability that an earthquake took place is $P(e\eq 1) = \epsilon = 0.001$,
 and these two propositions are {\em independent}.

 First, when $p\eq 1$,
 we know that the alarm is ringing: $a\eq 1$.
 The posterior probability of $b$ and $e$ becomes:
\beq
	P(b,e\given a\eq 1) = \frac{ P(a\eq 1 \given  b,e ) P(b) P(e ) }{ P(a\eq 1) } .
\eeq
 The numerator's four possible values  are 
\[
\begin{array}{rcl@{\times}l@{\times}lcl}
P(a\eq 1\given b\eq 0,\, e\eq 0)\times P(b\eq 0)\times P(e\eq 0) &=& 0.001       & 0.999 & 0.999 &=& 0.000\,998 \\
P(a\eq 1\given b\eq 1,\, e\eq 0)\times P(b\eq 1)\times P(e\eq 0) &=& 0.990\,01   & 0.001 & 0.999  &=&0.000\,989  \\
P(a\eq 1\given b\eq 0,\, e\eq 1)\times P(b\eq 0)\times P(e\eq 1) &=& 0.010\,99   & 0.999 & 0.001 &=&0.000\,010\,979 \\
P(a\eq 1\given b\eq 1,\, e\eq 1)\times P(b\eq 1)\times P(e\eq 1) &=& 0.990\,109\,9 & 0.001 & 0.001 &=& 9.9\times 10^{-7} .
\end{array}
\]
 The normalizing constant is the sum of these four numbers, 
$P(a\eq 1) = 0.002$,
% 0.0019989901099
% pr z
% z = 0.001      * 0.999 * 0.999+ 0.99001    * 0.999 * 0.001 + 0.01099    * 0.001 * 0.999 + 0.9901099  * 0.001 * 0.001 
 and the posterior probabilities are
\beq
\begin{array}{rcl}
  P(b\eq 0,\, e\eq 0\given a\eq 1) &=& 0.4993               \\
  P(b\eq 1,\, e\eq 0\given a\eq 1) &=& 0.4947             \\
  P(b\eq 0,\, e\eq 1\given a\eq 1) &=& 0.0055             \\
  P(b\eq 1,\, e\eq 1\given a\eq 1) &=& 0.0005           .
\end{array}
\label{eq.earthquake.post}
\eeq
 To answer the question, `what's the probability a burglar was there?' 
 we {\dem\index{marginalization}{marginalize}\/} over the earthquake variable $e$:
\beq
\begin{array}{rclcl}
P(b\eq 0\given a\eq 1) &=&   P(b\eq 0,\, e\eq 0\given a\eq 1)  +   P(b\eq 0,\, e\eq 1\given a\eq 1) &=& 0.505 \\
P(b\eq 1\given a\eq 1) &=&   P(b\eq 1,\, e\eq 0\given a\eq 1)  +   P(b\eq 1,\, e\eq 1\given a\eq 1) &=& 0.495 .
\end{array}
\eeq
 So there is nearly a  50\% chance that there was a burglar present.
 It is important to note that the variables $b$ and $e$, which
 were independent {\em a priori},
 are now {\em dependent}. The posterior distribution (\ref{eq.earthquake.post})
 is not a separable function of $b$ and $e$.
%
%pr 0.001      * 0.999 * 0.999/z
%pr 0.99001    * 0.999 * 0.001 /z
%pr 0.01099    * 0.001 * 0.999/z
%pr 0.9901099  * 0.001 * 0.001 /z
%
%pr 0.001      * 0.999 * 0.999/z + 0.01099    * 0.001 * 0.999/z
%pr 0.9901099  * 0.001 * 0.001 /z + 0.99001    * 0.999 * 0.001 /z
 This fact is illustrated most simply by studying the effect
 of learning that $e=1$.

 When we learn $e\eq1$,
 the posterior probability of $b$
 is given by $P(b \given e\eq 1,\,   a\eq 1  )  = P(b ,e\eq 1\given a\eq 1) / P( e\eq 1\given a\eq 1)$,
 \ie, by dividing the bottom two rows of
%quantities from
 (\ref{eq.earthquake.post}),
%\beq
%\begin{array}{rcl}
%  P(b\eq 0,\, e\eq 1\given a\eq 1) &=& 0.0055             \\
%  P(b\eq 1,\, e\eq 1\given a\eq 1) &=& 0.0005           ,
%\end{array}
%\label{eq.earthquake.post2}
%\eeq
%pr 0.9901099  * 0.001 * 0.001 /z + 0.01099    * 0.001 * 0.999/z
% e =  0.9901099  * 0.001 * 0.001 /z + 0.01099    * 0.001 * 0.999/z
% pr   0.9901099  * 0.001 * 0.001 /(z*e)
% pr  0.01099    * 0.001 * 0.999/(z*e)
% 
 by their sum $P( e\eq 1\given a\eq 1) = 0.0060$. The posterior probability of $b$ is:
\beq
\begin{array}{rcl}
  P(b\eq 0\given e\eq 1,\, a\eq 1) &=& 0.92 \\
  P(b\eq 1\given e\eq 1,\, a\eq 1) &=& 0.08           .
\end{array}
\label{eq.earthquake.post3}
\eeq
% 0.0827220303808637
% 0.917277969619136
 There is thus now an 8\% chance that a burglar  was in  Fred's house.
 It is in accordance with everyday intuition that the probability that $b\eq 1$
 (a possible cause of the alarm)
  reduces when Fred learns  that an earthquake, an alternative
 explanation of the alarm, has happened.

\subsection{Explaining away}
 This phenomenon, that one of the possible causes ($b\eq 1$) of some data (the data
 in this case being
 $a\eq 1$) becomes {\em less\/} probable when another of the causes ($e\eq 1$)
 becomes more probable, even though those two causes were independent
 variables {\em a priori}, is known as {\dem\ind{explaining away}}.
 Explaining away is an important feature of correct inferences,
 and one that any artificial intelligence should replicate.

 If we believe that the neighbour and the radio
 service are unreliable or capricious,
 so that we are not certain that the alarm really is
 ringing or that an earthquake really has happened, the calculations
 become more complex, but the
 explaining-away effect persists;
  the arrival of the earthquake report $r$
 simultaneously makes it {\em more\/} probable that the
 alarm truly is ringing, and {\em less\/} probable that
 the burglar  was present.

 In summary, we solved the inference questions about the burglar
 by enumerating all four hypotheses about the variables $(b,e)$,
 finding their posterior probabilities, and marginalizing
 to obtain the required inferences about $b$.

\exercisxB{2}{ex.earthquake}{
	After Fred receives the phone-call about the burglar alarm, but before
 he hears the radio report, what, from his point of view, is the probability that there was
 a small earthquake today?
}


\section{Exact inference for continuous hypothesis spaces }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%  probc clustering
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5

% \section{Mixture modelling}


 Many of the hypothesis spaces we will consider
 are naturally thought of as continuous. For example,
 the unknown decay length $\l$ of
 \sectionref{sec.decay} (\pref{sec.decay})
 lives in a  continuous one-dimensional space;
 and 
 the unknown mean and standard deviation of
 a Gaussian $\mu,\sigma$
% $(\mu_1,\mu_2)$ of the Gaussian 
 live in a continuous two-dimensional space.
 In any practical computer implementation,
 such continuous spaces will necessarily be discretized,
 however, and so  can, in principle, be enumerated -- at a grid of parameter
 values, for example.  In \figref{decay.like.2} we
 plotted the likelihood function for the decay length as a function of $\l$
 by evaluating the likelihood at a finely-spaced series of points.

\subsection{A two-parameter model}
 Let's look at the Gaussian distribution as an example of
 a model with a two-dimensional hypothesis space.
\begin{figure}
\figuremargin{\begin{center}\begin{tabular}{c}
\fbox{\hspace*{-0.05in}\psfig{figure=mixture/hundred0.ps,angle=-90,width=\skinnytextwidth}\hspace{0.05in}}\\
\end{tabular}\end{center}}
{\caption[a]{Enumeration of an
 entire (discretized) hypothesis space for one Gaussian with parameters $\mu$ (horizontal axis) and
 $\sigma$ (vertical). }
\label{fig.enumerate.gaussian}}
\end{figure}
 The  one-dimensional
 Gaussian distribution is parameterized by a mean $\mu$ 
 and a standard deviation $\sigma$:
%
\beq
P(x\given \mu,\sigma)
% ,\H_{\rm Normal})
        = \frac{1}{\sqrt{2 \pi} \sigma}
        \exp \left( - \frac{ ( x-\mu )^2 }{2 \sigma^2 } \right)
        \equiv {\rm Normal}(x;\mu,\sigma^2) .
\eeq
%
 \Figref{fig.enumerate.gaussian}
 shows an enumeration of one hundred
 hypotheses about the mean and standard deviation of
 a one-dimensional Gaussian distribution.
 These hypotheses are evenly spaced in a ten by ten
 square grid covering ten values of $\mu$ and
 ten values of $\sigma$. Each hypothesis is represented
 by a picture showing
% its associated
 the probability  density that it puts on $x$.
%
%\begin{figure}
\marginfig{\begin{center}\begin{tabular}{c}
\mbox{\psfig{figure=mixture/data5.ps,width=1.9in,angle=-90}}\\[0.03in]
\end{tabular}\end{center}
%}{
\caption[a]{Five datapoints  $\{x_n\}_{n=1}^5$.
 The horizontal coordinate
 is the value of the datum, $x_n$; the vertical coordinate
 has no meaning.}
% represents the order in which the data were acquired.}
\label{fivepoints}
}
%\end{figure}
%
 We now examine the inference of $\mu$ and $\sigma$ 
 given data points $x_n$, $n=1,\ldots, N$, assumed to be drawn independently 
 from this density.
% distribution. 
%

 Imagine that we acquire data, for example the
 five points shown in \figref{fivepoints}.
 We can now evaluate the posterior probability of each of the one hundred
 subhypotheses by evaluating the likelihood of each,
 that is, the value of
 $P( \{x_n\}_{n=1}^5  \given  \mu, \sigma )$.
 The likelihood values are shown diagrammatically 
 in \figref{fig.gaussian5} using the line thickness
 to encode the value of the likelihood. Subhypotheses
 with likelihood smaller than $e^{-8}$  times
 the maximum likelihood have been deleted.

\begin{figure}
\figuremargin{\begin{center}\begin{tabular}{c}
\fbox{\psfig{figure=mixture/hundred.ps,angle=-90,width=\skinnytextwidth}}\\
\end{tabular}\end{center}}
{\caption[a]{ Likelihood function, given  the data of \figref{fivepoints},
 represented by line thickness.  Subhypotheses having
 likelihood smaller than $e^{-8}$ times the maximum
 likelihood are not shown.}
\label{fig.gaussian5}}
\end{figure}

 Using a finer grid, we can represent the same information by
 plotting the likelihood  as a surface plot or contour plot
 as a function of $\mu$ and $\sigma$ (\figref{like.sig.mu1}).

% copy from bayes_intermediate.tex
% /home/mackay/book/figs
\begin{figure}
\figuremargin{\small%
\vspace{-0.56in}
\begin{center}\small
\begin{tabular}{l@{}l}
%(a1)
\hspace{-0.2in}\raisebox{-8mm}{\psfig{figure=\bookfigs/basic/new_surfaceplot.ps,angle=-90,width=3in}}
&
%(a2)
\hspace{-0.6in}\raisebox{-8mm}{\psfig{figure=\bookfigs/basic/new_contourplot.ps,angle=-90,width=3in}}
\\
%(b)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_posts.ps,angle=-90,width=2.3in}}
%&
%\hspace*{-0.3in}(c)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_sigposts.ps,angle=-90,width=2.3in}}
%\\
\end{tabular}
\end{center}
}{%
\caption[abbrev]{{The likelihood function for the parameters of 
        a  Gaussian distribution}.

%        {(a1,a2)}
 Surface plot and 
contour plot of the log likelihood as a function of $\mu$
        and $\sigma$.  The data set of $N=5$ points had mean
        $\bar{x}=1.0$ and $S^2 = \sum(x-\bar{x})^2 = 1.0$.
% Notice that
%        the maximum is skew in $\sigma$.  The two estimators of
%        standard deviation have values $\sigma_{\ssN}=0.45$ and
%        $\sigma_{\ssNM}=0.50$.

%{(b)} The posterior probability of $\mu$ for various values of 
% $\sigma$. 
%
%{(c)} The posterior probability of $\sigma$ for 
% various fixed values of $\mu$.

}
\label{like.sig.mu1}
}%
\end{figure}
%
%
%
\subsection{A five-parameter mixture model}
\label{sec.gaussian.firsttime}
 Eyeballing the data (\figref{fivepoints}),  you might agree that it seems
 more plausible that they come not from a single Gaussian but
 from a mixture of two Gaussians,
 defined by two means, two standard deviations,
 and two {\ind{mixing coefficients}} $\pi_1$ and $\pi_2$,
 satisfying $\pi_1+\pi_2=1$, $\pi_i \geq 0$.
\[%beq
 P(x|\mu_1,\sigma_1,\pi_1,\mu_2,\sigma_2,\pi_2) =
 \frac{\pi_1}{\sqrt{ 2 \pi} \sigma_1} \exp \left( -\smallfrac{(x-\mu_1)^2}{2 \sigma_1^2} \right)
+
 \frac{\pi_2}{\sqrt{ 2 \pi} \sigma_2} \exp \left( -\smallfrac{(x-\mu_2)^2}{2 \sigma_2^2} \right)
\]%eeq
 Let's  enumerate the subhypotheses for this alternative
 model. 
 The parameter space is five-dimensional, so it becomes challenging to
 represent it on a single page.
% \Figref{fig.mixture200} shows 
 \Figref{fig.mixture200} enumerates 800 subhypotheses with
 different values of the five parameters
 $\mu_1,\mu_2,\sigma_1,\sigma_2,\pi_1$.
 The means are varied between five values each in the horizontal directions.
 The standard deviations take on four values each vertically.
 And $\pi_1$ takes on two values vertically.
 We can represent the inference about these five parameters
 in the light of the five datapoints  as shown in 
 \figref{fig.mixture200post}.
% And do model comparison too.

\begin{figure}
%\figuredangle{%
\figuremargin{%
\begin{center}\begin{tabular}{c}
\mbox{\psfig{figure=mixture/mix0.0.6.ps,angle=-90,width=\skinnytextwidth}}\\
\mbox{\psfig{figure=mixture/mix0.0.8.ps,angle=-90,width=\skinnytextwidth}}\\
\end{tabular}\end{center}}
{\caption[a]{Enumeration of the
 entire (discretized) hypothesis space for a mixture of two Gaussians.  Weight of the mixture components
 is $\pi_1,\pi_2 = 0.6,0.4$ in the top half and $0.8,0.2$ in the
 bottom half. Means $\mu_1$ and $\mu_2$ vary horizontally,
 and standard deviations $\sigma_1$ and $\sigma_2$ vary
 vertically. }
\label{fig.mixture200}}
\end{figure}

\begin{figure}
   \figuremargin{%
%   \figuredangle{%
\begin{center}\begin{tabular}{c}
\mbox{\psfig{figure=mixture/D1mix.0.6.ps,angle=-90,width=\skinnytextwidth}}\\
\mbox{\psfig{figure=mixture/D1mix.0.8.ps,angle=-90,width=\skinnytextwidth}}\\
\end{tabular}\end{center}}
{\caption[a]{Inferring a mixture of two Gaussians. Likelihood function,
 given  the data of \figref{fivepoints},
 represented by line thickness.
 The hypothesis space is identical to that shown in
 \figref{fig.mixture200}.
Subhypotheses having
 likelihood smaller than $e^{-8}$ times the maximum
 likelihood are not shown, hence the blank regions, which
 correspond to hypotheses that the data have ruled out.\medskip
}
\label{fig.mixture200post}
\begin{realcenter}
\mbox{\psfig{figure=mixture/data5.ps,width=1.9in,angle=-90}}
\end{realcenter}
}
\end{figure}

 If we wish to compare the one-Gaussian model with the
 mixture-of-two model, we can find the models' posterior probabilities
% y of the two models
 by evaluating the \ind{marginal likelihood} or \ind{evidence} for each model $\H$,
$P( \{x\} \given  \H )$. The evidence
 is given by
 integrating over the  parameters, $\btheta$; the integration can be implemented
 numerically by summing over the 
 alternative enumerated values
 of $\btheta$, 
\beq
	P( \{x\} \given  \H ) = \sum_{ \btheta } P(\btheta) P( \{x\} \given  \btheta , \H ) ,
\eeq
 where $P(\btheta)$ is the prior distribution over the grid of parameter
 values, which I take to be uniform.

% The data set
% contains weak evidence for two clusters,
% and the evidence for the two models shown here comes
% out about 10:1 in  favour of the two-Gaussian model.
%

 For the mixture of two Gaussians this integral is a five-dimensional integral;
 if it is to be performed at all accurately, the grid of points will
 need to be much finer than the grids shown in the figures. If the uncertainty
 about each of $K$ parameters has been reduced by, say, a factor of ten by observing
 the data, then 
 brute force integration  requires a grid of at least $10^K$ points.
 This exponential growth of computation with model size is the reason why
 complete enumeration is rarely a feasible computational strategy.
% inference 

% \end{figure}

\exercisxA{1}{ex.tengaussians}{
 Imagine fitting a mixture of ten Gaussians to data in a twenty-dimensional
 space. Estimate the computational cost of implementing inferences
 for this model by enumeration of a grid of parameter values.
}







\dvips

% Show the surface plot of the likelihood also.
% Idea: Add the exam question on biexponential distbn here?


\chapter{Maximum Likelihood and Clustering}
\label{ch.ml}
\label{ch.clust}
% maximum likelihood and clustering - start of chapter
%
 Rather than enumerate all hypotheses -- which may
 be exponential in number -- we can save a lot of time by
 homing in on one good hypothesis that fits the data
 well. This is the philosophy behind the \ind{maximum likelihood}
 method, which  identifies the setting of the  parameter vector 
 $\btheta$ that maximizes the likelihood, $P(\mbox{Data} \given  \btheta, \H)$.

 For some models the maximum likelihood parameters can be identified
 instantly from the data; for more complex models, finding
 the maximum likelihood parameters may require an iterative algorithm.

 For any model, it is usually easiest to work with the {\em logarithm\/} of
 the likelihood rather than the likelihood, since likelihoods, being
 products of the probabilities of many data points, tend to be very small.
 Likelihoods multiply; log likelihoods add.

\section{Maximum likelihood for one Gaussian}
\label{sec.mloneg}
We return to the Gaussian for our first examples.
 Assume we have  data $\{ x_n \}_{n=1}^N$.
 The log likelihood is:
\beqan
\ln P(\{x_n\}_{n=1}^N \given \mu,\sigma)
        &=& -N \ln (\sqrt{2 \pi} \sigma)
        -\sum_n \linefrac{(x_n-\mu)^2}{(2 \sigma^2)}    . 
\eeqan
% Given the Gaussian model, 
 The likelihood can be expressed 
 in terms of  two functions  of the data, the sample mean
\beq
        \barx \equiv   {\sum_{n=1}^{N} x_n} / {N} ,
\eeq
 and the sum of square deviations
\beq
 S \equiv \sum_n  (x_n-\barx)^2:
\eeq
\beq
\ln P(\{x_n\}_{n=1}^N \given \mu,\sigma)
=
        -N \ln (\sqrt{2 \pi} \sigma) - \linefrac{ [ N ( \mu - \barx )^2 + S ]}
                                        { (2 \sigma^2) } .
\eeq
 Because the likelihood depends on the data only through
 $\barx$ and $S$,
  these  two quantities  are known as {\dem\ind{sufficient statistics}}.\index{statistic!sufficient}
% copy from bayes_intermediate.tex
% /home/mackay/book/figs
\begin{figure}
\figuremargin{\small%
\vspace{-0.56in}
\begin{center}
\begin{tabular}{l@{}l}
% \newcommand{\bookfigs}{/home/mackay/book/figs}
(a1)\hspace{-0.4in}\raisebox{-10mm}{\psfig{figure=\bookfigs/basic/new_surfaceplot.ps,angle=-90,width=3in}}
&
(a2)\hspace{-0.8in}\raisebox{-10mm}{\psfig{figure=\bookfigs/basic/new_contourplot.ps,angle=-90,width=3in}}
\\
(b)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_posts.ps,angle=-90,width=2.3in}}
&
\hspace*{-0.3in}(c)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_sigposts.ps,angle=-90,width=2.3in}}
\\
\end{tabular}
\end{center}
}{%
\caption[abbrev]{{The likelihood function for the parameters of 
        a  Gaussian distribution}.

        {(a1, a2)} Surface plot and 
contour plot of the log likelihood as a function of $\mu$
        and $\sigma$.  The data set of $N=5$ points had mean
        $\bar{x}=1.0$ and $S^2 = \sum(x-\bar{x})^2 = 1.0$.
% Notice that
%        the maximum is skew in $\sigma$.  The two estimators of
%        standard deviation have values $\sigma_{\ssN}=0.45$ and
%        $\sigma_{\ssNM}=0.50$.

{(b)} The posterior probability of $\mu$ for various values of 
 $\sigma$. 

{(c)} The posterior probability of $\sigma$ for 
 various fixed values of $\mu$ (shown as a density over $\ln \sigma$).

}
\label{like.sig.mu1a}
}%
\end{figure}
%

\exampl{ex.muML}{
 Differentiate the log likelihood with respect to $\mu$
% and $\ln \sigma$
 and show that,
 if  the standard deviation is known to be $\sigma$,
 the maximum likelihood mean $\mu$ of  a Gaussian
% whose
 is
 equal to the sample mean $\barx$,
 for any value  of $\sigma$.
}
\solution
\beqan
	\frac{\partial}{\partial \mu} \ln P &=& - \frac{N(\mu-\bar{x})}{\sigma^2}\\
&=&0 \:\: \ \mbox{when $\mu = \bar{x}$. \hspace{1in} \ensuremath{\epfsymbol}\hspace{-1in}}
\eeqan
% end soln

 If we Taylor-expand the log likelihood about the maximum,
 we can define approximate
 \ind{error bars} on the   maximum  likelihood parameter:
 we  use a quadratic approximation to estimate
 how far from the maximum-likelihood parameter setting we can go
 before  the likelihood falls by some standard factor,
 for example $e^{1/2}$, or $e^{4/2}$.
 In the special case of a likelihood that is a Gaussian\index{approximation!by Gaussian}
 function of the parameters, the quadratic approximation is exact.
\exampl{ex.muML2}{
 Find the second derivative of the log likelihood with
 respect to $\mu$, and find the error bars on $\mu$, given
 the data and $\sigma$.
}
{
\solution
\beq
	\frac{\partial^2}{\partial \mu^2} \ln P = - \frac{N}{\sigma^2}.
\hspace{1.4in} \ensuremath{\epfsymbol}\hspace{-1.5in}
% \hfill \ensuremath{\epfsymbol}
\eeq
 Comparing this curvature with the curvature of the log of a Gaussian
 distribution over $\mu$ of standard deviation $\sigma_{\mu}$,
 $\exp ( - \mu^2/(2 \sigma_{\mu}^2) )$,
 which is $1/\sigma^2_{\mu}$,  we can deduce that the error bars
 on $\mu$ (derived from the likelihood function) are
\beq
	\sigma_{\mu} = \frac{\sigma}{\sqrt{N}} .
\eeq
 The \ind{error bars} have this property: 
 at the two points  $\mu = \bar{x} \pm \sigma_{\mu}$, the likelihood is smaller than its maximum
 value by a factor of $e^{1/2}$.
}
\exampl{ex.sigML}{
 Find the maximum likelihood standard deviation $\sigma$ of  a Gaussian,
 whose mean is known to be $\mu$,
 in the light of data $\{ x_n \}_{n=1}^N$.
 Find the second derivative of the log likelihood with
 respect to   $\ln \sigma$, and error bars on $\ln \sigma$.
}
{
\solution\
 The likelihood's dependence on $\sigma$ is
\beq
\ln  P(\{x_n\}_{n=1}^N \given \mu,\sigma) 
=
        -N \ln (\sqrt{2 \pi} \sigma) - \frac{  S_{\rm tot}  }
                                        { (2 \sigma^2) }, 
\eeq
where  $S_{\rm tot} = \sum_n \! {(x_n-\mu)^2}$.
%  N ( \mu - \barx )^2 + S$.
 To find the maximum of the likelihood, we can differentiate with
 respect to $\ln \sigma$. [It's often most hygienic to differentiate
 with respect to $\ln u$ rather than $u$, when $u$ is a scale variable;
 we use 
% Recall $d(e^{nx})/dx = n e^{nx}$,
% so
 $\d u^{n}/\d(\ln u) = n u^{n}$.]
%
\beq
\frac{\partial \ln  P(\{x_n\}_{n=1}^N \given \mu,\sigma) }
{\partial \ln \sigma}
=
        -N  + \frac{  S_{\rm tot}  }
                                        {  \sigma^2 } 
\eeq
 This derivative is zero when
\beq
	\sigma^2 =  \frac{  S_{\rm tot}  }{ N } , 
\eeq
\ie,
\beq
	\sigma = \sqrt{
                 \frac{\sum_{n=1}^{N} ( x_n - \mu )^2 }{N}
        } .
\eeq
 The second derivative is 
\beq
\frac{\partial^2 \ln  P(\{x_n\}_{n=1}^N \given \mu,\sigma) }
{\partial (\ln \sigma)^2}
=
       - 2 \frac{  S_{\rm tot}  }  {  \sigma^2 } ,
\eeq
 and at the maximum-likelihood value of $\sigma^2$,
 this equals $-2N$.
% is
%\beq
%\frac{\partial^2 \ln  P(\{x_n\}_{n=1}^N \given \mu,\sigma) }
%{\partial (\ln \sigma)^2}= - 2N.
%\eeq
 So error bars on $\ln \sigma$ are
\beq
	\sigma_{\ln \sigma} = \frac{1}{\sqrt{2N}} .
\hspace{1in} \ensuremath{\epfsymbol}\hspace{-1.1in}
\eeq
}
\exercisxB{1}{ex.MLgaussian}{
% Differentiate the log likelihood with respect to $\mu$ and $\ln \sigma$ and s
 Show that   the  values of
 $\mu$ and $\ln \sigma$ that jointly maximize the  likelihood  are:
% \beq
$
 \{\mu,\sigma\}_{\ML} = \left\{ \bar{x},\sigma_{\ssN}
         = \sqrt{ \linefrac{S}{N} } \right\}  ,
$
% \eeq
 where
\beq
        \sigma_{\ssN} \equiv \sqrt{
                 \frac{\sum_{n=1}^{N} ( x_n - \barx )^2 }{N}
        }
.
\label{eq.sigmaML}
\eeq

}


\section{Maximum likelihood for a mixture of Gaussians}
% kmeans
% LABEL MOG mog
\label{sec.mog}
 We now derive an algorithm for fitting a mixture of Gaussians to one-dimensional
 data. In fact, this algorithm is so important to understand that,
 {\em you}, gentle reader, get to derive the algorithm. Please work through the following exercise.

\ExercissxA{2}{ex.mixture_em}{
% kmeans
 A random variable $x$ is assumed to have a probability 
 distribution that is a {\em mixture of two Gaussians},
%\beq
%	P(x| \mu_1,\mu_2 ,\sigma_1, \sigma_2, p_1, p_2)
%	=
%%	\frac{1}{2}
%	\left[\sum_{c=1}^{2}
%		 p_c \frac{1}{\sqrt{2 \pi \sigma_c^2}} 
%	\exp \left( - \frac{(x-\mu_c)^2}{2 \sigma_c^2} \right) \right] ,
%\eeq
% where the two Gaussians are labelled by the class labels
% $c=1$ and $c=2$;  $p_1$ and $p_2$ are the prior probabilities
% of the two Gaussians,
% which satisfy $p_1 + p_2 = 1$; and $\{ \mu_c \}$ and 
% $\{ \sigma_c\}$ are their means and standard deviations.
% For brevity, we will denote these  parameters by 
% $\btheta \equiv \left\{ \{ p_c \}, \{ \mu_c \},  \{ \sigma_c\} \right\}$.
%
%  Assuming that   $\{ p_c \}$, $\{ \mu_c \}$ and 
% $\{ \sigma_c\}$ are known  and that the standard deviations 
% are equal, that is, $\sigma$
\beq
	P(x \given \mu_1,\mu_2 ,\sigma)
	=
	\left[\sum_{k=1}^{2}
%	\frac{1}{2}
		p_k
		  \frac{1}{\sqrt{2 \pi \sigma^2}} 
	\exp \left( - \frac{(x-\mu_k)^2}{2 \sigma^2} \right) \right] ,
\eeq
 where the two Gaussians are given the labels
 $k=1$ and $k=2$;   the prior probability
 of the class label $k$ is $\{p_1 \eq 1/2 , \, p_2 \eq 1/2 \}$;  $\{ \mu_k \}$ are
 the means of the two Gaussians; and both have  standard deviation
 $\sigma$.
 For brevity, we  denote these  parameters by 
 $\btheta \equiv  \left\{  \{ \mu_k \},  \sigma \right\}$.

 A data set consists  of $N$  points $\{ x_n \}_{n=1}^N$ which are assumed 
 to be independent samples
 from this distribution. Let $k_n$ denote the unknown class 
 label of the $n$th point.

 Assuming that   $\{ \mu_k \}$ and 
 $\sigma$ are known, show that the 
 posterior probability of the class label $k_n$  of the $n$th point
 can be written as 
\begin{equation}
\begin{array}{rcl}
	P(k_n \eq 1 \given x_n , \btheta )& =&
\displaystyle  \frac{1}{1+\exp[ - ( w_1 x_n + w_0)] }
\\[0.21in]
	P(k_n \eq 2 \given x_n , \btheta )& =&
\displaystyle  \frac{1}{1+\exp[ + ( w_1 x_n + w_0)] } ,
\end{array} 
\label{eq1}
\end{equation}
 and give expressions for $w_1$ and $w_0$.
%\marginpar{[5]}
\medskip

 Assume now that the means $\{ \mu_k \}$ are {\em not\/} known, 
 and that we wish to infer them from the data $\{ x_n \}_{n=1}^N$.
 (The standard deviation $\sigma$ is known.)
 In the remainder of this question we will derive an iterative 
 algorithm for finding values for $\{ \mu_k \}$ that 
 maximize the likelihood,
\beq
	 P( \{ x_n \}_{n=1}^N   \given  \{ \mu_k \} , \sigma ) 
	= \prod_n P( x_n  \given   \{ \mu_k \} , \sigma ) .
\eeq
% Assume that we 
% have set the parameters $\mu_1, \mu_2$ to some initial values.
%    $\{ \mu_k \}$
% but that we  do have a current guess for them both
 Let $L$ denote the natural log of the likelihood.
 Show that the derivative of the log likelihood with respect 
 to $\mu_k$ is given by
\beq
	\frac{\partial}{\partial \mu_k} L 
	= \sum_n p_{k|n} \frac{( x_n - \mu_k )}{\sigma^2} ,
\eeq
%\marginpar{[5]}
 where $p_{k|n} \equiv P( k_n \eq k  \given  x_n , \btheta )$ appeared
% was discussed 
 above at equation (\ref{eq1}).

 Show, neglecting terms in 
$\frac{\partial}{\partial \mu_k}  P( k_n \eq k  \given  x_n , \btheta )$,
 that the second derivative is approximately given by 
%\marginpar{[2]}
\beq
	\frac{\partial^2}{\partial \mu_k^2} L 
	= - \sum_n p_{k|n} \frac{1}{\sigma^2} .
\eeq
 Hence show that from an initial state  $\mu_1, \mu_2$, 
 an approximate \ind{Newton--Raphson} step updates these parameters to
 $\mu_1', \mu_2'$, where 
\beq
	\mu_k' = \frac{ \sum_n p_{k|n} x_n }{ \sum_n p_{k|n}  } .
\eeq
 [The Newton--Raphson method for maximizing $L(\mu)$ 
 updates $\mu$ to $\mu' = \mu - \left[ \left. \frac{\partial L}{\partial \mu}  
 \right/  \frac{\partial^2 L}{\partial \mu^2} \right]$.]
%\medskip


%  -- inference problem, sigmoid function, adaptive mixture model}
%
\[
	\mbox{\hspace{-0.5in}\psfig{figure=figs/points32.ps,angle=-90,width=3in}}
\]	
 Assuming that $\sigma =1$, 
 sketch a contour plot of the likelihood function  as a function of 
 $\mu_1$ and $\mu_2$ for the data set shown above.
% The data set consists
% of 200 points, shown by the horizontal coordinates of the 
% {\tt x}s below the $x$ axis, and by a histogram
% above it. Indicate the widths of the peaks in your sketch.
 The data set consists
 of 32 points. Describe  the peaks in your sketch
 and indicate their widths.
}


 Notice that the algorithm you have derived for maximizing
 the likelihood is identical to the  soft {K-means algorithm}\index{K-means clustering!derivation}
 of \secref{sec.SOFT-KMEANS}.\index{learning algorithms!K-means clustering}
 Now that it is clear that clustering can be viewed as mixture-density-modelling,\index{density modelling}\index{mixture modelling}\index{modelling!density modelling}
 we are  able to derive enhancements to the K-means algorithm, which
 rectify the problems we noted earlier.\index{clustering}\indexs{K-means clustering}

  
% such as unequal variance algorithm and
% unequal masses of clusters.
 

\begin{algorithm}
\algorithmmargin{%
\begin{description}
\item[Assignment step\puncspace]
 The responsibilities are
\beq
	r_k^{(n)} = \frac{ \pi_k \frac{1}{(\sqrt{2 \pi} \sigma_k)^I}
	\exp \left( - \displaystyle\frac{1}{\sigma^2_k} \, d(\bm^{(k)} ,\bx^{(n)}) \right) }
		{\sum_{k'} \pi_k  \frac{1}{(\sqrt{2 \pi} \sigma_{k'})^I}
	\exp \left( - \displaystyle \frac{1}{\sigma^2_{k'}} \, d(\bm^{(k')} ,\bx^{(n)}) \right) } 
\label{eq.assignII}
\eeq
 where $I$ is the dimensionality  of $\bx$.

\item[Update step\puncspace]% also called Adaptation or Reestimation
 Each  cluster's parameters, $\bm^{(k)}$, $\pi_k$, and $\sigma^2_k$,
 are adjusted to match 
 the  data points that it is responsible for.
\beq
	\bm^{(k)} = \frac{ \displaystyle \sum_{n} \rnk \bx^{(n)} }{ R^{(k)} }
\label{eq.softkmeans.meanupdate}
\eeq
\beq
	\sigma^2_{k} = \frac{ \displaystyle \sum_{n} \rnk (  \bx^{(n)} - \bm^{(k)} )^2  }{ I R^{(k)} }
\label{eq.softkmeans.varianceupdate}
\eeq
\beq
	\pi_{k} = \frac{ R^{(k)} }{  \sum_{k}  R^{(k)} } 
\eeq
 where $R^{(k)}$ is the total responsibility of mean $k$, 
\beq
 R^{(k)} = \sum_{n}   \rnk  .
\eeq
%  and $I$ is the dimensionality of $\bx$.
\end{description}
}{
\caption{The soft K-means algorithm, version 2.}
\label{alg.kmeansoft2}
}
\end{algorithm}
% .1 just shows the data.
% .last shows the final state and should be included, ideally
% .2, .4 show initial params, updated params and new assignments, so are the best.
\begin{figure}
\figuremargin{%
\begin{center}\small\hspace*{0.2in}
\begin{tabular}{*{10}{l}}
\softtfbig{2.2}{0}&
\softtfbig{2.4}{1}&
\softtfbig{2.6}{2}&
\softtfbig{2.8}{3}&
\softtfbig{2.19}{9}
\\[0.012in]
\softtfbig{4.2}{0}&
\softtfbig{4.4}{1}&
\softtfbig{4.22}{10}&
\softtfbig{4.42}{20}&
\softtfbig{4.62}{30}&
\softtfbig{4.72}{35}
%\softtfbig{4.81}{40}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm, with $K=2$,
 applied (a) to the 40-point data set of
 \protect\figref{fig.kmeans.2}; (b) to the little 'n' large data set of
 \protect\figref{fig.kmeans.xbs}. }
\label{fig.skmeans.2dK2}
}%
\end{figure}


\begin{algorithm}
\algorithmmargin{%
%%%%%%%%%%%%%%%%%%%%%%55 CHECK box needed
\beq
	r_k^{(n)} = \frac{ \pi_k \displaystyle  \frac{1}{\prod_{i=1}^I \sqrt{2 \pi} \sigma_i^{(k)}}
	\exp \left( - \displaystyle \sum_{i=1}^I \lfrac{(m_i^{(k)}-x_i^{(n)})^2}{2( \sigma_i^{(k)})^2}   \right) }
		{\sum_{k'} \mbox{ (numerator, with $k'$ in place of $k$) } }
\label{eq.assignIII}
\eeq
\beq
	{\sigma^2_{i}}^{(k)} = \frac{
 \displaystyle \sum_{n} \rnk (  x^{(n)}_i - m^{(k)}_i )^2  }{ R^{(k)} }
\label{eq.softkmeans.varianceupdate.axisaligned}
\eeq
}{
\caption{The soft K-means algorithm, version 3, which
 corresponds to a model of axis-aligned Gaussians.}
\label{alg.kmeansoft3}
}
\end{algorithm}
\section{Enhancements to soft K-means}

 \Algref{alg.kmeansoft2} shows 
%%%%%%%%%%%%% stolen from clust.tex
 a version of the soft-K-means algorithm corresponding
 to a modelling assumption that each cluster is a 
 spherical Gaussian having its own width
 (each cluster has
 its own $\beta^{(k)} = \lfrac{1}{\sigma^2_{k}}$).
% First, version 2 of the soft K-means algorithm
% removes the job of adjusting $\b=1/\sigma^2$ by
% giving every cluster its own lengthscale parameter $\sigma_k$
% which is updated so as to maximize the likelihood.
% add reference to ML sig example above
%
 The algorithm updates  the lengthscales $\sigma_k$ for itself.
 The algorithm also includes cluster weight parameters $\pi_1,\pi_2,\ldots, \pi_K$
 which also update themselves, allowing accurate modelling
 of data from clusters of unequal weights.
 This algorithm is demonstrated in
% erroneous reference to earlier chapter!
% \figref{fig.skmeans.2d} and
 \figref{fig.skmeans.2dK2}
%
% CHECK THESE REFS
%
 for two data sets that
 we've seen before.
 The second example shows that convergence can take a long time, but eventually
 the algorithm identifies the small cluster and the large cluster.

% Do my demos include adapting weights $\pi$ as well? Yes, effectively.
%
%\begin{aside}
% Where did all this come from?
% Well, if you did the \exerciseref{ex.mixture_em}
%% (exam q on k-means) in bayes_intermediate.tex
% then you have a derivation. It's a maximum likelihood algorithm.
% Later, we will give a more general derivation, once we have
% learnt about variational methods.
% Then show that the update rules (EM) both increase a single variational 
% objective function. 
%\end{aside}


 Soft K-means, version 2, is a maximum-likelihood
 algorithm for fitting a mixture of {\em spherical Gaussians\/} to data --%
\marginpar{\small\raggedright{A proof that the  algorithm does indeed maximize the likelihood
 is deferred to  \secref{sec.EM}.}}
 `spherical' meaning that the variance of the Gaussian is the same in
 all directions. This algorithm is still no good at modelling the
 cigar-shaped clusters  of  \figref{fig.kmeans.lozenge}.
 If we wish to model the clusters by axis-aligned Gaussians
 with possibly-unequal variances, we
 replace the assignment rule (\ref{eq.assignII})
 and the  variance update rule (\ref{eq.softkmeans.varianceupdate})
 by the rules
 (\ref{eq.assignIII}) and
 (\ref{eq.softkmeans.varianceupdate.axisaligned}) displayed in
 \algref{alg.kmeansoft3}.
% was displayed HERE, moved it to be with  alg 2. 

\begin{figure}
\figuremargin{%
\begin{center}\small\hspace*{0.1025in}
\begin{tabular}{*{10}{l}}
\softtfbbig{2.2}{0}&
%\softtfbbig{2.4}{1}&
\softtfbbig{2.22}{10}&
\softtfbbig{2.42}{20}&
\softtfbbig{2.60}{30}
\\[0.12in]
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm, version 3, applied to the data consisting
 of two cigar-shaped clusters. $K=2$ (\cf\ \figref{fig.kmeans.lozenge}).}
\label{fig.skmeans.lozenge}
}%
\end{figure}



\begin{figure}
\figuremargin{%
\begin{center}\small\hspace*{0.05125in}
\begin{tabular}{*{10}{l}}
\softtfbig{18.2}{0}&
\softtfbig{18.22}{10}&
\softtfbig{18.42}{20}&
\softtfbig{18.54}{26}&
\softtfbig{18.65}{32}
\\%[0.12in]
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm, version 3, applied to the little 'n' large data set. $K=2$.}
\label{fig.skmeans.2f}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}\small\hspace*{0.1025in}
\begin{tabular}{*{10}{l}}
\softtfbbig{4.2}{0}&
\softtfbbig{4.12}{5}&
\softtfbbig{4.22}{10}&
\softtfbbig{4.42}{20}&
\\%[0.12bigin]
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm applied to a data set of 40 points. $K=4$.
 Notice that at convergence, one very small cluster has formed between
 two data points.}
\label{fig.skmeans.2g}
}%
\end{figure}
% ^^^^^^6 Fri 29/6/01 I stripped out the  $K=4$. case from this.
% it had an interesting singular cluster, but not really representative
% of real life.
% 6.33 in graveyard


 This third version of soft K-means is demonstrated in
 \figref{fig.skmeans.lozenge} on the `two cigars' data
 set of \figref{fig.kmeans.lozenge}.
 After 30 iterations, the algorithm  correctly
 locates the two  clusters.
\Figref{fig.skmeans.2f} shows the same algorithm applied to
 the   little 'n' large data set; again, the
 correct cluster locations are found.
\section{A fatal flaw of maximum likelihood}
\label{sec.kaboom}
 Finally,
 \figref{fig.skmeans.2g} sounds a cautionary note: when we fit $K=4$ means
 to our first toy data set, we sometimes find that very small clusters form,
 covering just one or two data points. This is a pathological property
 of soft K-means clustering, versions 2 and 3.
\exercisxB{2}{ex.kaboom}{
 Investigate what happens if one mean  $\bm^{(k)}$ sits exactly
 on top of one data point;  show that if the variance $\sigma^2_k$
 is sufficiently small, then no return is possible:  $\sigma^2_k$
 becomes ever smaller.}

\subsection{KABOOM!}
 Soft K-means can blow up.\index{kaboom}\index{blow up}
% end \section{Soft clustering}
 Put one  cluster  exactly on one data point
 and let its variance go to zero --
 you can obtain an arbitrarily large likelihood!
 Maximum likelihood methods  can  break down
% absurdly
 by finding highly tuned models that fit
 part of the data perfectly. This  phenomenon is known
 as \ind{overfitting}. The reason we are not interested
 in these solutions with enormous likelihood is this: sure,
 these parameter-settings may have enormous posterior probability
 {\em density\/}, but the density is large over only a very  small
 {\em volume\/} of parameter space. So the  probability
 {\em mass\/}
 associated with  these likelihood spikes is usually tiny.
   

% This overfitting problem is one reason why we must say bye-bye to maximum likelihood.
% Another example of overfitting: Imagine
% we are interested in  making a model of the  surnames in a telephone directory;
% one theory says that 5\% of people are called Smith, 3\% Jones, 2\% Davis, etc.;
% another theory says that 20\% are called Lo and 15\% are called Li; indeed we
% can imagine a high-dimensional continuum of such hypotheses.
% You tear a random page from the phone directory and pick a random name:
% it's {\tt{Shercliff}}.
% In the light of this datum, what  is the maximum likelihood hypothesis? Answer: the hypothesis
% that says that 100\% of the surnames are {\tt{Shercliff}}!
%

 We conclude that maximum likelihood methods are not a satisfactory
 general solution to data-modelling problems:\index{sermon!maximum likelihood}
 the likelihood may be infinitely large at certain parameter settings.
	Even if the likelihood does not have infinitely-large spikes,
 the maximum of the likelihood is often unrepresentative,
 in high-dimensional problems.

 Even in  low-dimensional problems,
 maximum likelihood solutions can be unrepresentative. 
 As you may know from basic statistics, the
 maximum likelihood estimator (\ref{eq.sigmaML}) for a
 Gaussian's standard deviation, $\sigma_{\ssN}$\index{bias!in statistics},
 is a  {\em{biased}\/} estimator, a topic that we'll take up in
 \chref{ch.exactmarg}.

\subsubsection{The maximum {\itshape a posteriori\/} (MAP) method}
 A popular replacement for maximizing the likelihood
 is maximizing the Bayesian  posterior probability density
 of the parameters instead.
 However, multiplying
 the likelihood by a prior and maximizing the posterior 
 does not make the above problems go away;
 the posterior density often also has  infinitely-large spikes,
 and the maximum of the posterior probability density is 
 often unrepresentative of the whole posterior distribution.
 Think back to the concept of typicality, which we encountered in \chref{ch.two}:
 in high dimensions, most of the probability mass is in a typical set
 whose properties are quite different from the  points that have
 the maximum probability density. Maxima are atypical.

 A further reason\index{sermon!maximum {\em a posteriori\/} method}
 for disliking  the maximum {\em a posteriori\/} is that it is {\em basis-dependent}.\index{basis dependence}
 If we make a nonlinear change of basis from the
 parameter $\theta$ to the parameter $u=f(\theta)$
 then the probability density  of $\theta$
 is transformed to
\beq
	P(u) = P(\theta) \left| \frac{ \partial \theta}{\partial u} \right| .
\label{eq.transformation.of.density}
\eeq
 The maximum of the density $P(u)$ will
 usually not coincide with the maximum of the density $P(\theta)$.
 (For figures illustrating such nonlinear changes of basis, see
 the next chapter.)
 It seems undesirable to use a method whose answers change
 when we change representation.


\section*{Further reading}
 The  soft K-means algorithm  is at the heart of the automatic classification
 package, \ind{AutoClass}
 \cite{AutoClass,AutoClassTR}.

\section{Further exercises}
\subsection{Exercises where maximum likelihood may be useful}
\exercisxC{3}{ex.KmeansD}{
	Make a version of the K-means algorithm that
 models the data as a mixture of $K$ arbitrary Gaussians, \ie,
 Gaussians that are not constrained to be axis-aligned.
}
\exercisxB{2}{ex.poissonml}{
\ben
\item A \ind{photon counter} is pointed at a remote
 star for one minute, in order to infer  the brightness,
 \ie, the rate of
 photons arriving at the counter per minute, $\l$.
 Assuming the number of photons collected $r$ has a
 \ind{Poisson
 distribution} with mean $\l$,
\beq
	P(r  \given  \l ) = \exp( - \l)\frac{ \l^{r} }{r!} ,
\eeq
 what is the maximum likelihood estimate for $\l$, given $r = 9$?
 Find error bars on $\ln \l$.
\item
 Same situation, but now  we assume  that the
 counter detects not only photons from the star but
 also `background' photons.
 The \ind{background rate} of {photon}s is known to be $b \eq 13$ photons
 per minute. We assume the number of photons collected, $r$,
 has a Poisson distribution with mean $\l+b$.
 Now, given $r\eq 9$ detected photons,  what is the maximum likelihood estimate
 for $\l$?
 Comment on this answer, discussing also the Bayesian posterior
 distribution, and the `unbiased\index{unbiased estimator}\index{sermon!unbiased estimator}
 \ind{estimator}\index{bias!in statistics}'
 of sampling theory, $\hat{\l} \equiv r-b$.
\een
}
\exercisxC{2}{ex.bentcoin}{
	A bent coin is tossed $N$ times, giving $N_a$ heads and $N_b$
 tails. Assume a beta distribution prior for the probability of heads, $p$,
 for example the uniform distribution.
 Find the maximum likelihood and {maximum {\em a posteriori\/}}\index{maximum {\em a posteriori}}
 values of $p$, then find the   maximum likelihood and {maximum {\em a posteriori\/}}
 values of the logit $a \equiv \ln[p/(1-p)]$. Compare with the
 predictive distribution, \ie, the probability that the next
 toss will come up heads.
}
\exercisxB{2}{ex.stars}{
{\em  Two men looked through prison bars; one
 saw stars, the other tried to infer where the
 window frame was.}
\amarginfignocaption{t}{
\newcommand{\imwidthb}{25}
\begin{center}\footnotesize\small
\setlength{\unitlength}{0.03in}
\begin{picture}(38,31)(-7.5,-5.1)
\put(0,\imwidthb){\line(1,0){\imwidthb}}
\put(\imwidthb,0){\line(0,1){\imwidthb}}
\put(0,0){\line(1,0){\imwidthb}}
\put(0,0){\line(0,1){\imwidthb}}
%
\put(0,0){\makebox(0,0)[tr]{\footnotesize$(x_{\min},y_{\min})$}}
\put(\imwidthb,\imwidthb){\makebox(0,0)[bl]{\footnotesize$(x_{\max},y_{\max})$}}
\put(4.5,15){\makebox(0,0)[r]{$\star$}}
\put(8.7,6){\makebox(0,0)[t]{$\star$}}
\put(14.5,21.8){\makebox(0,0)[r]{$\star$}}
\put(12.7,12.9){\makebox(0,0)[t]{$\star$}}
\put(24.5,9.5){\makebox(0,0)[r]{$\star$}}
\put(10.88,4){\makebox(0,0)[t]{$\star$}}
\end{picture}
\end{center}
%}{%
% \caption[a]{}
% \label{fig.stars}
}%

 From the other side of a room,
 you look through a \ind{window} and see \ind{stars} at locations
 $\{ (x_n,y_n) \}$. You can't see the window edges
 because it is totally dark apart from the stars.
 Assuming the window is rectangular
 and that the visible stars' locations are independently randomly distributed,
 what are the inferred values of $(x_{\min},$ $y_{\min}$, $x_{\max}$, $y_{\max})$,
 according to maximum likelihood?
 Sketch the likelihood as a function of  $x_{\max}$, for fixed $x_{\min}$,
  $y_{\min}$, and $y_{\max}$.
}
\exercisxB{3}{ex.navigator}{
 A%
\amarginfig{t}{
\newcommand{\locone}{\put(5,5)}
\newcommand{\loctwo}{\put(32,-1)}
\newcommand{\locthr}{\put(10,31)}
\newcommand{\imwidthc}{25}
\begin{center}\footnotesize\small
\setlength{\unitlength}{0.03975in}
\begin{picture}(38,40)(-7.5,-6)
\locone{\circle{1}}
\loctwo{\circle{1}}
\locthr{\circle{1}}
\locone{\line(1,1){\imwidthc}}
\locone{\makebox(0,0)[tr]{\footnotesize$(x_{1},y_{1})$}}
\loctwo{\line(-1,2){15}}
\loctwo{\makebox(0,0)[tr]{\footnotesize$(x_{2},y_{2})$}}
\locthr{\line(3,-2){20}}
\locthr{\makebox(0,0)[tr]{\footnotesize$(x_{3},y_{3})$}}
\end{picture}
\end{center}
%}{%
\caption[a]{The standard way of drawing
 three slightly inconsistent bearings on a chart
 produces a  triangle  called a cocked hat. Where is the sailor?}
\label{fig.buoys}
}
  sailor infers his location $(x,y)$ by measuring the
 bearings of three buoys whose
 locations $(x_n,y_n)$ are given on his chart.
 Let the true bearings of the buoys be $\theta_n$.
 Assuming that his measurement $\tilde\theta_n$ of each bearing
 is subject to Gaussian noise of small standard deviation $\sigma$,
 what is his inferred location, by maximum likelihood?
% http://education.qld.gov.au/tal/kla/compass/html/cncha.htm

 The sailor's rule of thumb says that the boat's
 position can be taken to be the centre of the 
 \ind{cocked hat}, the \ind{triangle} produced
 by the intersection of the three measured bearings (\figref{fig.buoys}).
 Can you persuade him that the maximum likelihood answer is better?
%
% 2 answers: 1) consider special case where two buoys
% very close. Then those bearings very accurate, should ignore the third.
% The centre of the triangle
% may be some way away from the intersection of the first two bearings.
% 2) consider special case where the triangle does not exist
% because the three bearings intersect on the wrong side of one of the buoys.
%   /
% --*--Boat known to be out in this direction
%   \
}

\exercissxB{3}{ex.mlmaxenta}{
 {\sf Maximum likelihood fitting of an \ind{exponential-family} model.}

 Assume that a variable $\bx$ comes from a probability
 distribution of the form
\beq
	P(\bx \given  \bw) = \frac{1}{Z(\bw)} \exp \left( \sum_k w_k f_k(\bx) \right),
\eeq
 where the functions $f_k(\bx)$ are given, and the parameters $\bw = \{ w_k \}$
 are not known.
 A data set $\{ \bx^{(n)} \}$ of $N$ points is supplied.

 Show  by differentiating the log likelihood that the maximum-likelihood
 parameters $\wml$ satisfy
\beq
	\sum_{\bx} P(\bx \given  \wml) f_k(\bx) = \frac{1}{N} \sum_{n} f_k(\bx^{(n)}) ,
\eeq
 where the left-hand sum is over {\em all\/} $\bx$, and the right-hand
 sum is over the data points.
 A shorthand for this result is that each function-average under the
 fitted model must equal the function-average found in the data:
\beq
	\left< f_k \right>_{ P(\bx \given  \wml) } = 
	\left< f_k \right>_{ {\rm Data} }  .
\eeq
}
\exercisxB{3}{ex.mlmaxentb}{
 {\sf `Maximum entropy' fitting of models to constraints.}\index{maximum entropy}

 When confronted by a probability distribution $P(\bx)$
 about which only a few facts are known, the {\dem{maximum entropy principle}\/} (maxent)
 offers a rule  for {\em choosing\/} a distribution that
 satisfies those constraints.
 According to \ind{maxent}, you should select the
 $P(\bx)$ that maximizes the entropy
\beq
	H = \sum_{\bx} P(\bx) \log 1/P(\bx) ,
\eeq
 subject to the constraints.
 Assuming the constraints assert that
 the {\em averages\/} of certain functions   $f_k(\bx)$  are known, \ie,
\beq
	\left< f_k \right>_{ P(\bx) } =  F_k ,
\label{eq.consME}
\eeq
 show, by introducing Lagrange multipliers (one for each constraint,
 including normalization),
 that the maximum-entropy
 distribution has the form
\beq
	P(\bx)_{\rm Maxent} = \frac{1}{Z} \exp \left( \sum_k w_k f_k(\bx) \right) ,
\eeq
 where the parameters $Z$ and $\{ w_k \}$ are set such that 
 the constraints (\ref{eq.consME})  are satisfied.

 And hence the maximum entropy method gives identical results
 to maximum likelihood fitting of an \ind{exponential-family} model
 (previous exercise).

\begin{aside}
 The maximum entropy method has sometimes been recommended
 as a method for assigning \index{prior!assigning}prior
 distributions in Bayesian modelling.
 While the outcomes of the maximum entropy method are sometimes
 interesting and  thought-provoking, I do not advocate maxent
 as {\em the\/}  approach to assigning \ind{prior}s.

 Maximum entropy is also sometimes proposed as a method
 for solving inference problems -- for example, `given that
 the mean score of this unfair six-sided die is 2.5, what is its
 probability distribution $(p_1,p_2,p_3,p_4,p_5,p_6)$?'
 I think it is  a bad idea to use maximum entropy in this way;\index{sermon!maximum entropy}
 it can give  silly answers. The correct way to solve
 inference problems is to use \Bayes\  theorem.
\end{aside}
}

\subsection{Exercises where maximum likelihood and MAP have difficulties}
\exercisxB{2}{ex.mog}{
 This exercise explores  the idea that maximizing  a probability density
 is a poor way to find a  point that is representative of the density.
 Consider a Gaussian distribution in a $k$-dimensional space,
 $P(\w) = (1/\sqrt{2 \pi} \, \sigW)^k \exp( -\sum_1^k w_i^2/2 \sigW^2)$. 
 Show that nearly all of the probability mass of a Gaussian is in a thin shell
 of radius $r=\sqrt{k} \sigW$ and of thickness proportional to
% $\propto
 $r/\sqrt{k}$. For example,  in 1000 dimensions, 90\% of the mass of a
 Gaussian with $\sigW = 1$ is in a shell of radius 31.6 and thickness
 2.8.
% 2.4 sigma gives 0.9986 of a Gaussian. 
 However, the probability {\em density\/} at the origin is $e^{k/2}
 \simeq 10^{217}$ times bigger than the density at this shell where
 most of the probability mass is.
 
%$\bullet$ 
 Now consider two Gaussian densities in 1000 dimensions that differ in
 radius $\sigW$ by just 1\%, and that contain equal total probability mass.
% In
% each case 90\% of the mass is located in a shell which differs in
% radius by only 1\% between the two distributions. 
 Show that the maximum
 probability density
%, however, 
 is greater at the centre of the
 Gaussian with smaller $\sigW$ by a factor of $\sim \! \exp( 0.01 k )
 \simeq 20\,000$.

 In  \ind{ill-posed problem}s,
 a typical  posterior
        distribution  is often a  weighted 
        superposition of Gaussians with varying means and standard deviations,
 so the         true posterior has a skew peak, with the maximum of the 
        probability density located near the mean of the 
        Gaussian distribution that has the smallest standard deviation, 
        not the Gaussian with the greatest weight.
}
\exercisxB{3}{ex.manyparams}{ {\sf The seven scientists}.
 $N$ datapoints $\{x_n\}$ are drawn from
 $N$ distributions, all of which are Gaussian with
 a common mean $\mu$ but with different unknown standard deviations $\sigma_n$.
 What are  the maximum likelihood parameters
 $\mu, \{ \sigma_n \}$ given the data?
\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/manyparams.ps,width=1.75in,angle=-90}}
\\[0.431in]
\begin{tabular}{cr@{.}l} \toprule
Scientist & \multicolumn{2}{c}{ $x_n$ } \\ \midrule
A & $-$27&020	\\
B &   3&570	\\
C &   8&191	\\
D &   9&898	\\
E &   9&603	\\
F &   9&945	\\
G &  10&056	\\  \bottomrule
\end{tabular}
\end{center}
\caption[a]{Seven measurements $\{x_n\}$ of a parameter $\mu$
 by seven scientists each having his own
  noise-level $\sigma_n$.}
\label{fig.manyparams}
}
 For example,  seven scientists (A, B, C, D, E, F, G)
 with wildly-differing
 \ind{experimental skill}s measure $\mu$. You expect some of them to do accurate
 work (\ie, to have small $\sigma_n$), and some of them to turn in
 wildly inaccurate answers (\ie, to have enormous $\sigma_n$).
 \Figref{fig.manyparams} shows their seven results.
 What is $\mu$, and how reliable is each scientist?

 I hope you agree that, intuitively, it looks pretty certain
 that A and B are both inept measurers, that D--G are better, and
 that the true value of $\mu$ is somewhere close to 10.
 But what does maximizing the likelihood tell you?
}
\exercisxC{3}{ex.alpha}{
 {\sf Problems with MAP method.}
 A collection of widgets $i=1,\ldots,  k$ have a property called `wodge',
 $w_i$, which we measure, widget by widget, in noisy experiments with
 a known noise level $\snu\eq  1.0$.  Our model for these quantities
 is that they come from a Gaussian prior $P(w_i  \given  \a) = \Normal(0,\dfrac{1}{\a})$,
 where $\a
 \eq  1 / \sigW^2 $ is not known. Our prior for this variance is flat
 over $\log \sigW$ from $\sigW = 0.1$ to $\sigW = 10$. 


 {\sf Scenario 1.}  Suppose four widgets have been measured and give
 the following data: $\{d_1,d_2,d_3,d_4\}=
 \{$2.2, $-2.2$, 2.8, $-2.8\}$.
 We are interested in inferring the wodges of these four
 widgets.
\ben
\item Find the values of $\bw$ and $\a$ that maximize the
 posterior probability  $P(\bw, \log \a \given  \bd)$.
\item
 Marginalize over $\a$ and  find the posterior probability
 density of $\bw$ given the data.  [Integration skills required. See
 \citeasnoun{MacKay94:alpha_nc} for solution.]
 Find  maxima 
 of $P(\bw \given \bd)$.
 [Answer:   two maxima -- one at 
 $\wmp =
 \{1.8,-1.8,2.2,-2.2 \},$  with error bars on all four parameters (obtained
 from Gaussian approximation to the posterior) $\pm 0.9$;
 and  one at $\wmp' =
 \{ 0.03 , - 0.03 , 0.04 , - 0.04 \}$ with error bars $\pm 0.1$.] 
\een
 
 {\sf Scenario 2.} Suppose in addition to the four measurements above 
 we are now informed that there are 
 four more widgets that have been measured with a 
 much less accurate instrument, having  $\snu'\eq  100.0$.  Thus we now
 have both well-determined and ill-determined parameters, as in a typical
 \ind{ill-posed problem}. The data from these measurements were
 a string of  uninformative 
 values, 
 $\{d_5,d_6,d_7,d_8\}= \{$100, $-100,$ 100,
 $-100\}$.

 We are again asked to infer the wodges of the widgets.
 Intuitively,  our inferences about
 the well-measured widgets should be negligibly affected by this vacuous
 information about the poorly-measured widgets.
 But what happens to the MAP method?

\ben
\item Find the values of $\bw$ and $\a$ that maximize the
 posterior probability  $P(\bw, \log \a \given  \bd)$.
\item
 Find  maxima 
 of $P(\bw \given \bd)$.
 [Answer:
 only one maximum,
 $\wmp = \{
 0.03$, $-0.03$, $0.03$, $-0.03$, $0.0001$, $-0.0001$, $0.0001$, $-0.0001 \}$, 
 with
% marginal
 error bars on
 all eight parameters $\pm 0.11$.]
% \sigma_{w|D} = 0.11$.]
\een

% see bayes/alpha4.ms
% see bayes/alpha974.ms

}


\section{Solutions}
\soln{ex.mixture_em}{
%  Follow the instructions. 
%
\amarginfig{c}{
\begin{raggedright}
	\raisebox{-0.795in}[0in][0in]{\mbox{\hspace*{-0.55in}\psfig{figure=figs/likeanswer.ps,angle=-90,width=2.9in}}}
% made by dologmix
\end{raggedright}
\caption[a]{The likelihood 
 as a function of $\mu_1$ and $\mu_2$.}
\label{fig.32mog}
}%	
 \Figref{fig.32mog} shows
 a contour plot of the likelihood function for the 32 data points.
 The peaks are  pretty-near centred on 
 the points $(1,5)$ and $(5,1)$, and are pretty-near
 circular in their contours. The width of each of the peaks 
 is a standard deviation of  $\sigma/\sqrt{16}$ = 1/4.
 The peaks are roughly Gaussian in shape.
} 
\soln{ex.mlmaxenta}{
% {\sf Maximum likelihood fitting of an \ind{exponential-family} model.}
 The log likelihood is:
\beq
	\ln P( \{ \bx^{(n)} \} \given  \bw) = -N\ln {Z(\bw)}
              + \sum_n  \sum_k w_k f_k(\bx^{(n)}) .
\eeq
\beq
	\frac{\partial}{\partial w_k}
	\ln P( \{ \bx^{(n)} \}  \given  \bw)
 = - N \frac{\partial}{\partial w_k} \ln {Z(\bw)} +  \sum_n f_k(\bx)  .
\eeq
 Now, the fun part  is what happens when we differentiate the
 log of the normalizing constant:
\[
  \frac{\partial}{\partial w_k} \ln {Z(\bw)} \ = \ 
 \frac{1}{Z(\bw)} \sum_{\bx}   \frac{\partial}{\partial w_k} \exp \left( \sum_{k'} w_{k'}  f_{k'}(\bx) \right)
\]
\beq
= \ 
 \frac{1}{Z(\bw)} \sum_{\bx}  \exp \left( \sum_{k'} w_{k'}  f_{k'}(\bx) \right) f_k(\bx)
\ = \  
  \sum_{\bx} P( \bx \given  \bw)   f_k(\bx) ,
\eeq
 so 
\beq
	\frac{\partial}{\partial w_k}
	\ln P( \{ \bx^{(n)} \}  \given  \bw)
 = - N \sum_{\bx} P( \bx \given  \bw)   f_k(\bx) +  \sum_n f_k(\bx)  ,
\eeq
 and at the maximum of the  likelihood,
\beq
	\sum_{\bx} P(\bx \given  \wml) f_k(\bx) = \frac{1}{N} \sum_{n} f_k(\bx^{(n)}) .
\eeq
}




\chapter{Useful Probability Distributions}
\label{ch.distributions}
% This chapter is unfortunately found a little intimidating
% because it uses gamma distributions, which are not really
% worth being scared of, and are not central to the chapter.
% Gamma distributions are a lot like Gaussian distributions, 
% except that whereas the Gaussian goes from $-\infty$ to $\infty$, 
%  gamma distributions go from 0 to $\infty$.
% Include a graph of a gamma distribution here.

\newcommand{\dinkyfig}[1]{\mbox{\psfig{figure=#1,angle=-90,width=1.51in}}}
\newcommand{\dinkyfigl}[1]{\mbox{\psfig{figure=#1,angle=-90,width=1.64in}}}
\amarginfig{t}{\small%
\begin{tabular}{r}
% $P(r \given f,N)$\\
\dinkyfig{bigrams/urn.f.g.ps}%
\\
\dinkyfigl{bigrams/urn.f.l.ps}%
\\[-0.1in]
\multicolumn{1}{c}{$r$}
\\
\end{tabular}
%}{%
\caption[a]{The binomial distribution $P(r \given f\eq 0.3,\,N \eq 10)$,
 on a linear scale (top) and  a logarithmic scale (bottom).}
\label{fig.binomial.again}
}
 In Bayesian data modelling, there's a small collection of
 probability distributions that come up again and again.
 The purpose of this chapter is to introduce these distributions
 so that they won't be intimidating when encountered in
 combat situations.

 There is no need to memorize any of them, except
 perhaps the Gaussian; 
 if a distribution is important enough,
 it  will memorize itself, and otherwise, it
 can easily be looked up.

\section{Distributions over integers}
\begin{center}
{\sf Binomial, Poisson, exponential}\par
\end{center}
\noindent
 \index{distribution!useful}\index{probability distributions}We already encountered the binomial distribution and the
 Poisson distribution on page \pageref{sec.poisson}.

 The {\dem\ind{binomial distribution}\/} for an integer\index{distribution!binomial}
 $r$ with parameters $f$ (the bias, $f \in [0,1]$)
 and $N$ (the number of trials) is:
\beq
 P(r \given f,N) = {N \choose r} f^{r} (1-f)^{N-r} \:\:\:\:\:\: r \in \{ 0,1,2,\ldots , N \} .
\label{eq.binomial.again}
\eeq

 The binomial distribution arises, for example, when we flip a bent
 coin, with bias $f$, $N$ times, and observe the number of heads, $r$.
\medskip

% see bigrams/README
 The {\dem\ind{Poisson distribution}\/}  with parameter $\l > 0$ is:\index{distribution!Poisson}
\beq
	P( r  \given  \l ) = e^{-\l} \frac{\l^r}{r!} \:\:\:\:\:\: r\in \{ 0,1,2,\ldots\} .
\label{eq.poisson.again}
\eeq
 The  Poisson distribution  arises, for example, when we count the number
 of photons $r$ that arrive in a pixel during a fixed interval,
 given that the mean intensity on the pixel corresponds to
 an average number of photons $\l$.
\amarginfig{b}{\small%
~\\[0.2in]
\begin{tabular}{r}
\mbox{\psfig{figure=bigrams/poisson.a1.g.ps,angle=-90,width=1.5in}}%
\\
\mbox{\psfig{figure=bigrams/poisson.a1.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{$r$}
\\
\end{tabular}
%}{%
\caption[a]{The Poisson distribution $P(r \given \l\eq 2.7)$,
 on a linear scale (top) and  a logarithmic scale (bottom).}
\label{fig.poisson.2}
}
% see bigrams/README
\medskip


 The {\dem{exponential distribution on integers}},\index{exponential distribution!on integers},\index{distribution!exponential}
\beq
	P(r \given f)
=
	f^{r} (1-f)  \:\:\:\:\:\: r \in (0,1,2,\ldots,\infty) ,
\label{eq.exponentiali}
\eeq
 arises in waiting problems. How long will you have to
 wait until a six is rolled, if a fair six-sided dice is rolled? 
 Answer: the probability distribution of the number of rolls, $r$, 
 is exponential over integers with parameter $f=5/6$.
 The distribution may also be written 
\beq
	P(r \given f)
=
	 (1-f)  \, e^{-\lambda r} \:\:\:\:\:\: r \in (0,1,2,\ldots,\infty) ,
\label{eq.exponentialii}
\eeq
 where $\lambda = \ln (1/f)$.
 

\section{Distributions over unbounded real numbers}
\begin{center}
{\sf Gaussian, Student, Cauchy, biexponential,  inverse-cosh.}
\par
\end{center}
\noindent
 The {\dem\ind{Gaussian distribution}\/} or \ind{normal} distribution\index{distribution!Gaussian}\index{distribution!normal}
 with mean $\mu$ and standard deviation $\sigma$
 is
\beq
 P(x \given \mu,\sigma) = 	\frac{1}{Z} \exp \left( - \frac{(x-\mu)^2}{2 \sigma^2} \right)
\ \  \:\: x\in(-\infty,\infty) ,
\eeq
 where
\beq
	Z = \sqrt{ 2 \pi \sigma^2 } .
\eeq
 It is sometimes useful to work with the quantity $\tau \equiv 1/\sigma^2$,
 which is called the {\dem\ind{precision}} parameter of the Gaussian.

%\begin{aside}
 {A \ind{sample}\index{sample!from Gaussian} $z$\index{Gaussian distribution!sample from}\index{distribution!Gaussian!sample from}
 from a standard univariate Gaussian can be generated by computing 
\beq
 z = \cos(2 \pi u_1)  \sqrt{2 \ln(1/u_2) },
\eeq
 where $u_1$ and $u_2$ are uniformly distributed in $(0,1)$.}
%
 A second sample $z_2 =  \sin(2 \pi u_1)  \sqrt{2 \ln(1/u_2) }$,
 independent of the first, can then be obtained
 for free.
%\end{aside}

 The Gaussian distribution is widely used and often asserted
 to be a very common distribution in the real world, but I am
 sceptical about this assertion. Yes, {\em unimodal\/} distributions
 may be common; but a Gaussian is a special, rather extreme,
 unimodal distribution.  It has very light tails: the log-probability-density
 decreases quadratically.\index{tail}
 The typical deviation of $x$ from $\mu$ is $\sigma$, but the
 respective probabilities that $x$ deviates from $\mu$ by more than $2 \sigma$,
 $3 \sigma$,  $4 \sigma$, and $5 \sigma$,
 are
 $0.046$, 0.003, $6 \times 10^{-5}$, and $6 \times 10^{-7}$.
% 046
% 0027
% 6.3
% 5.7
 In my experience, deviations from a mean four or five times greater
 than the typical deviation may be rare, but
%  they can happen more often than
 not as rare as $6 \times 10^{-5}$!
 I therefore urge caution\index{caution!Gaussian distribution}
 in the use of Gaussian distributions:
 if  a variable  that is modelled with a Gaussian 
 actually has a heavier-tailed\index{tail} distribution, the rest of the  model
 will contort itself to reduce the deviations of the
 outliers, like a sheet of paper being crushed by a 
 rubber band.
\exercisxB{1}{ex.findstats}{
	Pick a variable that is supposedly bell-shaped
 in probability distribution, gather data,
 and make a plot of the variable's empirical distribution. Show the distribution
 as a histogram on a log scale  and  investigate whether
 the tails are well-modelled by a Gaussian distribution.
 [One example of a variable to study is the amplitude of
 an audio signal.]
}
 
 One distribution  with heavier tails than a Gaussian
 is a {\dem\ind{mixture of Gaussians}}. A mixture of two Gaussians,
 for example, is defined by two means, two standard deviations,
 and two {\dem\ind{mixing coefficients}} $\pi_1$ and $\pi_2$,
 satisfying $\pi_1+\pi_2=1$, $\pi_i \geq 0$.
\[%beq
 P(x \given \mu_1,\sigma_1,\pi_1,\mu_2,\sigma_2,\pi_2) =
% \pi_1
 \frac{\pi_1}{\sqrt{ 2 \pi} \sigma_1} \exp \left( -\smallfrac{(x-\mu_1)^2}{2 \sigma_1^2} \right)
+
%\pi_2
 \frac{\pi_2}{\sqrt{ 2 \pi} \sigma_2} \exp \left( -\smallfrac{(x-\mu_2)^2}{2 \sigma_2^2} \right).
\]%eeq
%
% ?????????/  Sun 3/2/02
% \begin{figure}


 If we take an appropriately weighted mixture of an infinite number
 of Gaussians, all having mean $\mu$,
 we obtain a {\dem\ind{Student-$t$ distribution}},\index{distribution!Student-$t$}
\beq
	P(x \given \mu,s,n)
	= \frac{1}{Z} \frac{1}{ ( 1+(x-\mu)^2/(n s^2) )^{(n+1)/2} } ,
\label{eq.student}
\eeq
 where
% (CHECK) published by William Gosset in 1908. His employer, Guinness Breweries, required him to publish under a pseudonym, so he chose "Student."
% checked, http://mathworld.wolfram.com/Studentst-Distribution.html
%\begin{figure}
%\figuremargin{\small
\amarginfig{b}{\small
\begin{center}
\begin{tabular}{rr}
\dinkyfig{bigrams/student1.ps}%
\\
\dinkyfigl{bigrams/student1.l.ps}%
\end{tabular}
\end{center}
%}{
\caption[a]{Three unimodal distributions.
 Two Student  distributions, with parameters
 $(m,s)=(1,1)$ (heavy line) (a Cauchy distribution)\index{distribution!Cauchy}
 and $(2,4)$ (light line),
 and a Gaussian distribution with mean $\mu = 3$ and
 standard deviation $\sigma=3$ (dashed line),
 shown on  linear vertical scales (top) and logarithmic
 vertical scales (bottom).
 Notice that the heavy tails of the Cauchy distribution
 are scarcely evident in the upper `bell-shaped curve'.}
\label{fig.student}
}
%\end{figure}
%%%%%%%%%%%%% CHECK !!!!!!!!!!!!!!!!!11
\beq
 Z = \sqrt{ \pi n s^2 } \frac{ \Gamma(n/2) }{  \Gamma((n+1)/2) }
\eeq
 and $n$ is called the number of degrees of
 freedom and $\Gamma$ is the gamma function.
 If $n>1$ then the Student distribution (\ref{eq.student}) has a mean
 and that mean is $\mu$. If $n>2$
 the distribution also has a finite variance,
 $\sigma^2 = ns^2/(n-2)$.
 As $n \rightarrow \infty$, the Student
 distribution approaches the normal distribution
 with mean $\mu$ and standard deviation $s$.
 The Student distribution arises both in classical
 statistics (as the sampling-theoretic distribution
 of certain statistics) and in Bayesian inference
 (as the probability distribution of a variable
 coming from a Gaussian distribution whose
 standard deviation we aren't  sure of).

 In the special case $n=1$, the Student
 distribution is called the {\dem{\ind{Cauchy distribution}}}.
\medskip

 A distribution whose tails are intermediate in heaviness between\index{tail}
 Student and Gaussian is the {\dem\ind{biexponential distribution}},\index{distribution!biexponential}
\beq
	P(x \given \mu,s) =
	\frac{1}{Z} \exp \left( - \frac{|x - \mu|}{s} \right) \:\: x \in (-\infty,\infty)
\eeq
 where
\beq
	Z = 2 s.
\eeq
% figure here from 01.tex
\medskip

 The {\dem\ind{inverse-cosh distribution}\/}\index{distribution!inverse-cosh}
\beq
	P(x \given \beta) \propto \frac{1}{[\cosh(\beta x)]^{1/\beta}}
\eeq
 is a popular model in \ind{independent component analysis}.
 In the limit 
 of large $\beta$, the
% nonlinearity becomes a step function and the 
 probability distribution $P(x \given \b)$ becomes a biexponential distribution.
%,  $Pp_i(s_i) \propto \exp(-|x|)$.
 In the limit $\beta \rightarrow 0$
 $P(x \given \b)$ approaches a Gaussian with mean zero and variance $1/\beta$.


\section{Distributions over {\slshape\textbf{positive\/}} real numbers}
\begin{center}
{\sf {Exponential}, gamma, inverse-gamma, and {log-normal}.}
\par
\end{center}
\noindent
 The {\dem\ind{exponential distribution}},\index{distribution!exponential}
\beq
	P(x \given s) =
	\frac{1}{Z} \exp \left( - \frac{x}{s} \right)  \:\: \ \ x \in (0,\infty) ,
\label{eq.exponential}
\eeq
 where
\beq
	Z =  s,
\eeq
 arises in waiting problems. How long will you have to
 wait for a bus in \ind{Poissonville}, given that buses arrive independently
 at random with  one every $s$ minutes on average?
 Answer: the probability distribution of your wait, $x$,
 is exponential with mean $s$.
\medskip
 
 The {\dem\ind{gamma distribution}} is   like a Gaussian distribution,\index{distribution!gamma} 
 except  whereas the Gaussian goes from $-\infty$ to $\infty$, 
  gamma distributions go from 0 to $\infty$.
 Just as the Gaussian distribution has two parameters $\mu$ and $\sigma$
 which control the mean and width of the distribution,
 the gamma distribution has two parameters.
 It is the product of  the one-parameter
 exponential distribution
 (\ref{eq.exponential}) with a polynomial, $x^{c-1}$.
 The exponent $c$ in the polynomial is the second parameter.
\beq
        P( x  \given  s,c ) \:\:=\:\: \Gamma( x ; s , c ) \:\:=\:\:
\frac{1}{Z}
\left(         \frac{ x }{s} \right)^{c-1} \!
         \exp \left( - \frac{x}{s} \right)
        ,\:\:\: 0\leq x < \infty
\label{gamma.dist}
\eeq
 where
\beq
	Z = \Gamma(c)  s .
\eeq
 This is a simple peaked distribution with mean $sc$ and
 variance $s^2c$.

 It is often natural to represent
 a positive real variable $x$ in terms of its logarithm
 $l = \ln x$.
 The probability  density of $l$ is 
% x = e^l
% dx/dl = e^l = x
\beqan
	P(l)& =& P(x(l)) \, \left| \frac{\partial x}{\partial l} \right|
\:\:
		= \:\:  P(x(l)) x(l)  \label{eq.transformlog} \\
&=& 
\frac{1}{Z_l}
\left(         \frac{ x(l) }{s} \right)^{\! c}
         \exp \left( - \frac{x(l)}{s} \right)
        ,
\label{gamma.distl}
\eeqan
 where
%
\beq
	Z_l \:\: =\:\: \Gamma(c)  .
\eeq
 [{{The gamma distribution is named after its normalizing constant --
 an odd convention, it seems to me!}}]
\begin{figure}
\figuremargin{\small
\begin{center}
\begin{tabular}{rr}
\dinkyfig{bigrams/gamma1.3.x.ps}%
&
\dinkyfig{bigrams/gamma1.3.l.ps}%
\\
\dinkyfigl{bigrams/gamma1.3.x.l.ps}%
&
\dinkyfigl{bigrams/gamma1.3.l.l.ps}%
\\
\hspace*{0.6in} $x$ & \hspace*{0.6in} $l = \ln x$ \\
\end{tabular}
\end{center}
}{
\caption[a]{Two gamma distributions, with parameters
 $(s,c)=(1,3)$ (heavy lines) and $10,0.3$ (light lines),
 shown on  linear vertical scales (top) and logarithmic
 vertical scales (bottom);
 and shown as a function of $x$ on the left (\ref{gamma.dist})
 and $l = \ln x$ on the right (\ref{gamma.distl}).}
\label{fig.gammas}
}
\end{figure}

 \Figref{fig.gammas} shows a couple of gamma distributions as a function
 of $x$ and of $l$. Notice that where the original gamma
 distribution  (\ref{gamma.dist}) may have a `spike' at $x=0$, the
 distribution over $l$ never has such a spike. The spike
 is an artefact of a bad choice of basis.
 
 In the limit $sc = 1, c
 \rightarrow 0$, we obtain the {noninformative prior} for a scale
 parameter, the $1/x$ prior. This \ind{improper} {prior} is called
 noninformative because it has no associated length scale,
 no characteristic value of $x$, so it prefers all values of $x$
 equally. It is 
 {invariant} under the reparameterization $x = m x$.
 If we transform the $1/x$ probability density into a density over $l= \ln x$
 we find the latter density is uniform.

\exercisxB{1}{ex.power}{
 Imagine that we reparameterize a positive variable $x$
 in terms of its cube root, $u = x^{1/3}$.
 If the probability density of $x$ is the improper distribution $1/x$,
 what is the probability density of $u$?
}

 The gamma distribution is always a unimodal density over
 $l = \ln x$, and, as can be
 seen in the figures, it is   asymmetric.
 If $x$ has a gamma  distribution,
 and we decide to work in terms of the inverse of $x$,
 $v=1/x$, we obtain a new distribution, in which
 the density over $l$ is flipped left-for-right:
 the probability density
 of $v$ is called
 an {\dem\ind{inverse-gamma distribution}},
% v = 1/x
% x = 1/v
% mult by |dx/dv|
% = 1/v^2
\beq
        P( v  \given  s,c )  =
\frac{1}{Z_v}
\left(         \frac{ 1 }{s v} \right)^{\! c+1}
         \exp \left( - \frac{1}{s v} \right)
        , \ \ \ \  0\leq v < \infty
\label{inversegamma.dist}
\eeq
\begin{figure}
\figuremargin{\small
\begin{center}
\begin{tabular}{rr}
\dinkyfig{bigrams/igamma1.3.x.ps}%
&
\dinkyfig{bigrams/igamma1.3.l.ps}%
\\
\dinkyfigl{bigrams/igamma1.3.x.l.ps}%
&
\dinkyfigl{bigrams/igamma1.3.l.l.ps}%
\\
$v$ & $\ln v$\\
\end{tabular}
\end{center}
}{
\caption[a]{Two inverse gamma distributions, with parameters
 $(s,c)=(1,3)$ (heavy lines) and $10,0.3$ (light lines),
 shown on  linear vertical scales (top) and logarithmic
 vertical scales (bottom);
 and shown as a function of $x$ on the left
 and $l = \ln x$ on the right.}
\label{fig.igammas}
}
\end{figure}
 where
% (CHECK)
% not checked yet.
\beq
	Z_v = \Gamma(c) / s .
\eeq


 Gamma and inverse gamma distributions crop up in many
 inference problems in which a positive quantity
 is inferred from data. Examples include inferring
 the variance of Gaussian noise from some noise samples,
 and inferring the rate parameter of  a Poisson distribution\index{distribution!Poisson}
 from the count.

 Gamma distributions also arise naturally in
  the distributions of waiting times between Poisson-distributed events.
 Given a Poisson process with rate $\l$, the probability
 density of the arrival time $x$ of the $m$th  event
 is
\beq
	\frac{ \l (\l x)^{m-1} }{ ( m\! -  \! 1 )! } \, e^{-\l x} .
\eeq
% check, m=1 -> exp(-lx) . good.

\subsubsection{Log-normal distribution}
 Another distribution over a positive
 real number $x$
 is the {\dem{\ind{log-normal}}} distribution,\index{distribution!log-normal}
 which is the distribution that results when
 $l = \ln x$ has a normal distribution.
 We define $m$ to be the median value of $x$,
 and $s$ to  be the standard deviation of $\ln x$.
\beq
 P(l \given m,s) = 	\frac{1}{Z} \exp \left( - \frac{(l-\ln m)^2}{2 s^2} \right)
\ \  \:\: l\in(-\infty,\infty) ,
\eeq
 where
\beq
	Z = \sqrt{ 2 \pi s^2 }, 
\eeq
 implies
% via {eq.transformlog}'s relp
\beq
	P(x \given  m,s ) = \frac{1}{x}
		 \exp \left( - \frac{(\ln x -\ln m)^2}{2 s^2} \right)
	\ \ 	 \:\: x\in(0,\infty) .
\eeq
%\begin{figure}
\marginfig{\small
%\figuremargin{
\begin{center}
\begin{tabular}{rr}
\dinkyfig{bigrams/lognormal.ps}%
\\
\dinkyfigl{bigrams/lognormal.l.ps}%
\end{tabular}
\end{center}
%}{
\caption[a]{Two log-normal  distributions, with parameters
 $(m,s)=(3,1.8)$ (heavy line) 
 and $(3,0.7)$ (light line),
 shown on  linear vertical scales (top) and logarithmic
 vertical scales (bottom). [Yes, they really do have
 the same value of the median, $m=3$.]}
\label{fig.lognormal}
}
%\end{figure}



\section{Distributions over periodic variables\nonexaminable}
 A \ind{periodic variable} $\theta$ is a real number
 $\in [0,2 \pi]$\index{distribution!over periodic variables}
 having the property that $\theta=0$ and $\theta=2 \pi$ are
 equivalent.
% identical


 A distribution that plays for periodic variables
 the role played by the Gaussian distribution for real variables is the
 {\dem\ind{Von Mises distribution}}:\index{distribution!Von Mises} 
\beq
 P(\theta \given  \mu,\beta)  =\frac{1}{Z} \exp \left( \beta \cos( \theta - \mu )
                              \right)  \:\:\:\: \theta \in (0,2\pi). 
\eeq
 The normalizing constant is $Z= 2\pi I_0(\beta)$, where
 $I_0(x)$ is a modified Bessel function.
% (equal to J_0(ix))
\medskip

 A distribution that arises from Brownian \ind{diffusion}\index{Brownian motion}
 around the \ind{circle} is  the 
 wrapped 	Gaussian distribution,
% with wrap-around,
\beq
 P(\theta \given  \mu,\sigma)  = \sum_{n=-\infty}^{\infty}
	\Normal( \theta ; ( \mu+2\pi n ), \sigma ) \:\:\:\: \theta \in (0,2\pi) .
\eeq
% Not the same as (think about them on a log scale, for case of small s)...

% SECOND EDITION
%
% INSERT
% \input{tex/wrappedcauchy.tex}
%
% LOOK DO ME

\section{Distributions over probabilities}
\begin{center}
{\sf Beta distribution, Dirichlet distribution, entropic distribution}
\par
\end{center}
\noindent
 The%
% {normalized vectors}
\marginfig{\small
%\figuremargin{
\begin{center}
\begin{tabular}{rr}
\dinkyfig{figs/beta.ps}%
\\[0.2in]
\dinkyfig{figs/betal.ps}%
\\[0.2in]
\end{tabular}
\end{center}
%}{
\caption[a]{Three beta distributions,
 with $(u_1,u_2) = ( 0.3,1)$, $(1.3,1)$, and $(12,2)$.
 The upper figure shows $P(p \given u_1,u_2)$ as a function of
 $p$; the lower shows the corresponding density over
 the {\dem{\ind{logit}}\/},
$$\ln \frac{p}{1-p}. $$
 Notice how well-behaved the densities are
 as a function of  the  logit.
}
\label{fig.beta}
}
 {\dem\ind{beta distribution}} is a probability density\index{distribution!beta}
 over a
 variable $p$ that is a probability, $p \in (0,1)$:
\beq
P(p \given u_1,u_2) = \frac{1}{Z(u_1,u_2)} p^{u_1-1} (1-p)^{u_2-1}  .
\eeq
 The parameters $u_1,u_2$ may take any positive value.
 The normalizing constant is the \ind{beta function},
% (CHECKED)
\beq
	Z(u_1,u_2) = \frac{ \Gamma(u_1) \Gamma(u_2) }{  \Gamma(u_1 + u_2)  } .
\label{eq.Zbeta}
\eeq
% !!!!!!!!!!!!!!!!!!!!!!!!!!!
 Special cases include the uniform distribution -- $u_1\eq1, u_2\eq 1$;
 the \ind{Jeffreys prior} --  $u_1\eq 0.5, u_2\eq 0.5$;
 and the \ind{improper} \ind{Laplace prior} --   $u_1\eq 0, u_2\eq 0$.
 If we transform the beta distribution to the corresponding density over
 the \ind{logit} $l \equiv \ln \lfrac{p}{(1-p)}$, we find it is always a
 pleasant bell-shaped density over $l$, while the density over $p$
 may have singularities at $p=0$ and $p=1$ (\figref{fig.beta}).

\subsection{More dimensions}
 The {\dem\ind{Dirichlet distribution}}\index{distribution!Dirichlet}
 is a density over an $I$-dimensional vector $\bp$ whose $I$ components
 are positive and sum to 1. The beta distribution is a special case of
a Dirichlet distribution with $I=2$.
 The Dirichlet distribution
% for a probability vector $\bp$ with $\lI$ components
 is parameterized by a measure $\bu$ (a  vector with all 
 coefficients $u_i > 0$) which I
 will write here as $\bu = \alpha \bm$, where $\bm$ is a normalized
 measure over the $\lI$ components ($\sum m_i = 1$), and $\a$ is 
 positive:
\beq
	P(\bp \given \a\bm) = \frac{1}{Z(\a \bm)} 
		\prod_{i=1}^{\lI} p_i^{\a m_i - 1}  
		\delta \left(\textstyle  \sum_i p_i - 1 \right)
	\equiv \Dir{\bp}{\a\bm}{\lI} .
\label{eq.dirichletdefn}
\eeq
 The function $\delta (x)$ is the Dirac delta function, which 
 restricts the distribution to the \ind{simplex} such that $\bp$ is
 normalized, \ie, $\sum_i p_i = 1$.  The normalizing constant of the Dirichlet 
 distribution is:
\beq
Z(\a\bm) 
%\int  \d^{\lI} \! \bp \:  \prod_{i=1}^{\lI} 
% p_i^{\a m_i - 1}  \delta \left(\textstyle  \sum p_i - 1 \right)
%        = \frac{ \prod_i \Gamma (\a m_i) }{ \Gamma(\sum \a m_i) }.$
        =  \prod_i \Gamma (\a m_i) \left/ \Gamma( \a ) \right. .
\label{lang.z}
\eeq
 The vector $\bm$ is the mean of the probability distribution:
\beq
	\int  \Dir{\bp}{\a\bm}{\lI} \: \bp \: \d^{\lI} \! \bp   = \bm  .
\label{dirichlet_mean}
\eeq
 When working with a probability vector $\bp$, it is often
 helpful to work in the `\index{softmax, softmin}{softmax} basis', in which,
 for example, a three-dimensional probability $\bp=(p_1,p_2,p_3)$
 is represented by  three numbers  $a_1,a_2,a_3$
 satisfying $a_1+a_2+a_3=0$ and 
\beq
	p_i = \frac{1}{Z}  \, e^{a_i}, \:\: \mbox{where $Z = \sum_i   e^{a_i}$.}
\label{eq.softmaxdef}
\eeq
% Dirichlet distributions are most
% naturally dealt with in this basis
% \protect\cite{MacKay96:laplace}.
 This nonlinear transformation is analogous
 to the $\sigma \rightarrow \ln \sigma$
 transformation for a scale variable
 and the logit transformation for a single probability,
 $p \rightarrow \ln \frac{p}{1-p}$.
 In the {softmax} basis, the ugly minus-ones in the exponents
 in the Dirichlet
 distribution (\ref{eq.dirichletdefn}) disappear,
 and the density is given by:
\beq
	P(\ba \given \a\bm) \propto \frac{1}{Z(\a \bm)} 
		\prod_{i=1}^{\lI} p_i^{\a m_i}  
		\delta \left(\textstyle  \sum_i a_i \right) .
\eeq
%
\begin{figure}
\figuremargin{\small
\begin{center}
\begin{tabular}{l}
\makebox[0in][l]{\hspace{0.3in}$\bu=(20,10,7)$}% 
\makebox[0in][l]{\hspace{1.65in}$\bu=(0.2,1,2)$}%
\makebox[0in][l]{\hspace{2.8in}$\bu=(0.2,0.3,0.15)$}%
\\
{\hspace*{0in}\psfig{figure=zipf/dirichletdemo.ps,width=4in,angle=-90}}\\
{\hspace*{0in}\psfig{figure=zipf/dirichletdemol.ps,width=4in,angle=-90}}\\
\end{tabular}
\end{center}
}{
\caption[abb]{Three Dirichlet distributions over a three-dimensional probability
 vector $(p_1,p_2,p_3)$. The upper figures show 1000 random draws from
 each distribution, showing the values of $p_1$ and $p_2$ on the two axes. $p_3 =1-( p_1+p_2)$.
 The triangle in the first figure
 is the simplex of legal probability distributions.

 The lower figures show the same points in the
 `softmax' basis (\eqref{eq.softmaxdef}).
 The two axes show $a_1$ and $a_2$. $a_3 = -a_1-a_2$.
}
\label{fig.dirichletdemo}
}
\end{figure}
\noindent
 The role of  the parameter $\a$ can be characterized in two ways. First, 
 $\a$ measures the sharpness of the distribution (\figref{fig.dirichletdemo}); 
 it measures how different we expect typical samples $\bp$ from the
 distribution to be from the mean $\bm$, just as the
 precision $\tau=\dfrac{1}{\sigma^2}$ of a Gaussian
 measures how far samples stray from its mean. A large value of $\a$
 produces a distribution over $\bp$ that is sharply peaked around
 $\bm$. The effect of $\a$ in higher-dimensional
 situations can be visualized by drawing a typical sample 
 from the distribution $\Dir{\bp}{\a\bm}{\lI}$, with $\bm$ set to the uniform 
 vector $m_i = \dfrac{1}{I}$, 
 and making a \ind{Zipf plot}, that is, a ranked plot of the values of 
 the components $p_i$.
 It is traditional to plot both $p_i$ (vertical axis) and the rank (horizontal
 axis) on logarithmic scales so that power law relationships 
 appear as straight lines. 
% Many natural languages have word frequencies which 
% are well modelled by Zipf's law
 Figure \ref{fig.zipf} shows these plots for a single sample from
 ensembles with $\lI=100$ and $\lI=1000$ and with $\a$ from 0.1 to
 1000. For large $\a$, the plot is shallow with many components having 
 similar values.
% s to the most probable component. 
 For small $\a$, typically one component $p_i$ receives an
 overwhelming share of the probability, and of the small probability that
 remains to be shared among the other components, another component
 $p_{i'}$ receives a similarly large share.  In the limit as $\a$ goes
 to zero, the plot tends to an increasingly steep power law.
%\begin{figure}
\amarginfig{c}{\small
\begin{center}
\begin{tabular}{c}
$\lI=100$ \\
\hspace*{-0.15in}\psfig{figure=zipf/ps/all.100.ps,%
width=57mm}
\\
 $\lI=1000$ \\
\hspace*{-0.15in}\psfig{figure=zipf/ps/all.1000.ps,%
width=57mm}
\\
\end{tabular}
\end{center}
%
\caption[abb]{Zipf plots for random samples from Dirichlet distributions
	with various values of $\a = 0.1 \ldots 1000$. For each value
 of $\lI=100$
 or 1000
	and each  $\a$,
%
% RESTORE DETAILS somewhere
%$\lI$ samples from a standard gamma 
%	distribution were generated
%	with shape parameter $\a/\lI$ and normalized to give a
 one sample 
	$\bp$ from the Dirichlet distribution was generated. 
	The Zipf plot shows the probabilities $p_i$, ranked by magnitude, 
	versus their rank. 
}
\label{fig.zipf}
}
%\end{figure}

 Second, we can characterize the role of $\a$ in terms of the 
 predictive distribution that results when we observe samples from $\bp$ and
 obtain counts $\bF = (F_1, F_2, \ldots, F_I)$ of the possible outcomes.
% The term $\a m_i$ plays the role of an effective initial count in 
% bin $i$. 
 The value of $\a$ defines the number of samples from
 $\bp$ that are required in order that the data dominate over the
 prior in  predictions. 

\exercisxC{3}{ex.Dadditive}{
 The Dirichlet distribution satisfies a nice additivity property.
 Imagine that a biased six-sided die has two red faces
 and four blue faces. The die is rolled $N$ times and two Bayesians
 examine the outcomes in order  to infer the bias of the die and make
 predictions.
 One Bayesian has access to the red/blue colour outcomes only,
 and he infers a two-component probability vector ($p_{\rm R}, p_{\rm B}$).
 The other Bayesian has access to each full outcome: he can
 see which of the six faces came up, and he infers
 a six-component probability vector ($p_1, p_2, p_3,p_4,p_5,p_6$),
 where $p_{\rm R} =p_1+ p_2$ and $p_{\rm B} =  p_3 + p_4 +p_5+p_6 $.
 Assuming that the second Bayesian
 assigns a Dirichlet distribution to 
 ($p_1, p_2, p_3,p_4,p_5,p_6$)
 with \ind{hyperparameter}s
 ($u_1, u_2, u_3,u_4,u_5,u_6$),
 show that, in order for the first Bayesian's inferences to be
 consistent with those of the second Bayesian,
 the first Bayesian's prior should be  
 a Dirichlet distribution 
 with hyperparameters
 ($(u_1 + u_2), (u_3+u_4+u_5+u_6)$).

 {\sf Hint}: a brute-force approach is to compute the integral
 $P(p_{\rm R}, p_{\rm B}) = \int \d^6  \bp \, P(\bp \given \bu) \, \delta(
 p_{\rm R} - (p_1+ p_2) ) \, \delta (p_{\rm B} -(  p_3 + p_4 +p_5+p_6 ))$.
 A cheaper approach is to compute the predictive
 distributions, given arbitrary data
 ($F_1, F_2, F_3,F_4,F_5,F_6$),
 and find the condition for the two  predictive distributions to
 match  for all data.
}

 The {\dem\ind{entropic distribution}\/} for a
 probability vector $\bp$ is sometimes used in the
 `maximum entropy' image reconstruction
 community.
\beq
	P(\bp \given \a,\bm) = \frac{1}{Z(\a,\bm)} 
		\exp [ - \a D_{\rm KL}(\p||\bm) ]
\,		\delta \! \left(\textstyle  \sum_i p_i - 1 \right) ,
\eeq
 where $\bm$, the measure, is a positive vector, and $D_{\rm KL}(\bp||\bm) = \sum_i p_i \log p_i/m_i$.

\section*{Further reading}
 See \cite{MacKay_Peto} for fun with
 Dirichlets.


\section{Further exercises}
\exercisxC{2}{ex.gammainf}{
 $N$ datapoints $\{ x_n \}$ are drawn from a
% quantity $x$ has a
 gamma distribution
$        P( x  \given  s,c ) \:=\: \Gamma( x ; s , c )$
 with unknown parameters $s$ and $c$.
 What are the maximum likelihood parameters $s$ and $c$?
}

\dvips
%%\prechapter{About              Chapter}
%%\input{tex/_pexact.tex}% not associated with exact.tex any more!
\chapter{Exact Marginalization}
% in Continuous Spaces}
\label{ch.exactmarg}
% WAS ::: \chapter{Intermediate Bayesian Stuff}
\label{ch.bayes.gaussian}
\label{ch.bayes.int}
 How can we avoid the exponentially large cost of complete enumeration of all
 hypotheses?  Before we stoop to approximate methods, we explore
 two approaches to exact marginalization: first, \ind{marginalization} over
 continuous variables (sometimes known as
 \ind{nuisance parameters}) by doing {\em integrals};
 and second, summation over discrete variables by message-passing.

% In this chapter we run through some very simple
% but intimidating examples.
 Exact marginalization over continuous parameters
 is a \ind{macho} activity enjoyed by  those who are fluent in
 definite integration.
\fakesection{Gamma whinge}
% WAS _pexact.tex ::::::::::::::::::::::::::::::::::::::::::::::
 This chapter
% is unfortunately found a little intimidating
% because it
 uses gamma distributions; as
%, which are not really
% worth being scared of, and are not central to the chapter.
% As
 was explained in the previous chapter, 
 gamma distributions are a lot like Gaussian distributions, 
 except that whereas the Gaussian goes from $-\infty$ to $\infty$, 
 gamma distributions go from 0 to $\infty$.

\section{Inferring the mean and variance of a  Gaussian distribution}
 We discuss again the  one-dimensional
 Gaussian distribution, parameterized by a mean $\mu$ 
 and a standard deviation $\sigma$:
%
\beq
P(x \given \mu,\sigma)
% ,\H_{\rm Normal})
        = \frac{1}{\sqrt{2 \pi} \sigma}
        \exp \left( - \frac{ ( x-\mu )^2 }{2 \sigma^2 } \right)
        \equiv {\rm Normal}(x;\mu,\sigma^2) .
\eeq
%
% Let us examine the inference of $\mu$ and $\sigma$ 
% given data points $x_n$, $n=1,\ldots, N$, assumed to be drawn independently 
% from this distribution. 
%
 When inferring these parameters, we must specify their prior
 distribution. The prior gives us the opportunity to include specific
 knowledge that we have about $\mu$ and $\sigma$ (from independent
 experiments, or on theoretical grounds, for example). If we have no
 such knowledge, then we can construct an appropriate prior that
 embodies our supposed ignorance.
 In \secref{sec.gaussian.firsttime}, we assumed a uniform  prior
 over the range of parameters plotted.
 If we wish to be able to perform exact
 marginalizations, it may be
 useful to consider {\em conjugate priors}; these are priors
 whose functional form  combines naturally with the likelihood
 such that the inferences have a convenient
 form.
\subsection{Conjugate priors for $\mu$ and $\sigma$\nonexaminable}
 The \ind{conjugate prior} for a mean $\mu$ is a Gaussian:\index{Gaussian distribution!parameters}
 we introduce two `\ind{hyperparameter}s', $\mu_0$ and $\sigma_{\mu}$,
 which parameterize the prior on $\mu$, and write
 $P(\mu \given \mu_0,\sigma_{\mu}) = \Normal(\mu;\mu_0,\sigma_{\mu})$.
 In the limit $\mu_0 \eq 0$, $\sigma_{\mu} \rightarrow \infty$, we obtain the {\em
 noninformative prior\/} for a location parameter, the flat prior.  This
 is {\dem\ind{noninformative}} because it is {\em invariant\/} under the
 natural reparameterization $\mu' = \mu+c$.
% \marginpar{\footnotesize I need to give a better explanation of `noninformative'.}
 The prior $P(\mu) = {\rm const.}$
 is also an {\em\ind{improper}\/} prior, that is, it is not normalizable.

 The \ind{conjugate prior} for a standard deviation $\sigma$ is a
 \ind{gamma
 distribution}, which has two parameters $ b_{\b}$ and $c_{\b}$.
 It is most convenient to  define the prior
 density of the
 inverse variance
%\marginpar{\footnotesize{The inverse variance is sometimes
% called the {\dem\ind{precision}} parameter of the Gaussian.}}
 (the {\dem\ind{precision}} parameter)
 $\beta = 1/\sigma^2$:
\beq
        P( \b ) = \Gamma( \b ; b_{\b} , c_{\b} ) =
        \frac{1}{\Gamma(c_{\b})}
        \frac{ \b^{c_{\b}-1} }
                { b_{\b}^{c_{\b}} }  
         \exp \left( - \frac{\b}{b_{\b}} \right)
        , \ \ \ \  0\leq \b < \infty .
\label{gamma.dist.again}
\eeq
 This is a simple peaked distribution with mean $b_{\b}c_{\b}$ and
 variance $b^2_{\b}c_{\b}$. In the limit $b_{\b}c_{\b} = 1, c_{\b}
 \rightarrow 0$, we obtain the {noninformative prior} for a scale
 parameter, the $1/\sigma$ prior.  This is `noninformative' because it
 is {invariant} under the reparameterization $\sigma' = c \sigma$. The
 $1/\sigma$ prior is less strange-looking if we examine the resulting
 density over $\ln \sigma$, or $\ln \beta$, which is flat.%
\marginpar{\small\raggedright{Reminder: when we change variables
 from $\sigma$ to $l(\sigma)$, a one-to-one function of $\sigma$, 
 the probability density transforms from $P_{\sigma}(\sigma)$
 to
$$
 P_l(l) = P_{\sigma}(\sigma) \left| \frac{\partial \sigma}{\partial l} \right|
 .
$$
 Here, the \ind{Jacobian} is
$$
  \left| \frac{\partial \sigma}{\partial \ln \sigma} \right| = \sigma 
.
$$
% \eqref{eq.transformlog}.}}
 }}
 This is
 the prior that expresses ignorance about $\sigma$ by saying `well, it
 could be 10, or it could be
 1, or it could be 0.1, \ldots' Scale variables such as $\sigma$ are
 usually best represented in terms of their logarithm. Again, this
 noninformative $1/\sigma$ prior is \ind{improper}.
%

 In the following examples, I will  use the improper noninformative priors 
 for $\mu$ and $\sigma$.  Using improper priors is viewed as distasteful
 in some circles, so let me excuse myself  by saying it's for the sake of
 readability; if I included proper priors, the calculations
 could still be done but the key points would be obscured by the
 flood of  extra  parameters.


\subsection{Maximum likelihood and marginalization: 
        $\sigma_{\ssN}$ and $\sigma_{\ssNM}$}
\label{sn}
 The task of inferring the mean and \ind{standard deviation} 
 of a Gaussian distribution from $N$ samples is a familiar one, though 
 maybe not everyone understands the difference between the 
 \ind{$\sigma_{\ssN}$ and $\sigma_{\ssNM}$} buttons on their \ind{calculator}. 
 Let us recap the formulae, then derive them. 

        Given data $D = \{ x_n \}_{n=1}^{N}$, an `estimator' of $\mu$ is
% \newcommand{\barx}{\bar{x}}
\beq
        \barx \equiv \textstyle  {\sum_{n=1}^{N} x_n} / {N} ,
\eeq
 and two estimators of $\sigma$ are:
\beq
        \sigma_{\ssN} \equiv \sqrt{
                 \frac{\sum_{n=1}^{N} ( x_n - \barx )^2 }{N}
        }
\: \mbox{ and } \: 
\sigma_{\ssNM} \equiv \sqrt{
                 \frac{\sum_{n=1}^{N} ( x_n - \barx )^2 }{N-1}
        } .
\eeq
%
        There are two principal paradigms for statistics: \ind{sampling theory} 
 and Bayesian inference. In sampling theory (also known as `\ind{frequentist}'
 or \ind{orthodox statistics}),  one  invents {\dem\ind{estimator}s} of quantities of
 interest and then chooses between those estimators using some
 criterion measuring their sampling properties; there is no clear
 principle for deciding which criterion to use to measure the
 performance of an estimator; nor, for most criteria, is there any
 systematic procedure for the construction of optimal estimators.
 In Bayesian inference, in contrast, once we have made 
 explicit all our  assumptions about the model and the data,  our inferences are
 mechanical.
% stic. 
 Whatever question we wish to pose, the rules
 of probability theory give a unique answer which consistently takes
 into account all the given information. Human-designed 
 estimators and confidence intervals
 have no role in Bayesian inference; 
 human input only enters into the important 
 tasks of designing the hypothesis space (that is, the specification
 of the model and all its probability distributions),
 and figuring out 
 how to do the computations that implement inference 
 in that space.
 The answers to our questions are probability distributions over the 
 quantities of interest. We often find that the estimators of 
 sampling theory emerge automatically as modes or means 
 of these posterior distributions  when we choose a simple hypothesis
 space and turn the handle of Bayesian 
 inference.

        In sampling theory, the estimators above can be motivated 
 as follows. $\barx$ is an unbiased estimator of $\mu$ which, out of all 
 the possible unbiased estimators of $\mu$,  has smallest \ind{variance} (where this 
 variance is computed by averaging over an ensemble of imaginary experiments
 in which the data samples are assumed to come from an unknown  
 \ind{Gaussian distribution}).\index{bias!in statistics} 
 The estimator $(\barx,\sigma_{\ssN})$ is the  maximum likelihood estimator
 for $(\mu,\sigma)$. The estimator $\sigma_{\ssN}$ is {\em\ind{biased}}, however:
 the expectation of $\sigma_{\ssN}$, given $\sigma$, averaging over 
 many imagined experiments, is not $\sigma$.
\exercissxA{2}{ex.sigmanbias}{
 Give an intuitive explanation why the estimator $\sigma_{\ssN}$ is {biased}.
}
 This bias  motivates the invention, in sampling theory, of 
 $\sigma_{\ssNM}$, which can be shown to be an unbiased estimator. Or to 
 be precise, it is $\sigma_{\ssNM}^2$ that is an \ind{unbiased estimator} of 
 $\sigma^2$.

% copy of this stolen and included in  enumerate.tex 
% \renewcommand{\figs}{/home/mackay/book/figs} % while in bayes chapter 
\begin{figure}
\figuremargin{\small%
\vspace{-0.56in}
\begin{center}
\begin{tabular}{l@{}l}
% \newcommand{\bookfigs}{/home/mackay/book/figs}
(a1)\hspace{-0.4in}\raisebox{-10mm}{\psfig{figure=\bookfigs/basic/new_surfaceplot.ps,angle=-90,width=3in}}
&
(a2)\hspace{-0.8in}\raisebox{-10mm}{\psfig{figure=\bookfigs/basic/new_contourplot.ps,angle=-90,width=3in}}
\\
%(b)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_posts.ps,angle=-90,width=2.3in}}
%&
(c)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_sigposts.ps,angle=-90,width=2.3in}}
\hspace*{-0.3in}
%\\
&
\hspace*{-0.3in}(d)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_sigmargb.ps,angle=-90,width=2.3in}}
\\
\end{tabular}
\end{center}
% \mbox{{\bf (a)} \psfig{figure=\bookfigs/basic/like_sig_mu.ps,%
% width=3 true in,height=2.53 true in,angle=-90,% 
% bbllx=19.5cm,bblly=1.1cm,%3.9cm,%
% bburx=2.4cm,bbury=24.0cm}
% %} 
% %\mbox{
% {\bf (b)} \psfig{figure=\bookfigs/basic/sigma_likes.ps,%
% width=3 true in,height=2.3 true in,angle=-90,%
% bbllx=19.5cm,bblly=1.9cm,%
% bburx=2.4cm,bbury=24cm} }
}{%
\caption[abbrev]{{The likelihood function for the parameters of 
        a  Gaussian distribution},
 repeated from \protect\figref{like.sig.mu1}.

        {(a1, a2)} Surface plot and 
contour plot of the log likelihood as a function of $\mu$
        and $\sigma$.  The data set of $N=5$ points had mean
        $\bar{x}=1.0$ and $S^2 = \sum(x-\bar{x})^2 = 1.0$. Notice that
        the maximum is skew in $\sigma$.  The two estimators of
        standard deviation have values $\sigma_{\ssN}=0.45$ and
        $\sigma_{\ssNM}=0.50$.

%{(b)} The posterior probability of $\mu$ for various values of 
% $\sigma$. 
%
{(c)} The posterior probability of $\sigma$ for 
 various fixed values of $\mu$ (shown as a density over $\ln \sigma$).
%  The two graphs show: the likelihood as a function of
%       $\sigma$, with $\mu$ fixed to $\barx$, \ie, $P(D \given \mu={\bar
%       x},\sigma)$ [this is a vertical section through the peak in
%       (a)]; and 

{(d)} The 
% `evidence' (marginal likelihood) for
        posterior probability of $\sigma$, $P(\sigma \given D)$, 
%       $\sigma$, $P(D \given \sigma)$,
  assuming a flat prior on $\mu$, 
%       (rescaled by an arbitrary constant). The evidence is 
 obtained
        by projecting the probability mass in (a) onto the $\sigma$
        axis.  The maximum of 
% $P(D \given \sigma)$ is
$P(\sigma \given D)$ is
        at $\sigma_{\ssNM}$. By contrast, the maximum of 
% $P(D \given \mu={\bar x},\sigma)$ is
$P(\sigma \given D,\mu\eq {\bar x})$ is
        at $\sigma_{\ssN}$.
 (Both probabilities are shows as densities
 over $\ln \sigma$.) }
\label{like.sig.mu}
}%
\end{figure}

 We now look at some Bayesian inferences for this problem, assuming 
 noninformative priors for $\mu$ and $\sigma$. The emphasis is thus not on 
 the priors, but rather on (a) the likelihood function, 
 and (b) the concept of marginalization. 
 The joint posterior probability of $\mu$ and $\sigma$ is 
 proportional to the likelihood function illustrated by a contour plot
 in figure \ref{like.sig.mu}a.
 The log likelihood is:
\beqan
\!\!\!\!\ln P(\{x_n\}_{n=1}^N \given \mu,\sigma)
        &\!\!=\!\!& -N \ln (\sqrt{2 \pi} \sigma)
        -\sum_n \linefrac{(x_n-\mu)^2}{(2 \sigma^2)}    , 
\\
&\!\!=\!\!&
        -N \ln (\sqrt{2 \pi} \sigma) - \linefrac{ [ N ( \mu - \barx )^2 + S ]}
                                        { (2 \sigma^2) }, 
\eeqan
 where $S \equiv \sum_n  (x_n-\barx)^2$. Given the Gaussian model, 
 the likelihood can be expressed 
 in terms of the two functions  of the data $\barx$ and $S$, so these 
 two quantities are known as `sufficient statistics'. 
 The posterior probability of $\mu$ and $\sigma$ is, using the improper
 priors:
\beqan
        P( \mu , \sigma \given  \{x_n\}_{n=1}^N  ) &=&
         \frac{ P(\{x_n\}_{n=1}^N \given \mu,\sigma) P( \mu, \sigma ) }
                { P (  \{x_n\}_{n=1}^N  ) }
\label{joint.post1}
\\
&=&
\frac{
        \smallfrac{1}{(2 \pi \sigma^2)^{N/2}} \exp\left( - 
                                \smallfrac{N ( \mu - \barx )^2 + S }
                                        { 2 \sigma^2 }
        \right)
        \frac{1}{\sigma_{\mu}}
%       \frac{1}{(2 \pi \sigma_{\mu}^2)^{1/2}}
%               \exp\left( - \frac{1}{2}  
%                \mu^2 / ( 2 \sigma_{\mu}^2 ) \right)
%               \frac{1}{\Gamma(c_{\b})}
%       \frac{ \b^{c_{\b}-1} }
%               { b_{\b}^{c_{\b}} }  
%        \exp \left( - \frac{\b}{b_{\b}} \right)
        \frac{1}{\sigma}
}
% \right/
{
         P (  \{x_n\}_{n=1}^N  ) 
} .
\label{joint.post2}
\eeqan
 This function describes the answer to the question, `given the data, 
 and the noninformative priors, what might $\mu$ and $\sigma$ be?'
 It may be of interest to find the parameter values that maximize
 the posterior probability, though it should be emphasized that posterior 
 probability maxima have no fundamental status in Bayesian inference, since
 their location depends on the choice of basis. Here we choose 
 the basis $(\mu , \ln \sigma)$, in which our prior is flat, 
 so that the posterior probability maximum coincides with the 
 maximum of the likelihood.
%\exercisxB{2}{ex.MLgaussian}{
% Differentiate the log likelihood with respect to $\mu$ and $\ln \sigma$
% and show that  the maximum likelihood solution is:
%% \beq
%$
% \{\mu,\sigma\}_{\ML} = \left\{ \bar{x},\sigma_{\ssN}
%         = \sqrt{ \linefrac{S}{N} } \right\}  .
%$
%% \eeq
%}
 As we saw in \exerciseref{ex.MLgaussian},
 the maximum likelihood
 solution for $\mu$ and $\ln \sigma$ is 
$
 \{\mu,\sigma\}_{\ML} = \left\{ \bar{x},\sigma_{\ssN}
         = \sqrt{ \linefrac{S}{N} } \right\}  .
$


 There is more to the posterior distribution than just its mode.  As
 can be seen in figure \ref{like.sig.mu}a, the likelihood has a skew
 peak.  As we increase $\sigma$, the width of the conditional
 distribution of $\mu$ increases (\figref{like.sig.mu1a}b).  And if we fix $\mu$ to a sequence
 of values moving away from the sample mean $\barx$, we obtain a
 sequence of conditional distributions over $\sigma$ whose maxima move
 to increasing values of $\sigma$ (\figref{like.sig.mu}c).
 

%  The next question we might ask is `given the data, 
%  and the noninformative prior on $\mu$, and assuming a particular 
%  value of $\sigma$, what might $\mu$ be?'

 
  The posterior probability of $\mu$ given $\sigma$ is 
\beqan
 P( \mu \given  \{x_n\}_{n=1}^N,\sigma  ) &=&
         \frac{ P(\{x_n\}_{n=1}^N \given \mu,\sigma) P( \mu ) }
                        { P(\{x_n\}_{n=1}^N  \given \sigma  ) }
\label{post.mu}
\\
 &\propto&
\exp ( -N(\mu - \barx )^2/(2 \sigma^2) )
\\
&=&
\Normal( \mu ; \barx , \sigma^2/N )  .
\eeqan
 We note 
% This posterior distribution shows 
 the familiar 
 $\sigma/\sqrt{N}$ scaling of the error bars on $\mu$. 
% posterior uncertainty of the parameter $\mu$.  


 Let us now ask the question `given the data, 
 and the noninformative priors, what might  $\sigma$ be?' This question 
 differs from the first one we asked in that we are now not interested in 
 $\mu$. This parameter must therefore be {\em marginalized\/} over. 
 The posterior probability of $\sigma$ is:
\beq
 P( \sigma  \given  \{x_n\}_{n=1}^N ) =
        \frac{ P(\{x_n\}_{n=1}^N  \given \sigma ) P( \sigma ) }
                { P(\{x_n\}_{n=1}^N ) }  .
\label{eq.truepostsigma}
\eeq
 The data-dependent term $P(\{x_n\}_{n=1}^N  \given \sigma )$ appeared
 earlier as the normalizing constant in equation (\ref{post.mu}); one
 name for this quantity is the `\ind{evidence}', or \ind{marginal likelihood},
 for $\sigma$.  We obtain the evidence for $\sigma$ by integrating out
 $\mu$; a noninformative prior $P(\mu)=\mbox{constant}$ is
 assumed; we call this constant
 $\linefrac{1}{\sigma_{\mu}}$, so that we can think of the prior
 as a top-hat prior of width $\sigma_{\mu}$.
% , with $\sigma_{\mu} \to \infty$: 
% \beqa
 The Gaussian integral, 
$ P(\{x_n\}_{n=1}^N \given \sigma)  = 
        \int P(\{x_n\}_{n=1}^N \given \mu,\sigma)P(\mu) \: d \mu ,
$
% \eeqa
% \\
%  & = &  P(\{x_n\}_{n=1}^N \given \mu_{\MP},\sigma)P(\mu_{\MP}) 
%               \sqrt{2 \pi}\frac{\sigma}{\sqrt{N}}    .
% \eeqa
% This Gaussian integral yields:
% e log evidence is therefore: 
 yields:
\beq
\ln P(\{x_n\}_{n=1}^N \given \sigma)=-N \ln (\sqrt{2 \pi} \sigma)
        - \frac{S}{2 \sigma^2} + \ln \frac{\sqrt{2 \pi}
        \sigma / \sqrt{N}}{ \sigma_{\mu} }   .
\label{eq.sigmaevidence}
\eeq
 The first two terms are the best fit log likelihood (\ie, the log
 likelihood with $\mu = \bar{x}$). The last term is the log of the
 {\dem\ind{Occam factor}\/} which penalizes smaller values of $\sigma$.  (We
 will discuss Occam factors more in \chref{ch.occam}.) When we
 differentiate the log evidence with respect to $\ln \sigma$, to find
 the most probable $\sigma$, the additional volume factor
 ($\linefrac{\sigma}{\sqrt{N}} $) shifts the maximum from $\sigma_{\ssN}$
 to
\beq
%\sigma_{\MP} = 
\sigma_{\ssNM} = \sqrt{ \linefrac{S}{(N-1)} } .
\eeq
 Intuitively, the denominator \mbox{$(N\!-\!1)$} counts the number of
 noise measurements contained in the quantity $S = \sum_n
 (x_n\!-\!\bar{x})^2$. The sum contains $N$ residuals squared, but
 there are only \mbox{$(N\!-\!1)$} effective noise measurements\index{degrees of freedom}
 because the determination of one parameter $\mu$ from the data causes
 one dimension of noise to be gobbled up in unavoidable \ind{overfitting}.
 In the terminology of classical statistics,
 the Bayesian's best guess  for $\sigma$  sets\index{$\chi^2$}\index{chi-squared}
 $\chi^2$ (the measure of deviance  defined by $\chi^2 \equiv
 \sum_n (x_n - \hat{\mu})^2/{\hat\sigma}^2$) equal to the number of degrees
 of freedom, $N-1$.
%
% HELP - put more clarification here.
%

 Figure \ref{like.sig.mu}d shows the posterior probability of $\sigma$,
 which is proportional to the marginal likelihood.
%  as a function of $\sigma$.
 This may be contrasted with 
  the posterior probability of
% likelihood as a function of
 $\sigma$ with $\mu$ fixed to its most probable value, $\barx\eq 1$, which 
 is shown in \figref{like.sig.mu}c and d.

 The final inference we might wish to make is `given the data, what is $\mu$?'
\exercisxB{3}{ex.studentint}{
 Marginalize over $\sigma$ and obtain the posterior 
 marginal distribution of $\mu$, which is a \ind{Student-$t$ distribution}:
\beq
        P( \mu  \given  D ) \propto 1 / \left( N ( \mu - \barx )^2 + S \right)^{N/2} .
\eeq
}
%
% in error, this used to say (N-1)/2  
% 21/3/96
%
% see ~/book/figs/basic/README

% stole exercises from here and put them in bayes_int_exs.tex

\section*{Further reading}
 A bible of exact marginalization is \quotecite{Bretthorst} book
 on {B}ayesian spectrum analysis
	and parameter estimation.

\section{Exercises}
\exercisxB{3}{ex.manyparamsb}{
 [This exercise requires macho integration capabilities.]
 Give a Bayesian solution to \exerciseref{ex.manyparams},
 where seven scientists of varying
 capabilities have measured $\mu$ with
 personal  noise levels $\sigma_n$, 
\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/manyparams.ps,width=1.75in,angle=-90}}
\end{center}
%\caption[a]{Seven measurements $\{x_n\}$ of a parameter $\mu$
% by seven scientists each having his own
%  noise-level $\sigma_n$.}
}
 and we are interested in inferring $\mu$.
% , and perhaps $\{ \sigma_n \}$ too.
 Let the prior on each $\sigma_n$ be a broad prior, for example a
 gamma distribution with parameters $(s,c)=(10,0.1)$.
 Find the posterior distribution of $\mu$.
 Plot it, and explore its properties for a variety of
 data sets such as the one given, and the data set $\{ x_n \} = \{ 13.01 , 7.39 \}$.

 [{\sf Hint}: first find the posterior distribution of $\sigma_n$ given
 $\mu$ and $x_n$, $P(\sigma_n \given x_n,\mu)$. Note that the normalizing constant
 for this inference is $P(x_n \given  \mu)$. Marginalize over
 $\sigma_n$ to find this normalizing constant,
 then use \Bayes\  theorem a second time to
 find $P(\mu \given  \{ x_n \} )$.]
}

% \section{Solutions to Chapter \protect\ref{ch.bayes.int}'s exercises} %
\section{Solutions}
%
\soln{ex.sigmanbias}{
 1.\  The data points are distributed with mean squared deviation $\sigma^2$ 
 about the true mean.
 2.\ 
 The sample mean is unlikely to exactly  equal the true mean.
 3.\ The sample 
 mean is the value of $\mu$ that minimizes the sum squared deviation
 of the data points from $\mu$.
 Any other value of $\mu$ (in particular, the true value of $\mu$)
 will have a larger value of the sum-squared deviation that $\mu = \bar{x}$.

 So the expected   mean squared deviation from the 
 sample mean is necessarily  smaller than the
  mean squared deviation $\sigma^2$ 
 about the true mean. 
}

\dvips
% \dvipsb{solutions bayes intermediate}
%
%\prechapter{About                  Chapter}
%\mysetcounter{page}{69} % set to preceding page
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% WAS \chapter{Exact Inference Methods}
\chapter{Exact Marginalization in Trellises}
\label{ch.exact}\label{ch.minsum2} 
%
% exact inference methods
%
% contains lots on trellises
%
% solutions are in _sexact.tex
%
% need state diagram picture s,t,u
%
 In this chapter we will discuss a few
 exact methods that are used in probabilistic 
 modelling.
% We will do this with the aid of two examples. 
 As an example we will discuss
 the task of decoding a linear error-correcting 
 code.
% The second is the burglar-alarm problem of  \exburglar.
% In both examples w
 We will see that
 inferences can be conducted most efficiently by
 {\dem\index{message passing}{message-passing} algorithms}, which take
 advantage of the graphical structure of the problem
 to avoid unnecessary duplication of computations (see 
 \chapterref{ch.message}).

% This chapter is a possible location for the first introduction
% of Markov chains, and/or hidden Markov models.

\section{Decoding problems}
\label{sec.decoding.problems}
%
% these are defined first in _linear.tex
%
 A codeword $\bt$ is selected from a linear $(N,K)$
 code $\C$, and it is transmitted 
 over a noisy channel; the received signal is 
 $\by$.
 In this chapter we will assume that the channel is a memoryless 
 channel such as a Gaussian channel.
 Given an assumed channel model $P(\by  \given \bt)$, there are 
 two decoding problems. 
\begin{description}
\item[The codeword decoding problem] is the task of\index{decoder!codeword} 
 inferring which codeword $\bt$ was transmitted given the 
 received signal.
\item[The bitwise decoding problem] is the task of inferring\index{decoder!bitwise} 
 for each transmitted bit $t_n$ how likely it is that that 
 bit was a one rather than a zero.
\end{description}


 As a concrete example,  take the $(7,4)$ Hamming  code.
 In  \chref{ch.one}, we  discussed
 the codeword decoding problem  for that code, assuming
 a binary symmetric channel. We didn't discuss the bitwise decoding problem
 and we didn't discuss how to handle more general channel models 
 such as a Gaussian channel.


\subsection{Solving the codeword decoding problem}
 By \Bayes\  theorem, the posterior probability 
 of the codeword $\bt$ is\index{Bayes' theorem}
\beq
        P( \bt  \given  \by ) = \frac{ P(\by \given \bt) P(\bt) }{ P(\by )} .
\label{eq.decode}
\eeq
\begin{description}
\item[Likelihood function\puncspace]
 The first factor in the numerator, 
 $P(\by \given \bt)$, is the {\dbf\ind{likelihood}} of the codeword,
 which, for any memoryless  channel, is a separable function,
\beq
        P(\by \given \bt) = \prod_{n=1}^N P(y_n \given t_n) .
\eeq
 For example, if the channel is a Gaussian channel with transmissions 
 $\pm x$ and additive noise of standard deviation $\sigma$, 
 then the probability density of the received signal $y_n$ in the two 
 cases $t_n=0,1$ is
\beqan
        P(y_n \given t_n \eq 1) &=& \frac{1}{\sqrt{2 \pi \sigma^2}}
                 \exp \left( -\frac{(y_n - x )^2}{2 \sigma^2} \right) \\
        P(y_n \given t_n \eq 0) &=& \frac{1}{\sqrt{2 \pi \sigma^2}}
                 \exp \left( -\frac{(y_n + x )^2}{2 \sigma^2} \right)  .
\eeqan
 From the point of view of decoding, all that matters is the {\dbf likelihood
 ratio}, which for the case of the Gaussian channel is
\beq
	\frac{P(y_n \given t_n \eq 1)}{P(y_n \given t_n \eq 0)} =
                 \exp \left( \frac{2 x y_n }{ \sigma^2} \right)  .
\eeq
\end{description}
\exercisxA{2}{ex.gc.bsc}{
 Show that from the point of view of decoding, a Gaussian channel
 is equivalent to a time-varying binary symmetric channel with a known
 noise level $f_n$ which depends on $n$.
}
\begin{description}
\item[Prior\puncspace]
 The second factor in the numerator is the {\dbf prior} probability of 
 the codeword, $P(\bt)$, which is usually assumed to be uniform over 
 all valid codewords.

 The denominator in (\ref{eq.decode}) is  the normalizing constant 
\beq
         P(\by ) = \sum_{\bt}  { P(\by \given \bt) P(\bt) } .
\eeq
\end{description}

 The complete solution to the codeword decoding problem is 
 a list  of all codewords and their probabilities as given by equation 
 (\ref{eq.decode}).  Since the number of codewords
 in a linear code, $2^K$, is often very large, and since we are not 
 interested in knowing the detailed probabilities of all the codewords, 
 we  often  restrict attention to a simplified version of the codeword 
 decoding problem. 

\begin{description}
\item[The \index{maximum {\em a posteriori}}{MAP} codeword decoding problem] is the task of 
 identifying {\em the most probable codeword\/} $\bt$ given the 
 received signal.

 If the prior probability over codewords is uniform then this 
 task is identical to the problem of {\dbf maximum likelihood 
 decoding}, that is, identifying the codeword that maximizes
 $P(\by  \given  \bt )$.
\end{description}
{\sf Example:} In \chref{chone}, for
% the case of
 the $(7,4)$ Hamming code and a binary symmetric channel
 we  discussed  a method for 
 deducing the {most probable codeword} from the syndrome of 
 the received signal, thus solving the {MAP} codeword decoding problem
 for that case. We would like a more general solution.
 

 The MAP codeword decoding problem can  be solved in exponential time 
 (of order $2^K$) by searching through all codewords for the one that 
 maximizes $P(\by \given \bt) P(\bt)$. But we are interested in methods that 
 are more efficient than this. In section \ref{sec.viterbi}, we will
 discuss an exact method known 
 as the  {\dbf\ind{min--sum
 algorithm}}
 which may be able to solve the codeword 
 decoding problem more efficiently; how much more efficiently
 depends on the properties of the code.  

% {\em (put this somewhere else?)}
% However,
  It is worth emphasizing that
   MAP codeword decoding  for a {\em general\/} linear 
 code is known to be \ind{NP-complete} (which means in layman's terms
 that MAP codeword decoding has a complexity that
% can only be done in general
% in a time that
 scales exponentially with the blocklength, unless
 there is a revolution in computer science).
 So restricting attention to the \ind{MAP decoding} problem hasn't\index{maximum {\em a posteriori\/} decoder}
 necessarily 
 made the task much less challenging; it simply makes the answer briefer to
 report.

\subsection{Solving the bitwise decoding problem}
 Formally, the exact solution of the bitwise decoding problem 
 is obtained from \eqref{eq.decode} by {\em marginalizing\/} 
 over the other bits.
\beq
        P( t_n  \given  \by ) =  \sum_{ \{ t_{n'} : \, n' \neq n \} }
                                 { P(\bt \given \by)} .
\label{eq.bitwise}
\eeq
 We can also write this marginal with the aid of a truth function 
 $\truth[S]$ that is one if
 the proposition $S$ is true and zero otherwise.
\beqan
        P( t_n\eq 1  \given  \by ) &=&  \sum_{\bt}
                                 { P(\bt \given \by)} \,\truth[ t_n\eq  1 ] \\
        P( t_n\eq 0  \given  \by ) &=&  \sum_{\bt}
                                 { P(\bt \given \by)} \,\truth[ t_n\eq 0 ] .
\label{eq.bitwise1}
\eeqan
% In case this notation is hard to understand, here is an explicit
% example using the bitwise decoding of a $(7,4)$ Hamming code. 
% The probability that $t_2=1$ is
%\beq
%       P( t_n\eq  1  \given  \by ) =  \sum_{ \{ t_1,t_3,t_4,t_5,t_6,t_7 \} }
%                                { P(\bt \given \by)}
%\label{eq.bitwise2}
%\eeq
% 
 Computing these marginal  probabilities by an explicit sum over all
 codewords
 $\bt$  takes 
 exponential time. But, 
 for certain codes, the bitwise decoding problem can be solved 
 much more efficiently using the {\dbf \ind{forward--backward
 algorithm}}. We will describe
 this algorithm, which is an example of the
 {\dbf\ind{sum--product algorithm}}, in a moment. Both the min--sum algorithm and the
 sum--product algorithm have widespread importance, and  have been
 invented many times in many fields.


\section{Codes and trellises\nonexaminable}
 In Chapters \chone\ and \chseven, we represented linear $(N,K)$
 codes in terms of their generator matrices and their parity-check matrices. 
 In the case of a {\dbf systematic} block code, the first 
 $K$ transmitted bits in each block of size $N$ are the source 
 bits, and the remaining $M=N-K$ bits are the parity-check 
 bits. This means that the generator matrix of the code can be written 
\beq
        \bG^{\T} = \left[ \begin{array}{c} \bI_K \\ \bP \end{array} \right] ,
\eeq
 and the parity-check matrix can be written 
\beq
        \bH = \left[ \begin{array}{cc} \bP & \bI_M \end{array} \right] ,
\eeq
 where $\bP$ is an $M \times K$ matrix. 
 
 In this section we will  study another representation 
 of a linear code called a trellis. The codes that these trellises 
 represent will not in general be systematic codes, but
 they can be mapped onto systematic codes 
 if desired by a reordering 
 of the  bits in a block.

%\begin{figure}
\marginfig{%
\footnotesize
\begin{tabular}{*{1}{l@{\hspace{-0.5in}}l}}
\raisebox{0.5in}{(a)}& \hspace*{0.42in}\mbox{\psfig{figure=trellis/R3/ps.ps,angle=-90,width=1.3in}} \\
& \multicolumn{1}{c}{\footnotesize  Repetition code $R_3$} \\[-0.12in]
\raisebox{0.5in}{(b)}& \hspace*{0.42in}\mbox{\psfig{figure=trellis/P3/ps.ps,angle=-90,width=1.3in}}  \\
&  \multicolumn{1}{c}{\footnotesize Simple parity code $P_3$ } \\[-0.12in]
\raisebox{0.85in}{(c)}& \hspace*{-0.24in}\mbox{\psfig{figure=trellis/H74s/ps.ps,angle=-90,width=2.63in}} \\
&  \multicolumn{1}{c}{\footnotesize $(7,4)$ Hamming code} \\
\end{tabular}
%}{%
\caption[a]{Examples of trellises.
% \\ (a) Repetition code R3. \\
% (b) Simple parity code P3. \\ (c) $(7,4)$ Hamming code.

 Each edge in a trellis is labelled by a zero (shown by a square)
 or a one (shown by a cross).}
\label{fig.trellises}
\label{fig.trellis}
}%
%\end{figure}
\subsection{Definition of a trellis}
 Our definition
% of a trellis
 will be quite narrow. For a more
 comprehensive
 view of trellises, the reader should consult \citeasnoun{Kschischang_}.

\begin{description}
  \item[A trellis]
    is a {\dem graph\/} consisting of {\dem nodes\/} (also known as states or vertices)
 and {\dem edges}. The nodes 
 are grouped into vertical slices called {\dem times}, and the times
% \marginpar{\footnotesize{Warning: terminology has recently been altered here. Look for ``state'' needing to be changed to ``time''.}}
% states
% \marginpar{\footnotesize{I need to reconsider this terminology:
% I would like to be able to talk about `a four-state trellis'
% and `the state as a function of time'; this usage conflicts with
% the idea that the encoder passes through an ordered sequence of `states'.}}
 are 
 ordered such that each edge connects a node in one time
% state
 to a node 
 in a neighbouring time.
% state
 Every edge is labelled with a {\dem symbol}. 
 The leftmost and rightmost states contain only one node. 
 Apart from these two extreme nodes, all nodes in the trellis have at least
 one edge connecting leftwards and at least one connecting rightwards.
\end{description}

 A trellis with $N\!+\!1$ times
% states 
 defines a code of blocklength $N$
 as follows: a codeword
 is obtained by taking a path that crosses the trellis from left to 
 right and reading out the symbols on the edges that are traversed. 
 Each valid path through the trellis defines a codeword.
 We will number the leftmost time `time 0' and the rightmost
 `time $N$'. We will number the leftmost state `state 0' and the rightmost
 `state $I$', where $I$ is the total number of
 states (vertices) in the trellis. The $n$th bit of the codeword
 is emitted as we move from time
% state
 $n\!-\!1$ to time
% state
 $n$. 

 The {\dem width\/} of the trellis at a given time
% state
 is the number of 
 nodes in that time.
% state.
 The {\dem maximal width\/}
 of a trellis is what it sounds like.

 A trellis is called a {\dem linear trellis\/}
 if the code it defines is a 
 linear code. We will solely be concerned with linear trellises
 from now on,
 as nonlinear trellises are much more complex beasts.
% \cite{Kschischang_}.
 For brevity, we will only discuss binary trellises, that is, 
 trellises whose edges are labelled with zeroes and ones. It is 
 not hard to generalize the methods that follow to $q$-ary trellises. 

 Figures \ref{fig.trellises}(a--c) show the trellises corresponding to
 the repetition code $R_3$ which has $(N,K)=(3,1)$; the
 parity code $P_3$ with $(N,K) = (3,2)$;  and
 the $(7,4)$ Hamming code. 

\exercisxB{2}{ex.trellish74}{
 Confirm that the sixteen codewords listed in \tabref{fig.h74}
 are generated by the trellis shown in \figref{fig.trellises}c.}

\subsection{Observations about linear trellises}
 For any linear code the {\dem minimal trellis\/} is the one 
 that has the smallest number of nodes.
%
% CHECK: is reordering of bits permitted? 
%
% vertices. 
 In a minimal trellis, each node has at most two  edges entering it
 and at most two edges leaving it.  All nodes in a time
% state
 have the same 
 left degree as each other and they have the same right 
 degree as each other. The width is always a power of two. 

 A minimal trellis for a linear $(N,K)$ code cannot have a width greater 
 than $2^K$ since every node has at least one valid codeword through it, 
 and there are only $2^K$ codewords. Furthermore, if we define $M=N-K$, 
 the minimal trellis's width is everywhere less than $2^M$.
 This will be proved in section \ref{sec.two.to.M.trellis}.

 Notice that for the linear trellises in \figref{fig.trellis}, all of
 which are minimal trellises, $K$ is the number of times a binary
 branch point is encountered as the trellis is traversed from left to
 right or from right to left.

 We will discuss the construction of trellises more  in section
 \ref{sec.more.on.trellis}.
% where we discuss how to make trellises from generator matrices.
 But we now know enough to
 discuss the decoding problem.

\section{Solving the decoding problems on a trellis\nonexaminable}
 We can view the trellis of a linear code
 as giving a causal description of the probabilistic
 process that gives rise to a codeword, with time flowing from  
 left to right.
% At each timestep we move one state to the right.
 Each time a divergence 
 is encountered, a random source (the source of information
 bits for communication) determines which way we go. 
% Note this is just the same as saying that the codeword is generated 
% by a hidden Markov model with a time-varying transition probability
% matrix.

 At the receiving end, we receive a noisy version of the
 sequence of edge-labels, and wish
 to infer which path was taken, or to be precise, (a) we want  to 
 identify the most probable path in order to solve the
 codeword decoding problem; and (b) we want  to find the probability that 
 the transmitted symbol at time $n$ was a zero or a one,
 to solve the bitwise decoding problem.

\Exampl{ex.trellis.h74}{
	Consider the case of 
 a single transmission from the Hamming $(7,4)$ trellis shown
 in \figref{fig.trellis}c.

\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{clll} \toprule
$\bt$ & \multicolumn{1}{c}{Likelihood } &  \multicolumn{2}{c}{Posterior probability} \\ \midrule
%
\tt 0000000 & 0.0275562  & 0.25      & \raisebox{2mm}{\framebox[0.246in]{}}  \\ 
\tt 0001011 & 0.0001458  & 0.0013      & \raisebox{2mm}{\framebox[0.001in]{}}  \\ 
\tt 0010111 & 0.0013122  & 0.012      & \raisebox{2mm}{\framebox[0.012in]{}}  \\ 
\tt 0011100 & 0.0030618  & 0.027      & \raisebox{2mm}{\framebox[0.027in]{}}  \\ 
\tt 0100110 & 0.0002268  & 0.0020      & \raisebox{2mm}{\framebox[0.002in]{}}  \\ 
\tt 0101101 & 0.0000972  & 0.0009      & \raisebox{2mm}{\framebox[0.001in]{}}  \\ 
\tt 0110001 & 0.0708588  & 0.63      & \raisebox{2mm}{\framebox[0.632in]{}}  \\ 
\tt 0111010 & 0.0020412  & 0.018      & \raisebox{2mm}{\framebox[0.018in]{}}  \\ 
\tt 1000101 & 0.0001458  & 0.0013      & \raisebox{2mm}{\framebox[0.001in]{}}  \\ 
\tt 1001110 & 0.0000042  & 0.0000      & \raisebox{2mm}{\framebox[0.000in]{}}  \\ 
\tt 1010010 & 0.0030618  & 0.027      & \raisebox{2mm}{\framebox[0.027in]{}}  \\ 
\tt 1011001 & 0.0013122  & 0.012      & \raisebox{2mm}{\framebox[0.012in]{}}  \\ 
\tt 1100011 & 0.0000972  & 0.0009      & \raisebox{2mm}{\framebox[0.001in]{}}  \\ 
\tt 1101000 & 0.0002268  & 0.0020      & \raisebox{2mm}{\framebox[0.002in]{}}  \\ 
\tt 1110100 & 0.0020412  & 0.018      & \raisebox{2mm}{\framebox[0.018in]{}}  \\ 
\tt 1111111 & 0.0000108  & 0.0001      & \raisebox{2mm}{\framebox[0.000in]{}}  \\ \bottomrule
\end{tabular}
\end{center} 
}{%
\caption[a]{Posterior probabilities over the sixteen codewords
 when the received vector $\by$ has normalized
 likelihoods  $(0.1, 0.4, 0.9, 0.1, 0.1, 0.1,
 0.3)$.}
\label{fig.posteriorH74}
}%
\end{figure}

	Let the normalized likelihoods be: $(0.1, 0.4, 0.9, 0.1, 0.1, 0.1,
 0.3)$. That is, the ratios of the likelihoods are 
\beq
\frac{	P(y_1 \given x_1 \eq 1)}{	P(y_1 \given x_1 \eq 0)} = \frac{0.1}{0.9} ,
\:\:\:
\frac{	P(y_2 \given x_2 \eq 1)}{	P(y_2 \given x_2 \eq 0)} = \frac{0.4}{0.6} ,
\:\:\:
\mbox{etc.}
\eeq
 How should this received signal be decoded?

\begin{enumerate}
\item If we threshold the likelihoods at 0.5 to turn the
 signal into a binary received vector, we have $\br = (0,0,1,0,0,0,0)$,
 which decodes, using the  decoder for the binary
 symmetric channel (\chapterref{ch1}), into $\hat{\bt} = (0,0,0,0,0,0,0)$.

 This is not the optimal decoding procedure.
 Optimal inferences are always obtained by using \Bayes\  theorem.

\item
 We can find the posterior probability over codewords
 by explicit enumeration of all sixteen codewords. This
 posterior distribution is shown
 in \figref{fig.posteriorH74}. Of course, we aren't really
 interested in such brute-force solutions, and the aim
 of this chapter
% the following sections
 is to understand algorithms for getting
 the same information out in less than $2^K$ computer time.

 Examining the posterior probabilities, we notice that the most probable
 codeword is actually the string $\bt = \tt 0110001$. This is more than
 twice as probable as the answer found by thresholding, {\tt 0000000}.

 Using the posterior probabilities shown in  \figref{fig.posteriorH74},
 we can also compute the posterior marginal distributions of each of
 the bits. The result is shown in \figref{fig.exact.marginals}.
 Notice that bits 1, 4, 5 and 6 are all quite confidently
 inferred to be zero. The strengths of the posterior probabilities
 for bits 2, 3, and 7 are not so great. \hfill \ensuremath{\epfsymbol}\par
\end{enumerate}
}
\begin{figure}
\figuremargin{%
\[
\begin{array}{ccclllll} \toprule
 n & \multicolumn{2}{c}{\mbox{Likelihood}} & \multicolumn{4}{c}{\mbox{Posterior marginals}} \\
  & \multicolumn{1}{c}{P(y_n \given t_n \eq 1)} & \multicolumn{1}{c}{P(y_n \given t_n \eq 0)} & 
 \multicolumn{2}{c}{P(t_n \eq 1  \given  \by)} & \multicolumn{2}{c}{P(t_n \eq 0  \given  \by)} \\ \midrule
% marginals
 1 & 0.1       & 0.9      & 0.061    & \raisebox{2mm}{\framebox[0.061in]{}}  & 0.939    & \raisebox{2mm}{\framebox[0.939in]{}}  \\
 2 & 0.4       & 0.6      & 0.674    & \raisebox{2mm}{\framebox[0.674in]{}}  & 0.326    & \raisebox{2mm}{\framebox[0.326in]{}}  \\
 3 & 0.9       & 0.1      & 0.746    & \raisebox{2mm}{\framebox[0.746in]{}}  & 0.254    & \raisebox{2mm}{\framebox[0.254in]{}}  \\
 4 & 0.1       & 0.9      & 0.061    & \raisebox{2mm}{\framebox[0.061in]{}}  & 0.939    & \raisebox{2mm}{\framebox[0.939in]{}}  \\
 5 & 0.1       & 0.9      & 0.061    & \raisebox{2mm}{\framebox[0.061in]{}}  & 0.939    & \raisebox{2mm}{\framebox[0.939in]{}}  \\
 6 & 0.1       & 0.9      & 0.061    & \raisebox{2mm}{\framebox[0.061in]{}}  & 0.939    & \raisebox{2mm}{\framebox[0.939in]{}}  \\
 7 & 0.3       & 0.7      & 0.659    & \raisebox{2mm}{\framebox[0.659in]{}}  & 0.341    & \raisebox{2mm}{\framebox[0.341in]{}}  \\ \bottomrule
\end{array}
\]
}{%
\caption[a]{Marginal posterior probabilities for the 7 bits
 under the posterior distribution of \protect\figref{fig.posteriorH74}.}
\label{fig.exact.marginals}
}%
\end{figure}

 In the above example, the MAP
% most probable
 codeword is in agreement
 with the
% bit-by-bit
 bitwise decoding that is obtained by
 selecting the most probable state for each bit using the
 posterior marginal distributions. But this is
 not always the case, as the following exercise shows.
\exercissxA{2}{ex.H74.hinoise}{
	Find the most probable codeword in the case
 where the normalized likelihood is  $( 0.2,0.2,0.9,0.2,0.2,0.2,0.2 )$.
 Also find or estimate 
 the marginal posterior probability for each of the seven bits,
 and give the bit-by-bit decoding.

 [Hint: concentrate on the few codewords that
 have the largest probability.]
}

 We now discuss how to use message passing on a code's trellis to solve
 the decoding problems.

\subsection{The min--sum algorithm\nonexaminable}
% {Viterbi}
\label{sec.viterbi}
 The MAP codeword decoding problem can be solved
 using the \ind{min--sum algorithm} that was introduced
% Connect this section to counting paths in the constrained channel,
% chapter \ref{ch.noiseless}, and to the message-passing  chapter \ref{ch.message}.
 in \secref{sec.minsum1}.
 Each codeword of the code corresponds to a path across
 the trellis.
 Just as the cost of a journey is the sum of the costs of its  constituent
 steps, the log likelihood of a codeword is the sum
 of the bitwise log likelihoods. By convention, we
 flip the sign of the log likelihood (which we would like
 to maximize) and talk in terms of a cost, which
 we would like to minimize.

 We associate with each edge a cost $-\!\log P(y_n \given t_n)$,
 where $t_n$ is the transmitted bit associated with that edge,
 and $y_n$ is the received symbol.
 The min--sum algorithm presented in  \secref{sec.minsum1}
 can then identify the most probable codeword in a number of computer operations equal
 to the number of edges in the trellis.
 This algorithm is also known as the \ind{Viterbi algorithm} \cite{viterbi}.\index{message passing!Viterbi}

% Consider a node on the most probable path, which has two upstream 
% parents. The most probable way of creating the first $n$ emissions
% and getting to the present node must be the same as 
% the most probable path, because if it weren't....
% To find the most probable path to a node, only need to know the 
% score of the most probable paths to its parents, and the score associated 
% with transitions from those two parents. Then can identify which is 
% the cheaper parent. 

\subsection{The sum--product algorithm\nonexaminable}
\label{sec.trellisfb}
 To solve the bitwise decoding problem,
 we can make a small modification to the min--sum algorithm,
 so that the messages passed through the trellis
 define `the probability of the data up to the current point'
 instead of `the cost of the best route to this point'.
 We replace the costs on the edges,  $-\!\log P(y_n \given t_n)$,
 by the likelihoods themselves,  $P(y_n \given t_n)$.
 We replace the min  and  sum operations of the \ind{min--sum algorithm}
 by a sum and product respectively.

 Let $i$ run over nodes/states,  $i=0$ be the label for the
 start state,  ${\cal P}(i)$ denote the set of
 states that are parents of state $i$,
 and $w_{ij}$ be the likelihood associated with the
 edge from node $j$ to node $i$. 
 We define the forward-pass messages $\alpha_i$
 by
\beqan
	\alpha_0 &=& 1 \nonumber \\
 \alpha_i & = & \sum_{ j \in {\cal P}(i) } w_{ij} \alpha_j  .
\eeqan
 These messages can be computed sequentially from left to right.
\exercisxB{2}{ex.sumprod}{
 Show that for a node $i$ whose time-coordinate is $n$,
 $\alpha_i$ is proportional to the joint probability
 that  the codeword's path passed through node $i$
 and that the first $n$ received symbols
 were  $y_1, \ldots, y_n$.
}
 The message $\alpha_I$ computed at the end node of the trellis is proportional to
 the marginal probability of
 the data. 
\exercisxB{2}{ex.sumprodb}{
 What is the constant of proportionality? [Answer: $2^K$]
}

 We define a second set of backward-pass messages $\beta_i$
 in a similar manner. Let node $I$ be the end node.
\beqan
	\beta_I &=& 1 \nonumber \\
 \beta_j & = & \sum_{i : j \in {\cal P}(i) } w_{ij} \beta_i .
\eeqan
 These messages can be computed sequentially in
 a backward pass from right to left.
\exercisxB{2}{ex.sumprodd}{
 Show that for a node $i$ whose time-coordinate is $n$,
 $\beta_i$ is proportional to the conditional probability,
 {\em given\/}
 that  the codeword's path passed through node $i$,
 that the subsequent  received symbols
 were  $y_{n+1} \ldots y_N$.
}

 Finally, to find the probability that the $n$th bit
 was a 1 or 0, we do two summations of products of the
 forward and backward messages. Let $i$ run over nodes
 at time $n$ and $j$ run over nodes at time $n-1$,
 and let $t_{ij}$ be the value of $t_n$ associated with
 the trellis edge from node $j$ to node $i$. For each
 value of $t=0/1$, we compute 
\beq
	r^{(t)}_n = \sum_{i,j: \, j \in {\cal P}(i) ,\,  t_{ij} = t}  \alpha_j w_{ij} \beta_i .
\eeq
 Then the posterior probability that $t_n$ was $t=0/1$ is
\beq
	P( t_n \eq  t \given  \by ) = \frac{1}{Z} r^{(t)}_n ,
\eeq
 where the normalizing constant $Z =  r^{(0)}_n + r^{(1)}_n$
 should be identical to the final forward message $\alpha_I$
 that was computed earlier.
\exercisxC{2}{ex.sumprode}{
 Confirm that the above sum--product algorithm
 does compute  $P( t_n \eq  t \,|\, \by )$.
}
 Other names for  the  sum--product algorithm presented here 
 are `the \ind{forward--backward algorithm}', `the  \ind{BCJR algorithm}',
 and `\ind{belief propagation}'.\index{message passing!BCJR}\index{message passing!belief propagation}\index{message passing!forward--backward}
\exercissxB{2}{ex.sumprodf}{
 A codeword of the simple parity code  $P_3$
 is transmitted, and the received
 signal $\by$ has associated likelihoods shown in
 \tabref{tab.sumprodf}.%
\margintab{
\begin{center}
\begin{tabular}{ccc} \toprule
$n$ & \multicolumn{2}{c}{ $P(y_n \,|\, t_n )$ } \\
&  $t_n \eq  0$ &  $ t_n \eq  1 $ \\
\midrule
1 & \dquarter & \dhalf \\
2 & \dhalf    & \dquarter \\
3 & \deighth  & \dhalf \\ \bottomrule
\end{tabular}
\end{center}
\caption[a]{Bitwise likelihoods for
 a codeword of $P_3$.}
\label{tab.sumprodf}
}
 Use the min--sum algorithm and the sum--product
 algorithm in the trellis (\figref{fig.trellis})
 to solve the MAP codeword decoding problem
 and the bitwise decoding problem. Confirm your answers
 by enumeration of all codewords ({\tt{000}}, {\tt{011}},  {\tt{110}}, {\tt{101}}).
 [\Hint: use logs to base 2 and do the min--sum computations by hand.
 When working the sum--product algorithm by hand, you may find
 it helpful to use three colours of pen, one for the
 $\alpha$s, one for the $w$s, and one for the $\beta$s.]
% in the sum--product computation, the answers are best
% expressed working in multiples of .]
}


%\section{Exercises}





% Could discuss  The junction tree algorithm.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\subchapter{More on Trellises\nonexaminable}
\section{More on trellises}
\label{sec.more.on.trellis}
% In this appendix we
 We now discuss various ways of making the trellis of
 a code. You may safely jump over this section.

 The {\dbf \ind{span}} of a codeword
% of length $N$
 is the set of bits contained between
 the first bit in the codeword that is non-zero, and the last
 bit that is non-zero, inclusive. We can indicate the span of a codeword by
 a binary vector as shown in \tabref{fig.span}.
\begin{table}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{rccccc} \toprule
Codeword &
\tt 0000000 &
\tt 0001011 &
\tt 0100110 &
\tt 1100011 &
\tt 0101101 \\
Span &
\tt 0000000 &
\tt 0001111 &
\tt 0111110 &
\tt 1111111 &
\tt 0111111 \\ \bottomrule
\end{tabular} 
\end{center} 
}{%
\caption[a]{Some codewords and their spans.}
\label{fig.span}
}%
\end{table}

\noindent
 A generator matrix is in {\dbf trellis-oriented form} if 
 the spans of the rows of the  generator matrix all start in different
 columns and the spans all end in different columns.

%
% see bin/G2T.p
%
\subsection{How to make a trellis from a generator matrix}
 First, put the generator matrix into trellis-oriented form by 
 row-manipulations similar to Gaussian elimination. 
 For example, our $(7,4)$ Hamming code can be generated by
\beq
\bG = \left[ \begin{array}{ccccccc}
1&0&0&0&1&0&1\\
0&1&0&0&1&1&0\\
0&0&1&0&1&1&1\\
0&0&0&1&0&1&1
\end{array}
\right]
\eeq
 but this matrix is not in trellis-oriented form -- for example, 
 rows 1, 3 and 4 all have spans that end in the same column. 
 By subtracting lower rows from upper rows, we can obtain 
 an equivalent generator matrix (that is, one that generates the 
 same set of codewords) as follows: 
\beq
\bG = \left[ \begin{array}{ccccccc}
1&1&0&1&0&0&0\\
0&1&0&0&1&1&0\\
0&0&1&1&1&0&0\\
0&0&0&1&0&1&1
\end{array}
\right] .
\eeq

 Now, each row of the generator matrix can be thought of 
 as defining an $(N,1)$ subcode of the $(N,K)$ code, that is, 
 in this case, a code with two codewords of length $N=7$.
 For the first row, the code consists of the two codewords
 $\tt 1 1 0 1 0 0 0$ and $\tt 0 0 0 0 0 0 0$. The subcode defined 
 by the second row consists of $\tt 0 1 0 0 1 1 0$ and $\tt 0 0 0 0 0 0 0$.
 It is easy to construct the minimal trellises of these subcodes; 
 they are shown in the left column of figure \ref{fig.tH74s}.

 We build the trellis incrementally as shown in
  figure \ref{fig.tH74s}. We start with the trellis corresponding 
 to the subcode given by the first row of the generator matrix.
 Then we add in one subcode at a time.
 The vertices within  the span of the new subcode are all duplicated.
 The edge symbols in the original trellis are left unchanged and the
 edge symbols in the second part of the trellis are flipped wherever
 the new subcode has a {\tt{1}} and otherwise left alone.
%
%
% MORE HERE!

\begin{figure}
\figuremargin{%
\vspace{-0.86in}
\begin{center}
\begin{tabular}{cl@{\hspace{-0.2in}}l}
\mbox{\psfig{figure=trellis/H74s/ps1.ps,angle=-90,width=2.13in}}& \\
 + \\[-1in]
\mbox{\psfig{figure=trellis/H74s/row2/ps.ps,angle=-90,width=2.13in}}&\raisebox{0.25in}{=}&
\mbox{\psfig{figure=trellis/H74s/ps2.ps,angle=-90,width=2.13in}}\\[0.1in]
 + \\[-0.64in]
\mbox{\psfig{figure=trellis/H74s/row3/ps.ps,angle=-90,width=2.13in}}&\raisebox{0.25in}{=}&
\mbox{\psfig{figure=trellis/H74s/ps3.ps,angle=-90,width=2.13in}}\\[0.6in]
 + \\[-1.1in]
\mbox{\psfig{figure=trellis/H74s/row4/ps.ps,angle=-90,width=2.13in}}&\raisebox{0.25in}{=}&
\mbox{\psfig{figure=trellis/H74s/ps.ps,angle=-90,width=2.13in}}\\[-0.1in]
\end{tabular}
\end{center}
}{%
\caption[a]{Trellises for four subcodes of the $(7,4)$ Hamming code
 (left column), 
 and the sequence of trellises that are made when constructing the 
 trellis for the $(7,4)$ Hamming code (right column).

 Each edge in a trellis is labelled by a zero (shown by a square)
 or a one (shown by a cross).}
\label{fig.tH74s}
}%
\end{figure}

 
 Another $(7,4)$ Hamming code can be generated by
\beq
\bG = \left[ \begin{array}{ccccccc}
1&1&1&0&0&0&0\\
0&1&1&1&1&0&0\\
0&0&1&0&1&1&0\\
0&0&0&1&1&1&1
\end{array}
\right] .
\label{eq.betterG74}
\eeq
 The $(7,4)$ Hamming code generated by this matrix differs by a permutation 
 of its bits from the code generated by the systematic matrix used 
 in  \chref{ch.one} and above.
%. This permutation has been chosen such that the 
%  parity-check matrix can be written thus:
 The  parity-check matrix  corresponding to this permutation is:
\beq
\bH = \left[ 
\begin{array}{ccccccc}
1&0&1&0&1&0&1\\
0&1&1&0&0&1&1\\
0&0&0&1&1&1&1
\end{array}
\right] .
\label{eq.betterH74}
\eeq
 The trellis obtained from the permuted
 matrix $\bG$ given in \eqref{eq.betterG74}
 is shown in \figref{fig.tH74}a. Notice that the number of
% edges and
 nodes in this trellis is smaller than the number of nodes in the
 previous trellis for the Hamming $(7,4)$ code in \figref{fig.trellis}c.
 We thus observe that {\em rearranging the order of the codeword bits can sometimes
 lead to smaller, simpler trellises.}
% kschischang
%\begin{figure}
%\figuremargin{%
\marginfig{\footnotesize%\small
\begin{center}
\begin{tabular}{*{1}{l@{\hspace{-0.5in}}l}}
\raisebox{0.5in}{(a)}&
\mbox{\psfig{figure=trellis/H74/ps.ps,angle=-90,width=2.13in}}
 \\
\raisebox{0.5in}{(b)}&
\mbox{\psfig{figure=trellis/H74H/ps.ps,angle=-90,width=2.13in}}\\
\end{tabular}
\end{center}
%}{%
\caption[a]{Trellises for the permuted $(7,4)$ Hamming code generated from 
(a) the generator matrix by the method
 of \figref{fig.tH74s}; (b) the parity-check matrix
 by the method on page \pageref{sec.pcm.page271}.

 Each edge in a trellis is labelled by a zero (shown by a square)
 or a one (shown by a cross).}
\label{fig.tH74}
}%
%\end{figure}


\subsection{Trellises from parity-check matrices}
\label{sec.pcm.page271}
 Another way of viewing the trellis is in terms of the syndrome. 
 The syndrome of a  vector $\br$ is defined to be $\bH \br$, 
 where $\bH$ is the parity-check matrix. A vector is only a codeword 
 if its syndrome is zero. As we generate a codeword 
 we can describe the current state by the {\dbf partial syndrome}, 
 that is, the product of $\bH$ with the codeword bits thus far generated. 
 Each  state in the trellis is a partial syndrome at one time 
 coordinate.
 The starting and ending states are both constrained to be the zero 
 syndrome. 
%
 Each node in a state represents a different possible 
 value for the partial syndrome.
 Since $\bH$ is an $M\times N$ matrix, where $M=N-K$, the 
 syndrome is at most an $M$-bit vector. So we need at most 
 $2^M$ nodes in each state. 
 We can construct the trellis of a code from its parity-check 
 matrix by walking from each end, generating two trees of possible 
 syndrome sequences. The intersection of these two trees defines
 the trellis of the code. 

 In the pictures we obtain  from this construction, we can let the 
 vertical coordinate represent the syndrome. Then any horizontal edge 
 is necessarily associated with a zero bit (since only a non-zero bit 
 changes the syndrome) and any non-horizontal edge is associated with 
 a one bit.
 (Thus in this representation
 we no longer need to label the edges in the trellis.)
%
% these are done by RMtest into the directory code/lt
%
%
% see also bin/G2T.p and mutate.p
%
  \Figref{fig.tH74}b shows the trellis corresponding to the parity-check
 matrix of \eqref{eq.betterH74}. 
% \Figref{fig.tRM16} shows the trellises of some slightly larger codes.
%
% restore RM material and GF4?
%
\fakesection{Is this label roughly right?}
\label{sec.two.to.M.trellis}


% \section{Solutions} are in _sexact
%MNBV\newpage
%\newpage

\dvips
\section{Solutions}% to Chapter \protect\ref{ch.exact}'s exercises} % 
\begin{table}[hbtp]
\figuremargin{
\[%beq
\begin{tabular}{clll} \toprule
$\bt$ & \multicolumn{1}{c}{Likelihood } &  \multicolumn{2}{c}{Posterior probability} \\ \midrule
%
\tt 0000000 & 0.026   & 0.3006      & \raisebox{2mm}{\framebox[0.301in]{}}  \\ 
\tt 0001011 & 0.00041  & 0.0047      & \raisebox{2mm}{\framebox[0.005in]{}}  \\ 
\tt 0010111 & 0.0037  & 0.0423      & \raisebox{2mm}{\framebox[0.042in]{}}  \\ 
\tt 0011100 & 0.015   & 0.1691      & \raisebox{2mm}{\framebox[0.169in]{}}  \\ 
\tt 0100110 & 0.00041  & 0.0047      & \raisebox{2mm}{\framebox[0.005in]{}}  \\ 
\tt 0101101 & 0.00010  & 0.0012      & \raisebox{2mm}{\framebox[0.001in]{}}  \\ 
\tt 0110001 & 0.015   & 0.1691      & \raisebox{2mm}{\framebox[0.169in]{}}  \\ 
\tt 0111010 & 0.0037  & 0.0423      & \raisebox{2mm}{\framebox[0.042in]{}}  \\ 
\tt 1000101 & 0.00041  & 0.0047      & \raisebox{2mm}{\framebox[0.005in]{}}  \\ 
\tt 1001110 & 0.00010  & 0.0012      & \raisebox{2mm}{\framebox[0.001in]{}}  \\ 
\tt 1010010 & 0.015   & 0.1691      & \raisebox{2mm}{\framebox[0.169in]{}}  \\ 
\tt 1011001 & 0.0037  & 0.0423      & \raisebox{2mm}{\framebox[0.042in]{}}  \\ 
\tt 1100011 & 0.00010  & 0.0012      & \raisebox{2mm}{\framebox[0.001in]{}}  \\ 
\tt 1101000 & 0.00041  & 0.0047      & \raisebox{2mm}{\framebox[0.005in]{}}  \\ 
\tt 1110100 & 0.0037  & 0.0423      & \raisebox{2mm}{\framebox[0.042in]{}}  \\ 
\tt 1111111 & 0.000058  & 0.0007      & \raisebox{2mm}{\framebox[0.001in]{}}  \\
\bottomrule 
\end{tabular}
\]%eeq
}{
\caption[a]{
 The posterior probability over codewords for \protect\exerciseonlyref{ex.H74.hinoise}.
}
\label{tab.74hipost}
}
\end{table}
\soln{ex.H74.hinoise}{
 The posterior probability over
 codewords is shown in \tabref{tab.74hipost}.
 The most probable codeword is {\tt 0000000}.
 The marginal posterior probabilities of
 all seven bits are:
% marginals
\[%beq
\begin{array}{cccllll}\toprule
 n & \multicolumn{2}{c}{\mbox{Likelihood}} & \multicolumn{4}{c}{\mbox{Posterior marginals}} \\
  & P(y_n\given t_n\eq {\tt 1}) & P(y_n\given t_n\eq {\tt 0}) & 
 \multicolumn{2}{c}{P(t_n\eq {\tt 1} \given  \by)} & \multicolumn{2}{c}{P(t_n\eq {\tt 0} \given  \by)} \\ \midrule
% marginals
 1 & 0.2       & 0.8      & 0.266    & \raisebox{2mm}{\framebox[0.266in]{}}  & 0.734    & \raisebox{2mm}{\framebox[0.734in]{}}  \\
 2 & 0.2       & 0.8      & 0.266    & \raisebox{2mm}{\framebox[0.266in]{}}  & 0.734    & \raisebox{2mm}{\framebox[0.734in]{}}  \\
 3 & 0.9       & 0.1      & 0.677    & \raisebox{2mm}{\framebox[0.677in]{}}  & 0.323    & \raisebox{2mm}{\framebox[0.323in]{}}  \\
 4 & 0.2       & 0.8      & 0.266    & \raisebox{2mm}{\framebox[0.266in]{}}  & 0.734    & \raisebox{2mm}{\framebox[0.734in]{}}  \\
 5 & 0.2       & 0.8      & 0.266    & \raisebox{2mm}{\framebox[0.266in]{}}  & 0.734    & \raisebox{2mm}{\framebox[0.734in]{}}  \\
 6 & 0.2       & 0.8      & 0.266    & \raisebox{2mm}{\framebox[0.266in]{}}  & 0.734    & \raisebox{2mm}{\framebox[0.734in]{}}  \\
 7 & 0.2       & 0.8      & 0.266    & \raisebox{2mm}{\framebox[0.266in]{}}  & 0.734    & \raisebox{2mm}{\framebox[0.734in]{}} \\ \bottomrule
\end{array}
\]%eeq
 So the bitwise decoding is {\tt 0010000}, which is not actually a
 codeword.
}

\soln{ex.sumprodf}{
 The MAP codeword is {\tt{101}}, and its likelihood
 is $1/8$. The normalizing constant of the sum--product algorithm
 is $Z = \alpha_I = \dfrac{3}{16}$.
 The intermediate $\alpha_i$ are (from left to right)
 $\dhalf$, $\dquarter$, $\dfrac{5}{16}$, $\dfrac{4}{16}$;
 the intermediate $\b_i$ are (from right to left),
 $\dhalf$, $\deighth$, $\dfrac{9}{32}$, $\dfrac{3}{16}$.
 The bitwise decoding is:
 $P(t_1 \eq  1 \given   \by) = 3/4$;
 $P(t_1 \eq  1 \given   \by) = 1/4$;
 $P(t_1 \eq  1 \given   \by) = 5/6$.
 The codewords' probabilities are
 \dfrac{1}{12},  \dfrac{2}{12},  \dfrac{1}{12},  \dfrac{8}{12}
 for {\tt{000}}, {\tt{011}},  {\tt{110}}, {\tt{101}}.
}


\dvipsb{solutions exact}  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \prechapter{About               Chapter}
\chapter{Exact Marginalization in Graphs}
\label{ch.belief.propagation}
\label{ch.sumproduct}
\label{ch.factorgraphs}
\index{sum--product algorithm}\index{factor graph}\index{graph!factor graph}\index{algorithm!sum--product}
\label{sec.sumproduct}
 We now take a more general view of the tasks of inference
 and marginalization.
 Before reading this chapter, you should read  about message passing in \chref{ch.message}.

% \newcommand{\gP}{P^*} in itprnnchapter.tex
\section{The general problem}
 Assume that a function $\gP$ of a set of $N$ variables $\bx \equiv \{ x_n \}_{n=1}^{N}$
 is defined as a product of $M$ {\dem{factors}\/} as follows:
\beq
	\gP(\bx) =
% \frac{1}{Z}
	\prod_{m=1}^M	f_m( \bx_m ) .
\label{eq.factorfunction}
\eeq
% Each of the factors $\phi_n(x_n)$ is a function of only one of the variables.
 Each of the factors $f_m( \bx_{m} )$ is a function of a subset $\bx_{m}$ of the
 variables that make up $\bx$.
 If $\gP$ is a positive function then we may be interested in
 a second normalized function,
\beq
	P(\bx) \equiv  \smallfrac{1}{Z}	\gP(\bx) =
 \smallfrac{1}{Z}
	\prod_{m=1}^M	f_m( \bx_m ) ,
\label{eq.factorfunctionZ}
\eeq
 where the normalizing constant $Z$ is defined
 by
\beq
	Z = \sum_{\bx}
			\prod_{m=1}^M	f_m( \bx_m ) .
\eeq

 As an example of the notation we've just introduced,
 here's a function of three binary variables $x_1$, $x_2$, $x_3$  defined   by
 the five factors:
% ($N=3$, $M=2$):
\beq
\begin{array}{rcl}
  f_1 (x_1) &=& \left\{ \begin{array}{cl} 0.1 & x_1 \eq  0 \\  0.9 & x_1 \eq  1 \end{array}\right. \\
  f_2 (x_2) &=& \left\{ \begin{array}{cl} 0.1 & x_2 \eq  0 \\  0.9 & x_2 \eq  1 \end{array}\right. \\
  f_3 (x_3) &=& \left\{ \begin{array}{cl} 0.9 & x_3 \eq  0 \\  0.1 & x_3 \eq  1 \end{array}\right. \\
  f_4 (x_1,x_2) &=& \left\{ \begin{array}{cl} 1 & (x_1,x_2) \eq  (0,0) \:\:\mbox{or}\:\: (1,1) \\
  		 0 & (x_1,x_2) \eq  (1,0) \:\:\mbox{or}\:\: (0,1)  \end{array}\right. \\
  f_5 (x_2,x_3) &=&  \left\{ \begin{array}{cl} 1 & (x_2,x_3) \eq  (0,0) \:\:\mbox{or}\:\: (1,1) \\
		 0 & (x_2,x_3) \eq  (1,0) \:\:\mbox{or}\:\: (0,1)  \end{array}\right.
\\[0.15in]
 \gP(\bx)& =& 
 f_1 (x_1)
 f_2 (x_2)
 f_3 (x_3)
 f_4 (x_1,x_2)
 f_5 (x_2,x_3) \\
 P(\bx)& =& \displaystyle \smallfrac{1}{Z}
 f_1 (x_1)
 f_2 (x_2)
 f_3 (x_3)
 f_4 (x_1,x_2)
 f_5 (x_2,x_3) .
\end{array}
\label{eq.r3factors}
\eeq
 The five subsets of $\{ x_1,x_2,x_3 \}$ denoted by $\bx_m$ in the
general function (\ref{eq.factorfunction})
 are here 
 $\bx_1 = \{x_1\}$,
 $\bx_2 = \{x_2\}$,
 $\bx_3 = \{x_3\}$,
 $\bx_4 = \{x_1,x_2\}$,
 and 
 $\bx_5 = \{x_2,x_3\}$.

 The function $P(\bx)$, by the way, may be recognized as the posterior probability
 distribution of the three transmitted bits in a repetition code (\sectionref{sec.r3})
 when the received signal is $\br = ( {\tt 1} , {\tt 1} , {\tt 0} )$
 and the channel is a binary symmetric channel with flip probability 0.1.
 The  factors $f_4$ and $f_5$ respectively enforce the
 constraints that $x_1$ and $x_2$ must be identical and that
 $x_2$ and $x_3$ must be identical.
 The factors $f_1$, $f_2$, $f_3$ are the likelihood functions  contributed
 by each component of $\br$.




 A function of the factored form (\ref{eq.factorfunction})
 can be depicted by a {\dem\ind{factor graph}},
 in which the variables are depicted by circular nodes
 and the
% shared
 factors are depicted by square nodes.
 An edge is put between  variable node $n$ and factor node $m$
% if $n \in \Nm$, that is, if the function $\psi_m(\bx)$ has
 if the function $f_m(\bx_m)$ has
 any dependence on variable $x_n$.
%
 The factor graph for the example function  (\ref{eq.r3factors}) is shown 
 in \figref{fig.r3.graph}.
\amarginfig{b}{
\begin{center}{
\setlength{\unitlength}{0.477mm}
\begin{picture}(101,34)(-54,-7)
\put(0,15){\line(1,-1){10}}
\put(11,5){\line(1,1){10}}
\put(22,15){\line(1,-1){10}}
\put(33,5){\line(1,1){10}}
\put(0,25){\makebox(0,0)[c]{$x_1$}}
\put(22,25){\makebox(0,0)[c]{$x_2$}}
\put(44,25){\makebox(0,0)[c]{$x_3$}}
\multiput(0,18)(21.5,0){3}{\circle{6}}
\multiput(-1,15)(21.5,0){3}{\line(-5,-1){50}}
% five boxes
\multiput(-57,5)(22,0){5}{\line(1,0){5}}
\multiput(-57,5)(22,0){5}{\line(0,-1){5}}
\multiput(-57,0)(22,0){5}{\line(1,0){5}}
\multiput(-52,0)(22,0){5}{\line(0,1){5}}
\put(-54,-5.4){\makebox(0,0)[c]{$f_1$}}%(x_1)
\put(-32,-5.4){\makebox(0,0)[c]{$f_2$}}
\put(-10,-5.4){\makebox(0,0)[c]{$f_3$}}
\put(12,-5.4){\makebox(0,0)[c]{$f_4$}}% (x_1,x_2)
\put(34,-5.4){\makebox(0,0)[c]{$f_5$}}% (x_2,x_3)
\end{picture}}
\end{center}

%}{%
\caption[a]{The factor graph associated with the function
% defined in
 $\gP(\bx)$
 (\ref{eq.r3factors}).}
\label{fig.r3.graph}
}% end marginfig


\subsection{The normalization problem}
 The first  task to be solved is
 to compute the normalizing constant $Z$.

\subsection{The marginalization problems}
 The second  task to be solved is
 to compute the marginal function
%\marginpar{\footnotesize{We use the term
% marginal function rather than marginal distribution because
% in what follows we do not need to constrain $f$ to be a probability distribution.}}
 of any  variable $x_n$, defined by
\beq
	Z_n(x_n) = \sum_{ \{ x_{n'} \} , \, n' \neq n} \gP (\bx) .
\eeq

 For example, if $f$ is a function of three variables then
 the marginal for $n=1$ is defined by
\beq
	Z_1(x_1) = \sum_{x_2,x_3} f(x_1,x_2,x_3) .
\eeq
 This type of summation, over `all the $x_{n'}$ except for $x_n\!$'
 is so important that
 it can be useful to have a special notation for
 it -- the `\ind{not-sum}' or `\ind{summary}'.
%,
%\beq
%	f_1(x_1) = \sum_{\tilde x_1}  f(x_1,x_2,x_3) \equiv
%		\sum_{x_2,x_3} f(x_1,x_2,x_3) .
%\eeq
% The marginal function  $f_n(x_n)$ can be called
% `the summary for $x_n$ of $f$'.

 The third task to be solved is to compute
 the normalized marginal  of any  variable $x_n$, defined by
\beq
	P_n(x_n) \equiv \sum_{ \{ x_{n'} \} , \, n' \neq n} P (\bx) .
\eeq
 [We  include the suffix `$n$' in $P_n(x_n)$, 
 departing from our normal practice in the rest of the book,
 where  we would omit it.]
\exercisxB{1}{ex.normmarg}{
 Show that the normalized marginal is related to the
 marginal $Z_n(x_n)$ by 
\beq
	P_n(x_n) = \frac{ Z_n(x_n) }{ Z } .
\eeq
}

 We might also be interested in marginals over
 a subset of the variables, such
 as
\beq
	Z_{12}(x_1,x_2)
% \equiv \sum_{\tilde \{ x_1,x_2 \} }  \gP (x_1,x_2,x_3)
                   \equiv  \sum_{x_3}  \gP (x_1,x_2,x_3) .
\eeq


 All these tasks are intractable in general.
 Even if every
% shared
 factor is a function of
 only three variables, the cost of computing
 exact solutions for $Z$ and for the marginals
 is believed in general to grow exponentially
 with the number of variables $N$.

 For certain  functions $\gP$, however,
 the marginals can be computed efficiently
 by exploiting the factorization of $\gP$.
 The idea of how this efficiency arises  is
 well illustrated by the message-passing examples
 of \chref{ch.message}.  
 The sum--product algorithm that we
 now review is a generalization of message-passing
 rule-set B (\pref{sec.messageBtree}).
 As was the case there, the sum--product algorithm
 is only valid if the graph is \ind{tree}-like.

\section{The sum--product algorithm}
\subsection{Notation}
 We identify the set of variables that the $m$th factor depends on, $\bx_m$,
 by
% defining
 the set of their indices $\Nm$.
 For our example function (\ref{eq.r3factors}),
 the sets are $\N(1) = \{ 1 \}$ (since
 $f_1$ is a function of  $x_1$ alone),
 $\N(2) = \{ 2 \}$, 
 $\N(3) = \{ 3 \}$, 
 $\N(4) = \{ 1,2 \}$, and
 $\N(5) = \{ 2,3 \}$.
% the sets are $\N(1) = \{ 1,2 \}$ (since
% $\psi_1(\bx)$ depends on $x_1$ and $x_2$)
% and $\N(2) = \{ 2,3 \}$.
% This lets us use the notation $\psi_m( \{ x_n \}_{n \in \Nm} )$.
% note the set of variables $n$ that participate in shared factor $m$ by $\Nm \equiv \{ n :  \}$.
 Similarly we define the set of
% shared
 factors in which variable $n$
 participates, by $\Mn$. We
 denote a set $\Nm$ with variable $n$ excluded by $\Nm\wo n$.
 We introduce the shorthand  \xmwon\ or \xmwonb\  to denote
 the set of variables in $\bx_m$ with $x_n$ excluded,
 \ie,
\beq
	\xmwon \equiv \{ x_{n'} \! : n' \in \Nm \wo n \}  .
\eeq

 The sum--product algorithm will involve
 messages of two types passing along the edges in the
 factor graph: messages $q_{n \rightarrow m}$ from
 variable nodes to factor nodes,
 and  messages $r_{m \rightarrow n}$ from
 factor nodes to  variable nodes.
 A message (of either type, $q$ or $r$)
 that is sent along an edge connecting factor $f_m$
 to variable $x_n$ is always a function of the variable $x_n$.

 Here are the two rules for the updating of the two sets of messages.\indexs{sum--product algorithm}\index{message passing!sum--product algorithm}\index{factor graph}
\medskip

\noindent
\begin{framedalgorithm}
\begin{description}
\item[From variable to factor:]
\beq
 q_{n \rightarrow m}(x_n) = \prod_{m' \in \Mn\wo m}  r_{m' \rightarrow n}(x_n)  .
\label{eq.spq}
\eeq
\item[From factor to variable:]
\beq
 r_{m \rightarrow n}(x_n) = \sum_{\xmwon}
 \left( f_m( \bx_m) \prod_{ n' \in \Nm \wo n }  q_{n' \rightarrow m}(x_{n'})
 \right) .
\label{eq.spr}
\eeq
\end{description}
\end{framedalgorithm}
\subsection{How these rules apply to leaves in the factor graph}
 A%
\amarginfig{b}{\begin{center}\mbox{\epsfbox{metapost/sumproduct.1}}\end{center}
\caption[a]{A factor node that is a leaf node
 perpetually sends the message
 $r_{m \rightarrow n}(x_n) =  f_m( x_n)$ to its one neighbour $x_n$.}}
 node that has only one edge connecting it to another node is called a \ind{leaf} node.

 Some factor nodes in the graph may be connected to only one variable node,
 in which case the set $\Nm \wo n$ of variables appearing  in the factor
 message update (\ref{eq.spr}) is an empty set, and the product of
 functions $\prod_{ n' \in \Nm \wo n }  q_{n' \rightarrow m}(x_{n'})$
 is the empty product, whose value is 1.
 Such a factor node therefore always broadcasts to its one neighbour $x_n$ the message
 $r_{m \rightarrow n}(x_n) =  f_m( x_n)$.

 Similarly, there may be variable nodes that are connected to
 only one factor node, so the set $\Mn\wo m$ in (\ref{eq.spq}) is empty.
 These nodes perpetually
 broadcast the message $q_{n \rightarrow m}(x_n) = 1$.%
\amarginfig{b}{\begin{center}\mbox{\epsfbox{metapost/sumproduct.2}}\end{center}
\caption[a]{A variable node that is a leaf node perpetually
 sends the message $q_{n \rightarrow m}(x_n) = 1$.}}

% We call nodes that have only one edge connecting them to another node
% `leaf nodes'.

\subsection{Starting and finishing, method 1}
 The algorithm can be initialized in two ways.
 If the graph is tree-like then it must have nodes that are leaves.
 These leaf nodes can broadcast their messages to their
 respective neighbours from the start.
\beqan
 \mbox{For all {leaf\/} variable nodes $n$:}&&	q_{n \rightarrow m}(x_n) = 1  \\
 \mbox{For all {leaf\/} factor nodes $m$:}&&	r_{m \rightarrow n}(x_n) = f_m( x_n) .
\eeqan
 We can then adopt the procedure used in
  \chref{ch.message}'s message-passing
 rule-set B (\pref{sec.messageBtree}):
  a message is  created in accordance with the rules (\ref{eq.spq}, \ref{eq.spr})
 only  if all the messages on which it depends are present.
\amarginfig{t}{
\begin{center}{
\setlength{\unitlength}{0.477mm}
\begin{picture}(101,34)(-54,-7)
\put(0,15){\line(1,-1){10}}
\put(11,5){\line(1,1){10}}
\put(22,15){\line(1,-1){10}}
\put(33,5){\line(1,1){10}}
\put(0,25){\makebox(0,0)[c]{$x_1$}}
\put(22,25){\makebox(0,0)[c]{$x_2$}}
\put(44,25){\makebox(0,0)[c]{$x_3$}}
\multiput(0,18)(21.5,0){3}{\circle{6}}
\multiput(-1,15)(21.5,0){3}{\line(-5,-1){50}}
% five boxes
\multiput(-57,5)(22,0){5}{\line(1,0){5}}
\multiput(-57,5)(22,0){5}{\line(0,-1){5}}
\multiput(-57,0)(22,0){5}{\line(1,0){5}}
\multiput(-52,0)(22,0){5}{\line(0,1){5}}
\put(-54,-5.4){\makebox(0,0)[c]{$f_1$}}%(x_1)
\put(-32,-5.4){\makebox(0,0)[c]{$f_2$}}
\put(-10,-5.4){\makebox(0,0)[c]{$f_3$}}
\put(12,-5.4){\makebox(0,0)[c]{$f_4$}}% (x_1,x_2)
\put(34,-5.4){\makebox(0,0)[c]{$f_5$}}% (x_2,x_3)
\end{picture}}
\end{center}

%}{%
\caption[a]{Our model factor graph for the function
% defined in
 $\gP(\bx)$
 (\ref{eq.r3factors}).}
\label{fig.r3.graphagain}
}% end marginfig
 For example,  in \figref{fig.r3.graphagain}, the message from $x_1$ to $f_1$
 will  be sent only when the message from $f_4$ to $x_1$ has been received;
 and the message from $x_2$ to $f_2$, $q_{2 \rightarrow 2}$,
 can  be sent only when the messages
 $r_{4 \rightarrow 2}$ and 
 $r_{5 \rightarrow 2}$ have both been received.

 Messages will thus flow through the tree, one in each direction along every edge,
 and after a number of
 steps equal to the diameter of the graph,
 every message will have been created.

 The answers we require can then be read out. The marginal
 function of $x_n$ is obtained by multiplying all the incoming messages
 at that node.
\beq
	Z_n(x_n) = \prod_{m \in \Mn} r_{m \rightarrow n}(x_n)  .
\eeq

 The normalizing constant $Z$ can be obtained by summing any marginal function,
 $Z = \sum_{x_n} Z_n(x_n)$, and the normalized marginals obtained from
\beq
 P_n(x_n) = \frac{ Z_n(x_n) }{ Z } .
\eeq

\exercisxB{2}{ex.spforr3}{
 Apply the sum--product algorithm to the function
 defined in \eqref{eq.r3factors} and  \figref{fig.r3.graph}.
 Check that the normalized marginals are consistent with what you know
 about the repetition code $R_3$.
}
\exercisxC{3}{ex.sppf}{
 Prove that the sum--product algorithm correctly
 computes the marginal functions $Z_n(x_n)$ if the graph is tree-like.
}
\exercisxC{3}{ex.sppf2}{
 Describe how to use the messages computed by the sum--product algorithm
 to obtain more complicated marginal functions in a tree-like graph, for example
 $Z_{1,2}(x_1,x_2)$, for two variables $x_1$ and $x_2$ that are
 connected to one common factor node.
}

\subsection{Starting and finishing, method 2}
 Alternatively, the algorithm can be initialized by setting
 {\em all\/} the initial messages from variables to 1:
\beq
 \mbox{for all $n$, $m$:}\:\:\:	 q_{n \rightarrow m}(x_n)  = 1 , 
\eeq
 then proceeding with the factor message update rule (\ref{eq.spr}),
 alternating with the variable message update rule (\ref{eq.spq}).
 Compared with method 1, this
 lazy initialization method leads to a load of wasted computations,
 whose results are gradually flushed out  by the correct answers
 computed by method 1.
 
 After a number of iterations equal to the diameter of the
 factor graph, the algorithm will converge to a set of messages satisfying
 the sum--product relationships (\ref{eq.spq}, \ref{eq.spr}).
\exercisxC{2}{ex.spforr3again}{
 Apply this second version of the sum--product algorithm to the function
 defined in \eqref{eq.r3factors} and  \figref{fig.r3.graph}.
}

 The reason for introducing this lazy method is that (unlike method 1) it can  be applied
 to graphs that are not tree-like.\index{loopy message-passing}\index{message passing!loopy}\index{message passing!in graphs with cycles}
 When the sum--product algorithm is run on a graph with cycles,
 the  algorithm
 does not necessarily converge, and certainly does not in general
 compute the correct marginal functions; but it is nevertheless an
 algorithm of great practical importance, especially  in the decoding of
 \ind{sparse-graph code}s.

\subsection{Sum--product algorithm with on-the-fly normalization}
 If  we are  interested in only the {\em normalized\/} marginals,
 then another version of the sum--product algorithm may be useful.
 The factor-to-variable messages $r_{m \rightarrow n}$ are computed
 in just the same way (\ref{eq.spr}), but the
 variable-to-factor messages are normalized thus:
\beq
 q_{n \rightarrow m}(x_n) = \alpha_{nm} \prod_{m' \in \Mn\wo m}  r_{m' \rightarrow n}(x_n)
\label{eq.spqn}
\eeq
 where $\alpha_{nm}$ is a scalar chosen such that
\beq
	\sum_{x_n}   q_{n \rightarrow m}(x_n) =  1 .
\eeq
\exercisxC{2}{ex.spforr3againagain}{
 Apply this normalized version of the sum--product algorithm to the function
 defined in \eqref{eq.r3factors} and  \figref{fig.r3.graph}.
}

\subsection{A factorization view of the sum--product algorithm}
 One way to view the sum--product algorithm is that it reexpresses
 the original factored function, the product of $M$ factors
 $	\gP(\bx) =
	\prod_{m=1}^M	f_m( \bx_m )$,
 as  another factored function which is the product
 of $M+N$ factors,
\beq
	\gP(\bx) =
	\prod_{m=1}^M	\phi_m( \bx_m ) 
	\prod_{n=1}^N	\psi_n( x_n ) .
\label{eq.factorfunctionphipsi}
\eeq
 Each factor $\phi_m$ is associated with a factor node $m$,
 and each factor $\psi_n(x_n)$ is associated with a variable node.
 Initially $\phi_m(\bx_m) = f_m(\bx_m)$ and  $\psi_n(x_n)=1$. 

 Each time
 a factor-to-variable  message $r_{m\rightarrow n}(x_n)$ is
 sent, the factorization is updated thus:
\beq
 \psi_n(x_n) =  \prod_{m \in \Mn} r_{m\rightarrow n}(x_n)
\label{eq.firstpsirule}
\eeq
\beq
 \phi_m(\bx_m) = \frac{ f(\bx_m) }{ \prod_{n \in \Nm} r_{m\rightarrow n}(x_n)}.
\eeq
 And each message can be computed in terms of $\phi$ and $\psi$ using
\beq
 r_{m \rightarrow n}(x_n) = \sum_{\xmwon}
 \left( \phi_m( \bx_m) \prod_{ n' \in \Nm }  \psi_{n'}(x_{n'})
 \right)
\label{eq.sprpsi}
\eeq
 which differs from the assignment (\ref{eq.spr}) in that the product is over
 all $n' \in \Nm$.
\exercisxC{2}{ex.psiconfirm}{
 Confirm that the update rules (\ref{eq.firstpsirule}--\ref{eq.sprpsi})
 are equivalent to the sum--product rules (\ref{eq.spq}--\ref{eq.spr}). 
 So $\psi_n(x_n)$ eventually becomes the marginal $Z_n(x_n)$.
}
 This factorization viewpoint applies whether or not the graph is tree-like.
% and $\phi_m(\bx_m)$ becomes a function having the property
%\beq
%	\sum_{ \xmwon } \phi_m(\bx_m)
%\eeq
% Thus after any number of iterations of 
 
\subsection{Computational tricks}
 On-the-fly normalization is a good idea from a computational
 point of view because if $P^*$ is a product of many factors,
 its values are likely to be very large or very small.

 Another useful computational trick involves passing
 the logarithms of the messages $q$ and $r$ instead of $q$  and $r$ themselves;
 the computations of the products in the algorithm (\ref{eq.spq}, \ref{eq.spr})
 are then replaced by simpler additions.  The summations in
 (\ref{eq.spr}) of course become more difficult: to carry them out
 and return the logarithm, we need to compute \index{softmax, softmin}{softmax} functions like
\beq
	l = \ln (  e^{l_1} + e^{l_2} + e^{l_3} ) .
\label{eq.examplesum}
\eeq
 But this computation can be done efficiently using look-up tables
 along with  the observation that the value of the answer $l$
 is typically just a little larger than $\max_i l_i$.
 If we store in look-up tables values of the
 function
\beq
	\ln ( 1 + e^{\delta} )
\eeq
 (for negative $\delta$)
 then $l$ can be computed exactly in a number of look-ups and
 additions scaling as the number of terms in the sum.
 If look-ups and sorting operations are cheaper than {\tt{exp()}}
 then this approach costs less than the direct evaluation  (\ref{eq.examplesum}).
 The number of operations can be further reduced by
 omitting  negligible contributions from the smallest of the $\{ l_i \}$.

 A third computational trick applicable to certain error-correcting codes
 is to pass not the messages but the \ind{Fourier transform}
 of the messages. This again makes the  computations of the factor-to-variable messages 
 quicker. A simple example of this Fourier transform trick is given in
 \chref{ch.gallager} at \eqref{eq.ft.gallager}.


%\section{The Min--Sum  Algorithm}
\section{The min--sum  algorithm}
 The sum--product algorithm solves the problem of
 finding the marginal function of a given product $P^*(\bx)$.
 This is analogous to solving the bitwise decoding problem
 of \secref{sec.decoding.problems}.
 And just as there were other decoding problems
 (for example, the codeword decoding problem),
 we can define other tasks involving  $P^*(\bx)$
 that can be solved by modifications of the sum--product algorithm.
 For example, consider this task, analogous to
 the codeword decoding problem:
\begin{description}
\item[The maximization problem\puncspace]
 Find the setting of $\bx$ that maximizes  the  product $P^*(\bx)$.
\end{description}

 This problem can be solved by replacing the two operations
 {\sf{add}} and {\sf{multiply}}
% (`+' and `$\cdot$')
 everywhere they appear in the sum--product algorithm
 by
 another pair of operations that satisfy the distributive
 law,
% eq 14 in /home/mackay/tmp/fgspa.ps
 namely {\sf{max}} and {\sf{multiply}}.
 If we replace summation ($+$, $\sum$) by maximization,
 we notice that the quantity formerly known
 as the normalizing constant,
\beq
	Z = \sum_{\bx} P^*(\bx) ,
\eeq
 becomes  $\max_{\bx} P^*(\bx)$.

 Thus the sum--product algorithm can
 be turned into a \ind{max--product} algorithm\index{algorithm!max--product}
 that computes  $\max_{\bx} P^*(\bx)$,
 and from which the solution of the
 maximization problem can be deduced.
 Each `marginal' $Z_n(x_n)$ then lists the maximum
 value that $P^*(\bx)$ can attain for each value of $x_n$.

 In practice, the max--product algorithm
 is most often carried out in
 the negative log likelihood domain,
 where {\sf{max}} and {\sf{product}}
 become {\sf{min}} and {\sf{sum}}.
 The min--sum algorithm  is also known as the
 \ind{Viterbi algorithm}.\index{algorithm!Viterbi}


\section{The junction tree algorithm}
%\section{The Junction Tree Algorithm}
 What should one do when the factor graph one is interested
 in is not a tree?

 There are several options, and they divide into exact methods
 and approximate methods.
 The most widely used exact method for handling marginalization
 on graphs with cycles is called the \ind{junction tree algorithm}.
 This algorithm works by agglomerating variables together
 until the agglomerated graph has no cycles.
 You can probably figure  out the details for yourself; the
 complexity of the marginalization grows exponentially
 with  the number of agglomerated variables. 
 Read more about the {junction tree algorithm}
 in \cite{lauritzen96,jordan98:_learn_graph_model}.

 There are many approximate methods, and we'll visit some of
 them over the next few chapters -- Monte Carlo methods and
 variational methods, to name a couple.
 However, the most amusing way of handling factor graphs to
 which the sum--product algorithm  may not be applied
 is, as we already mentioned,
 to apply the sum--product algorithm! We simply compute the messages
 for each node in the graph, as if the graph were a tree,  iterate,
 and cross our fingers.
 This so-called `\ind{loopy}' message passing has great importance
 in the decoding of
% state-of-the-art
 error-correcting codes,
 and we'll come back to it in \secref{sec.bvfe.fr} and {\partnoun} \sgcpart.
% at the end of this book.

%\exercisxC{3}{ex.minsum}{
% Fill in the 
%}

\section*{Further reading}
 For further reading about factor graphs and the sum--product algorithm,
 see \citeasnoun{Kschischang2001},
 \citeasnoun{YFW2000},
 \citeasnoun{YFW2001long},
% this next one is poset
 \citeasnoun{YFW2002}, \citeasnoun{wainwright2003},\index{Wainwright, Martin}\index{Yedidia, Jonathan}
 and \citeasnoun{Forney2001}.

%\section*{Further reading}
% Redo the burglar-alarm problem of  \exburglar\ using message-passing.


 See also \citeasnoun{pearl}.
 A good reference for the fundamental theory of graphical models
 is \citeasnoun{lauritzen96}. A  readable introduction to Bayesian 
 networks is given by \citeasnoun{jensen96}.  

 Interesting message-passing algorithms that have different
 capabilities from the sum--product algorithm include {\dem\ind{expectation propagation}\/}
 \cite{Minka2001} and {\dem\ind{survey propagation}\/}
 \cite{surveyPropagation}.\index{Minka, Thomas}\index{Braunstein, A.}
% \index{M\'ezard, Marc}\index{Zecchina,  R.}
 See\index{Mezard, Marc}\index{Zecchina,  R.}
 also \secref{sec.bvfe.fr}.

\section{Exercises}
\exercisxB{2}{ex.pearl}{
 Express the joint probability distribution
 from the burglar alarm and earthquake problem (\exampleref{ex.burglar})
 as a factor graph, and find the marginal probabilities of all the variables
 as each piece of information comes to Fred's attention,
 using the sum--product algorithm with on-the-fly normalization.
}

\dvips

% {Laplace's method}
\chapter{Laplace's Method}% \nonexaminable}
\index{Laplace's method}% (integration)}
\label{ch.laplace}
% \label{ch.laplace}
\fakesection{Laplace}
% A chapter about Laplace's method.

% Here we can perhaps include the choice of  basis paper.
% see /home/mackay/_doc/dirichlet/laplace.tex

 The idea behind the {Laplace approximation}\index{approximation!Laplace}\index{Laplace's method}
 is  simple.
 We assume that an unnormalized probability density $P^*(x)$,
 whose normalizing constant
\beq
	Z_P \equiv \int P^*(x) \, \d x
\eeq
 is of interest, has a peak
 at a point $x_0$.
\marginfig{
\mbox{\psfig{figure=figs/peak/laplace.phi.ps,angle=-90,width=1.2in}\raisebox{0.3in}{\small$P^*(x)$}}
}
%
 We Taylor-expand the logarithm of $P^*(x)$ around this peak:
\marginfig{
\mbox{\psfig{figure=figs/peak/laplace.phi.l.ps,angle=-90,width=1.2in}%
\raisebox{0.3in}{\small$\ln P^*(x)$}}
}
\marginfig{
\psfig{figure=figs/peak/laplace.l.ps,angle=-90,width=1.2in}%
\makebox[0in][l]{\raisebox{0.1in}{\small$\ln P^*(x)$}}
\makebox[0in][l]{\raisebox{-0.08in}{\small\& $\ln Q^*(x)$}}
}
\beq
	\ln P^*(x) \simeq  \ln P^*(x_0) - \frac{c}{2} (x-x_0)^2 + \cdots ,
\label{eq.expansionlogP}
\eeq
 where
\beq
	c = - \left.
	\frac{\partial^2}{\partial x^2}  \ln P^*(x)
	\right|_{x=x_0} .
\eeq
%
 We then approximate  $P^*(x)$ by an unnormalized Gaussian,\index{approximation!by Gaussian}
\beq
	Q^*(x) \equiv P^*(x_0) \exp \left[ - \frac{c}{2} (x-x_0)^2
	\right]  ,
\eeq
\marginfig{
\psfig{figure=figs/peak/laplace.ps,angle=-90,width=1.2in}
\makebox[0in][l]{\raisebox{0.1in}{\small$P^*(x)$}}
\makebox[0in][l]{\raisebox{-0.08in}{\small\& $Q^*(x)$}}
}
%
 and we approximate the normalizing constant $Z_P$ by the
 normalizing constant
 of this Gaussian,
\beq
	Z_Q =   P^*(x_0)  \sqrt{ \frac{2 \pi }{ c }} .
\eeq


 We can generalize  this integral to
 approximate $Z_P$ for a  density $P^*(\bx)$ over a $K$-dimensional
 space $\bx$.
 If the matrix of second derivatives of $-\ln P^*(\bx)$
 at the maximum $\bx_0$ is $\bA$,
 defined by: 
\beq
	A_{ij} =  - \left.
	\frac{\partial^2}{\partial x_i \partial x_j}  \ln P^*(\bx)
	\right|_{\bx=\bx_0} ,
\eeq
 so that the expansion (\ref{eq.expansionlogP}) is generalized to
\beq
	\ln P^*(\bx) \simeq  \ln P^*(\bx_0) - \frac{1}{2} (\bx-\bx_0)^{\T}\!\bA  (\bx-\bx_0)
	+ \cdots ,
\label{eq.generalexpansionP}
\eeq
 then the normalizing constant can be approximated by: 
\beq
	Z_P \simeq Z_Q
	=  P^*(\bx_0) \frac{ 1 }{  \sqrt{ \det{\frac{1}{2 \pi} \bA} } } 
	=  P^*(\bx_0)  \sqrt{\frac{ (2 \pi)^K }{ \det{\bA} } } .
\eeq
 Predictions  can be made using the approximation $Q$.
 Physicists also call this widely-used approximation
 the {\dem\ind{saddle-point approximation}}.\index{approximation!saddle-point}


\begin{aside}
 The fact that the normalizing constant of a Gaussian is given by 
\beq
	\int \d^K \bx  \:  \exp \left[ -\frac{1}{2} \bx^{\T} \bA \bx \right]
	=  \sqrt{\frac{ (2 \pi)^K }{ \det{\bA} } }
\eeq
 can be proved by making an orthogonal transformation
 into the  basis $\bu$ in which
 $\bA$ is transformed into a  diagonal matrix.  The integral
 then separates into a product of one-dimensional integrals,
 each of the form
\beq
	\int \d u_i    \exp \left[ -\frac{1}{2} { \lambda_i u_i^2 } \right]
	= \sqrt{\frac{2 \pi}{\lambda_i}} .
\eeq
 The product of the eigenvalues $\lambda_i$ is the determinant of $\bA$.
\end{aside}

 The Laplace approximation is \index{basis dependence}basis-dependent:
 if $x$ is transformed to a nonlinear function $u(x)$ and
 the density is transformed to $P(u) = P(x) \left| \d x/\d u \right|$
 then in general the approximate normalizing constants $Z_Q$
 will be different.
 This can be viewed as a defect -- since the true value $Z_P$
 is basis-independent -- or an opportunity -- because
 we can hunt for a choice of basis in which the Laplace approximation
 is most accurate.

\section{Exercises}
% {\em Under construction}
% \medskip
%
%In the maximum likelihood chapter we found the second derivative
% for a few examples.
%
\exercisxA{2}{ex.poissonmap}{
 (See also \exerciseref{ex.poissonml}.)
 A \ind{photon counter} is pointed at a remote
 star for one minute, in order to infer the rate of
 photons arriving at the counter per minute, $\l$.
 Assuming the number of photons collected $r$ has a
 \ind{Poisson
 distribution} with mean $\l$,
\beq
	P(r  \given  \l ) = \exp( - \l)\frac{ \l^{r} }{r!} ,
\eeq
 and assuming the \ind{improper} prior $P(\l) = 1/\l$,
 make  Laplace approximations to the posterior
 distribution
% $P(\l \given r)$
\ben
\item over $\l$
\item over $\log \l$. [Note the improper prior transforms to $P(\log \l) = \mbox{constant}$.]
\een
}
\exercisxB{2}{ex.laplacebeta}{
	Use Laplace's method to approximate the integral
\beq
	Z(u_1,u_2) = \int_{-\infty}^{\infty} \! \d a \: f(a)^{u_1} (1-f(a))^{u_2} ,
\eeq
 where $f(a) = 1/(1+e^{-a})$ and $u_1,u_2$ are positive.
 Check the accuracy of the approximation
 against the exact answer
 (\ref{eq.Zbeta}, \pref{eq.Zbeta})
 for $(u_1,u_2)=(\dhalf,\dhalf)$
 and $(u_1,u_2)=(1,1)$.
 Measure the error $(\log Z_P - \log {Z}_Q)$ in bits.
}
% Start with a tiny example -- note that
% for  the mean of a Gaussian with known $\sigma$ the
% approximation is exact.
% For several other models (eg interpolation)
% the approximation is exact.
\exercisxB{3}{ex.interpoln}{
 {\sf Linear \ind{regression}.}\index{linear regression}
  $N$ datapoints $\{ (x^{(n)},t^{(n)}) \}$
 are generated by the experimenter choosing each $x^{(n)}$,
 then the world delivering
 a noisy version of the linear function 
\beq
	y(x) = w_0 + w_1 x ,
\eeq
\beq
	t^{(n)} \sim \Normal( y(x^{(n)}) , \sigma_{\nu}^2 ) .
\eeq
 Assuming Gaussian priors on $w_0$ and $w_1$,
 make the Laplace approximation to the posterior distribution
 of $w_0$ and $w_1$ (which is exact, in fact)
 and obtain the predictive distribution for the next datapoint
 $t^{(N\!+\!1)}$, given 
 $x^{(N\!+\!1)}$.

 (See \citeasnoun{MacKay92a} for further reading.)
}
%\section*{Further reading}
% We'






% restore me?
% \input{tex/laplacebasis.tex}
\dvips
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Model Comparison and Occam's Razor}
\label{ch.occam}
%
% see also nn_occam.tex
%
% and graveyard.tex
%
%\newcommand{\FIGS}{figs2}          %
%\newcommand{\figs}{figs}           % figures are kept in three directories
%\newcommand{\figsinter}{figs/inter}% 
%
% \maketitle
%  {\em Under construction.}
% \section{Probability theory and {O}ccam's razor}
\begin{figure}[hbtp]
\figuremargin{
\mbox{\psfig{figure=figs/dogs.eps,width=3in}}
}{
\caption[a]{A picture to be interpreted. It contains a \ind{tree}\index{box}\index{image analysis} and
 some boxes.}
\label{fig.dogs1}
}
\end{figure}
\section{{O}ccam's razor}%\index{Occam's razor}
% mini-sermon on Bayes removed to
% sermons.tex
\label{sec.occam1}
 How many boxes are in the picture (\figref{fig.dogs1})?
 In particular, how many boxes are in the vicinity of the tree?
 If we looked with x-ray spectacles, would
 we see one or two boxes behind the trunk (\figref{fig.dogs3})?
 (Or even more?)%
\newcommand{\twoboxesorone}{
\makebox[0in][r]{1?}\mbox{\psfig{figure=figs/dogs3.eps,width=1.46in}}\\
\makebox[0in][r]{or 2?}\mbox{\psfig{figure=figs/dogs3b.eps,width=1.46in}}
}%
\marginfig{\footnotesize
\begin{center}
\twoboxesorone
\end{center}
\caption[a]{How many boxes are behind the tree?}
\label{fig.dogs3}
}
 Occam's razor is the principle that states a preference for simple
 theories.
 `Accept the simplest explanation that fits the data'.
 Thus according to \inds{Occam's razor}, we should
 deduce that there is only one box behind the tree.
 Is this an ad hoc
% {\em ad hoc\/}
 rule of thumb?
 Or is there a convincing reason for believing
 there is most likely one box?  Perhaps
 your intuition likes the argument
 `well, it would be a remarkable {\em\ind{coincidence}\/}
 for the two boxes to be just the same height and
 colour as each other'.
 If we wish to make artificial intelligences
 that interpret data correctly, we must translate
 this intuitive feeling into a concrete theory.

%\section{Probability theory and Occam's razor}
\subsection{Motivations for Occam's razor}

 If several explanations are compatible with a set of
 observations, Occam's razor advises us to buy the
 simplest.
%least complex explanation.
 This principle is often advocated for one of two
 reasons: the first is aesthetic (`A theory with mathematical beauty
 is more likely to be correct than an ugly one that fits some
 experimental data' (Paul Dirac)); the second reason is the past
 empirical success of Occam's razor.
% (`simple theories have proved successful in the past, so
%I prefer simple theories for new domains too'). 
 However there is a different justification for Occam's razor,
 namely:

\begin{quotation}
\noindent
	Coherent inference (as embodied by Bayesian probability)
 automatically embodies Occam's razor, 
	 quantitatively.
\end{quotation}
% Dirac need not be too upset if we reject his motivation for 
% Occam's razor; the Bayesian Occam's razor is a theory 
% with its own mathematical beauty!
 It is indeed {\em more probable\/} that there's one
 box behind the tree, and we can compute how much more
 probable one is than two.

% Similarly, 
\subsection{Model comparison and Occam's razor}
 We evaluate the plausibility of two alternative theories $\H_1$ and
 $\H_2$ in the light of data $D$ as follows: using\index{Bayes' theorem}
 \Bayes\  theorem, we relate the plausibility of model $\H_1$ given the
 data, $P(\H_1\given D)$, to the predictions made by the model about the
 data, $P(D\given \H_1)$, and the prior plausibility of $\H_1$, $P(\H_1)$.
 This gives the following probability ratio between theory $\H_1$ and
 theory $\H_2$:

\begin{equation}
	\frac{P(\H_1\given D)}{P(\H_2\given D)} = \frac{P(\H_1)}{P(\H_2)}
			\frac{ P(D\given \H_1)}{ P(D\given \H_2)}  .
\label{occam.eq1}
\end{equation}
 The first ratio $( P(\H_1) / P(\H_2) )$ on the right-hand side
 measures how much our initial beliefs favoured $\H_1$ over
 $\H_2$. The second ratio
% factor $( P(D\given \H_1) / P(D\given \H_2) )$ evaluates
 expresses how well the observed data were predicted by $\H_1$,
 compared to $\H_2$.

 How does this relate to Occam's razor, when $\H_1$ is a simpler model
 than $\H_2$? The first ratio $( P(\H_1) / P(\H_2) )$ gives us the
 opportunity, if we wish, to insert a prior bias in favour of $\H_1$
 on aesthetic grounds, or on the basis of experience. This would
 correspond to the aesthetic and empirical motivations for Occam's
 razor mentioned earlier. But such a prior bias
 is not necessary: the second ratio,
 the data-dependent factor, embodies Occam's razor {\em
 automatically}. Simple models tend to make precise
 predictions. Complex models, by their nature, are capable of making a
 greater variety of predictions (figure \ref{fig.pdh}).  So if $\H_2$
 is a more complex model, it must spread its predictive probability
 $P(D\given \H_2)$ more thinly over the data space than $\H_1$. Thus, in the
 case where the data are compatible with both theories,
% then 
% it must be the case that 
% $P(D\given \H_1)$ is greater
% than $P(D\given \H_2)$, so 
 the simpler $\H_1$ will turn out more probable than $\H_2$, without
 our having to express any subjective dislike for complex models. Our
 subjective prior just needs to assign equal prior probabilities to
 the possibilities of simplicity and complexity. Probability theory
 then allows the observed data to express their opinion.
% then match the model to the observed data. 
\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{\psfig{figure=\figs/occam_int.ps,%
%width=90mm,height=46mm,angle=90}
width=65mm,height=32mm,angle=90}}
\end{center}
}{
\caption[abbrev]{{Why Bayesian inference embodies Occam's razor.}
 This figure gives the basic intuition for why complex models can turn
 out to be less probable.  The horizontal axis represents the space of
 possible data sets $D$.  \Bayes\  theorem rewards models in proportion
 to how much they {\em predicted\/} the data that occurred. These
 predictions are quantified by a normalized probability distribution
 on $D$. This probability of the data given model
 $\H_i$, $P(D\given \H_i)$, is called the evidence for $\H_i$.

 A simple model $\H_1$ makes only a limited range of predictions,
 shown by $P(D\given \H_1)$; a more powerful model $\H_2$, that has, for
 example, more free parameters than $\H_1$, is able to predict a
 greater variety of data sets. This means, however, that $\H_2$ does
 not predict the data sets in region $\C_1$ as strongly as
 $\H_1$. Suppose that equal prior probabilities have been assigned to
 the two models. Then, if the data set falls in region $\C_1$, the
 {\em less powerful\/} model $\H_1$ will be the {\em more probable\/}
 model.
}
\label{fig.pdh}
\label{fig3}
}
\end{figure}

% f:=(x,c,d,e)-> c*x**3 + d * x**2 + e ;
% x1:=-1;x2:=3;x3:=7;x4:=11; solve({f(x1,c,d,e)=x2, f(x2,c,d,e)=x3, f(x3,c,d,e)=x4, f(x4,c,d,e)=x5, f(x5,c,d,e)=x6},{c,d,e,x5,x6});
%
 Let us turn to a simple example. Here is a sequence of numbers:
\[
	-1, \: 3, \: 7, \: 11.
\]
 The task is to predict  the next two numbers,\index{sequence}\index{what number comes next?}\index{arithmetic progression}
 and infer  the underlying process that gave rise to
 this sequence. A popular answer to this question is the  prediction  `15,
 19', with the explanation `add 4 to the previous number'.
 
 What about the alternative answer `$-19.9, 1043.8$' with the underlying
 rule being: `get the next number from the previous number, $x$, by
 evaluating $-x^3/11 + 9/11 x^2 + 23/11$'?  I assume that this
 prediction
% -{\frac {x^{3}}{11}}+{\frac {9\,x^{2}}{11}}+{\frac {23}{11}}
 seems rather less plausible. But the second rule fits the data ($-1$,
 3, 7, 11) just as well as the rule `add 4'. So why should we find it
 less plausible?  Let us give labels to the two general theories:
%
\begin{description}
\item[$\H_a$ --]   the sequence is an {\dem arithmetic\/} progression, `add $n$', 
                   where $n$ is an integer. 
\item[$\H_c$ --]  the sequence is generated by a {\dem cubic\/} function of the 
                   form    $x \rightarrow c x^3 + d x^2 + e$, 
                   where $c$, $d$ and $e$ are fractions. 
\end{description}
%
One reason for finding the second explanation, $\H_c$, less
plausible, might be that arithmetic progressions are more frequently
encountered than cubic functions. This would put a \ind{bias} in the
prior probability ratio $P(\H_a)/P(\H_c)$ in equation (\ref{occam.eq1}).
But let us give the two theories equal prior probabilities, and
concentrate on what the data have to say. How well did each theory
predict the data?

 To obtain $P(D\given \H_a)$ we must specify the probability distribution
 that each model assigns to its parameters. First, $\H_a$ depends on
 the added integer $n$, and the first number in the sequence. Let us
 say that these numbers could each have been anywhere between $-50$ and
 50. Then since only the pair of values \{$n\eq 4$, $\mbox{first
 number}\eq -1$\} give rise to the observed data $D$ = ($-1$,
 3, 7, 11),
 the probability of the data, given $\H_a$, is:

\beq
	P(D\given \H_a) = \frac{1}{101} \frac{1}{101} = 0.00010 .
\eeq
 To evaluate $P(D\given \H_c)$, we must similarly say what values the
 fractions $c,d$ and $e$ might take on. [I choose to represent these
 numbers as fractions rather than real numbers because if we used real
 numbers, the model would assign, relative to $\H_a$, an
 infinitesimal probability to $D$. Real parameters are the norm
 however, and are assumed in the rest of this chapter.]  A reasonable
 prior might state that for each fraction the numerator could be any number
 between $-50$ and 50, and the denominator is any number between
 1 and 50. As for the initial value in the sequence, let us leave its
 probability distribution the same as in $\H_a$.  
 There are four ways of expressing the fraction $c=-1/11= -2/22=-3/33=-4/44$ 
 under this prior, and similarly there are four and two possible solutions 
 for $d$ and $e$, respectively. So the
 probability of the observed data, given $\H_c$, is found to be:
%
\begin{eqnarray}
	P(D\given \H_c) &=& \left(\frac{1}{101}\right)  
		\left(\frac{4}{101}\frac{1}{50}\right) 
		\left(\frac{4}{101}\frac{1}{50}\right)
		 \left(\frac{2}{101}\frac{1}{50}\right) \nonumber \\
		&=& 0.0000000000025 = 2.5 \times 10^{-12}  .
\end{eqnarray}
 Thus comparing $P(D\given \H_c)$ with $P(D\given \H_a) = 0.00010$, even if our prior
 probabilities for $\H_a$ and $\H_c$ are equal, the odds,
 $P(D\given \H_a):P(D\given \H_c)$, in favour of $\H_a$ over $\H_c$, given the
 sequence $D$ = ($-1$, 3, 7, 11), are about forty million to one.\ENDsolution

 This answer depends on several subjective assumptions; in particular,
 the probability assigned to the free parameters $n$, $c$, $d$, $e$ of
 the theories.  Bayesians make no apologies for this: there is no such
 thing as inference or prediction without assumptions.  However, the
 quantitative details of the prior probabilities have no effect on the
 qualitative Occam's razor effect; the complex theory $\H_c$ always
 suffers an `\ind{Occam factor}' because it has more parameters, and so can
 predict a greater variety of data sets (figure \ref{fig.pdh}). This
 was only a small example, and there were only four data points; as we
 move to larger and more sophisticated problems the magnitude of the
 Occam factors typically increases, and the degree to which our
 inferences are influenced by the quantitative details of our
 subjective assumptions becomes smaller.

% Why is this quantification of Occam's razor useful? It probably would
% have been of little use to William of Ockham, the 14th century
% Franciscan monk after whom the razor is named, though this derivation
% does show that Occam's razor is not an ad hoc principle. 


\subsection{Bayesian methods and data analysis}
	Let us now relate the discussion above to real problems in data 
 analysis. 

        There are countless problems in science, statistics and technology 
  which require that, given a limited data set, preferences be assigned 
 to alternative models of differing complexities.\index{complexity control}\index{model comparison} 
 For example, two alternative hypotheses accounting for planetary
 motion are Mr.\ \ind{Inquisition}'s geocentric model based on `\ind{epicycles}',
 and Mr.\ \ind{Copernicus}'s simpler model of the \ind{solar system} with the
 sun at the centre. The
 epicyclic model fits data on planetary motion at least as well as the
 Copernican model, but does so using more parameters. Coincidentally
 for Mr.\ Inquisition, two of the extra epicyclic parameters for every
 planet are found to be identical to the period and radius of the
 sun's `cycle around the earth'.  Intuitively we find Mr.\ Copernicus's
 theory more probable. 
% I will now explain in more detail
% how Mr.\ Inquisition's excess parameters are penalized automatically
% under probability theory.

%thesis/figs/fig1.enhanced.tex
\begin{figure}
\figuremargin{\small
\begin{center}
\setlength{\unitlength}{0.007in}
\begin{picture}(510,485)(-250,-265)% increased height Tue 24/12/02
\thicklines
\newsavebox{\fatarr}
\savebox{\fatarr}(40,40){\begin{picture}(40,40)(-20,-20)
 \put(-9,-10){\line(0,1){20}}
 \put(10,-10){\line(0,1){20}}
 \put(0,-20){\line(1,1){20}}
 \put(0,-20){\line(-1,1){20}}
 \end{picture}}
%\put(-250,-400){\framebox(500,700){}}
%\put(-200,-300){\framebox(400,500){}}
%\put(-150,-200){\framebox(300,300){}}
%\put(-100,-100){\framebox(200,100){}}
\thinlines
\put(-100,165){\fbox{\shortstack{Gather \\ DATA}}}
\put(20,165){\fbox{\shortstack{Create\\ alternative\\ MODELS}}}
\put(-20,110){\usebox{\fatarr}}
\put(-93,50){\fbox{\fbox{\shortstack{Fit each MODEL\\ to the DATA}}}}
\put(-20,-15){\usebox{\fatarr}}
\put(-132,-85){\fbox{\fbox{\shortstack{Assign preferences to the \\ alternative MODELS}}}}
\put(-240,-220){\fbox{\shortstack{Choose what \\
data to\\  gather next}}}
\put(-315,-60){\fbox{\shortstack{Gather\\more data}}}
\put(100,-220){\fbox{\shortstack{Decide whether\\ to create new\\ models}}}
\put(195,-60){\fbox{\shortstack{Create new\\ models}}}
\put(-65,-265){\fbox{\shortstack{Choose future \\ actions}}}
\thicklines
\put(-220,-140){\vector(0,1){50}}
\put(-220,0){\line(0,1){50}}
\put(-200,50){\oval(40,40)[tl]}
\put(-200,70){\vector(1,0){70}}

\put(220,-140){\vector(0,1){50}}
\put(220,0){\line(0,1){50}}
\put(200,50){\oval(40,40)[tr]}
\put(200,70){\vector(-1,0){70}}

\put(30,-120){\vector(1,-1){50}}
\put(-30,-120){\vector(-1,-1){50}}
\put(0,-120){\vector(0,-1){75}}
%\put(){\fbox{\shortstack{}}}
\end{picture}\\
\end{center}
}{
\caption[Where Bayesian inference fits into the data modelling process]{{Where Bayesian inference fits into the data modelling process.
%\small

 This figure illustrates an abstraction of the part of the scientific
process in which data are collected and modelled.  In particular, this
figure applies to pattern classification, learning, interpolation,
etc.  The two double-framed boxes denote the two steps which involve
{\em inference}.  It is only in those two steps that \Bayes\  theorem can
be used. Bayes does not tell you how to invent models, for example.

The first box, `fitting each model to the data', is the task of
inferring what the model parameters might be given the model and the
data. Bayesian methods  may be used to find the most probable parameter values,
and error bars on those parameters.  The result of applying Bayesian methods to
this problem is often little different from the answers given by
orthodox statistics.

The second inference task, model comparison in the light of 
the data, is where Bayesian methods are in a class of their own. This 
second inference problem requires a quantitative Occam's razor to 
penalize over-complex models. Bayesian methods 
can assign objective preferences to the alternative models in a way that 
automatically embodies Occam's razor.
}}
\label{fig1}
}
\end{figure}


\subsection{The mechanism of the {B}ayesian razor: 
	the evidence and the {O}ccam factor}
 Two levels of {inference} can often be distinguished in the
 process of data modelling.  At the first level of inference, we
 assume that a particular model is true, and we fit that model to the
 data, \ie, we infer what values its free parameters should plausibly
 take, given the data. The results of this inference are often
 summarized by the most probable parameter values, and error bars  on
 those parameters.  This analysis is repeated for each model.  The
 second level of inference is the task of model comparison.  Here we
 wish to compare the models in the light of the data, and assign some
 sort of preference or ranking to the alternatives.
\begin{aside}
 Note that
 both levels of {\em inference\/} are distinct from {\em \ind{decision
 theory}}.  The goal of inference is, given a defined hypothesis space
 and a particular data set, to assign probabilities to
 hypotheses. Decision theory typically chooses between alternative
 {\em actions\/} on the basis of these probabilities so as to minimize
 the expectation of a `loss function'. This chapter concerns inference
 alone and no loss functions are involved. When we
 discuss model comparison, this should not be construed as implying
 model {\em choice\/}. Ideal Bayesian predictions do not involve choice
 between models; rather, predictions are made by summing over all the
 alternative models, weighted by their probabilities.
% (section \ref{sec.eb}).}}
\end{aside}


 Bayesian methods are able consistently and quantitatively to solve
 both the inference tasks.  There is a popular \ind{myth} that states that
 Bayesian methods only differ from orthodox statistical methods by the
 inclusion of subjective priors, which are difficult to assign, and which
 usually don't make much difference to the conclusions.
%[I hope
% this myth has already been dispelled by examples such
% as \exerciseref{ex.3doors}.]
 It is true
 that, at the first level of inference, a Bayesian's results will
 often differ little from the outcome of an orthodox attack.  What is
 not widely appreciated is how a Bayesian performs the second level of
 inference; this chapter will therefore focus on Bayesian model
 comparison.

 Model comparison is a difficult task because it is not possible
 simply to choose the model that fits the data best: more complex\index{complexity control}\index{model comparison}
 models can always fit the data better, so the \ind{maximum likelihood}
 model choice would lead us inevitably to implausible,
 over-parameterized models, which generalize poorly.  Occam's razor is
 needed.

 Let us write down \Bayes\  theorem for the two levels of inference
 described above, so as to see explicitly how Bayesian model
 comparison works.\index{Bayes' theorem}
 Each model $\H_i$ is assumed to have a vector of
 parameters $\bw$. A model is defined by a collection of probability
 distributions: a `prior' distribution $P(\bw\given \H_i)$, which states what
 values the model's parameters might be expected to take; and a set of
 conditional distributions, one for each value of $\bw$, defining the
 predictions $P(D\given \bw,\H_i)$ that the model makes about the data $D$.
%  when its parameters take a particular value $\bw$. 
% The second of these is actually a collection of
% probability distributions, one for each value of$\bw$.

% Note that
% models with the same parameterization but different priors over the
% parameters are therefore defined to be different models.

\ben
\item {\bf Model fitting.}
At the first level of inference, we assume 
that one model, the $i$th, say,
% $\H_i$
is true, and we 
infer what the model's parameters $\bw$ might be, 
given the data $D$.  
Using \Bayes\  theorem, the {\dem posterior probability\/} of the 
parameters $\bw$ is: 
\begin{equation}
\label{i.POpre}
P(\bw\given  D, \H_i) = \frac{P(D\given \bw,\H_i)P(\bw\given \H_i)}{P(D\given \H_i)},
\end{equation}
% In words:
that is, 
\[
	{\rm Posterior = \frac{Likelihood \times Prior}{Evidence} }.
\]
 The normalizing constant $P(D\given \H_i)$ is commonly ignored since it is
 irrelevant to the first level of inference, \ie, the inference of $\bw$;
 but it becomes important in the second level of inference, and we
 name it the {\dem\ind{evidence}\/} for $\H_i$.  It is common practice to use
 gradient-based methods to find the maximum of the posterior, which
 defines the most probable value for the parameters, $\wmp$; it is
 then usual to summarize the posterior distribution by the value of
 $\wmp$, and error bars or confidence intervals on these best fit
 parameters. Error bars can be obtained from the curvature of the
 posterior; evaluating the Hessian at $\wmp$, $\bA =
 \left. -\grad\grad \ln P(\bw\given  D,\H_i)\right|_{\wmp}$, and
 Taylor-expanding the log posterior probability with $\upDelta \bw = \bw - \wmp$:
\begin{equation}
	P(\bw\given  D, \H_i)	\simeq 
	P(\wmp\given  D, \H_i)\exp \left( - {\textstyle \dhalf}\upDelta \bw^{\T} 
		\bA \upDelta \bw  \right), 
\label{i.EB}
\end{equation}
 we see that the posterior can be locally approximated as a Gaussian
 with covariance matrix (equivalent to error bars)
 $\bA^{-1}$. [Whether this approximation is good or not will depend on
 the problem we are solving. Indeed, the maximum and mean of the
 posterior distribution have no fundamental status in Bayesian
 inference -- they both change under nonlinear
 reparameterizations. Maximization of a posterior probability is only
 useful if an approximation like equation (\ref{i.EB}) gives a good
 summary of the distribution.]
\item {\bf Model comparison.}
 At the second level of inference, we wish to infer which model is
 most plausible given the data. The posterior probability of each
 model is:
\begin{equation}
\label{i.EV.A}
P(\H_i\given D) \propto P(D \given  \H_i)    P(\H_i) .
%P(\H_i\given D) \propto \underbrace{P(D \given  \H_i)}    P(\H_i) 
\end{equation}
 Notice that the data-dependent term $P(D \given  \H_i)$ is the evidence for
 $\H_i$, which appeared as the normalizing constant in
 (\ref{i.POpre}).  The second term, $P(\H_i)$, is the subjective prior
 over our hypothesis space, which expresses how plausible we thought
 the alternative models were before the data arrived.  Assuming that
 we choose to assign equal priors $P(\H_i)$ to the alternative models,
 {\em models $\H_i$ are ranked by evaluating the evidence.} The normalizing 
 constant $P(D) = \sum_i P(D \given  \H_i)    P(\H_i)$ has been omitted from 
 equation (\ref{i.EV.A}) because in the data-modelling
 process we may develop new models after the data have arrived, when
 an inadequacy of the first models is detected, for example. Inference
 is open ended: we continually seek more probable models to account
 for the data we gather.
% \een

 To repeat the key idea:
%\begin{conclusionbox}
 to rank alternative models $\H_i$, a
 Bayesian evaluates the evidence $P(D\given \H_i)$.
%\end{conclusionbox}
%
 This concept is very
 general: the evidence can be evaluated for parametric and
 `non-parametric' models alike; whatever our data-modelling task, a
 regression problem, a classification problem, or a density estimation
 problem, the  evidence is a transportable quantity for
 comparing alternative models. In all these cases the evidence
 naturally embodies Occam's razor.
\een

\subsection{Evaluating the evidence}
 Let us now study the evidence more closely to gain insight into how
 the Bayesian Occam's razor works.  The evidence is the normalizing
 constant for equation (\ref{i.POpre}):
\begin{equation}
\label{evidence}
	P(D\given \H_i) =  \int P(D\given \bw,\H_i)P(\bw\given \H_i)\, \d \bw.
\end{equation}
\sloppy
 For many problems the
 posterior $P(\bw\given D,\H_i)\propto P(D\given \bw,\H_i)P(\bw\given \H_i)$ has a
 strong peak at the most probable parameters $\wmp$ (figure
 \ref{fig4}).  Then, taking for simplicity the one-dimensional case,
 the evidence can be approximated, using Laplace's method,
 by the height of the peak of the
 integrand $P(D\given \bw,\H_i)P(\bw\given \H_i)$ times its width, $\sigma_{w|D}$:
\fussy
%
%thesis/figs/fig4.tex
\begin{figure}
\figuremargin{\small%
\begin{center}
\setlength{\unitlength}{0.000595745in}
\begin{picture}(5625,2400)(250,1600)
\put(0,1700){\makebox(0,0)[bl]{\psfig{figure=\figs/occam_dd.ps,%
width=3.5 true in,height=1.383 true in}
% ,% height was 2.383 with old fig
% bbllx=0.0in,bblly=0.0in,%
% bburx=5.875in,bbury=4.0in}
}}
\put(3750,2000){\makebox(0,0)[b]{$\wmp$}}
\put(3700,2570){\makebox(0,0)[b]{$\sigma_{w|D}$}}
\put(3000,1625){\makebox(0,0)[t]{$\sigma_{w}$}}
\put(5500,1750){\makebox(0,0)[t]{$\bw$}}
\put(1500,2300){\makebox(0,0)[bl]{$P(\bw\given \H_i)$}}
\put(4000,3125){\makebox(0,0)[bl]{$P(\bw\given D,\H_i)$}}

\end{picture}\\
\end{center}
}{
\caption[abbrev]{{The Occam factor.}\indexs{Occam factor}
	This figure shows the quantities that determine the Occam
	factor for a hypothesis $\H_i$ having a single parameter
	$\bw$. The prior distribution (solid line) for the parameter
	has width $\sigma_{w}$.  The posterior distribution (dashed
	line) has a single peak at $\wmp$ with characteristic width
	$\sigma_{w|D}$. The Occam factor is
	$$\displaystyle\sigma_{w|D} P(\wmp\given \H_i) = \frac{\sigma_{w|D}}{\sigma_{w}}.$$
}
\label{fig4}
\label{fig.prior.post}
}
\end{figure}

%
\begin{equation}
\label{approx.evidence}
\begin{array}[t]{c@{\hspace{0.2cm}\simeq\hspace{0.2cm}}r@{\mbox{$\:\times\:$}}l}
P(D\given \H_i)  & \strutc \underbrace{P(D\given \wmp,\H_i)} &
                \underbrace{ P(\wmp\given \H_i) \, \sigma_{w|D} }. \\
{\rm Evidence}   &{\rm Best~fit~likelihood} & {\rm Occam~factor }
\end{array}
\end{equation}
 Thus the evidence is found by taking the best fit likelihood 
 that the model can achieve and multiplying it by  an `{Occam factor}',
% \cite{G1}, 
 which is a term with magnitude less than one that penalizes 
 $\H_i$ for having the parameter $\bw$.  

\subsection{Interpretation of the {O}ccam factor}
 The quantity $\sigma_{w|D}$ is the posterior uncertainty in $\bw$.
 Suppose for simplicity that the prior $P(\bw\given \H_i)$ is uniform on
 some large interval $\sigma_{w}$, representing the range of values of
 $\bw$ that were possible {\em a priori}, according to $\H_i$ (figure
 \ref{fig4}).  Then $P(\wmp\given \H_i) = 1/\sigma_{w}$, and
\beq
	\mbox{Occam factor}= \frac{\sigma_{w|D}}{\sigma_{w}},
\eeq
 \ie, {\em the Occam factor is equal to the ratio of the posterior
 accessible volume of $\H_i$'s parameter space to the prior accessible
 volume,} or the factor by which $\H_i$'s hypothesis space collapses
 when the data arrive.
% \cite{G1,Jeffreys}.
 The model $\H_i$ can be
 viewed as consisting of a certain number of exclusive submodels, of
 which only one survives when the data arrive. The Occam factor is the
 inverse of that number.  The logarithm of the Occam factor is a
 measure of the amount of information\index{information content}
 we gain about the model's
 parameters when the data arrive.

 A\index{complexity control}\index{model comparison}
 complex model having many parameters, each of which is free to vary
 over a large range $\sigma_{w}$, will typically be penalized by a
 stronger Occam factor than a simpler model. The Occam factor also
 penalizes models that have to be finely tuned to fit the data,
 favouring models for which the required precision of the parameters
 $\sigma_{w|D}$ is coarse. The magnitude of the Occam factor is thus a
 measure of complexity of the model;
% but, unlike the V-C dimension \cite{Abu1},
 it relates to the complexity of the predictions that the
 model makes in data space.  This depends not only on the number of
 parameters in the model, but also on the prior probability that the
 model assigns to them.  Which model achieves the greatest evidence is
 determined by a trade-off between minimizing this natural complexity
 measure and minimizing the data misfit. In
% further
 contrast to
 alternative measures of model complexity, the Occam factor for a
 model is straightforward to evaluate: it simply depends on the error
 bars on the parameters, which we already evaluated when fitting the
 model to the data.


\begin{figure}[t]
\figuremargin{\small%
\begin{center}
\setlength{\unitlength}{1mm}
%\framebox{
\begin{picture}(116.5,95)(0,0)
\put(1,0){\makebox(0,0)[bl]{\psfig{figure=\figs/hyp8.ps,width=4.5in,angle=90}}}
\put(46.5,2.5){\makebox(0,0){$\sigma_{w}$}}
\put(54,11){\makebox(0,0)[t]{$\sigma_{w|D}$}}

\put(48,18){\makebox(0,0)[br]{$P(\bw\given \H_3)$}}
\put(50.5,26){\makebox(0,0)[br]{$P(\bw\given D,\H_3)$}}
\put(85,22){\makebox(0,0)[br]{$P(\bw\given \H_2)$}}
\put(86,32){\makebox(0,0)[br]{$P(\bw\given D,\H_2)$}}
\put(106,31){\makebox(0,0)[br]{$P(\bw\given \H_1)$}}
\put(107,47){\makebox(0,0)[br]{$P(\bw\given D,\H_1)$}}

\put(26,64){\makebox(0,0)[l]{$P(D\given \H_1)$}}
\put(15.3,71.5){\makebox(0,0)[bl]{$P(D\given \H_2)$}}
\put(14.5,82){\makebox(0,0)[bl]{$P(D\given \H_3)$}}

\put(4,20){\makebox(0,0)[r]{$D$}}
\put(4.5,68){\makebox(0,0)[r]{$D$}}

\put(65,13){\makebox(0,0)[t]{$\bw$}}
\put(93,13){\makebox(0,0)[t]{$\bw$}}
\put(113,13){\makebox(0,0)[t]{$\bw$}}

\end{picture}
% }
\end{center}
% I had to edit this ps file to correct the bb. from 0 0 770 552
% to %  500 0 1000 552 - this put the fig too low
% to   435 0 900 552
}{
\caption[abbrev]{{A hypothesis space} consisting of three exclusive 
 models, each having one parameter $\bw$, and a one-dimensional data
 set $D$.   The `data set' is a single measured value which differs
 from the parameter $\bw$ by a small amount of additive noise.  Typical 
 samples from the joint distribution $P(D,w,\H)$ are shown 
 by dots. (N.B.,  these are not data points.) The observed `data set' 
 is a single particular value for $D$ shown by the
 dashed horizontal line. The
 dashed curves below show the posterior probability of $\bw$ for each
 model given this data set (\cf\ figure \protect\ref{fig.pdh}).  The evidence
 for the different models is obtained by marginalizing onto the $D$
 axis at the left-hand side (\cf\ figure \protect\ref{fig.prior.post}).
}
\label{fig.modelspace}
}
\end{figure}

 Figure \ref{fig.modelspace} displays an entire hypothesis space so
 as to illustrate the various probabilities in the analysis.  There
 are three models, $\H_1, \H_2, \H_3$, which have equal prior
 probabilities. Each model has one parameter $\w$ (each shown on a
 horizontal axis), but assigns a different prior range $\sigW$ to that
 parameter. $\H_3$ is the most `flexible' or `complex' model,
 assigning the broadest prior range. A one-dimensional data space is
 shown by the vertical axis.  Each model assigns a joint probability
 distribution $P(D,\w\given \H_i)$ to the data and the parameters,
 illustrated by a cloud of dots.  These dots represent random samples
 from the full probability distribution.  The total number of dots in
 each of the three model subspaces is the same, because we assigned
 equal prior probabilities to the models.

 When a particular data set $D$ is received (horizontal line), we
 infer the posterior distribution of $\w$ for a model ($\H_3$, say) by
 reading out the density along that horizontal line, and
 normalizing. The posterior probability $P(\w\given D,\H_3)$ is shown by the
 dotted curve at the bottom. Also shown is the prior distribution
 $P(\w\given \H_3)$ (\cf\ figure \ref{fig.prior.post}). [In the case of model
 $\H_1$ which is very poorly matched to the data, the shape of the
 posterior distribution will depend on the details of the tails of the
 prior $P(\bw\given \H_1)$ and the likelihood $P(D\given \bw,\H_1)$; the curve
 shown is for the case where the prior falls off more strongly.]

 We obtain figure \ref{fig3} by marginalizing the joint distributions
 $P(D,\w\given \H_i)$ onto the $D$ axis at the left-hand side.
% and normalizing them. 
% This procedure gives the predictions of each model in data space.
 For the data set $D$ shown by the dotted horizontal line, the
 evidence $P(D\given \H_3)$ for the more flexible model $\H_3$ has a smaller
 value than the evidence for $\H_2$.  This is because $\H_3$ placed
 less predictive probability (fewer dots) on that line.  In terms of the 
 distributions over $\w$, model $\H_3$ has smaller evidence because
 the Occam factor $\sigma_{w|D}/\sigma_{w}$ is smaller for $\H_3$ than
 for $\H_2$.  The simplest model $\H_1$ has the smallest evidence of
 all,
% because it assigned very low probability to $D$. 
 because the best fit that it can achieve to the data $D$ is very
 poor.  Given this data set, the most probable model is $\H_2$.


\subsection{Occam factor for several parameters}
% $\bw$ is $k$-dimensional, and if
 If  the posterior is well
 approximated by a Gaussian, then the Occam factor is obtained from
 the determinant of the corresponding covariance matrix (\cf\ equation
 (\ref{approx.evidence}) and \chref{ch.laplace}):
\begin{equation}
\label{general.occam}
\begin{array}[t]{c@{\hspace{0.2cm}\simeq\hspace{0.2cm}}r@{\mbox{$\:\times\:$}}c}
	P(D\given \H_i)  & \underbrace{P(D\given \wmp,H_i)} &
		\underbrace{P(\wmp\given \H_i) \, 
		{\det}^{-\half} (\bA/2\pi)  }, \\   
	{\rm Evidence}   &{\rm Best~fit~likelihood} & {\rm Occam~factor }
\end{array}
\end{equation}
 where $\bA = -\grad\grad \ln P(\bw\given D,\H_i)$, the Hessian which we
 evaluated when we calculated the error bars on $\wmp$ (equation
 \ref{i.EB} and \chref{ch.laplace}).  As the amount of data collected
% , $N$,
 increases, this
 Gaussian approximation is expected to become increasingly accurate.\index{approximation!by Gaussian}

 In summary, Bayesian model comparison is a simple extension of maximum
 likelihood model selection: {\em the evidence is obtained by
 multiplying the best fit likelihood by the Occam factor.}

 \index{Occam factor}To evaluate the Occam factor we need only the Hessian $\bA$, if the
 Gaussian approximation is good.  Thus the Bayesian method of model
 comparison by evaluating the evidence is no more 
 computationally demanding than the task of finding for each model the best fit
 parameters and their error bars.

% see NOTES.tex  for stolen material

\section{Example}
 Let's return to the example that opened this chapter.
 Are there  one or two boxes behind the \ind{tree}\index{box}\index{image analysis}
 in \figref{fig.dogs1}?
 Why do \ind{coincidence}s make us suspicious?\index{suspicious coincidences}

 Let's assume the image of the area round the trunk and box
 has a size of 50 pixels, that the trunk is 10 pixels wide,
 and that 16 different colours of
 boxes can be distinguished.
 The theory $\H_1$ that says there is one box near the trunk
 has four free parameters: three coordinates defining the top three edges
 of the box, and one parameter giving the box's colour. (If boxes
 could levitate, there would be five free parameters.)

 The theory $\H_2$ that says there are two boxes near the trunk
 has eight free parameters (twice four), plus a ninth, a binary
 variable that
 indicates which of the two boxes is the closest to the viewer.
\marginfig{\footnotesize
\begin{center}
\twoboxesorone
\end{center}
\caption[a]{How many boxes are behind the tree?}
\label{fig.dogs3again}
}

 What is the evidence for each model?
 We'll do $\H_1$ first.
 We need a prior on the parameters to evaluate the evidence.
 For convenience, let's work in pixels.
 Let's assign a separable prior to the horizontal location of the box,
 its width, its height, and its colour.
 The height could have any of, say, 20 distinguishable values,
 so could the width, and so could the location.
 The colour could have any of 16 values.
 We'll put uniform priors over these variables. We'll
 ignore all the parameters associated with  other objects in the image,
 since they don't come into the model comparison between
 $\H_1$ and $\H_2$.
 The evidence is
\beq
	P(D\given \H_1) = \frac{1}{20}  \frac{1}{20}  \frac{1}{20}  \frac{1}{16} 
\eeq
 since only one setting of the parameters fits the data,
 and it predicts the data perfectly.

 As for model $\H_2$,
 six of its nine parameters are well-determined,
 and three of them are partly-constrained by the data.
 If the left-hand box is furthest away, for example,
 then its width  is at least 8 pixels
 and at most 30; if it's the closer of the two boxes,
 then its width is between 8 and 18 pixels.
 (I'm assuming here that the visible portion of the
 left-hand box is about 8 pixels wide.)
 To get the evidence we need to sum up the prior
 probabilities of all viable hypotheses.
 To do an exact calculation, we need to be more specific
 about the data and the priors, but let's just get
 the ballpark answer, assuming that the two unconstrained real
 variables have half their values available,
 and that the binary variable is completely undetermined. (As
 an exercise, you can make an explicit model and work
 out the exact answer.)
\beq
	P(D\given \H_2) \simeq \frac{1}{20}  \frac{1}{20}  \frac{10}{20}  \frac{1}{16}
 \frac{1}{20}  \frac{1}{20}  \frac{10}{20}  \frac{1}{16}
\frac{2}{2} .
\eeq
 Thus the posterior probability ratio
 is (assuming equal prior probability):
\beqan
\frac{	P(D\given \H_1)P(\H_1)}
{P(D\given \H_2)  P(\H_2)}
& =& \frac{1}{  \frac{1}{20}  \frac{10}{20}  \frac{10}{20}  \frac{1}{16}  }
\label{eq.fourfactors}
\\
& =& {20 \times 2 \times 2 \times 16} \:\:\simeq\:\: 1000/1 .
\eeqan
 So the data are roughly 1000 to 1 in favour of the simpler hypothesis.
 The four factors in (\ref{eq.fourfactors}) can be interpreted in terms of
 Occam factors. The more complex model
 has four extra parameters for sizes and colours -- three
 for sizes, and one for colour.
 It has to pay two big Occam factors (\dfrac{1}{20} and \dfrac{1}{16}) for
 the highly suspicious \ind{coincidence}s that the two box heights match
 exactly and the two colours match exactly;
 and it also pays two lesser Occam factors for the two lesser coincidences
 that both boxes happened to have one of their edges conveniently hidden
 behind a tree or behind each other.

% MDL stuff --
% stolen from nn_occam.tex
\section{Minimum description length (MDL)}
\label{MDL}
\index{minimum description length}
 A complementary view of Bayesian model comparison is obtained by
 replacing probabilities of events by the lengths in bits of messages
 that communicate the events without loss to a receiver. Message
 lengths $L(\bx)$ correspond to a probabilistic model over events
 $\bx$ via the relations:
\begin{equation}
	P(\bx) = 2^{-L(\bx)}, \:\: L(\bx) = - \log_2 P(\bx)           .
\label{p_l}
\end{equation}
% Non-integer coding lengths can be handled by the arithmetic coding
% procedure \cite{arith_coding}.

 The MDL principle \cite{WB} states that one should prefer models
 that can communicate the data in the smallest number of
 bits. Consider a two-part message that states which model, $\H$, is to be
 used, and then communicates the data $D$ within that model, to some
 pre-arranged precision $\delta D$. This produces a message of length
 $L(D,\H) = L(\H) + L(D\given \H)$.  The lengths $L(\H)$ for different $\H$
 define an implicit prior $P(\H)$ over the
 alternative models. Similarly $L(D\given \H)$ corresponds to a density
 $P(D\given \H)$. Thus, a procedure for assigning message lengths can be
 mapped onto posterior probabilities:
\begin{eqnarray}
	L(D,\H) &=& -\log P(\H) - \log \left( P(D\given \H) \delta D \right) \\
		 &=& -\log P(\H\given D) + {\rm const}                             .
\end{eqnarray}
 In principle, then, MDL can always be interpreted as Bayesian model
 comparison and {\em vice versa}.  However, this simple discussion has
 not addressed how one would actually evaluate the key data-dependent
 term $L(D\given \H)$, which corresponds to the evidence for $\H$. Often,
 this message is imagined as being subdivided into a parameter block
 and a data block (figure \ref{fig.mdl}). Models with a small number
 of parameters have only a short parameter block but do not fit the
 data well, and so the data message (a list of large residuals) is
 long. As the number of parameters increases, the parameter block
 lengthens, and the data message becomes shorter. There is an optimum
 model complexity ($\H_2$ in the figure) for which the sum is
 minimized.

% these lengths are defined in itprnnchapter
\begin{figure}[t]
\figuremargin{
\small
\begin{center}
\begin{tabular}{rl}
%	& \makebox[0.5\minch]{\ostruta$L(\H)$}  \makebox[0.8\minch]{\ostruta$L(\w^*\given \H)$}  \makebox[3.8\minch]{\ostruta$L(D\given \w^*,\H)$} \\[0.1\minch]
$\H_1$:\ostrutb	& \framebox[0.5\minch]{\ostruta$L(\H_1)$} 
	\framebox[0.9\minch]{\ostruta$L(\w^*_{(1)}\given \H_1)$} 
		\framebox[3.2\minch]{\ostruta$L(D\given \w^*_{(1)},\H_1)$} \\[0.1in]
$\H_2$:\ostrutb	& \framebox[0.5\minch]{\ostruta$L(\H_2)$} 
	\framebox[1.4\minch]{\ostruta$L(\w^*_{(2)}\given \H_2)$} 
		\framebox[2.2\minch]{\ostruta$L(D\given \w^*_{(2)},\H_2)$} \\[0.1in]
$\H_3$:\ostrutb	& \framebox[0.5\minch]{\ostruta$L(\H_3)$} 
	\framebox[2.2\minch]{\ostruta$L(\w^*_{(3)}\given \H_3)$} 
		\framebox[1.8\minch]{\ostruta$L(D\given \w^*_{(3)},\H_3)$} \\
\end{tabular}
\end{center}
}{
\caption[abbrev]{{A popular view of model comparison by \inds{minimum description length}.}
	Each model $\H_i$ communicates the data $D$ by sending the
	identity of the model, sending the best fit parameters of the
	model $\w^*$, then sending the data relative to those
	parameters.  As we proceed to more complex models the length
	of the parameter message increases. On the other hand, the
	length of the data message decreases, because a complex model
	is able to fit the data better, making the residuals
	smaller. In this example the intermediate model $\H_2$
	achieves the optimum trade-off between these two trends.
}
\label{fig.mdl}
}
\end{figure}

 This picture glosses over some subtle issues. We have not specified
 the precision to which the parameters $\bw$ should be sent. This
 precision has an important effect (unlike the precision $\delta D$ to
 which real-valued data $D$ are sent, which, assuming $\delta D$ is
 small relative to the noise level, just introduces an additive
 constant). As we decrease the precision to which $\bw$ is sent, the
 parameter message shortens, but the data message typically lengthens
 because the truncated parameters do not match the data so well. There
 is a non-trivial optimal precision. In simple Gaussian cases it is
 possible to solve for this optimal precision \cite{Wallace_Freeman},
 and it is closely related to the posterior error bars on the
 parameters, $\bAI$, where $\bA = -\grad \grad \ln P(\w\given D,\H)$. It
 turns out that the optimal parameter message length is virtually
 identical to the log of the Occam factor\index{Occam factor} in equation
 (\ref{general.occam}). (The random element involved in parameter
 truncation means that the encoding is slightly sub-optimal.)

 With care, therefore, one can replicate Bayesian results in MDL
 terms.  Although some of the earliest work on complex model
 comparison involved the MDL framework \cite{Patrick_Wallace}, MDL has
 no apparent advantages over the direct probabilistic approach.

 MDL does have its uses as a pedagogical tool.  The description length
 concept is useful for motivating prior probability distributions.
 Also, different ways of breaking down the task of communicating data
 using a model can give helpful insights into the modelling process, 
 as will now be illustrated.
\subsubsection{On-line learning and cross-validation.}
\begin{sloppypar}
 In cases where the data consist of a sequence of points  $D = \bt^{(1)}, \bt^{(2)}, \ldots , \bt^{(N)}$,
 the log evidence can be decomposed as a sum of `on-line' predictive 
 performances: 
\begin{eqnarray}
\hspace*{-1em}
\log P(D\given \H) &=& 
	\log P(\bt^{(1)}\given \H) +
	\log P(\bt^{(2)}\given \bt^{(1)},\H)  \nonumber \\
\hspace*{-1em}
&&	\hspace{-1in} +	\log P(\bt^{(3)}\given \bt^{(1)},\bt^{(2)},\H)  + \cdots 
	+ \log P(\bt^{(\ssN)}\given \bt^{(1)}\ldots \bt^{(\ssNM)},\H)        .
\end{eqnarray}
 This decomposition can be used to explain the difference between the
 evidence and `leave-one-out \ind{cross-validation}' as measures of
 predictive ability.  Cross-validation examines the average value of
 just the last term, $\log P(\bt^{(\ssN)}\given t^{(1)}\ldots
 \bt^{(\ssNM)},\H)$, under random re-orderings of the data. The
 evidence, on the other hand, sums up how well the model predicted all
 the data, starting from scratch.
\end{sloppypar}

\subsection{The `bits-back' encoding method.}
\label{sec.bitsback}
 Another MDL thought experiment \cite{Hinton_bb} involves\index{bits back}\index{Hinton, Geoffrey E.}
 incorporating random bits into our message. The data are communicated
 using a parameter block and a data block. The parameter vector sent
 is a random sample from the posterior,
 $P(\w\given D,\H) =
 P(D\given \w,\H) P(\w\given \H) / P(D\given \H)$. This sample $\w$ is sent to an
 arbitrary small granularity $\delta \w$ using a message length
 $L(\w\given \H) = -\log [P(\w\given \H) \delta \w]$. The data are encoded
 relative to $\w$ with a message of length $L(D\given \w,\H) = - \log
 [P(D\given \w,\H) \delta D]$. Once the data message has been received, the
 random bits used to generate the sample $\w$ from the posterior can
 be deduced by the receiver.  The number of bits so recovered is
 $-\! \log [P(\w\given D,\H) \delta \w]$. These recovered bits need not count
 towards the message length, since we might use some other
 optimally-encoded message as a random bit string, thereby communicating that
 message at the same time. The net description cost is therefore:
\begin{eqnarray}
	L(\w\given \H) + L(D\given \w,\H) - \mbox{`Bits back'} &=&
	 -\log \frac{ P(\w\given \H) \, P(D\given \w,\H) \, \delta D }{ P(\w\given D,\H)  }     \nonumber  \\
	&=& -\log P(D\given \H)        -\log \delta D             .
\end{eqnarray}
 Thus this thought experiment has yielded the optimal description length.
 Bits-back encoding has been turned into a practical
 compression\index{source code!for complex sources}\index{latent variable model!compression}
 method for data modelled with latent variable models by \citeasnoun{frey-98}.\index{Frey, Brendan J.}
\label{bits_back}

\section*{Further reading}
 Bayesian methods are introduced and contrasted with
 sampling-theory statistics in \cite{Jaynes.intervals,G1,Loredo}. The
 Bayesian Occam's razor is demonstrated on model problems in
 \cite{G1,MacKay92a}. Useful textbooks are
 \cite{Box_and_Tiao_text,Berger}.

 One debate worth understanding is the question of whether
 it's permissible to use \ind{improper prior}s\index{prior!improper}
 in Bayesian inference \cite{dawidJaynes}.
 If we want to do model comparison (as discussed in this chapter),
 it is essential to
 use proper priors -- otherwise the evidences and the
 Occam factors are meaningless. Only
 when one has no intention to do model comparison may it be safe
 to use improper priors, and even in such cases there are
 pitfalls, as Dawid \etal\ explain. I would agree with their
 advice to {\em always use proper priors},
 tempered by an encouragement to be smart when making calculations,
 recognizing
 opportunities for approximation.
% to approximate proper  by improper priors.

\section{Exercises}
\exercisxC{3}{ex.uniformslope}{
	Random variables $x$ come independently from a probability distribution
 $P(x)$.
 According to model $\H_0$,
 $P(x)$ is a uniform distribution
\beq
	P(x\given \H_0) = \frac{1}{2}  \:\:\: \:\:\:\:\:\: x \in (-1,1) .
\eeq
\amarginfignocaption{c}{
\begin{center}\mbox{\epsfbox{metapost/occam.1}}\\[0.23in]
\mbox{\epsfbox{metapost/occam.2}}
\end{center}
%\caption[a]{
%}
}According to model $\H_1$, $P(x)$ is a nonuniform distribution with
 an unknown parameter $m \in (-1,1)$:
\beq
	P(x\given m,\H_1) = \frac{1}{2} (1+m x) \:\:\: \:\:\: \:\:\: x \in (-1,1) .
\eeq
 Given the data $D = \{ 0.3, 0.5, 0.7, 0.8, 0.9\}$,
 what is the evidence for  $\H_0$ and $\H_1$?
}
\exercisxC{3}{ex.slopeornot}{
 Datapoints $(x,t)$ are believed to come from a straight
 line. The experimenter chooses $x$, and
 $t$ is Gaussian-distributed
 about
\beq
	y = w_0 + w_1 x 
\eeq
\amarginfignocaption{b}{
\begin{center}\mbox{\epsfbox{metapost/occam.3}}
\end{center}
%\caption[a]{
%}
}with variance $\sigma_{\nu}^2$.
 According to model $\H_1$, the straight line is horizontal, so $w_1 = 0$.
 According to model $\H_2$, $w_1$ is a parameter with
 prior distribution $\Normal(0,1)$. Both models
 assign a  prior distribution $\Normal(0,1)$ to $w_0$.
 Given the data set
 $D = \{  (-8,8),    (-2,10),    (6,11)\}$,
 and assuming the noise level is $\sigma_{\nu} = 1$,
 what is the evidence for each model?
}
\exercisxC{3}{ex.dicebiased}{
 A six-sided die is rolled 30 times and the
 numbers of times each  face came up were $\bF = \{ 3,3,2,2, 9,11 \}$.
 What is the probability that the die is a perfectly fair die (`$\H_0$'),
 assuming the alternative hypothesis $\H_1$ says that the
 die has a biased distribution $\bp$, and the prior density for $\bp$
 is uniform over  the simplex $p_i \geq 0$, $\sum_i p_i =1$?

 Solve this problem two ways:
 exactly, using the helpful Dirichlet formulae (\ref{eq.dirichletdefn}, \ref{lang.z}),
 and approximately, using Laplace's method. Notice that your choice
 of basis for the \index{Laplace's method}Laplace approximation is important. See \citeasnoun{MacKay96:laplace}
 for discussion of this exercise.
}
\exercisxC{3}{ex.florida}{
 The influence of \ind{race} on the imposition of the \ind{death penalty} for murder
 in \ind{America} has been much studied.
 The following three-way table classifies 326 cases
 in which the defendant was convicted of \ind{murder}.
 The three variables are the defendant's \ind{race}, the victim's race,
 and whether the defendant was sentenced to death.
 (Data from M.~Radelet, `Racial characteristics and imposition of the death penalty,'
 {\em American Sociological Review}, {\bf 46} (1981), pp.\,918-927.)
\begin{center}
\begin{tabular}{rcccrcc}\toprule
\multicolumn{3}{c}{
White defendant    } &&  \multicolumn{3}{c}{
                           Black defendant }
\\ \cmidrule{1-3}\cmidrule{5-7}
&\multicolumn{2}{c}{
Death penalty  } &&& \multicolumn{2}{c}{
                         Death penalty } 
\\  
&Yes & No &&&  Yes &  No \\ \cmidrule{2-3}\cmidrule{6-7}
White victim &19 & 132 && White victim& 11& 52 \\
Black victim &0  &   9 && Black victim& 6 & 97\\
\bottomrule\end{tabular}
\end{center}
%From 1979 to 2001, the state of Florida executed
% fifty-one convicted murderers.
% The dataIs there any racial bias in the decision
% of whether the  murderer receives the death
% penalty?
 It seems that
 the death penalty was applied much more often when the victim was white
 then when the victim was black.
 When the victim was \ind{white} 14\% of defendants got the death penalty,
 but when the victim was \ind{black} 6\% of defendants
 got the \ind{death penalty}.
% And white defendants overwhelmingly (94.3% of cases) killed white victims.
 [Incidentally, these data provide an example of
 a phenomenon known as {\dem\ind{Simpson's paradox}}:
 a higher fraction of white defendants
 are sentenced to death overall, but in cases
 involving    black  victims
 a higher fraction of black defendants are sentenced to death
 and in cases
 involving   white  victims
 a higher fraction of black defendants are sentenced to death.]
\marginfig{
\begin{center}
\begin{tabular}{cc}
\mbox{\epsfbox{metapost/occam.4}}&
\mbox{\epsfbox{metapost/occam.5}}\\
\mbox{\epsfbox{metapost/occam.6}}&
\mbox{\epsfbox{metapost/occam.7}}\\
\end{tabular}
\end{center}
\caption[a]{Four hypotheses concerning the dependence
 of the imposition of the death penalty $d$
 on the race of the victim $v$ and the race of the convicted murderer $m$.
 $\H_{01}$, for example, asserts that the probability
 of receiving the death penalty does depend on the murderer's race,
 but not on the victim's.
}\label{fig.murder}
}

 Quantify the evidence for the four  alternative hypotheses
 shown in \figref{fig.murder}.
 I should mention that I don't believe any of these models
 is adequate: several additional variables are important in murder
 cases, such as whether the victim and murderer knew each other,
 whether the murder was premeditated, and whether the defendant
 had a prior criminal record; none of these variables
 is included in the table.
 So this is an academic exercise in model comparison rather than
 a serious study of racial bias in the state of \ind{Florida}.

 The hypotheses are shown as graphical models, with arrows
 showing dependencies between the variables $v$ (victim race),
 $m$ (murderer race), and $d$ (whether death penalty given).
 Model $\H_{00}$ has only one free parameter, the probability of
 receiving the death penalty; model $\H_{11}$ has four such parameters,
 one for each state of the variables $v$ and $m$. Assign uniform
 priors to these variables. How sensitive are the conclusions
 to the choice of prior?
}









\dvips
\prechapter{About Chapter}
 The last couple of chapters have assumed that 
 a Gaussian approximation to the probability distribution
 we are interested in is adequate.
 What if it is not?
 We have already seen an example -- clustering -- where
 the likelihood function is multimodal, and has nasty
 unboundedly-high spikes in certain locations in the parameter space;
 so maximizing the posterior probability and fitting
% likelihood
 a Gaussian is not always going to work.
 This difficulty with Laplace's method is one motivation
 for being interested in Monte Carlo methods. In fact, Monte Carlo methods
 provide  a general-purpose  set of tools with  applications
 in Bayesian data modelling and many other fields.

%\begin{quotation}
  This chapter describes a sequence of  methods: 
 {\dbf importance sampling}, {\dbf rejection sampling},  the {\dbf
 Metropolis method}, 
 {\dbf Gibbs sampling} and {\dbf slice sampling}.  For each method, we
 discuss whether the method is expected to be useful for 
 high-dimensional problems such as arise in inference 
 with graphical models. [A graphical
 model is a probabilistic
 model in which dependencies and
 independencies of variables
 are represented by
 edges in a graph
 whose nodes are the variables.]
 Along the way, the terminology of Markov chain 
 Monte Carlo methods is presented. 
% [This unconventional ordering
% has been chosen because the Metropolis  and Gibbs sampling methods
% can be readily understood without knowing the terminology, and 
% concepts such as `ergodicity' and `detailed balance' are probably 
% easiest to learn once the reader has become familiar with 
% some  Markov chains.] 
% The chapter concludes with
 The subsequent chapter gives
 a discussion of advanced methods
% , including methods
 for reducing random walk behaviour.
% Chapter \ref{ch.mcexact} discusses 

 For details of Monte Carlo methods, theorems and proofs and a full list 
 of references, the reader is directed to 
 \citeasnoun{Neal_dop},
 \citeasnoun{MCMC96}, and \citeasnoun{Tanner96}.
%\end{quotation}



 In this chapter 
 I will use the word `\ind{sample}' in the following  sense: 
 a sample from a distribution $P(\bx)$ is a single realization 
 $\bx$ whose probability distribution is $P(\bx)$. 
 This contrasts with the alternative usage   in statistics, 
 where `sample' refers to a collection of realizations $\{ \bx\}$.

% UGLY HYPHENATION HACK::::::::::::::
 When we discuss transition probability matrices,
 I will use a right-multipli-
 cation convention:
 I like my matrices to act to the right, preferring
\beq
	\bu = \bM \bv
\eeq
 to
\beq
	\bu^{\T} = \bv^{\T} \bM^{\T} .
\eeq
 A \ind{transition probability matrix} $T_{ij}$ or $T_{i|j}$
 specifies the probability, given the current state is
 $j$, of making the transition from $j$ to $i$. The columns
 of $\bT$ are probability vectors.
 If we write down a transition probability density,
 we use the same convention for the order of its arguments:
 $T(x';x)$ is a transition probability density from $x$
 to $x'$.  This unfortunately means that
 you have to get used to reading from right to left --
 the sequence $xyz$ has probability $T(z;y) T(y;x) \pi(x)$. 
% I hope the consistency of this notation is helpful.

% {Monte Carlo methods}
\ENDprechapter
\chapter{Monte Carlo Methods}
\label{ch.mc}
%
%
%
% this includes slice.tex
%
% \section{Motivation}
\newcommand{\intdx}{\int \! \d^N \bx \:}% was \sum_{\bx}
\newcommand{\intdxpp}{\int \! \d^N \bx'' \:}% was \sum_{\bx}
% \newcommand{\citeasnoun}[1]{\citeauthor{#1}\ \shortcite{#1}}
% \newcommand{\quotecite}[1]{\citeauthor{#1}'s\ \shortcite{#1}}
\newcommand{\mcn}{n}
%
%
\section{The problems to be solved}
%
\label{sec.mcproblemsdefined}
\noindent
 \ind{Monte Carlo methods} are computational techniques that
 make use of \ind{random} numbers.
 The aims of Monte Carlo methods are to solve one or both of the following 
 problems.
\begin{description}
\item[Problem 1\mycolon] to generate samples
 $\{ \bx^{(r)}\}_{r=1}^{R}$
 from a given probability distribution $P(\bx)$.
\item[Problem 2\mycolon]
        to estimate expectations of functions under this distribution, 
 for example
\beq
        \Phi = \left< \phi(\bx) \right> \equiv  \intdx\:  P(\bx) \phi(\bx)  .
\label{eq.prob2}
\eeq
\end{description}
%
% we may also be interested in estimating distributions of f(x)
%
%
 The probability distribution  $P(\bx)$, which we  call
 the {\dbf target density}, might be a distribution 
 from statistical physics or  a conditional distribution 
 arising in data modelling -- for example, the posterior probability 
 of a model's parameters given some observed data.
 We will generally
 assume 
 that $\bx$ is an $N$-dimensional  vector with real components $x_n$, 
 but we will  sometimes consider discrete spaces also.

 Simple examples of  functions $\phi(\bx)$  whose
 expectations we might be interested in include the
 first and second moments
 of quantities that we wish to predict, from
 which we can compute 
 means and variances; for example if some quantity
 $t$ depends on $\bx$, we can find the mean and variance of $t$ under
 $P(\bx)$ by finding the expectations of the functions
 $\phi_1(\bx) = t(\bx)$ and 
 $\phi_2(\bx) = (t(\bx))^2$,
\beq
  \Phi_1 \equiv \Exp [ \phi_1(\bx)  ] \: \mbox{ and } \:
  \Phi_2 \equiv \Exp [ \phi_2(\bx)  ]
 ,
\eeq
 then using
% from which we can obtain
\beq
 \bar{t} = \Phi_1  \: \mbox{ and } \: \var( t ) = \Phi_2 - \Phi_1^2 .
\eeq
% ; all of this chapter's 
% discussions apply to discrete spaces too, with the replacement of 
% `$\intdx$' by $\sum_{\bx}$ throughout.]
 It is assumed that $P(\bx)$ is sufficiently complex that we cannot evaluate
 these expectations by exact methods; so we are interested
 in  Monte Carlo methods.
%
% point out, maybe in pre-chapter, that approximate methods like
% laplace's method are not good in general because typical set aint associated
% with maxima. Show alpha pictures. 
%

 We will concentrate on the first problem (sampling), because 
% One way of solving the second problem (estimation) 
% is to solve the first problem (sampling).
 if we have solved it, then we 
 can solve the second problem by using the random 
 samples $\{ \bx^{(r)}\}_{r=1}^{R}$ to give the estimator
\beq
        \hat{\Phi} \equiv \frac{1}{R} \sum_{r} \phi( \bx^{(r)} ) .
\label{eq.mc.est}
\eeq
 If the vectors $\{ \bx^{(r)}\}_{r=1}^{R}$  are  generated 
 from $P(\bx)$ then the expectation of $\hat{\Phi}$ is $\Phi$. 
 Also, as the number of samples $R$ increases, the variance of $\hat{\Phi}$
 will decrease as $\dfrac{\sigma^2}{R}$, where
 $\sigma^2$ is the variance of $\phi$, 
\beq
        \sigma^2 = \intdx\: P(\bx)  (\phi(\bx)-\Phi)^2 .
\eeq
 This is one of the important properties of Monte Carlo\index{key points!Monte Carlo} 
 methods.\index{Monte Carlo methods!dependence on dimensionality} \medskip

\noindent
\begin{conclusionbox}
{The accuracy of the Monte Carlo estimate 
 (\ref{eq.mc.est}) depends only on the
 variance of $\phi$, not
 on
% is independent of
 the dimensionality of the space sampled.}
 To be precise, the variance of $\hat{\Phi}$
 goes as $\linefrac{\sigma^2}{R}$. So regardless of the 
 dimensionality of $\bx$, it may be that  as few as a dozen {independent\/} 
 samples $\{ \bx^{(r)}\}$ suffice to estimate $\Phi$ satisfactorily. 
\end{conclusionbox}
\medskip

%  This result should be taken with a pinch of salt; 
 We will find later, however, 
 that high dimensionality can cause other difficulties for Monte Carlo 
 methods. Obtaining independent samples from a given distribution $P(\bx)$
 is often not easy.

\subsection{Why is sampling from $P(\bx)$ hard?}
 We will assume that the density from which we wish to draw 
 samples, $P(\bx)$, can be  evaluated,  at least to within a multiplicative 
 constant; that is, we can evaluate a function $P^*\!(\bx)$ such that 
\beq
        P(\bx) = P^*\!(\bx)  / Z.
\eeq
 If we can evaluate $P^*\!(\bx)$, why can we not easily solve 
 problem 1? Why is it in general difficult to obtain samples from 
 $P(\bx)$? There are two difficulties. The first is that we typically 
 do not know the normalizing constant
\beq
        Z = \intdx\: P^*\!(\bx) .
\eeq
 The second is that, even if we did know $Z$, the problem of drawing samples
 from $P(\bx)$ is still a challenging one, especially in high-dimensional 
 spaces, because there is no obvious way to sample from $P$
 without
% in general we have to
 enumerating most or 
 all of the possible states.
 Correct samples from $P$ will by definition tend to come
 from places in $\bx$-space where $P(\bx)$ is big; how
 can we identify those places where $P(\bx)$ is big, without
 evaluating $P(\bx)$ {\em{everywhere}}?
 There are only a few high-dimensional densities 
 from which it is easy to 
 draw samples, for example the Gaussian distribution.

\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{\makebox[-0.25in][l]{\raisebox{-0.1in}{(a)}}\psfig{figure=mc/pstar.ps,angle=-90,width=2.7in}%
\makebox[-0.25in][l]{\raisebox{-0.1in}{(b)}}\psfig{figure=mc/pstar.imp.ps,angle=-90,width=2.7in}}
\end{center}
}{%
\caption[a]{(a) The function $P^*\!(x) =  \exp \!
 \left[ 0.4 (x-0.4)^2 - 0.08 x^4 \right]$. How to draw samples 
 from this density? (b) The function $P^*\!(x)$ evaluated 
 at a discrete set of uniformly spaced points $\{x_i\}$.  How to draw samples 
 from this discrete distribution? 
}
\label{fig.pstar}
}%
\end{figure}
 Let us start with a  simple  one-dimensional 
 example. Imagine that we wish to 
 draw samples from the density $P(x) =  P^*\!(x) / Z$ where
%  set key   ;  set size 0.6,0.6                                
%  set term post
% set output "pstar.ps"
% plot [-5:5][0:3.1] exp(0.4*((x-0.4)**2-0.2*x**4)) t "P*(x)"
%  set term post ; set samples 50 
% set output "pstar.imp.ps"
% plot [-5:5][0:3.1] exp(0.4*((x-0.4)**2-0.2*x**4)) t "P*(x)" w imp
\beq
        P^*\!(x) =  \exp \left[ 0.4 (x-0.4)^2 - 0.08 x^4 \right] , 
                \:\: x \in (-\infty, \infty) .
\eeq
 We can plot this function (\figref{fig.pstar}a).
 But that does not mean we can draw samples from it. To start with,
 we don't know the normalizing constant $Z$. To give ourselves
 a simpler problem,
 we could discretize the variable $x$ and ask for samples from the 
 discrete probability distribution over a finite set of uniformly 
 spaced points $\{x_i\}$  (\figref{fig.pstar}b). How could we solve 
 this problem? If we evaluate $p^*_i = P^*\!(x_i)$ at each point $x_i$,
  we can compute 
\beq
        Z = \sum_i p^*_i 
%  P^*\!(x_i),
\label{eq.Zdirect}
\eeq 
 and
\beq
        p_i = p^*_i / Z
\label{eq.pdirect}
\eeq 
 and we can then sample 
% repeatedly
 from the probability
 distribution $\{ p_i \}$ using various methods based on 
 a source of  random bits (see \secref{sec.ac.efficient}).
% chapter \ref{ch.ac}).
% , of which  reversed arithmetic coding 
% \cite{Rissanen_Langdon:79,arith_coding} is the most efficient
% method in terms of the number of random bits needed.
%
% rad:
%% Arithmetic coding is an efficient way of generating from a finite
%% distribution only if by "efficient" you mean "uses as few random bits
%% as possible".  Usually, one is more concerned with computation time,
%% in which case the "alias method" is much more efficient, assuming one
%% wants to generate many points from the same distribution.  This method
%% generates N points from a distribution with K possible values in
%% O(N+K) time.
%% 
%% 
 But  what is  the cost of this procedure, and how does it scale with the 
 dimensionality of the space, $N$? Let us concentrate on the 
 initial cost of evaluating $Z$  (\ref{eq.Zdirect}). To compute $Z$
 we have to visit every 
 point in the space. In \figref{fig.pstar}b there are 50 
 uniformly spaced points in one dimension.
 If our system had $N$ dimensions,  $N=1000$ say, 
 then the corresponding number of points  would
 be $50^{1000}$, an  unimaginable number of evaluations of $P^*$.
 Even if each component $x_{\mcn}$
%  were discretized to only 2 values,
 took only two discrete values, 
%  $\pm 1$,
 the number of evaluations of  $P^*$ would 
 be $2^{1000}$, a number that is still horribly huge.
 If every electron in the universe (there are about $2^{266}$ of them)
 were  a 1000 gigahertz computer that could evaluate 
 $P^*$ for a trillion ($2^{40}$)
% $10^{12}$ 
 states every second, and if we ran
 those $2^{266}$ 
 computers
  for a time equal 
 to the age of the universe ($2^{58}$ seconds), they
 would still only visit $2^{364}$ states. We'd have to wait 
 for more than $2^{636} \simeq 10^{190}$ universe ages to elapse before 
 all  $2^{1000}$ states had been visited.

\newcommand{\cents}{\mbox{c}}
 Systems with  $2^{1000}$   states are two a penny.$^{\star}$\marginpar[c]{%
\small\raggedright
 $^{\star}\,$Translation for
 American readers: `such systems  are a dime a dozen'; incidentally,
 this equivalence ($10\cents = 6$p) shows that  the correct exchange rate
 between our currencies
 is $\pounds$1.00 = \$1.67.
%What's more,
% in pre-decimal currency, $10\cents = 6{\rm d}$ gives \$4 to the pound,
% which was the (fixed) exchange rate under the Bretton Woods system that
% was introduced after World War Two, until the devaluations of the 1960s.
 }
 One example is a collection of 1000 spins such as  a $30 \times 30$ fragment 
 of an Ising model
% (or `Boltzmann machine' or `Markov field') \cite{yeomans92}
 whose probability distribution
 is proportional to 
\beq
        P^*\!(\bx) =  \exp \! \left[ - \beta E(\bx) \right]
\eeq
 where $x_n \in \{ \pm 1 \}$  and 
\beq
        E(\bx) = - \left[ 
 \frac{1}{2}
                                \sum_{m,n} J_{mn} x_m x_n 
                                + \sum_{n} H_n x_n \right]  . 
\label{eq.ising.eb}
\eeq
%  Non-physicists who are more familiar with neural networks
%  than Ising models can think of this as the probability distribution 
%  of a Boltzmann machine with symmetric weights $J_{mn}$ and 
%  biases $H_n$.
 The energy function $E(\bx)$ is readily evaluated for 
 any $\bx$. But if we wish to evaluate this function at {\em all\/} states
 $\bx$, the computer time required would be  $2^{1000}$ function evaluations.

 The Ising model is a simple model which has been around for 
 a long time, but the task of
 generating 
 samples from the distribution $P(\bx) = P^*\!(\bx) / Z$ 
 is still an active research area;
% has proved so difficult that  researchers are still actively 
% developing practical  methods for solving it
% are still this problem was published 
 the first `exact' samples
 from this distribution were
 created in the pioneering work of \citeasnoun{Propp1996},
 as we'll describe in \chref{ch.mcexact}.


\subsection{A useful analogy}
\marginfig{\footnotesize
\begin{center}
\mbox{\psfig{figure=figs/lake2.eps,width=1.46in}\raisebox{0.59in}{$P^*(\bx)$}}
\end{center}
\caption[a]{A lake whose depth at $\bx=(x,y)$ is $P^*(\bx)$.}
\label{fig.lake2}
}
 Imagine the tasks of drawing random water samples from
 a \ind{lake} and finding the average \ind{plankton} concentration
 (\figref{fig.lake2}). The depth of the lake\index{depth of lake}
 at $\bx=(x,y)$ is $P^*(\bx)$,
 and we assert (in order to make the analogy work) that the  plankton concentration
 is a function  of $\bx$,  $\phi(\bx)$.
 The required average concentration is an integral like (\ref{eq.prob2}),
 namely
\beq
        \Phi = \left< \phi(\bx) \right> \equiv  \frac{1}{Z} \intdx\:  P^*(\bx) \phi(\bx)  ,
\label{eq.prob2again}
\eeq
 where $Z = \int \! \d x \, \d y \: P^*\!(\bx)$ is the volume of the lake.\index{partition function!analogy with lake}
 You are provided with a boat, a satellite navigation system, and a plumbline.
 Using the navigator, you can take your boat to any desired location $\bx$
 on the map; using the plumbline you can  measure $P^*(\bx)$ at that point.
 You can also measure the plankton concentration there.

 Problem 1 is to draw $1\,\mbox{cm}^3$ water samples at random  from the lake,
 in such a way that each sample is equally likely to come from any point within the
 lake.
 Problem 2 is to find the average plankton concentration.

 These are difficult problems to solve because at the outset we know nothing
 about the  depth $P^*(\bx)$. 
%\begin{figure}[hbtp]
%\figuremargin{
\marginfig{
\mbox{\psfig{figure=figs/lake.eps,width=1.83in}}
%}{
\caption[a]{A slice through a lake that includes some canyons.}
\label{fig.lake1}
}
%\end{figure}
 Perhaps much  of the volume of the   lake is contained in  narrow,
 deep underwater canyons (\figref{fig.lake1}), in which
 case, to correctly sample from the lake and correctly estimate
 $\Phi$ our method  must  implicitly discover the canyons
 and find their volume relative to the rest of the lake.
 Difficult problems, yes; 
% Given that we can't expect to visit every location in the lake, $\bx$, our
% problems seem difficult;
 nevertheless, we'll see that clever
 Monte Carlo methods can solve them.
% both problems.

\subsection{Uniform sampling}
 Having accepted that we cannot exhaustively visit every location $\bx$ in the state space, 
 we might consider trying to solve the second problem (estimating the
 expectation of a function $\phi(\bx)$) by drawing 
 random samples $\{ \bx^{(r)} \}_{r=1}^{R}$
 {\em uniformly\/} from the state space 
 and  evaluating $P^*\!(\bx)$ at those points. Then we could 
 introduce a normalizing constant
% {\em estimate\/} 
 $Z_R$, defined  by
\beq
%       \hat{Z}
        Z_R =  \sum_{r=1}^{R} P^*\!(\bx^{(r)}) ,
\eeq
% where $V$ is the volume of the space $V = \intdx\:1$,
 and estimate $\Phi = \intdx\: \phi(\bx) P(\bx)$ by
\beq
        \hat{\Phi} =  \sum_{r=1}^{R} 
                \phi(\bx^{(r)}) \frac{P^*\!(\bx^{(r)})}{Z_R} .
\eeq
 Is anything wrong with this strategy? Well, it depends on the
 functions $\phi(\bx)$ and $P^*\!(\bx)$. Let us assume that $\phi(\bx)$
 is a benign, smoothly varying function and concentrate on the nature
 of $P^*\!(\bx)$.  As we learnt in  \chapterref{ch.two},
 a high-dimensional distribution is often
 concentrated in a small region of the state space known as its
 typical set
% . For example, if the state $\bx$ contains a large number
% of roughly {\em independent\/} variables, then it follows from the law
% of large numbers that almost all of the probability mass of the
% distribution $P(\bx)$ can be found in a typical set 
 $T$, whose volume
 is given by $|T| \simeq 2^{H(\bX)}$, where $H(\bX)$ is the
% Shannon-Gibbs 
 entropy of the probability distribution $P(\bx)$.
%\beq
% H(\bX) = \sum_{\bx} P(\bx) \log_2 \frac{1}{P(\bx)} . 
%\eeq
 If almost all the probability mass is located in the typical set and
 $\phi(\bx)$ is a benign function,  the value of $\Phi
 = \intdx\: \phi(\bx) P(\bx)$ will be principally determined by the
 values that $\phi(\bx)$ takes on in the typical set. So uniform
 sampling will only stand a chance of giving a good estimate of $\Phi$
 if we make the number of samples $R$ sufficiently large that we are
 likely to hit the typical set at least once or twice.
% a number of times.
 So, how many samples
 are required?
\begin{figure}
\figuremargin{\small%
\begin{mycenter}
\begin{tabular}{c@{$\:\:\:\:\:$}c}
\mbox{{\small(a)}\hspace{-0.32cm}\psfig{figure=figs/Sising.ps,angle=-90,width=2.5in}}%
&
{\small(b)}\raisebox{0.25in}{\framebox{\psfig{figure=../comput/newising_mc/32.32/t2.5.ps,width=1.4in}}}%
%\mbox{\psfig{figure=figs/Sising9.ps,angle=-90,width=3.2in}}%
\\
%\footnotesize(a) &
%\footnotesize(b) \\
\end{tabular}\end{mycenter}
}{%
\caption[a]{(a) Entropy of a 64-spin Ising model
% spin systems
    as a function of temperature. 
%        The entropy of 64 spins arranged in a planar rectangular 
%        lattice with periodic boundary conditions
%        was found as a function of temperature $T$
%        with the nearest neighbour coupling set to $J=+1$.
% (ferromagnet)        and $J=-1$ (antiferromagnet).
%
(b) One state of a 1024-spin Ising model.
%  with 1024 spins
}
% cd ~/_courses/comput/newising_mc/
% i -o 32.32/o -ot 32.32/t -nx 32 -ny 32 -bmin 0.2 -bmax 0.5 -bs 11 -its 130000 -mf 30000
%Right figure, 81 spins.
\label{fig.Sising}
}%
\end{figure}
% cd _courses/comput/newising/r 
% gnuplot
% load '../gnu8'
%
 Let us take the case of the Ising model again. (Strictly,
 the Ising model may  not be a good example, since it doesn't necessarily have
 a typical set, as defined in  \chapterref{ch2}; the definition
 of a typical set was that all states had $\log$ probability close
 to the entropy, which for an Ising model would mean that
 the {\em energy\/} is very close to the {\em mean energy};
 but in the vicinity of phase transitions, the variance of energy,
 also known as the heat capacity, may diverge, which means that the
 energy of a random state is not necessarily expected to be very
 close to the mean energy.)
% ; but let's ignore this.)
 The
 total size of the state space is $2^N$ states, and the typical set
 has size $2^H$. So each sample has a chance of $2^H/2^N$ of falling
 in the typical set.  The number of samples required to hit the
 typical set once is thus of order
\beq
        R_{\min} \simeq 2^{N-H} .
\eeq
% would like \pagebreak here
% \pagebreak[1]
 So, what is $H$?
 At high temperatures, the probability distribution of an Ising 
 model tends to a uniform distribution and the entropy tends to 
 $H_{\max} = N$ bits, which means $R_{\min}$ is of order 1.
 Under these conditions, uniform sampling 
 may well be a satisfactory technique for estimating $\Phi$. 
 But high temperatures are not of great interest. Considerably 
 more interesting are intermediate temperatures such as the critical 
 temperature at which the Ising model melts from an ordered phase
 to a disordered phase.\index{phase transition}
 The critical temperature of an infinite Ising model, at which it melts,
 is $\theta_c=2.27$. 
 At this temperature the entropy of 
 an Ising model is roughly $N/2$ bits  (\figref{fig.Sising}).
% For example,
% if the entropy of the 64-spin model is $32\log(2)$, 
%        the probability mass is concentrated  in a typical
% set of size $2^{32}$ states; this set is a fraction roughly $1/2^{32}$
%        of the total size of the state space.
% 
 For this probability 
 distribution the number of samples required simply to hit the typical set 
 once is of order
\beq
        R_{\min} \simeq 2^{N-N/2} =  2^{N/2}  ,
\eeq
 which for $N=1000$ is about $10^{150}$. This is roughly the square of the 
 number of particles in the universe. Thus uniform sampling 
 is utterly useless for the study of Ising models of modest size. 
 And in most high-dimensional problems, if the distribution 
 $P(\bx)$ is not actually uniform, uniform sampling is unlikely to 
 be useful.
 
% \exercis{ex.}{
%  Prove that the estimator $\hat{Z}$ is an unbiased 
%  estimator for 
%  $Z$. 
% } 
%
\subsection{Overview}
 Having established that drawing samples from a high-dimensional
 distribution  $P(\bx) = P^*\!(\bx) / Z$  is 
 difficult even if $P^*\!(\bx)$ is easy to evaluate, we
 will now study a sequence of more sophisticated Monte Carlo methods: 
%\bit
%\item {\dbf importance sampling},
%\item {\dbf rejection sampling},
%\item the {\dbf Metropolis method}, 
%\item {\dbf Gibbs sampling}, and
%\item {\dbf slice sampling}.
%\eit
%
 {\dbf importance sampling},
 {\dbf rejection sampling},
 the {\dbf Metropolis method}, 
 {\dbf Gibbs sampling}, and
 {\dbf slice sampling}.


\section{Importance sampling}
\label{sec.importance}
 \indexs{Monte Carlo methods!importance sampling}\indexs{importance sampling}
 Importance
 sampling is not a method for generating samples 
 from $P(\bx)$ (problem 1); it is just a method for estimating the
 expectation of a function $\phi(\bx)$ (problem 2). It can be viewed
 as a generalization of the uniform sampling method.
%
% Radford him say
%   Not directly, but you could always sample from the weighted distribution
%  once you have a large number of points, thereby getting something close
%  to a bunch of independent points.
%
%

 For illustrative purposes, 
 let us imagine that the target distribution 
 is a one-dimensional density $P(x)$. Let us 
 assume that we are able to 
 evaluate this density at any chosen point $\bx$,
 at least to within a multiplicative 
 constant; thus we can evaluate a function $P^*\!(x)$ such that 
\beq
        P(x) = P^*\!(x)  / Z.
\eeq
 But $P(x)$ is too complicated a function for us to be able to sample
 from it directly.  We now assume that we have a simpler 
 density $Q(x)$  from which we {\em can\/} generate samples and
 which we can evaluate to within a   multiplicative 
 constant (that is, we can evaluate $Q^*(x)$, where
 $Q(x) = Q^*(x)/Z_Q$).
%
% Importance sampling
% , like rejection sampling, assumes that we 
% makes use of 
% an approximation $Q(x)$ which is similar to $P(x)$ and which we can draw
% samples from.
% We relax the restriction that an inequality 
% relating $Q$ and $P^*$  must be known.
 An example of the  functions $P^*$, $Q^*$ and $\phi$ is shown in 
 \figref{fig.pq.importance}. 
%\begin{figure}
%\figuremargin{%
\amarginfig{t}{
\begin{center}
\mbox{\epsfbox{metapost/rejection.3}}
%\mbox{\psfig{figure=figs/pq.importance.eps,width=2in,angle=-90}}
\end{center}
%}{%
\caption[a]{{Functions involved in importance sampling.}
 We wish to estimate the expectation of $\phi(x)$ under $P(x)\propto P^*\!(x)$. 
 We can generate samples from the simpler distribution $Q(x) \propto Q^*(x)$.
 We can evaluate $Q^*$ and $P^*$
 at any point. 
}
\label{fig.pq.importance}
}%
%\end{figure}
 We call $Q$ the {\dem\ind{sampler density}}.
%%  [The methods that 
%%  follow will work even if the sampler density
%%  is not normalized, that is, we can 
%%  only evaluate $Q^*(x)$, which is proportional to $Q(x)$.] 

\newcommand{\xfromq}{x}% was _q
 In {importance sampling}, we generate $R$ samples 
 $\{\xfromq^{(r)}\}_{r=1}^R$ from $Q(x)$.
 If these points were samples from $P(x)$  then 
 we could estimate $\Phi$
 by \eqref{eq.mc.est}.
 But when we generate samples from $Q$, values of $x$ where 
 $Q(x)$ is greater than $P(x)$ will be {\em over-represented\/} in 
 this estimator, and points where $Q(x)$ is less than $P(x)$ 
 will be {\em under-represented}.
 To take into account the fact that we have sampled from 
 the wrong distribution, we introduce {\dem{weights}}\index{weight!importance sampling} 
\beq
        w_r \equiv \frac{ P^*\!(\xfromq^{(r)}) }{ Q^*(\xfromq^{(r)}) }
\label{eq.mc.is.weight.def}
\eeq
 which we use to adjust the `importance' of each point 
 in our estimator thus:
\beq
        \hat{\Phi} \equiv \frac{ \sum_{r} w_r \phi( \xfromq^{(r)} ) }{ \sum_r w_r } .
\label{eq.is}
\eeq
\exercissxB{2}{ex.Phiconverge}{
 Prove that,
 if $Q(x)$ is non-zero for all $x$ where $P(x)$ is non-zero,
 the estimator $\hat{\Phi}$ converges to $\Phi$, the mean
 value of $\phi(x)$, as $R$ increases.
  What is the variance of this estimator, asymptotically?  
 Hint: consider the statistics of the numerator and the denominator
 separately. 
 Is the estimator   $\hat{\Phi}$  an unbiased estimator for small $R$?
%
% 
% Show that the estimator also works if the normalization constant $Z_Q$
% of $Q(x)$ is unknown -- that is, if we can draw samples from $Q(x)$, 
% but we can only evaluate $Q^*(x)$, where $Q(x) = Q^*(x)/Z_Q$. 
} 
% \exercisa
% If $Q(x)$ is non-zero for all $x$ where $P(x)$ is non-zero, 
% it can be proved that the estimator $\hat{\Phi}$ converges to $\Phi$, 
% the mean value of $\phi(x)$, as $R$ increases.
% The estimator also works if the normalization constant $Z_Q$
% of $Q(x)$ is unknown -- that is, if we can draw samples from $Q(x)$, 
% but can only evaluate $Q^*(x)$, where $Q(x) = Q^*(x)/Z_Q$. 
%
% Your presentation of importance sampling normalizes the weights. Of course,
% you don't have to normalize the weights and it would seem useful to
%  discuss this issue. 
% Also, your section on importance sampling should discuss Rubin's  Sampling
% Importance Resampling (SIR) since its a nice bridge between rejection
% sampling and importance sampling.
% 
 A practical difficulty with importance sampling is that it is hard to
 estimate how reliable the estimator $\hat{\Phi}$ is.  The variance of the
 estimator is unknown beforehand,
 because it depends on an integral over $x$ of a
 function involving $P^*\!(x)$. And the variance of
 $\hat{\Phi}$ is hard to estimate, because the empirical variances of
 the quantities $w_r$ and $w_r \phi( \xfromq^{(r)} )$ are not necessarily
 a good guide to the true variances of the numerator and denominator
 in \eqref{eq.is}. If the proposal density $Q(x)$ is small in a region
 where $|\phi(x)P^*\!(x)|$ is large then it is quite possible, even after many
 points $\xfromq^{(r)}$ have been generated, that none of them will have
 fallen in that region. In this  case the  estimate of $\Phi$ would
 be  drastically wrong, and there would be  no indication in the {\em{empirical}\/}
 variance that the
 true variance of the estimator $\hat{\Phi}$ is large.\index{caution!importance sampling}

\newcommand{\FIGTOY}{/home/mackay/aa/ps}
\begin{figure}[bthp]
\figuremargin{\small%
\begin{mycenter}
\mbox{%
\raisebox{0.32in}{\makebox[0in][l]{\small(a)}}%
\hspace{-0.012in}%
\psfig{figure=\FIGTOY/demo.is.norm.ps,height=1.8in,angle=-90}\hspace{0.12in}
\raisebox{0.32in}{\makebox[0in][l]{\small(b)}}%
\hspace{-0.092in}%
\psfig{figure=\FIGTOY/demo.is.cauchy.ps,height=1.8in,angle=-90}}\\[-0.2in]
\end{mycenter}
}{%
\caption[a]{Importance sampling in action: (a) using a Gaussian sampler density; 
 (b) using a  \index{Cauchy distribution}Cauchy
 sampler density.  Vertical
 axis shows the estimate $\hat{\Phi}$. The horizontal line indicates 
 the true value of $\Phi$.
 Horizontal axis shows number of samples
 on a log scale.}
\label{fig.is}
}%
\end{figure}
\subsection{Cautionary illustration of importance sampling}
 In a toy problem 
 related to the modelling of \ind{amino acid} probability distributions 
 with a one-dimensional variable $x$,
%
% I am pretty sure nearl;y all the  figs in _doc/proteins/amino_is
% corresponds to one latent variable
%
 I evaluated a quantity of interest using importance sampling. 
 The results using a Gaussian sampler 
% $Q(x)$
 and a  Cauchy sampler are shown in 
 \figref{fig.is}. The horizontal axis shows the number of\pagebreak[1]
 samples 
 on a log scale. In the case of the
 Gaussian sampler, after about 500 samples had been evaluated
 one might be tempted to call a halt; but evidently 
 there are infrequent samples 
 that make a huge contribution to $\hat{\Phi}$, and the value of the estimate at 
 500 samples is wrong. Even after a million samples have been  
 taken, the estimate has still not settled down close to the true value.
 In contrast, the Cauchy sampler does not suffer from glitches; it converges 
 (on the scale shown here) after about 5000 samples. 

 This example illustrates the fact that an importance sampler should have 
 {\bf heavy tails}. 

\exercissxA{2}{ex.peakysample}{
 Consider the situation where $P^*\!(x)$ is multimodal, consisting
 of several widely-separated peaks.
%  for example,  a mixture of Gaussians, widely separated.
 (Probability distributions
 like this arise frequently in statistical data modelling.)
 Discuss whether it is a wise strategy to do importance sampling 
 using a sampler $Q(x)$ that is a unimodal distribution fitted
 to  one of these peaks. 
\marginfig{
\hspace*{-0.2in}\psfig{figure=figs/gmixture.pqf.ps,angle=-90,width=2.24in}
\caption[a]{A multimodal distribution  $P^*\!(x)$  and a unimodal sampler   $Q(x)$.}
}
 Assume that the function $\phi(x)$ whose mean $\Phi$ is to be estimated
 is a smoothly varying function of $x$ such as $mx+c$.  Describe 
 the typical evolution of the estimator $\hat{\Phi}$ as a function of 
 the number of samples $R$.
}
\subsection{Importance sampling in many dimensions}
 We have already observed that care is needed in one-dimensional
 importance sampling problems. Is importance sampling a useful 
 technique in spaces of higher dimensionality, say $N=1000$?

 Consider a simple case-study where the target density $P(\bx)$
 is a uniform distribution inside a sphere,
\beq
        P^*\!(\bx) = \left\{ \begin{array}{cl} 1 & 0 \leq \rho(\bx) \leq R_P \\
                                        0 & \rho(\bx) > R_P,  \end{array}
\right. 
\eeq 
 where $\rho(\bx) \equiv (\sum_i x_i^2 )^{1/2}$, 
 and the proposal density is a Gaussian centred on the origin,
\beq
        Q(\bx) = \prod_i \Normal(x_i ; 0,\sigma^2 ) .
\eeq
 An importance-sampling method will be in trouble if the estimator $\hat{\Phi}$
 is dominated by a few large weights $w_r$. 
 What will be the typical range of values of the weights $w_r$? 
 We know from our discussions of typical sequences in \partone\ 
 -- see  \extwentyseven, for example --
% by the law of large numbers
  that if $\rho$ is the  distance
%  By the central-limit theorem (see \extwentyseven, for example),
% if $\rho$ is the  distance 
 from the origin of  a sample from $Q$, 
 the quantity $\rho^2$ has a roughly Gaussian distribution with mean 
 and standard deviation:
% is very likely to have a distance 
% $\rho$ from the origin that satisfies
\beq
        \rho^2 \sim
   N \sigma^2 \pm  \sqrt{2 N} \sigma^2 .
\eeq
% where $z$ is a constant equivalent to the $\beta$ that controlled 
%  the size of our typical set. If $z=2$ then there is a 95\% chance 
%  that $\rho^2$ will lie in the above interval.
 Thus almost all samples from $Q$ lie in a \ind{typical set} with distance 
 from the origin very close to $\sqrt{N} \sigma$. 
 Let us assume that $\sigma$ is chosen such that 
 the typical set of  $Q$ lies 
%  almost all typical samples from $Q$
 inside the sphere of radius $R_P$. [If it does not, 
 then the law of large numbers implies that almost all the samples 
 generated from $Q$ will fall outside $R_P$ and will have weight zero.]
% 
 Then we know that most samples from $Q$ will have a value of $Q$
 that lies in   the range
\beq
        \frac{1}{({2 \pi \sigma^2})^{N/2}} \exp \left( -\frac{N}{2}
                \pm \frac{\sqrt{2 N}}{2} \right) .
\eeq
 Thus the weights $w_r=P^*/Q$ will typically have values in the range
\beq
        {({2 \pi \sigma^2})^{N/2}} \exp \left( \frac{N}{2}
                \pm \frac{\sqrt{2 N}}{2} \right) .
\label{weightrange}
\eeq
 So if we draw a hundred samples, what will the typical range of weights be?
 We can roughly estimate the ratio of the largest weight to the median
 weight by doubling the standard deviation in 
 \eqref{weightrange}.
% Taking the two-standard-deviation points, 
%  A value of $z$ equal to 2 gives a reasonable ball-park figure, and 
% we find: 
%\begin{description}
%\item[The largest weight and the median weight] {\bf will typically be in the 
% ratio:}
 The largest weight and the median weight will typically be in the 
 ratio:
\beq
\frac{w_r^{\max}}{w_r^{{\rm med}}} = \exp  \left(  \sqrt{2 N} \right) .
\eeq
%\end{description}
 In $N=1000$ dimensions therefore, the largest weight after one hundred
 samples is likely to be roughly  $10^{19}$ times greater 
 than the median weight.  
%
 Thus an importance sampling estimate for a high-dimensional problem 
 will very likely be utterly dominated by a few samples with huge 
 weights.
% Also, there are nice effective-sample-size heuristics that you might like
% to mention. See, for example, Kong, Liu and Wong's JASA paper on sequential
% imputation. 
%
% Radford said:
%
%  What happens if we pick pick sigma optimally?  Is importance sampling
%  still bad?
% 

% note with uniform sampling, the problem was to hit the typical set
% here the problem is, even if we hit the typical set, the probabilities
% of states 
% within the typical set vary by considerable factors. 

 In conclusion, importance sampling in high dimensions often suffers from 
 two difficulties. First, we  need to obtain samples that
 lie in the typical  set of $P$, and this may take a long time unless\index{approximation!of complex distribution} 
 $Q$ is a good approximation to $P$. Second, even if we obtain samples 
 in the typical set, the weights associated with those samples 
 are likely to vary by large factors, because the probabilities 
 of points in a typical set, although similar to each other, 
 still differ by factors  of order $\exp(\sqrt{N})$,
 so the weights will too, unless $Q$ is a near-perfect
 approximation to $P$.
% could quantify - time to get samples in the typical set relates to DKL?
% difference in weights relates to ... ?
 
 
\begin{figure}
\figuremargin{%
\begin{center}\footnotesize
\begin{tabular}{cc}
\raisebox{1.15in}{\makebox[0in][l]{\footnotesize(a)}}
\mbox{\epsfbox{metapost/rejection.1}}
%\psfig{figure=figs/pq.rejection.eps,width=2.25in,angle=-90}
& 
\raisebox{1.15in}{\makebox[0in][l]{\footnotesize(b)}}
\mbox{\epsfbox{metapost/rejection.2}}
%\psfig{figure=figs/pq.rejection.shade.eps,width=2.25in}\\
\end{tabular}
\end{center}
}{%
\caption[a]{{Rejection sampling.\indexs{rejection sampling}\index{rejection}}
%
 (a) The functions involved in {rejection sampling}.
 We desire samples from $P(x) \propto P^*\!(x)$. We are able to draw 
 samples from $Q(x)  \propto Q^*(x)$, and we know a value $c$ such that 
 $c\,Q^*(x) > P^*\!(x)$ for all $x$.
%
 (b) A point $(\xfromq,u)$ is generated 
 at random in the lightly shaded
 area under the curve $c\,Q^*(x)$. If this point also 
 lies below $P^*\!(x)$ then it is accepted.}
\label{fig.pq.rejection}
\label{fig.pq.rejection.xu}
\label{fig.pq.rejection.shade}
}%
\end{figure}
\section{Rejection sampling}
\label{sec.rejection}
 \indexs{Monte Carlo methods!rejection sampling}%\indexs{rejection sampling}
 We
 assume again a one-dimensional density
% \beq
        $P(x) = P^*\!(x)  / Z$
% \eeq
 that is too complicated a function for us to be able to sample
 from it directly.  We assume that we have a simpler {\em \inds{proposal
 density}\/} $Q(x)$ which  we can evaluate (within a multiplicative factor
 $Z_Q$, as before),
 and from which we can generate samples.  We further assume that we
 know the value of a constant $c$ such that 
\beq
 c\, Q^*(x) > P^*\!(x) , \:\: \mbox{for all $x$}. 
\eeq
% For rejection sampling to work $Q(x)$ should be similar to $P(x)$.
 A schematic picture of the two functions is shown in
 \figref{fig.pq.rejection}a. 

 We generate two random numbers. The
 first, $\xfromq$, is generated from the proposal density $Q(x)$.  We then evaluate
 $c\,Q^*(\xfromq)$ and generate a uniformly distributed random variable $u$
 from the interval $[0,c\,Q^*(\xfromq)]$.  These two random numbers can be
 viewed as selecting a point in the two-dimensional plane as shown in
 \figref{fig.pq.rejection.xu}b.

 We now evaluate $P^*\!(\xfromq)$ and accept or reject the sample $\xfromq$ by
 comparing the value of $u$ with the value of $P^*\!(\xfromq)$. If $u >
 P^*\!(\xfromq)$ then $\xfromq$ is rejected; otherwise it is accepted, which
 means that we add $\xfromq$ to our set of samples $\{ x^{(r)} \}$. The
 value of $u$ is discarded.

 Why does this procedure generate samples from $P(x)$?  The proposed point
 $(\xfromq,u)$ comes with uniform probability from the lightly shaded
 area underneath the curve $c\,Q^*(x)$ as shown in
 \figref{fig.pq.rejection.shade}b.  The rejection rule rejects all the
 points that lie above the curve $P^*\!(x)$. So the points $(x,u)$
 that are
 accepted are uniformly distributed in the heavily shaded area under
 $P^*\!(x)$. This implies that the probability density of the
 $x$-coordinates of the accepted points must be proportional to
 $P^*\!(x)$, so the samples must be independent samples  from $P(x)$.

 Rejection sampling will work best if $Q$ is a good approximation to
 $P$. If $Q$ is very different from $P$ then,
 for $c\,Q$ to exceed $P$ everywhere,
 $c$ will necessarily have
 to be large and the frequency of rejection will be large.

%\begin{figure}
%\figuremargin{%
\marginfig{
\[
\hspace*{-0.2in}\psfig{figure=figs/grejection.ps,angle=-90,width=2.24in}
\]
%}{%
\caption[a]{A Gaussian $P(x)$ and a slightly broader Gaussian  $Q(x)$
 scaled up by a factor $c$ such that $c\,Q(x) \geq P(x)$.}
\label{fig.grejection}
}%
%\end{figure}
%
\subsection{Rejection sampling in many dimensions}
 In a high-dimensional problem it is very likely that the 
 requirement that $c\,Q^*$ be an upper bound for $P^*$ will force
 $c$ to be so huge that acceptances\index{acceptance rate} will be very rare indeed.
 Finding such a value of $c$ may be difficult too, since in many 
 problems we know neither where the modes of $P^*$  are located 
 nor how high they are.
%beforehand

 As a case study, consider a pair of 
 $N$-dimensional Gaussian distributions with mean zero 
 (\figref{fig.grejection}). Imagine 
 generating samples from one with standard deviation $\sigma_Q$
 and using rejection sampling to obtain samples from the other 
 whose  standard deviation is $\sigma_P$.  Let us assume that these 
 two standard deviations are close in value -- say, 
 $\sigma_Q$ is 1\%
% one \percent\
 larger than $\sigma_P$. [$\sigma_Q$ must 
 be larger than $\sigma_P$ because if this is not the case, there
 is no $c$ such that $c\,Q$ exceeds $P$ for all $\bx$.]
 So, what  value of $c$ is required  if the dimensionality is $N=1000$?
 The density of $Q(\bx)$ at the origin is $1/({2 \pi \sigma_Q^2})^{N/2}$, 
 so for $c\,Q$ to exceed $P$  we need to set 
\beq
        c = \frac{({2 \pi \sigma_Q^2})^{N/2}}{({2 \pi \sigma_P^2})^{N/2}}
                = \exp \left( {N} \ln \frac{ \sigma_Q }{ \sigma_P} \right) .
\eeq
 With $N=1000$ and $\frac{ \sigma_Q }{ \sigma_P}=1.01$, we find 
 $c=\exp(10)\simeq 20$,000.
 What will the acceptance rate\index{acceptance rate} be for this value of $c$? 
%  The typical 
%  sample from $Q$ has $Q \simeq \frac{1}{({2 \pi \sigma_Q^2})^{N/2}} e^{-N/2}$, 
%  so that $c\,Q \simeq \frac{1}{({2 \pi \sigma_P^2})^{N/2}} e^{-N/2}$.
%  At this typical sample, the value of $P$ will be roughly
%  $\frac{1}{({2 \pi \sigma_P^2})^{N/2}} \exp \left( 
%  \frac{-N\sigma_Q^2}{2\sigma_P^2} \right)$, which is smaller than $c\,Q$ 
%  by the factor
% \beq
%  \left.\frac{P}{cQ}\right|_{\rm typ} = \exp \left[ 
%  -\frac{N}{2}\left(
%  \frac{\sigma_Q^2}{\sigma_P^2} - 1 \right) \right]
%  , 
% \eeq 
%  which is roughly $c$
 The answer is immediate: since the acceptance rate is the ratio of the 
 volume under the curve $P(\bx)$ to the volume under $c\,Q(\bx)$, 
 the fact that $P$ and $Q$ are both normalized here implies that the acceptance 
 rate will be $1/c$,
 for example,
% . For our case study, this is
% $\dfrac{1}{20,000}$.
 1/20,000.
 In general, $c$ grows exponentially with the dimensionality $N$,
 so the acceptance rate is expected to be exponentially small in $N$. 

 Rejection sampling, therefore, whilst a useful method for 
 one-dimensional problems, is not expected to be a practical technique for 
 generating samples from high-dimensional distributions $P(\bx)$. 

\section{The Metropolis--Hastings method}
\label{sec.metropolis}
\label{sec.metrop}
 \indexs{Monte Carlo methods!Metropolis--Hastings}%
\indexs{Monte Carlo methods!Markov chain Monte Carlo}%
%\indexs{Markov chain Monte Carlo}%\indexs{Metropolis method}
 Importance
 sampling and rejection sampling  work well only
 if the proposal density $Q(x)$ is similar to $P(x)$. 
 In large and complex problems it is difficult to create a single 
 density $Q(x)$ that has this property. 

%\begin{figure}
%\figuremargin{%
\amarginfig{c}{
\begin{center}\small
%\mbox{\small
\begin{tabular}{@{}c@{}}
\setlength{\unitlength}{0.7mm}% was 1mm then 0.75 (for a nice fit to textwidth)
\begin{picture}(75,40)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/pq.metrop.eps,%
width=2.1in,angle=-90}}}% was width 3
\put(73,-1){\makebox(0,0)[t]{$x$}}
\put(13,-1){\makebox(0,0)[t]{$x^{(1)}$}}
\put(17,38){\makebox(0,0)[l]{$Q(x;x^{(1)})$}}
\put(42,15){\makebox(0,0)[l]{$P^*\!(x)$}}
\end{picture}\\[0.2in]%\hspace{0.2in} 
\setlength{\unitlength}{0.7mm}% was 1mm then 0.75 (for a nice fit to textwidth)
\begin{picture}(75,40)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/pq.metropb.eps,%
width=2.1in,angle=-90}}}% was width 3 then 2.25 (for a nice fit to textwidth)
\put(73,-1){\makebox(0,0)[t]{$x$}}
\put(51,-1){\makebox(0,0)[t]{$x^{(2)}$}}
\put(51,36){\makebox(0,0)[l]{$Q(x;x^{(2)})$}}
\put(20,15){\makebox(0,0)[r]{$P^*\!(x)$}}
\end{picture}
%}
\end{tabular}
\end{center}
%}{%
\caption[a]{{Metropolis--Hastings method in one dimension.} The proposal distribution 
 $Q(x';x)$ is here shown as having a shape that changes as $x$ changes, 
 though this is not typical of the proposal densities 
 used in practice.}
\label{fig.pq.metrop}
}%
%\end{figure}
 The Metropolis--Hastings algorithm instead makes use of a 
 \ind{proposal density} $Q$ 
 {\em which depends on the current state\/} $x^{(t)}$. The 
 density $Q(x';x^{(t)})$ might
% in the simplest case
 be a simple distribution such as a Gaussian
 centred on the current $x^{(t)}$.
 The proposal density $Q(x';x)$ can be {\em any\/} 
 fixed density from which we can draw samples. In contrast
 to importance sampling and rejection sampling,
 it is not necessary that $Q(x';x^{(t)})$  look at all similar 
 to $P(x)$ in order for the algorithm to be practically useful.
 An example of a proposal density is shown in  \figref{fig.pq.metrop};
 this figure shows the density $Q(x';x^{(t)})$ for two different 
 states $x^{(1)}$ and $x^{(2)}$. 

 As before, we assume that we can evaluate $P^*\!(x)$
 for any $x$. 
 A tentative new state $x'$ is generated from the proposal density
 $Q(x';x^{(t)})$. To decide whether to accept the new state, we compute 
 the quantity
\beq
%  P({\rm accept
 a = 
% \min \left( 1, 
        \frac{ P^*\!(x') }{ P^*\!(x^{(t)}) }
        \frac{ Q(x^{(t)};x') }{ Q(x';x^{(t)}) } .
%       \right)
\label{eq.ratio.metrop}
\eeq
\[
\begin{array}{l}
\mbox{{\sf If} $a\geq 1$ then the new state is accepted.}
\\
\mbox{{\sf Otherwise}, the new state  is accepted with probability $a$.}\\[0.1in]
\mbox{If the step is accepted,  we set $x^{(t+1)} = x'$.}  \\
\mbox{If the step is rejected,\index{rejection} then we set $x^{(t+1)} = x^{(t)}$. }
\end{array}
\]
Note the difference from rejection sampling: in rejection sampling, 
 rejected points are discarded and have no influence on the 
 list of samples $\{x^{(r)}\}$ that we collected. Here, a rejection 
 causes the current state to be written again onto the list.
% of points  another time. 

 {\sf Notation.} $\,$ 
 I have used the superscript $r = 1, \ldots, R$ to label points that are 
 {\em independent\/} samples from a distribution, and the superscript $t = 
 1, \ldots , T$
 to label the sequence of states in a Markov chain. It is important 
 to note that a Metropolis--Hastings simulation of $T$ iterations does not 
 produce $T$  {\em independent\/} samples from the target distribution $P$. The 
 samples are dependent.

 To compute
 the acceptance probability (\ref{eq.ratio.metrop}) we need to be able to compute the
 probability ratios $P(x')/P(x^{(t)})$ and
 $\linefrac{ Q(x^{(t)};x') } 
        { Q(x';x^{(t)}) }$. If the proposal density 
 is a simple symmetrical density such as a Gaussian centred on the
 current point, then the latter factor is unity, 
 and the Metropolis--Hastings method simply involves comparing 
 the value of the target density at the two points. This special
 case is sometimes called the Metropolis method. However,
 with apologies to Hastings, I will call the general 
 Metropolis--Hastings algorithm for asymmetric $Q$
% given above, is often called
 `the  Metropolis method' since I believe important ideas deserve
 short names.

\subsection{Convergence of the Metropolis method to the target density}
 It can be shown that for any positive $Q$ (that is, any $Q$ such 
 that $Q(x';x) > 0$ for all $x,x'$), as $t \rightarrow \infty$, 
 the probability distribution of $x^{(t)}$ tends to $P(x)=P^*\!(x)/Z$.
 [This statement should not be seen as implying that $Q$ {\em has\/}
 to assign positive probability to every point $x'$ -- we will 
 discuss examples later where $Q(x';x) = 0$ for some $x,x'$; 
 notice also that we have said nothing about how rapidly 
 the convergence to $P(x)$ takes place.]


%\subsection{Markov chain Monte Carlo}
 The Metropolis method is an example of a {\dbf{Markov chain Monte Carlo}}
% not indexed (see see.tex)
 method\index{Monte Carlo methods!Markov chain Monte Carlo}
 (abbreviated {MCMC}).  In contrast to rejection sampling, where
 the accepted points $\{ x^{(r)} \}$ are {\em independent\/} samples from the
 desired distribution, Markov chain Monte Carlo methods involve a
 Markov process in which a sequence of states $\{ x^{(t)} \}$ is
 generated, each sample $x^{(t)}$ having a probability distribution
 that depends on the previous value, $x^{(t-1)}$. Since successive
 samples are dependent,
% correlated with each other,
 the Markov chain may have to be
 run for a considerable time in order to generate samples that are
 effectively independent samples from $P$. 

 Just as it was difficult to estimate  the variance of an importance sampling
 estimator, so it is difficult to assess whether a Markov chain Monte Carlo
 method has `converged', and to quantify how long one has to wait to obtain 
 samples that are effectively independent samples from $P$. 

\begin{figure}
\figuremargin{%
\begin{center}
%\framebox{
{
\setlength{\unitlength}{0.8mm}% was 1mm
\begin{picture}(75,75)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/metrop2.eps,%
width=2.4in,angle=-90}}}% was width 3
% \put(73,-1){\makebox(0,0)[t]{$x$}}
\put(20,38){\makebox(0,0)[l]{$\bx^{(1)}$}}
\put(10,35){\makebox(0,0)[r]{$Q(\bx;\bx^{(1)})$}}
\put(55,57){\makebox(0,0)[l]{$P^*\!(\bx)$}}
\put(45,30){\makebox(0,0)[l]{$L$}}
\put(18,47){\makebox(0,0)[b]{$\epsilon$}}
\end{picture}
}
\end{center}
}{%
\caption[a]{{Metropolis method in two dimensions, 
        showing a traditional proposal density that 
        has a sufficiently small step size $\epsilon$
        that the acceptance frequency\index{acceptance rate} will be about
        0.5.}}
\label{fig.metrop2}
}%
\end{figure}
\subsection{Demonstration of the Metropolis method}
\label{sec.metrop.demo} 
 The Metropolis method is widely used for high-dimensional problems.
%
 Many implementations of the Metropolis method employ a proposal distribution 
 with a  length scale $\epsilon$ 
 that is short relative to the longest length scale $L$
 of the probable region (\figref{fig.metrop2}).  
%The use of a small length
% scale is not obligatory, but a
 A reason for choosing a small length 
 scale is that for most high-dimensional problems, a large random step 
 from a typical point (that is, a sample from $P(\bx)$)
 is very likely to end in a state that has very low probability; 
 such steps are unlikely to be accepted.
 If $\epsilon$ is large, 
 movement around the state space will only occur when such a transition 
 to a low-probability state
 is actually  accepted, or when a large random step chances to land in another 
 probable state. So the rate of progress 
 will be slow if large steps are used.
% , unless small steps are used.

 The disadvantage of small steps, on the other hand, is that 
 the Metropolis method will explore the probability distribution 
 by a {\dem\ind{random walk}\/}, and a random walk takes a long 
 time to get anywhere, especially if the walk is made of small steps. 
\exercisxA{1}{ex.randomwalk}{
 Consider a one-dimensional random walk, on each step
 of which the state moves 
 randomly to the left or to the right with equal  probability. 
 Show that after $T$ steps of size $\epsilon$,
 the state is  likely to have moved only a  distance 
 about $\sqrt{T} \epsilon$. (Compute 
 the root mean square distance travelled.) 
}
 Recall that the first aim of Monte Carlo 
 sampling is to generate a number of {\em independent\/} samples 
 from the given distribution (a dozen, say). 
 If the largest length scale of the state space is $L$, 
 then we have to simulate a random-walk Metropolis method 
 for a time $T \simeq \left(\linefrac{L}{\epsilon}\right)^2$ 
 before we can expect to get a sample that is
 roughly independent of the initial condition -- and 
 that's assuming that every step is accepted: if only a fraction $f$
 of the steps are accepted on average, then this time is increased 
 by a factor $1/f$. \medskip

\begin{conclusionbox}
{\bf Rule of thumb: lower bound on number of iterations of a Metropolis 
 method.} If the largest length scale of the  space of probable 
 states is $L$,\index{key points!Monte Carlo}  
 a Metropolis method whose proposal distribution generates a random 
 walk with step size $\epsilon$ must be run for at least
\beq
 T \simeq \left(\linefrac{L}{\epsilon}\right)^2
\label{eq.ruleofthumb}
\eeq
 iterations 
 to obtain an independent sample. 
\end{conclusionbox}
\medskip

 This  rule of thumb
%  for the required number of  iterations to obtain an independent sample
  gives only a lower bound; the situation may be much worse, if, for 
 example, the probability distribution consists of several 
%  separate 
 islands of high probability separated by regions of low probability. 

\begin{figure}[htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{c@{\hspace{-0.3in}}c@{\hspace{-0.3in}}c}
\hspace*{-0.1in}(a)\begin{tabular}{c}
\psfig{figure=metrop/Aps.ps,height=5.5in,width=0.35in}%7.5 was too big 
\end{tabular}
&
\begin{tabular}{c}(b) Metropolis\\[0.15in]
\psfig{figure=metrop/hist.100.ps,height=1.64in,angle=-90} \\
\psfig{figure=metrop/hist.400.ps,height=1.64in,angle=-90} \\
\psfig{figure=metrop/hist.1200.ps,height=1.64in,angle=-90} \\
\end{tabular}
&
\begin{tabular}{c}(c) Independent sampling\\[0.15in]
\psfig{figure=metrop/h.100.ps,height=1.64in,angle=-90} \\
\psfig{figure=metrop/h.400.ps,height=1.64in,angle=-90} \\
\psfig{figure=metrop/h.1200.ps,height=1.64in,angle=-90} \\
\end{tabular}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{{Metropolis method for a toy problem.}
%
 (a) The state sequence for $t = 1 , \ldots , 600$.  Horizontal direction =
 states from 0 to 20; vertical direction = time from 1 to 600; the
 cross bars mark time intervals of duration 50.
%
 (b) Histogram of occupancy of the states after 100, 400 and 1200 iterations.
%
 (c) For comparison, histograms resulting when
 successive points are drawn {\em independently\/}
 from the target distribution.
} 
\label{fig.metrop}
}%
\end{figure}
\label{sec.simplemc}
 To illustrate
 how slowly
%  the difficulties caused by 
% the exploration of a state space by
 a random walk explores a state space, \figref{fig.metrop} shows 
 a simulation of a Metropolis algorithm
 for generating
%  that is intended to generate 
 samples  from the  distribution:
% following distribution over integers:
\beq
        P(x) = \left\{
 \begin{array}{ll} \dfrac{1}{21} & x \in \{ 0,1,2,\ldots,20 \} \\
 0 & \mbox{otherwise.} \end{array}
\right.
\label{eq.metrop}
\eeq
 The proposal distribution is
% the probability distribution for a simple random walk,
\beq
        Q(x' ; x ) = \left\{
 \begin{array}{ll} \dfrac{1}{2} & x' = x \pm 1 \\
 0 & \mbox{otherwise.} \end{array}
\right.
\label{eq.metropb}
\eeq
 Because the target distribution $P(x)$ is uniform, 
 rejections  occur only when the proposal takes the state 
 to $x' = -1$ or $x'=21$.

 The simulation was started in the state $x_0 = 10$ and its evolution 
 is shown in \figref{fig.metrop}a. How long does it take to reach one of
 the end states $x = 0$ and $x=20$?  Since the distance is 10 steps, 
 the rule of thumb (\ref{eq.ruleofthumb}) predicts that it will typically take 
 a time $T \simeq 100$ iterations to  reach an end state.
 This is confirmed in the present example: the first step into an 
 end state occurs on the 178th iteration.
 How long does it take to visit {\em both\/} end states? 
 The rule of thumb predicts about 400 iterations are required 
 to traverse the whole state space; and indeed the first encounter with 
 the other end state takes place on the 540th iteration. Thus 
 effectively-independent samples are only generated by simulating
 for about four hundred iterations per independent sample.

% [This discussion should not be misunderstood as saying that the aim 
% of a Markov chain Monte Carlo is to actually reach every probable state; 
% the argument is that if the chain has not had 

% \subsection{Reducing random walk behaviour in Markov chain Monte Carlo}
 This simple example shows that it is important to  try to 
 abolish random walk behaviour in Monte Carlo methods. A
  systematic exploration of  the toy state space $\{0,1,2,\ldots , 20\}$
 could  get around  it, using the same step 
 sizes, in  about twenty steps instead of four hundred.
 Methods for reducing random walk behaviour
 are  discussed in the next chapter.
%
% \subsection{Hybrid Monte Carlo}
%
%

\subsection{Metropolis method in high dimensions}
 The rule of thumb  (\ref{eq.ruleofthumb}),
% that we discussed above,
 which gives a lower bound on 
 the number of iterations of a random walk Metropolis method, 
 also applies to higher dimensional problems.
 Consider the simple case of a target distribution that is 
 an $N$-dimensional Gaussian, and a proposal distribution that is a spherical 
 Gaussian of standard deviation  $\epsilon$ in each direction.
  Without loss of generality, we can 
 assume that the target distribution is a separable distribution 
 aligned with the axes $\{x_n\}$, and that it has standard deviation 
 $\sigma_n$ in  direction $n$.  Let $\sigma^{\max}$
 and $\sigma^{\min}$ be the largest and smallest of these standard deviations.
 Let us assume that $\epsilon$ is adjusted  such that the acceptance 
 frequency\index{acceptance rate} is close to 1. Under this assumption, 
 each variable $x_n$ evolves independently of all the others, 
 executing a random walk with  step size about $\epsilon$. The time taken 
 to generate effectively independent samples from the target distribution 
 will be controlled by the largest lengthscale $\sigma^{\max}$.
 Just as in the previous section, where
 we needed at least $T \simeq (L/\epsilon)^2$
 iterations to obtain an independent sample, here we 
 need $T \simeq ( \sigma^{\max} /\epsilon)^2$. 

 Now, how big can $\epsilon$ be? The bigger it is, the smaller this number 
 $T$ becomes, but if $\epsilon$ is too big -- bigger than 
 $\sigma^{\min}$ -- then the acceptance  rate\index{acceptance rate} will fall sharply. 
 It seems plausible that the optimal $\epsilon$ must be similar to 
 $\sigma^{\min}$. Strictly, this may not  be true; in 
 special cases where the second smallest $\sigma_n$ 
 is significantly greater than  $\sigma^{\min}$,
 the optimal $\epsilon$ may be closer to that second smallest 
 $\sigma_n$. But our rough conclusion is this: where simple 
 spherical proposal distributions are used, 
 we will need at least $T \simeq ( \sigma^{\max} /  \sigma^{\min} )^2$
 iterations to obtain an independent sample, where 
 $\sigma^{\max}$ and $\sigma^{\min}$ are the longest and shortest lengthscales
 of the target distribution.  

 This is good news and bad news. It is good news because, unlike
 the cases of rejection sampling and importance sampling, there is no
 catastrophic dependence on the dimensionality $N$.
% We can get answers
 Our \ind{computer} {\em will\/} give useful
 answers in a time shorter than the age of the universe.
 But it is bad news
 all the same, because  this quadratic dependence on the 
 lengthscale-ratio 
% that random walks induce 
 may still force us to make  very lengthy simulations. 

 Fortunately, there are methods for suppressing
 \index{Monte Carlo methods!random walk suppression}\index{random walk!suppression}random walks in 
 Monte Carlo simulations, which we will discuss in the next chapter.


\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{ll}
\hspace{-0.05in}(a)
{
\setlength{\unitlength}{1mm}
\begin{picture}(50,50)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gibbs.eps,%
width=2in,angle=-90}}}
\put(48,-1){\makebox(0,0)[t]{$x_1$}}
\put(-1,40){\makebox(0,0)[r]{$x_2$}}
\put(33,19){\makebox(0,0)[l]{$P(\bx)$}}
\end{picture}
}
&
\hspace{-0.05in}(b)
\setlength{\unitlength}{1mm}
\begin{picture}(50,50)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gibbss.eps,%
width=2in,angle=-90}}}
\put(48,-1){\makebox(0,0)[t]{$x_1$}}
\put(-1,40){\makebox(0,0)[r]{$x_2$}}
\put(29,13){\makebox(0,0)[l]{$P(x_1\given x_2^{(t)})$}}
\put(24,6){\makebox(0,0)[l]{$\bx^{(t)}$}}
\end{picture}
\\
\hspace{-0.05in}(c)
\setlength{\unitlength}{1mm}
\begin{picture}(50,50)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gibbst.eps,%
width=2in,angle=-90}}}
\put(48,-1){\makebox(0,0)[t]{$x_1$}}
\put(-1,40){\makebox(0,0)[r]{$x_2$}}
\put(33,16){\makebox(0,0)[l]{$P(x_2\given x_1)$}}
\end{picture}
&
\hspace{-0.05in}(d)
{
\setlength{\unitlength}{1mm}
\begin{picture}(50,50)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gibbsu.eps,%
width=2in,angle=-90}}}
\put(48,-1){\makebox(0,0)[t]{$x_1$}}
\put(-1,40){\makebox(0,0)[r]{$x_2$}}
\put(24,6){\makebox(0,0)[l]{$\bx^{(t)}$}}
\put(12,22){\makebox(0,0)[br]{$\bx^{(t+1)}$}}
\put(18,29){\makebox(0,0)[br]{$\bx^{(t+2)}$}}
\end{picture}
}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{{Gibbs sampling.}

 (a) The joint density {$P(\bx)$}
 from which samples 
 are required. (b) Starting from a state $\bx^{(t)}$, $x_1$ 
 is sampled from the conditional density $P(x_1\given x_2^{(t)})$.
 (c) A sample is then made from the conditional density $P(x_2\given x_1)$. 
 (d) A couple of iterations of Gibbs sampling. }
\label{fig.gibbs}
}%
\end{figure}
% 
\section{Gibbs sampling}
 We introduced\indexs{Monte Carlo methods!Gibbs sampling}
% have studied 
 importance sampling, rejection sampling and the 
 Metropolis method using one-dimensional examples. \inds{Gibbs sampling}, 
 also known as the {\em{\ind{heat bath}} method\/}
 or `\ind{Glauber dynamics}',
% Not in index because done in see.tex % NOW UNDONE in see.tex
 is a method for sampling from distributions over  at least two dimensions.
 Gibbs sampling can be viewed as a Metropolis method in which a
 sequence of proposal 
 distributions $Q$ are defined in terms of the {\em conditional\/}
 distributions of the joint distribution $P(\bx)$. It is assumed that, whilst
 $P(\bx)$ is too complex to draw samples from directly, its conditional
 distributions $P(x_i\given \{x_j\}_{j\neq i})$ are tractable to work with.
%
% Gibbs sampling is a \MCMC\ method in which each iteration 
% $\bx \rightarrow \bx'$ involves a separate sampling of each
% variable $x_i$ in turn from its distribution {\em conditional\/} on the
% current values of all the other variables in the model. 
 For many graphical 
 models (but not all) these one-dimensional  conditional
 distributions are straightforward to sample from.
 For example, if a Gaussian distribution for some variables $\bd$ has an unknown
 mean $\bm$,
 and the  prior distribution of $\bm$ is Gaussian, 
 then the conditional distribution of $\bm$ given $\bd$ is also Gaussian. 
 Conditional
 distributions that are not of standard form may still be sampled from
 by {\dem\ind{adaptive rejection sampling}\index{Monte Carlo methods!rejection sampling!adaptive}\/}
 if the conditional distribution satisfies
 certain \ind{convexity} properties \cite{Gilks_Wild}.

 Gibbs  sampling is illustrated for a case with two variables $(x_1,x_2)=\bx$
 in \figref{fig.gibbs}. 
 On each iteration, we start from  the current state $\bx^{(t)}$,
 and $x_1$ 
 is sampled from the conditional density $P(x_1\given x_2)$, with $x_2$ fixed
 to $x^{(t)}_2$.
 A sample $x_2$ is then made from the conditional density $P(x_2\given x_1)$, 
 using the new value of $x_1$. This brings us to the new state 
 $\bx^{(t+1)}$, and completes the iteration.


 In the general case of a system with $K$ variables,
 a single iteration involves  sampling one parameter at a time:
\newcommand{\tplusone}{(t+1)}
\beqan
\label{eq.gibbs1}
x_1^{\tplusone} &\sim& P( x_1 \given  x_2^{(t)} , x_3^{(t)} , \ldots , x_K^{(t)} ) \\
x_2^{\tplusone} &\sim& P( x_2 \given  x_1^{\tplusone} , x_3^{(t)} , \ldots , x_K^{(t)} ) \\
x_3^{\tplusone} &\sim& P( x_3 \given   x_1^{\tplusone} ,  x_2^{\tplusone} , \ldots , x_K^{(t)} ) , \:\: \mbox{etc.}
\label{eq.gibbs3}
\eeqan
\subsection{Convergence of Gibbs sampling to the target density}
\exercisxB{2}{ex.gibbs.eq.met}{
 Show that a single variable-update
 of Gibbs sampling can be viewed as a Metropolis method
 with target density $P(\bx)$, and that this Metropolis method 
 has the property  that every proposal is always accepted.  
} 
 Because Gibbs sampling is a Metropolis method, the probability
 distribution of $\bx^{(t)}$ tends to $P(\bx)$ as $t \rightarrow
 \infty$, as long as $P(\bx)$ does not have pathological properties.
\exercissxB{2}{ex.gibbs.h74}{
 Discuss whether the syndrome decoding problem for a $(7,4)$ Hamming code
 can be solved using Gibbs sampling.
 The syndrome decoding problem, if we are to solve it with a Monte Carlo
 approach, is to draw samples
 from the posterior distribution of the noise vector $\bn = (n_1, \ldots, n_n,
 \ldots, n_N)$, 
\beq
 P( \bn \given  {\bf f}, \bz ) = \frac{1}{Z} \prod_{n=1}^N f_n^{n_n}
		(1-f_n)^{(1-n_n)}  \, \truth [ \bH \bn \eq \bz ] , 
\eeq
 where $f_n$ is the normalized likelihood  for the  $n$th transmitted bit and
 $\bz$ is the observed  syndrome. The factor $\truth [ \bH \bn \eq \bz ]$
 is 1 if
% the hypothesis
 $\bn$ has  the correct  syndrome $\bz$
 and 0 otherwise.

 What about the
 \ind{syndrome decoding}\index{error-correcting code!syndrome decoding}
 problem for any linear error-correcting code?
} 


\subsection{Gibbs sampling in high dimensions}
 Gibbs sampling suffers from the same defect as simple Metropolis algorithms 
 -- the state space is explored by a  slow random walk, unless
 a fortuitous parameterization has been chosen that makes the 
 probability distribution $P(\bx)$ separable. If, say, two variables 
 $x_1$ and $x_2$ are strongly correlated, having marginal densities 
 of width $L$ and conditional densities of width $\epsilon$, 
 then it will take at least about $(L/\epsilon)^2$ iterations
 to generate an independent sample from the target density. \Figref{fig.adler},
 \pref{fig.adler}, illustrates the slow progress made
 by Gibbs sampling when $L \gg \epsilon$.

 However Gibbs sampling involves no adjustable parameters, so it is
 an attractive strategy when one wants to get a model running
 quickly.
 An excellent software package, {\tt BUGS},\index{software!BUGS}\index{BUGS} 
 makes it easy to set up almost arbitrary probabilistic models
 and simulate them by Gibbs sampling \cite{bugs}.\footnote{\tt{http://www.mrc-bsu.cam.ac.uk/bugs/}}

%%%%%%%%%%%%%%%%%%%%%%%%%%
% possible boundary
%%%%%%%%%%%%%%%%%%%%%%%%%%
% new material added here from metrop/DEMO.m
% post=1
% DEMO
\newcommand{\metropdensity}[2]{%
\mbox{\makebox[0in][r]{\raisebox{0.24in}{$p^{(#2)}(x)$}}%
\psfig{figure=metrop/ps/pt#1.#2.ps,width=1in,angle=-90}}}

% advanced monte carlo methods
\section{Terminology for \MCMC\ methods}
\label{sec.mc.terminology}
% The preceding description of the Metropolis method and Gibbs sampling 
% is hopefully comprehensible. 
 We now spend a few moments sketching\index{terminology!Monte Carlo methods}
 the theory on which the Metropolis method and Gibbs sampling are based. 
 We  denote by $p^{(t)}(\bx)$ the probability distribution of the 
 state of a Markov chain simulator. (To visualize this distribution, imagine running 
 an infinite
 collection of identical simulators in parallel.)
 Our aim is to find a Markov chain 
 such that as $t \rightarrow \infty$,   $p^{(t)}(\bx)$ tends to 
 the desired distribution $P(\bx)$. 

%\begin{description}
%\item[Markov chain.]
% \subsection{Markov chain}
 A {\dbf Markov chain} can be specified by an {\dbf initial} probability 
 distribution $p^{(0)}(\bx)$ and a {\dbf transition probability} $T(\bx';\bx)$.

 The probability distribution of the state at the $(t\!+\!1)$th iteration
 of the Markov chain, $p^{(t+1)}(\bx)$, is given by
\beq
        p^{(t+1)}(\bx') = \intdx\: T(\bx';\bx) p^{(t)}(\bx) .
\eeq
%\item[Choice of Markov chain.]
% \subsection{Choice of Markov chain}

\noindent
\Exampl{example20}{
 An example of a Markov chain is given by the Metropolis demonstration
 of section \ref{sec.metrop.demo} (\figref{fig.metrop}),
 for which the transition probability is
\begin{realcenter}
{\footnotesize $\displaystyle
\mbox{\normalsize$\bT$}  = \left[
\begin{array}{*{21}{c@{\,}}}
\dhalf&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\dhalf
\end{array}
\right]
$} \end{realcenter}
 and the initial distribution was
\beq
 p^{(0)}(x) = \left[
\begin{array}{*{21}{c@{\,}}}
\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&1&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,
\\
\end{array}
\right] .
\eeq
 The probability distribution $p^{(t)}(x)$ of the state at the
 $t$th iteration is shown
 for $t=0,$ 1, 2, 3, 5, 10, 100, 200, 400
% $\ldots$
 in \figref{fig.metropdensity10};
 an equivalent sequence of distributions is shown
 in \figref{fig.metropdensity17}
 for the chain that begins in initial state $x_0=17$.
 Both chains converge to the target density, the uniform density,
 as $t \rightarrow \infty$.
\amarginfig{b}{\footnotesize
\begin{center}
\begin{tabular}{c}
\metropdensity{10}{0}\\[-0.3in]
\metropdensity{10}{1}\\[-0.2in]
\metropdensity{10}{2}\\[-0.2in]
\metropdensity{10}{3}\\[-0.3in]
\metropdensity{10}{10}\\[-0.3in]
\metropdensity{10}{100}\\[-0.3in]
\metropdensity{10}{200}\\[-0.3in]
\metropdensity{10}{400}\\[-0.13in]
\end{tabular}
\end{center}
\caption[a]{The probability distribution of the state of the 
 Markov chain of \exampleonlyref{example20}.
}\label{fig.metropdensity10}
}% \ENDsolution
}
%\end{example}

\subsection{Required properties}
 When designing a Markov chain Monte Carlo method,
 we construct a chain with the following properties:
\ben
\item The desired 
 distribution $P(\bx)$ is an {\dbf\ind{invariant distribution}} of the chain. 

 A distribution $\pi(\bx)$ is an invariant distribution of the transition
 probability $T(\bx';\bx)$ if
\beq
        \pi(\bx') = \intdx\: T(\bx';\bx) \pi(\bx) .
\eeq
 An invariant distribution is an eigenvector of the transition
 probability matrix that has eigenvalue 1.
\item
        The chain must also be {\dbf\ind{ergodic}}, that is,
\beq
         p^{(t)}(\bx)  \rightarrow \pi(\bx) \mbox{ as $t \rightarrow \infty$, for any $p^{(0)}(\bx)$.} 
\eeq 
  A couple of reasons why a chain might not be ergodic are:
\ben
\item Its matrix might be {\dem\ind{reducible}}, which
 means that the state space contains two or more subsets  of states  that 
 can never be reached from each other.  Such a chain has many invariant 
 distributions; which one  $p^{(t)}(\bx)$ would tend to 
 as $t \rightarrow \infty$ would depend on the initial condition 
 $p^{(0)}(\bx)$. 

\amarginfig{b}{\footnotesize
\begin{center}
\begin{tabular}{c}
\metropdensity{17}{0}\\[-0.3in]
\metropdensity{17}{1}\\[-0.2in]
\metropdensity{17}{2}\\[-0.2in]
\metropdensity{17}{3}\\[-0.3in]
\metropdensity{17}{10}\\[-0.3in]
\metropdensity{17}{100}\\[-0.3in]
\metropdensity{17}{200}\\[-0.3in]
\metropdensity{17}{400}\\[-0.13in]
\end{tabular}
\end{center}
\caption[a]{The probability distribution of the state of the 
 Markov chain for initial condition $x_0 = 17$ (\exampleref{example20}).
}\label{fig.metropdensity17}
}
 

 The transition probability matrix of such a chain has more than 
 one eigenvalue equal to 1.
\item The chain might  have a {\dem periodic\/}
% irreducible
 set, which 
 means that, for some initial conditions,  $p^{(t)}(\bx)$ doesn't
 tend to an invariant distribution, but instead tends to 
 a periodic limit-cycle. 

 A simple Markov chain with this property
 is the random walk on the $N$-dimensional hypercube. The chain $T$
 takes the state from one corner to a randomly chosen adjacent corner. 
 The unique invariant distribution of this chain is the uniform 
 distribution over all $2^N$ states, but the chain is not ergodic;
 it is periodic with period two: 
 if we divide the states into states with odd parity and states with even
 parity, we notice that every odd state is surrounded by even states
 and {\em vice versa}. So if the initial condition at time $t=0$ 
 is a state with even parity, then at time $t=1$ -- and 
 at all odd times -- the state must have 
 odd parity, and at all even times, the state will be of even parity. 


 The transition probability matrix of such a chain has more than 
 one eigenvalue with magnitude equal to 1. The random walk on the hypercube,
 for example,
 has eigenvalues equal to $+ 1$ and $-1$.
\een
\een

\subsection{Methods of construction of Markov chains}
 \index{concatenation!in Markov chains}It
 is often convenient to construct $T$ by \index{mixture distribution}{\dem{mixing}\/}
 or {\dem{concatenating}\/}
 simple {\dbf\ind{base transitions}\/} $B$ all of which satisfy
\beq
        P(\bx') = \intdx\: B(\bx';\bx) P(\bx) ,
\eeq
 for the desired density $P(\bx)$, \ie, they
 all have the desired density as an invariant distribution.
 These base transitions need not  individually be 
 \ind{ergodic}.

 $T$ is a  {\dem{mixture}}\index{mixture!in Markov chains}
 of several base transitions $B_b(\bx',\bx)$ if
 we make the transition by picking one of the base transitions
 at random, and allowing it to determine the transition, \ie, 
\beq
	T(\bx',\bx)  = \sum_b p_b B_b(\bx',\bx) ,
\eeq
 where $\{ p_b \}$ is a probability distribution over the
 base transitions.

 $T$ is a {\dem{concatenation}}\index{concatenation!in Markov chains} of two   base transitions $B_1(\bx',\bx)$
 and $B_2(\bx',\bx)$ if we first make a transition to an
 intermediate state $\bx''$ using $B_1$, and then make a transition
 from state  $\bx''$  to $\bx'$ using $B_2$. 
\beq
	T(\bx',\bx)  = \intdxpp  B_2(\bx',\bx'') B_1(\bx'',\bx)  .
\label{eq.concatT}
\eeq

% \item[Detailed balance.]
% \subsection{Detailed balance}
\subsection{Detailed balance}
 Many useful transition probabilities satisfy the 
 {\dbf detailed balance} property:
\beq
        T(\bx_{a};\bx_{b}) P(\bx_{b}) =  
        T(\bx_{b};\bx_{a}) P(\bx_{a}) , \mbox{ for all $\bx_{b}$ and $\bx_{a}$}.
\eeq
 This equation says that if we pick (by magic) a state 
% n $\bx$
 from the target density 
 $P$ and make a transition under $T$ to another
 state, it is just as likely that we will  pick $\bx_{b}$
 and go from $\bx_{b}$ to $\bx_{a}$ as it is that we will pick 
  $\bx_{a}$
 and go from $\bx_{a}$ to $\bx_{b}$.
% \end{description}
 Markov chains that satisfy detailed balance are also called
 {\dbf reversible} Markov chains. 
 The reason why the detailed balance property is of interest 
 is that detailed balance implies invariance of the
 distribution $P(\bx)$ under the Markov chain $T$, which
 is a necessary condition for
 the key property that we want from our MCMC simulation -- that
 the probability distribution of the chain should converge to $P(\bx)$.
\exercisxB{2}{ex.detbal}{Prove that detailed balance implies invariance of the
 distribution $P(\bx)$ under the Markov chain $T$.}
 Proving that 
 detailed balance holds  is often a key step 
 when proving that a \MCMC\ simulation will converge to 
 the desired distribution. The Metropolis method
% and Gibbs sampling  method both
 satisfies detailed balance, for example. Detailed balance 
 is not an essential condition, however, and we will see later that 
 irreversible Markov chains can be useful in practice, because
 they may have different random walk properties.
%(We still require
% such chains to have $P(\bx)$ as their invariant distribution.)
\exercisxB{2}{ex.detbal2}{
 Show that, if we concatenate two base transitions
 $B_1$ and $B_2$ that satisfy detailed balance,
 it is not necessarily the case that the $T$
 thus defined (\ref{eq.concatT}) satisfies detailed balance.
}
\exercisxC{2}{ex.detbal3}{
 Does Gibbs sampling, with several variables all
 updated in a deterministic sequence, satisfy detailed balance?
% Radford says no.
}

% slice sampling
% 980214
\section{Slice sampling}
 Slice sampling\index{slice sampling}
 \cite{Radford_slice,Radford_slice2001}\index{Neal, Radford}
 is a Markov chain Monte Carlo method that has
 similarities to rejection sampling, Gibbs sampling and the Metropolis method.
 It can be applied wherever the Metropolis method
 can be applied, that is, to any system for which
 the target density $P^*(\bx)$ can be evaluated at any point $\bx$;
 it has the advantage over simple Metropolis methods that it is more robust
 to the choice of parameters like step sizes. 
 The simplest version of slice sampling
 is similar to Gibbs sampling in that
% a slice sampling simulation
 it 
 consists of  one-dimensional transitions
 in the state space; however there is no requirement that the
 one-dimensional conditional distributions be easy to sample from,
 nor that they have any convexity properties such as
 are required for adaptive rejection sampling.
 And slice sampling is similar to rejection sampling in that it is a method that
 asymptotically  draws  samples from the volume under the
 curve described by $P^*(\bx)$; but there is no requirement for
 an upper-bounding function.

 I will describe slice sampling by giving a sketch of
 a one-dimensional sampling algorithm,  then giving a pictorial
 description that includes the details
 that make the method valid.

\subsection{The skeleton of slice sampling}
 Let us assume that we want to draw samples from $P(x) \propto P^*(x)$
 where $x$ is a real number. 
 A one-dimensional slice sampling algorithm is a method for
 making transitions from a two-dimensional point $(x,u)$ lying 
 under the curve $P^*(x)$
 to another point $(x',u')$ lying 
 under the same curve, such that the probability distribution of $(x,u)$
 tends to a uniform distribution over the area under the curve $P^*(x)$,
 whatever initial point we start from -- like the uniform distribution
 under the curve
 $P^*(x)$ produced by rejection sampling (\sectionref{sec.rejection}).

 A single transition $(x,u) \rightarrow (x',u')$ of a
 one-dimensional slice sampling algorithm  has the following steps,
 of which steps {\tt 3} and {\tt 8} will require further elaboration.
\medskip

\newcommand{\Uniform}{\mbox{Uniform}}
\newcommand{\localtt}{\sf}
\begin{framedalgorithm}
\noindent \sf
  {\tt 1:} evaluate $P^*\!(x)$
\\{\tt 2:} draw a vertical coordinate $u' \sim \Uniform(0,P^*\!(x))$
\\{\tt 3:} create a horizontal interval $(x_l,x_r)$ enclosing $x$
\\{\tt 4:} loop {\localtt\{}
\\{\tt 5:} \hspace{0.3in} draw $x'  \sim \Uniform(x_l,x_r)$
\\{\tt 6:} \hspace{0.3in} evaluate $P^*\!(x')$
\\{\tt 7:} \hspace{0.3in} {\localtt if} $P^*\!(x') > u'$ {\localtt break out of loop {\tt4}-{\tt9}}
\\{\tt 8:} \hspace{0.3in} {\localtt else} modify the interval  $(x_l,x_r)$
\\{\tt 9:} {\localtt\}}
\end{framedalgorithm}
\medskip

 There are several methods for creating the interval  $(x_l,x_r)$ in step
 {\tt 3}, and several methods for modifying it at step {\tt 8}.
 The important point is that the overall method must satisfy detailed
 balance, so that the uniform distribution for $(x,u)$
 under the curve $P^*\!(x)$ is invariant.
% Here I will describe methods appropriate for a real variable $x$.

% see itp/octave/mcmc/slice.m
% second argument is a label
\newcommand{\slicefig}[2]{%
\makebox[0in][l]{\hspace{0.07in}\raisebox{1.5in}{\small\tt{#2}}}%
\mbox{\psfig{figure=octave/mcmc/ps/slice/#1.ps,width=2.49in,angle=-90}\hspace{-0.05in}}}%was 2.42 and -0.2
\begin{figure}
%\figuredangle{%
\figuremargin{%
\begin{raggedright}
\begin{tabular}{@{\hspace*{-0.524in}}*{2}{l}@{\hspace*{-0.2in}}}
\slicefig{22.1}{1}&
\slicefig{22.2}{2}\\
\slicefig{22.3}{3a,3b,3c}&
\slicefig{22.4}{3d,3e}\\
\slicefig{22.5}{5,6}&
\slicefig{22.6}{8}\\
\slicefig{22.7}{5,6,7}&
\\
\end{tabular}
\end{raggedright}
}{%
\caption[a]{Slice sampling.
 Each panel is labelled by the steps of the algorithm that
 are executed in it. At step {\tt1}, $P^*\!(x)$ is evaluated
 at the current point $x$.
 At step  {\tt2}, a vertical coordinate is selected giving the point $(x,u')$
 shown by the box;
 At steps {\tt 3a-c}, an interval of size $w$ containing $(x,u')$ is created
 at random.
 At step  {\tt 3d}, $P^*$ is evaluated at the left end of the interval
 and is found to be larger than $u'$, so a step to the left of size $w$
 is made.
 At step  {\tt 3e}, $P^*$ is evaluated at the right end of the interval
 and is found to be smaller than $u'$, so no stepping out to the
 right is needed.
 When step  {\tt 3d} is repeated, $P^*$ is
 found to be smaller than $u'$, so the stepping out halts.
 At step {\tt 5}  a point is drawn from the interval, shown by a $\circ$.
 Step {\tt6} establishes that this point is above $P^*$
 and step {\tt8} shrinks the interval to the rejected point
 in such a way that the original point $x$ is still in the interval.
 When step {\tt5} is repeated, the new coordinate $x'$ (which is
 to the right-hand side of the interval) gives a value of $P^*$ greater than
 $u'$, so this point $x'$ is the outcome at step {\tt7}.
}
\label{fig.slice0}
\label{fig.slice}
}%
\end{figure}
%%%%%%%%%%%%%
\subsection{The `stepping out' method for step {\tt 3}}
 In the `stepping out' method for
 creating an interval $(x_l,x_r)$ enclosing $x$, we  step out
 in steps of length $w$  until we find
 endpoints $x_l$ and $x_r$ at which $P^*$ is smaller than
 $u$. The algorithm is
% easiest to understand by seeing it in action as
 shown in \figref{fig.slice}.
\medskip

\begin{framedalgorithm}
\noindent \sf
  {\tt 3a:} draw $r \sim \Uniform(0,1)$
\\{\tt 3b:} $x_l$ {\tt :=} $x - r w$
\\{\tt 3c:} $x_r$ {\tt :=} $x + (1- r) w$
\\{\tt 3d:} {\localtt while} ($P^*\!(x_l)>u'$) {\localtt\{} $x_l \:{\tt:=}\: x_l - w$
{\localtt\}}
\\{\tt 3e:} {\localtt while} ($P^*\!(x_r)>u'$) {\localtt\{} $x_r \:{\tt:=}\: x_r + w$
{\localtt\}}
\end{framedalgorithm}

\subsection{The `shrinking' method for step {\tt 8}}
 Whenever a point $x'$ is drawn such that $(x',u')$ lies
 above the curve $P^*\!(x)$, we shrink the interval so that
 one of the end points is $x'$, and such that the original
 point $x$ is still enclosed in the interval.
\medskip
%\begin{quotation}
\begin{framedalgorithm}
\noindent \sf
  {\tt 8a:} {\localtt if} ($x'>x$) 
% \\  \hspace*{0.853in}
	 \{ $x_r$ {\tt :=} $x'$ \}
\\{\tt 8b:} {\localtt else} 
% \\  \hspace*{0.853in}
	\{ $x_l$ {\tt :=} $x'$ \}
\end{framedalgorithm}
%\end{quotation}

\subsection{Properties of slice sampling}
 Like a standard Metropolis method,
 slice sampling  gets around by a random walk, but
 whereas in the Metropolis method, the choice of the  step size
 is critical to the rate of progress,
 in slice sampling
 the step size is
 self-tuning.
 If the initial interval size $w$ is too small by a factor $f$ compared with the
 width of the probable region then 
% there are
% never any rejections and
 the stepping-out procedure expands the interval  size.
 The cost of this stepping-out is only linear in $f$,
% the factor by which the optimal $w$ is bigger than the chosen $w$,
 whereas in the Metropolis method the computer-time scales
 as the square of $f$ if the step size is too small.


% Discuss sensitivity to $w$. If $w$ too small then waste linear
% amount of time in stepping out.
 If the chosen value of $w$ is too large by a factor $F$ then the algorithm spends
 a time proportional to the logarithm of $F$
% waste
% logarithmic factor
 shrinking the interval down to the right size, since the interval
 typically shrinks by a factor in the ballpark of $0.6$ each time
% The exact value is exp(-1/2) = 0.61.
 a point is rejected. In contrast, the  Metropolis
 algorithm responds to a too-large step size by
 rejecting almost all proposals, so the
 rate of progress is  exponentially bad in $F$.
 There are no rejections in slice sampling. The probability of staying
 in exactly the same place is very small.

\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/sliceeg.eps,width=1.6in}} 
\end{center}
\caption[a]{$P^*\!(x)$.}
\label{fig.sliceeg}
}%
\exercisxB{2}{ex.sliceproblem}{
 Investigate the properties of slice sampling applied to the
 density 
 shown in \figref{fig.sliceeg}. $x$ is  a real variable
 between 0.0 and 11.0.
 How long does it take typically
 for slice sampling to get from an $x$ in the peak region $x\in (0,1)$
 to an $x$ in the tail region $x \in (1,11)$, and {\em vice versa}?
 Confirm that the probabilities of these transitions do
 yield an asymptotic probability density that is correct.
% 
% \in (1,2)$ to an $x' \in (2,3)$?
% Note that for some distributions it may take a long time
% to mix, \eg, if there is a peak and a long low tail, then
% you spend a lot of time in the tail then
% a lot in the peak. This can in some cases be viewed as beneficial.
% Skilling has applications where the peak has much more
% probability mass than the tail, but it is the tail that
% is of interest, slice sampling is used to explore the tail;
% transitions between the tail and peak are
% handled by a separate proposal.  Slice sampling is thus one of several
% base transitions.
%
}


\subsection{How  slice sampling is used in real problems \nonexaminable}
 An $N$-dimensional density $P(\bx) \propto P^*(\bx)$ may be sampled
 with the help of the one-dimensional slice sampling method presented
 above by picking a sequence of directions $\by^{(1)}, \by^{(2)},\ldots$
 and defining
 $\bx = \bx^{(t)} + x \by^{(t)}$. The function $P^*(x)$ above
 is replaced by $P^*( \bx ) = P^*( \bx^{(t)} + x \by^{(t)})$.
 The directions may be chosen in various ways; for example,
 as in Gibbs sampling, the directions could be the coordinate axes;
 alternatively, the directions  $\by^{(t)}$ may be selected at random
 in any manner such that the overall procedure satisfies detailed balance.

\subsection{Computer-friendly slice sampling \nonexaminable}
 The real variables of a probabilistic model will always be
 represented in a computer using a finite number of bits.
 In the following implementation of slice sampling
 due to Skilling\nocite{SkillingMacKay2002},
 the stepping-out, randomization, and shrinking 
 operations, described above
% by \citeasnoun{Radford_slice2001}
% Neal
 in terms of floating-point operations,
 are replaced  by binary and integer operations.

 We assume that the
 variable $x$ that is being slice-sampled is represented
 by  a $b$-bit  integer  $X$ taking on one of $B = 2^b$ values,
 $0, 1, 2, \ldots, B\!-\!1$, many or all of which correspond to
 valid values of $x$.
 Using an integer grid eliminates
 any errors in detailed balance that might  ensue from
 variable-precision rounding of floating-point numbers.
%
%  via a mapping $x(X)$.
% We often take these points
% to have equal prior measure, so that the prior becomes flat
% over $X$ and all points are automatically a-priori-equivalent.
% Floating-point numbers, by contrast, are not equivalent, because of their
% variable rounding. Using an integer grid eliminates
% any errors in detailed balance that might thus ensue.
% We denote by $F(X)$ the appropriately transformed version of
% the unnormalized density $f(x(X))$.
 The mapping from $X$ to $x$ need not be
 linear; if it is nonlinear,
 we assume that the function $P^*\!(x)$ is replaced by
 an appropriately transformed function -- for example,
 $P^{**}(X) \propto P^*\!(x) |\d x/\d X|$.
% , if the mapping from $X$ to $x$ is continuous.

 We assume the following operators on $b$-bit integers
 are available:
\def\la{\,{\tt :=}\,}
\def\sp{\hspace*{0.2in}}
\begin{realcenter}
\begin{tabular}{cc}
 $X + N$      & arithmetic sum, modulo $B$, of   $X$ and $N$.\\
 $X - N$      & difference, modulo $B$, of   $X$ and $N$.\\
 $X \oplus N$ & {bitwise\/} exclusive-or of $X$ and $N$.\\
 $N  \la
 {\tt{randbits}}(l)$ & sets $N$ to a random $l$-bit integer.\\
\end{tabular}
\end{realcenter}

 A slice-sampling procedure for integers is then as follows: \medskip

% \footnote{note change in draft 2.2 from $<$ to $\leq$ in the first line.}
\newcommand{\nsp}[1]{\makebox[0in][l]{\tt{#1}:}\hspace*{0.3in}\sf}
\begin{framedalgorithmw}{\fulltextwidth}
%\begin{realcenter}
\begin{tabular}{p{2.7in}p{3.4in}}
% \multicolumn{2}{c}{ {\sf Shrinking procedure} }\\
\multicolumn{2}{c}{ {\sf Given: a current point $X$ and a height $Y = P^*\!(X) \times \mbox{Uniform}(0,1) \leq P^*\!(X)$} }\\[0.15in]
\nsp{1}	$U \la {\tt{randbits}}(b)$ & Define a random translation $U$ of the binary coordinate system. \\
\nsp{2}	set $l$ to a value $l \leq b$ & Set initial $l$-bit sampling range. \\% (step 2)\\
\nsp{3}	do \{ \\
\nsp{4}\sp		$N \la {\tt{randbits}}(l)$ & Define a random move within the current interval of width $2^l$.\\
\nsp{5}\sp		$X' \la ( (X-U) \oplus N ) + U $ &
  Randomize  the lowest $l$ bits of $X$ (in the translated coordinate system).
\\
\nsp{6}\sp		$l \la  l - 1$ &
 If $X'$ is not acceptable, decrease $l$ and try again \\
\nsp{7}\} until \mbox{($X' = X$) or ($P^*\!(X') \geq Y$)} &   with a smaller
 perturbation of $X$; termination at or before $l=0$ is assured.\\
\end{tabular}
\end{framedalgorithmw}
\medskip

% \end{realcenter}
 The translation $U$ is introduced to avoid permanent sharp edges, where
 for example the adjacent binary integers {\tt{0111111111}} and {\tt{1000000000}}
 would otherwise be permanently in different sectors, making it difficult for
 $X$ to move from one to the other.

\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=figs/slicehalve.eps,width=1.965in}}\\[-0.015in]% was -.15
\end{center}
\caption[a]{
 The sequence of intervals from which
 the new candidate points are drawn. 
}
\label{fig.slicehalve}
%Pictorially, the sequence of intervals from which
% the new candidate points are drawn are like the sequence
% of intervals in Neal's doubling procedure (Neal, 2001, figure 2).
}
 The sequence of intervals from which
 the new candidate points are drawn is illustrated in \figref{fig.slicehalve}.
%\begin{center}
%\mbox{\psfig{figure=figs/slicehalve.eps,width=2.5in}}
%\end{center}
 First, a point is drawn from the
 entire interval, shown by the top horizontal line.
 At each subsequent draw, the interval is halved in such a way
 as to contain the previous point $X$.

\begincuttable
% I aimed to CUT some details from here and put them in graveyard.tex.
% Mon 30/12/02
% They are also in the original skilling paper/.
 If preliminary stepping-out from the initial range is required, step {\sf{2}} above
 can be replaced by the following similar procedure:
\medskip% NORMALCENTER

\begin{framedalgorithm}
\begin{center}
\hspace*{0.5in}
\begin{tabular}{@{}p{2.614in}p{2.4in}}
\nsp{2a}	set $l$ to a value $l < b$ & $l$ sets the  initial width \\
\nsp{2b}	do \{ \\
\nsp{2c}\sp		$N \la {\tt{randbits}}(l)$ \\
\nsp{2d}\sp		$X' \la ( (X-U) \oplus N ) + U $ \\
\nsp{2e}\sp		$l \la  l + 1$ \\
\nsp{2f}	\} until \mbox{($l=b$) or ($P^*\!(X') < Y$)} \\
%%  Then shrink as before \\
\end{tabular}
\end{center}
\end{framedalgorithm}
\medskip

% \footnote{ I changed $\geq$ to $<$ above}

 These shrinking and stepping out methods shrink and expand
 by a factor of two per evaluation. A variant
 is to shrink  or expand  by more than one bit each time, setting
 $l \la  l \pm \Delta l$ with $\Delta l > 1$.
%
% addition Thu 7/2/02
%
% Provided the initial sampling range is well chosen
% ({\em i.e.,} of the same order of magnitude as the
% acceptable range), we found experimentally that
% the mean diffusion rate of $X$ per
% evaluation when $\Delta l = 1$ is at most 25\% slower than for Neal's
% method of shrinking to the rejected point. If the initial
% sampling range is not well chosen, the faster shrinking
% allowed here by setting $\Delta l > 1$ enables
% more rapid diffusion because an admittedly poorer
% acceptable jump is found more quickly.
 Taking $\Delta l$ at each step from any pre-assigned distribution (which
 may include $\Delta l=0$) allows extra flexibility.
\exercisxC{4}{ex.slice.ex}{
 In the shrinking phase, after an unacceptable $X'$ has been
 produced, the choice of $\Delta l$ is allowed to depend on
 the  difference between the slice's height $Y$ and the value
 of $P^*\!(X')$, without spoiling the algorithm's validity. (Prove this.)
 It  might be a good idea to 
 choose a larger value of  $\Delta l$  when  $Y-P^*\!(X')$ is large.
 Investigate this idea theoretically or empirically.
}
\ENDcuttable


 A feature  of using the integer representation is that, with a suitably
 extended number of bits, the single integer $X$ can represent
 two or more real parameters -- for example, by mapping $X$ to $(x_1,x_2,x_3)$
 through a space-filling curve such as a Peano curve.
 Thus \index{slice sampling!multi-dimensional}multi-dimensional
 slice sampling can be performed using the
 same software as for one dimension.
% Peano curves are useful here because they relate conveniently to a rectangular grid  and
% they have the best possible locality properties: nearby points on the curve
% are close in space (though not the converse, which is unattainable).
% In this case, each successive
% bit of $X$ represents a factor of 2 in volume. Because
% we are likely to be uncertain about the optimal sampling volume
% in several dimensions, it
% may be helpful to set $\Delta l$ to the dimensionality.




\newpage%%%%%%%%%%%%%%%%%%%%%%%% ADDED Fri 11/7/03
\section{Practicalities}
{\bf Can we predict how long a \MCMC\ simulation will take to equilibrate?}
 By considering the random walks involved in a \MCMC\ simulation
 we can obtain
 simple {\em lower bounds\/}  on the time required for convergence.
 But predicting this time more precisely 
 is a difficult problem, and most of the theoretical results
 giving upper bounds on the convergence time 
 are of little practical use. The exact sampling methods
 of \chref{ch.mcexact} offer a solution to this problem
 for certain Markov chains.
\medskip

\noindent
{\bf Can we diagnose or detect convergence in a running simulation?}
 This is also a difficult problem. There are a few practical tools available, 
 but none of them is perfect  \cite{Cowles1996a}.
\medskip

\noindent
{\bf Can we speed up the convergence time and time between independent samples of a 
 \MCMC\ method?} 
 Here, there is good news, as described in the next chapter,
% following three sections,
 which describes the \hybrid\ Monte Carlo method, overrelaxation, and simulated annealing. 

%%%%%%%%%%%%%%%%%%%%%%%%%

% this material is grabbed from later in the chapter advanced_mc.tex

\section{Further practical issues}
\subsection{Can the normalizing constant be evaluated?}
 If the target density $P(\bx)$ is given in the form of an unnormalized 
 density $P^*\!(\bx)$ with $P(\bx) = \frac{1}{Z} P^*\!(\bx)$, the value of 
 $Z$ may well be of interest. Monte Carlo methods do not 
 readily 
 yield an estimate of this quantity, and it is an area of active research 
 to find ways of evaluating it. Techniques for evaluating $Z$ 
 include:
\ben
\item
        Importance sampling   (reviewed by  \citeasnoun{Neal_dop})\index{importance sampling}\index{Monte Carlo methods!importance sampling}
 and \ind{annealed importance sampling} \cite{Radford_ais}.\index{Monte Carlo methods!annealed importance sampling}\index{annealing!importance sampling}
\item
        `Thermodynamic integration'\index{thermodynamic integration} during \ind{simulated annealing},\index{Monte Carlo methods!simulated annealing}\index{Monte Carlo methods!thermodynamic integration}\index{annealing}
 the `\index{acceptance ratio method}acceptance ratio' method, and `\ind{umbrella sampling}'\index{Monte Carlo methods!umbrella sampling}\index{Monte Carlo methods!acceptance ratio method} 
        (reviewed by  \citeasnoun{Neal_dop}).\index{Neal, Radford}
% and AIS
\item
        `Reversible jump \MCMC' \cite{Green1995}.\index{reversible jump}\index{Monte Carlo methods!reversible jump}
\een
%  \citeasnoun{Neal_dop} gives a review of these methods.  

 One way of dealing with $Z$, however, may be to find a 
 solution to one's task that does not require that $Z$ be evaluated.
 In Bayesian data modelling one might be able to avoid the need to evaluate $Z$ --
 which would be 
%  traditionally 
 important for model comparison -- by not having more than one
 model. Instead of using several models (differing in 
% their
 complexity, for example) and evaluating their relative posterior
 probabilities, one can make a single {\dbf hierarchical\/} model\index{hierarchical model}
 having, for example, various continuous \ind{hyperparameter}s which play a
 role similar to that played by the distinct models
 \cite{Radford_book}.
% The major objection to this approach of not evaluating $Z$
% is that
 In noting the possibility of not computing $Z$, I am not endorsing this
 approach. The normalizing constant $Z$ is often the single most
 important number in the problem, and I think every effort should be devoted
 to calculating it.
 
\subsection{The Metropolis method for big models}
 Our original description of the Metropolis method involved a joint 
 updating of all the variables using a proposal density $Q(\bx';\bx)$. 
 For big problems it may be more efficient to use several proposal 
 distributions  $Q^{(b)}(\bx';\bx)$, each of which updates only some
 of the components of $\bx$. Each proposal is individually accepted or 
 rejected, and the proposal distributions are repeatedly 
 run through in sequence. 
\exercissxB{2}{ex.metropB}{
 Explain why the rate of movement through the state space 
 will be greater when $B$ proposals 
 $Q^{(1)} ,\ldots, Q^{(B)}$ are considered {\em individually\/}
 in sequence, compared with the case of a single proposal
 $Q^*$ defined by the concatenation of $Q^{(1)} ,\ldots ,Q^{(B)}$. 
 Assume that each proposal distribution  $Q^{(b)}(\bx';\bx)$  
 has an \ind{acceptance rate} $f<1/2$.
}

 In the Metropolis method, the proposal density $Q(\bx';\bx)$ typically 
 has a number of parameters that control, for example, its `width'. 
 These parameters are usually set by trial and error with the \ind{rule 
 of thumb} being to aim for a rejection frequency of about 0.5. 
 It is {\em not\/} valid to have the width parameters be dynamically 
 updated during the  simulation in a way that depends on the 
 history of  the  simulation. Such a modification of the proposal 
 density would violate the detailed balance condition that
 guarantees that  the Markov chain has the correct invariant distribution.

\subsection{Gibbs sampling in big models}
 Our description of Gibbs sampling involved sampling one parameter at a time, 
 as described in equations (\ref{eq.gibbs1}--\ref{eq.gibbs3}).
%:
%\beqan
%x_1^{\tplusone} &\sim& P( x_1 \given  x_2^{(t)} , x_3^{(t)} , \ldots ,x_K^{(t)} ) \\
%x_2^{\tplusone} &\sim& P( x_2 \given  x_1^{\tplusone} , x_3^{(t)} , \ldots ,x_K^{(t)} ) \\
%x_3^{\tplusone} &\sim& P( x_3 \given   x_1^{\tplusone} ,  x_2^{\tplusone} , \ldots , x_K^{(t)} ) , \: \mbox{ etc.}
%\eeqan
 For big problems it may be more efficient to sample {\em groups\/} of 
 variables jointly, that is to use several proposal 
 distributions:
\beqan
x_1^{\tplusone}\hspace{-0.1in},\ldots, x_a^{\tplusone} &\!\!\sim\!\!& P( x_1,\ldots, x_a \given  x_{a+1}^{(t)} 
                         ,\ldots, x_K^{(t)} ) \\
x_{a+1}^{\tplusone} ,\ldots, x_b^{\tplusone}
         &\!\!\sim\!\!&
 P( x_{a+1}, \ldots, x_b \given  x_1^{\tplusone}\hspace{-0.1in} ,\ldots , x_a^{\tplusone} 
         , x_{b+1}^{(t)}  ,\ldots, x_K^{(t)} )  , \:\: \mbox{ etc.}
\nonumber
\eeqan

\subsection{How many samples are needed?}
 At the start of this chapter, we observed that the variance of 
 an estimator $\hat{\Phi}$ depends only on the number of independent
 samples $R$ and the value of 
\beq
        \sigma^2 = \intdx\: P(\bx)  (\phi(\bx)-\Phi)^2 .
\eeq
 We have now discussed a variety of methods for generating 
 samples from $P(\bx)$. How many independent samples $R$ should we aim for? 

 In many problems, we really only need  about 
% a dozen (twelve)
 twelve 
 independent  samples  from $P(\bx)$. Imagine that $\bx$
 is an unknown vector such as the amount of corrosion present 
 in each of $10\,000$ underground pipelines around Cambridge,
% Cambridge, 
 and $\phi(\bx)$ 
 is the total cost of repairing those pipelines. The  
 distribution $P(\bx)$ describes the probability of 
 a state $\bx$ given the tests that have been carried 
 out on some pipelines and the assumptions about the physics of corrosion.
 The quantity $\Phi$ is the expected cost of the repairs. 
 The quantity $\sigma^2$ is the variance of the cost -- $\sigma$ 
 measures  by how much we should expect the actual cost to differ from the 
 expectation $\Phi$. 

 Now, how accurately would a manager like to know $\Phi$? I would suggest there
 is little point in knowing $\Phi$ to a precision finer than about 
 $\sigma/3$. After all, the true cost is likely to differ by 
 $\pm \sigma$ from $\Phi$. 
 If we obtain $R=12$ independent samples from $P(\bx)$, 
 we can estimate $\Phi$ to a precision of $\sigma/\sqrt{12}$ 
 -- which is smaller than $\sigma/3$. So twelve samples suffice.


\begin{figure}%[htbp]
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=figs/mcresource.eps,angle=-90,width=2.5in}}
\end{center}
}{%
\caption[a]{Three possible Markov chain Monte Carlo 
 strategies for obtaining twelve samples 
 in a fixed amount of computer time.  Time is represented
 by horizontal lines; samples by white circles.
 (1) A single run consisting of one long `burn in' period followed
 by a sampling period. (2) Four medium-length runs with 
 different initial conditions and a medium-length burn in period.
 (3) Twelve short runs.}
\label{fig.mcresource}
}%
\end{figure}
%
\subsection{Allocation of resources}
\label{sec.mcresource}
% Choice of strategy}
 Assuming we have decided how many independent samples $R$ 
 are required, 
 an important question is how one should make use of one's limited computer 
 resources to obtain these samples. 

 A typical \MCMC\ experiment involves an initial period in which 
 control parameters of the simulation such as step sizes may be adjusted. 
 This is followed by a `burn in' period during which we hope the simulation 
 `converges' to the desired distribution. Finally, as the simulation 
 continues, we record the state vector occasionally so as to create 
 a list of states $\{ \bx^{(r)}\}_{r=1}^{R}$
 that we hope are roughly independent samples from 
 $P(\bx)$.

 There are several possible strategies (\figref{fig.mcresource}):
\ben
\item Make one  long run, obtaining all $R$ samples from it.
\item Make a few medium-length runs with different
 initial conditions, obtaining some samples     from each.
\item Make $R$ short runs, each starting from a different random 
 initial condition, with the only state that is recorded being the 
 final state of each simulation.
\een
 The first strategy has the
 best chance of attaining `convergence'. 
 The last strategy  may have the advantage that the correlations between 
 the recorded samples are  smaller.
 The middle path is popular with \MCMC\ experts \cite{MCMC96}
 because it avoids the inefficiency of discarding burn-in
 iterations in many runs, while still allowing one to 
 detect problems with lack of
 convergence that would not be apparent from a single
 run.
%The lots-of-short-runs versus one-long-run has been very controversial. You
%should reference the Gelman and Rubin and Geyer papers on the topic.

 Finally, I should emphasize that there is no need to make the
 points in the estimate nearly-independent.  Averaging
 over  dependent points is fine -- it won't lead to any bias
 in the estimates.  For example, when you use  strategy 1 or 2, you may, if you wish,
 include all the points between the first  and last sample in each run.
  Of course,  estimating the accuracy of the estimate is harder when the
  points are dependent.


% \section{Philosophy} moved to graveyard.tex  Sun 3/2/02

\section{Summary}
\bit
\item
 Monte Carlo methods are a powerful tool that allow one to sample from
 any probability distribution that can be expressed in the form
 $P(\bx) = \frac{1}{Z} P^*\!(\bx)$.
\item
 Monte Carlo methods  can answer virtually any query related 
 to $P(\bx)$ by putting the query in the 
 form
\beq
        \int \phi(\bx) P(\bx) \simeq \frac{1}{R} \sum_r \phi(\bx^{(r)}) .
\eeq
%  and estimating this integral by sampling.
\item
 In  high-dimensional problems the   only satisfactory methods
 are those based on Markov chains, such as the Metropolis method, Gibbs sampling and slice
 sampling.  Gibbs sampling is an attractive method
 because it has no adjustable parameters but its use is restricted to
 cases where samples can be generated from the conditional distributions.
 Slice sampling is
 attractive because, whilst it has step-length parameters, its performance
 is not very sensitive to their values. 

\item
 Simple Metropolis algorithms and Gibbs sampling algorithms,
 although widely used, 
 perform poorly because they explore the
 space by a slow random walk. The next chapter will discuss
 methods for speeding up Markov chain Monte Carlo simulations.
% More sophisticated Metropolis
% algorithms such as \hybrid\ Monte Carlo, which we discuss in the next
% chapter,
% (see \citeasnoun{Neal_dop}) 
% make use
% of proposal densities that give faster movement through  the state space.
%
% The efficiency of Gibbs sampling is also troubled by random walks. 
% The method of ordered overrelaxation is a general purpose technique
% for suppressing them.
\item
% for summary 
 Slice sampling does not avoid random walk behaviour,
 but it automatically chooses the largest appropriate
 step size, thus reducing the bad effects of the random walk
 compared with, say, a Metropolis method with a tiny step size.
\eit

\section{Exercises}
%
% I rate this ex as one of the best bits of this book
%
\exercissxA{2C}{ex.isproblem}{
	{\sf A study of importance sampling.}
%
	We already established in section \ref{sec.importance}
 that importance sampling is likely to be useless
 in high-dimensional problems.
	This exercise explores a further \index{sermon!importance sampling}cautionary tale, showing\index{caution!importance sampling}
 that importance sampling can fail even in one dimension,
 even with
 friendly Gaussian distributions.\index{Monte Carlo methods!importance sampling!weakness of}

	Imagine that we want to know the expectation of a function
 $\phi(x)$ under a distribution $P(x)$, 
\beq
	\Phi = \int \d x \: P(x) \phi(x) ,
\eeq
 and that this expectation is estimated by importance sampling
 with a distribution $Q(x)$.
 Alternatively, perhaps we wish to estimate the normalizing constant
 $Z$ in $P(x) = P^*\!(x)/Z$ using
\beq
	Z = \int \d x \: P^*\!(x) =   \int \d x \: Q(x) \frac{P^*\!(x)}{Q(x)}
	= \left< \frac{P^*\!(x)}{Q(x)} \right>_{x\sim Q} .
\eeq
 Now, let $P(x)$ and $Q(x)$ be Gaussian distributions with
 mean zero and standard deviations $\sigma_p$ and $\sigma_q$.
 Each point $x$ drawn from $Q$ will have an associated weight
 $P^*\!(x)/Q(x)$. 
 What is the variance of the weights? [Assume that $P^* = P$, so
 $P$ is actually normalized, and $Z=1$, though we can pretend that we didn't know
 that.]
 What happens to the variance of the weights as $\sigma^2_q \rightarrow
  \sigma^2_p/2$?

 Check your theory by simulating this importance-sampling problem
 on a computer. 
}
\exercisaxA{2}{ex.metFred}{
 Consider the  Metropolis algorithm for the 
 one-dimensional toy problem of section \ref{sec.metrop.demo},
 sampling from $\{ 0,1,\ldots,20\}$. 
 Whenever the current state is one of the end states, 
 the proposal density given in \eqref{eq.metropb} will propose with 
 probability 50\% a state that will be rejected. 

 To reduce this `waste', Fred modifies the software responsible for 
 generating samples from $Q$ so that when $x=0$, the proposal density 
 is 100\% on $x'=1$, and similarly when $x=20$, $x'=19$ is always 
 proposed.  Fred sets the software that implements the acceptance 
 rule so that the software accepts all proposed moves.
 What probability $P'(x)$ will Fred's modified 
 software generate samples from? 

 What is the correct acceptance rule for Fred's proposal density, in 
 order to obtain samples from $P(x)$?
}
%%%%%%%%%%%% extra exercises added draft 4.1 %%%%%%%%%%%%%
\exercisxB{3C}{ex.doGibbs1}{
	Implement Gibbs sampling for the inference of a
 single one-dimensional Gaussian, which we studied using maximum likelihood in \secref{sec.mloneg}.
 Assign a broad Gaussian prior to $\mu$ and a broad gamma prior (\ref{gamma.dist.again})
 to the  \ind{precision} parameter
 $\beta = 1/\sigma^2$. 
 Each update of $\mu$ will involve a sample from a Gaussian distribution,
 and each update of $\sigma$ requires a sample from a gamma distribution.
}
\exercisxA{3C}{ex.doGibbs2}{
 {\sf Gibbs sampling for clustering.}
	Implement Gibbs sampling for the inference of a
 mixture of $K$
% two or more
 one-dimensional Gaussians, which we studied using maximum likelihood in \secref{sec.mog}.
 Allow the clusters to have different standard deviations $\sigma_k$.
% Assign a uniform prior to the
 Assign priors to the means and standard deviations in the same way as the previous
 exercise.  Either fix the prior probabilities of the classes $\{ \pi_k \}$ to be equal 
 or put a uniform prior over the parameters $\pi$ and include them in the Gibbs sampling.
%
% [ -0.01 -0.27 0.1 0.31  0.706   1.07 1.37 1.16  1.2 1.25 1.3 1.33 1.65 ]

 Notice the similarity  of Gibbs sampling to the  soft K-means clustering algorithm (\algref{alg.kmeansoft2}).
 We can alternately {\em assign\/} the class labels $\{ k_n \}$ given the parameters $\{ \mu_k , \sigma_k \}$,
 then {\em update\/} the parameters given the class labels.
 The assignment step involves sampling from the probability distributions defined by
 the responsibilities (\ref{eq.assignII}), and the
 update step updates  the means  and variances using probability distributions
 centred on the K-means algorithm's values (\ref{eq.softkmeans.meanupdate}, \ref{eq.softkmeans.varianceupdate}).

 Do your experiments confirm that Monte Carlo methods bypass the overfitting
 difficulties of maximum
 likelihood discussed in \secref{sec.kaboom}?

 A solution to this exercise and the previous one,
 written in {\tt{octave}}, is available.\footnote{{\tt{http://www.inference.phy.cam.ac.uk/mackay/itila/}}}
}
\exercisxB{3C}{ex.doGibbs3}{
	Implement Gibbs sampling for the  {\sf  seven scientists} inference problem,
 which we encountered in \exerciseref{ex.manyparams}, and which you may
 have solved by exact marginalization (\exerciseref{ex.manyparamsb}) [it's not essential to have done the latter]. 
}
%%%%%%%%%%%% end extra exercises added draft 4.1 %%%%%%%%%%%%%
\exercisxB{2}{ex.walkGau}{
 A Metropolis method is used to explore a distribution $P(\bx)$
 that is actually a 1000-dimensional spherical Gaussian distribution of standard deviation
 1 in all dimensions.
 The proposal density $Q$ is a 1000-dimensional spherical Gaussian distribution
 of standard deviation $\epsilon$.
 Roughly what is the step size $\epsilon$ if the \ind{acceptance rate}
 is 0.5?
 Assuming this value of $\epsilon$,
\ben
\item
 roughly how long would the method take to traverse the distribution
 and generate a sample independent of the initial condition?
\item
 By how much does $\ln P(\bx)$ change in a typical step?
 By how much should $\ln P(\bx)$ vary when $\bx$ is drawn from
 $P(\bx)$?
\item
 What happens if, rather than using a Metropolis
  method that tries to change all components at once, one instead uses
  a concatenation of Metropolis updates changing one component at a time?
\een
}
\exercisaxB{2}{ex.walkE}{
 When discussing the time taken by the Metropolis algorithm to 
 generate independent samples we considered a distribution 
 with longest spatial 
 length scale $L$ being explored using a proposal distribution
 with step size $\epsilon$. 
 Another dimension 
% non-spatial exploration
 that a MCMC method must explore  is the range of 
 possible values of the log probability 
% $E(\bx) \equiv 
 $\ln P^*\!(\bx)$. Assuming that the state $\bx$ contains a number of 
 independent random variables proportional to $N$, 
 when samples are drawn from $P(\bx)$, the
% `$\!$Asymptotic Equipartition' Principle
 `\ind{asymptotic equipartition}' principle
 tell us  that the value of $- \ln P(\bx)$
 is likely to be close to the entropy of $\bx$, varying either side with 
 a standard deviation that scales as $\sqrt{N}$. 
 Consider a Metropolis method with a symmetrical proposal density, 
 that is, one that satisfies $Q(\bx;\bx') = Q(\bx';\bx)$. Assuming that 
 accepted jumps  either increase $\ln P^*\!(\bx)$ by some amount
 or decrease it 
 by a {\em small\/} amount, \eg\ $\ln e=1$ (is this a reasonable
 assumption?), discuss how long 
 it must take to generate roughly independent samples from $P(\bx)$. 
 Discuss whether Gibbs sampling has similar properties.
}
%the point of 23.11 (exercise) is the idea that as well 
%as a spatial random walk, there are other ways of 
%thinking about the random walks that MCMC does. Other dimensions.
%For example, during a simulation, the energy of a system
%wanders up and down. And it has to cover the "typical" range
%of values before we can expect the simulation to converge.
%Therefore the convergence time is something like (X/x)^2
%where X is the range of energies and x is the typical change
%in energy.
%
% \exercis{ex.goodapproxsample}{
%  Compare and contrast what makes a
% % n approximating
%  distribution $Q$
%  a good variational approximation to a distribution $P$
%  (as in the previous chapter)
%  and what makes a distribution $Q$ a good sampler for 
%  importance sampling. 
% }
\exercisxC{3}{ex.ZMC}{
	Markov chain Monte Carlo methods do not compute
 partition functions $Z$,
 yet they allow ratios of quantities like $Z$ to
 be estimated. For example, consider a random-walk Metropolis algorithm
 in a state space where the energy is zero in a connected accessible
 region, and infinitely large everywhere else; 
 and imagine that the accessible space can be chopped into two regions
 connected by one or more corridor states. The fraction of
 times spent in each region at equilibrium is  proportional to
 the volume of the region. How does the Monte Carlo method
 manage to do this without measuring the volumes?
}
\exercisxC{5}{ex.BayesianMC}{
{\sf Philosophy}.\index{philosophy}

 One curious defect of these Monte Carlo methods -- which are widely used 
 by Bayesian statisticians -- is that they are all non-Bayesian \cite{ohagan87}. 
 They involve  computer experiments from which 
 {\em estimators\/} of quantities of interest are derived. These estimators
 depend on the proposal distributions that were used to generate 
 the samples and on the random numbers that happened
 to come out of our random number generator.
 In contrast,  an alternative Bayesian approach to 
 the problem would use the results of our computer experiments 
 to infer the properties of the target function $P(\bx)$ and 
 generate predictive distributions for quantities of interest such as $\Phi$. 
 This approach would give answers that would depend only on the 
 computed values of $P^*(\bx^{(r)})$ at the  points 
 $\{ \bx^{(r)} \}$; the answers would not depend on how those 
 points were chosen.

 Can you make a Bayesian Monte Carlo method?
 (See \citeasnoun{zoubincarlBMC} for a practical  attempt.)
}

%  \input{tex/bayes_mc.tex}
\dvips
\section{Solutions}% to Chapter \protect\ref{ch.mc}'s exercises} 
% 
\fakesection{s11.tex}
\soln{ex.Phiconverge}{ 
 We wish to show that 
\beq
	\hat{\Phi} \equiv \frac{ \sum_{r} w_r \phi( \xfromq^{(r)} ) }{ \sum_r w_r } 
% \label{eq.is}
\eeq
 converges to the expectation of $\Phi$ under $P$. We consider the 
 numerator and the denominator separately. First, the denominator.
 Consider a single importance weight 
\beq
	w_r \equiv \frac{ P^*(\xfromq^{(r)}) }{ Q^*(\xfromq^{(r)}) }  .
% copied from \label{eq.mc.is.weight.def}
\eeq
 What is its expectation, averaged under the distribution $Q=Q^*/Z_Q$ of 
 the point $\xfromq^{(r)}$?
\beq
	\langle w_r \rangle
	= \int \d \xfromq \,
	Q( \xfromq )
	 \frac{ P^*(\xfromq) }{ Q^*(\xfromq) }
	= \int \d \xfromq \,
	\frac{1}{Z_Q} 
	 P^*(\xfromq) 
	= \frac{Z_P}{Z_Q} .
\eeq
 So the expectation of the denominator is 
\beq
	\left< \sum_r w_r \right> = R \frac{Z_P}{Z_Q} .
\eeq
  As long as the variance of $w_r$ is finite, the denominator, divided 
 by $R$, will converge to $Z_P/Z_Q$ as $R$ increases.
 [In fact, the estimate  converges to the right answer even if this variance is
  infinite, as long as the expectation is well-defined.]
 Similarly, the expectation of one term in the numerator is
\beq
	\langle  w_r  \phi( \xfromq ) \rangle =   
	\int \d \xfromq \,
	Q( \xfromq )
	 \frac{ P^*(\xfromq) }{ Q^*(\xfromq) } \phi( \xfromq ) 
	= \int \d \xfromq \,
	\frac{1}{Z_Q} 
	 P^*(\xfromq)  \phi( \xfromq ) 
	= \frac{Z_P}{Z_Q} {\Phi} ,
\eeq
 where $\Phi$ is the expectation of $\phi$ under $P$.
 So the numerator, divided by $R$, converges to $\smallfrac{Z_P}{Z_Q} {\Phi}$
 with increasing  $R$. 
 Thus $\hat{\Phi}$ converges to $\Phi$.

 The numerator and the denominator are unbiased estimators of 
 $R Z_P/Z_Q$ and 
 $R Z_P/Z_Q \Phi$ respectively, but their ratio $\hat{\Phi}$
 is not necessarily an unbiased estimator for finite $R$.
%%%%%%%%%%%%%%%% HELP !!!!!!!!!!!!!!!!!!!!!!!!!!!
% More here about variance and bias.
%%%%%%%%%%%%%%%% HELP !!!!!!!!!!!!!!!!!!!!!!!!!!!
 }
\soln{ex.peakysample}{When the true density $P$ is multimodal,
 it is unwise to use importance sampling
 with a sampler density fitted to one mode, because on the rare
 occasions that a point is produced that lands in one of the other modes,
 the weight associated with that point will be enormous. The
 estimates will have enormous variance, but this enormous variance
 may not be evident to the user if no points in the other mode have been seen.
}
%\soln{ex.randomwalk}{ ... }
%\soln{ex.gibbs.eq.met}{ ... }
\soln{ex.gibbs.h74}{
 The posterior distribution
 for the syndrome decoding problem
 is  a pathological distribution from
 the point of view of Gibbs sampling.
 The factor $\truth[ \bH \bn = \bz ]$ is only 1 on a small fraction
 of the space of possible vectors $\bn$, namely the $2^K$ points
 that correspond to the valid codewords.  No two codewords are adjacent,
 so similarly, any single bit flip from a viable state $\bn$ will
 take us to a state with zero probability and so the state will never move
 in
 Gibbs sampling.

 A general code  has exactly the same problem.
 The points corresponding to valid codewords are relatively few in number
 and they are not adjacent (at least for any useful code).
 So Gibbs sampling is no use for syndrome decoding for two reasons.
 First, finding {\em any\/} reasonably good hypothesis is difficult, and
 as long as the state is not near a valid codeword, Gibbs sampling
 cannot help since none of the conditional distributions is defined;
 and second, once we are in a valid hypothesis, Gibbs sampling
 will never take us out of it.

% However, clever modifications of Gibbs sampling, using several
% annealing parameters, have been developed by \citeasnoun{Neal_mcdecoder},
% who has demonstrated that Monte Carlo decoding of certain codes,
% while inefficient, is not impossible.
 One could attempt to perform Gibbs sampling using the
 bits of the original message $\bs$ as the variables. This
 approach would not get locked up in the way just described,
 but, for a good code, any single bit flip would substantially alter
 the reconstructed codeword, so if one had found a state
 with reasonably large likelihood, Gibbs sampling would take
 an impractically large time to escape from it.
 

}
%%%%%%%%%%%%%%%%%%
\soln{ex.metropB}{
 Each Metropolis proposal will take the energy
 of the state up or down by some amount.
 The total change in energy  when $B$ proposals
 are concatenated will be the end-point of a random walk
 with $B$ steps in it.  This walk might have mean zero,
 or it might have a tendency to drift upwards (if most
 moves increase the energy and only a few decrease it). In general
 the latter will hold, if the acceptance rate $f$ is small:
 the mean change in energy from any one move will be some $\Delta E>0$
 and so the acceptance probability for the concatenation of  $B$
 moves will be 
 of order $1/(1+\exp(-B \Delta E))$, which scales roughly as $f^B$.
 The mean-square-distance moved will be of order $f^B B \epsilon^2$,
 where $\epsilon$ is the typical step size.
 In contrast, the mean-square-distance moved when the
 moves are considered individually will be of order $f B \epsilon^2$.
}

% importance sampling
% see also ~/itp/importance/
\begin{figure}[htpb]
\figuredanglenudge{
\begin{center}
\begin{tabular}{ccc}
\mbox{\psfig{figure=importance/mean.ps,width=2.3in,angle=-90}}&
\mbox{\psfig{figure=importance/std.ps,width=2.3in,angle=-90}}&
\mbox{\psfig{figure=importance/ws.ps,width=2.3in,angle=-90}}\\
\end{tabular}
\end{center}
}{
\caption[a]{Importance sampling in one dimension.
 For $R=1000,$ $10^4$, and $10^5$,
 the normalizing constant of a Gaussian distribution  (known
 in fact to be 1) was
 estimated using importance sampling with a sampler density of standard
 deviation $\sigma_q$ (horizontal axis).
 The same random number seed was used for all runs.
 The three plots show (a) the estimated normalizing constant;
 (b) the {\em empirical\/}
 standard deviation of the $R$ weights; (c) 30 of the weights.
}
\label{fig.iscrazy}
}{-0.15in}
\end{figure}
\soln{ex.isproblem}{
 The weights are $w = P(x)/Q(x)$ and $x$ is drawn from $Q$.
 The mean weight is
\beq
	\int \d x \: Q(x) \left[ P(x)/Q(x) \right]
	= \int \d x \:  P(x) = 1,
\eeq
 assuming the integral converges.
 The variance is
\beqan
	\var ( w ) &=& \int \d x \: Q(x) \left[ \frac{P(x)}{Q(x)} - 1 \right]^2
\\
 &=& \int \d x \: \frac{P(x)^2}{Q(x)}  - 2 P(x) + Q(x) 
\\
% &=& \left[ \int \d x \: \frac{Z_Q}{Z_P^2}
% \frac{ \exp \left( - 2 x^2/(2 \sigma^2_p) \right) }
% { \exp \left( - x^2/(2 \sigma^2_q) \right)}
% \right]
% - 2  + 1
%\\
 &=& \left[ \int \d x \: \frac{Z_Q}{Z_P^2}
  \exp \left(
  -  \frac{x^2}{2}
 \left( \frac{2}{\sigma^2_p} -  \frac{1}{\sigma^2_q}  \right)
 \right)
 \right]
 -  1 ,
\label{eq.nasty}
\eeqan
 where $Z_Q/Z_P^2 = \sigma_q/(\sqrt{2\pi}\sigma_p^2)$.
 The integral in (\ref{eq.nasty})
 is finite only if the coefficient  of $x^2$
 in the exponent is positive, \ie,
 if
%\beq
%	  \left( \frac{1}{\sigma^2_p} -  \frac{1}{2 \sigma^2_q}  \right) > 0
%\eeq
% \ie,
\beq
	 \sigma^2_q > \frac{1}{2}  \sigma^2_p .
\eeq
 If this condition is satisfied, the variance is
\beq
%\left[ \int \d x \: \frac{Z_Q}{Z_P^2}
%  \exp \left(
%  -  x^2 \left( \frac{1}{\sigma^2_p} -  \frac{1}{2 \sigma^2_q}  \right)
% \right] -  1 ,
\var(w) = 
 \frac{\sigma_q}{\sqrt{2\pi}\sigma_p^2} \sqrt{2 \pi}
 \left( \frac{2}{\sigma^2_p} -  \frac{1}{\sigma^2_q}  \right)^{\!-\frac{1}{2}} \!\!\!\! - 1
\:=\:
 \frac{\sigma_q^2}{\sigma_p \left( 2 \sigma^2_q - \sigma^2_p \right)^{1/2}}
   - 1.
\eeq
 As $\sigma_q$ approaches the critical value -- about $0.7 \sigma_p$ --
 the variance becomes infinite.
 \Figref{fig.iscrazy} illustrates these phenomena for $\sigma_p=1$ with 
 $\sigma_q$ varying from 0.1 to 1.5.
{\em  The same random number seed was used for all runs,}
 so the weights and estimates follow smooth curves.
 Notice that the {\em empirical\/}
 standard deviation of the $R$ weights can look quite small
 and well-behaved (say, at $\sigma_q \simeq 0.3$) when the true
 standard deviation is nevertheless infinite.

%
}
% \soln{ex.goodapproxsample}{
%  Variational free energy approximation: compact $Q$ good, heavy--tailed $Q$
%  bad.
%  Importance sampling: compact $Q$ bad, heavy--tailed $Q$ good.
% }



\dvipsb{solutions mc}
% {Efficient Monte Carlo methods}
\chapter{Efficient Monte Carlo Methods \nonexaminable}
\label{ch.mc2}
%\{Speeding up Monte Carlo methods}
% \subsection{Reducing 
 This chapter discusses
 several methods for
 {reducing random walk behaviour in Metropolis methods}.
 The aim is to reduce the  time
 required to obtain effectively 
 independent samples.
 For brevity, we will
%  deliberately
% use imprecise terminology,
 say `independent samples' when we mean
 `effectively 
 independent samples'.

\section{\Hybrid\ Monte Carlo}
\begin{algorithm}
\begin{framedalgorithmwithcaption}{
%\figuremargin{%{%%%%%%%\margincaption{%
\caption[a]{{\tt Octave} source code for the \hybrid\  Monte Carlo method.} 
\label{fig.hmc}
}%
\footnotesize
\begin{verbatim}
g = gradE ( x ) ;                  # set gradient using initial x
E = findE ( x ) ;                  # set objective function too
                             
for l = 1:L                        # loop L times
   p = randn ( size(x) ) ;         # initial momentum is Normal(0,1)
   H = p' * p / 2 + E ;            # evaluate H(x,p)
                             
   xnew = x ;  gnew = g ;        
   for tau = 1:Tau                 # make Tau `leapfrog' steps
                                    
      p = p - epsilon * gnew / 2 ; # make half-step in p
      xnew = xnew + epsilon * p ;  # make step in x
      gnew = gradE ( xnew ) ;      # find new gradient 
      p = p - epsilon * gnew / 2 ; # make half-step in p 
                                    
   endfor                           
                                    
   Enew = findE ( xnew ) ;         # find new value of H
   Hnew = p' * p / 2 + Enew ;  
   dH = Hnew - H ;                 # Decide whether to accept

   if ( dH < 0 )                accept = 1 ; 
   elseif ( rand() < exp(-dH) ) accept = 1 ; 
   else                         accept = 0 ;
   endif

   if ( accept )
      g = gnew ;   x = xnew ;    E = Enew ; 
   endif
endfor
\end{verbatim}
\end{framedalgorithmwithcaption}
\end{algorithm}
%\newcommand{\Tau}{\mbox{\verb+Tau+}}
%\newcommand{\ttepsilon}{\mbox{\verb+epsilon+}}
\newcommand{\Tau}{\mbox{\tt{Tau}}}
\newcommand{\ttepsilon}{\mbox{\tt{epsilon}}}
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{crcr}
\multicolumn{2}{c}{\Hybrid\ Monte Carlo}
&
\multicolumn{2}{c}{Simple Metropolis}
\\
%%%%%%%%%%%%%%%%%%%%
% HMC easy start
\raisebox{1.5in}{\makebox[0.1in][l]{(a)}}&%
\hspace{-0.42in}\psfig{figure=hmcdemo/hmc.sample2.ps,angle=-90,width=2.53in}%
%
% detail inset:
%
\makebox[0.0in][r]{\hspace{-0.15in}\raisebox{0.1in}{%
%{\small\sf{detail:}}
\psfig{figure=hmcdemo/det0.ps,angle=-90,width=1.5in}}}%
%%%%%%%%%%%%%%%%%%%%
&
%%%%%%%%%%%%%%%%%%%%
% metrop
\raisebox{1.5in}{\makebox[0.1in][l]{(c)}}&%
\hspace{-0.42in}\psfig{figure=hmcdemo/metrop2.ps,angle=-90,width=2.53in}%
\makebox[0in][l]{\hspace{-0.15in}\raisebox{0.1in}{\makebox[0.0in][r]{%
%{\small\sf{detail:}}
\psfig{figure=hmcdemo/det2.ps,angle=-90,width=1.5in}}}}
%%%%%%%%%%%%%%%%%%%%
\\
%%%%%%%%%%%%%%%%%%%%
% HMC hard start
\raisebox{1.5in}{\makebox[0.1in][l]{(b)}}&%
\hspace{-0.42in}\psfig{figure=hmcdemo/hmc.converge4.ps,angle=-90,width=2.53in}
%%%%%%%%%%%%%%%%%%%%
&%
%%%%%%%%%%%%%%%%%%%%
% metrop
\raisebox{1.5in}{\makebox[0.1in][l]{(d)}} & %
\hspace{-0.42in}\psfig{figure=hmcdemo/metrop4.ps,angle=-90,width=2.53in}
%%%%%%%%%%%%%%%%%%%%
\\
\end{tabular}
\end{center}
}{%
\caption[a]{{(a,b) \Hybrid\ Monte Carlo used to generate samples from  a 
 \ind{bivariate Gaussian} with correlation $\rho = 0.998$. (c,d) For comparison, a simple
 \index{Monte Carlo methods!random-walk Metropolis}\ind{random-walk Metropolis method}, given equal computer time.}
%
} 
\label{fig.hmcdemo}
}%
\end{figure}
%
%
 The \indexs{Hamiltonian Monte Carlo}\index{Monte Carlo methods!Hamiltonian Monte Carlo}{\hybrid\ Monte Carlo}
 \index{algorithm!Hamiltonian Monte Carlo}method
%  has been reviewed and developed by \citeasnoun{Neal_dop}.
 is a Metropolis method, applicable to continuous state 
 spaces, that makes use of gradient information to reduce
 random walk behaviour. [The Hamiltonian Monte Carlo method
 was originally called  \ind{hybrid Monte Carlo}, for historical
 reasons.]
 
 For many systems whose probability $P(\bx)$  can be written in the form
\beq
        P(\bx) = \frac{ e^{- E(\bx)} }{Z},
\eeq 
 not only $E(\bx)$ but also its
 gradient with respect to $\bx$  can  be readily evaluated. It seems 
 wasteful to use a simple random-walk Metropolis method when 
 this gradient is available -- the gradient indicates
 which direction one should go in 
 to find states that have higher probability!

\subsection{Overview of  \hybrid\ Monte Carlo}
 In the   {\hybrid\ Monte Carlo} method, 
 the state space $\bx$ is augmented by  {\em  momentum
 variables\/} $\bp$, and there is an alternation of two types 
 of proposal. The first proposal  randomizes the 
 \ind{momentum} variable, leaving the state $\bx$ unchanged.
 The second 
 proposal  changes both $\bx$ and $\bp$
 using
 simulated Hamiltonian dynamics as defined by the Hamiltonian
\beq
        H(\bx,\bp) = E(\bx) + K(\bp) ,
\eeq
 where $K(\bp)$ is a `kinetic energy' such as $K(\bp) = \bp^{\T}\bp/2$.
% are iterated for a number of steps; 
 These two proposals are used to create (asymptotically) samples from 
 the joint density 
\beq
        P_H(\bx,\bp) = \frac{1}{Z_H} \exp [ - H(\bx,\bp) ] =  \frac{1}{Z_H} \exp [ -   E(\bx) ]  \exp [ -   K(\bp) ].
\eeq
 This density is separable, 
 so  the marginal distribution of $\bx$ is
 the desired distribution $\exp [ -   E(\bx) ]/Z$. 
 So, simply discarding the momentum variables, we  obtain a sequence of 
 samples $\{ \bx^{(t)} \}$ that asymptotically come from 
 $P(\bx)$. 

\subsection{Details of  \hybrid\ Monte Carlo}
 The first proposal, which can be viewed as a Gibbs sampling update,
 draws a new momentum from the 
 Gaussian density $\exp [ -   K(\bp) ]/{Z_K}$.  This
% is a  Gibbs sampling update and the
 proposal is always accepted.
 During the second, dynamical proposal,
 the momentum variable determines where the 
 state $\bx$ goes, and the {\em gradient\/} of
% the log of the probability  density $P(\bx)$
 $E(\bx)$ determines how the momentum $\bp$  changes, in accordance
 with the equations
\beqan
 \dot{\bx} &=& \bp \\
  \dot{\bp} &=& - \frac{\partial E(\bx)}{\partial\bx} 
 .
\eeqan 
 Because of the persistent motion of $\bx$ in the direction of the 
 momentum $\bp$ 
 during each  dynamical proposal, 
 the state of the system tends to move a distance that goes {\em linearly\/}
 with the computer time, rather than as the square root.
%
% see itp/hmcdemo
% octave
% DEMO
%

 The second proposal is accepted in accordance with the Metropolis rule.
 If the simulation of the Hamiltonian dynamics is numerically perfect
 then the proposals are accepted every time, because the total energy 
 $H(\bx,\bp)$ is a constant of the motion and so $a$ in \eqref{eq.ratio.metrop}
 is equal to one. If the simulation is
 imperfect, because of finite step sizes for example, then some of the
 dynamical proposals will be rejected. The rejection rule makes use of
 the change in $H(\bx,\bp)$, which is zero if the simulation is
 perfect.   The occasional rejections ensure that, 
 asymptotically, we obtain samples $(\bx^{(t)},\bp^{(t)})$
 from the required joint density $P_H(\bx,\bp)$.

 The source code in \figref{fig.hmc} describes a \hybrid\ Monte Carlo 
 method that uses the `leapfrog' algorithm\index{leapfrog algorithm}\index{algorithm!leapfrog}
 to simulate the dynamics 
 on the function {\tt{findE(x)}}, whose gradient is found by the function
 {\tt{gradE(x)}}. \Figref{fig.hmcdemo} shows this algorithm generating 
 samples from a bivariate Gaussian whose energy function is 
 $E(\bx) = \half \bx^{\T} \bA \bx$ with 
\beq
\bA = \left[ 
\begin{array}{rr}
   250.25 & -249.75 \\
  -249.75 &  250.25 
\end{array}
\right] ,
\eeq
%
 corresponding to a variance--covariance matrix of
\beq
 \left[ 
\begin{array}{ll}
  1 & 0.998 \\
  0.998 &  1
\end{array}
\right] .
\eeq
 In \figref{fig.hmcdemo}a,
 starting from the state marked by the arrow, the solid line 
 represents two successive trajectories generated by the Hamiltonian 
 dynamics. The squares show the endpoints of these two trajectories.
 Each trajectory consists of $\Tau =19$
 `leapfrog' steps with $\ttepsilon = 0.055$.
 These steps are indicated by the crosses on the trajectory in the 
 magnified inset.
 After each trajectory, the momentum is randomized.  
 Here, both trajectories are
 accepted; the errors in the Hamiltonian were only $+0.016$ and $-0.06$
 respectively. 

  \Figref{fig.hmcdemo}b shows how a sequence of four trajectories converges 
 from an initial condition, indicated by the arrow,
 that is not close to the typical set of 
 the target distribution.  The trajectory parameters $\Tau$ and 
 $\ttepsilon$ were randomized for each trajectory using uniform
 distributions with means 19 and 0.055 respectively. The first trajectory takes
 us to a new state, $(-1.5,-0.5)$, 
 similar in energy to the first state. The second 
 trajectory happens to end in a state nearer the bottom of the energy 
 landscape. Here, since the potential energy $E$ is smaller, the kinetic 
 energy $K = \bp^2/2$ is necessarily larger than it was at the start of the trajectory.
 When 
 the momentum is randomized before the third trajectory, its kinetic energy
 becomes much smaller. After the fourth trajectory has been simulated, 
 the state appears to have become typical of the target density.

 \Figsref{fig.hmcdemo}(c) and (d) show a
 random-walk Metropolis method using a Gaussian proposal density
 to sample from the same Gaussian distribution, starting from the 
 initial conditions of (a) and (b) respectively.
 In (c) the step size
% radius had been
 was adjusted
 such that the acceptance rate was 58\%. 
 The number of proposals was 38 so the total amount of computer time 
 used was similar to that  in (a). The distance moved is small because 
 of random walk behaviour.
 In  (d) the  random-walk Metropolis method was
 used and started from the same initial condition as (b) 
 and given a similar amount of 
 computer time.


%
%
% see hmc.tex for an attempt to make a toy simulation story.
% 
%  The {\dbf \hybrid\ Monte Carlo} method mentioned in the section 
%  on Metropolis methods makes use of gradient information to reduce
%  random walk behaviour.
%
%
% for a nice adler demo, cd itp/adler; gnuplot ; load 'gnu'
%
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{@{}l@{}}
\begin{tabular}{ccc}&Gibbs sampling & Overrelaxation\\
\raisebox{1.5in}{\makebox[0in][l]{(a)}}&%
\hspace{-0.2in}\psfig{figure=adler/gibbs.xy.ps,angle=-90,width=2.3in} &
\hspace{-0.2in}\psfig{figure=adler/adler.xy.ps,angle=-90,width=2.3in} \\
\raisebox{1.5in}{\makebox[0in][l]{(b)}}&%
&
\hspace{-0.2in}\psfig{figure=adler/adler.xy.det.ps,angle=-90,width=2.3in} \\
\end{tabular}
\\
\raisebox{5mm}{\makebox[0in][l]{(c)}}%
\hspace{-0.2in}%
\begin{tabular}[t]{l}
\hspace{12mm}Gibbs sampling \\
\psfig{figure=adler/gibbs.x1.ps,width=5in,angle=-90} \\[-0.05in]
\hspace{12mm}Overrelaxation\\
\psfig{figure=adler/adler.x1.ps,width=5in,angle=-90} \\
\end{tabular}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{{Overrelaxation contrasted with Gibbs sampling for a 
 bivariate Gaussian with correlation $\rho = 0.998$.}
%
 (a) The state sequence for 40 iterations, each iteration involving 
 one update of both variables. The overrelaxation method
        had $\alpha=-0.98$. (This excessively  large value is chosen 
 to make it easy to see how the overrelaxation method reduces random
 walk behaviour.) The dotted line shows the contour 
 $\bx^{\T} \bSigma^{-1} \bx=1$.
%
 (b) Detail of (a), showing the two steps making up each iteration.
 (c) Time-course of the variable $x_1$ during 2000 iterations of the 
 two methods. The overrelaxation method
        had $\alpha=-0.89$.
 (After \protect\citeasnoun{Radford.over}.)
%
} 
\label{fig.adler}
}%
\end{figure}

%Page 19
%What's the relationship between overrelaxtion and antithetic variables
%(e.g. Besag and Green, 1993, JRSS(B))?

\section{Overrelaxation}
        The method of {\dbf\ind{overrelaxation}}
 is a  method for reducing 
 random walk behaviour in Gibbs sampling.
 Overrelaxation was originally introduced for systems in which all the 
 conditional distributions are Gaussian.
\begin{aside}
An example of a
% There are 
 joint distribution that  is {\em not\/} Gaussian but whose conditional
 distributions  {\em are\/} all Gaussian is $P(x,y)=
 \exp(-x^2 y^2-x^2 -y^2)/Z$. 
\end{aside}
% Adler's Overrelaxation
\subsection{Overrelaxation for Gaussian conditional distributions}
 In ordinary Gibbs sampling, 
 one draws the new value $x_i^{(t+1)}$ 
 of the current variable $x_i$ from its conditional 
 distribution, ignoring the old value $x_i^{(t)}$.
 The state makes lengthy random walks in cases where 
 the variables are strongly correlated, as illustrated in the 
 left-hand panel of \figref{fig.adler}.
%  the joint distribution has strong correlations between the variables.
 This figure uses a correlated Gaussian distribution
 as the target density.
% that we used  when studying the \hybrid\ Monte Carlo method.

 In \nocite{Adler1981}Adler's  (1981)
% In \quotecite{Adler1981}
 overrelaxation method, one instead samples 
 $x_i^{(t+1)}$  from a Gaussian that is biased to the {\em opposite\/} side of 
 the  conditional distribution.
%, and that is narrower than the 
% Gaussian describing the conditional distribution.  
% When the bias and  width are appropriately set, 
 If the conditional distribution of $x_i$ is $\Normal(\mu,\sigma^2)$
 and the current value of $x_i$  is $x_i^{(t)}$, then Adler's method 
 sets $x_i$  to
\beq
        x_i^{(t+1)} = \mu + \a ( x_i^{(t)} - \mu ) + (1- \alpha^2 )^{1/2} \sigma \nu ,
\label{eq.adler}
\eeq
 where $\nu \sim \Normal(0,1)$ and $\alpha$ is a parameter between $-1$ and 
 $1$, usually set to a negative value. (If $\alpha$ is positive, then the
 method is called under-relaxation.)
\exercisxA{2}{ex.adler}{
 Show that this individual transition leaves invariant the conditional
 distribution   $x_i \sim \Normal(\mu,\sigma^2)$.
}
 A single iteration of Adler's overrelaxation, like one of Gibbs sampling,
 updates each variable in turn as indicated in \eqref{eq.adler}. 
 The transition matrix $T(\bx';\bx)$ defined by a complete update of
 all variables in some fixed order 
 does not satisfy  detailed balance.
 Each individual transition for one coordinate
 just described {\em does\/} satisfy detailed balance -- so the
 overall chain gives a valid
 sampling strategy which converges to the target density $P(\bx)$ -- 
 but 
 when we form a chain by applying the individual transitions
 in a fixed sequence, the overall 
 chain is not reversible.  This temporal asymmetry is the key to why
 overrelaxation can be beneficial.  If, say, 
 two variables are positively correlated, then they will (on a short
 timescale) evolve 
 in a directed manner instead of by random walk, 
 as shown in \figref{fig.adler}.  This may
 significantly reduce the time required to obtain
% effectively 
 independent samples.
% This method is still a valid 
% sampling strategy -- it converges to the target density $P(\bx)$ -- 
% because it is made up of transitions that satisfy detailed balance.
%  (XXXXXXXXXXXXXXXXXXX)
%
% PUT THIS FIG BACK in any non-erice version, and this ref.
%

% Figure \ref{fig.adler} illustrates the  difference
% between Gibbs sampling and overrelaxation for the case of 
% a bivariate Gaussian distribution. Notice how much more rapidly
% overrelaxation gets around the distribution.

\exercisxC{3}{ex.detbaladler}{
	The   transition matrix $T(\bx';\bx)$ defined by a complete update of
 all variables in some fixed order 
 does not satisfy  \ind{detailed balance}. If the updates were in a {\em random order},
 then $T$ would be symmetric.  Investigate, for the toy two-dimensional
 Gaussian distribution, the assertion 
 that  the advantages of  overrelaxation are lost if the 
 overrelaxed updates  are made in  a random order.
}

\subsection{Ordered Overrelaxation\nonexaminable}
 The overrelaxation method has been generalized by\index{overrelaxation!ordered}\index{Monte Carlo methods!overrelaxation!ordered} 
 \citeasnoun{Radford.over}
 whose {\dem\ind{ordered overrelaxation}\/} method  is applicable to\index{Neal, Radford} 
 {\em any\/} system where \ind{Gibbs sampling}\index{Monte Carlo methods!Gibbs sampling}
 is used.
%
% MORE HERE
%
 In ordered overrelaxation, instead of taking one sample from
 the conditional distribution $P(x_i \given  \{ x_j \}_{j \neq i} )$,
 we create $K$ such samples $x_i^{(1)}, x_i^{(2)}, \ldots , x_i^{(K)}$,
 where $K$ might be set to twenty or so.
 Often, generating $K-1$ extra samples adds a negligible
 computational cost to the initial computations required for making
 the first sample. The points $\{ x_i^{(k)} \}$ are then
 sorted numerically, and the current value of $x_i$ is inserted into
 the sorted list, giving a list of $K+1$ points. We give them
 ranks $0,1,2,\ldots, K$. Let $\kappa$ be
 the rank of the current value of $x_i$ in the list.
 We set $x'_i$ to the value that is an equal distance from the
 other end of the list, that is,
%\beq
 the value with rank $K-\kappa$.
%\eeq
 The role played by Adler's $\alpha$ parameter is here played by the parameter
 $K$.  When $K=1$, we obtain ordinary Gibbs sampling.
% Radford says this should be cut:
% In practice, it might be a good idea to  use a small value of $K$, \eg, $K=1$, before
% convergence of a Gibbs sampler, then a value such as $K=20$ after
% convergence, because atypicality persists longer for larger values of $K$.
% (Imagine an atypical state at the 95th percentile hopping across
%to an equally atypical state at the 5th)
%
 For practical purposes Neal\index{Neal, Radford} estimates that ordered overrelaxation
 may speed up a simulation by a factor of ten or twenty.
%
% but maybe should recommend don't use before convergence. RADFORD SAYS NO
%
%  It is a method which can be used wherever Gibbs sampling is used.


\section{Simulated  annealing}
 A  third technique for speeding convergence is {\dbf\ind{simulated 
 annealing}}. In \index{Monte Carlo methods!simulated annealing}simulated\index{annealing}
 annealing, a `\ind{temperature}' parameter is introduced
 which, when large, allows the system to make transitions that
 would be improbable at temperature 1. The temperature is
 set to  a large value and  gradually reduced to 1. This procedure is supposed
 to reduce the chance that the simulation gets
 stuck in an unrepresentative probability island.  

 We asssume that we wish to sample from a 
 distribution of the form 
\beq
        P(\bx) = \frac{ e^{- E(\bx)} }{Z}
\eeq
 where  $E(\bx)$ can be evaluated. In the simplest simulated annealing method,
 we instead sample from the distribution
\beq
        P_T(\bx) =
{\smallfrac{1}{Z(T)}}  \, { e^{-\smallfrac{E(\bx)}{T}} }
\eeq
        and decrease $T$ gradually to 1.

 Often the energy function can be separated into two terms, 
\beq
        E(\bx) = E_0(\bx) + E_1(\bx), 
\eeq
 of which the first term is `nice' (for example, a separable function 
 of 
% linear  in 
 $\bx$) and the second is `nasty'.
 In these cases, a better simulated annealing method might make use of
 the distribution
\beq
        P'_T(\bx) = 
{\smallfrac{1}{Z'(T)}}
 \, 
  e^{-E_0(\bx)-\textstyle\dfrac{ E_1(\bx)}{T}} 
\eeq
        with $T$ gradually  decreasing to 1.
 In this way, the distribution at high temperatures reverts to a 
 well-behaved distribution defined by $E_0$.

 Simulated annealing is often used as an \ind{optimization} method, where 
 the aim is to find an $\bx$ that minimizes $E(\bx)$, in which case 
 the temperature is decreased to zero rather than to 1.

 As a Monte Carlo method, simulated annealing as
 described above doesn't sample exactly
 from the right distribution, because there is no guarantee
 that the probability of falling into one basin of the energy
 is equal to the total probability of all the states
 in that basin.
%-- indeed we would expect this not to be the case
% in general; simulated annealing is usually a biased sampling method.
 The closely related
 `simulated tempering' method  \cite{Marinari1992}
 corrects the biases introduced by the annealing process
 by making  the temperature itself a random variable that is 
 updated in Metropolis fashion during the simulation.
%
 \quotecite{Radford_ais}
% Neal's (unpublished)
 `annealed importance sampling' method\index{Neal, Radford} 
 removes the biases introduced by annealing by computing importance weights
 for each generated point.
% more?


\section{Skilling's multi-state leapfrog method\nonexaminable}
\label{sec.skillingleapfrog}
 A fourth method for speeding up Monte Carlo simulations,
% reducing  random walk behaviour,
 due to \index{Skilling, John}John Skilling,
% was introduced by Skilling (unpublished, 2002); it
 has a  similar spirit  to  overrelaxation,
 but works in more dimensions.
 This method is applicable to sampling from a distribution
 over a continuous state space, and the sole requirement
 is that the energy $E(\bx)$ should be easy to evaluate.
 The gradient is not used.
 This leapfrog method is not intended to be used on its
 own but rather in sequence with other Monte Carlo
 operators.

 Instead of moving just one state vector $\bx$
 around the state space, as was the case for all the
 Monte Carlo methods discussed thus far,
 Skilling's leapfrog method simultaneously\index{Monte Carlo methods!multi-state}
 maintains a {\dem{set}\/} of $S$ state vectors
 $\{ \bx^{(s)} \}$, where $S$ might be six or twelve.
% or so.
 The aim is that  all $S$ of these vectors will represent independent
 samples from the same distribution $P(\bx)$.

 Skilling's leapfrog makes a proposal for the new
 state ${\bx^{(s)}}'$,
 which is accepted or rejected in accordance with the
 Metropolis method, by%
\amarginfig{t}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(40,40)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gallager/skilling.eps,width=40mm,height=30mm}}}
\put(0,0){\makebox(0,0)[br]{$\bx^{(s)}$}}
\put(20,15){\makebox(0,0)[br]{$\bx^{(t)}$}}
\put(40,30){\makebox(0,0)[br]{${\bx^{(s)}}'$}}
\end{picture}
\end{center}
}\
 leapfrogging the current state $\bx^{(s)}$ over
 another state vector $\bx^{(t)}$:
\beq
	 {\bx^{(s)}}' =   \bx^{(t)} + ( \bx^{(t)} - \bx^{(s)} )
		=  2 \bx^{(t)} - \bx^{(s)} .
\eeq
 All the other state vectors are left where they are, so
 the acceptance probability depends  only on the change in
 energy of  $\bx^{(s)}$.

 Which vector, $t$, is the partner for
 the leapfrog event can be chosen in
 various ways. The simplest method
 is to select the partner at random from the
 other vectors.
 It might be better to choose $t$ by selecting one
 of the nearest neighbours  $\bx^{(s)}$
 -- nearest by any chosen distance function --
 as long as one then uses an acceptance rule that ensures
 detailed balance by checking whether point $t$ is still among
 the nearest
 neighbours of the new point, ${\bx^{(s)}}'$.

\subsection{Why the leapfrog is a good idea}
 Imagine that  the target density $P(\bx)$ has
 strong correlations -- for example, the density
 might be a needle-like Gaussian with width $\epsilon$
 and length $L \epsilon$, where $L \gg 1$.
 As we have emphasized, motion around such a density by standard
 methods proceeds by a  slow random walk.

 Imagine now that our set of $S$ points is lurking initially
 in a location that is probable under the density,
 but in an inappropriately  small ball of size $\epsilon$.
 Now, under Skilling's leapfrog method,
 a typical first move will take the point a little
 outside the current ball, perhaps
 doubling its distance from the centre of the ball.
 After all the points have had a chance to move,
 the ball will have increased in size;
 if all the moves are accepted, the ball will
 be bigger by a
 factor of two or so in all dimensions. The rejection
 of some moves will mean that the ball containing the
 points will probably  have elongated
 in the needle's long direction by a factor of, say, two.
 After another cycle through the points, the
 ball will have grown in the long direction by another factor of two.
 So the typical  distance travelled in the long dimension
 grows {\em{exponentially\/}} with the number of iterations.

 Now, maybe a factor of two growth per iteration
 is on the optimistic side; but even if
 the ball only grows by a factor of, let's say, 1.1 per iteration,
 the growth is nevertheless exponential. It will only
 take a number of iterations proportional to $\log L/\log(1.1)$
 for the long dimension to be explored.

\exercissxB{2}{ex.skilling}{
 Discuss how the effectiveness of
 Skilling's method scales with dimensionality,
 using a correlated $N$-dimensional Gaussian distribution
 as an example. Find an expression for the rejection
 probability, assuming the Markov chain is at equilibrium.
 Also discuss
 how it scales with the strength of correlation among
 the Gaussian variables.
 [\Hint:  Skilling's method is invariant under
 affine transformations, so the  rejection probability
 at equilibrium can be found by looking at the case
 of a {\em{separable}\/} Gaussian.]
% $x_i \sim \Normal(\mu,\sigma^2)$.
}

 This method has some similarity to
 the   	 `\ind{adaptive direction sampling}' method
 of \citeasnoun{Gilks_RG_ADS}\index{Gilks, W.R.}\index{Roberts, Gareth O.}\index{George, E.I.}
 but the  leapfrog method is simpler and can be applied to a greater
 variety of distributions. 
%  Gilks, WR; Roberts, GO; George, EI (1994). Adaptive direction sampling. 
%    Statistician, 43, 179-189. 
%
%  Roberts, GO; Gilks, WR (1994). Convergence of adaptive direction sampling. 
%    Journal of Multivariate Analysis, 49, 287-298. 

%5Here are some rough notes I just jotted down today
%5on monte carlo methods.  I'm getting keen on the 
%idea that genetic methods are the key step needed in 
%order to speed up monte carlo, since sex allows
%information to be acquired by a population from an oracle
%at a rate sqrt(G) faster than the maximum rate achievable
%by a lone individual who evolves under standard dumb metropolis.
%
%        So then the challenge is to come up with a birth or death
%rule which supplies the required slaughter of unfit individuals,
%post-sex, without ruining the proof of validity we are interested
%in obtaining.
%
%        Hope this sparks some useful ideas. I started thinking about
%this because John Skilling is using a little bit of sex in his
%algorithms, but I am pretty sure his method is vastly suboptimal
%because he is not accompanying his sex by the appropriate amount of
%death.
%
\section{Monte Carlo algorithms as communication channels}
 It may be a helpful perspective, when thinking about
 speeding up Monte Carlo methods, to think about the information
 that is being communicated.\index{Monte Carlo methods!information communication in}\index{communication} 
 Two  communications take place when a sample from $P(\bx)$ is being generated.

 First, the selection of a particular $\bx$ from $P(\bx)$
 necessarily requires that at least $\log 1/P(\bx)$ random bits  be consumed.
 [Recall the use of inverse arithmetic coding as a method
 for generating  samples from given distributions (\secref{sec.ac.efficient}).]
% (\chref{ch.ac})]
%
% For example, could think about a chain qith Q_1 = +/1 mod B
% Q_2 = +/- 2^{1} mod B
% Q_b = +/- B
% (or 0/1)
% with each move consuming one bit, and the outcome generates a sample from
% 1...2^b, which is b bits.
%

 Second, the generation of a sample conveys information about $P(\bx)$ from the
% involves consulting the
 subroutine that is able to evaluate $P^*(\bx)$ (and from any other
 subroutines that have access to properties of   $P^*(\bx)$).

 Consider a dumb Metropolis method, for example. In a
 \ind{dumb Metropolis}\index{Monte Carlo methods!Metropolis method!dumb Metropolis} method,
 the proposals $Q(\bx';\bx)$ have nothing to do with $P(\bx)$.
 Properties of $P(\bx)$ are only  involved in the
 algorithm at the acceptance step, when the ratio
 $P^*(\bx')/P^*(\bx)$ is computed.
 The channel from the true distribution $P(\bx)$
 to the user who is interested in computing properties of
 $P(\bx)$ thus passes through a bottleneck: all
 the information about $P$ is conveyed by
% mediated
 the string of acceptances and rejections.
 If $P(\bx)$ were replaced by a different distribution
 $P_2(\bx)$, the only way in which this change would have an influence
 is that the string of acceptances and rejections would be changed.
 I am not aware of much use being made of
 this information-theoretic  view of Monte Carlo algorithms,
 but I think it is  an instructive viewpoint: if the aim is to
 obtain information about properties of $P(\bx)$ then
 presumably it is helpful to identify the channel through which
 this information flows, and maximize the rate of information
 transfer.
\exampl{ex.whyhalf}{
 The  information-theoretic  viewpoint offers a simple justification
 for the widely-adopted rule of thumb, which states that the
 parameters of a dumb Metropolis  method should be adjusted such that the
 \ind{acceptance rate}\index{Monte Carlo methods!acceptance rate}
 is about one half.
 Let's call the acceptance history, that is,
 the binary string of accept or reject
 decisions, $\ba$. The information learned about $P(\bx)$ after
 the algorithm has run for $T$ steps is less than or equal to the
 information content of $\ba$, since  all information about $P$
 is mediated by $\ba$. And the information content of $\ba$ is upper-bounded
 by $T H_2(f)$, where $f$ is the acceptance rate. This bound
 on  information acquired about $P$ is maximized by setting $f=1/2$.
}
% Radford says:
% The information theory perspective looks useful, but it seems hard to
% get precise results out of it.  The accept/reject decisions for a
% sequence of updates will generally not be independent, so the actual
% amount of information conveyed will be less than one bit per proposal.
% So it's not clear (though perhaps one can attempt to \analyse) whether
% or not departing from a 50% accept rate actually reduces the
% information (it could be it reduces the dependence enough to
% compensate for departing from the marginally optimal 50% rate).  As
% you maybe are aware, Gareth Roberts (with, I think, Gelman and Gilks)
% has shown that a 23% acceptance rate is optimal under certain
% circumstances, asymptotically with increasing dimensionality.
% 
% The evolutionary perspective is tantalizing, but also looks hard to
% get to work.  One basic problem is that death is not reversible.
% Looked at another way, the reason we want death is to make room for
% births, specifically for births that are similar to their parents (or
% siblings, if parents are eliminated).  If we succeed in this, we will
% have an overall state consisting of many copies of the base state,
% many of which are very similar.  This is quite incompatible with the
% usually scheme of defining the distribution for the overall state by
% saying the component base states are independent, with all having the
% desired distribution.



 Another\index{evolutionary computing}
 helpful analogy for a dumb Metropolis method is an evolutionary
 one. Each proposal generates a progeny $\bx'$ from the current state $\bx$.
 These two individuals then compete with each other, and the Metropolis
 method uses 
 a noisy survival-of-the-fittest rule. If the progeny $\bx'$ is
 fitter than the parent (\ie, $P^*(\bx')>P^*(\bx)$, assuming the $Q/Q$ factor is
 unity) then the progeny replaces the parent. The survival rule 
 also allows less-fit progeny  to replace the parent, sometimes.
 Insights about the rate of evolution can thus be applied  to Monte Carlo methods.
\exercisxC{3}{ex.learnMC}{
	Let $\bx \in \{ 0,1 \}^G$ and let
	$P(\bx)$ be a separable distribution,
%
\beq
	P(\bx) = \prod_g p(x_g),
\eeq
 with $p(0) = p_0$ and $p(1)=p_1$, for example $p_1=0.1$.
 Let the proposal density of a dumb Metropolis algorithm
 $Q$ involve flipping a fraction $m$ of the $G$ bits in the state $\bx$.
 Analyze how long it takes for the chain to converge to the
 target density as a function of $m$. Find the optimal $m$
 and deduce how long  the Metropolis method must run for.

 Compare the result with the results for an evolving population
 under natural selection found in \chref{ch.sex}.
}

 The insight that the fastest progress
 that a standard Metropolis method can make, in information terms,
 is about one
 bit per iteration, gives a strong motivation for speeding
 up the algorithm.
 This chapter has already reviewed  several methods for
 reducing random-walk behaviour. Do these methods also
 speed up the rate at which information is acquired?
\exercisxC{4}{ex.learnMC2}{
	Does Gibbs sampling, which is a smart Metropolis method
	whose proposal distributions do depend on $P(\bx)$,
	allow information about $P(\bx)$ to leak out at a rate
	faster than one bit per iteration?
	Find  toy examples in which this question can be
	precisely investigated.
}
\exercisxC{4}{ex.learnMC3}{
	\Hybrid\ Monte Carlo is another smart Metropolis method
	in which the proposal distributions  depend on $P(\bx)$.
	Can \Hybrid\ Monte Carlo extract information about $P(\bx)$ at a rate
	faster than one bit per iteration?
}
\exercisxC{5}{ex.learnimport}{
	In importance sampling, the  weight $w_r = P^*(\bx^{(r)})/Q^*(\bx^{(r)})$,
 a floating-point number,
	is computed and retained until the end of the computation.
	In contrast, in the dumb Metropolis method, the ratio $a = P^*(\bx')/P^*(\bx)$
	is reduced to a single bit (`is $a$ bigger than or smaller than
 the random number $u$?').
% \in (0,1)$?').
	Thus in principle importance sampling preserves more information
	about $P^*$ than does dumb Metropolis.
 Can you find a toy example in which this extra information
	does indeed lead to faster convergence of importance sampling
	than Metropolis?
	Can you design a Markov chain Monte Carlo algorithm
	that  moves around adaptively, like a Metropolis method,
	and that retains more useful information about the
	value of $P^*$, like importance sampling?
}
 In  \chref{ch.sex} we noticed that an evolving population of $N$ individuals
 can  make faster evolutionary progress if the individuals
 engage in sexual reproduction. This  observation  motivates
 looking at Monte Carlo algorithms in which   multiple parameter vectors $\bx$
 are evolved and interact.
 
\section{Multi-state methods\nonexaminable}
 In a multi-state method, multiple parameter vectors $\bx$ are maintained;\index{Monte Carlo methods!multi-state}
 they evolve individually under moves such as Metropolis and Gibbs; there
 are also interactions among the vectors.
 The intention is either
 that  eventually all the vectors $\bx$ should be
 samples from $P(\bx)$ (as illustrated by Skilling's leapfrog method), or that information associated with the final
 vectors $\bx$
 should allow us to approximate expectations under $P(\bx)$, as
 in importance sampling.


\subsection{Genetic methods}
% There is a good reason  for wanting to use a population and have sex.
% In inference, the  computational task is to hunt for the the place where
% the probability is big. This is difficult. As the chapter on sex
% and evolution (\chref{ch.sex}) shows,
% if the problem can be decomposed, 
% it's more efficient to have many individuals and crossover and selection.
% The following gives a particular way of making crossover and selection
% in a principled way.
  Genetic algorithms\index{genetic algorithm}\index{evolutionary computing}\index{algorithm!genetic}
 are not
 often described  by their proponents as Monte Carlo algorithms, but
 I think this is the correct categorization, and an ideal genetic
 algorithm would be one that can be proved to be a valid  Monte Carlo algorithm
 that converges to a specified density.

 I'll use $R$ to denote the number of vectors in the population.
 We aim to have $P^*(\{\bx^{(r)}\}_1^R) = \prod P^*(\bx^{(r)})$.
 A genetic algorithm involves moves of two or three types.

 First, individual moves in which one state vector is perturbed,
 $\bx^{(r)} \rightarrow  {\bx^{(r)}}'$, which
 could be performed using any of the Monte Carlo methods
 we have mentioned so far.

 Second, we allow crossover moves of the form
 $\bx,\by \rightarrow \bx',\by'$;
 in a typical crossover move, the progeny $\bx'$ receives half his
 state vector from one parent, $\bx$, and half from the other, $\by$;
 the secret of success in a \ind{genetic algorithm}\index{algorithm!genetic} is that
% the way that
 the parameter $\bx$ 
% relates to $P(\bx)$
 must be encoded  in such a way that the crossover of two independent
 states $\bx$ and $\by$, both of which have good
 fitness $P^*$, should have a  reasonably good chance of producing progeny
 who are equally fit.
 This constraint is a hard one to satisfy in many problems, which
 is why genetic algorithms  are mainly talked about and hyped up,
 and rarely used by  serious experts.
 Having introduced a crossover move
 $\bx,\by \rightarrow \bx',\by'$, we need to
 choose an acceptance rule.  One easy way to obtain
 a valid algorithm is to accept or reject the crossover proposal
 using the Metropolis rule with  $P^*(\{\bx^{(r)}\}_1^R)$
 as the target density -- this involves
 comparing the fitnesses before and after the crossover using the
 ratio
\beq
	\frac{ P^*( \bx' ) P^*( \by' ) }{ P^*( \bx ) P^*( \by ) } .
\eeq
 If the crossover operator is reversible then
 we have an easy proof that this procedure satisfies detailed
 balance and so is a valid component in a chain
 converging to   $P^*(\{\bx^{(r)}\}_1^R)$.

\exercisxB{3}{ex.geneticenough}{
	Discuss whether the above two operators, individual
 variation and crossover with the Metropolis acceptance rule,
 will give a more efficient  Monte Carlo method
 than a standard method with only one state vector and
 no crossover.
}
 The reason why the sexual community  could acquire information
 faster than the asexual community in \chref{ch.sex} was because
 the \ind{crossover} operation produced  diversity with standard deviation
 $\sqrt{G}$, then the \ind{Blind Watchmaker} was able to convey lots of information
 about the fitness function  by {\em killing off\/} the less fit offspring.
 The above two operators do {\em not\/} offer a speed-up of $\sqrt{G}$
 compared with standard Monte Carlo methods because there is
 no killing.  What's required, in order to obtain a speed-up, is two things:
 multiplication and death; and at least one of these must operate {\em selectively}.
 Either we must kill off the less-fit state vectors, or
 we must allow the more-fit state vectors to give rise to more offspring.
% We need a birth process in which 
% is some sort of birth
% process in which  $\bx,\by \rightarrow \bx',\by'$
 While it's easy to sketch these ideas, it is hard to
 define a valid method for doing it.
\exercisxD{5}{ex.geneticsolve}{
	Design a birth rule and a death rule
	such that the chain converges
	to   $P^*(\{\bx^{(r)}\}_1^R)$.
}
% http://www.robots.ox.ac.uk/~misard/condensation.html
% http://www-sigproc.eng.cam.ac.uk/~ad2/book.html
 I believe this is still an open research problem.
% \index{particle filters}{Particle filters}
% offer a partial solution to this problem for cases where
% the  target density $P^*$ can be chopped into a product of factors.
% One way  to chop a target density into a product of factors is annealing:
% we can, for example, write $P^*(x) = ( P_{\epsilon}(x) )^N$, where
% $P_{\epsilon}(x) \equiv  ( P^*(x) )^{\epsilon}$ and $N=1/\epsilon$ is the
% number of steps in an annealing process with equally-spaced temperatures.
% Thus births and deaths can be integrated into the annealing process. 

% these cross
% and accept or reject
% using Metropolis.
% (This is not how many people do Genetic algorithm, but it is a good idea)
%
% Birth or death rule: Skilling's method couple deaths and births to
% the annealing process.\index{Skilling, John}
% Reduce temperature such that just one of $R$ is likely to die.

\subsection{Particle filters}
%5 See next edition of the book!
 Particle\index{particle filter}
 filters, which are particularly popular in
 inference problems involving temporal
 tracking, are multistate methods that mix the ideas
 of importance sampling and Markov chain Monte Carlo.
% should be covered in the next edition of this book.
 See
 \citeasnoun{isard96visual}, \citeasnoun{isard98condensation},
 \citeasnoun{berzuini97dynamic},
 \citeasnoun{BerzuiniGilks2001}, \citeasnoun{particlefilters01}.

\section{Methods that do not necessarily help}
 It is common practice to  use {\em many\/} initial conditions
 for a particular Markov chain (\figref{fig.mcresource}).
% (\secref{sec.mcresource}).
 If you are worried about
 sampling well from a complicated density $P(\bx)$,
 {\em can\/} you ensure the states produced by
 the simulations are well distributed about the
 typical set of $P(\bx)$
 by ensuring that the
 initial points  are `well distributed about
 the whole state space'?

 The  answer is, unfortunately, no.  In  hierarchical Bayesian  models,
 for example, a large number of parameters $\{x_n\}$ may be coupled
 together via  another parameter $\b$ (known as a  hyperparameter).
 For example, the quantities   $\{x_n\}$ might be independent
 noise signals, and $\b$ might be the inverse-variance of
 the noise source. The joint distribution of $\b$ and  $\{x_n\}$
 might be
\beqa
	P( \b , \{x_n\} ) & =& P(\b) \prod_{n=1}^{N} P(x_n\given \b) \\ \nonumber
 & =& P(\b) \prod_{n=1}^{N} \smallfrac{1}{Z(\b)} \, e^{-\b x_n^2 / 2}  ,
\eeqa
 where $Z(\b) = \sqrt{ 2 \pi / \b }$ and $P(\b)$ is a broad
 distribution describing our ignorance about the noise level.
 For simplicity, let's leave out all the other variables -- data and such --
 that might be involved in a realistic problem.
 Let's imagine that we want to sample effectively from $P( \b , \{x_n\} )$
 by Gibbs sampling -- alternately sampling  $\b$ from the
 conditional distribution  $P(\b\given x_n)$  then sampling
all the $x_n$ from their
 conditional 
 distributions  $P(x_n\given \b)$.
  [The resulting  marginal distribution of $\b$ should asymptotically be the
 broad distribution $P(\b)$.]

 If $N$ is large then the conditional distribution
 of $\b$ given any particular setting of $\{x_n\}$
 will be tightly concentrated on a particular most-probable
 value of $\b$, with width proportional to $1/\sqrt{N}$. 
 Progress up and down  the $\b$-axis will therefore take place
 by a slow random walk with steps of size $\propto 1/\sqrt{N}$.

 So, to the initialization strategy. Can we finesse our
 slow convergence problem by using  initial conditions located
 `all over the state space'?
 Sadly, no.
 If we distribute the points  $\{x_n\}$  widely,
 what we are actually doing is favouring
 an initial value of the noise level $1/\b$ that is {\em large\/}.
% , since widely varying values of  $\{x_n\}$.
 The random walk of the parameter $\b$ will thus  tend,
 after the first drawing of $\b$ from $P(\b\given x_n)$,
 always to start off from one end of the $\b$-axis.



\section*{Further reading}
\index{Neal, Radford}The \hybrid\ Monte Carlo method \cite{duane-kennedy-pendleton-roweth-87}
 is  
 reviewed in \citeasnoun{Neal_dop}.
 This excellent tome also reviews a huge range of other Monte Carlo
 methods, including the related
 topics of simulated annealing\index{annealing} and free energy estimation.
% For another advanced  method for adapting Markov chains, 
% see \citeasnoun{MCMCRegeneration}.
% \index{regeneration}\index{Markov chain Monte Carlo!regeneration}

\section{Further exercises}
\exercisxC{4}{ex.hmcreversible}{
 An important detail of the \hybrid\ Monte Carlo method is\index{Hamiltonian Monte Carlo}
 that the simulation  of the Hamiltonian dynamics,
 while it may be inaccurate, must be
 perfectly reversible, in the sense that
 if the initial condition $(\bx,\bp)$ goes to
% \rightarrow
 $(\bx',\bp')$,
 then the same simulator must take $(\bx',-\bp')$
 to
% \rightarrow
 $(\bx,-\bp)$,
 and the inaccurate dynamics must conserve state-space volume.
% cut this on radford's advice
% (In fact, this second rule is redundant since,
% if state-space volume is not conserved, perfect reversibility is impossible,
% if the state $\bx,\bp$ is represented with finite precision using integers.)
%  the rule of perfect reversibility must be violated
 [The leapfrog method in \algref{fig.hmc} satisfies these rules.]

 Explain why these rules must be satisfied and create an example illustrating
 the problems that arise if they are not.
}


\exercisxC{4}{ex.multi-state-slice}{
 {\sf A multi-state idea for slice sampling.}\index{Monte Carlo methods!multi-state}
 Investigate the following multi-state method for slice sampling.
 As in {Skilling's multi-state leapfrog method} (\secref{sec.skillingleapfrog}),
 maintain a set of $S$ state vectors
 $\{ \bx^{(s)} \}$. Update one state  vector  $\bx^{(s)}$ by one-dimensional
 slice sampling in a direction $\by$ determined by
 picking two other state vectors  $\bx^{(v)}$ and $\bx^{(w)}$ at random and
 setting $\by = \bx^{(v)} - \bx^{(w)}$.
\amarginfig{b}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(40,40)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gallager/multislice.eps,width=40mm,height=30mm}}}
\put(8,11){\makebox(0,0)[tl]{$\bx^{(s)}$}}
\put(34,19.5){\makebox(0,0)[tl]{$\bx^{(v)}$}}
\put(21.50,8){\makebox(0,0)[tl]{${\bx^{(w)}}$}}
\end{picture}
\end{center}
}\
 Investigate this method on toy problems such as a highly-correlated
 multivariate \ind{Gaussian distribution}. Bear in mind that
 if $S-1$ is smaller than the number of dimensions  $N$ then
 this method will not be ergodic by itself, so it may
 need to be mixed with other methods. Are there
 classes of problems that are better solved by this slice-sampling method
 than by the standard methods for picking $\by$ such
 as cycling through the coordinate axes or picking $\bu$ at random
 from a Gaussian distribution?
}




% see advanced_mc.tex
\section{Solutions}
\soln{ex.skilling}{
	Consider the spherical Gaussian distribution where
all components have mean zero and variance 1. 
	In one dimension, the $n$th, if $x^{(1)}_n$ leapfrogs over $x^{(2)}_n$,
 we obtain the proposed coordinate
\beq
	(x^{(1)}_n)' = 2 x^{(2)}_n - x^{(1)}_n .
\eeq
 Assuming that  $x^{(1)}_n$ and $x^{(2)}_n$ are Gaussian random variables
 from $\Normal(0,1)$, $(x^{(1)}_n)'$ is Gaussian from
 $\Normal(0,\sigma^2)$,
where $\sigma^2 = 2^2 + (-1)^2 = 5$.
 The change in energy contributed by this one dimension will be
% half x'^2 - ( half x^2 )
\beq
 \frac{1}{2} \left[	( 2 x^{(2)}_n - x^{(1)}_n  )^2 -  ( x^{(1)}_n )^2  \right] =
%	4 (x^{(2)}_n)^2  - 4 x^{(2)}_n x^{(1)}_n +  (x^{(1)}_n  )^2 -  ( x^{(1)}_n )^2 =
	2 ( x^{(2)}_n)^2 - 2 x^{(2)}_n x^{(1)}_n 
\eeq
 so the typical change in energy is $2 \langle ( x^{(2)}_n)^2 \rangle = 2$.
 This positive change is bad news.
 In $N$ dimensions, the typical change in energy when
 a leapfrog move is made, at equilibrium,
 is thus $+2N$.
 The probability of acceptance of the move scales
 as
\beq
	e^{-2N} .
% \exp$
\eeq
 This implies that Skilling's method, as described, is not  effective
 in very high-dimensional problems -- at least,  not 
 once convergence has occurred.
% , so it is hard to imagine that it's effective .
 Nevertheless it has the impressive advantage that its
 convergence  properties are independent of the
 strength of correlations between the variables --
 a property that not even the \hybrid\ Monte Carlo
 and overrelaxation methods offer.


% In a bit more detail, roughly what does the above calculation
% mean about the rate of  convergence? If we assume that
% the rare acceptances lead to a
% rough doubling in size of the ball of points (in the
% directions in which growth is permitted) then
% there will be a doubling every $e^{2N}$ iterations,
% and the number of iterations to fill out the long dimension (which
% is $L$ times larger than the initial ball)
% will be roughly $e^{2N} \log_2 L$.
}


\prechapter{About                  Chapter}
%
% \chapter{Ising Models}
%
% for entropy versus temperature see 
%\label{fig.Sising}
% in basic_mc
%
 Some of the neural network models that we will encounter 
 are related to \ind{Ising model}s, which are idealized magnetic systems. 
% familiar to most physics graduates.
%\footnote{Though maybe not the 
% present Cambridge class?}
 It is not essential to understand 
 the statistical physics of Ising models to understand these 
 neural networks, 
 but I hope you'll find them helpful.
% think it is a good idea to include some notes on them, 
% as much to revise the beauties of \ind{statistical physics} as to 
% refresh the memory about Ising models.

 Ising models are also related to several other topics in this book.
 We will use exact tree-based computation methods like those
 introduced in  \chapterref{ch.exact} to evaluate properties of
 interest in Ising models.
 Ising models offer crude models for binary images.\index{image models}\index{binary images}
 And Ising models relate to two-dimensional \ind{constrained channel}s (\cf\ \chapterref{ch.noiseless}):
 a two-dimensional \ind{bar-code} in which a black dot may not
 be completely surrounded by black dots, and
% {\em vice versa\/}
 a white dot may not
 be completely surrounded by white dots,
 is similar to an antiferromagnetic Ising model at  low temperature.
 Evaluating the entropy of this Ising model is equivalent to
 evaluating the capacity  of the constrained channel for conveying bits.

 If you would like to jog your memory  on statistical physics
 and thermodynamics, you might find  \appendixref{app.statphy}
 helpful. I also recommend the book by \citeasnoun{Reif}.



\ENDprechapter
\chapter{Ising Models} 
\label{ch.ising}

\fakesection{Ising Models}
 An \ind{Ising model}\index{spin system}
 is an array of spins (\eg, atoms that can take 
 states $\pm 1$) that are 
 magnetically coupled to each other. If one spin is, say, in the $+1$ state
 then it is energetically favourable for its immediate neighbours to 
 be in the same state, in the case of a ferromagnetic model, 
 and in the opposite state, in the case of an antiferromagnet.
 In this chapter
% e following 
% three two sections
 we  discuss two computational techniques 
 for studying Ising models. 

 Let the state $\bx$ of an Ising model with $N$ spins be a vector in which
 each component $x_n$ takes values $-1$ or $+1$.
 If two spins $m$ and $n$ are neighbours we write $(m,n) \in {\cal N}$.
 The coupling between neighbouring spins is $J$.
 We define $J_{mn} = J$ if $m$ and $n$ are neighbours
 and $J_{mn}=0$ otherwise. The energy of a state 
 $\bx$ is 
\beq
	E(\bx;J,H) = - \left[ 
% \frac{1}{2}
				\half \sum_{m,n}
% \begin{array}{@{}c@{}}m,n:\\ 
% (m,n) \in {\cal N}\end{array} }
				J_{mn} x_m x_n 
				+ \sum_{n} H x_n \right]  , 
\label{eq.ising.e}
\eeq
 where
% $J$ is the coupling 
% between spins $m$ and $n$, 
% and
 $H$ is the applied field. If $J > 0$ then the model is
 \ind{ferromagnetic}, and if $J < 0$ it is \ind{antiferromagnetic}.  
 We've included the factor of $\dhalf$ because 
 each pair  is counted twice in the first sum, once as $(m,n)$ and once as $(n,m)$.
% ; alternatively, we could sum over all $m$ and $n$, shove in a factor
% of $\half$, and define $J_{mn}$ (the coupling between spins $m$ and
% $n$) to be zero if $(m,n) \not \in \cal N$.
%
% In Physics we may be interested in the properties of Ising models with 
% a large number $N$ of spins having regular geometric neighbourhood 
% relationships. 
 At equilibrium at temperature $T$, the probability that the 
 state is $\bx$ is 
\beq
	P( \bx\given  \beta, J,H) = \frac{1}{Z(\beta,J,H)} \exp \! \left[ - \beta E( \bx ; J , H ) \right] ,
\label{eq.ising.p}
\eeq
 where $\beta = 1/k_{\rm B}T$, $k_{\rm B}$ is Boltzmann's constant, and 
\beq
	Z(\beta,J,H) \equiv \sum_{\bx} \exp \!
		\left[ - \beta E( \bx ; J , H ) \right]  .
\label{eq.ising.z}
\eeq

\subsection{Relevance of Ising models}
 Ising models are relevant for three reasons.

 Ising models are important first  as models of magnetic systems
 that have a phase transition. The theory of \ind{universality} in
 statistical physics
 shows that all systems with the same dimension (here, two), 
 and the same  symmetries, have equivalent critical 
 properties, \ie, the scaling laws shown by their phase transitions
 are identical. So by studying Ising models we can find out
 not only about magnetic phase transitions but also about 
 phase transitions in  many  other systems.
%  such as liquid-vapour transitions. 

 Second, if we generalize the energy function to
\beq
	E(\bx;\bJ,{\bf h}) = - \left[ 
 \frac{1}{2}
				\sum_{m,n} J_{mn} x_m x_n 
				+ \sum_{n} h_n x_n \right]  , 
\eeq
 where the couplings $J_{mn}$ and applied fields $h_n$ are not constant, 
 we obtain a family of models known as `spin glasses' 
 to physicists, and  as `Hopfield\pagebreak[1] networks' or 
 `Boltzmann machines' to the neural 
 network community.  In some of these models, all spins are declared
 to be neighbours of each other, in which case physicists call
 the system an `infinite-range' spin glass, and networkers call
 it a `fully connected' network.

 Third,
% , as I will show in section \ref{sec.ising.retina}, 
 the Ising model is also useful as a statistical model in its own
 right.

 In this chapter we will
%  sections \ref{sec.ising.mc} and \ref{sec.ising.matrix} we will 
 study Ising models using two different computational techniques.
% The aim is not so much to learn about Ising  
% models as it is to think about the Physics of thermodynamic systems.


\subsection{Some remarkable relationships in statistical physics}
 \index{statistical physics}We
 would like to get as much information as possible out of 
 our computations. Consider for example the \ind{heat capacity} of a system, 
 which is defined to be 
\beq
 C \equiv  \frac{\partial}{\partial T} \bar{E}
 , 	
\eeq
 where
\beq
	 \bar{E} = \frac{1}{Z} \sum_{\bx} \exp( - \beta E(\bx) ) \, E(\bx) .
\eeq
% Naively, we might guess that to work
% out the heat capacity of a system at a certain 
% temperature, we have to change the temperature to a higher
% temperature and measure the energy change.
%
% given only observations of the system at that temperature, 
% or do we have to change the temperature? 
%
   To work out the heat capacity of a system,
 we
   might naively guess that we have to increase the temperature and
   measure the energy change.
 Heat capacity, however, is intimately related to energy  {\em \ind{fluctuations}\/} 
 at constant temperature.
 Let's start from the \ind{partition function}, 
\beq
	Z = \sum_{\bx} \exp( -\b E(\bx) ) .
\eeq
 The mean energy is obtained by differentiation \wrt\ $\b$:
\beq
\frac{	\partial \ln Z}{ \partial \b } 
	= \frac{1}{Z} \sum_{\bx} - E(\bx) \exp( -\b E(\bx) ) =  - \bar{E} .
\eeq
 A further differentiation spits out the variance of the \ind{energy}: 
\beq
\frac{	\partial^2 \ln Z}{ \partial \b^2 } =
  \frac{1}{Z} \sum_{\bx}  E(\bx)^2 \exp( -\b E(\bx) ) - \bar{E}^2 
 = \langle{E^2}\rangle - \bar{E}^2  = {\rm var}(E) .
\eeq
 But the heat capacity is also the derivative of $\bar{E}$ with respect to 
 temperature:
\beq
	\frac{	\partial  \bar{E} }{ \partial T } 
	= - \frac{	\partial}{ \partial T}
	\frac{	\partial \ln Z}{ \partial \b } 
	= - \frac{	\partial^2 \ln Z}{ \partial \b^2 } \frac{	\partial  \b }{ \partial T }
 = - {\rm var}(E) ( -1/k_{\rm B} T^2 ) .
\eeq
 So for any system at temperature $T$, 
\beq
	C = \frac{ {\rm var}(E) }{k_{\rm B} T^2} =  k_{\rm B} \b^2 \, {\rm var}(E)  .
\eeq
 Thus if we can observe the variance of the energy of a system at equilibrium,
 we can estimate its heat capacity. 
% More tricks can be found in section \ref{sec.ising.matrix}.
% , with derivations

 I find this an almost paradoxical relationship.\index{paradox!heat capacity and fluctuations}
 Consider a system
 with a finite set of states, and imagine heating it up. At high
 temperature, all states will be equiprobable, so the mean energy will
 be essentially constant and the heat capacity will be essentially
 zero. But on the other hand, with all states being equiprobable,
 there will certainly be fluctuations in energy. So how can the heat
 capacity be related to the fluctuations? The answer is in the
 words `essentially zero' above. The heat capacity is not quite zero at high
 temperature, it
 just tends to zero. And it tends to zero as $\smallfrac{ {\rm var}(E)
 }{k_{\rm B} T^2}$, with the quantity ${\rm var}(E)$ tending
 to a constant at high
 temperatures. This $1/T^2$ behaviour of the heat capacity of finite 
 systems at high temperatures is thus very general.

 The $1/T^2$ factor can be viewed as an accident of history. If only
 temperature scales had been defined using $\beta=\smallfrac{1}{\kB T}$,
 then the definition of heat capacity would be
\beq
	C^{(\beta)} \equiv \frac{ \partial \bar{E} }{ \partial \b }
	= {\rm var}(E) ,
\eeq
 and heat capacity and fluctuations would be  identical
 quantities.
% , were it not for this slip by \ind{Kelvin}, \ind{Carnot} {\em et al}.
% \medskip

\exercisxB{2}{ex.SlZbE}{
 [We will call the entropy of a physical system  $S$ rather
 than $H$, while we are in a statistical physics
 chapter;
% for convenience we will
 we set $k_{\rm B} = 1$.]

 The entropy of a system whose states are $\bx$, at temperature $T=1/\beta$, 
 is 
\beq
	S = \sum p(\bx) \! \left[ \ln 1/p(\bx) \right] 
\eeq
 where
\beq
	p( \bx ) = \frac{1}{Z(\beta)} \exp \! \left[ - \beta E( \bx ) \right]   .
\label{eq.gen.p}
\eeq
\ben
\item
 Show that 
\beq
	S = \ln Z(\beta) + \beta \bar{E}(\beta)
\eeq
 where $\bar{E}(\beta)$ is the mean energy of the system.
\item
 Show that
\beq
 S = - \frac{ \partial F }{ \partial T } ,
\eeq
 where the free energy $F = - kT \ln Z$ and $kT = 1/\b$.
\een
}

% Binder says simulate a 55x55 grid
% critical behaviour of magnetization: m -> B(1-T/Tc)^beta, beta=1/8
% Cv should have a divergence.
\section{Ising models -- Monte Carlo simulation}
\label{sec.ising.mc} 
 In this section we  study two-dimensional planar Ising models
 using a simple Gibbs-sampling
% or `heat bath Monte Carlo'
 method.  
 Starting from some initial state, a spin $n$ is selected at
 random, and the probability that it should be $+1$ given the state of
 the other spins and the temperature  is computed,
\beq
	P(+1\given b_n)= \frac{1}{1+\exp(- 2 \beta b_n)},
\label{eq.gibbs}
\eeq
 where $\beta = 1/k_{\rm B}T$ and $b_n$ is the local field
\beq
	b_n = \sum_{m:(m,n) \in {\cal N}} J x_m + H.
\eeq
 [The factor of 2 appears in equation (\ref{eq.gibbs}) because 
 the two spin states are $\{+1,-1\}$ rather than $\{ +1 , 0 \}$.]
 Spin $n$ is set to $+1$ with that probability, and otherwise to 
 $-1$; then the next  spin  to update is selected at random.
 After sufficiently many iterations, this procedure
 converges to the equilibrium distribution
% of equation
 (\ref{eq.ising.p}).
% $P(\bx)=\frac{1}{Z}\exp(-\beta E(\bx;J,B))$.
 An alternative to the Gibbs sampling formula (\ref{eq.gibbs}) is the 
 Metropolis algorithm, in which we consider the change in energy 
 that results from flipping the chosen spin from its current state $x_n$, 
\beq
	\Delta E = 2 x_n b_n ,
\eeq
 and adopt this change in configuration with probability
\beq
	P( {\rm accept} ; \Delta E , \b ) = \left\{ \begin{array}{cc}
	1 & \Delta E \leq 0 \\
	\exp( - \beta \Delta E ) & \Delta E > 0 .
	\end{array}
	\right.  
\eeq
 This procedure has roughly double the probability of accepting energetically
 unfavourable moves, so may be a more efficient sampler -- but at very low
 temperatures the relative merits
 of
% choice between
 Gibbs sampling and the
 Metropolis algorithm may be subtle.%
%\begin{center}
\amarginfig{b}{
\mbox{\setlength{\unitlength}{1.572pt}
\begin{picture}(70,40)(0,-5)
\newsavebox{\verticalfour} 
\savebox{\verticalfour}(0,0)[bl]{
	\multiput(0,0)(0,10){4}{\circle{2}}     % spins
	\multiput(0,5)(0,10){4}{\line(0,-1){3}} % lines down
	\multiput(0,-5)(0,10){4}{\line(0,1){3}} %   lines up
	\multiput(2,0)(0,10){4}{\line(1,0){6}}  % lines right
}
\multiput(0,0)(10,0){6}{\usebox{\verticalfour}}
\end{picture}
}
\caption[a]{Rectangular Ising model.}
\label{fig.isingR}
}
%\end{center}


\subsection{Rectangular geometry}

 I first simulated
% Let us first write a program that simulates
 an Ising model with the rectangular
 geometry  shown
% below
 in \figref{fig.isingR}, and with periodic boundary conditions. A
 line between two spins indicates that they are neighbours.
%
% To make a bite-sized example, we will set $b$ to 0 throughout, 
 I  set the external field $H=0$
 and considered the two cases $J = \pm 1$
 which are a ferromagnet and antiferromagnet respectively. 

 I started  at a large temperature ($T \eq  33, \beta \eq  0.03$) 
 and changed the temperature every $I$ iterations, first decreasing
 it gradually to $T\eq 0.1, \beta \eq  10$, then increasing it gradually back to a large 
 temperature again. This procedure gives a crude check on whether 
 `equilibrium has been reached' at each temperature; if not, we'd expect to 
 see some hysteresis in the graphs we plot. It also gives an 
 idea of the reproducibility of the results, if we assume that 
 the two runs, with decreasing and increasing temperature,
 are effectively independent of each other. 

 At each temperature I recorded the mean energy per spin
 and the standard deviation
 of the energy, and the mean square value of the magnetization $m$, 
\beq
	m = { \smallfrac{1}{N}}
		 \sum_{n} x_n .
\eeq
%\begin{figure}
%\figuremargin{%
\marginfig{\small
\begin{center} 
\makebox[-0.35in]{}
\begin{tabular}[b]{cl} $T$ & \\
5   & \risingsample{r0.2} \\%%%%%%%%%%%%%% restore these two!!!!!!!:::::::
%3   & \risingsample{r0.33} \\
%2.7 & \risingsample{r0.37} \\
2.5 & \risingsample{r0.4} \\
%\end{tabular}
%\begin{tabular}[b]{cl} % $T$ & \\
2.4 & \risingsample{r0.42} \\
2.3 & \risingsample{r0.44} \\
2   & \risingsample{r0.5} \\
\end{tabular}
\end{center}
%}{%
\caption[a]{Sample states of rectangular Ising models with $J=1$
 at a sequence of temperatures $T$.
}
\label{fig.ising.states1}
}%
%\end{figure}
%
 One tricky decision that has to be made is how soon to start taking
 these measurements after a new temperature has been established; it
 is difficult  to detect `equilibrium'  -- or even to give a clear
 definition of a system's being `at equilibrium'! [But in \chref{ch.mcexact}
 we will see  a solution to this problem.] My crude strategy
 was to let the number of iterations at each temperature, $I$, be a few hundred times the
 number of spins $N$, and to discard the first $\dthird$ of those
% assume equilibrium had been reached after
% $I/3$
 iterations. With $N\eq 100$, I found I needed more than $100\,000$
 iterations to reach equilibrium at any given temperature.
% My code is written in {\tt C} and is available at
% \verb+http://wol.ra.phy.cam.ac.uk/+. 
% There are no fancy tricks. 

\subsection{Results for small $N$ with $J=1$.}
 I simulated an $l \times l$ grid for $l =  4, 5, \ldots, 10, 40, 64$. 
 Let's have a quick think about what results we expect.  At low temperatures
 the system is expected to be in a ground state. The rectangular Ising model 
 with $J=1$ has two ground states, 
 the all $+1$ state and the all $-1$ state. The energy per spin of either
 ground state is $-2$. 
 At high temperatures, the spins are independent,
 all states are equally probable, and the 
 energy is expected to fluctuate around a mean of
 $0$ with a standard deviation proportional to  $1/\sqrt{N}$. 
% At intermediate temperatures we expect the energy to rise monotonically.
% By thinking more carefully we could probably predict the leading order
% behaviour at each extreme. 

 Let's look at some results. In all figures temperature $T$ is shown with 
 $k_{\rm B}=1$.
 The basic picture emerges with as few 
 as  16 spins (\figref{fig.ising.16}, top):
 the energy  rises monotonically.
 As we increase the number 
 of spins to 100 (\figref{fig.ising.16}, bottom) some new details emerge. 
 First, as expected, the fluctuations at large temperature decrease 
 as $1/\sqrt{N}$. Second, the fluctuations at intermediate temperature
 become relatively {\em bigger}. This is the signature of a `\ind{collective}
 phenomenon', in this case, a \ind{phase transition}. Only systems with 
 infinite $N$ show true phase transitions, but with $N=100$ we are getting 
 a hint of the \ind{critical fluctuations}. \Figref{fig.ising.100d} shows details
 of the graphs for $N=100$ and $N=4096$.
 \Figref{fig.ising.states1} shows a sequence of typical states from 
 the simulation of $N=4096$ spins at a sequence of decreasing temperatures.

\begin{figure}
\figuremargin{\small%
%\figuredangle{%
\begin{center}
\footnotesize
%\makebox[-0.35in]{}
\begin{tabular}{@{}cll} $N$ &  \hspace{0.2in} Mean energy and fluctuations
%in energy
 & \hspace{0.2in} Mean square magnetization \\
\raisebox{10mm}{16}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/E4.1.ps,angle=-90,width=2.49in}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/M4.1.ps,angle=-90,width=2.49in}\\
\raisebox{10mm}{100}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/E10.1.ps,angle=-90,width=2.49in}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/M10.1.ps,angle=-90,width=2.49in}\\
%4096&
%\makebox[-0.15in]{}\psfig{figure=isingfigs/E64.1.ps,angle=-90,width=2.4in}&
%\makebox[-0.15in]{}\psfig{figure=isingfigs/M64.1.ps,angle=-90,width=2.4in}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Monte Carlo simulations of rectangular Ising models with $J=1$.
 Mean energy and fluctuations in energy as a function of temperature (left).
 Mean square magnetization as a function of temperature (right).

 In the top row, $N=16$, and the bottom, $N=100$. For even larger
 $N$, see later figures.
}
\label{fig.ising.16}
}%
\end{figure}
% these figures were done by _courses/comput/newising_mc/i but see ising_mc/README
%
% FIGURE \label{fig.ising.100d}
% was moved later than this natural point, against my wishes. and against logic.
% in order to get its number to be 31.5

\subsubsection{Contrast with Schottky anomaly}
\amarginfig{c}{%%%%%%%%%%%%%% this fig needs its axes cleaning up
\mbox{\psfig{figure=figs/fakeC.ps,width=1.5in,angle=-90}\footnotesize$\,T$}
\caption[a]{Schematic diagram to explain the meaning of
 a  \ind{Schottky anomaly}.
 The curve shows the heat capacity of two gases
 as a function of temperature. The lower curve shows a
 normal gas whose heat capacity is an increasing
 function of temperature. The upper curve
 has a small peak in the heat capacity, which is
 known as a Schottky anomaly (at least in Cambridge).
 The peak is produced by the gas having magnetic
 degrees of freedom with a finite number of accessible states.
%
% can I find real data?  see schott.gnu for this hack DO NOT EDIT
%
}
\label{fig.schottky}
}%%%%%%%%%%%%%%%%%%%%%
A peak in the \ind{heat capacity}, as a function of temperature, 
 occurs in any system that has a finite number of energy levels;
 a peak is not in itself evidence of a phase transition.
 Such peaks were viewed as anomalies in classical \ind{thermodynamics},
 since `normal' systems with infinite numbers
 of energy levels (such as a particle in a box) have heat capacities that are either
 constant  or increasing functions of temperature.
 In contrast, systems with a finite number of levels produced
 small blips in the heat capacity graph (\figref{fig.schottky}).
%
% this belongs earlier logically
%
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{l@{}l} \multicolumn{1}{c}{ $N=100$ } &  \multicolumn{1}{c}{ $N=4096$ } \\
\makebox[-0.15in]{(a)}\psfig{figure=isingfigs/E10.1d.ps,angle=-90,width=2.7in} 
&
\makebox[-0.15in]{}\psfig{figure=isingfigs/E64.1d.ps,angle=-90,width=2.7in} \\[-0.1in]
\makebox[-0.15in]{(b)}\psfig{figure=isingfigs/SE10.1d.ps,angle=-90,width=2.7in}
&
\makebox[-0.15in]{}\psfig{figure=isingfigs/SE64.1d.ps,angle=-90,width=2.7in} \\[-0.1in]
\makebox[-0.15in]{(c)}\psfig{figure=isingfigs/M10.1d.ps,angle=-90,width=2.7in}& 
\makebox[-0.15in]{}\psfig{figure=isingfigs/M64.1d.ps,angle=-90,width=2.7in} \\[-0.1in]
\makebox[-0.15in]{(d)}\psfig{figure=isingfigs/SC10.1.ps,angle=-90,width=2.7in}& 
\makebox[-0.15in]{}\psfig{figure=isingfigs/SC64.1.ps,angle=-90,width=2.7in} \\[-0.1in]
\end{tabular}
\end{center}
}{%
\caption[a]{Detail of Monte Carlo simulations of rectangular Ising models with $J=1$.
 (a) Mean energy and fluctuations in energy as a function of temperature.
 (b) Fluctuations in energy (standard deviation).
 (c) Mean square magnetization.
 (d) Heat capacity. 
}
\label{fig.ising.100d}
}%
\end{figure}
%
%  END this belongs earlier.
%

 Let us refresh 
 our memory of the simplest such system, a two-level system with 
 states $x=0$ (energy 0) and $x=1$ (energy $\epsilon$). 
 The mean energy is 
\beq
	E(\beta) = \epsilon \frac{ \exp( - \beta \epsilon ) }{ 
			1 + \exp( - \beta \epsilon ) } 
		= \epsilon \frac{ 1}{
			1 + \exp( \beta \epsilon ) }
\eeq
 and the derivative with respect to $\beta$ is 
\beq
	\d E/\d \beta = - \epsilon^2  \frac{   \exp( \beta \epsilon ) }{
			[ 1 + \exp( \beta \epsilon )]^2 } .
\label{eq.schot.dEdb}
\eeq
 So the heat capacity is 
\beq
	C = \d E / \d T 
% = dE/d\beta d(1/kT)/dT
  = - \frac{ \d E}{\d\beta} \frac{ 1}{k_{\rm B}T^2}
	= \frac{\epsilon^2}{k_{\rm B}T^2}
		  \frac{   \exp( \beta \epsilon ) }{
			[ 1 + \exp( \beta \epsilon )]^2 } 
\eeq
 and the fluctuations in energy are given by 
 $\var(E) = C k_{\rm B} T^2 = - \d E/\d\beta$, which was
% already
 evaluated  in 
 (\ref{eq.schot.dEdb}). 
 The heat capacity and fluctuations are plotted in figure \ref{fig.schot}.
 The take-home message at this point is that whilst  Schottky anomalies
 do have a peak in the heat capacity, there is {\em no\/} peak
 in their {\em\ind{fluctuations}}; the variance of the
 energy simply increases monotonically
 with temperature to a value proportional to
 the number of independent spins.  Thus it is a peak
 in the {\em{fluctuations}\/} that is interesting, rather than
 a peak in the heat capacity.
% , as it gives evidence
% that something non-standard is going on.
% In contrast,
 The Ising model has such a peak
% an exciting peak
 in its fluctuations,
 as can  be seen in the second row of \figref{fig.ising.100d}. 
%  visible in the Ising plots is novel in contrast to Schottky anomalies.
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{l} 
\makebox[-0.15in]{}\psfig{figure=isingfigs/schot.ps,angle=-90,width=2.9in} \\
\end{tabular}
\end{center}
}{%
\caption[a]{Schottky anomaly -- 
 Heat capacity  and fluctuations in energy as a function of temperature
 for a two-level system with separation $\epsilon = 1$ and $k_{\rm B} = 1$.
}
\label{fig.schot}
}%
\end{figure}
%gnuplot> dE(b)=-exp(b)/(1+exp(b))**2
%gnuplot> vE(T) = -dE(1/T)
%gnuplot> C(T) = - dE(1/T) / T**2
%gnuplot> plot [0.1:10] vE(x), C(x)
%gnuplot> set logs x ; set size 0.6,0.6 ; replot
%gnuplot> plot [0.1:10] vE(x) t "Var(E)", C(x) t "Heat Capacity" 
%
%gnuplot> 
%gnuplot> plot [0.1:10] vE(x) t "Var(E)", C(x) t "Heat Capacity",E(x) t "Var(E)",gnuplot> set logs x ; set size 0.6,0.6 ; replot                         
%gnuplot> plot [0.1:10] C(x) t "Heat Capacity", vE(x) t "Var(E)"
%gnuplot> set xlabel "Temperature"   
%gnuplot> replot
%gnuplot> set term post
%Terminal type set to 'postscript'
%Options are 'landscape monochrome dashed "Helvetica" 14'
%gnuplot> set output "schot.ps"
%gnuplot> replot

 
 
\subsection{Rectangular Ising model with $J=-1$} 
 What do we expect to happen in the case $J=-1$? The ground states 
 of an infinite system are  the two \ind{checkerboard} patterns (\figref{fig.ising.check}), 
 and they have energy per spin $-2$, like the ground states of 
 the $J \eq 1$ model. Can this analogy be pressed further?%
%\begin{figure}
\amarginfig{t}{\small%
\begin{center} 
\makebox[-0.35in]{}
\begin{tabular}{c@{\hspace{0.25in}}c} 
 \smallrisingsample{sixC} &
 \smallrisingsample{sixC2} \\
\end{tabular}
\end{center}
%}{%
\caption[a]{The two ground states of a rectangular Ising model with $J=- 1$.
}
\label{fig.ising.check}
}%
%\end{figure}
%\begin{figure}\figuremargin{%
\amarginfig{t}{\small
\begin{center} 
\makebox[-0.35in]{}
\begin{tabular}{c@{\hspace{0.25in}}c} $J=-1$ & $J=+1$ \\
 \smallrisingsample{six} &
 \smallrisingsample{sixc} \\
\end{tabular}
\end{center}
%}{%
\caption[a]{Two states of rectangular Ising models with $J=\pm 1$
 that have identical energy.
}
\label{fig.ising.check.equiv}
}%
%\end{figure}
 A moment's reflection will confirm that the two systems are 
 equivalent to each other under a checkerboard symmetry operation. 
 If you take an infinite $J=1$ system in some state and flip all the spins 
 that lie on the black squares of an infinite checkerboard, and 
 set $J=-1$ (\figref{fig.ising.check.equiv}), then 
 the energy is unchanged. (The magnetization changes, of course.)
 So all
 thermodynamic properties of the two systems are expected to be identical
 in the case of zero applied field.
% This provides a useful check on one's  code.

 But there is a subtlety lurking here. Have you spotted it? 
%\newpage
%
 We are simulating 
 finite grids with periodic boundary conditions. If the size of the grid in 
 any direction is {\em odd}, then the checkerboard operation is no longer 
 a symmetry operation relating $J=+1$ to $J=-1$, because the checkerboard
 doesn't match up at the boundaries. This means that for systems 
 of odd size, the ground state of a  system with $J=-1$
 will have degeneracy greater than 2, 
 and the energy of those ground states will not be as low as $-2$ per spin. 
 So we expect qualitative differences between 
 the cases $J = \pm 1$ in odd-sized systems.
 These differences are expected to be most prominent 
 for small systems.  The \ind{frustration}s are introduced by the boundaries, 
 and the length of the boundary grows as the square root of the system 
 size, so the fractional influence of this boundary-related frustration 
 on the energy and entropy of the system will decrease as $1/\sqrt{N}$. 
%
 \Figref{fig.ising.25} compares the energies of the ferromagnetic and 
 antiferromagnetic models with $N=25$. Here, the difference is  striking.

% The graphs for  fragments with even size are identical for $J=\pm 1$ as 
% theoretically predicted.
\begin{figure}[hbtp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{ll} \multicolumn{1}{c}{$J=+1$} & \multicolumn{1}{c}{$J=-1$} \\
\makebox[-0.15in]{}\psfig{figure=isingfigs/E5.1.ps,angle=-90,width=2.7in}&
%\makebox[-0.15in]{(b)}\psfig{figure=isingfigs/M5.1.ps,angle=-90,width=2.7in}\\
\makebox[-0.415in]{}\psfig{figure=isingfigs/E5.-1.ps,angle=-90,width=2.7in}\\
%\makebox[-0.15in]{(d)}\psfig{figure=isingfigs/M5.-1.ps,angle=-90,width=2.7in}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Monte Carlo simulations of rectangular Ising models with $J=\pm 1$ and $N=25$.
 Mean energy and fluctuations in energy as a function of temperature.
(a) $J=1$.
%(b) $J=1$. Mean square magnetization as a function of temperature.
(b) $J=-1$.
%  Mean energy and fluctuations in energy as a function of temperature.
%(d) $J=-1$. Mean square magnetization as a function of temperature.
}
\label{fig.ising.25}
}%
\end{figure}


\subsection{Triangular Ising model}
 We can repeat these computations for a triangular Ising model.
 Do we expect the triangular Ising model with $J = \pm 1$
 to show different physical 
 properties from the rectangular Ising model? 
 Presumably the $J=1$ model will have broadly similar properties 
 to its rectangular counterpart. But the case $J=-1$ is
 radically different from what's gone before. Think about it: 
 {\em there is no unfrustrated ground state}; in any state, there {\em must\/} be
 \ind{frustration}s -- pairs of neighbours 
 who have the same sign as each other. Unlike the case of 
 the rectangular model with odd size, the frustrations are 
 not  introduced by the periodic boundary conditions. 
 {\em Every set of three mutually neighbouring spins must be in a state of 
 frustration,} as shown in \figref{fig.frustration}.
% this was in the caption but the margin got full...
  (Solid lines show `happy' couplings which contribute $-|J|$ to the
 energy; dashed lines show `unhappy' couplings which contribute
 $|J|$.)
 Thus we certainly expect different behaviour at low temperatures.
 In fact we might expect this system to have a 
 non-zero entropy at absolute zero. (`Triangular model violates 
 \ind{third law of thermodynamics}!')\index{thermodynamics!third law}

 Let's look at some results.
%
% this figure belongs higher up.
%\marginpar[b]{
%%%%%%%%%%%%%
%}
%%%%%%%%%%%% end marginpar
%\begin{figure}
%\figuremargin{%
\amarginfig{b}{
\begin{center}\small
\mbox{
\setlength{\unitlength}{1.7pt}% was 2pt
 \begin{picture}(70,45)(0,-5)
\newsavebox{\verticalfourdiag}% hexagonal
\savebox{\verticalfourdiag}(0,0)[bl]{
	\multiput(0,0)(0,10){4}{\circle{2}}     % spins
	\multiput(0,5)(0,10){4}{\line(0,-1){3}} % lines down
	\multiput(0,-5)(0,10){4}{\line(0,1){3}} %   lines up
	\multiput(2,1)(0,10){4}{\line(2,1){6}}  % lines rightup
	\multiput(2,-1)(0,10){4}{\line(2,-1){6}}  % lines rightdown
}
\multiput(0,0)(20,0){3}{\usebox{\verticalfourdiag}}
\multiput(10,5)(20,0){3}{\usebox{\verticalfourdiag}}
\end{picture}
}\\[0.1in]
\begin{tabular}{cc}
\psfig{figure=isingfigs/triangle.ps,angle=-90,width=0.7in} &
\psfig{figure=isingfigs/triangle3.ps,angle=-90,width=0.7in} 
\\
(a) & (b) \\
\end{tabular}
\end{center}
%}{%
\caption[a]{In an antiferromagnetic
 triangular Ising model, any three neighbouring
 spins are frustrated. Of the eight possible configurations of three
 spins, six 
 have energy $-|J|$ (a), and two have energy $3|J|$ (b).
}
\label{fig.frustration}
}%
%\end{figure}
%
%
%
% There are various ways to implement 
% a periodic triangular grid. I did it as shown in the margin.
% \input{tex/isingshearfig.tex}
% includes some cut graphs for 25 also
 Sample states are shown in \figref{fig.ising.stateshex1}, and 
 \figref{fig.ising.H4096} shows the energy, fluctuations,
 and heat capacity for $N=4096$. 
 Note how different the results for $J = \pm 1$ are. There is 
 no peak at all in the standard deviation of the energy in the case 
 $J = - 1$.
 This indicates that the antiferromagnetic system does not have a phase
 transition to a state with long-range order.

\begin{figure}[hbtp]
\figuremarginb{\small%
\begin{raggedright}
\noindent
\begin{tabular}{ll}  \multicolumn{1}{c}{$ J=+1$} & \multicolumn{1}{c}{$J=-1$} \\
\makebox[-0.15in]{(a)}\psfig{figure=isingfigs/HE64.1.ps,angle=-90,width=2.7in} &
\makebox[-0.15in]{(d)}\psfig{figure=isingfigs/HE64.-1.ps,angle=-90,width=2.7in}\\
%
\makebox[-0.15in]{(b)}\psfig{figure=isingfigs/HSE64.1.ps,angle=-90,width=2.7in} &
\makebox[-0.15in]{(e)}\psfig{figure=isingfigs/HSE64.-1.ps,angle=-90,width=2.7in}\\
%
\makebox[-0.15in]{(c)}\psfig{figure=isingfigs/HC64.1.ps,angle=-90,width=2.7in} &
\makebox[-0.15in]{(f)}\psfig{figure=isingfigs/HC64.-1.ps,angle=-90,width=2.7in} \\
\end{tabular}
\end{raggedright}
}{%
\caption[a]{Monte Carlo simulations of triangular Ising models with $J=\pm 1$ and $N=4096$.
(a--c) $J=1$. (d--f) $J=-1$. 
(a, d) Mean energy and fluctuations in energy as a function of temperature.
(b, e) Fluctuations in energy (standard deviation).
%
% change to variance? 
%
(c, f) Heat capacity.
}
\label{fig.ising.H4096}
}%
\end{figure}
\begin{figure}
\figuremarginb{\small%
\begin{center} 
\mbox{
\begin{tabular}[t]{cl} $T$ & $J=+1$ \\
20  & \Hisingsample{hexagon0.05} \\
6   & \Hisingsample{hexagon0.16} \\
4   & \Hisingsample{hexagon0.25} \\
3   & \Hisingsample{hexagon0.3} \\
2   & \Hisingsample{hexagon0.5} \\
\end{tabular}
\begin{tabular}[t]{cl} $T$ & $J=-1$ \\
50  & \hisingsample{hexagon0.02} \\
5   & \hisingsample{hexagon0.2} \\
2   & \hisingsample{hexagon0.5} \\
0.5 & \hisingsample{hexagon2} \\
\end{tabular}
}
\end{center}
}{%
\caption[a]{Sample states of triangular Ising models with $J=1$ and $J=-1$.
 High temperatures at the top; low at the bottom.
}
\label{fig.ising.stateshex1}% not referred to?
}%
\end{figure}

\section{Direct computation of partition function of Ising models}
\label{sec.ising.matrix}
 We now examine a completely different approach to Ising models. 
 The {\dbf\ind{transfer matrix method}}
 is an exact and  abstract  approach that obtains 
 physical properties of the model  from the \ind{partition function}
\beq
	Z(\beta,\bJ,\bb) \equiv \sum_{\bx} \exp \!
		\left[ - \beta E( \bx ; \bJ , \bb ) \right] ,
\eeq
 where the summation is over all states $\bx$, and the inverse
 temperature is $\beta = 1/T$. [{As usual, Let $\kB = 1$.}]
 The \ind{free energy} is given by $F = -
 \frac{1}{\beta} \ln Z$. The number of states is $2^N$, so direct
 computation of the partition function is not possible for large
 $N$. To avoid enumerating all global states explicitly, we can use a  
 trick similar to the \ind{sum--product
% probability propagation
 algorithm} discussed in  \chapterref{ch.exact}.\index{message passing}
 We concentrate on models that have the form of a 
 long thin strip of width $W$ with periodic boundary conditions in both
 directions, and we iterate along the
 length of our model, working out a set of
 {\dem\ind{partial partition functions}\/}\index{partition function}
% \index{partition function!partial}
 at one location $l$ in terms of partial partition functions at the
 previous location $l-1$. Each iteration  involves a summation
 over all the states at the boundary. This operation is  exponential 
 in the width of the strip, $W$. The final clever trick is to note that 
 if the system is \ind{translation-invariant} along its length 
 then we only need to do {\em one\/} iteration in order to find the properties
 of a system of {\em any\/}  length.

 The computational task becomes the evaluation of an $S \times S$ matrix, 
 where $S$ is the number of microstates that need to be considered at the 
 boundary, and the computation of its eigenvalues. The  eigenvalue of largest
 magnitude gives the partition function for an infinite-length thin strip.

 Here is a more detailed explanation. Label the states of the $C$ columns of the thin 
 strip $s_1, s_2, \ldots, s_C$, with each $s$ an integer from 0 to $2^{W}\!-\!1$. 
 The $r$th bit of $s_c$ indicates whether the spin in row $r$, column $c$
 is up or down. 
 The \ind{partition function} is 
\newcommand{\lE}{{\cal E}}
\beqan
	Z& =& \sum_{\bx} \exp ( -\b E(\bx) ) \\
	& = & \sum_{s_1}\sum_{s_2}\cdots \sum_{s_C}  \exp \! \left(
	 -\b \sum_{c=1}^{C} \lE(s_{c},s_{c+1}) \right) ,
\label{eq.Z.sums}
\eeqan
 where $\lE(s_{c},s_{c+1})$ is an appropriately defined energy, and, if 
 we want periodic boundary conditions, $s_{C+1}$ is defined to be $s_{1}$.
 One definition for $\lE$ is:
\marginfig{
\begin{center}{\epsfbox{metapost/ising.1}}\end{center}
\caption[a]{Illustration to help explain the definition (\ref{eq.mydefn.ising}).
 $\lE(s_{2},s_{3})$ counts all the contributions to the
 energy in the rectangle. 
 The total energy is given by stepping the rectangle along.
 Each horizontal bond inside the rectangle is counted once;
 each vertical bond is half-inside the rectangle (and will be
 half-inside an adjacent rectangle) so half its energy is
 included in  $\lE(s_{2},s_{3})$; the factor of $1/4$ appears
 in the second  term
% ${\textstyle\frac{1}{4}}\!\!\!\!\!\sum_{\begin{array}{c}\scriptstyle (m,n) \in {\cal N}:\\ \scriptstyle  m \in c, n \in c \end{array}\!\!\!\!}  \!\!\!\!\! J \, x_m x_n$
 because $m$ and $n$ both run over all nodes
 in  column $c$, so  each bond is visited twice.

%\indent MANUAL INDENT
\hspace{1.5em}For the state shown here, $s_2 =  (100)_2$, $s_3 =  (110)_2$,
 the horizontal bonds contribute
 $+J$ to  $\lE(s_{2},s_{3})$, and the vertical bonds
 contribute  $-J/2$ on the left and $-J/2$ on the right,
 assuming periodic boundary conditions between top and bottom.
 So  $\lE(s_{2},s_{3}) = 0$.
}
}
\beq
	\lE(s_{c},s_{c+1}) =
\!\!\!\sum_{\begin{array}{c}\scriptstyle (m,n) \in {\cal N}:\\ \scriptstyle  m\in c, n \in c+1 \end{array}\!\!\!\!} 
	\!\!\!\!\! J \, x_m x_n
	+ {\textstyle\frac{1}{4}}\!\!\!\!\!\sum_{\begin{array}{c}\scriptstyle (m,n) \in {\cal N}:\\ \scriptstyle  m \in c, n \in c \end{array}\!\!\!\!}  \!\!\!\!\! J \, x_m x_n
	+ {\textstyle\frac{1}{4}}\!\!\!\!\!\!\!\!\!\sum_{\begin{array}{c}\scriptstyle (m,n) \in {\cal N}:\\ \scriptstyle m\in c+1, n \in c+1  \end{array}\!\!\!\!}\!\!\!\!  \!\!\!\!\!  J \, x_m x_n .
\label{eq.mydefn.ising}
\eeq
 This definition of the energy has the nice property that (for the rectangular Ising model)
 it defines a matrix that is symmetric in 
 its two indices $s_{c},s_{c+1}$. The factors of $1/4$ are needed because 
 vertical links are counted four times. Let us define
\beq
 M_{s s'} = \exp \! \left(	 -\b  \lE(s,s') \right) .
\eeq
 Then continuing from equation (\ref{eq.Z.sums}), 
\beqan
	Z& = & \sum_{s_1}\sum_{s_2}\cdots \sum_{s_C}  
	\left[ \prod_{c=1}^{C} M_{s_{c},s_{c+1}} \right] \\
	& = & \Trace \left[  \bM^C \right] \\
	& = & \sum_a \mu_a^C ,
\label{eq.Z.prods}
\eeqan
 where $\{ \mu_a \}_{a=1}^{2^W}$ are the eigenvalues of $\bM$. 
 As the length of the strip $C$ increases, $Z$ becomes dominated by the 
 largest eigenvalue $\mu_{\max}$:
\beq
	Z \rightarrow  \mu_{\max}^C .
\eeq
 So the \ind{free energy} per spin in the limit of an infinite thin strip is 
 given by:
\beq
	f = - kT \ln Z / (WC) =  - kT C \ln  \mu_{\max} / (WC )
		=  - kT \ln  \mu_{\max} / W .
\eeq
 It's really neat that {\em all\/}
 the thermodynamic properties of a long 
 thin strip can be obtained from just the largest \ind{eigenvalue} of this \ind{matrix} 
 $\bM$!


%  From the partition function we can obtain interesting thermodynamic properties
%  using the following relations (which you should confirm):\footnote{Here 
%  I have been careless about $\kB$, since I use the convention 
%  throughout the numerics of this paper that $\kB = 1$.}
% \beqan
% 	E &=& - \partial \ln Z /  \partial \beta
% \\
% 	F &=& - \frac{1}{\b} \ln Z
% \\
% 	F &=& E - TS
% \\
% \Rightarrow
% 	S &=& \ln Z + \b \partial \ln Z /  \partial \beta
% \\
% 	C &=& \partial E /   \partial T  \\
% 	&=& k_{\rm B} \b^2  \partial^2 \ln Z /  \partial \beta^2 \\
% 	&=& \frac{ \partial^2 \ln Z /  \partial \beta^2 }{k_{\rm B} T^2 }\\
% 	{\rm var}(E) & =& \partial^2 \ln Z /  \partial \beta^2
% %
% \eeqan

\subsection{Computations}
% I wrote  a {\tt C} program that computes
 I computed the \ind{partition function}s of  {\dem\index{long thin strip}{long-thin-strip}}
 Ising models with the geometries shown in \figref{fig.thinstrips}.
\begin{figure}[htbp]
\figuredangle{
\begin{center}\small
\begin{tabular}{cc}
 Rectangular:
&
Triangular:\\
%
\setlength{\unitlength}{1.7pt}% was 2pt
\begin{picture}(135,40)(-10,-5)
\newsavebox{\vfour} % again!
\savebox{\vfour}(0,0)[bl]{
	\multiput(0,0)(0,10){4}{\circle{2}}     % spins
	\multiput(0,5)(0,10){4}{\line(0,-1){3}} % lines down
	\multiput(0,-5)(0,10){4}{\line(0,1){3}} %   lines up
	\multiput(2,0)(0,10){4}{\line(1,0){6}}  % lines right
}
\multiput(0,0)(10,0){12}{\usebox{\vfour}}
\put(-14,17.75){\makebox{$W$}}
\put(-10,25){\vector(0,1){10}}
\put(-10,15){\vector(0,-1){10}}
\end{picture}\hspace{0.42in}
&
% smallest length that works seems to be 1.7pt
\setlength{\unitlength}{1.7pt}\input{tex/isingstrip.tex}\\
\end{tabular}
\end{center}
}{
\caption[a]{Two long-thin-strip Ising models. A line between two spins 
 indicates that they are neighbours. The strips have width $W$ and infinite 
 length. }
\label{fig.thinstrips}
}
\end{figure}

 As in the last section, I 
 set the applied field $H$ to zero
 and  considered the two cases $J = \pm 1$ which are a ferromagnet
 and antiferromagnet respectively.
 I computed the free energy per spin, $f(\beta,J,H) = F / N$
 for widths from $W = 2$  to 8 as a function of $\beta$ for
 $H=0$. 

\subsubsection{Computational ideas:}
 Only the largest eigenvalue is needed. There are several ways of getting
 this quantity, for example, iterative multiplication of the matrix by an initial vector.
 Because the matrix is all positive we know that the principal
 eigenvector is all positive too (\ind{Frobenius--Perron theorem}), so a
 reasonable initial vector is $(1,1,\ldots,1)$.  This iterative
 procedure may be faster than explicit computation of all eigenvalues.
 I computed them all anyway, which has the advantage that
 we can find the free energy of finite length strips -- using
 \eqref{eq.Z.prods} --
 as well as infinite ones. 

\begin{figure}[tbh]
\figuremargin{\small%
\begin{center}
\mbox{\psfig{figure=ising/ferr8.ps,width=2.7in,angle=-90}\hspace{-0.2in} 
\psfig{figure=ising/anti8.ps,width=2.7in,angle=-90}} 
\end{center}
}{%
\caption[a]{Free energy per spin of long-thin-strip Ising models.

	Note the non-zero gradient at $T=0$ in the case of 
	the triangular antiferromagnet.
}
%\label{fig1} 
\label{fig.lts1} 
}%
\end{figure}
\begin{figure}%[tbh]
\figuremargin{\small%
\begin{center}\mbox{%
%\psfig{figure=ising/S.4.ps,width=3in,angle=-90}
\psfig{figure=ising/S.8.ps,width=2.773in,angle=-90} 
}\end{center}
}{%
\caption[a]{Entropies (in nats) of
% (a) width 4; (b)
 width 8 Ising systems as a function of temperature,
 obtained by differentiating the free energy curves
 in \protect\figref{fig.lts1}. The rectangular ferromagnet and
 antiferromagnet  have identical thermal properties.
 For the triangular systems, the upper curve  $(-)$ denotes the
 antiferromagnet and the lower curve $(+)$ the ferromagnet. 
}
\label{fig.ising.S}  
}%
\end{figure}
\subsection{Comments on graphs:}
	For large temperatures all Ising models should show the same
 behaviour: the \ind{free energy} is entropy-dominated, and the entropy per
 spin is $\ln(2)$. The mean energy per spin goes to zero.
 The free energy per spin should tend to
 $-\!\ln(2)/\beta$. The free energies are shown in \figref{fig.lts1}.

 One of the interesting properties we can obtain from the free energy 
 is the degeneracy of the ground state.  
 As the temperature goes to zero, the Boltzmann
% Gibbs
 distribution becomes 
 concentrated in the ground state. If the ground state is degenerate (\ie, 
 there are multiple ground states with identical energy)
 then the entropy as $T \to 0$ is non-zero. We can
 find the entropy from the free energy using $S = - \partial F/ \partial T$.

%  When $J=1$ and $b=0$, a rectangular ferromagnet has an almost unique 
%  ground state (degeneracy 2)
%  with energy per spin $-2.0$ (four bonds, each shared between 
%  two spins). 
% 
%  If $W$ is even then the antiferromagnet is equivalent to the
%  ferromagnet, under the checkerboard transformation, as we already
%  said.  But if $W$ is odd then the antiferromagnet is frustrated in
%  the width direction; this affects both the energy per spin, which is
%  not so negative; and also, in principle, the entropy per spin,
%  because the ground state of the frustrated system may be
%  significantly degenerate, with a non-zero entropy per spin.  In the
%  case $W = 3$ the free energy per spin is $-4/3$ instead of $-2$.  The
%  ground state only has degeneracy 2. For the rectangular geometry I
%  think that for any $W$ the ground state has finite degeneracy. As $W$
%  increases this effect becomes negligible.
% 
%  The degeneracy of the antiferromagnetic 
%  triangular system at low temperature is
%  different.  The ground state is extensively
%  degenerate, at least for all even values of $W$.
%  (It is instructive to create ground states on the back of an 
%  envelope.)
% % \footnote{By constructing ground states for $W=3$
% % on the back of an
% % 
% % : f:   -1.0807 lz:   +2.3283 T   +0.46416 log(beta)   +0.76753: 
% % : f:   -1.0482 lz:   +3.7671 T   +0.27826 log(beta)    +1.2792: 
% % : f:   -1.0289 lz:   +6.1681 T   +0.16681 log(beta)    +1.7909: 
% % : f:   -1.0173 lz:  +10.1733 T       +0.1 log(beta)    +2.3026: 
% %
% %  using T = 0.464, obtain S = .0807 / 0.464 = 0.17
% %
% % or just using S = - dF/dT = .0173/0.1 = 0.173
% %
% % envelope, I anticipated that the entropy per spin at low temperature
% % might be about $\ln(2)/3 \simeq 0.23$, because roughly every third spin 
% % seems undetermined by its neighbours.}
% %
% % here are some states with energy per spin -1: straight parallel lines
% % chevronny arallel lines. Any pattern starting from a honeycomb 
% % of mainly + and pockets of -
% % any pattern lie chevrons but with side branches. Mazes that have 
% % dead ends and walls and roundabouts. Hexagons inside hexagons. 
% %
%  The zero-temperature degeneracy is
%  nicely revealed by a plot of the free energy versus temperature which
%  has gradient at any $T$ equal to minus the entropy. If the ground state
%  is unique the gradient is zero; for a triangular antiferromagnet the
%  gradient is non-zero. See figure \ref{fig1} for an illustration with
%  width 
% % $W=4$. This graph has gradient corresponding to a zero temperature 
% % entropy of $0.17$ per spin. With
%  $W=8$. I found an entropy of 0.088 per spin from the gradient 
%  at zero temperature.
%  I have not figured out whether  the ground state entropy per spin is non-zero
%  vanishes  as $W$ increases.
% % --maybe it
% %  goes as $\sqrt{N}$ rather than  as $N$.
%
% according to students it says 0.3 in a textbook. also, this seems
% reasonable from a counting argument. Can show that 1/3 of spins are free, at least, giving 0.23.
 The entropy of the triangular antiferromagnet at absolute zero
 appears to be about 0.3, that is, about half its high temperature value (\figref{fig.ising.S}).
%
 The mean energy as a function of temperature is plotted in  figure \ref{fig.lts2}.
 It is evaluated using the identity $\left< E \right> = - \partial \ln Z / 
 \partial \beta$. 
\begin{figure}%[tbh]
\figuremargin{\small%
\begin{center}\mbox{%
\psfig{figure=ising/ebar.8.ps,width=2.6in,angle=-90}
% see ~/_courses/comput/newising
}\end{center}
}{%
\caption[a]{Mean energy versus temperature  of
 long thin strip Ising models with width 8.
 Compare with \figref{fig.ising.16}.
}
%\label{fig2}  
\label{fig.lts2}  
}%
\end{figure}

\begin{figure}
\figuremargin{\small%
\begin{center}\mbox{%
\psfig{figure=ising/ferr.R.4.8.C.ps,width=2.5in,angle=-90}\hspace{-0.25in}
% does this need changing to C2.ps ?
\psfig{figure=ising/anti.H.4.8.C.ps,width=2.5in,angle=-90} 
}\end{center}
}{%
\caption[a]{Heat capacities of (a) rectangular model;
 (b) triangular models with different 
 widths, (+) and $(-)$ denoting ferromagnet and
 antiferromagnet.  Compare with figure \ref{fig.ising.H4096}.
}
\label{fig.ising3}  
}%
\end{figure}
\begin{figure}
\figuremargin{\small%
\begin{center}\mbox{%
\psfig{figure=ising/ferr.R.4.8.vE.ps,width=2.5in,angle=-90}\hspace{-0.25in}
% does this need changing to vE2.ps ?
\psfig{figure=ising/anti.H.4.8.vE.ps,width=2.5in,angle=-90} 
}\end{center}
}{%
\caption[a]{Energy variances, per spin, of (a) rectangular model;
 (b) triangular models with different 
 widths, (+) and $(-)$ denoting ferromagnet and
 antiferromagnet.  Compare with figure \ref{fig.ising.H4096}.
}
\label{fig.ising4}  
}%
\end{figure}

 Figure \ref{fig.ising3} shows the estimated heat capacity (taking raw 
 derivatives of the mean energy) as a function of temperature 
 for the triangular models with widths 4 and 8.
  Figure \ref{fig.ising4} shows the fluctuations in energy
 as a function of temperature.  All of these figures should show
 smooth graphs; the roughness of the curves is  due
 to inaccurate numerics.
% eigenvalue evaluation.
% It is apparent 
% that the peak in the heat capacity is getting sharper as the width 
% increases, especially in the ferromagnetic case.
 The nature of any phase transition is not obvious, but the graphs 
 seem compatible with the assertion that the ferromagnet shows, 
 and the antiferromagnet does not show a phase transition.


 The pictures  of the free energy in \figref{fig.lts1}   give some insight 
 into how we could predict the transition temperature. We can 
 see how the two phases of the ferromagnetic 
 systems each have simple free energies: 
 a straight sloping line through $F=0$, $T=0$ for the high temperature 
 phase, and a horizontal line for the low temperature phase. (The slope 
 of each line shows what the entropy per spin of that 
 phase is.) The phase transition occurs roughly at the intersection 
 of these lines. So we predict the transition temperature 
 to be linearly related to the ground state energy.  
%

\subsection{Comparison with the Monte Carlo results}
 The agreement between the results of the two experiments 
 seems very good. The two systems simulated (the long thin strip and 
 the periodic square) are not quite identical.
 One could a more accurate comparison by finding all eigenvalues 
 for the strip of width $W$
 and computing $\sum \lambda^W$ to get the partition function 
 of a $W \times W$ patch. 


% \subsubsection*{Further properties that can be extracted}
%  A wonderful result derived in Yeomans 
% % \cite{yeomans92}
%  Yeomans (1992) is that the inverse correlation
%  length can be obtained from the first two eigenvalues:
% \beq
% 	\xi^{-1} = - \ln \left( \l_1/\l_0 \right) .
% \eeq
% %
% % p.103 critical temp is J/kT_c = 0.22165 for 3d ising model
% %

\section{Exercises}% Problems}
\exercisxB{4}{ex.mcS}{% (Open question)
 What would be the best way to extract the entropy from 
 the Monte Carlo simulations? 
 What would be the best way to obtain the entropy and the 
 heat capacity from the partition function computation?
}
\ExercisxA{3}{ex.isingmemories}{
 An Ising model may be generalized to have a coupling $J_{mn}$ 
 between any spins $m$ and $n$, and the value of $J_{mn}$ 
 could be different for each $m$ and $n$. 
%
 In the special case where all the couplings are positive we know that
 the system has two ground states, the all-up and all-down states.
%
 For a more general setting of $J_{mn}$ it is conceivable that there 
 could be {\em many\/} ground states.

 Imagine that it is required to make a spin system whose local minima
% lowest energy states 
 are a given list of states $\bx_{(1)}, \bx_{(2)}, \ldots,  \bx_{(S)}$.
%
 Can you think of a way of setting $\bJ$ such that the chosen 
 states are  low energy states? You are allowed
 to adjust all the $\{ J_{mn} \}$ to whatever values you wish.
}
 

\dvips
% \subchapter
% \section{Solutions}% to Chapter \protect\ref{ch.ising}'s exercises} % 
% \input{tex/_s9.tex}
\dvipsb{solutions ising}  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\prechapter{About Chapter}
%\input{tex/_pmc.tex}
\chapter{Exact Monte Carlo Sampling \nonexaminable}
\label{ch.mcexact}
\section{The problem with Monte Carlo methods}
 For  high-dimensional problems, the
 most widely used random  sampling methods
 are Markov chain Monte Carlo methods
 like the Metropolis method, Gibbs sampling, and
 slice sampling.

 The problem with all these methods is this:
 yes, a given algorithm can be guaranteed to
 produce samples from the target density $P(\bx)$
 asymptotically, 
 `once the chain has converged to the equilibrium
 distribution'.
 But if one runs the chain for
 too short a time $T$, then the samples will come
 from some other distribution $P^{(T)}(\bx)$.
 For how long must the Markov chain
 be run before it has `converged'?
 As was mentioned in  \chapterref{ch.mc},
 this question is usually very hard to answer.
% 
% 
 However, the pioneering
 work of \citeasnoun{Propp1996}\index{Propp, Jim G.}\index{Wilson, David B.} allows
 one, for certain chains,
 to answer  this very question;
% is of great importance
% for those who want to know for  how long to run their
 furthermore Propp and Wilson show how to 
% Markov chain Monte Carlo simulation to get a
 obtain `exact' samples
 from the target density.

\section{Exact sampling concepts}
  Propp and Wilson's {\dem{exact sampling method}} (also
 known as `\ind{perfect simulation}'\index{algorithm!exact sampling}\index{algorithm!perfect simulation}
 or `\ind{coupling from the past}')\index{exact sampling}\index{Monte Carlo methods!exact sampling}\index{Monte Carlo methods!perfect simulation}
 depends on three ideas.
\subsection{Coalescence of coupled Markov chains}
 First,\index{coalescence}
% the idea that
 if several Markov chains
 starting from different initial conditions
 share a single random-number generator, then their
 trajectories in state space may
 {\dem\index{Monte Carlo methods!coalescence}\index{coalescence}{coalesce}};
 and, having, coalesced, will not separate
 again. If {\em all\/} initial conditions lead to trajectories that
 coalesce into a single trajectory, then we can be sure that
 the Markov chain has `forgotten' its initial condition.
 \Figref{fig.mcexact.1}\mbox{a{\small{-i}}} shows twenty-one Markov chains
 identical to the one described in section \ref{sec.metropolis},
 which  samples from $\{ 0,1,\ldots,20\}$ using the
  Metropolis algorithm
 (\figref{fig.metrop}, \pref{fig.metrop}); each of the
 chains has  a different
 initial condition but they are all driven by a single random number generator;
 the chains coalesce after about 80 steps.
 \Figref{fig.mcexact.1}\mbox{a{\small{-ii}}} shows the same Markov chains
 with a different random number seed; in this case, coalescence
 does not occur until 400 steps have elapsed  (not shown).
 \Figref{fig.mcexact.1}b shows similar Markov chains, each
 of which has identical proposal density to those in
  section \ref{sec.metropolis} and \figref{fig.mcexact.1}a;
% the difference between   figures \ref{fig.mcexact.1}a and b
% is
 but in \figref{fig.mcexact.1}b, the proposed move at each step,
 `left' or `right', is obtained in the same way by all the chains
 at any timestep, independent of the current state.
 This  coupling of the chains changes the statistics of coalescence.
 Because two neighbouring paths only merge when a rejection occurs,
 and rejections only occur at the walls (for this particular
 Markov chain), coalescence will  occur only when the chains
 are all in the leftmost state or all in the rightmost state.

\newcommand{\exactforw}[1]{\hspace*{-7mm}\psfig{figure=metrop/exact/run#1,height=7.5in,width=1.2in}}
\newcommand{\exactback}[1]{\hspace*{-7mm}\psfig{figure=metrop/exact/back#1,height=7.5in,width=1.1in}\hspace*{-3mm}}
\begin{figure}
\figuremargin{
\footnotesize
\begin{tabular}{cccccccc}
%\exactforw{3/x.vn.1.ps}&
\exactforw{4/x.vn.1.ps}&
\exactforw{2/x.vn.1.ps}&
&
\exactforw{2/x.v.1.ps}&
%\exactforw{3/x.v.1.ps}&
\exactforw{4/x.v.1.ps}&
\\
% ``t'' means ternary moves are made. ``v'' means vanilla  simple
% dependence on random number generator.
% ``n'' means ``not locked to other states''
% the non-vanilla runs are a stupid idea it turns out.
%\psfig{figure=metrop/exact/run2/x.vn.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.tvn.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.v.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.tv.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.t.1.ps,angle=-90,height=5in}
%\\
%(a) &(b) &(c) &(d)& (e) & (f) \\
{\footnotesize{(i)}} & 
{\footnotesize{(ii)}} & 
 &
{\footnotesize{(i)}} & 
{\footnotesize{(ii)}} \\ 
\multicolumn{2}{c}{\footnotesize(a)} & \hspace{0.3in} &
\multicolumn{2}{c}{\footnotesize(b)} \\
\end{tabular}
}{
\caption[a]{Coalescence, the first idea behind the
 exact sampling method.
 Time runs from bottom to top.
 In the leftmost panel, coalescence occurred
 within 100 steps.
 Different coalescence properties are obtained
 depending on the way each state uses the random numbers
 it is supplied with.
% In the first and third panels shown, coalescence has occurred
% within 250 steps
 (a) Two runs of
% examples of coalescence for
 a Metropolis simulator in which the random bits that determine
 the proposed step
 depend on the current state; a different random number seed
 was used in each case.
 (b) In this simulator the random proposal (`left' or `right') is the same
 for all states.
 In each panel, one of the paths, the one starting at location $x=8$,
 has been highlighted.
}
\label{fig.mcexact.1}
}% end fig
\end{figure}


%\begin{figure}
%\figuremargin{
%\footnotesize
%\begin{tabular}{cccccccc}
%\exactforw{2/x.vn.L.ps}&
%\exactforw{3/x.vn.L.ps}&
%\exactforw{4/x.vn.L.ps}&
%\hspace{0.2in}&
%\exactforw{2/x.v.L.ps}&
%\exactforw{3/x.v.L.ps}&
%\exactforw{4/x.v.L.ps}&
%\\
% & (a)  & & & &(b) \\
%%(a) &(b) &(c) & (d) & (e) & (f)  \\
%\end{tabular}
%}{
%\caption[a]{Longer time-histories of the coalescences.
%}
%\label{fig.mcexact.L}
%}% end fig
%\end{figure}

\subsection{Coupling from the past}
% or {Simulation from the past}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5 next paragraph
 How can we use the coalescence property to find an exact
 sample from the equilibrium distribution of the chain?
 The state of the system at the moment when complete coalescence
 occurs is not a valid sample from the equilibrium distribution;
 for example in \figref{fig.mcexact.1}b,
 final coalescence always occurs when the state
 is against one of the two walls, because trajectories only
 merge at the walls. So sampling forward in time until coalescence
 occurs is not a valid method.

 The second key idea of exact sampling is that we can obtain exact samples
 by  sampling {\em from a time $T_0$ in the past, up to the present}.
 If coalescence has occurred, the present sample is an unbiased
 sample from the equilibrium distribution; if not, we restart
 the simulation from a time $T_0$ further into the past, {\em reusing the
 same random numbers}. The simulation is repeated at a sequence of ever
 more distant times $T_0$, with a doubling of $T_0$ from
 one run to the next being a convenient
% and near-optimal
 choice. When coalescence occurs at a time before `the present',
 we can record $x(0)$ as an {\dem exact sample\/} from the equilibrium
 distribution of the Markov chain.

 \Figref{fig.mcexact.b} shows two exact samples produced
 in this way. In the leftmost panel of \figref{fig.mcexact.b}a,
 we start twenty-one chains in all possible initial conditions
 at $T_0 = -50$ and run them forward in time.
 Coalescence does not occur. We restart the simulation
 from  all possible initial conditions
 at $T_0 = -100$, and reset the random number generator
 in such a way that the random numbers generated
 at each time $t$ (in particular, from $t=-50$ to $t=0$)
 will be identical to what they were in the first run. Notice that
 the  trajectories produced from  $t=-50$ to $t=0$  by
 these runs that started from  $T_0 = -100$ are identical to
 a {\em subset\/} of  the trajectories in the first
 simulation with $T_0=-50$.
 Coalescence still does not occur, so we double $T_0$ again
 to $T_0= -200$.
 This time, all the trajectories coalesce and we obtain
 an exact sample, shown by the arrow.
 If we pick an earlier time such as $T_0=-500$, all the trajectories
 must still end in the same point at $t=0$, since every trajectory
 must pass through {\em{some}\/} state at $t=-200$, and {\em{all}\/} those
 states lead to the same final point.
 So if we ran the Markov chain for an infinite time in the
 past, 
from  any initial condition, it would end in the same state.
 \Figref{fig.mcexact.b}b shows an exact sample produced in
 the same way with the Markov chains of
  \figref{fig.mcexact.1}b.	

 This method, called {\dem{coupling from the past}},
 is important because it allows us to obtain
 exact samples from the equilibrium distribution; but,
 as described here, it is of little practical use,
 since we are obliged to simulate chains starting
 in {\em all\/} initial states. In the examples shown,
 there are only twenty-one states, but in any realistic
 sampling problem there will be an utterly enormous number
 of states -- think of the $2^{1000}$ states of a
 system of 1000 binary spins, for example. The whole
 point of introducing Monte Carlo methods was to try to avoid
 having to visit all the states of such a system!

\begin{figure}
\fullwidthfigureright{
\footnotesize
\begin{tabular}{cccccccccc}
%\exactback{1/x.vn.20.L.ps}
%&
\exactback{1/x.vn.50.L.ps}
&
\exactback{1/x.vn.100.L.ps}
&
\exactback{1/x.vn.200.L.ps} % converges at 200
&
% \exactback{1/x.vn.500.L.ps} something wrong with this
&
\hspace{0.3in}
&
%\exactback{1/x.v.20.L.ps}
%&		
\exactback{1/x.v.50.L.ps}
&		
\exactback{1/x.v.100.L.ps}
&		
\exactback{1/x.v.200.L.ps} % converges at 200
%&		
%\exactback{1/x.v.500.L.ps}
\\
%{\footnotesize{(i)}} & 
%{\footnotesize{(ii)}} & 
%{\footnotesize{(iii)}} & 
%{\footnotesize{(iv)}}
%{\footnotesize{$T_0=-20$}} & 
{\footnotesize{$T_0=-50$}} & 
{\footnotesize{$T_0=-100$}} & 
{\footnotesize{$T_0=-200$}} & 
%{\footnotesize{$T_0=-500$}}
&
&
{\footnotesize{$T_0=-50$}} & 
{\footnotesize{$T_0=-100$}} & 
{\footnotesize{$T_0=-200$}} & 
%{\footnotesize{$T_0=-500$}}  
\\
\multicolumn{4}{c}{\footnotesize(a)} & &
\multicolumn{3}{c}{\footnotesize(b)} \\
\end{tabular}
}{
\caption[a]{\mbox{`Coupling from the past', the second idea behind the
 exact sampling method.} 
}
\label{fig.mcexact.b}
}% end fig
\end{figure}

\begin{figure}
\fullwidthfigureright{
\footnotesize
\begin{center}
\begin{tabular}{ccccccccc}
%\exactback{1/x.ve.20.ps}
%&		
\exactback{1/x.ve.50.ps}&		
\exactback{1/x.ve.100.ps}&		
\exactback{1/x.ve.200.ps} % converges at 200
%&		
%\exactback{1/x.ve.500.ps}
&
\hspace{0.3in}
&
\exactback{2/x.ve.all.ps} & \hspace{0.3in}
 &
\exactback{3/x.ve.all.ps}
\\
{\footnotesize{$T_0=-50$}} & 
{\footnotesize{$T_0=-100$}} & 
{\footnotesize{$T_0=-200$}} & 
%{\footnotesize{$T_0=-500$}} &
&
{\footnotesize{$T_0=-50$}} & &
{\footnotesize{$T_0=-1000$}}  
\\
\multicolumn{3}{c}{\footnotesize(a)}& &
{\footnotesize(b)}& &
{\footnotesize(c)}\\
\end{tabular}
\end{center}
}{
\caption[a]{(a) Ordering of states, the third idea behind the
 exact sampling method. The trajectories shown here are
 the left-most and right-most trajectories of
 \protect\figref{fig.mcexact.b}b.
%
 In order to establish what the state at time zero is,
 we only need to run simulations from  $T_0=-50$,  $T_0=-100$, and $T_0=-200$, after which
 point coalescence  occurs.

 (b,c) Two more exact samples from the target density, generated by this method,
 and different random number seeds.
 The initial times required were
 $T_0=-50$ and $T_0=-1000$, respectively.
}
\label{fig.mcexact.c}
}% end fig
\end{figure}

\subsection{Monotonicity}
 Having established that we can obtain valid samples by simulating
 forward   from times in the past, starting in {\em all\/}
 possible states at those times, the third trick of
 Propp and Wilson, which makes the exact sampling method useful in practice,
 is the idea that, for some Markov chains, it may be possible to
 detect coalescence of all trajectories {\em without simulating
 all those trajectories}. This property holds, for
 example, in the chain of \figref{fig.mcexact.1}b,
 which has the property that {\em two trajectories never cross}.
 So if we simply track the two trajectories starting from the leftmost
 and  rightmost states, we will know that coalescence of
 {\em all\/} trajectories has occurred when {\em those two\/}
 trajectories coalesce.
 \Figref{fig.mcexact.c}a illustrates this idea by
 showing only the left-most and right-most trajectories
 of \figref{fig.mcexact.b}b.
 \Figref{fig.mcexact.c}(b,c) shows two more
 exact samples from the same equilibrium distribution
 generated by running the `coupling from the past' method
 starting from the two end-states alone.
 In (b), two runs coalesced starting from $T_0=-50$;
 in (c), it was necessary to try times up to $T_0=-1000$ to achieve
 coalescence.


% could reference the paper by Holmes here
% except I am not convinced it is genuinely  useful.
% I put it in an exercise below.
\section{Exact sampling from interesting distributions}
 In the toy problem we studied, the states could be put in a one-dimensional
 order such that no two trajectories crossed. The states of
 many interesting state spaces can also be put into
 a {\dem\ind{partial order}\/} and coupled Markov chains can be found that
 respect this partial order. [An example of a partial order
 on the four possible states of two spins is this:
 $(+,+) > (+,-) > (-,-)$;
 and 
 $(+,+) > (-,+) > (-,-)$;
 and the states $(+,-)$ and $(-,+)$ are not ordered.]
 For such systems, we can show that coalescence has occurred merely by
 verifying that coalescence has occurred for  all the histories
 whose initial states were  `maximal' and `minimal' states of the
 state space.

\marginalg{
\begin{framedalgorithm}
\begin{tabular}{@{}l}
{\sf Compute} $a_i := \sum_j J_{ij} x_j$\\
{\sf Draw} $u$ {\sf from} Uniform$(0,1)$ \\
{\sf If} $u<1/(1+e^{-2 a_i})$ \\
\ \ \  $x_i := +1$\\
{\sf Else} \\
\ \ \  $x_i := -1$\\
\end{tabular}
\end{framedalgorithm}
\caption[a]{Gibbs sampling coupling method.
 The Markov chains
 are coupled together by having all chains update the same spin $i$
 at each time step and having 
 all chains share a common sequence of random numbers $u$.\medskip

}
\label{alg.coupling}
}
 As an example, consider the\index{Monte Carlo methods!Gibbs sampling}
 Gibbs sampling method
 applied to
% a set of spins
 a ferromagnetic Ising spin system, with the partial ordering of
 states being defined thus: state $\bx$ is `greater than or equal to' state $\by$
 if $x_i \geq y_i$ for all spins $i$. The maximal and minimal states
 are the the all-up and all-down states.
 The Markov chains are coupled together as shown in \algref{alg.coupling}.
% NOT by the number of up-spins in the state.
 \citeasnoun{Propp1996} show that exact samples
 can be generated for this system, although the time to find
 exact samples  is large if the Ising model is below its critical
 temperature, since the Gibbs sampling method itself
 is slowly-mixing under these conditions.
 Propp and Wilson have improved on this  method\index{Gibbs sampling}
 for the Ising model
 by using a Markov chain called the single-bond heat bath algorithm
 to sample from a related  model called the \ind{random cluster model};
 they show that
 exact samples 
 from the  random cluster model can  be obtained rapidly
 and can  be converted
 into exact samples from the Ising model. Their ground-breaking
 paper includes an exact sample from a 16-million-spin Ising model
 at its critical temperature. A sample for a smaller Ising model
 is shown in \figref{fig.ising.exact}.
\marginfig{
\begin{center}
\psfig{figure=images/q2.ps,width=1.94in}
\end{center}
\caption[a]{An exact sample from the Ising model  at its critical temperature,
 produced by 
% David Bruce Wilson.
 \mbox{D.B.~Wilson}.
 Such samples can be produced within seconds
 on an ordinary computer by
 exact sampling.
}
\label{fig.ising.exact}
}

\subsection{A generalization of the exact sampling method for `non-attractive' distributions}
 The method of Propp and Wilson for the Ising model, sketched above,
 can only be applied to probability distributions that are, as they
 call them,  `attractive'. Rather than define this term, let's say what it
 means, for practical purposes: the method can be applied to spin systems
 in which all the couplings are positive (\eg, the ferromagnet), and
 to a few special  spin systems with negative couplings (\eg, as we already
 observed in \chref{ch.ising}, the rectangular ferromagnet and antiferromagnet
 are equivalent); but it cannot be applied to general spin systems in which
 some couplings are negative, because in such systems the trajectories
 followed by the all-up and all-down states are not guaranteed to be
 upper and lower bounds for the set of all trajectories. 
% To put it another way, the  Markov chain  does not  have the non-crossing property.
 Fortunately, however, we do not need to be so strict.
% Radford Neal\index{Neal, Radford}
% has pointed out that i
 It is possible to re-express the
 \index{Propp, J. G.}{Propp}
 and \index{Wilson, David B.}{Wilson} algorithm in a way that generalizes to the case of
 spin systems with negative couplings.
% summary state 
 The idea of
 the {\dem\ind{\envelope}} version of  exact sampling 
 is still that we  keep track of bounds\index{{\tt{?}}}
% an `upper bound' and `lower bound'
 on the set of all trajectories, and detect when
 these  bounds are equal, so as to find exact samples.
% Propp and Wilson
 But the  bounds will not themselves be actual trajectories,
 and they  will not necessarily be {\em tight\/} bounds.
% This is called .

% simon said
% Is it as if we are
%using the '?' states to represent multiple possible states with a single
%vector and we only fill in the '?'s with a 0 or 1 when the 'alternative
%state' chains would've coalesced. So when we start off with all '?' we ARE
%effectively considering all possible start configurations - it just gives
%us a very economical way of keeping track and monitoring our progress.
% I think this is already said.
%
 Instead of simulating two trajectories, each of which moves in  a state
 space $\{ -1, +1 \}^N$, we simulate one {\dem trajectory envelope\/} in an
 augmented state space $\{ -1, +1 , {\tt ?} \}^N$, where the symbol
 {\tt ?} denotes `either $-1$ or $+1$'.
 We call the state of this augmented system the `\envelope'.
% envelope'
 An example
 \envelope\ of a six-spin system is {\tt ++-?+?}. This \envelope\ is
 shorthand for the set of states
\begin{center} {\tt ++-+++},  {\tt ++-++-},  {\tt ++--++},  {\tt ++--+-} .
\end{center} 
 The update rule at each step of the Markov chain takes a single spin,
 enumerates all possible states of the neighbouring spins that are compatible with
 the current \envelope, and, for each of these local scenarios,
 computes the new value ({\tt +} or {\tt -}) of the spin
 using Gibbs sampling (coupled to a random number $u$ as in \algref{alg.coupling}).
 If all these new values agree, then the new value of the updated spin in the
 \envelope\ is set to the unanimous value  ({\tt +} or {\tt -}).
 Otherwise, the new value of the spin in the \envelope\ is `{\tt ?}'.
% This update rule can
 The initial condition, at time $T_0$, is given by setting all the spins in
 the \envelope\ to `{\tt ?}', which corresponds to considering
 all possible start configurations.

 In the case of a spin system with positive couplings, 
 this \envelope\ simulation will be identical to the simulation of
 the uppermost state and lowermost states, in the style of
% {\em \`a la\/}
 Propp and Wilson, with coalescence occuring when all the `{\tt ?}' symbols
 have disappeared.
 The \envelope\ method can be applied to general spin systems with any couplings.
 The only shortcoming of this method is that the envelope may describe
 an unnecessarily
 large set of states, so there is no guarantee that the
 \envelope\ algorithm will  converge;
 the time for coalescence to be {\em detected\/} may be considerably larger
 than the actual time taken for the underlying Markov chain to coalesce.

 The \envelope\ scheme has been applied to exact sampling in belief networks
 by \citeasnoun{NealHarvey2000},\index{Neal, Radford} and to
 the triangular antiferromagnetic Ising model
 by \citeasnoun{PattersonChildsMacKay00}.
% Mike Harvey and Radford Neal.
 Summary state methods were first introduced by
 \citeasnoun{Huber1998}; they  also go by the names
 \ind{sandwiching method}s and \ind{bounding chain}s.
% The \envelope\ method was first introduced by
% \citeasnoun{Huber1998}, who called it a \index{bounding chain}.
% Should I also cite H?ggstr?m-Nelander. ?

\begin{figure}
\figuremargin{\mbox{\psfig{figure=figs/hexagonbig.ps,width=3.95in}}}{
\caption[a]{A perfectly random \ind{tiling} of a hexagon by lozenges,
 provided by J.G.~Propp and D.B.~Wilson.}
}
\end{figure}
\section*{Further reading}
 For further reading, impressive pictures of exact samples
 from other distributions, and generalizations of the
 exact sampling method, browse the perfectly-random sampling
 website.\footnote{\tt{http://www.dbwilson.com/exact/}}
% http://dimacs.rutgers.edu/$\sim$dbwilson/exact/}}
% http://dimacs.rutgers.edu/~dbwilson/exact/
% Exact sampling 


 For beautiful exact-sampling demonstrations running
 live in your web-browser, see Jim Propp's website.\footnote{
{\tt{http://www.math.wisc.edu/$\sim$propp/tiling/www/applets/}}}
%http://www.math.wisc.edu/~propp/tiling/www/applets/

% I hope CUP printer will render this nicely.
%\marginfig{\mbox{\psfig{figure=figs/hexagonbig.ps,width=54mm}}
%\caption[a]{A perfectly random tiling of a hexagon with lozenges,
% provided by J.G.~Propp and D.B.~Wilson.}
%}


\subsection{Other uses for coupling}
 The idea of coupling together Markov chains by having
 them share a random number generator has other
 applications beyond exact sampling.
 \citeasnoun{PintoNeal_01} have shown that
 the accuracy of estimates obtained from
 a Markov chain Monte Carlo simulation (the second problem discussed
 in  \sectionref{sec.mcproblemsdefined}, \pref{sec.mcproblemsdefined}), using the estimator
% chapter \ref{ch.mc},
% \cf\  \eqref{eq.mc.est})
\beq
 \hat{\Phi}_P  \equiv \frac{1}{T} \sum_{t} \phi( \bx^{(t)} ) ,
\label{eq.mc.est.again}
\eeq
 can be improved by coupling the chain of interest, which converges
 to  $P$, 
 to a second chain, which generates samples
 from a second, simpler distribution, $Q$.
 The coupling must be set up in such a way that
 the states of the two chains are strongly correlated.
 The idea is that we first estimate the expectations of a function
 of interest, $\phi$,
 under $P$ and under $Q$ in the normal way (\ref{eq.mc.est.again})
 and compare the estimate under $Q$, $\hat{\Phi}_Q$, with the true value of
  the  expectation under $Q$, ${\Phi_Q}$ which we assume
 can be evaluated exactly.
% because of the simplicity of $Q$. If
 If $\hat{\Phi}_Q$ is an overestimate then it is likely
 that  $\hat{\Phi}_P$ will be an overestimate too.
 The difference $(\hat{\Phi}_Q-{\Phi_Q})$  can thus be used to
 correct  $\hat{\Phi}_P$.\index{Neal, Radford}
% For details of the correction method, see  
% Pinto and Neal's paper.

\section{Exercises}
\exercissxB{2}{ex.mcexact}{
 Is there any relationship between the probability
 distribution of the time taken for all trajectories
 to coalesce, and the equilibration time of a Markov chain?
 Prove that there is a relationship, or find a single chain
 that can be realized in two different ways that have different
 coalescence times.
}
\exercisxB{2}{ex.mcexact.fred}{
 Imagine that Fred  ignores
 the requirement that the random bits used at some time $t$, in every run
 from increasingly distant times $T_0$, must be identical,
 and makes a coupled-Markov-chain simulator that uses
 fresh random numbers every time $T_0$ is changed.
 Describe what happens if Fred applies his method to the Markov
 chain that is intended to sample from the uniform distribution over
 the states 0, 1, and 2, using the Metropolis method, driven
 by a random bit source as in \figref{fig.mcexact.1}b.
}
\exercisxC{5}{ex.modelexact}{
 Investigate the application of perfect sampling to
 linear regression in 
 \citeasnoun{holmes98perfect} or \citeasnoun{holmes2002perfect}
 and try to generalize it.
}
\exercisxC{3}{ex.coalescencegeneral}{
 The concept of coalescence has many applications.
 Some surnames are more frequent than
 others, and some die out altogether. Make a model of this
 process; how long will it take until everyone
 has the same surname?

 Similarly, 
 variability  in any particular portion of the human genome
 (which forms the basis of \ind{forensic} \ind{DNA} fingerprinting)
 is inherited like a surname.   A DNA fingerprint is like a 
 string of surnames. 
 Should the fact that these surnames are subject
 to coalescences, so that some surnames are by chance more prevalent
 than others, affect the way in which DNA fingerprint
 evidence is used in court?
}
% http://www.biology.washington.edu/fingerprint/dnaintro.html
%  Variable Number Tandem Repeats or VNTR.
% http://www.college.ucla.edu/webproject/micro7/lecturenotes/finished/Fingerprinting.html
% This method is called Restriction Fragment Length Polymorphism and results in an RFLP Fingerprint.

\exercisxB{2}{ex.fairstrawsb}{
 How can you use a coin to create a random ranking of 3 people?
 Construct a solution that uses exact sampling. For example,
 you could apply exact sampling to a Markov chain in which the coin
 is repeatedly used alternately  to decide whether to switch first and
 second, then whether to switch second and third.
}% my solution: arithmetic coding. was in _e6a.tex
\exercisxC{5}{ex.exactZ}{
 Finding the partition function $Z$ of a
 probability distribution is a difficult problem.
 Many Markov chain Monte Carlo methods
 produce valid samples  from a distribution without
 ever finding out what $Z$ is.


 Is there any probability distribution and Markov chain
 such that 
 either the time taken to produce a perfect sample
 or the number of random  bits used to create a perfect
 sample are related  to the value of $Z$?
 Are there some situations in which the time to coalescence conveys
 information about $Z$?
}
\section{Solutions}
\soln{ex.mcexact}{
 It is perhaps surprising that there is no direct relationship
 between the equilibration time and the time to coalescence.
%
 We can prove this using the example of
% A simple example that proves this is the
% case of
 the uniform distribution over the integers $\A = \{ 0,1,2, \ldots , 20 \}$.
 A Markov chain that converges to this distribution in exactly
 one iteration is the chain for which the probability of
 state $s_{t+1}$ given $s_t$ is the uniform distribution, for all
 $s_t$.
 Such a chain can be coupled to a random number generator
 in two ways: (a) we could draw a random integer $u \in \A$,
 and set $s_{t+1}$ equal to $u$ regardless of
 $s_t$; or
 (b) we could draw a random integer $u \in \A$,
 and set $s_{t+1}$ equal to $(s_{t}+u) \mod 21$. Method (b)
 would produce a cohort of trajectories locked together, similar to
 the trajectories in \figref{fig.mcexact.1}, except that
 no coalescence ever occurs.
 Thus, while the equilibration times of  methods (a) and (b)
 are both one, the coalescence times are respectively one and
 infinity.

 It seems plausible on the other hand that coalescence time
 provides some sort of upper bound on
 equilibration time.
}




%%%%%%%%%
%
\chapter{Variational Methods}
\label{ch.mft} 
% \chapter{Mean Field Theory}
% \chapter{Variational Methods}
% \label{ch.mft}
% \chapter{Mean Field Theory}
% Another topic which will prove useful to have up our sleeves is 
% mean field theory.
%
%
% included by lb.tex
\label{ch.variational}
 Variational methods\index{variational methods}\index{approximation!variational} 
 are an important technique for the approximation of 
 complicated probability distributions, having\index{approximation!of complex distribution}
 applications in statistical physics, 
% Bayesian inference
 data modelling  and neural networks.
% ,  including the decoding of error correcting codes. 
% Mean field theory is relevant to understanding 
% neural networks and to the development of ways of implementing 
% Bayesian inference and decoding error correcting codes. 
\section{Variational free energy minimization}
 One method for approximating a 
 complex distribution in a physical system is {\dem \ind{mean field theory}}. 
 Mean field theory is  a special case of a general 
 {\dbf \ind{variational free energy}} 
 approach of Feynman\nocite{Feynman:SM}\index{Feynman, Richard}
 and Bogoliubov which we will now study. 
 The key piece of mathematics needed to understand this method 
 is Gibbs' inequality,\marginpar{\small\raggedright
 Gibbs' inequality first appeared in  equation (\eqKL); see also \exrelent.
}
% -- equation (\eqKL), \exrelent --
 which 
 we repeat here.
\begin{description}
\item[The relative entropy]
        between two probability distributions $Q(x)$ and $P(x)$ 
        that are defined over the same alphabet $\A_X$ is\index{relative entropy}
\beq
        D_{\rm KL}(Q||P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)} .
\label{eq.KL.again}
\eeq
 The relative entropy satisfies $D_{\rm KL}(Q||P) \geq 0$ (Gibbs'
 inequality) with equality only if $Q \eq P$.   In general
 $D_{\rm KL}(Q||P) \neq D_{\rm KL}(P||Q)$.

 In this chapter we will replace the $\log$ by $\ln$,
 and measure the divergence in nats.
\end{description} 
\subsection{Probability distributions in statistical physics}
% 
%  Refer to example \ref{ex.rel.ent} for the essential inequality.
%  Are the marginals the best approximation? No.
% 
% \subsection{Why mean field theory in statistical physics?}
 In statistical physics one often encounters probability 
 distributions of the form
\beq
        P( \bx \given  \beta, \bJ) = \frac{1}{Z(\beta,\bJ)}
         \exp \! \left[ - \beta E( \bx ; \bJ  ) \right]  , 
\label{eq.ising.p.again}
\eeq
 where for example the state vector is $\bx \in \{-1,+1\}^N$,  and $E(\bx;\bJ)$ is some energy function such as 
\beq
                E(\bx;\bJ) = - 
% \left[ 
 \frac{1}{2}
                                \sum_{m,n} J_{mn} x_m x_n - \sum_n h_n x_n.
% \right]  .
\label{eq.ising.e.again}
\eeq
 The \ind{partition function} (normalizing constant) is
\beq
        Z(\beta,\bJ) \equiv \sum_{\bx} \exp \!
                \left[ - \beta E( \bx ; \bJ  ) \right]   .
\label{eq.ising.z.again}
\eeq
% 
 The probability distribution of 
 \eqref{eq.ising.p.again} is complex. Not unbearably complex -- 
 we can, after all, evaluate $E(\bx;\bJ)$  for any particular $\bx$  in a time 
 polynomial in the number of spins. 
 But evaluating the normalizing constant $Z(\beta,\bJ)$ is difficult,
 as we saw in  \chapterref{ch.mc},
 and describing the properties of the probability distribution 
 is also hard. Knowing the value of $E(\bx;\bJ)$ at a few arbitrary points 
 $\bx$, for example, 
 gives no useful information about what the average properties 
 of the system are.

 An evaluation of $Z(\beta,\bJ)$ would be particularly 
 desirable because from
% the \ind{partition function}
 $Z$ we can derive all the 
 thermodynamic properties of the system. 

% Mean field theory\index{mean field theory}
 {Variational free energy minimization}\index{variational free energy!minimization}\index{free energy!minimization}\index{free energy!variational}
 is a method for {\dbf approximating\/} the complex 
 distribution $P( \bx)$ by a simpler ensemble $Q(\bx ; \btheta)$
 that is parameterized by adjustable parameters $\btheta$. We
 adjust these parameters so as to get $Q$ to best approximate 
 $P$, in some sense. 
 A by-product of this approximation is a lower bound on $Z(\beta,\bJ)$.

% \subsection{Why mean field theory in error correcting codes?}
% removed to leftovers

\subsection{The variational free energy}
 The objective function chosen to measure the
 quality of the approximation is the {\dem\ind{variational free
 energy}}
% \newcommand{\tF}{{\tilde{F}}}
\beq
        \beta \tF(\btheta) =  \sum_{\bx} \: Q(\bx;\btheta) 
                \ln \frac{ Q(\bx;\btheta) }{  \exp  \!
                \left[ - \beta E( \bx ; \bJ  ) \right]  }
                                .
\label{eq.vfe}
\eeq
% The factor of $\beta$ is included on the left-hand side
% to make it 
 This expression can be manipulated into a couple of interesting
forms: first,
\beqan
        \beta \tF(\btheta) &=& \beta \sum_{\bx} \: Q(\bx;\btheta) 
                  E( \bx ; \bJ  ) 
                - \sum_{\bx} \: Q(\bx;\btheta)  \ln\frac{1}
                                { Q(\bx;\btheta) }  \\
        &\equiv& \beta \left<  E( \bx ; \bJ  )  \right>_Q - S_Q ,
\eeqan
 where $\left<  E( \bx ; \bJ  )  \right>_Q$ is the average of the 
 energy function under the distribution $Q(\bx;\btheta)$, and 
 $S_Q$ is the entropy of the  distribution $Q(\bx;\btheta)$
 (we set  $k_{\rm B}$ to one  in the definition of $S$
 so that it is identical to the definition of the entropy $H$ in \partone).

 Second, we can use the definition of $P(\bx  \given  \beta, \bJ)$  
 to write:
\beqan
        \beta \tF(\btheta) &=&  \sum_{\bx} \: Q(\bx;\btheta) 
                \ln \frac{ Q(\bx;\btheta) }{  P(\bx  \given  \beta, \bJ) }
                - \ln {Z(\beta,\bJ)} 
\\
&=& D_{\rm KL}( Q || P ) +  \beta F,
\eeqan
 where $F$ is the  true free energy, defined by
\beq
 \beta F
 \equiv - \ln {Z(\beta,\bJ)},
\eeq
 and $D_{\rm KL}(Q||P)$ is the relative entropy between 
 the approximating distribution $Q(\bx;\btheta)$ and the 
 true distribution  $P(\bx  \given  \beta, \bJ)$.
 Thus by Gibbs' inequality, the variational free energy 
 $\tF(\btheta)$  is bounded below by $F$ and
 only attains this value for $Q(\bx;\btheta) = P(\bx \given   \beta, \bJ)$. 

 Our strategy is thus to vary $\btheta$ in such a way that 
 $\beta \tF(\btheta)$ is minimized. The approximating distribution 
 then gives a simplified approximation 
 to the true distribution that may be useful, and the value 
 of  $\b \tF(\btheta)$ will be an upper bound for $\b F$.
 Equivalently, $\tilde{Z} \equiv e^{-\b  \tF(\btheta)}$ is a lower bound for $Z$.

\subsection{Can the objective function $\b \tF$ be evaluated?}
 We have already agreed that the evaluation of various interesting
 sums over $\bx$ is intractable.  For example, the \ind{partition function}
\beq
 Z = \sum_{\bx} \exp \! \left( - \b E( \bx ; \bJ ) \right),
\eeq
 the energy
\beq
 \left< E \right>_P = \frac{1}{Z} \sum_{\bx} E( \bx ; \bJ ) \exp \!
 \left( - \b E( \bx ; \bJ ) \right) ,
\eeq
 and the entropy 
\beq
 S \equiv
 \sum_{\bx} P(\bx \given  \beta, \bJ) \ln\frac{1}{P(\bx \given  \beta, \bJ)}
\eeq
 are
 all presumed to be impossible to evaluate.
 So why should we suppose that this objective function 
 $\beta \tF(\btheta)$, which is also defined in terms of a sum 
 over all $\bx$ (\ref{eq.vfe}), should be a convenient quantity to deal
 with? Well, for a range of interesting energy functions,
 and for sufficiently simple approximating distributions, 
 the variational free energy {\em can\/} be efficiently 
 evaluated.



\section{Variational free energy minimization for spin systems}
% Ising models}
\label{sec.vfeising}
 An example of a tractable variational free energy is given by 
 the  spin system whose energy function was given in \eqref{eq.ising.e.again},
 which we can approximate with a {\em separable\/} approximating distribution,
\beq
        Q(\bx; \ba) = \frac{1}{Z_Q} \exp \left({  \sum_n a_n x_n }\right) .
\eeq
 The variational parameters $\btheta$ of the variational free
 energy (\ref{eq.vfe}) are the components of the vector
% of log probability ratios,
 $\ba$.
 To evaluate the variational free energy we need the entropy of 
 this distribution,
\beq
        S_Q = \sum_{\bx} \: Q(\bx;\ba)  \ln\frac{1}
                                { Q(\bx;\ba) } 
\eeq
 and the mean of the energy,
\beq
  \left<  E( \bx ; \bJ  )  \right>_Q  =
 \sum_{\bx} \: Q(\bx;\ba) 
                  E( \bx ; \bJ  ) .
\eeq
 The entropy of the separable approximating
 distribution is simply the sum of the entropies of the individual
 spins \exercisebref{ex.Hadditive},
% bref puts the ref in brackets
\beq
        S_Q = \sum_n H_2^{(e)}(q_n),
\eeq
 where $q_n$ is the probability that spin $n$ is $+1$,
\beq
        q_n = \frac{e^{a_n}}{e^{a_n}+e^{-a_n}} = \frac{1}{1+\exp(-2 a_n)}  ,
\eeq
 and
\beq
 H_2^{(e)}(q) = q \ln \frac{1}{q} + (1-q)\ln\frac{1}{(1-q)} .
\eeq
% all logs being natural logarithms.
%
% REFER to an exercise? 
%
 The mean energy under $Q$ is easy to obtain because 
% $E(\bx;\bJ)$
 $\sum_{m,n} J_{mn} x_m x_n$ is a sum 
 of terms each involving  the product of two 
 {\em independent\/} random variables.
 (There 
 are no self-couplings, so $J_{mn} = 0$ when $m=n$.)
 If we define
 the mean value 
 of $x_n$ to be $\bar{x}_n$, which is given by
\beq
        \bar{x}_n = \frac{ e^{a_n} - e^{-a_n} }{e^{a_n} + e^{-a_n} }
        = \tanh(a_n) = 2 q_n - 1 ,
\eeq
 we obtain
\beqan
         \left<  E( \bx ; \bJ  )  \right>_Q  &=& 
 \sum_{\bx} \: Q(\bx;\ba)  \left[ - \frac{1}{2}
                                \sum_{m,n} J_{mn} x_m x_n - \sum_n h_n x_n
 \right]  \\
 &=&
- \frac{1}{2}   \sum_{m,n} J_{mn} \bar{x}_m \bar{x}_n - \sum_n h_n \bar{x}_n.
%
\eeqan
 So the variational free energy is given by
\beq
        \b \tF(\ba) = \b  \left<  E( \bx ; \bJ  )  \right>_Q  - S_Q
        = \b \left(- \frac{1}{2}
                                \sum_{m,n} J_{mn} \bar{x}_m \bar{x}_n - \sum_n h_n \bar{x}_n \right)  - \sum_n H_2^{(e)}(q_n) .
\eeq
%%%%%%%%%%%%%%%%%%%%%% added Dec 2000
\amarginfig{c}{
%\begin{figure}  
%\figuremargin{%
\begin{center}
\hspace*{-0.1in}\mbox{\psfig{figure=gnu/ising.vfe.s.ps,angle=-90,width=2.4in}}\\[-0.4in]
\end{center}
%}{%
% see gnu/ising.gnu
% {\textstyle\half} no factor of half, because sum above is over all mn
\caption[a]{The variational free energy
 of
 the two-spin system whose energy is $E(\bx) = - x_1 x_2$,
 as a function of the two variational parameters $q_1$ and $q_2$.
 The inverse-temperature is  $\beta=1.44$.
% critical point for this system is 1
 The function plotted is
$$
  \b \tF = -
   \b  \bar{x}_1 \bar{x}_2
                - H_2^{(e)}(q_1) - H_2^{(e)}(q_2),
$$
 where $\bar{x}_n = 2  q_n -1$.
 Notice that for fixed $q_2$
 the function is \convexsmile\ with respect to $q_1$,
 and for fixed $q_1$ it is \convexsmile\ with respect to $q_2$.}
\label{fig.mft2spins}
}%
% see also load 'ising3.gnu' for a movie demo
% for lecture
%\end{figure}



 We now consider minimizing this function with respect 
 to the variational parameters $\ba$.
 If
% Noting that when
 $q=1/(1+e^{-2a})$, the derivative of the entropy is  
\beq
 \frac{ \partial}{\partial q} H_2^{e}(q) = \ln \frac{1-q}{q} = -2a .
\eeq
 So we obtain
\beqan
 \frac{\partial }{\partial a_m} \b \tF(\ba) 
        &=&  \b \left[ - \sum_{n} J_{mn}  \bar{x}_n -  h_m   \right]\left(2
         \frac{\partial q_m }{\partial a_m} \right)  - 
        \ln \left( \frac{1-q_m}{q_m}
\right) \left(\frac{\partial q_m }{\partial a_m} \right)  
\nonumber \\
 &=& 2\left(\frac{\partial q_m }{\partial a_m} \right) 
    \left[- \b \left( \sum_{n} J_{mn}  \bar{x}_n +  h_m\right)  + a_m \right]
 .
\eeqan
 This derivative is equal to zero when 
\beq
        a_m = \b\left( \sum_{n} J_{mn}  \bar{x}_n +  h_m \right)  .
\label{eq.mfta}
\eeq
 So  $\tF(\ba)$ is extremized at any point that  satisfies
 \eqref{eq.mfta} and
% the definition 
\beq
         \bar{x}_n = \tanh( a_n )  .
\label{eq.mftb}
\eeq
% define the solution to our variational 
% free energy minimization.

 The \vfe\ $\tF(\ba)$ may be a multimodal function,
 in which case each stationary point (maximum, minimum or saddle)
 will satisfy equations (\ref{eq.mfta}) and (\ref{eq.mftb}).
 One way of using these equations, 
 in the case of a  system with an arbitrary coupling matrix $\bJ$,
 is to update each parameter $a_m$
 and the corresponding value of $\bar{x}_m$
 using equation (\ref{eq.mfta}),  one at a time. This {\dem asynchronous
 updating of the parameters\/} is guaranteed to decrease $\b\tF(\ba)$.

 Equations (\ref{eq.mfta}) and (\ref{eq.mftb}) may be recognized
 as the \index{mean field theory}{mean field} equations for a spin system. The variational
 parameter $a_n$ may be thought of as the strength of a fictitious
 field  applied to an isolated spin $n$. 
% which when positive encourages spin $n$ to point up. 
 \Eqref{eq.mftb}
 describes the mean response of spin $n$, and \eqref{eq.mfta} describes
 how the field $a_m$ is set in response to the mean state of 
 all the other spins.
  
 The variational free energy  derivation is a helpful 
 viewpoint for mean field theory for two reasons.
\ben
\item
 This approach associates an objective function $\b \tF$ with 
 the mean field equations; such an objective function is useful 
 because it can help identify alternative dynamical systems 
 that minimize the same function.
\item
 The theory is readily generalized to other approximating
 distributions.  We can imagine introducing a more complex
 approximation $Q(\bx;\btheta)$ that might for example capture
 correlations among the spins instead of modelling the spins as
 independent.  One could then evaluate the variational free energy and
 optimize the parameters $\btheta$ of this more complex 
 approximation. The more degrees of freedom the approximating 
 distribution has, the tighter the bound on the free energy becomes. 
 However, if
 the complexity of an approximation is increased, the evaluation of either
 the mean energy or the entropy typically becomes more
 challenging.
\een

\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=isingmft/mft3.T.ps,angle=-90,width=3.2in}}
\end{center}
}{%
\caption[a]{Solutions of the variational free energy extremization problem
 for the Ising model, for three different
 applied fields $h$. Horizontal axis: temperature $T=1/\b$.
 Vertical axis: magnetization $\bar{x}$. The critical temperature
 found by mean field theory is $T_c^{\rm mft} = 4$.}
\label{fig.mft}
}%
\end{figure}

\section{Example: mean field theory for the ferromagnetic Ising model}
 In the simple Ising model studied in  \chapterref{ch.ising}, every coupling $J_{mn}$ is equal to $J$
 if $m$ and $n$ are neighbours and zero otherwise. There is
 an applied field $h_n = h$ that is the same for all spins.
 A very simple approximating distribution is one with just a single
 variational parameter $a$, which defines a separable distribution
\beq
        Q(\bx; a) = \frac{1}{Z_Q} \exp \left({  \sum_n a x_n }\right) 
\eeq
 in which all spins are independent  and have the same probability
% $\theta$,
\beq
        q_n =  \frac{1}{1+\exp(-2 a)}
\eeq
 of being up. The mean magnetization is 
\beq
         \bar{x} = \tanh( a ) 
\label{eq.mftb.i}
\eeq
 and the equation (\ref{eq.mfta}) which defines the minimum of the
 variational free energy becomes 
\beq
        a = \b\left( C J  \bar{x} +  h \right)  ,
\label{eq.mfta.i}
\eeq
 where $C$ is the number of couplings that a spin is involved in --
 $C=4$ in the case of a rectangular two-dimensional Ising model.
 We can solve  equations (\ref{eq.mftb.i}) and (\ref{eq.mfta.i}) for $\bar{x}$
 numerically -- in fact,
% if we want a graph
 it is easiest to vary $\bar{x}$ and solve
 for $\b$
%
% note if x = tanh(a) then a = 1/2 log[(1+x)/(1-x)]
%
 -- and obtain graphs of the free energy minima and maxima
 as a function of temperature as shown in \figref{fig.mft}. The
 solid line
 shows $\bar{x}$ versus $T = 1 /\beta$ for  the case $C=4, J=1$. 
%
% easy because  b( CJ x + h ) = tanh^{-1} x = 1/2 log[(1+x)/(1-x)]
%
% see ~/bin/mft.p

 When $h=0$, there is a pitchfork bifurcation at a critical
 temperature $T_c^{\rm mft}$. [A pitchfork bifurcation is
 a transition like the one shown by the solid lines in
 \figref{fig.mft},
% figure 26.1
 from a system with one minimum as a function of $a$ (on the right) to
 a system (on the left)
 with two minima and one maximum; the maximum is the middle one of
 the three lines. The solid lines look
 like a pitchfork.]
% (like the true critical temperature $T_c$ of the Ising model).
 Above this temperature, there is only one
 minimum in the variational free energy, at $a=0$
 and $\bar{x}=0$; this minimum corresponds to an approximating
 distribution that is uniform
% distribution
 over all states. Below the critical temperature, there
 are two minima corresponding to approximating distributions that are
 symmetry-broken, with all spins more likely to be up, or all spins
 more likely to be down.  The state $\bar{x}=0$ persists as a stationary
 point of the variational free energy, but now it is a local {\em maximum\/}
 of the variational free energy.

 When $h>0$, there is a global \vfe\ minimum at
 any temperature for a positive value of $\bar{x}$,
 shown by the upper dotted curves in  \figref{fig.mft}.
 As long as  $h