This document is (c) David J.C. MacKay, 2001
It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html
It contains the text of David MacKay's book, Information theory, inference, and learning algorithms. (latex source)
Copying and distribution of this file are NOT PERMITTED.
The file is provided for convenience of anyone wishing to make a web-based search of the text of the book.
% This document is (c) David J.C. MacKay, 2001
%
% It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/
% http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html
%
% It contains the text of David MacKay's book,
% Information theory, inference, and learning algorithms.
% (latex source)
%
% Copying and distribution of this file are NOT PERMITTED.
%
% The file is provided for convenience of anyone wishing to
% make a web-based search of the text of the book.
% was book2e.tex is now book.tex (and still latex2e)
\documentclass[11pt]{book}%
% last minute additions
\usepackage{DJCMamssymb}% needed for blacktriangleright Mon 10/11/03 (put in symbols instead)
\usepackage{ragged2e}% provides \justifying
% end last minute additions
\usepackage{floatflt}
%\usepackage{hangingsecnum}% makes sec numbers sit in the left margin (tried cutting out on Thu 6/11/03)
\usepackage{hangingsecnum2}% makes sec numbers sit in the left margin (modified Thu 6/11/03)
%\usepackage{mparhack}
\usepackage{mparhackright-209}% makes all margin pars go in right margin
\usepackage{marginfig}% Defines many macros for making various styles of figure with captions
%\usepackage{symbols}% Provides a few math symbols (replaced with DJCMamssymb)
%\usepackage{twoside}
\usepackage{myalgorith}% defines the Algorithm environment as a float
% Also forces fig,tab, and alg all to use a single counter
\usepackage{aside}% defines the {aside} environment
\usepackage{chapsummary}% helps me compile index-like objects (NOT USED)
\usepackage{chapternotes}% lots of assorted stuff
\usepackage{lsalike}% defines citation commands
\usepackage{booktabs}% makes nice quality tables
\usepackage{prechapter}% defines a chapter-like object
\usepackage{mycaption}% defines ``\indented''and \@makecaption; and the notindented style used in figure captions
% additions post-Sat 5/10/02
\usepackage{latexsym}% needed in order to make use of the \Box command
\usepackage{tocloft}% implements my look of table of contents
\usepackage{tocloftcomp2}% implements my look of table of contents (was tocloftcomp until Thu 6/11/03)
\usepackage{mychapter}% defines chapter command, including the look of the new chapter page
% also defines the look of the section and subsection commands
\usepackage{mycenter}% modifies center to reduce vertical space waste - useful for figures, etc.
\usepackage{mypart}% modifies part to not cleardoublepage (no longer Sat 5/4/03)
\usepackage{myheadings}% redefines the pagestyle ``headings''
% \usepackage{headingmods}% redefines the pagestyle ``headings'' (similar to myheadings)
% \usepackage{myindents}% defines parindent and leftmargin
\usepackage{graphics}% enables rotating of boxes
% \usepackage{boldmathgk}% provides bold alpha etc. (doesn't work)
% \usepackage{fixmath}% provides bold alpha etc. Also (I think) provides numerous sloping greeks that I don't like
\usepackage{fixmathDJCM}% provides bold alpha etc. Has Gamma definition cut out. and Omega
% suggested by DAG:
%\usepackage{amsmath}
%\usepackage{mathptmx}
\usepackage{DAGmathspacing}% provides smallfrac
\usepackage{boxedminipage}
\usepackage{fancybox}% Provides ability to put verbatim text inside boxes
\usepackage{bbold}% CTAN blackboard.ps was helpful for choosing this PROVIDES ``holey 1'' as \textbb{1}
\usepackage{epsf}% to allow use of metapost figures
%\usepackage{hyperref} % incompatible with something
%
\usepackage{multicol}% why does CTAN refer to multicols?
%\usepackage{myindex2}% overrides book definition of index
\usepackage{myindex}% overrides book definition of index
\usepackage{makeidx}
\usepackage{mybibliog}
\usepackage{mygaps}% defines \eq and \puncgap and \colonspace and \puncspace
\usepackage{mytoc}% suppresses the CONTENTS headings
\makeindex
%
\newcommand{\thedraft}{7.0}% 6.6 was 2nd printing. 6.8 was when I fixed errs Tue 24/2/04 % 6.9 = Mon 28/6/04 % 6.10 = Mon 2/8/04 % 6.11 Sun 22/8/04 % 7.0 final for 3rd printing
\renewcommand{\textfraction}{0.10}
\pagestyle{headings}
\begin{document}
\bibliographystyle{lsalikedjcmsc}%.bst
%\newcommand{\bf}{\textbf}
%\newcommand{\sf}{\textsf}
%%\newcommand{\em}{\textem}
%\newcommand{\rm}{\textrm}
%\newcommand{\tt}{\texttt}
%\newcommand{\sl}{\textsl}
%\newcommand{\sc}{\textsc}
%
% chapter.tex
%
% this contains a few common definitions for all chapters
% of the itprnn book
% for _l1.tex:
\hyphenation{left-multi-pli-ca-tion}
\hyphenation{multi-pli-ca-tion}
%
\newcommand{\partnoun}{Part}
\newcommand{\partone}{\partnoun\ I}
\newcommand{\datapart}{I}
\newcommand{\noisypart}{II}
\newcommand{\finfopart}{III}
\newcommand{\probpart}{IV}
\newcommand{\netpart}{V}
\newcommand{\sgcpart}{VI}
\newcommand{\hybrid}{Hamiltonian}
\newcommand{\Hybrid}{Hamiltonian}
%
% If sending book to readers -
\newcommand{\begincuttable}{}
\newcommand{\ENDcuttable}{}
% If sending to editor -
%\newcommand{\begincuttable}{\marginpar{\raisebox{-0.5in}[0in][0in]{$\downarrow$}CUTTABLE?}}
%\newcommand{\ENDcuttable}{\marginpar{\raisebox{0.5in}[0in][0in]{$\uparrow$}CUTTABLE?}}
%
\newcommand{\adhoc}{ad hoc}
\newcommand{\busstop}{bus-stop}
\newcommand{\mynewpage}{\newpage}% switch this off later Sun 3/2/02
% see also tex/inputs/itchapter.sty
% chapternotes.sty is where there is an index
\newcommand{\fN}{f\!N}
\newcommand{\exercisetitlestyle}{\sf}
%
% used in sumproduct.tex and gallager.tex
\newcommand{\Mn}{{\cal M}(n)}
\newcommand{\Nm}{{\cal N}(m)}
%\newcommand{\N}{{\cal N}}
%
% the delta function that is 1 if true (defined in notation.tex)
\newcommand{\truth}{\mbox{\textbb{1}}}
% requires:
% \usepackage{bbold}% CTAN blackboard.ps was helpful for choosing this
%
% used in gene.tex
\newcommand{\deltaf}{\delta\! f}
\newcommand{\tI}{\tilde{I}}
\newcommand{\Kp}{K_{\rm{p}}}
\newcommand{\Ks}{K_{\rm{s}}}
%
% end
% lang4.tex - distributions.tex
\newcommand{\lI}{I}
%
% clust.tex
\newcommand{\rnk}{r^{(n)}_k}
\newcommand{\hkn}{\hat{k}^{(n)}}
% good sizes:
% -0.45: 1.25
% -0.25: 0.65
% -0.4 0.8
\newcommand{\softfig}[1]{\hspace{-0.4in}\psfig{figure=octave/kmeansoft/ps1/#1.ps,width=0.8in,angle=-90}}
\newcommand{\softtfa}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.4in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.2in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtfabig}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.6in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.5in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtfabigb}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.45in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.625in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtf}[2]{\softtfa{#1}{#2}{ps1}}
\newcommand{\softtfbig}[2]{\softtfabig{#1}{#2}{ps1}}
\newcommand{\softtfbigb}[2]{\softtfabigb{#1}{#2}{ps1}}
\newcommand{\softtfb}[2]{\softtfa{#1}{#2}{ps3}}
\newcommand{\softtfbbig}[2]{\softtfabigb{#1}{#2}{ps3}}
\newcommand{\softfc}[1]{\begin{tabular}{c}%
\hspace*{-0.2in}\mbox{\psfig{figure=octave/kmeansoft/ps5/#1.ps,width=1.32in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
% end
%
% used in _p1 and _l2
\newcommand{\hpheight}{26mm}
\newcommand{\wow}{\marginpar{{\Huge{$*$}}}}
%\newcommand{\wow}{\marginpar{\raisebox{-12pt}{\psfig{figure=figs/wow.eps,width=1in}}}}
%
% used in _l1.tex:::::::
\renewcommand{\q}{{f}}
\newcommand{\obr}[3]{\overbrace{{#1}\,{#2}\,{#3}}}
\newcommand{\ubr}[3]{\underbrace{{#1}\,{#2}\,{#3}}}
\newcommand{\nbr}[3]{{{#1}\,{#2}\,{#3}}}
%
% for \mid and gaps puncgap etc see mygaps.sty
\newcommand{\EM}{EM}
\newcommand{\ENDsolution}{\hfill \ensuremath{\epfsymbol}\par}
\newcommand{\ENDproof}{\hfill \ensuremath{\epfsymbol}\par}
\newcommand{\Hint}{{\sf{Hint}}}
\newcommand{\viceversa}{{\itshape{vice versa}}}
\newcommand{\analyze}{analyze}
\newcommand{\analyse}{analyze}
\newcommand{\fitpath}{/home/mackay/octave/fit/ps}% used in fit.tex (gaussian fitting, octave)
% CUP style:
\renewcommand{\cf}{cf.}
\renewcommand{\ie}{i.e.}
\renewcommand{\eg}{e.g.}
\renewcommand{\NB}{N.B.}
%
% symbols i e and d in maths (operators)
\newcommand{\im}{{\rm i}}
\newcommand{\e}{{\rm e}}
% \d is already defined
%
% needs
% \usepackage{boxedminipage}
\newenvironment{conclusionboxplain}%
{\begin{Sbox}\begin{minipage}{\textwidth}}%
{\end{minipage}\end{Sbox}\fbox{\TheSbox}}
\newenvironment{conclusionbox}%
%{\begin{Sbox}\begin{minipage}{\textwidth}}%
%{\end{minipage}\end{Sbox}\fbox{\TheSbox}}
{% see also marginfig.sty for conflicting use of this enironment and its params - and for defn of fatfboxsep
\fatfboxsep%
\setlength{\mylength}{\textwidth}%
\addtolength{\mylength}{-2\fboxsep}%
\addtolength{\mylength}{-2\fboxrule}%
\vskip8pt\noindent\begin{Sbox}\begin{minipage}{\mylength}\hspace*{-\fboxsep}\hspace*{-\fboxrule}%
\hspace*{\leftmargini}\begin{minipage}{\textwidthlessindents}}%
{\end{minipage}\end{minipage}\end{Sbox}\shadowbox{\TheSbox}\resetfboxsep\vskip 1pt}
\newenvironment{oldconclusionbox}%
{\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -18pt\begin{quote}\vskip -8pt}%
{\end{quote}\vskip -14pt \noindent\rule{\textwidth}{0.1pt}\vskip 6pt}
% {\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -12pt\begin{quote}}%
% {\end{quote}\vskip -12pt \noindent\rule{\textwidth}{0.1pt}}
\newcommand{\dy}{\d y}
\newcommand{\plus}{+}
\newcommand{\Wenglish}{Wenglish}% winglish
\newcommand{\wenglish}{\Wenglish}% winglish
\newcommand{\percent}{{per cent}}% in USA only: percent
%
%\newcommand{\nonexaminable}{$^{*}$}
\newcommand{\nonexaminable}{}
%
% for exact sampling chapter
\newcommand{\envelope}{summary state}
%
\def\unit#1{\,{\rm #1}}
\def\cm{\unit{cm}}
\def\grams{\unit{g}}
% this is a 209 versus 2e problem: (huffman.latex edited instead)
%\def\tenrm{\rm}
%\def\tenit{\it}
%
% other problems: \pem
\renewcommand{\textfraction}{0.1}
%
% for use in free text:
\newcommand{\bits}{{\rm bits}}
\newcommand{\bita}{{\rm bit}}
% for use in equations or in '1 bit'
\newcommand{\ubits}{\,{\bits}}
\newcommand{\ubit}{\,{\bita}}
%
%
%
% ch 2:
\newcommand{\sixtythree}{{\tt sixty-three}}
\newcommand{\aep}{`asymptotic equipartition' principle}
%
% used in alpha:
\newcommand{\sla}{\sqrt{\lambda_a}}
\newcommand{\kga}{\kappa\gamma}
\newcommand{\kkgg}{\kappa^2\gamma^2}
\newcommand{\skg}{\sqrt{\kappa\gamma}}
\newcommand{\TYP}{{\rm \scriptscriptstyle TYP}}
%
\newcommand{\bb}{{\bf b}}
%
% used in ising.tex and _s4.tex
% J=+1 are in states1, J=-1 are in states
%\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1.82in}}
\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1in}}% was 1.75
\newcommand{\smallrisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=0.6in}}% was 1.2 was 0.9
\newcommand{\Hisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=2.6in}}
\newcommand{\hisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=2.6in}}
\newcommand{\bighisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=3.86in}}
%
% used in _noiseless.tex
\newcommand{\Connectionmatrix}{Connection matrix}
\newcommand{\connectionmatrix}{connection matrix}
\newcommand{\connectionmatrices}{connection matrices}
%\newcommand{\cwM}{M}% codeword number
%\newcommand{\cwm}{m}% codeword number
\newcommand{\cwM}{S}% codeword number
\newcommand{\cwm}{s}% codeword number
\newcommand{\sa}{\alpha}% signal amplitude in gaussian channel
%
\newcommand{\cmA}{A}% connection matrix symbol
\newcommand{\bcmA}{{\bf \cmA}}% connection matrix symbol
\newcommand{\bAcm}{{\bcmA}}
\newtheorem{ctheorem}{Theorem}[chapter]
\newtheorem{definc}{Definition}[chapter]
\newcommand{\appendixref}[1]{Appendix \ref{#1}}
\newcommand{\appref}[1]{Appendix \ref{#1}}
\newcommand{\Appendixref}[1]{Appendix \ref{#1}}
\newcommand{\sectionref}[1]{section \ref{#1}}
\newcommand{\Sectionref}[1]{Section \ref{#1}}
\newcommand{\secref}[1]{section \ref{#1}}
\newcommand{\Secref}[1]{Section \ref{#1}}
\newcommand{\chapterref}[1]{Chapter \ref{#1}}
\newcommand{\Chapterref}[1]{Chapter \ref{#1}}
\newcommand{\chref}[1]{Chapter \ref{#1}}
\newcommand{\Chref}[1]{Chapter \ref{#1}}
\newcommand{\chone}{\ref{ch.one}}
\newcommand{\chtwo}{\ref{ch.two}}
\newcommand{\chthree}{\ref{ch.three}}
\newcommand{\chfour}{\ref{ch.four}}
\newcommand{\chfive}{\ref{ch.five}}
\newcommand{\chsix}{\ref{ch.six}}
\newcommand{\chseven}{\ref{ch.ecc}}
\newcommand{\cheight}{\ref{ch.bayes}}
\newcommand{\chthirteen}{\ref{ch.single.neuron.class}}% single neuron
\newcommand{\chfourteen}{\ref{ch.single.neuron.bayes}}% single neuron bayes?
\newcommand{\chtwelve}{\ref{ch.nn.intro}}% intro to nn
\newcommand{\chcover}{\ref{ch.cover}}
\newcommand{\chbayes}{\ref{ch.bayes}}
\newcommand{\secpulse}{\ref{sec.pulse}}% 7.2.1?}
\newcommand{\secthirteenthree}{13.3?}
\newcommand{\secmetrop}{\ref{sec.metrop}}% 11.3?}
\newcommand{\figooo}{?1.11?}
\newcommand{\eqgamma}{8.27?}
\newcommand{\TSP}{travelling salesman problem}
\newcommand{\Bayes}{Bayes'}
\newcommand{\vfe}{variational free energy}
\newcommand{\vfem}{variational free energy minimization}
% could make this \ch6 = \ref{ch6}
% author, title etc is in here....
% {headerinfo.tex}% uses special commands
\setcounter{secnumdepth}{2}%
\newcommand{\indep}{\bot}% upside down pi desired
\newcommand{\dbf}{\slshape}% boldface in definitions
\newcommand{\dem}{\slshape}% emphasized definitions in text
\newcommand{\solutionb}[2]{\setcounter{solution_number}{#1}
\solutiona{#2}}
\newcommand{\lsolution}[2]{\section{Solution to exercise {#1}}{#2}}
%
%
\newcommand{\FIGS}{/home/mackay/book/FIGS}
\newcommand{\bookfigs}{/home/mackay/book/figs}
\newcommand{\figsinter}{/home/mackay/handbook/figs/inter}
\newcommand{\exburglar}{\exerciseref{ex.burglar}}
\newcommand{\exnine}{\exerciseref{ex.invP}}%10}
\newcommand{\exseven}{\exerciseonlyref{ex.weigh}}% use deprecated!
% was \exseven .... \exerciseref{ex.expectn}}%9}
\newcommand{\exaseven}{\exerciseref{ex.R9}}%{7}
\newcommand{\exten}{\exerciseref{ex.expectng}}%{11}
\newcommand{\exfourteen}{\exerciseref{ex.Hadditive}}%{15}
\newcommand{\exfifteen}{\exerciseref{ex.Hcondnal}}%{16}
\newcommand{\exeighteen}{\exerciseref{ex.Hmutualineq}}%{19}
\newcommand{\extwenty}{\exerciseref{ex.rel.ent}}%{21}
\newcommand{\extwentyone}{\exerciseref{ex.joint}}%{22}% the joint ensemble
\newcommand{\extwentytwo}{\exerciseref{ex.dataprocineq}}%{23}
\newcommand{\extwentythree}{\exerciseref{ex.zxymod2}}%{24}
\newcommand{\extwentyfour}{\exerciseref{ex.waithead}}%{25}
\newcommand{\extwentyfive}{\exerciseref{ex.sumdice}}%{26}
\newcommand{\extwentysix}{\exerciseref{ex.RN}}%{27}
\newcommand{\extwentyseven}{\exerciseref{ex.RNGaussian}}%{28}
\newcommand{\exthirtyone}{\exerciseref{ex.logit}}%{32}% logistic
\newcommand{\exthirtysix}{\exerciseref{ex.exponential}}%{37}%
\newcommand{\exthirtyseven}{\exerciseref{ex.blood}}%{38}% forensic
\newcommand{\exfiftythree}{\exerciseref{ex.}}%{53}% integers
\newcommand{\eqsixteenfive}{16.5}
\newcommand{\Kraft}{Kraft}% Kraft--McMillan
\newcommand{\exrelent}{\exerciseref{ex.rel.ent}}%{20} %% \ref{ex.rel.ent}
\newcommand{\eqKL}{1.24} %% \eqref{eq.KL}
\newcommand{\bSigma}{{\mathbf{\Sigma}}}
\newcommand{\sumproduct}{sum-product}
%
% for cpi material
%
\newcommand{\sigbias}{\sigma_{\rm bias}}
\newcommand{\sigin}{\sigma_{\rm in}}
\newcommand{\sigout}{\sigma_{\rm out}}
\newcommand{\abias}{\alpha_{\rm bias}}
\newcommand{\ain}{\alpha_{\rm in}}
\newcommand{\aout}{\alpha_{\rm out}}
%\newcommand{\bff}{\bf}
\newcommand{\handfigs}{/home/mackay/handbook/figs}
\newcommand{\mjofigs}{/home/mackay/figs/mjo}
\newcommand{\FIGSlearning}{/home/mackay/book/FIGS/learning}
\newcommand{\codefigs}{/home/mackay/_doc/code/ps/ps}
%
% mncEL stuff
%
\newcommand{\ebnowide}[1]{\mbox{\psfig{figure=../../code/#1.ps,width=2.8in,angle=-90}}}
\newcommand{\fem}{m}
\newcommand{\feM}{M}
\newcommand{\fel}{n}
\newcommand{\feL}{N}
\renewcommand{\L}{N}
\newcommand{\feLm}{{\cal N}(m)}
\newcommand{\feMl}{{\cal M}(n)}
\newcommand{\feK}{N}
\newcommand{\fek}{n}
\newcommand{\feKn}{{\cal N}(m)}
\newcommand{\feNk}{{\cal M}(n)}
\newcommand{\feN}{M}
\newcommand{\fen}{m}
\newcommand{\fer}{r}
\newcommand{\GL}{GL}
\newcommand{\SMN}{GL}
\newcommand{\NMN}{MN}
\newcommand{\MN}{MN}
\renewcommand{\check}{check}% was relationship
\newcommand{\checks}{checks}% was relationship
\newcommand{\fs}{f_{\rm s}}
\newcommand{\fn}{f_{\rm n}}
\newcommand{\llncspunc}{.}
\newcommand{\query}{\mbox{{\tt{?}}}}
\newcommand{\lcA}{{H}}
\newcommand{\rmncNall}{/home/mackay/_doc/code/rmncNall}
\newcommand{\oneA}{1A}
\newcommand{\twoA}{2A}
\newcommand{\thrA}{2A}
\newcommand{\oneB}{1B}
\newcommand{\twoB}{2B}
\newcommand{\thrB}{2B}
\newcommand{\bndips}{/home/mackay/_doc/code/bndips}
\newcommand{\codeps}{/home/mackay/_doc/code/ps}
\newcommand{\equalnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/equal.eps,width=8pt}\hspace{0mm}}}
\newcommand{\plusnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/plus.eps,width=8pt}\hspace{0mm}}}
%
% Mon 26/5/03 modified this to try to centre the left heading
\newcommand{\fourfourtable}[9]{\begin{tabular}[b]{lcc@{\hspace{4pt}}c}
\multicolumn{1}{l}{#1:} & & \multicolumn{2}{c}{#2} \\[-0.1in]% \cline{1-1}
& & {#3} & {#4} \\ \cline{3-4}
\raisebox{-6.5pt}[0pt][0pt]{{#5}} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt]
&\multicolumn{1}{l|}{#4} & {#8} & {#9} \\
\end{tabular}}
% Mon 26/5/03 extra version with heading right aligned and space reduced between col 1 and 2
\newcommand{\fourfourtabler}[9]{\begin{tabular}[b]{r@{}cc@{\hspace{4pt}}c}
\multicolumn{1}{l}{#1:} & & \multicolumn{2}{c}{#2} \\[-0.1in]% \cline{1-1}
& & {#3} & {#4} \\ \cline{3-4}
\raisebox{-6.5pt}[0pt][0pt]{{#5}} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt]
&\multicolumn{1}{l|}{#4} & {#8} & {#9} \\
\end{tabular}}
\newcommand{\fourfourtablebeforemaythree}[9]{\begin{tabular}[b]{lcc@{\hspace{4pt}}c}
\multicolumn{1}{l}{#1:} & & \multicolumn{2}{c}{#2} \\[-0.1in]% \cline{1-1}
& & {#3} & {#4} \\ \cline{3-4}
{#5} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt]
&\multicolumn{1}{l|}{#4} & {#8} & {#9} \\
\end{tabular}}
\newcommand{\fourfourtableb}[9]{\begin{tabular}[b]{l|c@{\hspace{1pt}}c@{\hspace{3pt}}c}
{#1} & {#2} & {#3} & {#4} \\ \cline{1-1}\cline{3-4}
\multicolumn{2}{l}{#5} & & \\
\multicolumn{1}{l|}{#3} & & {#6} & {#7} \\[-5pt]
\multicolumn{1}{l|}{#4} & & {#8} & {#9} \\
\end{tabular}}
\newcommand{\fourfourtableold}[9]{\begin{tabular}[b]{l|c|c|c|}
{#1} & {#2} & {#3} & {#4} \\ \cline{1-1}
\multicolumn{2}{l|}{#5} & & \\ \hline
\multicolumn{2}{l|}{#3} & {#6} & {#7} \\ \hline
\multicolumn{2}{l|}{#4} & {#8} & {#9} \\ \hline
\end{tabular}}
\newcommand{\mathsstrut}{\rule[-3mm]{0pt}{8mm}}
%
% for ra.tex
%
\newcommand{\halfw}{0.35in}
\newcommand{\onew}{0.9in}%{0.7in}% used in Gallager/MN figures in ra.tex% increased Wed 9/4/03
\newcommand{\onehalfw}{1.05in}
\newcommand{\twow}{1.4in}
\newcommand{\twohalfw}{1.75in}
\newcommand{\GHfig}[1]{\psfig{figure=GHps/#1,width=\onehalfw}}% for rate 1/3
\newcommand{\GHfigone}[1]{\psfig{figure=GHps/#1,width=\onew}}%
\newcommand{\GHfigthird}[1]{\psfig{figure=GHps/#1,width=\halfw}}
\newcommand{\GHfigquarter}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}
\newcommand{\GHfigtwo}[1]{\psfig{figure=GHps/#1,width=\twow}}% for rate 1/2
\newcommand{\GHfigdouble}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}% for five wide
% extra wide fitting::::::::::: (for turbo)
\newcommand{\GHfigdoubleE}[1]{\psfig{figure=GHps/#1,width=2in}}% for five wide
\newcommand{\GHfigE}[1]{\psfig{figure=GHps/#1,width=1.2in}}% for rate 1/3
%
\newcommand{\GHdrawfig}[1]{\psfig{figure=GHps/#1,width=1.5in}}% was 1.8
\newcommand{\standardfig}[1]{\psfig{figure=rirreg/#1,width=1.8in,angle=-90}}
\newcommand{\loopsfig}[1]{\psfig{figure=rirreg/loops.#1,height=1.85in,width=1.8in,angle=-90}}
\newcommand{\titledfig}[2]{\begin{tabular}{c}%
{#1}\\%
\standardfig{#2}\\%
\end{tabular}%
}
%
% for the single neuron chapters
%
\newcounter{funcfignum}
\setcounter{funcfignum}{1}
\newcommand{\funcfig}[2]{
\put(#1,#2){\makebox(0,0)[b]{
\begin{tabular}{@{}c@{}}
\psfig{figure=\FIGSlearning/f.#1.#2.ps,height=1.3in,width=1.3in,angle=-90} \\[-0.15in]
$\bw = (#1,#2)$
\\ \end{tabular}
}
}
}
\newcommand{\wflatfig}[1]{
\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.3)(0.30,0.40)
\psfig{figure=\FIGSlearning/#1,height=2.43in,width=2.064in,angle=-90}
% was 1.3,1.3
\end{picture}\\\end{tabular}
}
\newcommand{\wsurfig}[1]{
\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.5)(0,0)
\psfig{figure=\FIGSlearning/#1,height=1.8in,width=1.8in,angle=-90}
% was 1.5,1.5
\end{picture}\end{tabular}
}
\newcommand{\datfig}[1]{
\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1,1)(0.30,0.1)
\psfig{figure=\FIGSlearning/#1,height=1.2in,width=1.412in,angle=-90}
% was 1,1
\end{picture}\end{tabular}
}
\newcommand{\optens}{optimal input distribution}% used in l5.tex, l6.tex, s5.tex
\newcommand{\dilbertcopy}{{[Dilbert image Copyright\copyright{1997} United Feature Syndicate, Inc.,
used with permission.]}}
\newcommand{\Rnine}{\mbox{R}_9}
\newcommand{\Rthree}{\mbox{R}_3}
\newcommand{\eof}{{\Box}}
\newcommand{\teof}{\mbox{$\Box$}}% for use in text
\newcommand{\ta}{{\tt{a}}}
\newcommand{\tb}{{\tt{b}}}
%\newcommand{\dits}{dits}
%\newcommand{\dit}{dit}
\newcommand{\disc}{disk}
\newcommand{\dits}{bans}
\newcommand{\dit}{ban}
%
% used in l5
%
\newcommand{\BSC}{binary symmetric channel}
\newcommand{\BEC}{binary erasure channel}
\newcommand{\subsubpunc}{}% change to . if subsubsections are given in-line headings
%
% convolutional code definitions
%
\newcommand{\cta}{t^{(a)}}
\newcommand{\ctb}{t^{(b)}}
\newcommand{\z}{z}
\newcommand{\lfsr}{linear-feedback shift-register}
%
% definitions for including hinton diagrams from extended directory
%
\newcommand{\ecfig}[1]{\psfig{figure=extended/ps/#1.ps,silent=}}
% extra argument
\newcommand{\ecfigb}[2]{\psfig{figure=extended/ps/#1.ps,#2,silent=}}
%
% used in _s1 and in _linear maybe
%%%%%%%%%%% see /home/mackay/code/bucky
\newcommand{\buckypsfig}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in}}}
\newcommand{\buckypsfigw}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in}}}
\newcommand{\buckypsgraph}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in,angle=-90}}}
\newcommand{\buckypsgraphb}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in,angle=-90}}}
\newcommand{\buckypsgraphB}[1]{\mbox{\psfig{figure=buckyps/#1,width=2.2in,angle=-90}}}
%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%55
% for l1a
%%%%%%%%%%%%%%%%%%
% example
% \bigrampicture{3.538mm}{hd_conbigram.ps}
% \bigrampicture{3.538mm}{hd_conbigram.ps,width=278pt}%%%%%%% 278 is the original size
% This used to work fine in latex209 then needed rejigging in 2e.
% (alignment of g,j,p,q,y wrong at the bottom) (saved to graveyard.tex
\newcommand{\bigrampicture}[3]%args are unitlength,picturename-and-picturesize,font-request
{%%%%%%%%%
\setlength{\unitlength}{#1}
\begin{picture}(30,30)(0,-30)% was 28,28 0,-28
\put(0.15,-27.8){\makebox(0,0)[bl]{\psfig{figure=bigrams/#2,angle=-90}}}
\put(1,-29){\makebox(0,0)[b]{{#3\tt a}}}
\put(2,-29){\makebox(0,0)[b]{{#3\tt b}}}
\put(3,-29){\makebox(0,0)[b]{{#3\tt c}}}
\put(4,-29){\makebox(0,0)[b]{{#3\tt d}}}
\put(5,-29){\makebox(0,0)[b]{{#3\tt e}}}
\put(6,-29){\makebox(0,0)[b]{{#3\tt f}}}
\put(7,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt g}}}
\put(8,-29){\makebox(0,0)[b]{{#3\tt h}}}
\put(9,-29){\makebox(0,0)[b]{{#3\tt i}}}
\put(10,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt j}}}
\put(11,-29){\makebox(0,0)[b]{{#3\tt k}}}
\put(12,-29){\makebox(0,0)[b]{{#3\tt l}}}
\put(13,-29){\makebox(0,0)[b]{{#3\tt m}}}
\put(14,-29){\makebox(0,0)[b]{{#3\tt n}}}
\put(15,-29){\makebox(0,0)[b]{{#3\tt o}}}
\put(16,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt p}}}
\put(17,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt q}}}
\put(18,-29){\makebox(0,0)[b]{{#3\tt r}}}
\put(19,-29){\makebox(0,0)[b]{{#3\tt s}}}
\put(20,-29){\makebox(0,0)[b]{{#3\tt t}}}
\put(21,-29){\makebox(0,0)[b]{{#3\tt u}}}
\put(22,-29){\makebox(0,0)[b]{{#3\tt v}}}
\put(23,-29){\makebox(0,0)[b]{{#3\tt w}}}
\put(24,-29){\makebox(0,0)[b]{{#3\tt x}}}
\put(25,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt y}}}
\put(26,-29){\makebox(0,0)[b]{{#3\tt z}}}
\put(27,-29){\makebox(0,0)[b]{{#3--}}}
% they used to be at height -29 and were aligned bottom
%\put(27,-29){\makebox(0,0)[b]{{#3\verb+-+}}}
%
\put(29,-29){\makebox(0,0)[r]{#3$y$}}
%
\put(-0.2,-1){\makebox(0,0)[r]{{#3\tt a}}}
\put(-0.2,-2){\makebox(0,0)[r]{{#3\tt b}}}
\put(-0.2,-3){\makebox(0,0)[r]{{#3\tt c}}}
\put(-0.2,-4){\makebox(0,0)[r]{{#3\tt d}}}
\put(-0.2,-5){\makebox(0,0)[r]{{#3\tt e}}}
\put(-0.2,-6){\makebox(0,0)[r]{{#3\tt f}}}
\put(-0.2,-7){\makebox(0,0)[r]{{#3\tt g}}}
\put(-0.2,-8){\makebox(0,0)[r]{{#3\tt h}}}
\put(-0.2,-9){\makebox(0,0)[r]{{#3\tt i}}}
\put(-0.2,-10){\makebox(0,0)[r]{{#3\tt j}}}
\put(-0.2,-11){\makebox(0,0)[r]{{#3\tt k}}}
\put(-0.2,-12){\makebox(0,0)[r]{{#3\tt l}}}
\put(-0.2,-13){\makebox(0,0)[r]{{#3\tt m}}}
\put(-0.2,-14){\makebox(0,0)[r]{{#3\tt n}}}
\put(-0.2,-15){\makebox(0,0)[r]{{#3\tt o}}}
\put(-0.2,-16){\makebox(0,0)[r]{{#3\tt p}}}
\put(-0.2,-17){\makebox(0,0)[r]{{#3\tt q}}}
\put(-0.2,-18){\makebox(0,0)[r]{{#3\tt r}}}
\put(-0.2,-19){\makebox(0,0)[r]{{#3\tt s}}}
\put(-0.2,-20){\makebox(0,0)[r]{{#3\tt t}}}
\put(-0.2,-21){\makebox(0,0)[r]{{#3\tt u}}}
\put(-0.2,-22){\makebox(0,0)[r]{{#3\tt v}}}
\put(-0.2,-23){\makebox(0,0)[r]{{#3\tt w}}}
\put(-0.2,-24){\makebox(0,0)[r]{{#3\tt x}}}
\put(-0.2,-25){\makebox(0,0)[r]{{#3\tt y}}}
\put(-0.2,-26){\makebox(0,0)[r]{{#3\tt z}}}
\put(-0.2,-27){\makebox(0,0)[r]{{#3--}}}
%\put(-0.2,-27){\makebox(0,0)[r]{{#3\verb+-+}}}
\put(-0.2,1){\makebox(0,0)[r]{#3$x$}}
\end{picture}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% used in ch 1:
\newcommand{\pB}{p_{\rm B}}
\newcommand{\pb}{p_{\rm b}}
% from theorems.tex for exact.tex
\newcommand{\PGB}{p^{\rm G}_{\rm B}}
\newcommand{\PGb}{p^{\rm G}_{\rm b}}
\newcommand{\PB}{p_{\rm B}}
\newcommand{\Pb}{p_{\rm b}}
%
% used in occam.tex (from nn_occam.tex)
\newlength{\minch}
\setlength{\minch}{0.82in}
\newcommand{\ostruta}{\rule[-0.07\minch]{0cm}{0.18\minch}}
\newcommand{\ostrutb}{\rule[-0.17\minch]{0cm}{0.14\minch}}
%
% sumproduct.tex
\newcommand{\gP}{P^*}
\newcommand{\xmwon}{\ensuremath{\bx_m \wo n}}
\newcommand{\xmwonb}{\ensuremath{\bx_{m \wo n}}}
% southeast.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\gridlet}[1]{\thinlines
\multiput(#1)(0,-2){4}{\line(1,0){7.22}}%
\multiput(#1)(2,0){4}{\line(0,-1){7.22}}}
%
\newcommand{\gridletfive}[1]{\thinlines
\multiput(#1)(0,-2){5}{\line(1,0){9.22}}%
\multiput(#1)(2,0){5}{\line(0,-1){9.22}}}
%
\newcommand{\piece}[1]{\put(#1){\circle*{0.872}}}
\newcommand{\opiece}[1]{\put(#1){\circle{0.872}}}
\newcommand{\movingpiece}[1]{%
\thinlines
\put(#1){\circle*{0.872}}
\put(#1){\vector(0,-1){2}}
\put(#1){\vector(1,0){2}}
}%end movingpiece
\newcommand{\lhnextposition}[2]{\hnextposition{#1}
\put(#1){\makebox(0,0)[bl]{\raisebox{2mm}{#2}}}}% labelled horizontal arrow
\newcommand{\ldnextposition}[2]{\dnextposition{#1}
\put(#1){\makebox(0,0)[tl]{\raisebox{0mm}{#2}}}}% labelled horizontal arrow
\newcommand{\hnextposition}[1]{\put(#1){\vector(1, 0){2}}}
\newcommand{\vnextposition}[1]{\put(#1){\vector(0, -1){2}}}
\newcommand{\dnextposition}[1]{\put(#1){\vector(-2,-1){4}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% dfountain
\newcommand{\Ripple}{\ensuremath{S}}
% deconvoln.tex
\newcommand{\noisenu}{n}
% for _s13.tex and one_neuron
\newcommand{\hammingsymbol}[7]{\setlength{\unitlength}{1.4mm}%
\begin{picture}(1,2)(0,0)%
\ifnum #1=1 \put(0,2){\line(1,0){1}} \fi%
\ifnum #2=1 \put(0,2){\line(0,-1){1}} \fi%
\ifnum #3=1 \put(1,2){\line(0,-1){1}} \fi%
\ifnum #4=1 \put(0,1){\line(1,0){1}} \fi%
\ifnum #5=1 \put(0,1){\line(0,-1){1}} \fi%
\ifnum #6=1 \put(1,1){\line(0,-1){1}} \fi%
\ifnum #7=1 \put(0,0){\line(1,0){1}} \fi%
\end{picture}%
}
\newcommand{\hammingdigit}[1]{%
\ifnum #1=6 \hammingsymbol{0}{0}{0}{1}{0}{1}{1}\fi%
\ifnum #1=14 \hammingsymbol{0}{0}{1}{0}{1}{1}{1}\fi%
\ifnum #1=2 \hammingsymbol{0}{0}{1}{1}{1}{0}{0}\fi%
\ifnum #1=1 \hammingsymbol{0}{1}{0}{0}{1}{1}{0}\fi%
\ifnum #1=10 \hammingsymbol{0}{1}{0}{1}{1}{0}{1}\fi%
\ifnum #1=12 \hammingsymbol{0}{1}{1}{0}{0}{0}{1}\fi%
\ifnum #1=4 \hammingsymbol{0}{1}{1}{1}{0}{1}{0}\fi%
\ifnum #1=11 \hammingsymbol{1}{0}{0}{0}{1}{0}{1}\fi%
\ifnum #1=0 \hammingsymbol{1}{0}{0}{1}{1}{1}{0}\fi%
\ifnum #1=7 \hammingsymbol{1}{0}{1}{0}{0}{1}{0}\fi%
\ifnum #1=13 \hammingsymbol{1}{0}{1}{1}{0}{0}{1}\fi%
\ifnum #1=5 \hammingsymbol{1}{1}{0}{0}{0}{1}{1}\fi%
\ifnum #1=9 \hammingsymbol{1}{1}{0}{1}{0}{0}{0}\fi%
\ifnum #1=3 \hammingsymbol{1}{1}{1}{0}{1}{0}{0}\fi%
\ifnum #1=8 \hammingsymbol{1}{1}{1}{1}{1}{1}{1}\fi%
}
% here in binary order.
%6 &\hammingsymbol{0}{0}{0}{1}{0}{1}{1} \\
%14&\hammingsymbol{0}{0}{1}{0}{1}{1}{1} \\
%2 &\hammingsymbol{0}{0}{1}{1}{1}{0}{0} \\
%1 &\hammingsymbol{0}{1}{0}{0}{1}{1}{0} \\
%10&\hammingsymbol{0}{1}{0}{1}{1}{0}{1} \\
%12&\hammingsymbol{0}{1}{1}{0}{0}{0}{1} \\
%4 &\hammingsymbol{0}{1}{1}{1}{0}{1}{0} \\
%11&\hammingsymbol{1}{0}{0}{0}{1}{0}{1} \\
%0 &\hammingsymbol{1}{0}{0}{1}{1}{1}{0} \\
%7 &\hammingsymbol{1}{0}{1}{0}{0}{1}{0} \\
%13&\hammingsymbol{1}{0}{1}{1}{0}{0}{1} \\
%5 &\hammingsymbol{1}{1}{0}{0}{0}{1}{1} \\
%9 &\hammingsymbol{1}{1}{0}{1}{0}{0}{0} \\
%3 &\hammingsymbol{1}{1}{1}{0}{1}{0}{0} \\
%8 &\hammingsymbol{1}{1}{1}{1}{1}{1}{1} \\
\newcommand{\ldpcc}{low-density parity-check code}
%\newcommand{\Ldpc}{Low-density parity-check}% defined elsewhere
% included by l2.tex
% definitions for weighings.tex and for text
% shows weighing trees, ternary
%
% decisions of what to weigh are shown in square boxes with 126 over 345 (l:r)
% state of valid hypotheses are listed in double boxes
% or maybe dashboxes?
% three arrows, up means left heavy, straioght means right heavy, down is balance
%
\newcommand{\mysbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}}
\newcommand{\mydbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}}
\newcommand{\myuvector}[3]{\put(#1){\vector(#2){#3}}}
\newcommand{\mydvector}[3]{\put(#1){\vector(#2){#3}}}
\newcommand{\mysvector}[2]{\put(#1){\vector(1,0){#2}}}
\newcommand{\mythreevector}[4]{\myuvector{#1}{#2,#3}{#4}\mydvector{#1}{#2,-#3}{#4}\mysvector{#1}{#4}}
%
%\newcommand{\h1}{\mbox{$1^+$}}
%\newcommand{\l1}{\mbox{$1^-$}}
%\newcommand{\h2}{\mbox{$2^+$}}
%\newcommand{\l2}{\mbox{$2^-$}}
%\newcommand{\h3}{\mbox{$3^+$}}
%\newcommand{\l3}{\mbox{$3^-$}}
%\newcommand{\h4}{\mbox{$4^+$}}
%\newcommand{\l4}{\mbox{$4^-$}}
%\newcommand{\h5}{\mbox{$5^+$}}
%\newcommand{\l5}{\mbox{$5^-$}}
%\newcommand{\h6}{\mbox{$6^+$}}
%\newcommand{\l6}{\mbox{$6^-$}}
%\newcommand{\h7}{\mbox{$7^+$}}
%\newcommand{\l7}{\mbox{$7^-$}}
%\newcommand{\h8}{\mbox{$8^+$}}
%\newcommand{\l8}{\mbox{$8^-$}}
%\newcommand{\h9}{\mbox{$9^+$}}
%\newcommand{\l9}{\mbox{$9^-$}}
%\newcommand{\h10}{\mbox{$10^+$}}
%\newcommand{\l10}{\mbox{$10^-$}}
%\newcommand{\h11}{\mbox{$11^+$}}
%\newcommand{\l11}{\mbox{$11^-$}}
%\newcommand{\h12}{\mbox{$12^+$}}
%\newcommand{\l12}{\mbox{$12^-$}}
%\setlength{\parindent}{0mm}
\title{Information Theory, Inference, \& Learning Algorithms}
\shortlecturetitle{}
\shortauthor{David J.C. MacKay}
% the book - called by book.tex
%
% aiming for 696 pages total
%
% thebook.tex
% should run
% make book.ind
% by hand?
% Mon 7/10/02
\setcounter{exercise_number}{1} % set to imminent value
%
\setcounter{secnumdepth}{1} % sets the level at which subsection numbering stops
\setcounter{tocdepth}{0}
\newcommand{\mysetcounter}[2]{}%was {\setcounter{#1}{#2}}
% useful for forcing pagenumbers in drafts
%\setcounter{tocdepth}{1}
\renewcommand{\bs}{{\bf s}}
\newcommand{\figs}{/home/mackay/handbook/figs} % while in bayes chapter
% \addtocounter{page}{-1}
\pagenumbering{roman}
\setcounter{page}{2} % set to current value
\setcounter{frompage}{2}% this is used by newcommands1.tex dvips operator that helps make
\setcounter{page}{1} % set to current value
\setcounter{frompage}{1}% this is used by newcommands1.tex dvips operator that helps make
% individual chapters.
%
% PAGE ii
%
% \chapter*{Dedication}
%\input{tex/dedicationa.tex}
%\newpage
%
% TITLE PAGE iii
%
\thispagestyle{empty}
\begin{narrow}{0in}{-\margindistancefudge}%
\begin{raggedleft}
~\\[1.5in]
{\Large \bf Information Theory,
Inference,
and Learning Algorithms\\[1in]
}
{\Large\sf David J.C. MacKay }\\
\end{raggedleft}
\vfill
\mbox{}\epsfxsize=160pt\epsfbox{cuplogo.eps}% increased x size to compensate for 0.9 shrinkage later and another 10%
% \mbox{}\epsfxsize=128pt\epsfbox{cuplogo.eps}
\vspace*{-6pt}
\end{narrow}
\newpage
\thispagestyle{empty}
\begin{center}
~\\[1.5in]
{\Huge \bf Information Theory, \\[0.2in]
Inference,\\[0.2in]
and Learning Algorithms\\[1in]
}
{\Large\sf David J.C. MacKay }\\
{\tt{mackay@mrao.cam.ac.uk}}\\[0.3in]
\copyright 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004\\[0.1in]
\copyright Cambridge University Press 2003\\[1.3in]
Version \thedraft\ (third printing) \today\\
\medskip
\medskip
\medskip
\medskip
\medskip
Please send feedback on this book via
{\tt{http://www.inference.phy.cam.ac.uk/mackay/itila/}}
\medskip
\medskip
\medskip
Version 6.0 of this book was published by C.U.P.\ in September 2003.
It will remain viewable on-screen on the above website, in postscript, djvu,
and pdf formats.
\medskip
\medskip
In the second printing (version 6.6) minor typos were corrected,
and the book design was slightly altered to modify the placement of section numbers.
\medskip
\medskip
In the third printing (version 7.0) minor typos were corrected, and chapter 8 was
renamed `Dependent random variables' (instead of `Correlated').
\medskip
\medskip
\medskip
{\em (C.U.P. replace this page with their own page ii.)}
\end{center}
%\dvipsb{frontpage}
\newpage
% choose one of these:
% \input{cambridgefrontstuff.tex}
% \newpage
% {\em Page vi intentionally left blank.}
%
\newpage
% pages v and vi pages vii and viii
\mytableofcontents
\dvipsb{table of contents}
% alternate
%\fakesection{Roadmap}
%\input{roadmap.tex}
%
\subchaptercontents{Preface}%{How to Use This Book}% use subchapter because this
% marks the chapter name in the header, unlike chapter*{}
% \section*{How to use this book}
%{\em [This front matter is still being written. The remainder of the book is essentially finished,
% except for typographical corrections, April 18th 2003.]}
%
% a longer version of this is in
% longabout.tex
% \section*{How to use this book}
% \section{How to use this book}
% The first question we must address is:
This book is aimed at senior undergraduates and graduate students in
Engineering, Science, Mathematics, and Computing. It expects
familiarity with calculus, probability theory, and
linear algebra as taught in a first- or second-year
undergraduate course on mathematics for
scientists and engineers.
Conventional courses on information theory
cover not only the beautiful {\em theoretical\/} ideas of Shannon,
but also {\em practical\/} solutions to \ind{communication} problems.
This book
goes further, bringing in Bayesian data modelling,
Monte Carlo methods, variational
methods, clustering algorithms, and neural networks.
Why unify information theory and
machine learning?
% Well,
Because they
% Information theory and
% machine learning
are two sides of the same coin.
% , so it makes sense to unify them.
% These two fields were once unified:
% It was once so:
In the 1960s, a single field, cybernetics, was populated
by information theorists, computer scientists, and neuroscientists, all
studying common problems.
Information theory and machine learning still belong together.
Brains are the ultimate compression and \ind{communication} systems.
And the state-of-the-art algorithms
for both data compression and error-correcting codes
use the same tools as machine learning.
% Our brains are surely the ultimate in robust
% error-correcting information storage and recall systems.
\section*{How to use this book}
The essential dependencies between chapters are indicated in
the figure on the next page. An arrow from one chapter to another
indicates that the second chapter requires some of the first.
%\section*{General points}
% The pinnacles of the book, the key chapters with the really exciting bits,
% are first \chref{chone} (in which we meet Shannon's noisy-channel coding theorem);
% \chref{ch.six} (in which we prove it); \chref{ch.hopfield} (in which
% we meet a neural network that performs robust error-correcting
% content-addressable memory); and Chapters \ref{ch.ldpcc} and \ref{chdfountain}
% (in which we meet beautifully simple sparse-graph codes that solve
% Shannon's communication problem).
%% honorable mention - \chref{ch.ac}, ch.ra /////\ exact sampling - not central.
% Do not feel daunted by this book.
% You don't need to read all of this book.
Within {\partnoun}s \datapart, \noisypart, \probpart, and \netpart\ of this book, chapters
on advanced or optional topics are
towards the end.
% For example, \chref{ch.codesforintegers} (Codes for Integers), \chref{ch.xword} (Crosswords and Codebreaking)
% and \chref{ch.sex} (Why have Sex? Information Acquisition and Evolution)
% are provided for fun.
All chapters of {\partnoun} \finfopart\
are optional on a first reading, except perhaps for
\chref{ch.message} (Message Passing).
The same system sometimes applies within a chapter:
the final sections often deal with
advanced topics that can be skipped on a first reading.
For example in two key chapters --
\chref{chtwo} ({The Source Coding Theorem}) and \chref{ch.six} ({The Noisy-Channel Coding Theorem}) --
the first-time reader should detour
at \secref{sec.chtwoproof} and \secref{sec.ch6stop} respectively.
% \subsection*{Roadmaps}
Pages \pageref{map1}--\pageref{map4} show a few ways to use this book.
First, I give the roadmap for a course that I teach in Cambridge:
% which embraces both information theory and machine learning.
`Information theory, pattern recognition, and neural networks'.
%
The book is also intended as a textbook for
traditional courses in information theory.
The second roadmap
shows the chapters for
an introductory information theory course
and the third
for a course aimed at an understanding of
state-of-the-art error-correcting codes.
%
The fourth roadmap shows how to use the text in a
conventional course on machine learning.
% The diagrams on the following pages will indicate
% the dependences between chapters and
% a few possible routes through the book.
\newpage
\begin{center}\hspace*{-0.2cm}\raisebox{2cm}{\epsfbox{metapost/roadmap.2}}\end{center}
\newpage
% \input{tex/cambroadmap.tex}
% \newpage
\begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.3}}\end{center}
\label{map1}
\newpage
\begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.4}}\end{center}
\newpage
\begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.5}}\end{center}
\newpage
\begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.6}}\end{center}
\label{map4}
\newpage
\section*{About the exercises}
% I firmly believe that
You can understand a subject only by
creating it for yourself.
% To this end, you should
% I think it is essential to
The exercises
play an essential role in this book.
% on each topic.
For guidance, each
% exercise
has a rating (similar to that used by \citeasnoun{KnuthAll})
from 1 to 5 to indicate its difficulty.
\noindent\ratfull\hspace*{\parindent}In addition, exercises that are especially recommended
are marked by a marginal encouraging rat.
Some exercises that require the use of a computer are
marked with a {\sl C}.
% will have
% a rating such as A1, A5, C1 or C5.
% The letter indicates how important I think the exercise is:
% A = very important $\ldots$ C = not essential to the flow of the
% book. The number indicates the difficulty of the problem:
% 1 = easy, 5 = research project.
% I'll circulate detailed recommendations on exercises
% as the course progresses.
Answers to many exercises are provided. Use them
wisely. Where a solution is provided, this is indicated
by including its page number
% of the solution with
alongside the difficulty rating.
Solutions to many of the other exercises
will be supplied to instructors using this book in their
teaching; please email {\tt{solutions@cambridge.org}}.
%\begin{table}[htbp]
%\caption[a]
\begin{realcenter}
\fbox{
\begin{tabular}{ll}
%\begin{minipage}{3in}
{\sf Summary of codes for exercises}\\[0.2in]
% \hspace{0.2in}
\begin{tabular}[b]{cl}
\dorat & Especially recommended \\[0.2in]
{\ensuremath{\triangleright}} & Recommended \\
{\sl C} & Parts require a computer \\
{\rm [p.$\,$42]}& Solution provided on page 42 \\
\end{tabular}
%\end{minipage}
&
\begin{tabular}[b]{cl}
\pdifficulty{1} & Simple (one minute) \\
\pdifficulty{2} & Medium (quarter hour) \\
\pdifficulty{3} & Moderately hard \\
\pdifficulty{4} & Hard \\
\pdifficulty{5} & Research project \\[0.2in]
\end{tabular}
\\
\end{tabular}
}
\end{realcenter}
%\end{table}
\section*{Internet resources}
The website
\begin{realcenter}
{\tt{http://www.inference.phy.cam.ac.uk/mackay/itila}}
\end{realcenter}
contains several resources:
\ben
\item
{\em Software}.
Teaching software that I use in lectures,\index{software}
interactive software, and research software,
written in {\tt{perl}}, {\tt{octave}}, {{\tt{tcl}}}, {\tt{C}}, and {\tt{gnuplot}}.
Also some animations.
\item
{\em Corrections to the book}. Thank you in advance for emailing these!
\item
{\em This book}.
The book is provided in {\tt{postscript}}, {\tt{pdf}}, and {\tt{djvu}}
formats for on-screen viewing. The same copyright restrictions
apply as to a normal book.
% \item
% {\em Further worked solutions to some exercises}.
% If you would like to send in your own solutions for inclusion,
% please do.
\een
% {\em (I aim to add a table of software resources here.)}
\section*{About this edition}
This is the third printing of the first edition.
In the second printing,
% a small number of typographical errors were corrected,
% and
the design of the book was altered slightly.
% to allow a slightly larger font size.
Page-numbering generally remains unchanged,
% consistent between the two printings,
except in chapters 1, 6, and 28,
where
% with the exception of pages 7 to 13, where
% among which
a few paragraphs, figures, and equations have
moved around.
% on which text, figures, and equations have all been slightly rearranged.
All equation, section, and exercise numbers are unchanged.
In the third printing, chapter 8 has been renamed
`Dependent Random Variables', instead of `Correlated', which was sloppy.
% BEWARE, _RNGaussian.tex had to be changed for the asides.
%\input{tex/secondprint.tex}% about the second printing
\section*{Acknowledgments}
%\chapter*{Acknowledgments}
I am most grateful to the organizations who have supported
me while this book gestated: the Royal Society and Darwin College
who gave me a fantastic research fellowship
in the early years; the University of Cambridge;
the Keck Centre at the University of California in San Francisco,
where I spent a productive sabbatical;
% (and failed to finish the book);
and
the Gatsby Charitable Foundation, whose support gave me the
freedom to break out of the Escher staircase that book-writing had become.
My work has depended on the generosity of free software authors.\index{software!free}\index{Knuth, Donald}
I wrote the book in \LaTeXe. Three cheers for Donald Knuth and Leslie Lamport!
%\nocite{latex}
Our computers run the GNU/Linux operating system. I use {\tt{emacs}}, {\tt{perl}}, and
{\tt{gnuplot}} every day. Thank you Richard Stallman, thank you Linus Torvalds,
thank you everyone.
% I thank David Tranah of Cambridge University Press for his editorial support.
% ``cut, it's my job''
Many readers, too numerous to name here,
have given feedback on the book, and to
them all I extend my sincere acknowledgments.
%
I especially wish to thank all the students and colleagues
at Cambridge University who have attended my lectures on
information theory and machine learning over the last nine years.
% Without their enthusiasm and criticism, this book would surely
The members of the Inference research group have given immense support,
and I thank them all for their generosity and patience over the last ten years:
Mark Gibbs, Michelle Povinelli, Simon Wilson, Coryn Bailer-Jones, Matthew Davey,
Katriona Macphee, James Miskin, David Ward, Edward Ratzer, Seb Wills, John Barry,
John Winn, Phil Cowans, Hanna Wallach, Matthew Garrett, and especially Sanjoy Mahajan.
Thank you too to Graeme Mitchison, Mike Cates, and Davin Yap.
Finally I would like to express my debt to my personal heroes,
the mentors from whom I have learned so much:
Yaser Abu-Mostafa,
Andrew Blake,
John Bridle,
Peter Cheeseman,
Steve Gull,
Geoff Hinton,
John Hopfield,
Steve Luttrell,
Robert MacKay,
Bob McEliece,
Radford Neal,
Roger Sewell,
and
John Skilling.
%%%%%%%%%%%%%%
%\chapter*{Dedication}
%\vspace*{80pt}
\vfill
\begin{center}
\rule{\textwidth}{1pt} \par \vskip 18pt
{ \huge \sl
{Dedication} }
\par
%\end{center}
\nobreak \vskip 40pt
%\begin{center}
This book is dedicated to the campaign against the arms trade.\\[0.3in]
%
% Their web page is
% , as overburdened with animated images as the world is with weapons, is here:
%\verb+http://www.caat.demon.co.uk/+\\[0.6in]
\verb+www.caat.org.uk+\\[0.6in]
\end{center}
\begin{quote}
\begin{raggedleft}
Peace cannot be kept by force.\\
It can only be achieved
% by understanding.
% Peace cannot be achieved through violence, it can only be attained
through understanding.\\
\hfill -- {\em Albert Einstein}\\
\end{raggedleft}
\end{quote}
\vspace*{2pt}
\rule{\textwidth}{1pt} \par
% Two things are infinite: the universe and human stupidity; and I'm not sure
% about the the universe.
%The important thing is not to stop questioning. Curiosity has its own reason for
% existing.
%Any intelligent fool can make things bigger, more complex, and more violent. It
% takes a touch of genius -- and a lot of courage -- to move in the opposite
% direction.
% \input{extrafrontstuff.tex}% aims dedication, about the author, etc
% see also tex/oldaims.tex
% for some good stuff.
% and tex/typicalreaders.tex
%
%% \input{tex/overview2001.tex}
%\dvipsb{preface}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\setcounter{page}{0} % set to current value
%Fake page % added to get draft.dps to look right
%\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\pagenumbering{arabic}
\prechapter{About Chapter}
\setcounter{page}{1} % set to current value
\label{pch.one}
%
% pre-chapter 1
%
\fakesection{Before ch 1}
In the first chapter, you will need to be familiar with the \ind{binomial distribution}.
% , reviewed below.
And to solve the exercises in the text --
which I urge you to do -- you will need to know {\dem\ind{Stirling's
approximation}\/}\index{approximation!Stirling}
for the factorial function, $%\beq
x! \simeq x^{x} \, e^{-x}
$,
and be able to
apply it to ${{N}\choose{r}} =
\smallfrac{N!}{(N-r)!\,r!}$.\marginpar{\small\raggedright{Unfamiliar notation?\\ See
\appref{app.notation}, \pref{app.notation}.}}
% $x!$
These topics are reviewed below.
\subsection*{The binomial distribution}
\label{sec.first.binomial}
\exampl{ex.binomial}{
A bent coin has probability $f$ of coming up heads.
The coin is tossed $N$ times.
What is the probability
distribution of the number of heads, $r$?
What are the \ind{mean} and \ind{variance} of $r$?
}
\amarginfig{t}{%
\begin{tabular}{r}
% $P(r\given f,N)$\\
\mbox{\psfig{figure=bigrams/urn.f.g.ps,angle=-90,width=1.51in}}%
%\\
%\mbox{\psfig{figure=bigrams/urn.f.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{\small$r$}
\\
\end{tabular}
%}{%
\caption[a]{The binomial distribution $P(r \given f\eq 0.3,\,N \eq 10)$.}
% , on a linear scale (top) and a logarithmic scale (bottom).}
\label{fig.binomial}
}
% see bigrams/README
\noindent
%\begin{Sexample}{ex.binomial}
{\sf Solution\colonspace}
\label{sec.first.binomial.sol}
The number of heads
has a binomial distribution.
\beq P(r \given f,N) = {N \choose r} f^{r} (1-f)^{N-r} . \eeq
The mean, $\Exp [ r ]$, and variance, $\var[r]$,
of this distribution are
defined by
\beq
\Exp [ r ] \equiv \sum_{r=0}^{N} P(r\given f,N) \, r
\label{eq.mean.def}
\eeq
\beqan
\var[r] & \equiv &
\Exp \left[ \left( r - \Exp [ r ] \right)^2 \right] \\
& = &
\Exp [ r^2 ] - \left( \Exp [ r ] \right)^2
= \sum_{r=0}^{N} P(r\given f,N) r^2 - \left( \Exp [ r ] \right)^2 .
\label{eq.var.sum}
\eeqan
%
Rather than evaluating the sums over $r$ in (\ref{eq.mean.def}) and (\ref{eq.var.sum}) directly,
it is easiest to obtain the mean and variance by noting that $r$
is the sum of $N$ {\em independent\/}
% , identically distributed
random variables, namely, the number of heads in the
first toss (which is either zero or one),
the number of heads in the second toss, and so forth.
In general,
\beq
\begin{array}{rcll}
\Exp [ x + y ] &=& \Exp [ x ] + \Exp [ y ] & \mbox{for any random variables $x$ and $y$};
\\
\var [ x + y ] &=& \var [ x ] + \var [ y ] & \mbox{if $x$ and $y$ are independent}.
\end{array}
\eeq
So the mean of $r$ is the sum of the means of those random
variables, and the variance of $r$ is the sum of their variances.\index{variances add}
% its mean and variance are given by adding the means and variances
% of those random variables, respectively.
The mean number of heads in a single toss
is $f\times 1 + (1-f)\times 0 = f$, and the variance of the
number of heads in a single toss is
\beq
\left[ f\times 1^2 + (1-f)\times 0^2 \right] - f^2 = f - f^2 = f(1-f),
\eeq
so the mean and variance of $r$ are:
\beq \Exp [ r ] = N f
%\eeq\beq
\hspace{0.35in} \mbox{and} \hspace{0.35in}
\var[r] = N f (1-f) . \hspace{0.35in}\epfsymbol\hspace{-0.35in}
\eeq
%\end{Sexample}
% ADD END PROOF SYMBOL HERE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
\subsection*{Approximating $x!$ and ${{N}\choose{r}}$}
\amarginfig{t}{%
\begin{tabular}{r}
\mbox{\psfig{figure=bigrams/poisson.g.ps,angle=-90,width=1.5in}}%
%\\
%\mbox{\psfig{figure=bigrams/poisson.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{\small$r$}
\\
\end{tabular}
%}{%
\caption[a]{The Poisson distribution $P(r\,|\,\l\eq 15)$.}
% , on a linear scale (top) and a logarithmic scale (bottom).}
\label{fig.poisson}
}
% see bigrams/README
\label{sec.poisson}
% FAVOURITE BIT
\noindent
Let's derive Stirling's approximation by an unconventional route.
We start from the \ind{Poisson distribution} with mean $\l$,
\beq
P( r \given \l ) = e^{-\l} \frac{\l^r}{r!} \:\:\:\:
\:\: r\in \{ 0,1,2,\ldots\} .
\label{eq.poisson}
\eeq
%
% \noindent
For large $\l$, this distribution is well approximated -- at least\index{approximation!by Gaussian}
in the vicinity of $r \simeq \l$ -- by
a \ind{Gaussian distribution} with mean $\l$ and variance $\l$:
% So,
\beq
e^{-\l} \frac{\l^r}{r!} \,\simeq\, \frac{1}{\sqrt{2\pi \l}}
\, e^{{ -\smallfrac{(r-\l)^2}{2\l}}} .
\eeq
Let's plug $r=\l$ into this formula.\label{sec.stirling}
\beqan
e^{-\l} \frac{\l^{\l}}{\l!} &\simeq& \frac{1}{\sqrt{2\pi \l}}
\\
\Rightarrow \l! &\simeq& \l^{\l} \, e^{-\l} \sqrt{2\pi \l} .
\eeqan
This is {Stirling's approximation}
for the \ind{factorial} function.
\beq
x! \,\simeq\, x^{x} \, e^{-x} \sqrt{2\pi x} \:\:\:\Leftrightarrow\:\:\:
\ln x! \,\simeq\, x \ln x - x + {\textstyle\frac{1}{2}} \ln {2\pi x} .
\label{eq.stirling}
\eeq
We have derived not only the
leading order behaviour, $x! \simeq x^{x} \, e^{-x}$,
but also, at no cost, the next-order correction
term $\sqrt{2\pi x}$.
%
We now apply Stirling's approximation
% the approximation
%$%\beq
% x! \simeq x^{x} \, e^{-x} $
to\index{combination}
$%\beq
\ln {{N}\choose{r}}
$:%\eeq
\beqan
\ln {{N}\choose{r}}
\,\equiv\, \ln \frac{N!}{(N-r)!\,r!}
% & \simeq &
% N [ \ln N - 1 ] - (N-r) [ \ln (N-r) - 1 ] - r [ \ln r - 1 ]
%\\
& \simeq & (N-r) \ln\frac{N}{N-r} + r \ln\frac{N}{r}
.
\label{eq.choose.approx}
\eeqan
Since all the terms in this equation are logarithms,
this result can be rewritten in any base.\marginpar{\small Recall that
$\displaystyle{ \log_2 x = \frac{ \log_e x }{ \log_e 2} }$.\\[0.03in]
Note that $\displaystyle\frac{\partial \log_2 x }{\partial x} =
\frac{1}{\log_e 2}\,\frac{1}{x}$.
}
%\fakesubsection*{My rule about log and ln}
We will denote\index{conventions!logarithms}\index{notation!logarithms}
natural logarithms ($\log_e$) by `ln', and \ind{logarithms}
to base 2 ($\log_2$)
by `$\log$'.
If we introduce the {\dbf\ind{binary entropy function}},
\beq
H_2(x) \equiv x \log \frac{1}{x} + (1\! -\! x) \log \frac{1}{(1\! -\! x)} ,
\eeq
then we can rewrite the approximation (\ref{eq.choose.approx})
%\beq
%$ \log {{N}\choose{r}}
% \simeq (N-r) \log \frac{N}{N-r} + r \log \frac{N}{r}
%$
%\eeq
as
\amarginfig{t}{\small%
\begin{center}
\mbox{
\hspace{-6mm}
% \hspace{6.2mm}
\raisebox{\hpheight}{$H_2(x)$}
% to put H at left:
\hspace{-7.5mm}
% \hspace{-20mm}
\mbox{\psfig{figure=figs/H2.ps,%
width=42mm,angle=-90}}$x$
}
% see also H2p.tex
\end{center}
\caption[a]{The binary entropy function.}
% $H_2(x)$.}
\label{fig.h2x}
}
\beq
\log {{N}\choose{r}}
\, \simeq \, N H_2(r/N) ,
\label{eq.stirling.choose.l}
\eeq
or, equivalently,
% \:\:\:\Leftrightarrow\:\:\:
\beq
{{N}\choose{r}}
\, \simeq \, 2^{N H_2(r/N)} .
\label{eq.stirling.choose}
\eeq
If we need a more accurate approximation, we
can include terms of the next order from
Stirling's approximation
(\ref{eq.stirling}):
\beq
\log {{N}\choose{r}}
\,\simeq\, N H_2(r/N) -
{\textstyle\frac{1}{2}} \log \left[ {2\pi N \, \frac{N\!-\!r}{N} \,
\frac{r}{N}} \right]
.
\label{eq.H2approxaccurate}
\eeq
%
% - {\textstyle\frac{1}{2}} \ln {2\pi N}
% + {\textstyle\frac{1}{2}} \ln {2\pi N-r}
% + {\textstyle\frac{1}{2}} \ln {2\pi r}
%
% ln += {\textstyle\frac{1}{2}} \ln {2\pi (N-r)(r)/N}
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi (N-r)(r)/N}
% or
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi N}
% + {\textstyle\frac{1}{2}} \log_2 {\frac{(N-r)}{N}\frac{r}{N}}
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi \frac{(N-r)}{N}\frac{r}{N} N}
\ENDprechapter
\chapter{Introduction to Information Theory}
\label{ch.one}
\label{chone}
% % \part{Information Theory}
% \chapter{Introduction to Information Theory}
\label{ch1}
%\section{Communication over noisy channels}
% One of the principal questions addressed by information theory is
% Shannon's ground-breaking paper on `The Mathematical Theory of
% Communication' opens thus:
\begin{quotation}
\noindent
The fundamental problem of \index{communication} is that of reproducing at one point
either exactly or approximately a message selected at another point.
\\
\mbox{~} \hfill {\em (Claude Shannon, 1948)}\index{Claude Shannon} \\
%
\end{quotation}
\noindent
In the first half of
this book we
%are going to
study how to measure information content;
we
% are going to
% learn by how much data from a given source
% can be compressed; we
% are going to
learn how
% , practically, to
% achieve data compression;
to compress data; and we
% are going to
learn how to communicate
perfectly over imperfect communication channels.
We start by getting a feeling for this last problem.
\section[How can we achieve perfect communication?]{How
can we achieve perfect communication over an imperfect, noisy
communication channel?}
Some examples of noisy communication channels are:
\bit
\item
an analogue telephone
line,\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
\begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{modem}}}
\put(21,10){\makebox(0,0)[l]{\shortstack{phone\\line}}}
\put(39,10){\makebox(0,0)[l]{\shortstack{modem}}}
\put(15,10){\vector(1,0){3}}
\put(32,10){\vector(1,0){3}}
\end{picture}
}
over which two modems communicate digital information;
\item
the radio communication link from
Galileo,\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
\begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{Galileo}}}
\put(21,10){\makebox(0,0)[l]{\shortstack{radio\\waves}}}
\put(39,10){\makebox(0,0)[l]{\shortstack{Earth}}}
\put(15,10){\vector(1,0){3}}
\put(32,10){\vector(1,0){3}}
\end{picture}
}
the Jupiter-orbiting spacecraft,
to earth;
\item
\marginpar[c]{\footnotesize
\setlength{\unitlength}{1mm}%
\begin{picture}(30,20)(0,0)
\put(0,10){\makebox(0,0)[l]{\shortstack{parent\\cell}}}
\put(16,2){\makebox(0,0)[l]{\shortstack{daughter\\cell}}}
\put(16,16){\makebox(0,0)[l]{\shortstack{daughter\\cell}}}
\put(10,10){\vector(1,1){5}}
\put(10,10){\vector(1,-1){5}}
\end{picture}
}reproducing cells, in which the daughter cells' \ind{DNA}
contains information from the parent
% cell or
cells;
\item
\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
\begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}}
\put(20,10){\makebox(0,0)[l]{\shortstack{\disc\\drive}}}
\put(33,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}}
\put(15,10){\vector(1,0){3}}
\put(29,10){\vector(1,0){3}}
\end{picture}
}a \disc{} drive.
\eit
The last example shows that \ind{communication} doesn't have to involve
information going from one {\em place\/} to another. When
we write a file on a \disc{} drive, we'll
% typically
read it off
% again
in the same location -- but at a later {\em time}.
These channels are noisy.\index{noise}\index{channel!noisy} A telephone line suffers
from cross-talk with other lines; the hardware in the
line distorts and adds noise to the transmitted signal. The deep
space network that listens to Galileo's puny transmitter
% fairy-bulb power
receives background radiation from
terrestrial and cosmic sources.
DNA is subject to mutations and damage.
A \ind{disk drive}, which writes
a binary digit (a one or zero, also known as a {\dbf bit}) by aligning a patch of magnetic
material in one of two orientations, may later
% , with some probability,
fail to read out the stored binary digit:
% that was stored
the patch of material might spontaneously flip
magnetization, or
a glitch of
background noise might cause the reading circuit
to report the wrong
value for the binary digit, or the writing head might not induce
the magnetization in the first place because of interference
from neighbouring bits.
In all these cases, if we transmit data, \eg, a string
of bits, over the channel, there is some probability that
the received message will not be identical to the transmitted message.
% And in all cases,
We would prefer to have a communication channel for
which this probability was zero -- or so close to zero that
for practical purposes it is indistinguishable from zero.
Let's consider
% the example of
a noisy \disc{} drive
% having the property
that transmits each bit correctly
% transmitted
with probability
$(1\!-\!f)$ and incorrectly with probability $f$.
This model
% favourite
communication channel\index{channel!binary symmetric} is known
as the {\dbf{\ind{binary symmetric channel}}} (\figref{fig.bsc1}).
\begin{figure}[htbp]
\figuremargin{%
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,20)(-5,0)
\put(-4,9){{\makebox(0,0)[r]{$x$}}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{1}}
\put(4,16){\makebox(0,0)[r]{0}}
\put(16,2){\makebox(0,0)[l]{1}}
\put(16,16){\makebox(0,0)[l]{0}}
\put(24,9){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}
\:\:\:
\begin{array}{ccl}%%%%% {c@{}c@{}l} %%%%% (for twocolumn style)
P(y\eq 0 \given x\eq 0) &= & 1 - \q ; \\ P(y\eq 1 \given x\eq 0) &= & \q ;
\end{array}
\begin{array}{ccl}
P(y\eq 0 \given x\eq 1) &= & \q ; \\ P(y\eq 1 \given x\eq 1) &= & 1 - \q .
\end{array}
\]
}{%
\caption[a]{The binary symmetric channel. The
transmitted symbol is $x$ and the
received symbol $y$. The noise level, the probability
% of a bit's being
that a bit is
flipped, is $f$.}
\label{fig.bsc1}
}%
\end{figure}
\begin{figure}[htbp]
\figuremargin{%
\begin{mycenter}
\begin{tabular}{rcl}
\psfig{figure=bitmaps/dilbert.ps,width=1.2in}
&\hspace{0.1in}%
\raisebox{0.22in}{%
\setlength{\unitlength}{1.2mm}%
\begin{picture}(20,20)(0,0)%
\put(10,1){\makebox(0,0)[t]{$(1-f)$}}
\put(10,17){\makebox(0,0)[b]{$(1-f)$}}
\put(12,9.5){\makebox(0,0)[l]{$f$}}
% \put(10,16.5){\makebox(0,0)[b]{$(1-f)$}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{{1}}}
\put(4,16){\makebox(0,0)[r]{{0}}}
\put(16,2){\makebox(0,0)[l]{{1}}}
\put(16,16){\makebox(0,0)[l]{{0}}}
\end{picture}%
}%
\hspace{0.385in}&
\psfig{figure=_is/10000.10.ps,width=1.2in} \\
% & & \makebox[0in][l]{\large 10\% of bits are flipped} \\
\end{tabular}
\end{mycenter}
}{%
\caption[a]{A binary data sequence of length $10\,000$ transmitted over
a binary symmetric channel with noise level $f=0.1$.
\dilbertcopy}
\label{fig.bsc.dil}
}%
\end{figure}
\noindent
As an example,
% For the sake of argument,
let's imagine that $f=0.1$, that is, ten \percent\ of the bits are
flipped (figure \ref{fig.bsc.dil}).
% For a \disc{} drive to be useful, we would prefer that it should
% flip no bits at all in its entire lifetime.
A useful \disc{} drive would flip no bits at all in its entire lifetime.
%
If we expect to read and write a
gigabyte per day for ten years, we require a bit error
probability of the order of $10^{-15}$, or smaller.
There are two approaches to this goal.
\subsection{The physical solution}
The physical solution is to improve the physical characteristics of
the communication channel to reduce its error probability. We could
improve our \disc{} drive by
% , for example,
\ben
\item
using more reliable components in its circuitry;
\item
evacuating the air from the \disc{} enclosure so as
to eliminate the turbulence that perturbs the
reading head from the track;
\item
using a larger magnetic patch to represent each bit; or
\item
using higher-power signals or cooling the
circuitry in order to reduce thermal noise.
\een
These physical modifications
typically
increase the cost of the communication
channel.
% unit of area making the \disc{} spin at a slower rate
%
% the system solution
%
\begin{figure}%[htbp]
\figuremargin{%
\setlength{\unitlength}{1.25mm}
\begin{mycenter}
\begin{picture}(50,40)(-10,5)
\put(0,5){\framebox(25,10){\begin{tabular}{c}Noisy\\ channel\end{tabular}}}
\put(-20,20){\framebox(25,10){\begin{tabular}{c}Encoder\end{tabular}}}
\put(20,20){\framebox(25,10){\begin{tabular}{c}Decoder\end{tabular}}}
%\put(-20,40){\framebox(25,10){\begin{tabular}{c}Compressor\end{tabular}}}
%\put(20,40){\framebox(25,10){\begin{tabular}{c}Decompressor\end{tabular}}}
%\put(-50,20){\makebox(25,10){\begin{tabular}{c}{\sc Source}\\{\sc coding}\end{tabular}}}
% \put(-50,40){\makebox(25,10){\begin{tabular}{c}{\sc Channel}\\{\sc coding}\end{tabular}}}
\put(-20,37){\makebox(25,12){Source}}
%
\put(-10,14){\makebox(0,0){$\bt$}}
\put(-10,34){\makebox(0,0){$\bs$}}
\put(35,14){\makebox(0,0){$\br$}}
\put(35,34){\makebox(0,0){$\hat{\bs}$}}
\put(-7.5,18){\line(0,-1){8}}
\put(-7.5,10){\vector(1,0){6}}
\put(32.5,10){\vector(0,1){8}}
\put(32.5,10){\line(-1,0){6}}
%
\put(32.5,31){\vector(0,1){8}}
%\put(32.5,51){\vector(0,1){5}}
\put(-7.5,39){\vector(0,-1){8}}
%\put(-7.5,55){\vector(0,-1){5}}
\end{picture}
\end{mycenter}
}{%
\caption[a]{The `system' solution for
achieving
% almost perfect
reliable communication
over a noisy channel. The encoding system introduces
systematic redundancy
% in a systematic way
into the transmitted vector $\bt$. The decoding system
uses this known redundancy to deduce
from the
received vector $\br$
{\em both\/}
the original source vector
{\em and\/}
the noise introduced by the channel.
}
\label{system.solution}
}%
\end{figure}
\subsection{The `system' solution}
Information theory\index{information theory} and
\ind{coding theory}\index{system} offer
an alternative (and much more exciting)
approach: we accept the given noisy channel as it is
and
add communication {\dem systems\/} to it so that we
can {detect\/} and {correct\/} the errors introduced by the
% noise.
channel.
As shown in \figref{system.solution}, we add an
{\dem\ind{encoder}\/} before the channel and a {\dem\ind{decoder}\/} after
it. The encoder encodes the source message $\bs$
into a {\dem transmitted\/} message $\bt$,
% the idea is that the encoder adds
adding {\dem\ind{redundancy}\/} to the original message in some way. The
channel adds noise to the transmitted message, yielding a received
message $\br$. The decoder uses the known redundancy
introduced by the encoding system to infer both the original signal
$\bs$ and the added noise.
% added by the channel was.
Whereas physical solutions give incremental channel improvements
only at an ever-increasing cost,
% we hope to find
% there exist
system solutions can turn noisy channels into reliable
communication channels
with the only cost being a {\em computational\/} requirement
at the encoder and decoder.
% (and the delay associated with those computations.
%
% suggested addition:
% So, as the cost of computation falls, the cost of reliability will fall as well.
{\dbf Information theory} is concerned with the theoretical limitations and
% theoretical
potentials of such systems. `What is the best error-correcting
performance we could achieve?'
{\dbf Coding theory} is concerned with the creation of practical
encoding and decoding systems.
% Some
\section{Error-correcting codes for the binary symmetric channel}
We now consider examples of encoding and decoding systems.
What is the simplest way to add useful redundancy to a transmission?
[To make the rules of the game clear:
we want to be able to detect {\em and\/} correct errors;
and retransmission is not an option. We get only
one chance to encode, transmit,
and decode.]
\subsection{Repetition codes}
\label{sec.r3}
A straightforward idea is to repeat every bit of the message a prearranged
number of times -- for example, three times, as shown in \tabref{fig.r3}.
We call this {\dem \ind{repetition code}\/} `$\Rthree$'.
%\begin{figure}[htbp]
%\figuremargin{%
\amargintab{c}{
\begin{mycenter}
\begin{tabular}{c@{\hspace{0.3in}}c} \toprule % \hline
% Source sequence $\bs$ & Transmitted sequence $\bt$ \\ \hline
Source & Transmitted \\[-0.02in] % was -0.1, which was to much
sequence & sequence \\
$\bs$ & $\bt$ \\ \midrule % \hline
\tt 0 &\tt 000 \\
\tt 1 &\tt 111 \\ \bottomrule % \hline
\end{tabular}
\end{mycenter}
%}{%
\caption[a]{The repetition code {$\Rthree$}.}
\label{fig.r3}
}%
%\end{figure}
% \noindent
%
Imagine that
% what might happen if
we transmit the source message
\[
\bs = \mbox{\tt 0 0 1 0 1 1 0}
\]
over a binary
symmetric channel with noise level $f=0.1$ using this repetition code.
We can describe the channel as `adding' a sparse noise vector $\bn$ to the
transmitted vector -- adding in modulo 2 arithmetic, \ie, the binary algebra in which
{\tt 1}+{\tt 1}={\tt 0}. A possible noise
vector $\bn$ and received vector $\br = \bt + \bn$
are shown in
\figref{fig.r3.transmission}.
\begin{figure}[htbp]
%
% here i should switch the \[ \] for a display that oes not introduce
% white space at the top (about 0.1in)
%
\figuremargin{%
\[
\begin{array}{rccccccc}
\bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\
\bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\
\bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8}
\br & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\nbr{{\tt 0}}{{\tt 1}}{{\tt 0}}& \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}
\end{array}
\]
}{%
\caption{An example transmission using $\mbox{R}_3$.}
\label{fig.r3.transmission}
}
\end{figure}
%\noindent
How should we decode this received vector?
%
% optimality not clear - should justify?
%
% Perhaps you can see that
The optimal algorithm looks at the received
bits three at a time and takes
a \ind{majority vote} (\algref{alg.r3}).
\begin{algorithm}[htbp]
\algorithmmargin{%
\begin{mycenter}
\begin{tabular}{ccc} % \toprule % \hline
Received sequence $\br$ &
Likelihood ratio $\frac{P(\br\,|\, s\eq {\tt 1})}{P(\br\,|\, s\eq {\tt 0})}$
&
Decoded sequence $\hat{\bs}$ \\ \midrule
\tt 000 & $\gamma^{-3}$ &\tt 0 \\
\tt 001 & $\gamma^{-1}$ &\tt 0 \\
\tt 010 & $\gamma^{-1}$ &\tt 0 \\
\tt 100 & $\gamma^{-1}$ &\tt 0 \\
\tt 101 & $\gamma^{1}$ &\tt 1 \\
\tt 110 & $\gamma^{1}$ &\tt 1 \\
\tt 011 & $\gamma^{1}$ &\tt 1 \\
\tt 111 & $\gamma^{3}$ &\tt 1 \\
% \bottomrule
\end{tabular}
\end{mycenter}
}{%
\caption[a]{Majority-vote decoding algorithm for {$\Rthree$}.
Also shown are the likelihood ratios (\ref{eq.likelihood.bsc}), assuming
% This is the optimal decoder if
the channel is a binary symmetric channel; $\gamma \equiv (1-f)/f$.}
%
\label{fig.r3d}
\label{alg.r3}
}%
\end{algorithm}
%
\begin{aside}
%
At the risk of explaining the obvious, let's prove this result.
The optimal decoding decision
(optimal in the sense
of having the smallest probability of being wrong)
is to find which value of $\bs$
is most probable, given $\br$.\index{maximum {\em a posteriori}}
% to make clear the assumptions.
Consider the decoding of a single bit $s$, which was encoded
as
% after encoding as
$\bt(s)$
and gave rise to three received bits $\br = r_1r_2r_3$.
By \ind{Bayes' theorem},\label{sec.bayes.used} the {\dem posterior
probability\/} of $s$ is
\beq
P(s \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s ) P( s ) }
{ P( r_1r_2r_3 ) } .
\label{eq.bayestheorem}
\eeq
We can spell out the posterior probability of the two alternatives thus:
\beq
P(s\eq {\tt 1} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\eq {\tt 1} )
P( s\eq {\tt 1} ) }
{ P( r_1r_2r_3 ) } ;
\label{eq.post1}
\eeq
\beq
P(s\eq {\tt 0} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\eq {\tt 0} )
P( s\eq {\tt 0} ) }
{ P( r_1r_2r_3 ) } .
\label{eq.post0}
\eeq
%
This \ind{posterior probability} is determined by two factors:
the
{\dem{\ind{prior} probability\/}} $P(s)$, and
the data-dependent term $P( r_1r_2r_3 \,|\, s )$, which is called
the {\dem{\ind{likelihood}\/}} of $s$.
The normalizing constant $P( r_1r_2r_3 )$
% is irrelevant to
needn't be computed when finding
the optimal decoding decision,
which is to guess $\hat{s}\eq {\tt 0}$
if $P(s\eq {\tt 0} \,|\, \br ) > P(s\eq {\tt 1} \,|\, \br )$,
and $\hat{s}\eq {\tt 1}$ otherwise.
To find
$P(s\eq {\tt 0} \,|\, \br )$ and $P(s\eq {\tt 1} \,|\, \br )$,
% the optimal decoding decision,
we must make an assumption about the prior probabilities of the
two hypotheses ${s}\eq {\tt 0}$ and ${s}\eq {\tt 1}$, and we
must make an assumption about the probability of $\br$ given
$s$.
% $\bt(s)$.
We assume that the prior probabilities are equal:
$P( {s}\eq {\tt 0}) = P( {s}\eq {\tt 1}) = 0.5$;
then maximizing the posterior probability $P(s\,|\,\br)$ is
equivalent to maximizing the likelihood $P(\br\,|\,s)$.\index{maximum likelihood}
And we assume that the
channel is a binary symmetric channel with noise level $f<0.5$, so that
the likelihood is
\beq
P( \br \,|\, s ) = P(\br \,|\, \bt(s) ) = \prod_{n=1}^N
P(r_n \,|\, t_n(s) ) ,
\eeq
where $N=3$ is the number of transmitted bits in the block
we are considering, and
\beq
P(r_n\,|\,t_n) = \left\{ \begin{array}{lll}
(1\!-\!f) & \mbox{if} & r_n=t_n \\
f & \mbox{if} & r_n \neq t_n. \end{array} \right.
\eeq
Thus the likelihood ratio for the
two hypotheses is
% if we define $
\beq
\frac{P(\br\,|\, s\eq {\tt 1})}{P(\br\,|\, s\eq {\tt 0})}
% = \left( \frac{ (1-f) }{f} \right)^{
= \prod_{n=1}^N
\frac{P(r_n \,|\, t_n({\tt 1}) )}{P(r_n \,|\, t_n({\tt 0}) )} ;
\label{eq.likelihood.bsc}
\eeq
each factor
% $P(r_n \,|\, t_n(s) )$
$\frac{P(r_n | t_n({\tt 1}) )}{P(r_n | t_n({\tt 0}) )}$
equals $\frac{ (1-f) }{f}$ if $r_n=1$ and $\frac{f}{ (1-f) }$ if
$r_n=0$.
The ratio $\gamma \equiv \frac{ (1-f) }{f}$ is greater than 1,
since $f<0.5$, so the winning hypothesis is the one with the most
`votes', each vote counting for a factor of $\gamma$ in the
% posterior probability.
likelihood ratio.
Thus the majority-vote decoder shown in \algref{fig.r3d}
is the optimal decoder if we assume that
the channel is a binary symmetric channel and that the
two possible source messages {\tt 0} and {\tt 1}
have equal prior probability.
\end{aside}
%\noindent
We now apply the majority vote decoder to the received vector of \figref{fig.r3.transmission}.
The first three received bits are all ${\tt 0}$, so
we decode this triplet
as a ${\tt 0}$.
In the second triplet of \figref{fig.r3.transmission},
there are two {\tt 0}s and one {\tt 1}, so we decode
this triplet as a ${\tt 0}$ -- which in this case corrects the error.
Not all errors are corrected, however. If we are unlucky and
two errors fall in a single block, as in the fifth triplet of
\figref{fig.r3.transmission},
then the decoding rule gets the wrong answer, as shown in
\figref{fig.decoding.R3}.
% \Figref{fig.decoding.R3}
% shows the result of decoding the received vector
% from \figref{fig.r3.transmission}.
\begin{figure}[htbp]
\figuremargin{%
\[
\begin{array}{rccccccc}
\bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\
\bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\
\bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8}
\br & \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\ubr{{\tt 0}}{{\tt 1}}{{\tt 0}}& \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\
\hat{\bs} & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 0}&{\tt 1}&{\tt 0} \\
\mbox{corrected errors} &
&\star & & & & & \\
\mbox{undetected errors} &
& & & &\star & &
\end{array}
\]
}{%
\caption{Decoding
% Applying the maximum likelihood decoder for $\mbox{R}_3$ to
the received vector
from \protect\figref{fig.r3.transmission}.}
\label{fig.decoding.R3}
}%
\end{figure}
\noindent
% Thus the error probability is reduced by the use of this code.
% It is easy to compute the error probability.
% Exercise 1.1. Could this be made an Example, i.e. worked through in
% the text? -- for a beginner, there is a lot in it, and it seems to
% be important.
%
% see exercise.sty
\exercissx{2}{ex.R3ep}{%%%%%%%% keep this as A2, but cut it from the ITPRNN list
Show\marginpar{\small\raggedright The exercise's rating, \eg
% `{\em{A}}2'
`[{\em2\/}]',
indicates its difficulty:
`1' exercises are the easiest.
% An exercise rated {\em{A}}2 is important and should not prove too difficult.
Exercises that are accompanied by a marginal rat are especially recommended.
If a solution or partial solution is provided, the page is indicated after the difficulty rating;
for example, this exercise's solution is on page \pageref{ex.R3ep.sol}.
}
that the error probability is reduced by the use of {$\Rthree$}
by computing the error probability of
this code for a binary symmetric channel
with noise level $f$.
%Do so.
}
%
% This fig is 0.1 inch too wide, 9801
%
The error probability is dominated by the probability that two
bits in a block of three are flipped, which scales as $f^2$.
%
% JARGON??????
%
In the
case of the binary symmetric channel with $f=0.1$, the {$\Rthree$} code has a
probability of error, after decoding, of $\pb \simeq 0.03$ per bit.
\Figref{fig.r3.dilbert} shows the
result of transmitting a binary
image over a binary symmetric channel
using the repetition code.
\begin{figure}[hbtp]
%\fullwidthfigure{%
%\figuredangle{% this hung off the bottom of the page
\figuremarginb{% I think this may make a collision?
\begin{center}
\setlength{\unitlength}{0.8in}% was 0.75 98.12. changed to 0.8 99.01
\begin{picture}(7,4.3)(0,1.4)
\put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}}
\put(0.625,5.4){\makebox(0,0){\Large$\bs$}}
\thicklines
\put(1.35,4.75){\vector(1,0){0.4}}
\put(1.55,5.4){\makebox(0,0){{\sc encoder}}}
\put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.ps,width=1in}}}
\put(2.625,5.4){\makebox(0,0){\Large$\bt$}}
\put(3.6,5.4){\makebox(0,0){{\sc channel}}}
\put(3.6,5.15){\makebox(0,0){$f={10\%}$}}
\put(3.4,4.75){\vector(1,0){0.4}}
\put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.ps,width=1in}}}
\put(4.625,5.4){\makebox(0,0){\Large$\br$}}
\put(5.6,5.4){\makebox(0,0){{\sc decoder}}}
%\put(5.6,3.4){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder takes the majority vote of the three signals.}}}}
\put(5.4,4.75){\vector(1,0){0.4}}
\put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.d.ps,width=1in}}}
\put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}}
\end{picture}
\end{center}
}{%
\caption[a]{Transmitting $10\,000$ source bits over a binary symmetric channel
with $f=10\%$
% 0.1$
using a repetition code and the majority vote decoding
algorithm. The probability
of decoded bit error has fallen to about 3\%; the rate has fallen
to 1/3.}
% \dilbertcopy
\label{fig.r3.dilbert}
}%
\end{figure}
% Should `rate' be explicitly defined?
\newpage\indent
The repetition code $\Rthree$ has therefore reduced the probability of
error, as desired.
Yet we have lost something: our
{\em rate\/} of information transfer has fallen by a factor of
three. So if we use a repetition code to communicate data over a telephone
line, it will reduce the error frequency, but it will also reduce our
communication rate. We will have to pay three times as much for each
phone call.
% there will also be a delay
Similarly,
%As for our \disc{} drive,
we would need three of the original noisy gigabyte \disc{} drives
in order to create a one-gigabyte \disc{} drive with $\pb=0.03$.
Can we
% What happens as we try to
push the error probability lower, to the
values required for a
% quality
sellable \disc{} drive -- $10^{-15}$?
We could achieve lower error probabilities by using repetition
codes with more repetitions.
\exercissx{3}{ex.R60}{
\ben
\item
Show that the probability of error of $\RN$, the repetition
code with $N$ repetitions, is
\beq
p_{\rm b} = \sum_{n=(N+1)/2}^{N} {{N}\choose{n}} f^n (1-f)^{N-n} ,
\eeq
for odd $N$.
\item
Assuming $f = 0.1$, which of the terms in this sum is the biggest?
How much bigger is it than the second-biggest term?
\item
Use \ind{Stirling's approximation} (\pref{sec.stirling}) to approximate
% get rid of
the ${{N}\choose{n}}$
in the largest term, and find,
approximately, the probability of error of the repetition
code with $N$ repetitions.
\item
Assuming $f = 0.1$, find how many repetitions
are required
% show that it takes a repetition
% code with rate about $1/60$
to get the probability of error
down to $10^{-15}$. [Answer: about 60.]
\een
}
So to build a {\em single\/}
gigabyte \disc{} drive
with the required reliability from noisy gigabyte drives with $f=0.1$,
we would need {\em sixty\/} of the noisy \disc{} drives.
The tradeoff between error probability and rate for repetition
codes is shown in \figref{fig.pbR.R}.
%
% see end of l1.tex for method, also see poster1.gnu
%
\newcommand{\pbobject}{\hspace{-0.15in}\raisebox{1.62in}{$\pb$}%
\hspace{-0.05in}}
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/rep.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/rep.1.l.ps,angle=-90,width=2.6in} \\
\end{tabular}
\end{center}
}{%
\caption[a]{Error probability $\pb$ versus rate for repetition codes
over a binary symmetric channel with $f=0.1$.
The right-hand figure shows $\pb$ on a logarithmic scale. We would like
the rate to be large and $\pb$ to be small.
}
\label{fig.pbR.R}
}%
\end{figure}
% see end of this file for method
\subsection{Block codes -- the $(7,4)$ Hamming code}
\label{sec.ham74}
We would like to communicate with\index{Hamming code}
tiny probability of error {\em and\/} at a substantial rate.
Can we improve on repetition codes? What if we add redundancy to
{\dem blocks\/} of data instead of
% redundantly
encoding one bit at a time?
% You may already have heard of the idea of `parity check bits'.
We now
study a simple {\dem{block code}}.
A {\dem \ind{block code}\/} is a rule\index{error-correcting code!block code}
for converting a sequence of source
bits $\bs$, of length $K$, say, into a transmitted sequence $\bt$ of length
$N$ bits. To add redundancy, we make $N$
greater than $K$. In a {\dem linear\/} block code,
the extra $N-K$ bits are linear functions of the
original $K$ bits; these extra bits are called {\dem\ind{parity-check bits}}.
An example of a \ind{linear block code} is the \mbox{\dem$(7,4)$
\ind{Hamming code}}, which transmits $N=7$ bits for every $K=4$ source
bits.
% \index{error-correcting code!linear}
\begin{figure}[htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
(a)\psfig{figure=hamming/encode.eps,angle=-90,width=1.3in} &
(b)\psfig{figure=hamming/correct.eps,angle=-90,width=1.3in} \\
\end{tabular}
\end{center}
}{
\caption[a]{Pictorial representation of encoding for the $(7,4)$ Hamming
code.
% a and b are not explained in the caption. Does this matter?
%
% The parity check bits $t_5,t_6,t_7$ are set so that the parity within
%% each circle is even.
}
\label{fig.74h.pictorial}
\label{fig.hamming.pictorial}
}
\end{figure}
%
The encoding operation for the code is shown pictorially
in \figref{fig.74h.pictorial}.
%
% \subsubsection{Encoding}
We arrange the seven transmitted bits in three intersecting circles.
% as shown in \figref{fig.hamming.encode}.
The first four
transmitted bits,
$t_1 t_2 t_3 t_4$, are set equal to the four source bits,
$s_1 s_2 s_3 s_4$.
The parity-check bits\index{parity-check bits}
$t_5 t_6 t_7$ are set so that the {\dem\ind{parity}\/}
within each circle is even:
the first parity-check bit is the parity of the first three source bits
(that is, it is
%zero
{\tt 0} if the sum of those bits is even, and
% one
{\tt 1} if the sum is odd);
the second is the parity of the last three; and the third parity bit
is the parity of source bits one, three and four.
As an example, \figref{fig.74h.pictorial}b shows the transmitted
codeword for the case $\bs = {\tt 1000}$.
% idea for rewriting this: go straight to pictorial story, leave out the
% matrix description for another time.
%
%
%\noindent
%
Table \ref{tab.74h} shows the codewords generated
by each of the $2^4=$ sixteen settings of the four source bits.
% Notice that the first four transmitted bits are
% identical to the four source bits, and the remaining three bits
% are parity bits:
% The special property of these codewords is that
These codewords
have the special property that
any pair
differ from each other in at least three bits.
\begin{table}[htbp]
\figuremargin{%
\begin{center}
\mbox{\small
\begin{tabular}{cc} \toprule
% Source sequence
$\bs$ &
% Transmitted sequence
$\bt$ \\ \midrule
\tt 0000 &\tt 0000000 \\
\tt 0001 &\tt 0001011 \\
\tt 0010 &\tt 0010111 \\
\tt 0011 &\tt 0011100 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
$\bs$ & $\bt$ \\ \midrule
\tt 0100 &\tt 0100110 \\
\tt 0101 &\tt 0101101 \\
\tt 0110 &\tt 0110001 \\
\tt 0111 &\tt 0111010 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
$\bs$ & $\bt$ \\ \midrule
\tt 1000 &\tt 1000101 \\
\tt 1001 &\tt 1001110 \\
\tt 1010 &\tt 1010010 \\
\tt 1011 &\tt 1011001 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
$\bs$ & $\bt$ \\ \midrule
\tt 1100 &\tt 1100011 \\
\tt 1101 &\tt 1101000 \\
\tt 1110 &\tt 1110100 \\
\tt 1111 &\tt 1111111 \\ \bottomrule
\end{tabular}
}%%%%%%%%% end of row of four tables
\end{center}
}{%
\caption[a]{The sixteen codewords
$\{ \bt \}$ of the $(7,4)$ Hamming code. Any pair of
codewords
% have the % beautiful % elegant property that they
differ from each other in at least three bits.}
%\label{fig.hamming.encode}
\label{tab.74h}
\label{tab.h74}
\label{fig.h74}
\label{fig.74h}
}
\end{table}
%
\begin{aside}
Because the Hamming code is a {linear\/} code, it can\indexs{error-correcting code!linear}
be written compactly in terms of matrices as follows.\index{linear block code}
% It is a
% {\em linear\/} code; that is, t
The transmitted codeword $\bt$ is
% can be
obtained
from the source sequence $\bs$ by a linear operation,
\beq
\bt = \bG^{\T} \bs,
\label{eq.encode}
\eeq
where $\bG$ is the {\dem\ind{generator matrix}} of the code,
\beq
\bG^{\T} = {\left[ \begin{array}{cccc}
\tt 1 &\tt 0 &\tt 0 &\tt 0 \\
\tt 0 &\tt 1 &\tt 0 &\tt 0 \\
\tt 0 &\tt 0 &\tt 1 &\tt 0 \\
\tt 0 &\tt 0 &\tt 0 &\tt 1 \\
\tt 1 &\tt 1 &\tt 1 &\tt 0 \\
\tt 0 &\tt 1 &\tt 1 &\tt 1 \\
\tt 1 &\tt 0 &\tt 1 &\tt 1 \end{array} \right] } ,
\label{eq.h74.gen}
\eeq
and the encoding operation (\ref{eq.encode}) uses
modulo-2 arithmetic (${\tt 1}+{\tt 1}={\tt{0}}$, ${\tt 0}+{\tt 1}={\tt 1}$, etc.).
%\footnote{My notational
% convention is that all vectors -- $\bs$, $\bt$, etc.\ --
% are column vectors, except that in the figures where many
% vectors are listed, they are displayed as row vectors. The
% generator matrix $\bG$ is written ..... as to retain
% consistency with established notation in coding texts.}
% \begin{aside}
In the encoding operation
(\ref{eq.encode}) I have assumed that $\bs$ and $\bt$ are
column vectors. If instead they are row vectors, then this equation
is replaced by
\beq
\bt = \bs \bG,
\label{eq.encodeT}
\eeq
where
\beq
\bG = \left[ \begin{array}{ccccccc}
\tt 1& \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1 \\
\tt 0& \tt 1& \tt 0& \tt 0& \tt 1& \tt 1& \tt 0 \\
\tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1& \tt 1 \\
\tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1 \\
\end{array} \right] .
\label{eq.Generator}
\eeq
% f you are like me, you may
I find it easier to relate to
the right-multiplication (\ref{eq.encode})
% hyphenation specified in itprnnchapter.tex did not work so I do it manually
than the left-multiplica-{\breakhere}tion (\ref{eq.encodeT}).
% -- I like my matrices to act to the right.
Many coding theory texts use the left-multiplying conventions
(\ref{eq.encodeT}--\ref{eq.Generator}), however.
The rows of the generator matrix (\ref{eq.Generator}) can be
viewed as defining four basis vectors lying in a seven-dimensional
binary space. The sixteen codewords are obtained by making all
possible linear combinations
% binary sums
of these vectors.
\end{aside}
%
% should I add a cast of characters here?
% s,t,r,s^
\subsubsection{Decoding the $(7,4)$ Hamming code}
When we invent a more complex encoder $\bs \rightarrow \bt$,
the task of decoding the
received vector $\br$ becomes less straightforward. Remember that
{\em any\/} of the bits may have been flipped, including the parity bits.
% We can't assume that the three extra parity bits
%(The reader who
% is eager to see the denouement of the plot may skip ahead to section
% \ref{sec.code.perf}.)
% General defn of optimal decoder
If we assume that the channel is a binary symmetric channel and that
all source vectors are equiprobable,
% {\em a priori},
then the
optimal decoder
% is one that
identifies the source vector $\bs$ whose
encoding $\bt(\bs)$ differs from the received vector $\br$ in the
fewest bits. [{Refer to the likelihood function
% equation
% {eq.bayestheorem}--\ref{eq.likelihood.bsc}}
\bref{eq.likelihood.bsc}} to see why this is so.]
We could solve the decoding problem by measuring how far $\br$
is from each of the
sixteen codewords in \tabref{tab.74h}, then picking the closest.
Is there a more efficient way of finding the most probable source vector?
\subsubsection{Syndrome decoding for the Hamming code}
\label{sec.syndromedecoding}
For the $(7,4)$ Hamming code there is a pictorial solution to the
% syndrome
decoding problem, based on the encoding picture,
\figref{fig.74h.pictorial}.
%
% \subsubsection{Decoding}
%
% sanjoy says this is CONFUSING - tried to improve it Sat 22/12/01
% also romke did not like it
As a first example, let's assume the transmission was
$\bt = {\tt 1000101}$ and the noise flips the second bit,
so the received vector is
$\br = {\tt 1000101}\oplus{\tt{0100000}} = {\tt{1100101}}$.
% \ie, $\bn=({\tt 0},{\tt 1},{\tt 0},{\tt 0},{\tt 0}, {\tt 0},{\tt 0})$,
% and the received vector
We write the received vector into the three circles
as shown in \figref{fig.hamming.decode}a, and
look at each of the three circles to see whether its parity is even.
The circles whose parity is {\em{not}\/} even are shown by
dashed lines in \figref{fig.hamming.decode}b.
% The fact that all codewords differ from each other in at least
% three bits means that if the noise has flipped any one or two bits,
% the received vector will no longer be a valid codeword, and some of
% the parity checks will be broken.
%
The decoding task is
%We want
to find the smallest
set of flipped bits that can account for these violations
of the parity rules.
% violated.
[The pattern of violations of the parity checks is called the {\dem\ind{syndrome}}, and can be written as a binary vector -- for example,
in \figref{fig.hamming.decode}b, the syndrome is $\bz = ({\tt1},{\tt1},{\tt0})$,
because the first two circles are `unhappy' (parity {\tt1}) and the
third circle is `happy' (parity {\tt0}).]
% RESTORE ME:
%, and the task of syndrome decoding
% syndrome (just as a
% \ind{doctor} might seek the most probable underlying \ind{disease} to account for
% the symptoms shown by a \ind{patient}).
\begin{figure}% [htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{ccc}
(a)\psfig{figure=hamming/decode.eps,angle=0,width=1.3in} \\
(b)\psfig{figure=hamming/s2.eps,angle=-90,width=1.3in} &
(c)\psfig{figure=hamming/t5.eps,angle=-90,width=1.3in} &
(d)\psfig{figure=hamming/s3.eps,angle=-90,width=1.3in} \\[0.3in]
\multicolumn{3}{c}{%
(e)\psfig{figure=hamming/s3.t7.eps,angle=0,width=1.3in}
\setlength{\unitlength}{1in}
\begin{picture}(0.4,0.6)(0,0)
\put(0,0.6){\vector(1,0){0.6}}
\end{picture}
% \raisebox{0.6in}{$\rightarrow$}
(${\rm e}'$)\psfig{figure=hamming/s3.t7.d.eps,angle=0,width=1.3in}
}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Pictorial representation of decoding of the Hamming $(7,4)$
code. The received vector is written into the diagram
as shown in (a).
In (b,c,d,e), the received vector is
shown, assuming that the transmitted vector was
as in
% The bits that are flipped relative to
\protect
\figref{fig.hamming.pictorial}b and the bits labelled by $\star$
were flipped. The violated
parity checks are highlighted by dashed circles. One of the seven bits
is the most probable suspect to account for each `\ind{syndrome}', \ie, each
pattern of violated and satisfied parity checks.
In examples (b), (c), and (d), the most probable suspect is
the one bit that was flipped.
In example (e), two bits have been flipped, $s_3$ and $t_7$.
The most probable suspect is $r_2$, marked by a circle in (${\rm e}'$),
which shows the output of the decoding algorithm.
% each circle is even.
}\label{fig.hamming.decode}
\label{fig.hamming.s2}% these labels were in the wrong place feb 2000
\label{fig.hamming.s3}
\label{fig.hamming.correct}
}
\end{figure}
%
% ACTION: sanjoy still thinks this part is hard to follow - fixed Sat 22/12/01?
To solve the decoding task,
% problem,
we ask the question:
can we find a unique bit that lies {\em inside\/}
all the `unhappy' circles and {\em outside\/} all the
`happy' circles? If so, the flipping of that bit
would account for the observed
syndrome.
In the case shown in \figref{fig.hamming.s2}b,
the bit $r_2$
% that was flipped
lies inside the two unhappy circles and outside the happy
circle;
no other single bit has this property, so
$r_2$ is the only single bit capable of explaining the syndrome.
Let's work through a couple more examples.
\Figref{fig.hamming.s2}c shows what happens if one of the
parity bits, $t_5$, is flipped by the noise. Just one of the checks
is violated. Only $r_5$ lies inside this unhappy circle and outside
the other two happy circles,
so $r_5$ is identified as the only single bit
capable of explaining the syndrome.
If the central bit $r_3$ is received flipped,
\figref{fig.hamming.s3}d shows that all three checks are violated;
only $r_3$ lies inside all three circles, so $r_3$ is
identified as the suspect bit.
If you try flipping any one of the seven bits, you'll find
that a different syndrome is obtained in each case -- seven non-zero syndromes,
one for each bit. There is only
one other syndrome, the all-zero syndrome. So if
the channel is a binary symmetric channel with a
small noise level $f$, the optimal
decoder unflips at most one bit, depending on the
syndrome, as shown in \algref{tab.hamming.decode}.
Each syndrome could have been caused by other noise patterns
too, but any other noise pattern that has the same syndrome
must be less probable because it involves a larger number of
noise events.
%\begin{figure}
%\figuremargin{%
\begin{algorithm}
\algorithmmargin{%
\begin{center}
\begin{tabular}{c*{8}{c}}
% Fri 4/1/02 removed toprule and bottomrule because algorithm has its own frame
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \toprule
Syndrome $\bz$ & {\tt 000} & {\tt 001} & {\tt 010} & {\tt 011} & {\tt 100} & {\tt 101} & {\tt 110} & {\tt 111} \\ \midrule
Unflip this bit & {\small{\em none}} & $r_7$ & $r_6$ & $r_4$ & $r_5$ & $r_1$ & $r_2$ & $r_3$ \\
% \bottomrule
% Unflip this bit & {\small{\em none}} & 7 & 6 & 4 & 5 & 1 & 2 & 3 \\
% \bottomrule
% this is appropriate only if z =z3,z2,z1:
% Unflip this bit & {\small{\em none}} & 5 & 6 & 2 & 7 & 1 & 4 & 3 \\ \hline
\end{tabular}
\end{center}
%\begin{center}
%\begin{tabular}{cc} \hline
%Syndrome $\bz$ & % 3 2 1 !!!!!!!!!!!!!!!!!!!
%Flip this bit \\ \hline
% 000 &{\small{\em none}} \\
% 001 &5\\
% 010 &6\\
% 011 &2\\
% 100 &7\\
% 101 &1\\
% 110 &4\\
% 111 &3 \\ \hline
%\end{tabular}
%\end{center}
}{%
\caption[a]{Actions taken by the optimal decoder for the $(7,4)$ Hamming
code, assuming a binary symmetric channel with small noise level $f$.
The syndrome vector $\bz$ lists whether each parity check is
violated ({\tt 1}) or satisfied ({\tt 0}),
going through the checks in the order
of the bits $r_5$, $r_6$,
and $r_7$. }
\label{tab.hamming.decode}
}%
\end{algorithm}
What happens if the noise actually flips more than one bit?
\Figref{fig.hamming.s3}e shows the situation when two bits,
$r_3$ and $r_7$, are received flipped. The syndrome, {\tt 110},
makes us suspect the single bit $r_2$; so our optimal decoding algorithm
flips this bit, giving a decoded pattern with three errors
as shown in \figref{fig.hamming.s3}${\rm e}'$.
If we use the optimal decoding algorithm,
any two-bit error pattern will lead to a decoded seven-bit vector
that contains three errors.
\subsection{General view of decoding for linear codes: syndrome decoding}
\label{sec.syndromedecoding2}
\begin{aside}
% {\em (Does some of this stuff belong earlier in the pictorial area?)}
We can also describe the decoding problem
for a linear code in terms of matrices.\index{syndrome decoding}\index{linear block code}
% In the case of a linear code and a symmetric channel,
% the decoding task can be re-expressed as {\bf syndrome decoding}.
% Let's assume that the noise level $f$ is less than $1/2$.
The first four received bits, $r_1r_2r_3r_4$, purport to be
the four source bits; and the received bits $r_5r_6r_7$ purport
to be the parities of the source bits, as defined by the generator
matrix $\bG$. We evaluate the three parity-check bits for the
received bits, $r_1 r_2r_3 r_4$, and see whether
they match the three received
bits, $r_5r_6r_7$. The differences (modulo 2) between
these two triplets are called the {\dbf\ind{syndrome}}
of the received vector.
If the syndrome is zero -- if all three parity checks are happy
% agree with the corresponding received bits
-- then the received vector is a codeword,
and the most probable decoding is given by reading out its first four
bits. If the syndrome is non-zero, then
% we are certain that
the noise
sequence for this block was non-zero, and the syndrome is our
pointer to the most probable error pattern.
The computation of the syndrome vector is a
linear operation. If we define the $3 \times 4$ matrix $\bP$
such that the matrix of
equation (\ref{eq.h74.gen})
is
\beq
\bG^{\T} = \left[ \begin{array}{c}{\bI_4}\\
\bP\end{array} \right],
\eeq
where $\bI_4$ is the $4\times 4$ identity matrix, then
the syndrome vector is $\bz = \bH \br$, where the {\dbf\ind{parity-check matrix}}
$\bH$ is given by $\bH = \left[ \begin{array}{cc} -\bP & \bI_3 \end{array}
\right]$; in modulo 2 arithmetic, $-1 \equiv 1$, so
\beq
\bH = \left[ \begin{array}{cc} \bP & \bI_3 \end{array}
\right] = \left[
\begin{array}{ccccccc}
\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0&\tt 0 \\
\tt 0&\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0 \\
\tt 1&\tt 0&\tt 1&\tt 1&\tt 0&\tt 0&\tt 1
\end{array} \right] .
\label{eq.pcmatrix}
\eeq
All the codewords $\bt = \bG^{\T} \bs$ of the code satisfy
\beq
\bH \bt = \left[ {\tt \begin{array}{c} \tt0\\ \tt0\\ \tt0 \end{array} } \right] .
% (0,0,0) .
\eeq
\exercisaxB{1}{ex.GHis0}{
Prove that this is so by evaluating the $3\times4$ matrix $\bH \bG^{\T}$.
}
Since the received vector $\br$ is given by $\br = \bG^{\T}\bs + \bn$,
% and $\bH \bG^{\T}$=0,
the syndrome-decoding problem is to find the
most probable noise vector $\bn$ satisfying
the equation
\beq
\bH \bn = \bz .
\eeq
A decoding algorithm that solves this problem is called
a {\dem maximum-likelihood decoder}. We will discuss
decoding problems like this in later chapters.
%\footnote{Somewhere in this book
% I need to spell out \Bayes\ theorem for decoding. Here would be
% a good spot; but on the other hand, people can understand decoding
% intuitively, they don't need Bayes theorem and they might find it
% a hindrance if they were not only being hit by
% Shannon's theorem but also by likelihoods and priors.}
%
% ACTION NEEDED ????????????????????????????????????????
%
\end{aside}
\begin{figure}
%\fullwidthfigure{%
\figuredanglenudge{%
\begin{center}
\setlength{\unitlength}{0.8in}% was 1in, with figures 1.25 wide % then was 0.8 with 1in
\begin{picture}(7,2.7)(0,2.8)
\put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}}
\put(0.625,5.4){\makebox(0,0){\Large$\bs$}}
\thicklines
\put(1.35,4.75){\vector(1,0){0.4}}
\put(1.55,5.4){\makebox(0,0){{\sc encoder}}}
\put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.ps,width=1in}}}
\put(1.982,3.75){\makebox(0,0)[tr]{{parity bits} $\left.\rule[-0.342in]{0pt}{0.342in} \right\{$}}
\put(2.625,5.4){\makebox(0,0){\Large$\bt$}}
\put(3.6,5.4){\makebox(0,0){{\sc channel}}}
\put(3.6,5.15){\makebox(0,0){$f={10\%}$}}
\put(3.4,4.75){\vector(1,0){0.4}}
\put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.ps,width=1in}}}
\put(4.625,5.4){\makebox(0,0){\Large$\br$}}
\put(5.6,5.4){\makebox(0,0){{\sc decoder}}}
%\put(5.6,3.5){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder picks the $\hat{\bs}$ with maximum likelihood.}}}}
\put(5.4,4.75){\vector(1,0){0.4}}
\put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.d.ps,width=1in}}}
\put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}}
\end{picture}
\end{center}
}{%
\caption[a]{Transmitting $10\,000$ source bits over a binary symmetric channel
with $f=10\%$
%0.1$
using a $(7,4)$ Hamming code. The probability
of decoded bit error is about 7\%.}
% \dilbertcopy}
\label{fig.h74.dilbert}
}{0.7in}% third argument is the upward nudge of the caption
\end{figure}
\subsection{Summary of the $(7,4)$ Hamming code's properties}
Every possible received vector of length 7 bits is either a codeword,
or it's one flip away from a codeword.\index{Hamming code}
Since there are three parity constraints, each of which might
or might not be violated, there are
$2\times 2\times 2= 8$
% eight
distinct syndromes. They can be divided
into seven non-zero syndromes -- one
for each of the one-bit error patterns --
and the all-zero syndrome, corresponding to the zero-noise case.
The optimal decoder takes no action if the syndrome is zero,
otherwise it uses this mapping of non-zero syndromes onto one-bit error
patterns to unflip the suspect bit.
There is a {\dbf decoding error} if the four decoded bits $\hat{s}_1,
\hat{s}_2, \hat{s}_3, \hat{s}_4$ do not all match the source bits ${s}_1,
{s}_2, {s}_3, {s}_4$. The {\dbf probability of block error} $\pB$ is
the probability that one or more of the decoded bits in one block fail to
match the corresponding source bits,
\beq
\pB = P( \hat{\bs} \neq \bs ) .
\eeq
The {\dbf probability of bit error} $\pb$ is
the average probability
% per decoded bit
that a decoded bit fails to
match the corresponding source bit,
\beq
\pb = \frac{1}{K} \sum_{k=1}^K P( \hat{s}_k \neq s_k ) .
\eeq
In the case of the Hamming code,
a decoding error will occur whenever the noise has flipped more than
one bit in a block of seven.
% Any noise pattern that flips more than one bit will give rise to one of
% these syndromes, and our decoder will make an erroneous decision.
%
The probability of block error is thus the probability that two or more
bits are flipped in a block. This probability scales as $O(f^2)$, as did the
probability of error for the repetition code
$\Rthree$. But notice that the Hamming code
communicates at a greater rate, $R=4/7$.
\Figref{fig.h74.dilbert} shows a binary image transmitted over
a binary symmetric channel using the $(7,4)$ Hamming code.
About 7\% of the decoded bits are in error. Notice that
the errors are correlated:
% with each other:
often two or three successive
decoded bits are flipped.
\exercisaxA{1}{ex.Hdecode}{
This exercise and the next three refer to the
$(7,4)$ \ind{Hamming code}. Decode the received strings:
\ben
\item $\br = {\tt 1101011}$ % 10
\item $\br = {\tt 0110110}$ % 4
\item $\br = {\tt 0100111}$ % 4
\item $\br = {\tt 1111111}$. % 15
\een
}
\exercissxA{2}{ex.H74p}{
\ben \item
Calculate the probability of block error $p_{\rm B}$ of the $(7,4)$ Hamming
code
as a function of the noise level $f$ and show that to leading order
% \footnote{Do I need to explain what this means? Or use a different
% terminology? Maybe only physicists are familiar?}
%
% ACTION!!!
%
it goes as $21 f^2$.
\item
% }
% \exercis{}{
\difficulty{3}
% $^{B3}$
Show that to leading order the probability of
bit error $\pb$ goes as $9 f^2$.
\een}
\exercissxA{2}{ex.H74zero}{
% Hamming $(7,4)$ code.
Find some noise vectors that give the all-zero syndrome (that is,
noise vectors that leave all the parity checks unviolated).
How many such noise vectors are there?
}
% they are the codewords.
\exercisaxB{2}{ex.H74detail}{
% Hamming $(7,4)$ code.
I asserted above that a block decoding error will result
whenever two or more bits are flipped in a single block.
Show that this is indeed so. [In principle, there might be
error patterns that, after decoding, led only to the corruption
of the parity bits, with no source bits incorrectly
decoded.]
}
\subsection{Summary of codes' performances}
\label{sec.code.perf}
Figure \ref{fig.pbR.RH} shows the performance of \ind{repetition code}s and
the \ind{Hamming code}. It also shows the performance of a family of linear
block codes that are generalizations of Hamming codes, called \ind{BCH codes}.
% Reed-Muller codes, and
% see end of this file for method
%
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/rephambch.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/rephambch.1.l.ps,angle=-90,width=2.6in} \\
\end{tabular}
\end{center}
}{%
\caption[a]{Error probability $\pb$ versus rate $R$ for repetition codes,
the $(7,4)$ Hamming code and BCH codes with blocklengths up to 1023
over a binary symmetric channel with $f=0.1$.
The righthand figure shows $\pb$ on a logarithmic scale.}
\label{fig.pbR.RH}
}
\end{figure}
%
%\noindent
% use this noindent if the ``h'' (here) works, otherwise new para.
This figure shows that we can, using linear block codes, achieve better
performance than repetition codes; but the asymptotic situation still
looks grim.
\exercissxA{4}{ex.makecode}{
% invent your own code
Design an error-correcting code and a decoding algorithm for it,
estimate its probability of error,
and add it to figure \ref{fig.pbR.RH}.
[Don't worry if you find it difficult to make a code better than the
Hamming code, or if you find it difficult to find a good
decoder for your code; that's the point of this exercise.]
}
\exercissxA{3}{ex.makecode2error}{
A $(7,4)$ Hamming code
can correct any {\em one\/} error;
might there be a
% (10,4)
$(14,8)$ code
that can correct any two errors?
% What about a (9,4) code?
{\sf Optional extra:} Does the answer to this question
depend on whether the code is linear or nonlinear?
}
\exercissxA{4}{ex.makecode2}{
Design an error-correcting code, other than
a repetition code, that can
correct any {\em two\/} errors in a block of size $N$.
}
\section{What performance can the best codes achieve?}
There seems to be a trade-off between the decoded bit-error
probability $\pb$ (which we would like to reduce) and the rate $R$ (which
we would like to keep large). How can this trade-off be
characterized?
% Can we do better than repetition codes?
What points in
the $(R,\pb)$ plane are achievable? This question was addressed by
Claude Shannon\index{Shannon, Claude} in his pioneering paper of 1948, in which he both created the
field of information theory and solved most of its fundamental
problems.
% in the same paper.
At that time there was a widespread belief that the
boundary between achievable and nonachievable points in the
$(R,\pb)$ plane was a curve passing through the origin $(R,\pb) = (0,0)$;
if this were so, then, in order to achieve a vanishingly small
error probability $\pb$, one would have to reduce the rate
correspondingly close to zero.
% (figure ref here).
% This would seem a reasonable guess,
% in accordance with the general rule that the better something works
% the more you have to pay for it.
%
% ACTION: sanjoy doesn't like This
%
`No pain, no gain.'
However, Shannon proved the remarkable result that\wow\
% , for any given channel,
the boundary
between achievable and nonachievable points meets the $R$
axis at a {\em non-zero\/} value $R=C$, as shown in \figref{fig.pbR.RHS}.
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/repshan.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/repshan.1.l.ps,angle=-90,width=2.6in} \\
\end{tabular}
\end{center}
}{%
\caption[a]{Shannon's noisy-channel coding theorem.\indexs{noisy-channel coding theorem}\index{Shannon, Claude}
The solid curve shows the Shannon limit
on achievable values of $(R,\pb)$ for
the binary symmetric channel with $f=0.1$.
Rates up to $R=C$ are achievable with arbitrarily small $\pb$.
The points show the performance of some textbook codes,
as in \protect\figref{fig.pbR.RH}.
%\indent MANUAL INDENT
\hspace{1.5em}The equation defining the Shannon limit (the solid curve) is
%\[
$R = \linefrac{C}{(1-H_2(\pb))},$
%\]
where $C$ and $H_2$ are defined in \protect \eqref{eq.capacity}.
}
\label{fig.pbR.RHS}
}
\end{figure}
% see end of this file for method
%
For any channel, there exist codes that make it possible to
communicate with {\em arbitrarily small\/} probability of
error $\pb$ at non-zero rates. The first half of this book ({\partnoun}s I--III) will be
devoted to understanding this remarkable result, which is called
the {\dbf{noisy-channel coding theorem}}.
\subsection{Example: $f=0.1$}% A few details}
The maximum rate at which communication is possible with arbitrarily
small $\pb$ is called the {\dbf\ind{capacity}} of the channel.\index{channel!capacity}
The formula for the capacity of a binary
symmetric channel with noise level $f$ is\index{binary entropy function}
\beq
C(f) = 1 - H_2(f) = 1 - \left[ f \log_2
\frac{1}{f} + (1-f) \log_2 \frac{1}{1-f} \right] ;
\label{eq.capacity}
\eeq
the channel we were discussing earlier with noise level $f=0.1$
has capacity $C \simeq 0.53$. Let us consider what this means in terms
of noisy \disc{} drives. The \ind{repetition code} $\Rthree$ could communicate over this
channel with $\pb=0.03$ at a rate $R = 1/3$. Thus we know how
to build a single gigabyte \disc{} drive with $\pb = 0.03$
from three noisy gigabyte \disc{} drives. We also know how to make
a single gigabyte \disc{} drive
with $\pb \simeq 10^{-15}$ from sixty
noisy one-gigabyte drives \exercisebref{ex.R60}.
And now Shannon\index{Shannon, Claude}
passes by, notices us
\ind{juggling}
% tinkering
with \disc{} drives and codes and says:
\begin{quotation}
\noindent
`What performance are you trying to achieve?
$10^{-15}$? You don't need {\em sixty\/} \disc{} drives --
you can get that performance with just
{\em two\/} \disc{} drives (since 1/2 is less than $0.53$).
% (The capacity is 0.53, so the number of \disc{} drives needed at
% capacity is 1/0.53.)
% `
And if you want $\pb = 10^{-18}$
% , or $10^{-21}$,
or $10^{-24}$ or anything,
you can get there with two \disc{} drives too!'
\end{quotation}
%\begin{aside}
[Strictly, the above statements might not be quite right, since,
as we shall see, Shannon
proved his
noisy-channel coding theorem
%proves the achievability of ever smaller
% error probabilities at a given rate $Ra$)
is defined to be $\int_{a}^{b} \! \d v \: P(v)$. $P(v)\d v$ is dimensionless.
The density $P(v)$ is a dimensional
quantity, having dimensions inverse to the dimensions of $v$ -- in contrast
to discrete probabilities, which are dimensionless. Don't be surprised
to see probability densities greater than 1. This is normal, and nothing
is wrong, as long as $\int_{a}^{b} \! \d v \: P(v) < 1$ for any interval $(a,b)$.
Conditional and joint probability densities
are defined in just the same way as conditional and joint probabilities.
% , which is why I choose not to use different notation for them.
\end{aside}
% More equations here.
%
% bring from chapter 4?
%
% at present ch 4 refers to this page as the first occurrence of
% Laplace's rule.
%
% Sort out this mess:::::::::::::::
% p30 Ex 2.8 : There claims to be a solution to this on p121 but this is
%actually a solution to Ex 6.2
%Generally would be helpful if notation in Chapters 2 and 6 was the same
%
% !!!!!!!!!!!!!!!!!!!! Idea: move this exe to the end of this subsection?
% THIS EX seems to have no solution
\exercisaxB{2}{ex.postpa}{% solution added Mon 10/11/03
Assuming a uniform prior on $f_H$, $P(f_H) = 1$,
solve the problem posed in \exampleref{exa.bentcoin}.
Sketch the posterior distribution of $f_H$
and compute the probability that the $N\!+\!1$th outcome will be a head,
for
\ben
\item $N=3$ and $n_H=0$;
\item $N=3$ and $n_H=2$;
\item $N=10$ and $n_H=3$;
\item
$N=300$ and $n_H=29$.
\een
You will find the \ind{beta integral} useful:
\beq
\int_0^1 \! \d p_a \: p_a^{F_a} (1-p_a)^{F_b} =
\frac{\Gamma(F_a+1)\Gamma(F_b+1)}{ \Gamma(F_a+F_b+2) }
= \frac{ F_a! F_b! }{ (F_a + F_b + 1)! } .
\eeq
You may also find it instructive to look back at
\exampleref{ex.ip.urns} and \eqref{eq.laplace.succession.first}.
}
People sometimes confuse assigning a prior distribution
to an unknown parameter such as $f_H$ with making an initial guess
of the {\em{value}\/} of the parameter.
% But priors are not values, they are distributions.
But the prior over $f_H$, $P(f_H)$, is not a simple statement
like `initially, I would guess $f_H = \dhalf$'.
The prior is a probability density over $f_H$ which
specifies the prior degree of belief that $f_H$ lies
in any interval $(f,f+\delta f)$. It may well be the case
that our prior for $f_H$ is symmetric about $\dhalf$, so that the
{\em mean\/} of $f_H$ under the prior is $\dhalf$.
%under our prior for $f_H$, the {\em mean\/} of $f_H$ is $\dhalf$
% -- on symmetry grounds for example.
In this case, the
predictive distribution {\em for the first toss\/} $x_1$ would indeed be
\beq
P(x_1 \eq \mbox{head}) =
\int \! \d f_H \: P(f_H) P(x_1 \eq \mbox{head} \given f_H)
= \int \! \d f_H \: P(f_H) f_H = \dhalf .
\eeq
But the prediction for subsequent tosses will depend on
the whole prior distribution, not just its mean.
\subsubsection{Data compression and inverse probability}
Consider the following task.
\exampl{ex.compressme}{
Write a computer program capable of compressing binary files like this
one:\par
\begin{center}{\footnotesize%was tiny
{\tt 0000000000000000000010010001000000100000010000000000000000000000000000000000001010000000000000110000}\\
{\tt 1000000000010000100000000010000000000000000000000100000000000000000100000000011000001000000011000100}\\
{\tt 0000000001001000000000010001000000000000000011000000000000000000000000000010000000000000000100000000}\\[0.1in]% added this space Sat 21/12/02
}
\end{center}
% This file contains N=300 and n_1 = 29
The string shown contains $n_1=29$ {\tt 1}s
and $n_0=271$ {\tt 0}s.
% What is the probability that the next character in this file
% is a {\tt 1}?
}
Intuitively, compression works by taking advantage of the predictability
of a file. In this case, the source of the file
appears more likely to emit
{\tt 0}s than {\tt 1}s. A data compression program that compresses
this file must, implicitly or explicitly, be addressing the
question `What is the probability that the next character in this file
is a {\tt 1}?'
Do you think this problem is similar in character
to \exampleref{exa.bentcoin}? I do. One of the themes
of this book is that data compression and
data modelling are one and the same, and that they should
both be addressed, like the urn of example \ref{ex.ip.urns},
using inverse probability.
\Exampleonlyref{ex.compressme} is solved in \chref{ch4}.
%
% SOLVE IT HERE???
%
\subsection{The likelihood principle}
\label{sec.lp}
Please solve the following two exercises.
\exampl{ex.lp1}{
Urn\amarginfig{c}{\begin{center}\psfig{figure=figs/urnsA.ps,width=1.6in}\end{center}
\caption[a]{Urns for \protect\exampleonlyref{ex.lp1}.}}
A contains three balls: one black, and two white;
\ind{urn} B contains three balls: two black, and one white.
One of the urns is selected at random and one ball
is drawn. The ball is black. What is the probability
that the selected urn is urn A?
}
%
\exampl{ex.lp2}{
Urn\amarginfig{c}{\begin{center}\psfig{figure=figs/urns.ps,width=1.6in}\end{center}%
\caption[a]{Urns for \protect\exampleonlyref{ex.lp2}.}}
A contains five balls: one black, two white, one green and one pink;
urn B contains five hundred balls:
two hundred black, one hundred white, 50 yellow, 40 cyan, 30 sienna,
25 green, 25 silver, 20 gold, and 10 purple.
[One fifth of A's balls are black; two-fifths of B's are black.]
One of the urns is selected at random and one ball
is drawn. The ball is black. What is the probability
that the urn is urn A?
}
%
What do you notice about your solutions? Does each answer
depend on the detailed contents of each urn?
The details of the other possible outcomes and their probabilities
are irrelevant. All that matters is the probability of the outcome
that actually happened (here, that the ball drawn was black) given the different
hypotheses. We need only to know the {\em likelihood}, \ie,
how the probability of the data that happened varies with the
hypothesis.
This simple rule about inference
is known as the {\dbf\ind{likelihood principle}}.\label{sec.likelihoodprinciple}
%
% NOTE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% { \em (connect back to this point when discussing
% early stopping and inference in problems where the stopping rule is not known.)}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% README NOTE!!!!!!!!!!
\begin{conclusionbox}
{\sf The likelihood principle:}
given a generative model for data $d$ given parameters $\btheta$, $P(d \given \btheta)$,
and having observed a particular outcome $d_1$, all inferences\index{key points!likelihood principle}
and predictions should depend only on the function $P(d_1 \given \btheta)$.
\end{conclusionbox}
\noindent
In spite of the simplicity of this principle,
many classical statistical methods violate it.\index{classical statistics!criticisms}\index{sampling theory!criticisms}
% \newpage
\section{Definition of entropy and related functions}
\begin{description}
\item[The Shannon information content of an outcome $x$] is defined to be
% We define for each $x \in \A_X$, $
\beq
h(x) = \log_2 \frac{1}{P(x)} .
\eeq
% We can interpret $h(a_i)$ as the information content of the event
% $x \eq a_i$.
It is measured in bits. [The word `bit' is also used to
denote a variable whose value is 0 or 1; I hope context will
always make clear which of the two meanings is intended.]
\noindent
In the next few chapters, we will establish that
the Shannon information content $h(a_i)$ is indeed a natural measure of
the information content of the event $x \normaleq a_i$.
At that point, we will shorten the name of this quantity to
`the information content'.
\margintab{%
\begin{center}\small%footnotesize
%
% vertical table of a-z with probabilities, and information contents too;
% four decimal place
\begin{tabular}[t]{cccr} \toprule
$i$ & $a_i$ & $p_i$ & \multicolumn{1}{c}{$h(p_i)$} \\ \midrule
% $i$ & $a_i$ & $p_i$ & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} \\ \midrule
%
1 & {\tt a} &.0575 & 4.1 \\
2 & {\tt b} &.0128 & 6.3 \\
3 & {\tt c} &.0263 & 5.2 \\
4 & {\tt d} &.0285 & 5.1 \\
5 & {\tt e} &.0913 & 3.5 \\
6 & {\tt f} &.0173 & 5.9 \\
7 & {\tt g} &.0133 & 6.2 \\
8 & {\tt h} &.0313 & 5.0 \\
9 & {\tt i} &.0599 & 4.1 \\
10 &{\tt j} &.0006 & 10.7 \\
11 &{\tt k} &.0084 & 6.9 \\
12 &{\tt l} &.0335 & 4.9 \\
13 &{\tt m} &.0235 & 5.4 \\
14 &{\tt n} &.0596 & 4.1 \\
15 &{\tt o} &.0689 & 3.9 \\
16 &{\tt p} &.0192 & 5.7 \\
17 &{\tt q} &.0008 & 10.3 \\
18 &{\tt r} &.0508 & 4.3 \\
19 &{\tt s} &.0567 & 4.1 \\
20 &{\tt t} &.0706 & 3.8 \\
21 &{\tt u} &.0334 & 4.9 \\
22 &{\tt v} &.0069 & 7.2 \\
23 &{\tt w} &.0119 & 6.4 \\
24 &{\tt x} &.0073 & 7.1 \\
25 &{\tt y} &.0164 & 5.9 \\
26 &{\tt z} &.0007 & 10.4 \\
27 &{\tt{-}}&.1928 & 2.4 \\ \midrule
%27 &\verb+-+&.1928 & 2.4 \\ \midrule
& & & \\[-0.1in]
\multicolumn{3}{r}{
$\displaystyle \sum_i p_i \log_2 \frac{1}{p_i}$
} & 4.1 \\ \bottomrule % 4.11
\end{tabular}\\
\end{center}
% vertical table of a-z with probabilities, and information contents too;
\caption[a]{Shannon information contents of the outcomes {\tt a}--{\tt z}.}
\label{fig.monogram.log}
}
%
The fourth column in \tabref{fig.monogram.log} shows the Shannon
information content of the 27 possible outcomes when
a
random character is picked from an English document. The
outcome
% character
$x={\tt z}$ has a Shannon information content of
10.4 bits, and $x={\tt e}$ has an information content of 3.5 bits.
\item[The entropy of an ensemble $X$] is defined to be the average Shannon information
content of an outcome:
% from that ensemble:
\beq
H(X) \equiv \sum_{x \in \A_X} P(x) \log \frac{1}{P(x)},
\eeq
%\beq
% H(X) = \sum_i p_i \log \frac{1}{p_i},
%\eeq
with the convention for $P(x) \normaleq 0$ that \mbox{$0 \times \log 1/0 \equiv 0$},
since \mbox{$\lim_{\theta\rightarrow 0^{+}} \theta \log 1/\theta \normaleq 0 $}.
Like the information content, entropy is measured in bits.
When it is convenient, we may also write $H(X)$ as $H(\bp)$,
where $\bp$ is the vector $(p_1,p_2,\ldots,p_I)$.
Another name for the entropy of $X$ is the uncertainty of $X$.
\end{description}
\noindent
% The entropy is a measure of the information content or
% `uncertainty' of $x$. The question of why entropy is a
% fundamental measure of information content will be discussed in the
% forthcoming chapters. Here w
% was continued example
\exampl{eg.mono}{
The entropy of a
randomly selected letter in an English document
is about 4.11 bits, assuming its probability
is as given in \tabref{fig.monogram.log}.
%, p.\ \pageref{fig.monogram}.
% \tabref{tab.mono}.
We obtain this number by averaging $\log 1/p_i$ (shown in the fourth
column) under the probability distribution $p_i$ (shown in the third column).
}
We now note some properties of the entropy function.
\bit
\item
$H(X) \geq 0$ with equality iff $p_i \normaleq 1$ for one $i$.
[{`iff' means
`if and only if'.}]
\item Entropy is maximized if $\bp$ is uniform:
\beq
H(X) \leq \log(|\A_X|)
\:\: \mbox{ with equality iff $p_i \normaleq 1/|\A_X|$ for all $i$. }
\eeq
% \footnote{Exercise: Prove this assertion.}
{\sf Notation:}\index{notation!absolute value}\index{notation!set size}
the vertical bars `$|\cdot|$'
have two meanings.
% If $X$ is an ensemble, then
If $\A_X$ is a set,
$|\A_X|$ denotes the number of elements in $\A_X$;
if $x$ is a number,
% for example, the value of a random variable,
then $|x|$ is the absolute value of $x$.
\eit
%
% Mon 22/1/01
The {\dem\ind{redundancy}}
measures the fractional difference
between $H(X)$ and its maximum possible value,
$\log(|\A_X|)$.
\begin{description}%
\item[The redundancy of $X$] is:
\beq
1 - \frac{H(X)}{\log |\A_X|} .
\eeq
We won't make use of `redundancy'
% need this definition
in this book, so
I have not assigned a symbol to it.
% -- it would be redundant.
\end{description}
% ha ha
% funny but true.
% example: X is select a codeword from a code - H(X) = K, but |X| = 2^N
%
% Redundancy = 1 - R
% of code
\begin{description}% duplicated in _l1a and _p5A
\item[The joint entropy of $X,Y$] is:
\beq
H(X,Y) = \sum_{xy \in \A_X\A_Y} P(x,y) \log \frac{1}{P(x,y)}.
\eeq
Entropy is additive for independent random variables:
\beq
H(X,Y) = H(X) +H(Y) \:\mbox{ iff }\: P(x,y)=P(x)P(y).
\label{eq.ent.indep}% also appears in p5a (.again)
\eeq
\end{description}
\label{sec.entropy.end.parta}
Our definitions for information content
so far apply only to discrete probability distributions
over finite sets $\A_X$. The definitions can be extended
to infinite sets, though the entropy may then be infinite.
The case of a probability {\em density\/} over a continuous set is
addressed in section \ref{sec.entropy.continuous}.\index{probability!density}
Further important definitions and exercises to do with entropy
will come along in section \ref{sec.entropy.contd}.
\section{Decomposability of the entropy}
The entropy function satisfies a recursive property
that can be very useful when computing entropies.
For convenience, we'll stretch our notation\index{notation!entropy}
so that we can write $H(X)$ as $H(\bp)$, where
$\bp$ is the probability vector associated with the ensemble $X$.
Let's illustrate the property by an example first.
Imagine that a random variable $x \in \{ 0,1,2 \}$
is created by first flipping a fair coin to determine
whether $x = 0$; then, if $x$ is not 0,
flipping a fair coin a second time to determine whether
$x$ is 1 or 2.
The probability distribution of $x$ is
\beq
P( x\! =\! 0 ) = \frac{1}{2} ; \:\:
P( x\! =\! 1 ) = \frac{1}{4} ; \:\:
P( x\! =\! 2 ) = \frac{1}{4} .
\eeq
What is the entropy of $X$? We can either compute it by brute
force:
\beq
H(X) = \dfrac{1}{2} \log 2 + \dfrac{1}{4} \log 4 + \dfrac{1}{4} \log 4
= 1.5 ;
\eeq
or we can use the following decomposition, in which the value of $x$
is revealed gradually.
Imagine first learning whether $x\! =\! 0$, and then,
if $x$ is not $0$, learning which non-zero value is the case. The revelation
of whether $x\! =\! 0$ or not entails revealing a
binary variable whose probability distribution is $\{\dhalf,\dhalf \}$.
This revelation has an entropy $H(\dhalf,\dhalf) = \frac{1}{2} \log 2 +\frac{1}{2} \log 2 = 1\ubit$.
If $x$ is not $0$, we learn the value of the second coin flip.
This too is a
binary variable whose probability distribution is $\{\dhalf,\dhalf\}$, and whose entropy is
$1\ubit$.
We only get to experience the second revelation half the time, however,
so the entropy can be written:
\beq
H(X) = H( \dhalf , \dhalf ) + \dhalf \, H( \dhalf , \dhalf ) .
\eeq
Generalizing, the observation we are making about the entropy
of any probability distribution $\bp = \{ p_1, p_2, \ldots , p_I \}$
is that
\beq
H(\bp) =
H( p_1 , 1\!-\!p_1 )
+ (1\!-\!p_1)
H \! \left(
\frac{p_2}{1\!-\!p_1} ,
\frac{p_3}{1\!-\!p_1} , \ldots ,
\frac{p_I}{1\!-\!p_1}
\right) .
\label{eq.entropydecompose}
\eeq
When it's written as a formula, this property
looks regrettably ugly; nevertheless it is a simple
property and one that you should make use of.
Generalizing further, the entropy has the property for any $m$
that
\beqan
H(\bp) &=&
H\left[ ( p_1+p_2+\cdots+p_m ) , ( p_{m+1}+p_{m+2}+\cdots+p_I ) \right]
\nonumber
\\
&&+ ( p_1+
% p_2+
\cdots+p_m )
H\! \left(
\frac{p_1}{ ( p_1+\cdots+p_m ) } ,
% \frac{p_2}{ ( p_1+\cdots+p_m ) } ,
\ldots ,
\frac{p_m}{ ( p_1+\cdots+p_m ) }
\right)
\nonumber
\\
&& + ( p_{m+1}+
%p_{m+2}+
\cdots+p_I )
H \! \left(
\frac{p_{m+1}}{ ( p_{m+1}+\cdots+p_I ) } ,
% \frac{p_{m+2}}{ ( p_{m+1}+\cdots+p_I ) } ,
\ldots ,
\frac{p_I}{ ( p_{m+1}+\cdots+p_I ) }
\right) .
\nonumber
\\
\label{eq.entdecompose2}
\eeqan
\exampl{example.entropy}{
A source produces a character $x$
from the alphabet $\A = \{ {\tt 0}, {\tt 1}, \ldots, {\tt 9}, {\tt a}, {\tt b}, \ldots, {\tt z} \}$;
with probability $\dthird$, $x$ is a numeral (${\tt 0}, \ldots, {\tt 9}$);
with probability $\dthird$, $x$ is a vowel (${\tt a}, {\tt e}, {\tt i}, {\tt o}, {\tt u}$);
and with probability $\dthird$ it's one of the 21 consonants. All numerals are equiprobable,
and the same goes for vowels and consonants.
Estimate the entropy of $X$.
}
\solution\ \
$\log 3 + \frac{1}{3} ( \log 10 + \log 5 + \log 21 )= \log 3 + \frac{1}{3} \log 1050 \simeq \log 30\ubits$.\ENDsolution
%> pr log(36)/log(2)
%5.16992500144231
%> pr log(30)/log(2)
%4.90689059560852
%> pr (log(3) +log(1050)/3.0 )/log(2)
%4.93035370490565
% This may be compared with the maximum entropy for an alphabet
% of 36 characters, $\log 36\ubits$.
\section{Gibbs' inequality}
% We will also find useful the following:
\begin{description}
% SPACE PROBLEM HERE ...
\item[The \ind{relative entropy} {\em or\/} \ind{Kullback--Leibler divergence}]
\marginpar[t]{\small\raggedright{The `ei' in L{\bf{ei}}bler is pronounced\index{pronunciation}
the same as in h{\bf{ei}}st.}}between two probability distributions $P(x)$ and $Q(x)$
that are defined over the same alphabet $\A_X$ is\index{entropy!relative}\index{divergence}
\beq
D_{\rm KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} .
\label{eq.KL}
\label{eq.DKL}
\eeq
The relative entropy satisfies {\dem\ind{Gibbs' inequality}}
\beq
D_{\rm KL}(P||Q) \geq 0
\eeq
with equality only if $P \normaleq Q$. Note that in general
the relative entropy is not symmetric under interchange of the
distributions $P$ and $Q$:
in general
$D_{\rm KL}(P||Q) \neq D_{\rm KL}(Q||P)$, so $D_{\rm KL}$,
although it is sometimes called the `\ind{KL distance}',
is not strictly a
distance\index{distance!$D_{\rm KL}$}.\index{distance!relative entropy}
% `distance\index{distance!$D_{\rm KL}$}'.
% It is also known as the `discrimination' or `divergence',
The \ind{relative entropy} is important in pattern recognition and neural networks,
as well as in information theory.
%
% could include that aston guy's stuff here on (pq)^1/2?
%
% see also ../notation.tex
%
\end{description}
Gibbs' inequality is probably the most important inequality in this book.
It, and many other inequalities, can be proved
using the concept of convexity.
\section{Jensen's inequality for convex functions}
\begin{aside}
The
words `\ind{\convexsmile}'
and `\ind{\concavefrown}' may be pronounced `convex-smile'
and `concave-frown'.
This terminology has useful redundancy: while one
may forget which way up `convex' and `concave' are,
it is harder to confuse a smile with a frown.\index{notation!convex/concave}
\end{aside}
\begin{description}
%
\item[{\Convexsmile\ functions}\puncspace] A function $f(x)$ is {\dem \ind{\convexsmile}\/}
over $(a,b)$ if
\amarginfig{c}{%
\footnotesize
\setlength{\unitlength}{0.75mm}
\begin{tabular}{c}
\begin{picture}(60,60)(0,0)
\put(0,0){\makebox(60,65){\psfig{figure=figs/convex.eps,angle=-90,width=45mm}}}
\put(10,8){\makebox(0,0){$x_1$}}
\put(48,8){\makebox(0,0){$x_2$}}
\put(17,2){\makebox(0,0)[l]{$x^* = \lambda x_1 + (1-\lambda)x_2$}}
\put(31,23){\makebox(0,0){$f(x^*)$}}
\put(35,39){\makebox(0,0){$\lambda f(x_1) + (1-\lambda)f(x_2)$}}
\end{picture}
\end{tabular}
\caption[a]{Definition of convexity.}
\label{fig.convex}
}\
every chord of the function
lies above the function,
as shown in \figref{fig.convex}; that is,
for all $x_1,x_2
\in (a,b)$ and $0\leq \lambda \leq 1$,
\beq
f( \lambda x_1 + (1-\lambda)x_2 ) \:\:\leq \:\:\
\lambda f(x_1) + (1-\lambda) f(x_2 ) .
\eeq
A function $f$ is {\dem strictly
\convexsmile\/} if, for all $x_1,x_2 \in (a,b)$, the equality holds only
for $\lambda \normaleq 0$ and $\lambda\normaleq 1$.
Similar definitions apply to \concavefrown\ and strictly \concavefrown\
functions.
\end{description}
\newcommand{\tinyfunction}[2]{
\begin{tabular}{@{}c@{}}
{\small{#1}}
\\[-0.25in]
\psfig{figure=figs/#2.ps,width=1.06in,angle=-90}
\\
\end{tabular}
}
Some strictly \convexsmile\ functions are
\bit
\item $x^2$, $e^x$ and $e^{-x}$ for all $x$;
\item $\log (1/x)$ and $x \log x$ for $x>0$.
\eit
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\raisebox{0.4in}{%
\begin{tabular}[c]{c@{}c@{}c@{}c}
\tinyfunction{$x^2$}{convex_xx} &
\tinyfunction{$e^{-x}$}{convex_exp-x} &
\tinyfunction{$\log \frac{1}{x}$}{convex_logix} &
\tinyfunction{$x \log x$}{convex_xlogx} \\[0.2in]
%\tinyfunction{$x^2$}{convex_xx} &
%\tinyfunction{$e^{-x}$}{convex_exp-x} \\[0.42in]
%\tinyfunction{$\log \frac{1}{x}$}{convex_logix} &
%\tinyfunction{$x \log x$}{convex_xlogx} \\[0.2in]
\end{tabular}
}
\end{center}
}{%
\caption[a]{\Convexsmile\ functions.}
\label{fig.convexf}
}%
\end{figure}
\begin{description}
\item[Jensen's inequality\puncspace] If $f$ is a \convexsmile\ function
and $x$ is a random variable then:
\beq
\Exp\left[ f(x) \right] \geq f\!\left( \Exp[x] \right) ,
\label{eq.jensen}
\eeq
where $\Exp$ denotes \ind{expectation}. If $f$ is strictly \convexsmile\ and
$\Exp\left[ f(x) \right] \normaleq f\!\left( \Exp[x] \right)$, then the random
variable $x$ is a constant.
% (with probability 1).
% |!!!!!!!!!!!!!!!!! removed pedantry
\ind{Jensen's inequality} can also be rewritten for a
\concavefrown\ function, with the direction of the inequality
reversed.
\end{description}
A physical version of Jensen's \ind{inequality} runs as follows.
\amarginfignocaption{b}{\mbox{\psfig{figure=figs/jensenmass.ps,width=1.75in,angle=-90}}}
\begin{quote}
If a collection of
masses $p_i$ are placed on a
\convexsmile\ curve $f(x)$
at locations $(x_i, f(x_i))$, then the
\ind{centre of gravity} of those masses, which is at $\left( \Exp[x],
\Exp\left[ f(x) \right] \right)$, lies above the curve.
\end{quote}
If this fails to convince you, then feel free to
do the following exercise.
\exercissxC{2}{ex.jensenpf}{
Prove \ind{Jensen's inequality}.
}
\exampl{ex.jensen}{
Three squares have average area $\bar{A} = 100\,{\rm m}^2$.
The average of the lengths of their sides is $\bar{l} = 10\,{\rm m}$.
What can be said about the size of the largest of the
three squares? [Use Jensen's inequality.]
}
\solution\ \
Let $x$ be the length of the side of a square, and let the
probability of $x$ be $\dthird,\dthird,\dthird$ over the
three lengths $l_1,l_2,l_3$. Then the information that we have is
that $\Exp\left[ x \right]=10$ and $\Exp\left[ f(x) \right]=100$,
where $f(x) = x^2$ is the function mapping lengths to areas.
This is a strictly \convexsmile\ function.
We notice that the equality
$\Exp\left[ f(x) \right] \normaleq f\!\left( \Exp[x] \right)$ holds,
therefore $x$ is a constant, and the three lengths
must all be equal. The area of the largest square is 100$\,{\rm m}^2$.\ENDsolution
\subsection{Convexity and concavity also relate to maximization}
If $f(\bx)$ is \convexfrown\ and there exists a point at which
\beq
\frac{\partial f}{\partial x_k} = 0 \:\: \mbox{for all $k$},
% \forall k
\eeq
then $f(\bx)$ has its maximum value at that point.
The converse does not hold: if a \convexfrown\ $f(\bx)$ is maximized at
some $\bx$ it is not necessarily true that the gradient
$\grad f(\bx)$ is equal
to zero there. For example, $f(x) = -|x|$ is maximized at $x=0$
where its derivative is undefined; and $f(p) = \log(p),$ for
a probability
$p \in (0,1)$, is maximized on the boundary of the range,
at $p=1$, where the gradient $\d f(p)/\d p =1$.
%, since $f$ might for example
% be an increasing function with no maximum such as $\log x$,
% or its maximum might be located at a point $\bx$
% on the boundary of the range of $\bx$.
%
%{\em (is this use of range correct?)}
% exercises from that.
%
% exercises that belong between old chapters 1 and 2.
%
% see also _p5a.tex for moved exercises.
%
\section{Exercises}
\subsection*{Sums of random variables}
% sums of random variables.
% dice questions
\exercissxA{3}{ex.sumdice}{
\ben
\item
Two ordinary dice with faces labelled $1,\ldots,6$
are thrown. What is the probability distribution of
the sum\index{law of large numbers}
of the values? What is the probability distribution of the
absolute difference between the values?
\item
One\marginpar[c]{\small\raggedright{This exercise
is intended to help you think about the \ind{central-limit theorem}, which says
that if independent random variables $x_1, x_2, \ldots, x_N$
have means $\mu_n$ and finite variances $\sigma_n^2$, then, in the
limit of large $N$, the sum $\sum_n x_n$ has a distribution that tends
to a normal (\index{Gaussian distribution}Gaussian) distribution
with mean $\sum_n \mu_n$ and variance $\sum_n \sigma_n^2$.}}
hundred ordinary dice are thrown. What,
roughly, is the probability distribution of the sum of the values?
Sketch the probability distribution and estimate its mean and
standard deviation.
\item
How can two cubical dice be labelled using the numbers $\{0,1,2,3,4,5,6\}$
so that when the two dice are thrown the sum has a uniform
probability distribution over the integers 1--12?
% Can you prove your solution is unique?
\item
Is there any way that one hundred dice
could be labelled with integers
such that the probability distribution of the sum is uniform?
\een
}
% answer, one normal, one 060606
% uniqueness proved by noting that every outcome 1-12 has
% to be made from 3 microoutcomes, and 12 can only be made
% from 6,6, so there must be a six on each die, indeed 3 on 1, and
% 1 on the other. 1 can only be mae from 1,0, and don't want 0,0,
% so there must be three 0s. (M Gardner)
%
\subsection*{Inference problems}
\exercissxA{2}{ex.logit}{
If $q=1-p$ and $a = \ln \linefrac{p}{q}$, show that
\beq
p = \frac{1}{1+\exp(-a)} .
\label{eq.sigmoid}
\label{eq.logistic}
\eeq
Sketch this function and find its relationship to the hyperbolic tangent
function $\tanh(u)=\frac{e^{u} - e^{-u}}{e^{u} + e^{-u}}$.
It will be useful to be fluent in base-2 logarithms also.
If $b = \log_2 \linefrac{p}{q}$, what is $b$ as a function of $p$?
}
%
% is this exercise inappropriate now because we have not defined
% joint ensembles yet?
%
\exercissxB{2}{ex.BTadditive}{
Let $x$ and $y$ be dependent
% correlated
random variables with
$x$ a binary variable taking values in $\A_X = \{ 0,1 \}$.
Use \Bayes\ theorem to show that the log posterior probability
ratio for $x$ given $y$ is
\beq
\log \frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} = \log \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)}
+ \log \frac{P(x\eq 1)}{P(x\eq 0)} .
\eeq
}
% define ODDS ?
\exercissxB{2}{ex.d1d2}{
Let $x$, $d_1$ and $d_2$ be random variables such that $d_1$
and $d_2$ are conditionally independent given a binary variable $x$.
% (That is, $P(x,d_1,d_2)
% = P(x)P(d_1 \given x)P(d_2 \given x)$.)
%
% somewhere I need to introduce graphical repns and define
%
% TO DO!!! TODO
%
% (\ind{conditional independence} is discussed further in section XXX.)
%
% and give examples. A and C children of B. and A->B->C
% Jensen defn is
% A is cond indep of B given C if
% A|B,C = A|C
% which is symmetric, implying by BT
% B|A,C = B|C
% pf
% B|A,C = A|B,C B|C / A|C = B|C
% my defn here is
% A,B,C = C A|C B|C
% proof:
% A,B,C = C A|C B|C,A = .
% NB graphical model and decomposition are not 1-1 related. The two
% graphs A and C children of B. and A->B->C both have a joint prob
% that can be factorized in either way.
%
% $x$ is a binary variable taking values in $\A_X = \{ 0,1 \}$.
Use \Bayes\ theorem to show that the posterior probability
ratio for $x$ given $\{d_i \}$ is
\beq
\frac{P(x\eq 1 \given \{d_i \} )}{P(x\eq 0 \given \{d_i \})} =
\frac{P(d_1 \given x\eq 1)}{P(d_1 \given x\eq 0)}
\frac{P(d_2 \given x\eq 1)}{P(d_2 \given x\eq 0)}
\frac{P(x\eq 1)}{P(x\eq 0)} .
\eeq
}
\subsection*{Life in high-dimensional spaces}
%{Life in $\R^N$}
\index{life in high dimensions}
\index{high dimensions, life in}
Probability distributions and volumes have some unexpected
properties in high-dimensional spaces.
% The real line is denoted by $\R$. An $N$--dimensional real space
% is denoted by $\R^N$.
\exercissxA{2}{ex.RN}{
Consider a sphere of radius $r$ in an $N$-dimensional real space.
% dimensions.
Show that the
fraction of the volume of the sphere that
is
in the surface shell lying
at values of the radius between $r- \epsilon$ and $r$, where $0 < \epsilon < r$, is:
\beq
f = 1 - \left( 1 - \frac{\epsilon}{r} \right)^{\!N} .
\eeq
% from Bishop p.29
Evaluate $f$ for the cases $N\eq 2$, $N\eq 10$
and $N\eq 1000$, with (a) $\epsilon/r \eq 0.01$; (b) $\epsilon/r \eq 0.5$.
{\sf Implication:} points that are uniformly distributed in a sphere in $N$
dimensions, where $N$ is large, are very likely to be in a \ind{thin shell}
near the surface.
% (From Bishop (1995).)
}
%
\label{sec.exercise.block1}
\subsection*{Expectations and entropies}
You are probably familiar with the idea of computing the \ind{expectation}\index{notation!expectation}
of a function of $x$,
\beq
\Exp\left[ f(x) \right] = \left< f(x) \right> = \sum_{x} P(x) f(x) .
\eeq
Maybe you are not so comfortable with computing this expectation
in cases where the function $f(x)$ depends on
the probability $P(x)$. The next few
examples address this concern.
\exercissxA{1}{ex.expectn}{
Let $p_a \eq 0.1$, $p_b \eq 0.2$, and $p_c \eq 0.7$.
Let $f(a) \eq 10$, $f(b) \eq 5$, and $f(c) \eq 10/7$.
What is $\Exp\left[ f(x) \right]$?
What is $\Exp\left[ 1/P(x) \right]$?
}
\exercissxA{2}{ex.invP}{
For an arbitrary ensemble, what is $\Exp\left[ 1/P(x) \right]$?
}
\exercissxB{1}{ex.expectng}{
Let $p_a \eq 0.1$, $p_b \eq 0.2$, and $p_c \eq 0.7$.
Let $g(a) \eq 0$, $g(b) \eq 1$, and $g(c) \eq 0$.
What is $\Exp\left[ g(x) \right]$?
}
\exercissxB{1}{ex.expectng2}{
Let $p_a \eq 0.1$, $p_b \eq 0.2$, and $p_c \eq 0.7$.
What is the probability that $P(x) \in [0.15,0.5]$?
What is
\[
P\left( \left| \log \frac{P(x)}{ 0.2} \right| > 0.05 \right) ?
\]
}
\exercissxA{3}{ex.Hineq}{
Prove the assertion that
$H(X) \leq \log(|\A_X|)$ with equality iff $p_i \normaleq 1/|\A_X|$ for all $i$.
($|\A_X|$ denotes the number of elements in the set $\A_X$.)
[Hint: use Jensen's inequality (\ref{eq.jensen}); if your
first attempt to use Jensen does not succeed, remember that
Jensen involves both a random variable and a function,
and you have quite a lot of freedom in choosing
these; think about whether
your chosen function $f$ should be convex or concave.]
% further hint: try $u\eq 1/p_i$ as the random variable.]
}
\exercissxB{3}{ex.rel.ent}{
Prove that the relative entropy (\eqref{eq.KL})
satisfies $D_{\rm KL}(P||Q) \geq 0$ (\ind{Gibbs' inequality})
with equality only if $P \normaleq Q$.
% You may find this result
% helps with the previous two exercises. Note (moved to _p5a.tex)
%
% refer to this in mean field theory chapter {ch.mft}
%
}
%
% Decomposability of the entropy
\exercisaxB{2}{ex.entropydecompose}{
Prove that the entropy is
indeed decomposable as described in
\eqsref{eq.entropydecompose}{eq.entdecompose2}.
}
\exercissxB{2}{ex.decomposeexample}{
A random variable $x \in \{0,1,2,3\}$ is selected
by flipping a bent coin with bias $f$ to determine whether
the outcome is in $\{0,1\}$ or $\{ 2,3\}$;
\amarginfignocaption{t}{%
\begin{center}\small%footnotesize
\setlength{\unitlength}{0.6mm}
\begin{picture}(30,50)(-10,-15)
\put(-6,25){{\makebox(0,0)[r]{$f$}}}
\put(-6,5){{\makebox(0,0)[r]{$1\!-\!f$}}}
\put(-10,15){\vector(1,1){17}}
\put(-10,15){\vector(1,-1){17}}
\put(10,35){\vector(1,1){10}}
\put(10,35){\vector(1,-1){10}}
\put(16,45){{\makebox(0,0)[r]{$g$}}}
\put(16,25){{\makebox(0,0)[r]{$1\!-\!g$}}}
\put(16,5){{\makebox(0,0)[r]{$h$}}}
\put(16,-15){{\makebox(0,0)[r]{$1\!-\!h$}}}
\put(10,-5){\vector(1,1){10}}
\put(10,-5){\vector(1,-1){10}}
\put(24,45){{\makebox(0,0)[l]{\tt 0}}}
\put(24,25){{\makebox(0,0)[l]{\tt 1}}}
\put(24,5){{\makebox(0,0)[l]{\tt 2}}}
\put(24,-15){{\makebox(0,0)[l]{\tt 3}}}
\end{picture}
\end{center}
}
then either flipping a second bent coin with bias $g$
or a third bent coin with bias $h$ respectively.
Write down the probability distribution of $x$.
Use the
decomposability of the entropy (\ref{eq.entdecompose2})
to find the entropy of $X$. [Notice how compact
an expression is obtained if you make use of the binary entropy
function $H_2(x)$, compared with writing out the four-term
entropy explicitly.]
Find the derivative of $H(X)$ with respect to $f$. [Hint: $\d H_2(x)/\d x = \log((1-x)/x)$.]
}
\exercissxB{2}{ex.waithead0}{
An unbiased coin is flipped until one head is thrown. What is the
entropy of the random variable $x \in \{1,2,3,\ldots\}$, the number of
flips?
Repeat the calculation for the case of a biased coin with probability $f$
of coming up heads.
[Hint: solve the problem both directly and by using the
decomposability of the entropy (\ref{eq.entropydecompose}).]
%
}
%
% removed joint entropy questions.
\section{Further exercises}
%
\subsection*{Forward probability}% problems}
\exercisaxB{1}{ex.balls}{
An urn contains $w$ white balls and $b$ black balls.
Two balls are drawn, one after the other, without replacement.
Prove that the probability that the first ball
is white is equal to the probability that the second is white.
}
%
\exercisaxB{2}{ex.buffon}{
A circular \ind{coin} of diameter $a$ is thrown onto a \ind{square} grid
whose squares are $b \times b$. ($aB$ given that $F>A$?)
}
\exercisaxB{2}{ex.liars}{
The inhabitants of an island tell the
truth one third of the time. They lie with probability 2/3.
On an occasion, after one of them made a statement,
you ask another `was that statement true?'
and he says `yes'.
What is the probability that the statement was indeed true?
% [Ans: 1/5].
}
%
\exercissxB{2}{ex.R3error}{
Compare two ways of computing the probability of error of
the repetition code $\Rthree$, assuming a binary
symmetric channel (you
did this once for \exerciseref{ex.R3ep}) and confirm that they
give the same answer.
\begin{description}
\item[Binomial distribution method\puncspace]
Add the probability that all three bits are
flipped to the probability that exactly two bits are flipped.
% Add the probability of all three bits'
% being flipped to the probability of exactly two bits' being flipped.
\item[Sum rule method\puncspace]
% Using the different possible inferences]
Using the \ind{sum rule},
compute the marginal probability that $\br$ takes on each of
the eight possible values, $P(\br)$.
[$P(\br) = \sum_s P(s)P(\br \given s)$.]
Then compute
the posterior probability of $s$ for each of the eight
values of $\br$. [In fact, by symmetry, only two example
cases
$\br = ({\tt0}{\tt0}{\tt0})$ and
$\br = ({\tt0}{\tt0}{\tt1})$ need be considered.]
\marginpar{\small\raggedright{\Eqref{eq.bayestheorem} gives the posterior probability of
the input $s$, given the received vector $\br$.
}}
% $\br = ({\tt1},{\tt1},{\tt0})$,
% $\br = ({\tt1},{\tt1},{\tt1})$,
Notice that some of the
inferred bits are better determined than others.
From the posterior probability $P(s \given \br)$ you can read out
the case-by-case error probability,
the probability that the more probable hypothesis
is not correct, $P(\mbox{error} \given \br)$.
Find the average error probability using the sum rule,
\beq
P(\mbox{error}) = \sum_{\br} P(\br) P(\mbox{error} \given \br) .
\eeq
\end{description}
}
%
\exercissxB{3C}{ex.Hwords}{
The frequency
% probability
$p_n$ of the
$n$th most frequent word in English is roughly approximated
by
\beq
p_n \simeq \left\{
\begin{array}{ll}
\frac{0.1}{n} & \mbox{for $n \in 1, \ldots, 12\,367$}
% 8727$.}
\\
0 & n > 12\,367 .
\end{array}
\right.
\eeq
[This remarkable $1/n$ law is known as \ind{Zipf's law},
and applies to the word frequencies of many languages
% cite Shannon collection p.197 - except he has the number 8727, wrong!
% could also cite Gell-Mann
\cite{zipf}.]
If we assume that English is generated by picking
words at random according to this distribution,
what is the entropy of English (per word)?
[This calculation can be found in `Prediction and entropy of printed English', C.E.\ Shannon,
{\em Bell Syst.\ Tech.\ J.}\ {\bf 30}, p\pdot50--64 (1950), but, inexplicably,
the great man made numerical errors in it.]
% , in bits per word?
}
%%% Local Variables:
%%% TeX-master: ../book.tex
%%% End:
% \input{tex/_e1A.tex}%%%%%%%%%%%%%%%%%%%%% inference probs to do with logit and dice and decay moved into _p8.tex
\dvips
% include urn.tex here for another forward probability exercise.
%
\section{Solutions}% to Chapter \protect\ref{ch.prob.ent}'s exercises}
\fakesection{_s1aa solutions}
%=================================
\soln{ex.independence.bigram}{
No, they are not independent. If they were then all the
conditional distributions $P(y \given x)$ would be identical
functions of $y$, regardless of $x$ (\cf\ \figref{fig.conbigrams}).
}
\soln{ex.fp.toss}{
We define the fraction $f_B \equiv B/K$.
\ben
\item
The number of black balls
has a binomial distribution.
\beq P(n_B\,|\,f_B,N) = {N \choose n_B} f_B^{n_B} (1-f_B)^{N-n_B} . \eeq
\item
The mean and variance of this distribution are:
\beq \Exp [ n_B ] = N f_B \eeq
\beq \var[n_B] = N f_B (1-f_B) .
\label{eq.variance.binomial}
\eeq
These results were derived in \exampleref{ex.binomial}.
The standard deviation of $n_B$ is $\sqrt{\var[n_B]} = \sqrt{N f_B (1-f_B)}$.
% on page \pageref{sec.first.binomial.sol}.
When $B/K = 1/5$ and $N=5$,
the expectation and variance of $n_B$ are
1 and 4/5. The standard deviation is 0.89.
When $B/K = 1/5$ and $N=400$,
the expectation and variance of $n_B$ are
80 and 64. The standard deviation is 8.
\een
}
\soln{ex.fp.chi}{
The numerator of the quantity
\[%beq
z = \frac{(n_B - f_B N)^2}{ {N f_B (1-f_B)} }
%\label{eq.chisquared}
\]%eeq
can be recognized as\index{chi-squared}\index{$\chi^2$}
$\left( n_B - \Exp [ n_B ] \right)^2$;
the denominator is equal to
the variance of $n_B$ (\ref{eq.variance.binomial}),
which is by definition the expectation of the numerator.
So the expectation of $z$ is 1. [A random variable like $z$,
which measures the deviation of data from the
expected
% average
value, is sometimes called $\chi^2$ (chi-squared).]
In the case $N=5$ and $f_B = 1/5$, $N f_B$ is 1, and
$\var[n_B]$ is 4/5. The numerator has five possible values, only
one of which is smaller than 1:
$(n_B - f_B N)^2 = 0$ has probability $P(n_B \eq 1)= 0.4096$;
% $(n_B - f_B N)^2 = 1$ has probability $P(n_B = 0)+P(n_B = 2)= $ ;
% $(n_B - f_B N)^2 = 4$ has probability $P(n_B = 3)= $ ;
% $(n_B - f_B N)^2 = 9$ has probability $P(n_B = 4)= $ ;
% $(n_B - f_B N)^2 = 16$ has probability $P(n_B = 5)= $ ;
so the probability that $z < 1$ is 0.4096.
%
}
%
% stole solution from here
%
%%%%%%%%%%%%%%%%%%%%%%%%%% added 99 9 14
\soln{ex.jensenpf}{
We wish to prove, given the property
\beq
f( \lambda x_1 + (1-\lambda)x_2 ) \:\: \leq \:\:
\lambda f(x_1) + (1-\lambda) f(x_2 ) ,
\label{eq.convexdefn}
\eeq
that, if $\sum p_i = 1$ and $p_i \geq 0$,
\beq%
% \Exp\left[ f(x) \right] \geq f\left( \Exp[x] \right) ,
\sum_{i=1}^I p_i f(x_i) \geq f\left( \sum_{i=1}^I p_i x_i \right) .
\eeq
We proceed by recursion, working from the right-hand side. (This proof
does not
% needs further work to
handle
% awkward
cases where some $p_i=0$; such
details are left to the pedantic reader.) At the first line we
use the definition of convexity (\ref{eq.convexdefn}) with
$\lambda = \frac{p_1}{\sum_{i=1}^I p_i } = p_1$; at the second line,
$\lambda = \frac{p_2}{\sum_{i=2}^I p_i }$.
% , and so forth.
\fakesection{temporary solution}
\begin{eqnarray}
\lefteqn{ f\left( \sum_{i=1}^I p_i x_i \right) =
% &=&
f\left( p_1 x_1 + \sum_{i=2}^I p_i x_i
\right) } \nonumber
\\
&\leq&
p_1 f(x_1) + \left[ \sum_{i=2}^I p_i \right]
\left[ f\left( \sum_{i=2}^I p_i x_i
\left/ \sum_{i=2}^I p_i \right. \right) \right]
\\
&\leq&
p_1 f(x_1) + \left[ \sum_{i=2}^I p_i \right]
\left[
\frac{p_2}
{\sum_{i=2}^I p_i } f\left( x_2 \right)
+ \frac{\sum_{i=3}^I p_i}
{\sum_{i=2}^I p_i }
f\left( \sum_{i=3}^I p_i x_i
\left/ \sum_{i=3}^I p_i \right. \right)
\right] ,
\nonumber
% probably cut this last line, just show one itn of recursion
%
\end{eqnarray}
and so forth. %
% this works if I want to restore it. Indeed I have restored it
\hfill $\epfsymbol$% $\Box$%\epfs% end proof symbol
}
%%%%%%%%%%%%%%%%%%%%
% main post-chapter exercise solution area:
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\soln{ex.sumdice}{
\ben \item For the outcomes $\{2,3,4,5,6,7,8,9,10,11,12\}$,
the probabilities are $\P = \{
\frac{1}{36},
\frac{2}{36},
\frac{3}{36},
\frac{4}{36},
\frac{5}{36},
\frac{6}{36},
\frac{5}{36},
\frac{4}{36},
\frac{3}{36},
\frac{2}{36},
\frac{1}{36}\}%
$.
\item The value of one die has mean $3.5$ and variance $35/12$.
So the sum of one hundred has mean $350$ and variance $3500/12 \simeq 292$,
and by the \ind{central-limit theorem} the probability distribution
is roughly Gaussian (but confined to the integers), with
this mean and variance.
\item
In order to obtain a sum that has a uniform distribution
we have to start from random variables some of which
have a spiky distribution
with the probability mass concentrated at the extremes.
The unique solution is to have one ordinary die and one with faces 6, 6, 6, 0, 0, 0.
% That this solution is unique can be proved with an argument
% that starts by noting
% that each of the 12 outcomes has to be realized
% by 3 distinct microstates (a microstate
% being one of the 36 particular orientations
% of the two dice). To create outcome `12'
% in three ways there must be one six on
% one dice and three sixes on the other;
% similarly to create outcome `1' three ways, there
% must be one die with three zeroes on it
% and one with one one.
\item
Yes, a uniform distribution can be created in several ways,\marginpar[t]{\small\raggedright{To think about:
does this uniform distribution contradict the \ind{central-limit theorem}?}}
for example by labelling the $r$th die with
the numbers $\{0,1,2,3,4,5\}\times 6^r$.
\een
}
% \subsection*{Inference problems}
%
\soln{ex.logit}{
\beqan
a = \ln \frac{p}{q}
\hspace{0.2in} & \Rightarrow & \hspace{0.2in} \frac{p}{q} = e^a
\label{logit.step1}
\eeqan
and $q=1-p$ gives
\beqan
\frac{p}{1-p} & =& e^a
\\ \Rightarrow \hspace{0.52in} p & = & \frac{e^a}{e^a+1} = \frac{1}{1+\exp(-a)} .
\label{logit.step2}
\eeqan
The hyperbolic tangent is
\beq
\tanh(a) = \frac{e^a -e^{-a}}{e^a + e^{-a}}
\eeq
so
\beqan
f(a)& \equiv& \frac{1}{1+\exp(-a)} =
\frac{1}{2} \left( \frac{1-e^{-a}}{1+e^{-a}} + 1 \right) \nonumber \\
&=& \frac{1}{2}\left( \frac{ e^{a/2} - e^{-a/2} }{
e^{a/2} + e^{-a/2}} +1 \right)
= \frac{1}{2} ( \tanh(a/2) + 1 ) .
\eeqan
In the case $b = \log_2 \linefrac{p}{q}$, we can repeat
steps (\ref{logit.step1}--\ref{logit.step2}), replacing $e$ by $2$, to
obtain
\beq
p = \frac{1}{1+2^{-a}} .
\label{eq.sigmoid2}
\label{eq.logistic2}
\eeq
}
\soln{ex.BTadditive}{
\beqan
P(x \given y) &=& \frac{P(y \given x)P(x) }{P(y)}
\\%\eeq\beq
\Rightarrow\:\:
\frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} &=& \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)}
\frac{P(x\eq 1)}{P(x\eq 0)}
\\%\eeq\beq
\Rightarrow\:\:
\log \frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} &=& \log \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)}
+ \log \frac{P(x\eq 1)}{P(x\eq 0)} .
\eeqan
}
\soln{ex.d1d2}{
The conditional independence of $d_1$ and $d_2$ given $x$
means
\beq
P(x,d_1,d_2) = P(x)P(d_1 \given x)P(d_2 \given x) .
\eeq
This gives a separation of the posterior probability ratio
into a series of factors, one for each data point, times
the prior probability ratio.
\beqan
\frac{P(x\eq 1 \given \{d_i \} )}{P(x\eq 0 \given \{d_i \})} &=&
\frac{P(\{d_i\} \given x\eq 1)}{P(\{d_i\} \given x\eq 0)}
\frac{P(x\eq 1)}{P(x\eq 0)}
\\ &=&
\frac{P(d_1 \given x\eq 1)}{P(d_1 \given x\eq 0)}
\frac{P(d_2 \given x\eq 1)}{P(d_2 \given x\eq 0)}
\frac{P(x\eq 1)}{P(x\eq 0)} .
\eeqan
}
%
%
\subsection*{Life in high-dimensional spaces}
\soln{ex.RN}{
The \ind{volume} of a \ind{hypersphere} of radius $r$ in $N$ dimensions is
in fact
\beq
V(r,N) = \frac{\pi^{N/2}}{(N/2)!} r^{N} ,
\eeq
but you don't need to know this.
For this question all that we need is the $r$-dependence,
$V(r,N) \propto r^{N} .$
So the fractional volume in $(r-\epsilon,r)$ is
\beq
\frac{ r^{N} - (r-\epsilon)^N }{ r^N} =
1 -\left( 1 -\frac{\epsilon}{r}\right)^N .
\eeq
The fractional volumes in the shells for the required cases are:
\begin{center}
\begin{tabular}[t]{cccc} \toprule
$N$ & 2 & 10 & 1000 \\ \midrule
$\epsilon/r = 0.01$ & 0.02 & 0.096 & 0.99996 \\
$\epsilon/r = 0.5\phantom{0}$ & 0.75 & 0.999 & $1 - 2^{-1000}$ \\ \bottomrule
\end{tabular}\\
\end{center}
\noindent Notice that no matter how small $\epsilon$ is, for large enough $N$
essentially all the probability mass is in the surface shell of thickness
$\epsilon$.
}
%\soln{ex.weigh}{
% See chapter \chtwo.
%}
%
\soln{ex.expectn}{
$p_a \eq 0.1$, $p_b \eq 0.2$, $p_c \eq 0.7$.
$f(a) \eq 10$, $f(b) \eq 5$, and $f(c) \eq 10/7$.
\beq
\Exp\left[ f(x) \right] = 0.1 \times 10 + 0.2 \times 5 + 0.7 \times 10/7 = 3.
\eeq
For each $x$, $f(x) = 1/P(x)$, so
\beq
\Exp\left[ 1/P(x) \right] = \Exp\left[ f(x) \right] = 3.
\eeq
}
%
\soln{ex.invP}{
For general $X$,
\beq
\Exp\left[ 1/P(x) \right] = \sum_{x\in \A_X} P(x) 1/P(x) =
\sum_{x\in \A_X} 1 = | \A_X | .
\eeq
}
%
\soln{ex.expectng}{
$p_a \eq 0.1$, $p_b \eq 0.2$, $p_c \eq 0.7$.
$g(a) \eq 0$, $g(b) \eq 1$, and $g(c) \eq 0$.
\beq
\Exp\left[ g(x) \right]=p_b = 0.2.
\eeq
}
\soln{ex.expectng2}{
\beq
P\left( P(x) \! \in \! [0.15,0.5] \right) = p_b = 0.2 .
\eeq
\beq
P\left( \left| \log \frac{P(x)}{ 0.2} \right| > 0.05 \right)
= p_a + p_c = 0.8 .
\eeq
}
%
\soln{ex.Hineq}{
This type of question can be approached in two ways:
either by differentiating
the function to be maximized, finding the maximum, and proving
it is a global maximum; this strategy is somewhat risky since it is possible
for the maximum of a function to be at the boundary of the space,
at a place where the derivative is not zero.
Alternatively, a carefully chosen inequality
can establish the answer. The second method is much neater.
\begin{Prooflike}{Proof by differentiation (not the recommended method)}
Since it is slightly easier to differentiate $\ln 1/p$ than $\log_2 1/p$,
we temporarily define $H(X)$ to be measured using natural logarithms, thus
scaling it down by a factor of $\log_2 e$.
\beqan
H(X) &=& \sum_i p_i \ln \frac{1}{p_i} \\
\frac{\partial H(X)}{\partial p_i} &=& \ln \frac{1}{p_i} - 1
\eeqan
we maximize subject to the constraint $\sum_i p_i = 1$ which can be enforced
with a Lagrange multiplier:
\beqan
G(\bp) & \equiv & H(X) + \lambda \left( \sum_i p_i - 1 \right) \\
\frac{\partial G(\bp)}{\partial p_i} &=& \ln \frac{1}{p_i} - 1 + \lambda .
\eeqan
At a maximum,
\beqan
\ln \frac{1}{p_i} - 1 + \lambda &=& 0 \\
\Rightarrow \ln \frac{1}{p_i} &=& 1 - \l ,
\eeqan
so all the $p_i$ are equal. That this extremum is indeed a maximum
is established by finding the curvature:
\beq
\frac{\partial^2 G(\bp)}{\partial p_i \partial p_j} = -\frac{1}{p_i}
\delta_{ij} ,
\eeq
which is negative definite. \hfill
\end{Prooflike}
\begin{Prooflike}{Proof using Jensen's inequality (recommended method)}
First a reminder of the inequality.
\begin{quotation}
\noindent
If $f$ is a \convexsmile\ function
and $x$ is a random variable then:
\[%beq
\Exp\left[ f(x) \right] \geq f\left( \Exp[x] \right) .
\]%eeq
If $f$ is strictly \convexsmile\ and
$\Exp\left[ f(x) \right] \eq f\left( \Exp[x] \right)$, then the random
variable $x$ is a constant
(with probability 1).
\end{quotation}
The secret of a proof using Jensen's inequality is to choose the
right function and the right random variable.
We could define
% $f(u) = \log \frac{1}{u}$ and
\beq
f(u) = \log \frac{1}{u} = - \log u
\eeq
(which is a convex function) and
think of $H(X) = \sum p_i \log \frac{1}{p_i}$ as the
mean of $f(u)$ where $u=P(x)$, but this
would not get us there -- it would give us an inequality in the
wrong direction. If instead we define
\beq
u = 1/P(x)
\eeq
then we find:
% this introduces an extra minus sign:
\beq
H(X) = - \Exp\left[ f( 1/P(x) ) \right]
\leq - f\left( \Exp[ 1/P(x) ] \right) ;
\eeq
now we know from \exerciseref{ex.invP}\ that $\Exp[ 1/P(x) ] = |\A_X|$, so
\beq
H(X) \leq - f\left( |\A_X| \right) = \log |\A_X| .
\eeq
Equality holds only if the random variable $u = 1/P(x)$ is a constant,
which means $P(x)$ is a constant for all $x$.
\end{Prooflike}
}
%
\soln{ex.rel.ent}{
\beq
D_{\rm KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} .
% \label{eq.KL}
\eeq
\label{sec.gibbs.proof}% cross ref problem? Tue 12/12/00
We prove \ind{Gibbs' inequality} using \ind{Jensen's inequality}.
Let $f(u) = \log 1/u$ and $u=\smallfrac{Q(x)}{P(x)}$.
Then
\beqan
D_{\rm KL}(P||Q) & =& \Exp[ f( Q(x)/P(x) ) ]
\\ &\geq&
f\left(
\sum_x P(x) \frac{Q(x)}{P(x)} \right)
= \log \left( \frac{1}{\sum_x Q(x)} \right) = 0,
\eeqan
with equality only if $u=\frac{Q(x)}{P(x)}$ is a constant, that is,
if $Q(x) = P(x)$.\hfill$\epfsymbol$\\
\begin{Prooflike}{Second solution}
In the above proof the expectations were with respect to
the probability distribution $P(x)$. A second solution method
uses Jensen's inequality with $Q(x)$ instead.
We define $f(u) = u \log u$ and let $u = \frac{P(x)}{Q(x)}$.
Then
\beqan
D_{\rm KL}(P||Q)& =&
\sum_x Q(x) \frac{P(x)}{Q(x)} \log
\frac{P(x)}{Q(x)} = \sum_x Q(x) f\left( \frac{P(x)}{Q(x)} \right) \\
&\geq& f\left( \sum_x Q(x) \frac{P(x)}{Q(x)} \right) = f(1) = 0,
\eeqan
with equality only if $u=\frac{P(x)}{Q(x)}$ is a constant, that is,
if $Q(x) = P(x)$.
\end{Prooflike}
}
%
% solns moved to _s5A.tex
%
\soln{ex.decomposeexample}{
\beq
H(X)= H_2(f) + f H_2(g) + (1-f) H_2(h) .
\eeq
}
%
\soln{ex.waithead0}{
The probability that there are $x-1$ tails and then one head
(so we get the first head on the $x$th
toss) is
\beq
P(x) = (1-f)^{x-1} f .
\eeq
If the first toss is a tail, the probability distribution for
the future looks just like it did before we made the first toss.
Thus we have a recursive expression for the entropy:
\beq
H(X) = H_2( f ) + (1-f) H(X) .
\eeq
Rearranging,
\beq
H(X) = H_2( f ) / f .
\eeq
}
%
%
\fakesection{waithead solution}
\soln{ex.waithead}{
The probability of the number of tails $t$ is
\beq
P(t) = \left(\frac{1}{2}\right)^{\!t} \frac{1}{2}
\:\mbox{ for $t\geq 0$}.
\eeq
The expected number of heads is 1, by definition of the problem.
The expected number of tails is
\beq
\Exp[t] =
\sum_{t=0}^{\infty} t \left(\frac{1}{2}\right)^{\!t} \frac{1}{2} ,
\eeq
which may be shown to be 1 in a variety of ways. For example, since
the situation after one tail is thrown is equivalent to the opening
situation, we can write down the recurrence relation
\beq
\Exp[t] = \frac{1}{2} ( 1 + \Exp[t] ) + \frac{1}{2}0 \:\:
\Rightarrow \:\: \Exp[t] = 1.
\eeq
% if we define $S=\Exp[t]$ then we can subtract $S/2$ from $S$ to obtain
% a geometric series:
%\beq
% (1-1/2)S = \sum_{t=0}^{\infty} \left(\frac{1}{2}\right)^{t+1}
% = \frac{1/2}{1-1/2} = 1
%\eeq
% which gives $S=2$ --- what?
%%%%%%%%%%%%%%%%
%, for example, introducing
% $Z(\beta) \equiv \sum_t \left(\frac{1}{2}\right)^{\beta t} \frac{1}{2}
% = \frac{1}{2}/\left(1 - (\linefrac{1}{2})^{\beta}\right)$:
%\beq
% \sum_{t=0}^{\infty} t \left(\frac{1}{2}\right)^{t} \frac{1}{2}
% = \frac{\d}{\d\beta} \log Z
%\eeq
The probability distribution of the `estimator' $\hat{f} = 1/(1+t)$,
given that $f=1/2$, is plotted
in \figref{fig.f.estimator}. The probability of $\hat{f}$ is
simply the probability of the corresponding
value of $t$.
%
% gnuplot
% load 'figs/festimator.gnu'
%\begin{figure}
%\figuremargin{%
\marginfig{%
\begin{center}
\begin{tabular}{c}
$P(\hat{f})$\\[-0.3in]
\mbox{\psfig{figure=figs/festimator.ps,angle=-90,width=2in}}\\
\hspace{1.82in}$\hat{f}$
\end{tabular}
\end{center}
%}{%
\caption[a]{The probability distribution of the estimator $\hat{f} = 1/(1+t)$,
given that $f=1/2$.}
% , so that $P(t) = 1/2^{t+1}$.}
\label{fig.f.estimator}
%}
%\end{figure}
}
}
\soln{ex.waitbus}{
\ben
\item
The mean number of rolls from one six to the next six is six
(assuming
we
% don't count the first of the two sixes).
start counting rolls after the first of the two sixes).
The probability that the next six occurs on the $r$th
roll is the probability of {\em not\/} getting a six
for $r-1$ rolls multiplied by the probability of then
getting a six:
\beq
P(r_1 \eq r) = \left( \frac{5}{6} \right)^{\! r-1} \frac{1}{6}, \:\: \mbox{for $r\in \{1,2,3,\ldots \}$.}
\eeq
This probability distribution of the number of rolls, $r$,
may be called
an \ind{exponential distribution}, since
\beq
P(r_1 \eq r) = e^{-\alpha r} / Z,
\eeq
where $\alpha = \ln({6}/5)$, and $Z$ is a normalizing constant.
\item
The mean number of rolls from the clock until the next six is six.
\item
The mean number of rolls, going back in time,
until the most recent six is six.
\item
The mean number of rolls from the six before
the clock struck to the six after the clock struck
is the sum of the answers to (b) and (c), less one,
% (assuming we don't count the first of the two sixes),
that is, eleven.
\item
Rather than explaining the difference between (a)
% six and
and (d), let me give another hint.\index{bus-stop paradox}\index{waiting for a bus}
% see gnu/waitbus.gnu
Imagine that the buses in Poissonville arrive independently at random
(a \ind{Poisson process}), with, on average, one bus every six minutes.
Imagine that passengers turn up at {\busstop}s at a uniform rate,
% random also,
and are scooped up by the bus without delay, so the
interval between two buses remains constant.
Buses that follow gaps bigger than six minutes
become overcrowded. The passengers' representative complains that
two-thirds of all passengers found themselves on overcrowded buses.
The bus operator claims, `no, no -- only one third
of our buses are overcrowded'. Can both these claims be true?
\een
\amarginfig{b}{%
\begin{center}
\mbox{\hspace{-0.3in}\psfig{figure=figs/waitbus.ps,angle=-90,width=2.05in}}\\[-0.2in]
\end{center}
\caption[a]{The probability distribution of the number
of rolls $r_1$
from one 6 to the next
(falling solid line),
\[%\beq
P(r_1 \eq r) = \left( \frac{5}{6} \right)^{\! r-1} \frac{1}{6} ,
\]%\eeq
and the probability distribution (dashed line)
of
% the quantity $r_{\rm tot}=r_1+r_2-1$,
the number of rolls from the 6 before 1pm to the next 6,
% where $r_1$ and $r_2$ are the numbers of rolls before
% and after the clock strikes,
$r_{\rm tot}$,
\[%\beq
P(r_{\rm tot} \eq r) = r \, \left( \frac{5}{6} \right)^{\! r-1}
\left( \frac{1}{6} \right)^{\! 2 }
.
\]%\eeq
The probability $P(r_1>6)$ is about 1/3; the probability
$P(r_{\rm tot} > 6 )$ is about 2/3. The mean of $r_1$ is 6, and the
mean of $r_{\rm tot}$ is 11.
}
% other elegant ways of saying it:
% P( number rolls from one 6 to the next)
% P( number of rolls from the 6 before 1pm to the next)
}% end figure
}% end solbn
%
% \subsection{Move this solution}
%
% \subsection*{Conditional probability}
% \soln{ex.R3error}{
%
\fakesection{r3 error soln}
\soln{ex.R3error}{
\begin{description}
\item[Binomial distribution method\puncspace]
From the solution to \exerciseonlyref{ex.R3ep},
$p_B = 3 f^2 (1-f) + f^3$.\index{repetition code}
\item[Sum rule method\puncspace]
The marginal probabilities of the eight values of $\br$ are\index{sum rule}
illustrated by:
\beq
P(\br \eq {\tt0}{\tt0}{\tt0} ) = \dhalf (1-f)^3 + \dhalf f^3 ,
\eeq
\beq
P(\br \eq {\tt0}{\tt0}{\tt1} ) = \dhalf f(1-f)^2 + \dhalf f^2(1-f)
= \dhalf f(1-f) .
\eeq
The posterior probabilities are represented by
\beq
P( s\eq{\tt1} \given \br \eq {\tt0}{\tt0}{\tt0} ) = \frac{ f^3 }
{ (1-f)^3 + f^3 }
\eeq
and
\beq
P( s\eq{\tt1} \given \br \eq {\tt0}{\tt0}{\tt1} )
= \frac{ (1-f)f^2 }
{ f(1-f)^2 + f^2(1-f) }
= f .
\eeq
The probabilities of error in these representative cases are thus
\beq
P(\mbox{error} \given \br \eq {\tt0}{\tt0}{\tt0} ) = \frac{ f^3 }
{ (1-f)^3 + f^3 }
\eeq
and
\beq
P(\mbox{error} \given \br \eq {\tt0}{\tt0}{\tt1} ) = f .
\eeq
Notice that while the average probability of error of $\Rthree$ is
about $3 f^2$, the probability (given $\br$)
that any {\em{particular}\/} bit is
wrong is either about $f^3$ or $f$.
The average error probability, using the sum rule, is
\beqa
P(\mbox{error}) &=& \sum_{\br} P(\br) P(\mbox{error} \given \br) \\
&=& 2 [\dhalf (1-f)^3 + \dhalf f^3] \frac{ f^3 }
{ (1-f)^3 + f^3 }
+ 6 [\dhalf f(1-f)] f .
\eeqa
\marginpar{\vspace{-0.8in}\par\small\raggedright{The first two terms are for the cases $\br = \tt000$ and $\tt111$;
the remaining 6 are for the other outcomes, which share the
same
probability of occurring and identical error probability, $f$.}}%
So
\beqa
P(\mbox{error})
&=& f^3 + 3 f^2(1-f) .
\eeqa
\end{description}
}
%
%
% see also _s1A.tex
\soln{ex.Hwords}{
The entropy is 9.7
% 11.8
bits per word.
% , which is 2.6 bits per letter WRONG - shannon (p197) is in error
}
%\soln{ex.Hwords}{
%
% z := 1.000004301
%
%sum( 0.1/n * log(1.0/(0.1/n))/log(2.0) , n=1..12367) ;
% 9.716258456
% 9.716 bits.
%}
%\input{tex/_s1a.tex} nothing there any more
\fakesection{_s1A solutions}
%=================================
% quake
%
% \subsection*{Solutions to further inference problems}
%\soln{ex.exponential}{
% See chapter \chbayes.
%}
%\soln{ex.blood}{
% See chapter \chbayes.
%}
%
% The other exercises are discussed in the next chapter.
%%%%%%%%%%%%%%%%%%%%%%%%%%
\dvipsb{solutions 1a}
% now another inference chapter !
\prechapter{About Chapter}
\fakesection{About the first Bayes chapter}
If you are eager to get on to
% with data compression, information content and entropy,
information theory, data compression, and noisy channels,
you can skip to \chapterref{ch2}.
Data compression and data modelling are
intimately connected, however, so you'll probably
want to come back to this chapter
by the time you get to \chapterref{ch4}.
%
% move this later
%
% The exercises in this chapter are not a prerequisite for
% chapters \ref{ch2}--\ref{ch7}.
\fakesection{prerequisites for chapter 8}
Before reading \chapterref{ch.bayes},
it might be good to look at the following exercises.
% you
% should have worked on
% finished
% all the exercises in chapter \chone, in particular,
% \exerciserefrange{ex.logit}{ex.exponential}.
%
% \exthirtyone--\exthirtysix.
% uvw to HXY>0
\exercissxB{2}{ex.dieexponential}{
A die is selected at random from two twenty-faced dice
on which the symbols 1--10 are written with nonuniform frequency
as follows.
\begin{center}
\begin{tabular}{l@{\hspace{0.2in}}*{10}{l}} \toprule
Symbol & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \midrule
Number of faces of die A &
6 & 4 & 3 & 2 & 1 &1 &1 &1 &1 & 0 \\
Number of faces of die B &
3 & 3 & 2 & 2 & 2 &2 &2 &2 &1 & 1 \\ \bottomrule
\end{tabular}
\end{center}
The randomly chosen die is rolled 7 times, with the following
outcomes:
\begin{center}
5, 3, 9, 3, 8, 4, 7. % Sat 21/12/02 tried cutting this \\
\end{center}
What is the probability that the die is die A?
}
\exercissxB{2}{ex.dieexponentialb}{
Assume that there is a third twenty-faced die, die C, on which the symbols
1--20 are written once each.
As above, one of the three dice is selected at random and rolled
7 times, giving the outcomes:
% \begin{center}
3, 5, 4, 8, 3, 9, 7. \\
% \end{center}
What is the probability that the die is (a) die A, (b) die B, (c) die C?
}
% no normal solution pointer
\exercissxA{3}{ex.exponential}{ {\exercisetitlestyle Inferring a decay constant}\\
%\begin{quotation}
Unstable particles are emitted from a source and decay at a
distance $x$, a real number
that has an exponential probability distribution
with characteristic length $\lambda$. Decay events can only
be observed if they occur in a window extending from $x=1\cm$
to $x=20\cm$. $N$ decays are observed at locations $\{x_1 ,
\ldots , x_N\}$.
% ($x_n$ is a real number.)
What is $\lambda$?
%\end{quotation}
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.ps,width=3in,angle=90,%
bbllx=154mm,bblly=147mm,bbury=257mm,bburx=175mm}}\\
\end{center}
}
% no normal solution pointer
% \subsection*{Genetic test evidence}
% \begin{quotation}
\exercissxB{3}{ex.blood}{ {\exercisetitlestyle Forensic evidence} \\
% Two people have left traces of their own blood at the scene of a
% crime. Their blood groups can be reliably identified from these
% traces and are found
% to be of type `O' (a common type in the local population, having
% frequency 60\%) and of type `AB' (a rare type, with frequency 1\%).
% A suspect is tested and found to have type `O' blood.
% A careless lawyer might claim that the fact that the suspect's
% blood type was found at the scene is positive evidence for the theory
% that he was present. But do these data
% $D=$ \{type `O' and `AB' blood were found at scene\} make it more
% probable that this suspect was one of the two people present at the
% crime?
Two people have left traces of their own blood at the scene of a
crime.
A suspect, Oliver, is tested and found to have type `O' blood.
The blood groups of the two traces
are found
to be of type `O' (a common type in the local population, having
frequency 60\%) and of type `AB' (a rare type, with frequency 1\%).
Do these data
(type `O' and `AB' blood were found at scene) give evidence in favour
of the proposition that Oliver was one of the two people present at the
crime?
}
% \end{quotation}
%%%%%%%%%% (many are repeated from _s1aa)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \prechapter{About Chapter}
\mysetcounter{page}{54}
\ENDprechapter
\chapter{More about Inference}
\label{ch.bayes}\label{ch1b}
% contains the decay problem, the bent coin, and blood.
%
%
% solutions to exercises are in _s8.tex
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\fakesection{Inference intro}
It is not a controversial statement that \Bayes\ theorem\index{Bayes' theorem}
provides the correct language for describing the inference of a
message communicated over a
noisy channel, as we used it in \chref{ch1} (\pref{sec.bayes.used}).
But strangely, when it comes to other
inference problems, the use of
% approaches based on
\Bayes\ theorem
is not so widespread.
%let's take a little tour of other applications of
% probabilistic inference.
% Coherent inference can always be mapped onto probabilities (Cox, 1946).
%% \cite{cox}.
% Many
% textbooks on statistics do not mention this fact, so maybe it is worth
% using an example to emphasize the contrast between Bayesian inference
% and the orthodox methods of statistical inference.
%% involving
%% estimators, confidence intervals, hypothesis testing, etc.
% If this topic interests you, excellent further reading is
% to be found in the works of Jaynes, for example,
% \citeasnoun{Jaynes.intervals}.
\section{A first inference problem}
\label{sec.decay}\label{ex.exponential.sol}% special label by hand
When I was an undergraduate in Cambridge, I was privileged to receive
supervisions from Steve Gull. Sitting at his desk in a dishevelled
office in St.\ John's College, I asked him how one ought to answer an
old Tripos question (\exerciseonlyref{ex.exponential}):
\begin{quotation}
Unstable particles are emitted from a source and decay at a
distance $x$, a real number
that has an exponential probability distribution
with characteristic length $\lambda$. Decay events can only
be observed if they occur in a window extending from $x=1\cm$
to $x=20\cm$. $N$ decays are observed at locations $\{x_1 ,
\ldots , x_N\}$.
% ($x_n$ is a real number.)
What is $\lambda$?
\end{quotation}
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.ps,width=3in,angle=90,%
bbllx=154mm,bblly=147mm,bbury=257mm,bburx=175mm}}\\
\end{center}
I had scratched my head over this for some time.
My education had provided me with a couple of approaches to solving
such inference problems: constructing `\ind{estimator}s'
of the unknown parameters; or `fitting' the model to
the data, or to a processed version of the data.
Since the mean of an unconstrained exponential distribution is $\l$,
it seemed reasonable to examine the sample mean $\bar{x} = \sum_n x_n / N$
and see
if an estimator $\hat{\l}$ could be obtained from it.
It was evident that the {estimator}
$\hat{\l}=\bar{x}-1$ would be appropriate for
$\lambda \ll 20\,$cm, but not for cases where the
truncation of the distribution at the right-hand side
is significant; with a little ingenuity and the introduction of
ad hoc bins, promising estimators for $\lambda \gg 20$ cm could be
constructed. But there was no obvious estimator that would work
under all conditions.
Nor could I find a satisfactory
approach based on fitting the density $P(x\given \lambda)$ to
a histogram derived from the data. I was stuck.
What is the general solution to this problem and others like it?
Is it always necessary, when confronted by a new inference problem,
to grope in the dark for appropriate `estimators' and worry
about finding the `best' estimator (whatever that means)?
%% I hope you have already stopped and thought about this question.
% problem.
% \\ \mbox{~}\dotfill\ \mbox{~} \\
% \newpage
Steve
% Gull
wrote down the probability of one data point, given $\l$:
\beq
P(x\given \lambda) =\left\{ \begin{array}{ll}
{\textstyle \smallfrac{1}{\l}} \,
e^{-x/\lambda } / Z(\lambda) & 1 < x < 20 \\
0 & {\rm otherwise }
\end{array} \right.
\label{basic.likelihood}
\eeq
where
\beq
Z(\l) = \int_1^{20} \d x \: \smallfrac{1}{\l} \,
e^{-x/\lambda } = \left(e^{-1/\l} - e^{-20 /\l} \right).
\label{basic.likelihood.Z}
\eeq
This seemed obvious enough.
Then he wrote {\dem{\ind{\Bayes\ theorem}}}:
\beqan
\label{bayes.theorem}
% \begin{array}{l}
P(\l\given \{x_1, \ldots, x_N\}) &=&
\frac{P(\{x\}\given \lambda) P(\l)}{P(\{x\}) } \\
%&& \hspace{0.5in}
&\propto& \frac{1}{\left( \l Z(\l) \right)^N}
\exp \left( \textstyle - \sum_1^N x_n / \l \right) P(\l)
.
% \end{array}
\label{basic.posterior}
\eeqan
Suddenly, the straightforward distribution $P(\{x_1 ,\ldots, x_N \}\given
\l)$, defining the probability of the data given the hypothesis $\l$,
was being turned on its head so as to define the probability of a
hypothesis given the data. A simple figure showed the probability of
a single data point $P(x\given \l)$ as a familiar function of $x$, for
different values of $\l$ (figure \ref{decay.like.1}). Each curve was
an innocent exponential, normalized to have area 1. Plotting the
same function as a function of $\l$ for a fixed value of $x$,
something remarkable happens: a peak emerges (figure
\ref{decay.like.2}). To help understand these two points
of view of the one function, \figref{decay.probandlike}
shows a surface plot of $P(x\given \l)$ as a function of $x$ and $\l$.
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.like.1.ps,%
width=2 in,angle=-90}\ \ \ \raisebox{-3mm}[0in][0in]{$x$}}
\end{center}
}{%
\caption{{The probability density $P(x\given \l)$ as a function of $x$.}}
\label{decay.like.1}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.like.2.ps,%
width=2 in,angle=-90}\ \ \ \raisebox{-3mm}[0in][0in]{$\lambda$}}
\end{center}
}{%
\caption[a]{{The probability density $P(x\given \l)$ as a function of $\l$,
for three different values of $x$.}
\small
When plotted this way round, the function is known as
the {\dem\ind{likelihood}\/} of $\l$.
The marks indicate the three values of $\l$, $\l=2,5,10$,
that were used in the preceding figure.
}
\label{decay.like.2}
}
\end{figure}
%\begin{figure}
%\figuremargin{%
\marginfig{
\begin{center}
\begin{tabular}{c}
\makebox[0pt][l]{\hspace*{0.21in}\raisebox{0.435in}{$x$}}%
\mbox{\psfig{figure=\FIGS/probandlike.ps,%
width=2in,angle=-90}%
\makebox[0pt][l]{\hspace*{-0.352in}\raisebox{0.435in}{$\l$}}}\\[-0.3in]% was -0.6 Sat 5/10/02
\end{tabular}\end{center}
%}{%
\caption[a]{{The probability density $P(x\given \l)$ as a function of $x$
and $\l$. Figures \ref{decay.like.1} and \ref{decay.like.2} are
vertical sections through this surface.}
}
\label{decay.probandlike}
}
%\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=\FIGS/decay.like.xxx.ps,%
width=2in,angle=-90}}
\end{center}
}{%
\caption[a]{{The likelihood function in the case of a six-point dataset,
$P(\{x\} = \{1.5,2,3,4,5,12\}\given \lambda)$, as a function of $\l$.}
}
\label{decay.like.xxx}
}
\end{figure}
For a dataset consisting of several points, \eg, the
six points
$\{x\}_{n=1}^{N} = \{1.5,2,3,4,5,12\}$, the likelihood function
$P(\{x\}\given \lambda)$ is the product of the $N$ functions of $\l$,
$P(x_n\given \l)$ (\figref{decay.like.xxx}).
%
Steve summarized \Bayes\ theorem
% (equation \ref{bayes.theorem})
as
embodying the fact that
\begin{conclusionbox}
what you know about $\lambda$
after the data arrive is what
you knew before [$P(\lambda)$], and what the data told you
[$P(\{x\}\given \lambda)$].
\end{conclusionbox}
Probabilities are used here to
quantify degrees of belief.
% The probability
% of $\lambda$ is a quantification of what you know about $\lambda$.
To nip possible confusion in the bud, it must be
emphasized that the hypothesis $\lambda$ that correctly describes
the situation is {\em not\/} a {\em stochastic\/} variable, and the fact that
the Bayesian uses a probability\index{probability!Bayesian}
distribution $P$ does {\em not\/} mean
that he thinks of the world as stochastically changing its nature
between the states described by the different hypotheses. He uses the
notation of probabilities to represent his {\em beliefs\/} about the mutually
exclusive micro-hypotheses (here, values of $\l$),
of which only one is actually true. That
probabilities can denote degrees of belief, given assumptions, seemed
reasonable to me.
% , and is proved by Cox (1946).
% \citeasnoun{cox}.
% . Anyone who does not find it reasonable to use
% probabilities to quantify degrees of belief can read
% paper, where it is proved to be
% valid.
\label{sec.decayb}
The posterior probability distribution
% of equation
(\ref{basic.posterior}) represents
the unique and complete solution to the problem.
There is no need to invent\index{classical statistics!criticisms}
`estimators'; nor do we need to invent
criteria for comparing alternative estimators with each other.
Whereas orthodox statisticians offer twenty ways of solving a
problem, and another twenty different criteria for deciding which of
these solutions is the best, Bayesian statistics only offers one
answer to a well-posed problem.
% Added Mon 4/2/02
\marginpar{\small\raggedright{If you have any difficulty understanding this chapter I recommend
ensuring you are happy with
exercises \ref{ex.dieexponential} and \ref{ex.dieexponentialb} (\pref{ex.dieexponentialb})
then noting their similarity to
\exerciseonlyref{ex.exponential}.}}
\subsection{Assumptions in inference}
Our inference is conditional on our assumptions [for example, the
prior $P(\lambda)$]. Critics view such priors as a difficulty because
they are `subjective', but I
don't see how it could be otherwise. How can one perform inference
without making assumptions?
I believe that it is of great value that Bayesian
methods force one to make these tacit assumptions explicit.
First,
once assumptions are made, the inferences are objective and unique,
reproducible with complete agreement by anyone who has the same
information and makes the same assumptions. For example, given the
assumptions listed above, $\H$, and the data $D$,
% from an experiment
% measuring decay lengths,
everyone will agree about the posterior
probability of the decay length $\l$:
\beq
P(\l\given D,\H) = \frac{ P(D\given \l,\H) P(\l\given \H) }{ P(D\given \H) } .
\eeq
Second, when the assumptions are explicit, they are easier to
criticize, and easier to modify -- indeed,
we can quantify the sensitivity of our inferences to
the details of the assumptions. For example,
we can note from the likelihood curves
in figure \ref{decay.like.2} that in the case of a single data point at
$x=5$, the likelihood
function is less strongly peaked than in the case $x=3$; the
details of the prior $P(\lambda)$ become increasingly important as the sample
mean $\bar{x}$ gets closer to the middle of the window, 10.5. In the case
$x=12$, the likelihood function doesn't have a peak at all -- such data
merely rule out small values of $\lambda$, and don't give any information
about the relative probabilities of large values of $\lambda$. So
in this case, the details of the prior at the small--$\lambda$ end
of things are not important, but at the large--$\lambda$ end, the prior
is important.
% is whatever we knew before
% the experiment, \ie, our prior.
Third, when we are not sure which of various alternative assumptions
is the most appropriate for a problem, we can treat this question as
another inference task. Thus, given data $D$, we can\index{Bayes' theorem}
% learn from the data
compare alternative assumptions $\H$ using \Bayes\ theorem:
\beq
P(\H\given D,\I) = \frac{ P(D\given \H,\I) P(\H\given \I) }{ P(D\given \I) } ,
\label{basic.ev}
\eeq
where $\I$ denotes the highest assumptions, which we are not
questioning.
Fourth, we can take into account our uncertainty regarding such
assumptions when we make subsequent predictions. Rather than choosing
one particular assumption $\H^{*}$, and working out our predictions
about some quantity $\bt$, $P(\bt\given D,\H^{*},\I)$, we obtain
predictions that take into account our uncertainty about $\H$ by
using the sum rule:
\beq
P(\bt \given D, \I) = \sum_{\H} P(\bt \given D, \H , \I ) P(\H\given D,\I) .
\label{basic.marg}
\eeq
This is another contrast with orthodox statistics, in which it is
conventional to `test' a default model, and then, if the test\index{test!statistical}\index{statistical test}
`accepts the model' at some `\ind{significance level}', to use exclusively that model to make
predictions.
Steve thus persuaded me that
\begin{conclusionbox}
probability theory reaches parts that ad hoc methods cannot reach.
\end{conclusionbox}
% However, that is a topic for another lecture.
Let's look at a few more examples of simple inference problems.
\section{The bent coin}
\label{sec.bentcoin}
A \ind{bent coin}\index{inference problems!bent coin}
is tossed $F$ times; we observe a sequence $\bs$ of
heads and tails (which we'll denote by the symbols $\ta$ and $\tb$).
We wish to know the bias of the coin, and predict
the probability that the next toss will result in a head.
We first encountered this task in \exampleref{exa.bentcoin},
and we will encounter it again
in \chref{ch.four}, when we discuss adaptive data compression.
% the adaptive encoder for $a$s and $b$s.
It is also the original inference problem studied by
% Rev.\
{Thomas Bayes}
in his essay published in 1763.\index{Bayes, Rev.\ Thomas}
% cite{Bayes}
As in
% \chref{ch.prob.ent}
\exerciseref{ex.postpa}, we will
assume
% In chapter \chfour\ we assumed
a uniform prior distribution and
obtain a posterior distribution by multiplying by the likelihood. A
critic might object, `where did this prior come from?' I will not
claim that the uniform prior is in any way fundamental; indeed
we'll give examples of nonuniform priors later. The prior is
% It is simply
a subjective assumption. One of the themes of this book is:
%
% put this back somewhere?
%
% One way to justify the need for a prior is
% to assume, as in chapter \chfour,
% that our task is simply to make a code to encode the
% outcome $\bs$ as efficiently as possible. We have to compress the
% data from the source somehow, and any choice of a compression scheme
% must correspond to a prior distribution over coin biases. I see no
% way round this. The choice of code implies an assumed probability
% distribution over outcomes.
%\begin{quotation}
\begin{conclusionbox}
\noindent
you can't do inference -- or data compression -- without
making assumptions.
% You can't do data compression -- or inference -- without
% making assumptions.
\end{conclusionbox}
%\end{quotation}
%
% change notation? f_H?????????????????????????????????
%
%\subsubsection*{Likelihood function}
We give the name $\H_1$ to our assumptions. [We'll be introducing
an alternative set of assumptions in a moment.]
The probability, given $p_{\ta}$, that $F$ tosses
result in a sequence $\bs$
that contains $\{F_{\ta},F_{\tb}\}$ counts of the two outcomes
% $\{ a , b \}$
is
\beq
P( \bs \given p_{\ta} , F,\H_1 ) = p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} .
\label{eq.pa.likeb}
\eeq
[{For example, $P(\bs\eq {\tt{aaba}} \given p_{\ta},F \eq 4,\H_1)
= p_{\ta}p_{\ta}(1-p_{\ta})p_{\ta}.$}]
% This function of $p_{\ta}$ (\ref{eq.pa.likeb}) defines the likelihood function.
% Model 1
Our first model assumes a uniform prior distribution for $p_{\ta}$,
\beq
P(p_{\ta}\given\H_1) = 1 , \: \: \: \: \: \: p_{\ta} \in [0,1]
\label{eq.pa.priorb}
\eeq
and $p_{\tb} \equiv 1-p_{\ta}$.
\subsubsection{Inferring unknown parameters}
Given a string of length $F$ of which $F_{\ta}$ are $\ta$s and
$F_{\tb}$ are $\tb$s, we are interested in (a) inferring
what $p_{\ta}$ might be; (b) predicting whether the next character is an $\ta$
or a $\tb$. [Predictions\index{prediction} are always expressed as probabilities.
So `predicting whether the next character is an $\ta$'
is the same as computing the probability that the next character is an $\ta$.]
Assuming $\H_1$ to be true, the posterior probability of $p_{\ta}$, given a
string $\bs$ of length $F$ that has
counts $\{F_{\ta},F_{\tb}\}$, is, by \Bayes\ theorem,
\beqan
P( p_{\ta} \given \bs ,F,\H_1) &=&
\frac{ P( \bs \given p_{\ta} , F,\H_1 ) P(p_{\ta}\given\H_1) }{ P( \bs \given F,\H_1 ) } .
\label{eq.pa.post}
\label{eq.pa.post.again}
\eeqan
The factor $P( \bs \given p_{\ta} , F,\H_1 )$, which, as a function
of $p_{\ta}$, is known as the likelihood function,
was given in \eqref{eq.pa.likeb}; the prior
$P(p_{\ta}\given\H_1)$ was given in \eqref{eq.pa.priorb}.
Our inference of $p_{\ta}$ is thus:
% The posterior
\beqan
P( p_{\ta} \given \bs ,F,\H_1) &=&
\frac{ p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} }{ P( \bs \given F,\H_1 ) } .
\label{eq.pa.postb.again}
\eeqan
The normalizing constant is given by the beta integral
\beq
P( \bs \given F,\H_1 ) = \int_0^1 \d p_{\ta} \: p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} =
\frac{\Gamma(F_{\ta}+1)\Gamma(F_{\tb}+1)}{ \Gamma(F_{\ta}+F_{\tb}+2) }
= \frac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } .
\label{eq.evidenceZ}
\eeq
% Our inference of $p_{\ta}$, assuming $\H_1$ to be true,
% is thus given by \eqref{eq.pa.postb.again}.
%%%%%%%%%%%%%
\exercissxA{2}{ex.postpaII}{
Sketch the posterior probability $P( p_{\ta} \given \bs\eq {\tt aba} ,F\eq 3)$.
What is the most probable value of $p_{\ta}$ (\ie, the value that maximizes
the posterior probability density)? What is the mean value of $p_{\ta}$
under this distribution?
Answer the same questions for
the posterior probability $P( p_{\ta} \given \bs\eq {\tt bbb} ,F\eq 3)$.
}
\subsubsection{From inferences to predictions}
Our prediction about the next toss, the probability that the next toss is an $\ta$,
is obtained by integrating over $p_{\ta}$. This has the effect of
taking into account our uncertainty about $p_{\ta}$ when making predictions.
By the sum rule,
\beqan
P(\ta \given \bs ,F)& =& \int \d p_{\ta} \: P(\ta \given p_{\ta} ) P(p_{\ta} \given \bs,F ) .
\eeqan
The probability of an $\ta$ given $p_{\ta}$ is simply $p_{\ta}$,
so
\beqan
\lefteqn{ P(\ta \given \bs ,F)
= \int \d p_{\ta} \: p_{\ta} \frac{p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}}}
{P( \bs \given F ) } }
\\
&=& \int \d p_{\ta} \: \frac{p_{\ta}^{F_{\ta}+1} (1-p_{\ta})^{F_{\tb}}}
{P( \bs \given F ) }
\\
&=& \left.
% \frac
{ \left[ \frac{ (F_{\ta}+1)! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 2)! } \right] } \right/
{ \left[ \frac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } \right] }
\:\: = \:\: \frac{ F_{\ta}+1 }{ F_{\ta} + F_{\tb} + 2 } ,
\label{eq.laplacederived}
\eeqan
which is known as {\dem{\ind{Laplace's rule}}}.
\section{The bent coin and model comparison}
\label{sec.bentcoin2}
Imagine that a scientist introduces another theory for our data.
He asserts that the source is not really a bent coin but is really a
perfectly formed die with one face painted heads (`$\ta$') and the other five
painted tails (`$\tb$'). Thus the parameter $p_{\ta}$, which in the original model,
$\H_1$, could take any value between 0 and 1, is according
to the new hypothesis, $\H_0$, not a free parameter at all; rather, it
is equal to
% p_{\ta} =
$1/6$. [This hypothesis is termed $\H_0$ so that the suffix of each model
indicates its number of free parameters.]
How can we compare these two models in the light of data?
We wish to
infer how probable
$\H_1$ is relative to $\H_0$.
% , so we can use \Bayes\ theorem again.
% Let us write down the first model's probabilities again.
% {\em Here we repeat some material from the arithmetic coding
% chapter, chapter \ref{ch4}.}
\subsubsection*{Model comparison as inference}
In order to perform model comparison, we write down
\Bayes\ theorem again, but this time with a different\index{Bayes' theorem}
argument on the left-hand side. We wish to know how probable
$\H_1$ is given the data. By \Bayes\ theorem,
\beq
P( \H_1 \given \bs ,F ) = \frac{ P( \bs \given F,\H_1 ) P( \H_1 ) }{ P( \bs \given F) } .
\eeq
Similarly, the posterior probability of $\H_0$ is
\beq
P( \H_0 \given \bs ,F ) = \frac{ P( \bs \given F,\H_0 ) P( \H_0 ) }{ P( \bs \given F) }.
\eeq
The normalizing constant in both cases is $P(\bs\given F)$, which is the total
probability of getting the observed data.
% regardless of which model is true.
If $\H_1$ and $\H_0$ are the only models under
consideration, this probability is given by the sum rule:
\beq
P( \bs \given F) = P( \bs \given F,\H_1 ) P( \H_1 )
+ P( \bs \given F,\H_0 ) P( \H_0 ) .
\eeq
To evaluate the posterior probabilities of the hypotheses we
need to assign values to the prior probabilities $P( \H_1 )$
and $P( \H_0 )$; in this case, we might set these to 1/2 each. And
we need to evaluate the data-dependent terms
$P( \bs \given F,\H_1 )$ and $P( \bs \given F,\H_0 )$.
We can give names to these quantities.
The quantity $P( \bs \given F,\H_1 )$ is a measure of how much the data
favour $\H_1$, and we call it the {\dbf\ind{evidence}} for model $\H_1$.
We already encountered this quantity in equation (\ref{eq.pa.post.again})
where it appeared
as the normalizing constant of the first inference we made -- the
inference of $p_{\ta}$ given the data.
\medskip
\begin{conclusionbox}
%\begin{description}
%\item[How model comparison works:]
{\bf How model comparison works:}
The evidence for a model is usually\index{key points!model comparison}
the normalizing constant of an earlier Bayesian inference.
%\end{description}
\end{conclusionbox}
\medskip
We evaluated the normalizing constant for model $\H_1$ in
(\ref{eq.evidenceZ}).
The evidence for model $\H_0$ is very simple because this model
has no parameters to infer. Defining $p_0$ to be $1/6$, we have
\beq
P( \bs \given F,\H_0 ) = p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} .
\eeq
Thus the posterior probability ratio of model $\H_1$ to model $\H_0$ is
\beqan
\frac{ P( \H_1 \given \bs ,F )}
{P( \H_0 \given \bs ,F )}
& =&
\frac{ P( \bs \given F,\H_1 ) P( \H_1 ) }
{ P( \bs \given F,\H_0 ) P( \H_0 ) }
\\
&=&
\left.
{ \frac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } }
\right/
{ p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} } .
% \frac{ \smallfrac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } }{ p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} } .
% SECOND EDN - sanjoy says use linefrac
\label{eq.compare.final}
\eeqan
Some values of this posterior probability ratio are illustrated in
table \ref{tab.mod.comp}. The first five lines illustrate that
some outcomes favour one model, and some favour the other.
No outcome is completely incompatible with either model.
\begin{table}
\figuremargin{%
\begin{center}
\begin{tabular}{cccl} \toprule
$F$ & Data $(F_{\ta},F_{\tb})$ & $\displaystyle \frac{ P( \H_1 \given \bs ,F )}
{P( \H_0 \given \bs ,F )}$ \\ \midrule
6 & $(5,1)$ & 222.2 & \\
6 & $(3,3)$ & 2.67 &\\
6 & $(2,4)$ & 0.71 & = 1/1.4 \\
6 & $(1,5)$ & 0.356 & = 1/2.8 \\
6 & $(0,6)$ & 0.427 & = 1/2.3 \\ \midrule
20 & $(10,10)$ & 96.5 & \\
20 & $(3,17)$ & 0.2 & = 1/5 \\
20 & $(0,20)$ & 1.83 & \\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption{Outcome of model comparison between models $\H_1$ and $\H_0$
for the `bent coin'. Model $\H_0$ states that $p_{\ta}=1/6$, $p_{\tb}=5/6$.}
\label{tab.mod.comp}
}
\end{table}
With small amounts of data (six tosses, say) it is typically not the case that
one of the two models is overwhelmingly more probable than
the other. But with more data, the evidence against $\H_0$ given
by any data set with the ratio $F_{\ta} \colon F_{\tb}$ differing from $1 \colon 5$ mounts up.
%
% add figure showing some typical histories
%
You can't predict in advance how much data are needed to be pretty sure
which theory is true.\index{key points!how much data needed} It depends what $p_0$ is.
%
% THIS IS A VERY GENERAL
% message for machine learning.
% corrected Wed 28/11/01
The simpler model, $\H_0$, since it has no adjustable parameters,
is able to lose out by the biggest margin. The odds may be hundreds to one
against it. The more complex model can never lose out
by a large margin; there's no data set that is actually {\em unlikely\/}
given model $\H_1$.
\exercisaxB{2}{ex.evidencebounds}{
Show that after $F$ tosses have taken place, the
biggest value that the log evidence ratio
\beq
\log \frac{ P( \bs \given F,\H_1 ) }
{ P( \bs \given F,\H_0 ) }
\eeq
can have scales {\em linearly\/} with $F$ if
$\H_1$ is more probable, but
the log evidence in favour of $\H_0$ can grow
at most as $\log F$.
}
\exercissxB{3}{ex.evidenceest}{
Putting your sampling theory hat on, assuming $F_{\ta}$ has not yet been measured,
compute a plausible range that
% the mean and variance -- or some sort of most probable value, and indication of spread -- of the
the log evidence ratio might lie in, as a function of $F$ and
the true value of $p_{\ta}$,
and sketch it
as a function of $F$ for $p_{\ta}=p_0=1/6$, $p_{\ta}=0.25$,
and $p_{\ta}=1/2$.
[Hint: sketch the log evidence as a function
of the random variable $F_{\ta}$ and work out the mean
and standard deviation of $F_{\ta}$.]
% [Hint: Taylor-expand the log evidence as a function
% of $F_{\ta}$.]
}
%
% This page comes out rotated bizarrely by 90 degrees in pdf
%
\subsection{Typical behaviour of the evidence}
% see figs/sixtoone
% and bin/sixtoone.p
\Figref{fig.evidencetyp} shows the log evidence ratio
as a function of the number of
tosses, $F$, in a number of simulated experiments.
In the left-hand experiments, $\H_0$ was true.
In the right-hand ones, $\H_1$ was true, and the value of
$p_{\ta}$ was either 0.25 or 0.5.
% \newcommand{\sixtoone}[2]{% in newcommands1.tex
\begin{figure}
\figuremargin{%
\small%
\begin{center}
\begin{tabular}{cccc}
$\H_0$ is true &&
\multicolumn{2}{c}{$\H_1$ is true} \\ \cmidrule{1-1}\cmidrule{3-4}
\sixtoone{$p_{\ta}=1/6$}{h09}&&
\sixtoone{$p_{\ta}=0.25$}{h69}&
\sixtoone{$p_{\ta}=0.5$}{h29}\\
\sixtoone{}{h08}&&
\sixtoone{}{h68}&
\sixtoone{}{h28}\\
\sixtoone{}{h07}&&
\sixtoone{}{h67}&
\sixtoone{}{h27}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Typical behaviour of the evidence in favour of $\H_1$ as
bent coin tosses accumulate\index{typicality!behaviour of evidence}\index{evidence!typical behaviour of}\index{model comparison!typical evidence}
under three different conditions. Horizontal axis is the number of
tosses, $F$. The vertical axis on the left is
$\ln \smallfrac{ P( \bs \given F,\H_1 ) }
{ P( \bs \given F,\H_0 ) }$;
the right-hand vertical axis shows the values of
$\smallfrac{ P( \bs \given F,\H_1 ) }
{ P( \bs \given F,\H_0 ) }$.
(See also \protect\figref{fig.evidenceMSD}, \pref{fig.evidenceMSD}.)
}
\label{fig.evidencetyp}
}%
\end{figure}
We will discuss model comparison more in a later chapter.
\section{An example of legal evidence}
\label{ex.blood.sol}% special label by hand
The following example
% (\exerciseonlyref{ex.blood})
illustrates that there is more
to Bayesian inference than the priors.
\begin{quote}
% Two people have left traces of their own blood at the scene of a
% crime. Their blood groups can be reliably identified from these
% traces and are found
% to be of type `O' (a common type in the local population, having
% frequency 60\%) and of type `AB' (a rare type, with frequency 1\%).
% A suspect is tested and found to have type `O' blood.
% A careless lawyer might claim that the fact that the suspect's
% blood type was found at the scene is positive evidence for the theory
% that he was present. But do these data
% $D=$ \{type `O' and `AB' blood were found at scene\} make it more
% probable that this suspect was one of the two people present at the
% crime?
Two people have left traces of their own blood at the scene of a
crime.
A suspect, Oliver, is tested and found to have type `O' blood.
The blood groups of the two traces
are found
to be of type `O' (a common type in the local population, having
frequency 60\%) and of type `AB' (a rare type, with frequency 1\%).
Do these data
(type `O' and `AB' blood were found at scene) give evidence in favour
of the proposition that Oliver was one of the two people present at the
crime?
\end{quote}
A careless \ind{lawyer} might claim that the fact that the suspect's
blood type was found at the scene is positive evidence for the theory
that he was present. But this is not so.
Denote the proposition `the suspect and one unknown person were
present' by $S$. The alternative, $\bar{S}$, states `two unknown people
from the population were present'.
The prior in this problem is the prior probability ratio between the
propositions $S$ and $\bar{S}$. This quantity is important to the final
verdict and would be based on all other available information
in the case. Our task here is just to evaluate the contribution made by the
data $D$, that is, the likelihood ratio, $P(D\given S,\H)/P(D\given \bar{S},\H)$.
In my view, a jury's task should generally be to multiply together carefully
evaluated
likelihood ratios from each independent piece of admissible evidence
with an equally carefully reasoned prior probability.
[This view is shared by many statisticians but learned British appeal judges\index{judge}
recently disagreed and actually overturned the verdict of a trial
because the \index{jury}{jurors} {\em had\/} been taught to use \Bayes\ theorem to
handle complicated \ind{DNA} evidence.]
%
The probability of the data given $S$ is the probability that one unknown person
drawn from the population has blood type AB:
\beq
P(D\given S,\H) = p_{\rm{AB}}
\eeq
(since given $S$, we already know that one trace will be of type O).
The probability of the data given $\bar{S}$ is the
probability that two unknown people drawn from the population have
types O and AB:
\beq
P(D\given \bar{S},\H) = 2 \, p_{\rm{O}} \, p_{\rm{AB}} .
\eeq
In these equations $\H$ denotes the assumptions that two people were
present and left blood there, and that the probability distribution
of the blood groups of unknown people in an explanation is the same
as the population frequencies.
% Our posterior probability ratio for
% $S$ relative to $\bar{S}$ is obtained by multiplying the probability
% ratio based on all other independent information by the ratio of
% these likelihoods. The most straightforward way to summarize the
% contribution of any piece of evidence is in terms of a likelihood
% ratio.
Dividing, we obtain the likelihood ratio:
\beq
\frac{P(D\given S,\H)}{P(D\given \bar{S},\H)} = \frac{1}{2 p_{\rm O}}
= \frac{1}{2 \times 0.6}
= 0.83 .
\eeq
Thus the data in fact provide weak evidence {\em against\/} the
supposition that Oliver was present.
This result may be found surprising, so let us examine it from
various points of view. First consider the case of another suspect,
Alberto,
who has type AB. Intuitively, the data do provide evidence in favour
of the theory $S'$ that this suspect was present, relative to the
null hypothesis $\bar{S}$. And indeed the likelihood ratio in this
case is:
\beq
\frac{P(D\given S',\H)}{P(D\given \bar{S},\H)} = \frac{1}{2\, p_{\rm{AB}}} = 50.
\eeq
Now let us change the situation slightly; imagine that 99\% of people
are of blood type O, and the rest are of type AB. Only these two
blood types exist in the population. The data at the
scene are the same as before. Consider again how these data influence
our beliefs about Oliver,
a suspect of type O, and Alberto, a suspect of type
AB. Intuitively, we still believe that the presence of the rare AB
blood provides positive evidence that \ind{Alberto} was
there. But does
% we still have the feeling that
the fact that type O
blood was detected at the scene favour the hypothesis that
Oliver was present? If this were the case, that would mean that
regardless of who the suspect is, the data make it more probable they
were present; everyone in the population would be
under greater suspicion, which would be absurd. The data may be {\em
compatible\/} with any suspect of either blood type being present, but
if they provide evidence {\em for\/} some theories, they must also
provide evidence {\em against\/} other theories.
Here is another way of thinking about this: imagine that instead of
two people's blood stains there are ten, and that in the entire local
population of one hundred, there are ninety type O suspects and ten
type AB suspects.
% Initially all 100 people are suspects.
Consider a particular type O suspect, \ind{Oliver}: without any other information,
and before the blood test results come in,
there is a one in 10 chance that he was at the scene, since
we know that 10 out of the 100 suspects were present. We now get the
results of blood tests, and find that {\em nine\/} of the ten stains are of
type AB, and {\em one\/} of the stains is of type O. Does this make it more
likely that Oliver was there? No,
% although he could have been,
there is now only a one in ninety chance that he was there, since we
know that only one person present was of type O.
Maybe the intuition is aided finally by writing down the formulae for
the general case where $n_{\rm{O}}$ blood stains of individuals of type O
are found, and $n_{\rm{AB}}$ of type $\rm{AB}$, a total of $N$ individuals in
all, and unknown people come from a large population with fractions
$p_{\rm{O}}, p_{\rm{AB}}$. (There may be other blood types too.)
The task is to evaluate the likelihood ratio for the
two hypotheses: $S$, `the type O suspect (Oliver)
and $N\!-\!1$ unknown others
left $N$ stains'; and $\bar{S}$, `$N$ unknowns left $N$ stains'. The
probability of the data under hypothesis $\bar{S}$ is just the
probability of getting $n_{\rm{O}}, n_{\rm{AB}}$ individuals of the two types
when $N$ individuals are drawn at random from the population:
\beq
P(n_{\rm{O}},n_{\rm{AB}}\given \bar{S}) =
\frac{ N! }{ n_{\rm{O}} ! \, n_{\rm{AB}}! } p_{\rm{O}}^{n_{\rm{O}}} p_{\rm{AB}}^{n_{\rm{AB}}} .
\eeq
In the case of hypothesis $S$, we need the distribution of
the $N\!-\!1$ other individuals:
\beq
P(n_{\rm{O}},n_{\rm{AB}}\given S) =
\frac{ (N-1)! }{ (n_{\rm{O}}-1)! \, n_{\rm{AB}}! } p_{\rm{O}}^{n_{\rm{O}}-1} p_{\rm{AB}}^{n_{\rm{AB}}} .
\eeq
The likelihood ratio is:
\beq
\frac{ P(n_{\rm{O}},n_{\rm{AB}}\given S) }{ P(n_{\rm{O}},n_{\rm{AB}}\given \bar{S}) }
= \frac{n_{\rm{O}}/N}{p_{\rm{O}}} .
\eeq
This is an instructive result. The likelihood ratio, \ie\ the
contribution of these data to the question of whether Oliver
was present, depends simply on a comparison of the frequency
of his blood type
% type O blood
in the observed data with the background frequency
% of type O blood
in the population. There is no dependence on the counts
of the other types found at the scene, or their frequencies in the
population. If there are more type O stains than the average number
expected under hypothesis $\bar{S}$, then the data give
evidence in favour of the presence of Oliver.
Conversely, if there are fewer type O stains than the expected number
under $\bar{S}$, then the data reduce the probability of the
hypothesis that he was there. In the special case $n_{\rm{O}}/N = p_{\rm{O}}$, the
data contribute no evidence either way, regardless of the fact that
the data are compatible with the hypothesis $S$.
\section{Exercises}
% \subsection*{The game show}
%\subsubsection*{The normal rules}
%\subsubsection*{The earthquake scenario}
\exercissxA{2}{ex.3doors}{
{\sf The \ind{three doors},\index{Monty Hall problem} normal rules.}
% "Let's Make A Deal," hosted by Monty Hall
On a \ind{game show},\index{doors, on game show}\index{game!three doors}
a contestant is told the rules as
follows:
\begin{quote}
There are three doors, labelled 1, 2, 3. A single
prize has been hidden behind one of
them. You get to select one door. Initially your chosen door will {\em not\/}
be opened. Instead, the gameshow host will open one of the other two doors,
and {\em he will do so in such a way as not to reveal the prize.}
For example, if you first
choose door 1, he will then open {one\/} of doors 2 and 3, and it
is guaranteed that he will choose which one to open so that
the prize will not be revealed.
At this point, you will be given a fresh choice of door:
you can either stick with your first choice,
or you can switch to the other
closed door. All the doors will then be opened and
you will receive whatever is behind your final
choice of door.
\end{quote}
Imagine that the contestant chooses door 1 first; then the gameshow host
opens door 3, revealing nothing behind the door, as promised.
Should the contestant (a) stick with door 1, or (b)
switch to door 2, or (c) does it make no difference?
}
\exercissxA{2}{ex.3doorsb}{
{\sf The three doors, earthquake scenario.}
Imagine that the game happens again
and just as the gameshow host is about to open one of the
doors a violent earthquake\index{earthquake, during game show}
rattles the building and one of the
three doors flies open. It happens to be door 3, and it
happens not to have the prize behind it. The contestant had initially
chosen door 1.
Repositioning his toup\'ee,
the host suggests, `OK, since you chose door 1 initially,
door 3 is a valid door for me to open, according to the
rules of the game; I'll let door 3 stay open. Let's carry on
as if nothing happened.'
Should the contestant stick with door 1, or switch to door 2, or
does it make no difference? Assume that the prize was placed randomly, that
the gameshow host does not know where it is, and that the door flew open
because its latch was broken by the earthquake.
[A similar alternative scenario is a gameshow whose {\em confused host\/}\index{confused gameshow host}
forgets the rules, and where the prize is, and opens one of
the unchosen doors at random. He opens door 3, and the prize is not revealed.
Should the contestant choose what's behind door 1 or door 2?
Does the optimal decision for
the contestant depend on the contestant's \ind{belief}s about
whether the gameshow host is confused or not?]\index{game show}\index{three doors}\index{doors, on game show}\index{prize, on game show}\index{Monty Hall problem}
}
\exercisaxB{2}{ex.girlboy}{
%\subsection
{\sf Another example in which the emphasis is not on priors.}
%\begin{quote}
You visit a family whose three children are all at the local school.
You don't know anything about the sexes of the children.
While walking clumsily round the home, you stumble through
one of the three unlabelled bedroom doors that you know
belong, one each, to the three children, and find that the bedroom
contains \ind{girlie stuff} in sufficient quantities to
convince you that the child who lives in that bedroom
is a girl.
Later, you sneak a look at a letter addressed to the parents,
which reads `From the Headmaster:
we are sending this letter to all parents who have male children at
the school to inform them about the following \ind{boyish matters}\ldots'.
These two sources of evidence establish that at least
one of the three children is
a girl, and that at least one of the children is a boy.
What are the probabilities that there are (a) two girls and one boy;
(b) two boys and one girl?
%\end{quote}
}
% Another example of legal evidence}
\exercissxB{2}{ex.simpsons}{
Mrs\ S is found stabbed in her family
garden.
% \index{Simpson, O.J., similar case to}
Mr\ S behaves strangely after her death and is considered as
a suspect. On investigation of police and social records
it is found that Mr\ S had beaten up his wife on at least
nine previous occasions. The prosecution advances this
data as evidence in favour of the hypothesis that Mr\ S is
guilty of the murder.
`Ah no,' says
% Mr.\ Merd-Kopf,
Mr\ S's highly paid lawyer,\index{lawyer}\index{wife-beater}\index{murder}
`{\em statistically}, only one in a thousand wife-beaters
actually goes on to murder his wife.\footnote{In the U.S.A., it
is estimated that
% http://www.umn.edu/mincava/papers/factoid.htm
2 million women are abused each year by their partners.
In 1994, $4739$ women were victims of homicide; of those,
% 28 \percent,
$1326$ women (28\%)
were slain by husbands and boyfriends.\\ (Sources:
{\tt http://www.umn.edu/mincava/papers/factoid.htm,\\
http://www.gunfree.inter.net/vpc/womenfs.htm})
% http://www.gunfree.inter.net/vpc/womenfs.htm
% In keeping
% with the fictitious nature of this story, the $1/100\,000$
% figure was made up by me.
}\label{footnote.murder} So the wife-beating
% , which is not denied by Mr\ S,
is not strong evidence at all. In fact,
given the wife-beating evidence alone, it's extremely unlikely
that he would be the murderer of his wife -- only a
$1/1000$ chance. You should therefore find him innocent.'
Is the lawyer
% Mr\ Merd-Kopf
right to imply that the history of wife-beating does
not point to Mr\ S's being the murderer? Or is the lawyer a slimy trickster? If
the latter, what is wrong with his argument?
[Having received an indignant letter from a lawyer about
the preceding paragraph, I'd like to
add an extra inference exercise at this point:
{\em Does my suggestion that Mr.\ S.'s lawyer
may have been a slimy trickster imply that
I believe {\em all} lawyers are slimy tricksters?} (Answer: No.)]
}
% Lewis Carroll's Pillow Problem
\exercisaxB{2}{ex.bagcounter}{ A bag contains one counter, known to be
either white or black. A white counter is put in, the bag is shaken,
and a counter is drawn out, which proves to be white. What is now the
chance of drawing a white counter?
[Notice that
the state of the bag, after the operations, is exactly identical to its state before.]
}
\exercissxB{2}{ex.phonetest}{% ????????????????? needs solution adding (was phonecheck!)
You move into a new house; the phone is connected, and
% you are unsure of your phone number --
you're pretty sure that
the \ind{phone number}\index{telephone number} is
% it's
{\tt 740511}, but not as sure as you would like to be.
%
As an experiment, you pick up the phone and dial {\tt 740511};
you obtain a `busy' signal.
Are you now more sure of your phone number? If so, how much?
}
%
\exercisaxB{1}{ex.othercoin}{
In a game, two coins are tossed. If either of the coins comes up
heads, you have won a prize. To claim the prize, you must point to
one of your coins that is a head
and say `look, that coin's a head, I've won'.
You watch Fred play the game. He tosses the two coins, and he
points to a coin and says `look, that coin's a head, I've won'.
What is the probability that the {\em other\/} coin is a head?
}
%\subsection*{Another quasi-legal story}
% \exercis{ex.}{
% During a radio chat show on the health consequences of
% secondary smoking, it is reported by an expert that
% twelve recent studies have investigated whether
% there was a link between secondary smoking and cancer.
% Of these, eleven studies failed to establish a link
% and one study found significant evidence of a causal
% link -- secondary smoking increasing the risk of getting
% cancer. The expert said that the net evidence from these
% twelve results was that there was significant evidence of a causal
% link.
%
% Shortly thereafter, a Mr.\ N.T.\ Social called in in support
% of smokers' ``rights'' to pollute public air. `If eleven
% of the studies didn't find a link, and only one found a link,
% then it's eleven to one that there isn't a link, isn't it?'
%
% `Well, you clearly don't understand statistics, do you?' responded
% the condescending host.
%
% Can you suggest a more helpful explanation of the expert's statement?
%}
% euro.tex
\exercissxB{2}{ex.eurotoss}{
A statistical statement appeared in
% \footnote{Quoted by Charlotte Denny and Sarah Dennis
{\em The Guardian} on Friday January 4, 2002:
\begin{quote}
When spun on edge 250
times, a Belgian one-euro
coin came up heads 140 times and tails 110.
`It looks very suspicious to me', said Barry Blight, a statistics lecturer
at the London School of Economics.
`If the coin were unbiased the
chance of getting a result as extreme as that would be less than 7\%'.
\end{quote}
But {\em do\/} these
data give evidence that the coin is biased rather than fair?
[Hint: see \eqref{eq.compare.final}.]
}
% \input{tex/bayes_occam.tex}
\dvips
\section{Solutions}% to Chapter \protect\ref{ch.bayes}'s exercises} %
\soln{ex.dieexponential}{
Let the data be $D$. Assuming equal prior probabilities,
\beqan
\frac{P(A \given D)}{P(B \given D)} = \frac{1}{2}\frac{3}{2}\frac{1}{1}\frac{3}{2}
\frac{1}{2}\frac{2}{2}\frac{1}{2} = \frac{9}{32}
\eeqan
and $P(A \given D) = 9/41.$
% (check me).
}
\soln{ex.dieexponentialb}{
The probability of the data given each hypothesis is:
\beq
P(D \given A) = \frac{3}{20}\frac{1}{20}\frac{2}{20}\frac{1}{20}
\frac{3}{20}\frac{1}{20} \frac{1}{20} =
\frac{18}{20^7} ;
\eeq
\beq
P(D \given B) = \frac{2}{20}\frac{2}{20}\frac{2}{20}\frac{2}{20}
\frac{2}{20}\frac{1}{20} \frac{2}{20}
= \frac{64}{20^7} ;
\eeq
\beq
P(D \given C) = \frac{1}{20}\frac{1}{20}\frac{1}{20}\frac{1}{20}
\frac{1}{20}\frac{1}{20} \frac{1}{20}
= \frac{1}{20^7}.
\eeq
So
\beq
% \hspace*{-0.1in}
P(A \given D) = \frac{18}{18+64+1} = \frac{18}{83} ; \hspace{0.3in}
P(B \given D) = \frac{64}{83} ;\hspace{0.3in}
P(C \given D) = \frac{1}{83} .
\eeq
}
\fakesection{Bent coin exercise solns}
\begin{figure}[htbp]
\figuremargin{%
\footnotesize
\begin{center}
\begin{tabular}{cc}
(a) \psfig{figure=figs/aba.ps,width=2in,angle=-90}&
(b) \psfig{figure=figs/bbb.ps,width=2in,angle=-90}\\
$P( p_{\tt{a}} \given \bs\eq {\tt{aba}} ,F\eq 3) \propto p_{\tt{a}}^2 (1-p_{\tt{a}})$
&
$P( p_{\tt{a}} \given \bs\eq {\tt{bbb}} ,F\eq 3) \propto (1-p_{\tt{a}})^3$ \\
\end{tabular}
\end{center}
}{%
\caption[a]{Posterior probability for the bias $p_a$ of a bent coin given
two different data sets.}
\label{fig.aba.bbb}
}%
\end{figure}
\soln{ex.postpaII}{% relabelled from postpa Sun 6/4/03 - beware incorrect refs likely
\ben
\item
$P( p_{\tt{a}} \given \bs\eq {\tt{aba}} ,F\eq 3) \propto p_{\tt{a}}^2 (1-p_{\tt{a}})$.
The most probable value of $p_{\tt{a}}$ (\ie, the value that maximizes
the posterior probability density) is $2/3$.
The mean value of $p_{\tt{a}}$ is $3/5$.
See \figref{fig.aba.bbb}a.
\item
$P( p_{\tt{a}} \given \bs\eq {\tt{bbb}} ,F\eq 3) \propto (1-p_{\tt{a}})^3$.
The most probable value of $p_{\tt{a}}$ (\ie, the value that maximizes
the posterior probability density) is $0$.
The mean value of $p_{\tt{a}}$ is $1/5$.
See \figref{fig.aba.bbb}b.
\een
}
%/home/mackay/_courses/itprnn/figs
%gnuplot> plot x**2*(1-x)
%gnuplot> set xrange [0:1]
%gnuplot> replot
%gnuplot> set nokey
%gnuplot> set size 0.4,0.4
%gnuplot> replot
%gnuplot> set noytics
%gnuplot> replot
%gnuplot> set yrange [0:0.4]
%gnuplot> replot
%gnuplot> set yrange [0:0.17]
%gnuplot> replot
%gnuplot> set term post
%Terminal type set to 'postscript'
%Options are 'landscape monochrome dashed "Helvetica" 14'
%gnuplot> set output "aba.ps"
%gnuplot> replot
%gnuplot> set term X
%Terminal type set to 'X11'
%gnuplot> set yrange [0:1]
%gnuplot> plot (1-x)**3
%gnuplot> set term post
%Terminal type set to 'postscript'
%Options are 'landscape monochrome dashed "Helvetica" 14'
%gnuplot> set output "bbb.ps"
%gnuplot> replot
\fakesection{evidence est}
\begin{figure}[htbp]
\figuremargin{%
\small%
\begin{center}
\begin{tabular}{cccc}
$\H_0$ is true &&
\multicolumn{2}{c}{$\H_1$ is true} \\ \cmidrule{1-1}\cmidrule{3-4}
\sixtoone{$p_a=1/6$}{h0MSD}&&
\sixtoone{$p_a=0.25$}{h6MSD}&
\sixtoone{$p_a=0.5$}{h2MSD}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Range of plausible values of the log evidence in favour of $\H_1$ as
a function of $F$. The vertical axis on the left is
$\log \smallfrac{ P( \bs \given F,\H_1 ) }
{ P( \bs \given F,\H_0 ) }$;
the right-hand vertical axis shows the values of
$\smallfrac{ P( \bs \given F,\H_1 ) }
{ P( \bs \given F,\H_0 ) }$.
\index{typicality!behaviour of evidence}\index{evidence!typical behaviour of}\index{model comparison!typical behaviour of evidence}%
The solid line shows the log evidence if the random variable $F_a$
takes on its mean value, $F_a = p_aF$. The dotted lines show (approximately)
the log evidence if $F_a$ is at its 2.5th or 97.5th percentile.
(See also \protect\figref{fig.evidencetyp}, \pref{fig.evidencetyp}.)
}
\label{fig.evidenceMSD}
}%
\end{figure}
\soln{ex.evidenceest}{
The curves in \figref{fig.evidenceMSD} were found by finding the mean and standard deviation
of $F_a$, then setting $F_a$ to the mean $\pm$ two standard deviations
to get a 95\% plausible range for $F_a$, and computing the three
corresponding values of the log evidence ratio.
}%
\soln{ex.3doors}{
Let $\H_i$ denote the hypothesis that the prize is behind
door $i$.
We make the following assumptions: the three hypotheses
$\H_1$, $\H_2$ and $\H_3$ are equiprobable {\em a priori}, \ie,
\beq
P(\H_1) = P(\H_2) = P(\H_3) = \frac{1}{3} .
\eeq
The datum we receive, after choosing door 1,
is one of $D \eq 3$ and $D \eq 2$ (meaning door 3 or 2 is opened, respectively).
We assume that these two possible outcomes have the following probabilities.
If the prize is behind door 1 then the host has a free choice; in
this case we assume that the host selects at random between $D\eq 2$ and $D\eq 3$.
Otherwise the choice of the host is forced and the probabilities
are 0 and 1.
\beq
\begin{array}{|r@{\,}c@{\,}l|r@{\,}c@{\,}l|r@{\,}c@{\,}l|}
P( D\eq 2 \given \H_1) &=& \dfrac{1}{2} &
P( D\eq 2 \given \H_2) &=& 0 &
P( D\eq 2 \given \H_3) &=& {1} \\
P( D\eq 3 \given \H_1) &=& \dfrac{1}{2} &
P( D\eq 3 \given \H_2) &=& {1} &
P( D\eq 3 \given \H_3) &=& 0
\end{array}
\eeq
Now, using \Bayes\ theorem, we evaluate the posterior probabilities
of the hypotheses:
\beq
P( \H_i \given D\eq3 ) = \frac{P( D\eq3 \given \H_i) P(\H_i) }{P(D\eq3) }
\eeq
\beq
\begin{array}{|r@{\,}c@{\,}l|r@{\,}c@{\,}l|r@{\,}c@{\,}l|}
P(\H_1 \given D\eq 3) &=& \frac{ (1/2) (1/3) }{P(D\normaleq 3) } &
P(\H_2 \given D\eq 3) &=& \frac{ ({1}) (1/3) }{P(D\normaleq 3) } &
P(\H_3 \given D\eq 3) &=& \frac{ ({0}) (1/3) }{P(D\normaleq 3) }
\end{array}
\eeq
The denominator $P(D\eq 3)$ is $(1/2)$ because it is the normalizing
constant for this posterior distribution.
So
\beq
\begin{array}{|rcl|rcl|rcl|}
P( \H_1 \given D\eq3 ) &=& \dfrac{ 1}{3} &
P(\H_2 \given D\eq3) &=& \dfrac{ 2}{3} &
P(\H_3 \given D\eq3) &=& 0 .
\end{array}
\eeq
So the contestant should switch to door 2 in order to have
the biggest chance of getting the prize.
Many people find this outcome surprising. There are two
ways to make it more intuitive. One is to play the game\index{game!three doors}
thirty
times with a friend and keep track of the frequency with
which switching gets the prize. Alternatively,
you can perform a thought experiment in which the game is
played with a million doors. The rules are now that the contestant
chooses one door, then the game show host opens
999,998 doors in such a way as not to reveal the prize, leaving
the {\em contestant's\/}
selected door and {\em one other door\/}
closed. The contestant may
now stick or switch.
Imagine the contestant confronted by a million doors, of which
doors 1 and 234,598 have not been opened, door 1 having been
the contestant's initial guess. Where do you think the prize is?
}
%
\soln{ex.3doorsb}{
% earthquake rules.
If door 3 is opened by an earthquake, the inference comes out
differently -- even though visually the scene looks the same. The
nature of the data, and the probability of the data, are both now
different. The possible data outcomes are, firstly, that any number
of the doors might have opened. We could label the eight possible
outcomes $\bd = (0,0,0), (0,0,1), (0,1,0), (1,0,0), (0,1,1), \ldots,
(1,1,1)$. Secondly, it might be that the prize is visible after the
earthquake has opened one or more doors. So the data $D$ consists of
the value of $\bd$, and a statement of whether the prize was
revealed. It is hard to say what the probabilities of these outcomes
are, since they depend on our beliefs about the reliability
of the door latches and the properties of earthquakes,
but it is possible to extract the desired posterior probability
without naming the values of $P(\bd \given \H_i)$ for each $\bd$. All that
matters are the relative values of the quantities $P(D \given \H_1)$,
$P(D \given \H_2)$, $P(D \given \H_3)$, for the value of $D$ that actually occurred.
[This is the {\dem\ind{likelihood principle}}, which
we met in \sectionref{sec.lp}.]
% !!!!!!!!! add page ref?
The value of $D$ that actually occurred is
`$\bd \eq (0,0,1)$, and no prize visible'. First, it is clear that
$P(D \given \H_3)=0$, since the datum that no prize is visible is
incompatible with $\H_3$. Now, assuming that the contestant selected
door 1, how does the probability $P(D \given \H_1)$ compare with
$P(D \given \H_2)$? Assuming that earthquakes are not sensitive to
decisions of game show contestants,
these two quantities have to be equal, by symmetry. We don't know how likely it is
that door 3 falls off its hinges, but however likely it is, it's just
as likely to do so whether the prize is behind door 1 or door 2. So,
if $P(D \given \H_1)$ and $P(D \given \H_2)$ are equal, we obtain:
\beq
\begin{array}{|r@{\,\,=\,\,}l|r@{\,\,=\,\,}l|r@{\,\,=\,\,}l|}
P(\H_1 | D) & \smallfrac{ P(D | \H_1) (\smalldfrac{1}{3}) }{P(D) } &
P(\H_2 | D) & \smallfrac{ P(D | \H_2) (\smalldfrac{1}{3}) }{P(D) } &
P(\H_3 | D) & \smallfrac{ P(D | \H_3) (\smalldfrac{1}{3}) }{P(D) }
\\
& \dfrac{ 1}{2} &
& \dfrac{ 1}{2} &
& 0 .
\end{array}
\eeq
The two possible hypotheses are now equally likely.
If we assume that
the host knows where the prize is and might be acting
deceptively, then the answer might be further modified, because we
have to view the host's words as part of the data.
Confused? It's well worth making sure you
understand these two gameshow problems.
Don't worry, I slipped up on the second problem, the
first time I met it.
There is a general rule which helps immensely
when you have a confusing probability problem:\index{key points!how to solve probability problems}
\begin{conclusionbox}
Always write down the probability of everything.\\ {
\hfill {\em (Steve Gull)} \par
}
\end{conclusionbox}
From this joint probability, any desired inference can
be mechanically obtained (\figref{fig.everything}).
\amarginfig{b}{
\begin{center}
\newcommand{\tabwidth}{30}
\newcommand{\tabheight}{80}
\setlength{\unitlength}{1mm}{
\begin{picture}(43,92)(-13,0)
\put(15,90){\makebox(0,0){\small\sf{Where the prize is}}}
\put( 5,85){\makebox(0,0){\small{door}}}
\put(15,85){\makebox(0,0){\small{door}}}
\put(25,85){\makebox(0,0){\small{door}}}
\put( 5,82){\makebox(0,0){\small{1}}}
\put(15,82){\makebox(0,0){\small{2}}}
\put(25,82){\makebox(0,0){\small{3}}}
\put(-1, 5){\makebox(0,0)[r]{\footnotesize{1,2,3}}}
\put(-1,15){\makebox(0,0)[r]{\footnotesize{2,3}}}
\put(-1,25){\makebox(0,0)[r]{\footnotesize{1,3}}}
\put(-1,35){\makebox(0,0)[r]{\footnotesize{1,2}}}
\put(-1,45){\makebox(0,0)[r]{\footnotesize{3}}}
\put( 5,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}}
\put(15,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}}
\put(25,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}}
\put( 5,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}}
\put(15,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}}
\put(25,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}}
\put( 5, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}}
\put(15, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}}
\put(25, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}}
\put(-1,55){\makebox(0,0)[r]{\footnotesize{2}}}
\put(-1,65){\makebox(0,0)[r]{\footnotesize{1}}}
\put(-1,75){\makebox(0,0)[r]{\footnotesize{none}}}
\put(-12,40){\makebox(0,0){\rotatebox{90}{\small\sf{Which doors opened by earthquake}}}}
\multiput(0,0)(0,10){9}{\line(1,0){\tabwidth}}
\multiput(0,0)(10,0){4}{\line(0,1){\tabheight}}
\end{picture}}
\end{center}
\caption[a]{The probability of everything, for the second three-door problem,
assuming an earthquake has just occurred.
Here, $p_3$ is the probability that door 3 alone is opened by an earthquake.}
\label{fig.everything}
}
}
\fakesection{simpsons}
\soln{ex.simpsons}{
The statistic quoted by the lawyer indicates the
% {prior\/}
probability
% \index{Simpson, O.J., similar case to}%
%\index{Simpson, O.J., allusion to}
\index{lawyer}\index{wife-beater}\index{murder}
that a randomly selected wife-beater will also murder his wife.
The probability that the husband was the murderer, {\em given
that the wife has been murdered}, is a completely different quantity.
To deduce the latter, we need to make further assumptions about
the probability that the wife is murdered by someone else.
If she lives in a neighbourhood with frequent random murders, then
this probability is large and the posterior probability that
the husband did it (in the absence of other evidence) may not
be very large. But in more peaceful regions, it may well be
that the most likely person to have murdered you, if you are found
murdered, is
one of your closest relatives.
%{\em Numbers here.}
Let's work out some illustrative numbers with the help
of the statistics on page \pageref{footnote.murder}.
Let $m\eq 1$ denote the proposition that a woman has been murdered;
$h\eq 1$, the proposition that the husband did it; and $b\eq 1$,
the proposition that he beat her in the year preceding the
murder. The statement `someone else did it'
is denoted by $h\eq 0$.
We need to define $P(h \given m\eq 1)$, $P(b \given h\eq 1,m\eq 1)$, and $P(b\eq 1 \given h\eq 0,m\eq 1)$
in order to compute the posterior probability $P(h\eq 1 \given b\eq 1,m\eq 1)$.
From the statistics, we can read out $P(h\eq 1 \given m\eq 1)=0.28$.
And if two million women out of 100 million are beaten,
then $P(b\eq 1 \given h\eq 0,m\eq 1)=0.02$. Finally, we need a
value for $P(b \given h\eq 1,m\eq 1)$: if a man murders his wife, how likely is
it that this is the first time he laid a finger on her? I
expect it's pretty unlikely; so maybe $P(b\eq 1 \given h\eq 1,m\eq 1)$ is 0.9
or larger.
By \Bayes\ theorem, then,
\beq
P(h\eq 1 \given b\eq 1,m\eq 1)
= \frac{ .9 \times .28 }{ .9 \times .28 + .02 \times .72 }
\simeq 95\% .
\eeq
One way to make obvious the sliminess of the lawyer on \pref{ex.simpsons}
is to construct arguments, with the same logical structure
as his, that
are clearly wrong. For example, the lawyer could say `Not only
was Mrs.\ S murdered, she was murdered between 4.02pm and
4.03pm. {\em Statistically}, only one in a {\em million\/} wife-beaters
actually goes on to murder his wife between 4.02pm and
4.03pm. So the wife-beating
% , which is not denied by Mr.\ S,
is not strong evidence at all. In fact,
given the wife-beating evidence alone, it's extremely unlikely
that he would murder his wife in this way -- only a
1/1,000,000 chance.'
}
% arrived here Sun 6/4/03
\soln{ex.phonetest}{% was phonecheck
There are two hypotheses.
$\H_0$: your number is {\tt 740511}; $\H_1$: it is another number.
The data, $D$, are `when I dialed {\tt 740511}, I got a busy signal'.
What is the probability of $D$, given each hypothesis?
If your number is {\tt 740511}, then we expect a busy signal with certainty:
\[
P(D \given \H_0) = 1 .
\]
On the other hand, if $\H_1$ is true, then the probability that the number dialled
returns a busy signal is smaller than 1, since various other outcomes
were also possible (a ringing tone, or a number-unobtainable signal,
for example). The value of this probability $P(D \given \H_1)$
will depend on the probability $\alpha$ that a random phone number
similar to your own phone number would be a valid phone number,
and on the probability $\beta$ that you get a busy signal when you dial
a valid phone number.
% 37 per col, 4 cols per page, 250 pages.
% 20 per col, 3 cols per page, 270 pages.
% 50,000. maybe another 50% ex-directory?
I estimate from the size of
my phone book that Cambridge has about $75\,000$ valid phone numbers, all of length six
digits. The probability that a random six-digit number is valid is
therefore about $75\,000/10^6 = 0.075$. If we exclude numbers beginning with 0, 1, and 9
from the random choice, the probability $\a$
is about $75\,000/700\,000 \simeq 0.1$.
If we assume that
telephone numbers are clustered then a misremembered number
might be more likely to be valid than a randomly chosen number; so
the probability, $\alpha$,
that our guessed number would be valid, assuming $\H_1$ is true,
might be bigger than 0.1. Anyway, $\alpha$ must be somewhere between 0.1 and 1.
We can carry forward this uncertainty in the probability
and see how much it matters at the end.
The probability $\beta$ that you get a busy signal when you dial
a valid phone number is equal to the fraction of phones you think are in use
or off-the-hook
when you make your tentative call.
This fraction varies from town to town and with the time of day.
In Cambridge, during the day, I would guess that about 1\% of phones
are in use. At 4am,
% four in the morning,
maybe 0.1\%, or fewer.
The probability $P(D \given \H_1)$ is the product of $\alpha$ and $\beta$,
that is, about $0.1 \times 0.01 = 10^{-3}$. According to
our estimates, there's about a one-in-a-thousand
chance of getting a busy signal when you dial a random number;
or one-in-a-hundred, if valid numbers are strongly clustered;
or one-in-$10^4$, if you dial in the wee hours.
How do the data affect your beliefs about your phone number?
The posterior probability ratio is the likelihood ratio
times the prior probability ratio:
\beq
\frac{ P(\H_0 \given D) }{ P(\H_1 \given D) }
= \frac{ P(D \given \H_0) }{ P(D \given \H_1) }
\frac{ P(\H_0) }{ P(\H_1) } .
\eeq
The likelihood ratio is about 100-to-1 or 1000-to-1, so the posterior
probability ratio is swung by a factor of 100 or 1000 in favour of $\H_0$.
If the prior probability of $\H_0$ was 0.5 then the posterior
probability is
\beq
P(\H_0 \given D) = \frac{1}{1 + \smallfrac{ P(\H_1 \given D) }{ P(\H_0 \given D) } }
\simeq 0.99 \: \mbox{or} \: 0.999 .
\eeq
}
\soln{ex.eurotoss}{
% see also
% http://www.dartmouth.edu/~chance/chance_news/recent_news/chance_news_11.02.html
% for lots of practical info on coin biases.
%%%%%%%%%%%%%%%%%%%%%%%%%%% included by _s8.tex
% First, could confirm his sampling theory
%Sampling theory: number of heads $\sim 125 \pm 8$
%$ \sqrt{62.5}$
%so two-tail probability is
% pr 2*(1-myerf(14.5/7.9)) ans = 0.066440
% if the data were 141 out of 250 then we get
% 2*(1-myerf(15.5/7.9)) ans = 0.049760
\index{euro}We compare the models $\H_0$ -- the coin is fair --
and $\H_1$ -- the \ind{coin} is biased, with
the prior on its bias set to the uniform
distribution $P(p|\H_1)=1$.
% ent, as defined in this chapter.
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=gnu/euro.ps,width=1.62in,angle=-90}}
\end{center}
\caption[a]{The probability distribution of the
number of heads given the two hypotheses, that
the coin is fair, and that it is biased, with
the prior distribution of the bias being uniform.
The outcome ($D = 140$ heads) gives weak evidence
in favour of $\H_0$, the hypothesis that the coin is fair.}
\label{fig.euro}
}
[The use of a uniform prior seems reasonable to me, since I know
that some coins, such as American pennies,
have severe biases when spun on edge; so the situations $p=0.01$ or $p=0.1$
or $p=0.95$ would not surprise me.]
\begin{aside}
When I mention $\H_0$ -- the coin is fair -- a pedant would say, `how
absurd to even consider that the coin is fair -- any coin is surely
biased to some extent'. And of course I would agree. So will pedants
kindly understand $\H_0$ as meaning `the coin is fair to within
one part in a thousand, \ie, $p \in 0.5\pm 0.001$'.
\end{aside}
The likelihood ratio is:
% given in \eqref{eq.compare.final}.
\beq
% Bayesian approach: Model comparison:
\frac{ P( D|\H_1 )}
{P( D|\H_0 )}
= \frac{ \smallfrac{ 140! 110! }{ 251! } }{ 1/2^{250} } = 0.48 .
\eeq
Thus the data give scarcely any evidence
either way; in fact they
give weak evidence (two to one) in favour of $\H_0$!
% load 'gnu/euro.gnu'
`No, no', objects the believer in bias, `your silly uniform
prior doesn't represent {\em my\/} prior beliefs about
the bias of biased coins -- I was {\em expecting\/} only a small bias'.
To be as generous as possible to the $\H_1$,
let's see how well it could fare
if the prior were presciently set.
Let us allow a prior of the form
\beq
P(p|\H_1,\a) = \frac{1}{Z(\a)} p^{\a-1}(1-p)^{\a-1},
\:\:\:\: \mbox{where $Z(\a)=\Gamma(\alpha)^2/\Gamma(2 \alpha)$}
\eeq
(a Beta
% Dirichlet (or Beta)
distribution, with the original uniform prior reproduced
by setting $\a=1$). By tweaking $\alpha$,
the likelihood ratio for $\H_1$ over $\H_0$,
\beq
\frac{ P( D|\H_1,\a )}
{P( D|\H_0 )} =
\frac{\Gamma(140 \!+\! \alpha) \, \Gamma(110 \!+\! \alpha) \, \Gamma(2 \alpha) 2^{250}}
{ \Gamma(250 \!+\! 2 \alpha) \, \Gamma(\alpha)^2 },
\eeq
can
be increased a little. It
is shown for several values of $\a$ in \figref{fig.eurot}.%
%
% fig.eurot WAS here but has been moved away to avoid a crunch
% This figure belongs earlier.
\amarginfig{t}{
{\footnotesize
\begin{tabular}{r@{}l@{$\:\:\:$}r@{\hspace*{0.3in}}r@{}l}
\toprule
\multicolumn{2}{c}{$\alpha$}&
\multicolumn{3}{c}{$\displaystyle \frac{ P( D|\H_1,\a )}
{P( D|\H_0 )}$}\\
\midrule
&.37 & & &.25\\
1&.0 & & &.48\\
2&.7 & & &.82\\
7&.4 & &1&.3\\
20& & &1&.8\\
55& & &1&.9\\
148& & &1&.7\\
403& & &1&.3\\
1096& & &1&.1\\
% from euro.dat
\bottomrule
\end{tabular}
}
\caption[a]{Likelihood ratio for various choices of
the prior distribution's hyperparameter $\alpha$.
}
\label{fig.eurot}
}
%
Even the most favourable choice of $\alpha$ ($\a \simeq 50$)
can
yield a likelihood ratio of only two to one in favour of
$\H_1$.
In conclusion, the data are not `very suspicious'. They
can be construed as giving at most two-to-one evidence
in favour of one or other of the two hypotheses.
\begin{aside}
Are these wimpy likelihood ratios the fault
of over-restrictive
priors? Is there any way of producing
a `very suspicious' conclusion?
The prior that is best-matched to the data,
in terms of likelihood,
% and one that surely has to be viewed as unreasonable,
is the prior that sets $p$ to $f \equiv 140/250$ with probability
one. Let's call this model $\H_*$.
% , since it is a parameterless model like $\H_0$.
The likelihood ratio is $P(D|\H_*)/P(D|\H_0) = 2^{250} f^{140} (1-f)^{110}
=6.1$. So the strongest evidence that these data can possibly
muster against the hypothesis that there is no bias is six-to-one.
\end{aside}
% b.blight@lse.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% alternate answers for the case of 141 heads where
% the P value is 0.05 (0.04976)
%
%The outcomes of the computations for this case (141 from 250)
% are
% alpha , likelihood ratio
%
%.3678794412, .3166098681
%1., .6110726692
%2.718281828, 1.049115229
%7.389056099, 1.627382387
%20.08553692, 2.181864309
%54.59815003, 2.303276774
%148.4131591, 1.882663014
%403.4287935, 1.419011740
%1096.633158, 1.168433218
%2980.957987, 1.063851106
%8103.083928, 1.023737702
%22026.46579, 1.008765749
%
% and H_BF achieves 7.796
While we are noticing the absurdly misleading\index{sermon!sampling theory}\index{p-value}
answers that `sampling theory' statistics produces,
such as the \index{p-value}$p$-value of 7\% in the exercise we just solved,
let's stick the boot in.\label{sec.sampling5percent}
If we make a tiny change to the data set, increasing the
number of heads in 250 tosses from 140 to 141,
we find that the $p$-value goes below the mystical value of 0.05
(the $p$-value is 0.0497).
The sampling theory statistician would happily squeak `the probability
of getting a result as extreme as 141 heads is smaller than 0.05 --
we thus reject the null hypothesis at a significance level of 5\%'.
The correct answer
is shown for several values of $\a$ in \figref{fig.eurot141}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% alternate answers for the case of 141 heads where
% the P value is 0.05 (0.04976)
% Radford: Using R, I get that the true p-value (with genuine binomial
%probabilities) for 141 out of 250 is 0.04970679, close to your value.
%5
%The outcomes of the computations for this case (141 from 250)
% are
% alpha , likelihood ratio
%
%.3678794412, .3166098681
%1., .6110726692
%2.718281828, 1.049115229
%7.389056099, 1.627382387
%20.08553692, 2.181864309
%54.59815003, 2.303276774
%148.4131591, 1.882663014
%403.4287935, 1.419011740
%1096.633158, 1.168433218
%2980.957987, 1.063851106
%8103.083928, 1.023737702
%22026.46579, 1.008765749
%
% and H_BF achieves 7.796
The values worth highlighting from this table are, first,
the likelihood ratio when $\H_1$ uses the standard uniform prior,
which is 1:0.61 in favour of the {\em null hypothesis\/} $\H_0$.
Second, the most favourable choice of $\a$, from the
point of view of $\H_1$, can only
yield a likelihood ratio of about 2.3:1 in favour of
$\H_1$.\label{sec.pvalue05}
\amarginfig{c}{
{\footnotesize
\begin{tabular}{r@{}l@{$\:\:\:$}r@{\hspace*{0.3in}}r@{}l}
\toprule
\multicolumn{2}{c}{$\alpha$}&
\multicolumn{3}{c}{$\displaystyle \frac{ P( D'|\H_1,\a )}
{P( D'|\H_0 )}$ }\\
\midrule
&.37 & & &.32\\
1&.0 & & &.61\\
2&.7 & &1&.0\\
7&.4 & &1&.6\\
20& & &2&.2\\
55& & &2&.3\\
148& & &1&.9\\
403& & &1&.4\\
1096& & &1&.2\\
% from euro.dat
\bottomrule
\end{tabular}
}
\caption[a]{Likelihood ratio for various choices of
the prior distribution's \ind{hyperparameter} $\alpha$, when the data are
$D'=141$ heads in 250 trials.
}
\label{fig.eurot141}
}
%
Be warned! A $p$-value of 0.05 is often interpreted
% gives the impression to many
as implying
that the odds are stacked about twenty-to-one
{\em against\/} the null hypothesis. But the truth in this case
is that the evidence
either slightly {\em favours\/} the null hypothesis,
or disfavours it by at most 2.3 to one, depending on
the choice of prior.
% $p$-values
The $p$-values and `\ind{significance level}s' of
\ind{classical statistics}\index{sermon!classical statistics}
should be treated with {\em extreme caution}.\index{caution!sampling theory}
% This is the last we will see of them in this book.
Shun them!
Here ends the sermon.\index{sermon!sampling theory}
% Classical statistics and Microsoft Windows 95 --
% two of the greatest evils to come out of the twentieth century.
}
\dvipsb{solutions bayes}
% \input{tex/_l1b.tex}
%
% message passing was here
%
\renewcommand{\partfigure}{\poincare{8.0}}
\part{Data Compression}
\prechapter{About Chapter}
\fakesection{prerequisites for chapter 2}
%
In this chapter we
discuss how to measure the information content of the outcome
of a random experiment.
This chapter has some tough bits.
If you find the mathematical details hard,
% to follow,
skim through them and keep going -- you'll be able to enjoy Chapters
\ref{ch3} and \ref{ch4} without this chapter's tools.
% of typicality.
\amarginfignocaption{t}{%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Cast of characters}
\footnotesize
\begin{tabular}{@{}lp{1.14in}}
\multicolumn{2}{c}{
{\sf Notation}
}\\
\midrule
$x \in \A$ & $x$ is a {\dem{member}\/} of the \ind{set} $\A$ \\
$\S \subset \A$ & $\S$ is a {\dem\ind{subset}\/} of the set $\A$ \\
$\S \subseteq \A$ & $\S$ is a {\ind{subset}} of, or equal to, the set $\A$ \\
% \union
$\V = \B \cup \A$
& $\V$ is the {\dem\ind{union}\/} of the sets $\B$ and $\A$ \\
$\V = \B \cap \A$
& $\V$ is the {\dem\ind{intersection}\/} of the sets $\B$ and $\A$ \\
$|\A|$ & number of elements in set $\A$\\
\bottomrule
\end{tabular} \medskip
% end marginstuff
}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Before reading \chref{ch2}, you should have
read
% section \ref{ch1.secprob}
\chref{ch1.secprob}
and
worked on
% \exerciseref{ex.expectn}.
% It will also help if you have worked on
%
% do I need to ensure that {ex.Hadditive} occurs earlier?
%
\exerciseonlyrange{ex.expectn}{ex.Hineq} and \ref{ex.sumdice}
% \exerciseonlyrangeshort{ex.sumdice}{ex.RNGaussian}
\pagerange{ex.sumdice}{ex.invP},
% {ex.RNGaussian}.
% exercises \exnine-\exfourteen\ and \extwentyfive-\extwentyseven.
and \exerciseonlyref{ex.weigh} below.
The following
exercise is intended to
help you think about how to measure information content.
% Please work on this exercise now.
% weighing
% ITPRNN Problem 1
%
% weighing problem
%
\fakesection{the weighing problem}
\exercissxA{2}{ex.weigh}{
-- {\em Please work on this problem before reading \chref{ch.two}.}
\index{weighing problem}You are given 12 balls, all equal in weight except for
one that is either heavier or lighter. You are also given a two-pan
\ind{balance} to use.
% , which you are to use as few times as possible.
In each use of the balance you may put {any\/} number of the 12
balls on the left pan, and the same number on the right pan, and push
a button to initiate the weighing; there are three possible outcomes:
either the weights are equal, or the balls on the left are heavier,
or the balls on the left are lighter. Your task is to design a
strategy to determine which is the odd ball {\em and\/} whether it is
heavier or lighter than the others {\em in as few uses of the balance
as possible}.
% There will be a prize for the best answer.
While thinking about this problem,
you
% should
may find it helpful to
consider the following questions:
\ben
\item How can one measure {\dem\ind{information}}?
\item When you have identified the odd ball and whether it is heavy or
light, how much information have you gained?
\item Once you have designed a strategy, draw a tree showing,
for each of the possible outcomes
of a weighing, what weighing you perform next.
At each node in the tree, how much information have the outcomes
so far given you, and how much information remains to be
gained?
% What is the probability of each of the possible outcomes of the first
% weighing?
%\item
% What is the most information you can get from a single weighing?
% How much information do you get from a single weighing
% if the three outcomes are equally probable?
%\item What is the smallest number of weighings that might conceivably
%be sufficient always to identify the odd ball and whether it is heavy
%or light?
\item How much information is gained when you learn (i) the state of a
flipped coin; (ii) the states of two flipped coins;
(iii) the outcome when a four-sided die is rolled?
\item
How much information is gained on the first step of the weighing
problem if 6 balls are weighed against the other 6? How much is gained
if 4 are weighed against 4 on the first step, leaving out 4 balls?
% the other 4 aside?
\een
}
%
% How many possible outcomes of an e weighing process are there? To put it another way, imagine that you report the outcome by sending a postcard which says, for example, "ball number 5 is heavy", how many prepare a postcard
%
% how many outcomes are there?
% How many possible states of the world are y
% if you tell someone ball number x is heavy, how much info have you given
% them? how much information can be conveyed by $k$ uses of the balance?
%
%
% make clear that you can put any objects on the scales,
% don't have to weigh 6 vs 6.
% no cheating by gradually adding weights
%
% katriona's problem: 4 bits, randomly rotated every time you ask them
% to be flipped.
%
% hhhh llll gggg
% hhll lhgg lh
% if left is h then
% hh or l
% so do h vs h
%
% else gggg gggg ????
% -> ?? ?g
% -> hh l or ggg -> wegh last dude (1 bit)
% do h vs h
%
% if 13 and good avail, - hhhhh llll* gggg
% hhll lhgg hhl
%
\mysetcounter{page}{76}
\ENDprechapter
\chapter{The Source Coding Theorem}
\label{ch.two}\label{ch2}\label{chtwo}
% _l2.tex
% \part{Data Compression}
% \chapter{The Source Coding Theorem}
%
% I introduce the idea of a "name" (or label?) here, and should clarify
% (example 2.1)
%
% E = 13%, Q,Z = 0.1%
% TH = 3.7%
%
% New plan for this chapter:
% \section{Key concept}
% Rather than $H(\bp)$ being the measure of information content of
% an ensemble,
% I want the central idea of this chapter to be that
% $\log 1/P(\bx)$ is the information content of a particular
% outcome $\bx$. $H$ is then of interest because it is the average
% information content.
%
% An example to illustrate this is `hunt the professor'. Or crack
% the combination. Guess the PIN.
% An absent-minded professor wishes to remember an
% integer between 1 and 256, that is, eight bits of information.
% He takes 256 large numbered cardboard boxes, and climbs
% in the box whose number is the integer to be remembered.
% The only way to find him
% is to open the lid of a box. A single experiment involves
% opening a particular box. The outcome is either $x={\tt n}$ -- no
% professor -- or $x={\tt y}$ -- the professor is in there.
% The probabilities are
% \beq
% P(x\eq {\tt n}) = 255/256; P(x\eq {\tt y}) = 1/256.
% \eeq
% We open box $n$.
% If the professor is revealed, we have learned the integer,
% and thus recovered 8 bits of information. If he is not revealed,
% we have learned very little -- simply that the
% integer is not $n$. The information contents are:
% \beq
% h(x\eq 0) = \log_2( 256/255) = 0.0056 ; h(x\eq 1) = \log_2 256 = 8 .
% \eeq
% The average information content is
% \beq
% H(X) = 0.037 \bits .
% \eeq
% This example shows that in the event of an improbable outcome's occuring,
% a large amount of information really is conveyed.
%
% \section{Weighing problem}
% The weighing problem remains useful, let's keep it.
%
% \section{Source coding theorem}
% Relate `information content' $\log 1/P$ to message length
% in two steps. First, establish the AEP, that
% the outcome from an ensemble $X^N$
% is very likely to lie in a typical set having `information
% content' close to NH.
%
% Second, show that we can count the number of elements in the
% typical set, give them all names, and the number of
% names will be about $2^{NH}$.
%
% At what point should $H_{\delta}$ be introduced?
\section{How to measure the information content of a random variable?}
In the next few chapters, we'll be talking about probability
distributions and random variables. Most of the time
we can get by with sloppy notation, but occasionally, we will need
precise notation. Here is the
%definition and
notation that we established in \chapterref{ch.prob.ent}.\indexs{ensemble}
%
\sloppy
\begin{description}
\item[An ensemble] $X$ is a triple $(x,\A_X, \P_X)$,
where the {\dem outcome\/} $x$ is the value of a random variable,
% whose value $x$ can take on a
which takes on one of a
set of possible values,
% the alphabet
% {\em outcomes},
$\A_X = \{a_1,a_2,\ldots,a_i,\ldots, a_I\}$,
% \ie, possible values for a random variable $x$
% and a probability distribution over them,
having probabilities
$\P_X = \{p_1,p_2,\ldots, p_I\}$, with $P(x\eq a_i) = p_i$,
$p_i \geq 0$ and $\sum_{a_i \in \A_X} P(x \eq a_i) = 1$.
\end{description}
%\begin{description}
%\item[An ensemble] $X$ is a random variable $x$ taking on a value
% from a set of possible {\em outcomes},
% $$\A_X \eq \{a_1,\ldots,a_I\},$$
% having probabilities
% $$\P_X = \{p_1,\ldots, p_I\},$$ with $P(x\eq a_i) = p_i$,
% $p_i \geq 0$ and $\sum_{x \in \A_X} P(x) = 1$.
%\end{description}
% An ensemble is a set of possible values for a random variable
% and a probability distribution over them.
{How can we measure the information content of an outcome
$x = a_i$ from such an ensemble?}
In this chapter we examine the assertions
\ben
\item
that the
% It is claimed that the
{\dem{\ind{Shannon information content}}},\index{information content!Shannon}\index{information content!how to measure}
\beq
h(x\eq a_i) \equiv \log_2 \frac{1}{p_i},
\eeq
is a sensible measure of the information content of the outcome
$x = a_i$, and
\item
that
the {\dem{\ind{entropy}}} of the ensemble,
\beq
H(X) = \sum_i p_i \log_2 \frac{1}{p_i},
\eeq
is a sensible measure of the ensemble's average information content.
\een
\begin{figure}[htbp]
\figuremargin{%1
{\small%
\begin{center}
\mbox{
\mbox{
\hspace{-9mm}
\mbox{\psfig{figure=figs/h.ps,%
width=42mm,angle=-90}}$p$
\hspace{-35mm}
\makebox[0in][l]{\raisebox{\hpheight}{$h(p)= \log_2 \displaystyle \frac{1}{p}$ }}
\hspace{35mm}
}
\hspace{0.9mm}
\begin{tabular}[b]{ccc}\toprule
$p$ & $h(p)$ & $H_2(p)$ \\ \midrule
0.001 & 10.0 & 0.011 \\ % 9.96578 & 0.0114078
0.01\phantom{0} & \phantom{1}6.6 & 0.081 \\
0.1\phantom{01} & \phantom{1}3.3 & 0.47\phantom{1} \\
0.2\phantom{01} & \phantom{1}2.3 & 0.72\phantom{1} \\
0.5\phantom{01} & \phantom{1}1.0 & 1.0\phantom{01} \\ \bottomrule
\end{tabular}
\mbox{
% to put H at left: \hspace{1.2mm}
\hspace{6.2mm}
\raisebox{\hpheight}{$H_2(p)$}
% to put H at left: \hspace{-7.5mm}
\hspace{-20mm}
\mbox{\psfig{figure=figs/H2.ps,%
width=42mm,angle=-90}}$p$
}
% see also H2x.tex
\end{center}
}% end small
}{%
\caption[a]{The \ind{Shannon information content} $h(p) = \log_2 \frac{1}{p}$ and
the binary entropy function $H_2(p)=H(p,1\!-\!p)=p \log_2 \frac{1}{p}
+ (1-p)\log_2 \frac{1}{(1-p)}$ as a function of $p$.}
\label{fig.h2}
}%
\end{figure}
% gnuplot
% load 'figs/l2.gnu'
\noindent
\Figref{fig.h2} shows the Shannon information content
of an outcome with probability $p$, as a function of $p$.
The less probable an outcome is, the greater its
Shannon information content.
\Figref{fig.h2} also shows
% $h(p) = \log_2 \frac{1}{p}$,
the binary entropy function,
\beq
H_2(p)=H(p,1\!-\!p)=p \log_2 \frac{1}{p}
+ (1-p)\log_2 \frac{1}{(1-p)} ,
\eeq
which is the entropy of the ensemble $X$ whose alphabet and probability
distribution are
$\A_X = \{ a , b \}, \P_X = \{ p , (1-p) \}$.
%
\subsection{Information content of independent random variables}
Why should $\log 1/p_i$ have anything to do with the
information content? Why not some other function of $p_i$?
We'll explore this question in detail shortly,
but first, notice a nice property of this particular function
$h(x)=\log 1/p(x)$.
Imagine learning the value of two {\em independent\/} random
variables, $x$ and $y$.
The definition of independence is that the probability
distribution is separable into a {\em product}:
\beq
P(x,y) = P(x) P(y) .
\eeq
Intuitively, we might want any measure of
the `amount of information gained' to have the property of
{\em additivity} --
that is,
for independent random variables $x$ and $y$,
the information gained when we learn $x$ and $y$ should
equal the sum of the information gained if $x$ alone were learned
and the information gained if $y$ alone were learned.
The Shannon information content of the outcome $x,y$ is
\beq
h(x,y) = \log \frac{1}{P(x,y)}
= \log \frac{1}{P(x)P(y)}
= \log \frac{1}{P(x)}
+ \log \frac{1}{P(y)}
\eeq
so it does indeed satisfy
\beq
h(x,y) = h(x) + h(y), \:\:\mbox{if $x$ and $y$ are independent.}
\eeq
\exercissxA{1}{ex.Hadditive}{
Show that, if $x$ and $y$ are independent,
the entropy of the outcome $x,y$
satisfies
\beq
H(X,Y) = H(X) + H(Y) .
\eeq
In words, entropy is additive for independent variables.
}
We now explore these ideas with some examples;
then, in section \ref{sec.aep} and in Chapters \ref{ch3}
and \ref{ch4}, we prove that
the Shannon information content and the entropy are
related to the number of bits needed to describe
the outcome of an experiment.
% \section{Thinking about information content}
% \subsection{Ensembles with maximum average information content}
% The first property of the entropy that we will
% consider is the property that you proved when you solved
% \exerciseref{ex.Hineq}: the entropy of an ensemble
% $X$ is biggest if all the outcomes
% have equal probability $p_i \eq 1/|X|$.
%
% If entropy measures the average information content
% of an ensemble, then this idea of equiprobable outcomes
% should have relevance for the design of efficient experiments.
\subsection{The weighing problem: designing informative experiments}
Have you solved the \ind{weighing problem}\index{puzzle!weighing 12 balls}
\exercisebref{ex.weigh}\
yet? Are you sure? Notice that in three uses of the balance --
which reads either `left heavier', `right heavier', or `balanced' --
the number
of conceivable outcomes is $3^3=27$, whereas the number of possible
states of the world is 24: the odd ball could be any of twelve balls,
and it could be heavy or light. So in principle, the problem might be
solvable in three weighings -- but not in two, since $3^2 < 24$.
If you know how you
{can} determine the odd weight {\em and\/} whether it is heavy or
light in {\em three\/} weighings, then you may read on.
If you haven't found a strategy that always gets there in three weighings,
I encourage you to think about \exerciseonlyref{ex.weigh} some more.
% {ex.weigh}
% \subsection{Information from experiments}
Why is your strategy optimal? What is it about your series of weighings
that allows useful information to be gained as quickly as possible?
\begin{figure}%[htbp]
\fullwidthfigureright{%
% included by l2.tex
%
% shows weighing trees, ternary
%
% decisions of what to weigh are shown in square boxes with 126 over 345 (l:r)
% state of valid hypotheses are listed in double boxes
% three arrows, up means left heavy, straight means right heavy, down is balance
% actually s and d boxes end up having the same defn.
%
\setlength{\unitlength}{0.56mm}% page width is 160mm % was 6mm
\begin{center}
\small
\begin{picture}(260,260)(-50,-130)
%
% initial state
%
% all 24 hypotheses
\mydbox{-50,-100}{15,200}{$1^+$\\$2^+$\\$3^+$\\$4^+$\\$5^+$\\$6^+$\\$7^+$\\
$8^+$\\$9^+$\\$10^+$\\$11^+$\\$12^+$\\$1^-$\\$2^-$\\$3^-$\\$4^-$\\
$5^-$\\$6^-$\\$7^-$\\$8^-$\\$9^-$\\$10^-$\\$11^-$\\$12^-$}
\mysbox{-30,-8}{25,16}{$\displaystyle\frac{1\,2\,3\,4}{5\,6\,7\,8}$}
\put(-30,10){\makebox(25,8){weigh}}
%
% 1st arrows
%
\mythreevector{0,0}{1}{3}{30}
%
% first three boxes of hypotheses % boxes of actions
% #1 is bottom left corner, so has to be offset by height of box
% #2 is dimensions of box
%
% each digit is about 10 high
%
\mydbox{40,55}{15,70}{$1^+$\\$2^+$\\$3^+$\\$4^+$\\$5^-$\\$6^-$\\$7^-$\\$8^-$}
\mysbox{65,82}{25,16}{$\displaystyle\frac{1\,2\,6}{3\,4\,5}$}
\put(65,100){\makebox(25,8){weigh}}
\mydbox{40,-35}{15,70}{$1^-$\\$2^-$\\$3^-$\\$4^-$\\$5^+$\\$6^+$\\$7^+$\\$8^+$}
\mysbox{65,-8}{25,16}{$\displaystyle\frac{1\,2\,6}{3\,4\,5}$}
\put(65,10){\makebox(25,8){weigh}}
\mydbox{40,-125}{15,70}{$9^+$\\$10^+$\\$11^+$\\$12^+$\\$9^-$\\$10^-$\\$11^-$\\$12^-$}
\mysbox{65,-98}{25,16}{$\displaystyle\frac{9\,10\,11}{1\,2\,3}$}
\put(65,-80){\makebox(25,8){weigh}}
%
% 2nd arrows
%
\mythreevector{95,90}{1}{2}{15}
\mythreevector{95,0}{1}{2}{15}
\mythreevector{95,-90}{1}{2}{15}
% nine intermediate states. top ones
\mydbox{115,113}{35,14}{$1^+2^+5^-$}
\mysbox{155,112}{25,16}{$\displaystyle\frac{1}{2}$}
\mydbox{115,83}{35,14}{$3^+4^+6^-$}
\mysbox{155,82}{25,16}{$\displaystyle\frac{3}{4}$}
\mydbox{115,53}{35,14}{$7^-8^-$}
\mysbox{155,52}{25,16}{$\displaystyle\frac{1}{7}$}
% nine intermediate states. mid ones
\mydbox{115,23}{35,14}{$6^+3^-4^-$}
\mysbox{155,22}{25,16}{$\displaystyle\frac{3}{4}$}
\mydbox{115,-7}{35,14}{$1^-2^-5^+$}
\mysbox{155,-8}{25,16}{$\displaystyle\frac{1}{2}$}
\mydbox{115,-37}{35,14}{$7^+8^+$}
\mysbox{155,-38}{25,16}{$\displaystyle\frac{7}{1}$}
% nine intermediate states. bot ones
\mydbox{115,-67}{35,14}{$9^+10^+11^+$}
\mysbox{155,-68}{25,16}{$\displaystyle\frac{9}{10}$}
\mydbox{115,-97}{35,14}{$9^-10^-11^-$}
\mysbox{155,-98}{25,16}{$\displaystyle\frac{9}{10}$}
\mydbox{115,-127}{35,14}{$12^+12^-$}
\mysbox{155,-128}{25,16}{$\displaystyle\frac{12}{1}$}
% 3rd arrows mainline
\mythreevector{185,60}{1}{1}{10}
\mythreevector{185,0}{1}{1}{10}
\mythreevector{185,-60}{1}{1}{10}
% other branch lines
\mythreevector{185,120}{1}{1}{10}
\mythreevector{185,90}{1}{1}{10}
\mythreevector{185,30}{1}{1}{10}
\mythreevector{185,-30}{1}{1}{10}
\mythreevector{185,-90}{1}{1}{10}
\mythreevector{185,-120}{1}{1}{10}
% final answers aligned at 200,x*10
\mydbox{200,126}{10,8}{$1^+$}
\mydbox{200,116}{10,8}{$2^+$}
\mydbox{200,106}{10,8}{$5^-$}
\mydbox{200,96}{10,8}{$3^+$}
\mydbox{200,86}{10,8}{$4^+$}
\mydbox{200,76}{10,8}{$6^-$}
\mydbox{200,66}{10,8}{$7^-$}
\mydbox{200,56}{10,8}{$8^-$}
\mydbox{200,46}{10,8}{$\star$}% ---------- impossible outcome
\mydbox{200,36}{10,8}{$4^-$}
\mydbox{200,26}{10,8}{$3^-$}
\mydbox{200,16}{10,8}{$6^+$}
\mydbox{200,6}{10,8}{$2^-$}
\mydbox{200,-4}{10,8}{$1^-$}% the middle, 0
\mydbox{200,-14}{10,8}{$5^+$}
\mydbox{200,-24}{10,8}{$7^+$}
\mydbox{200,-34}{10,8}{$8^+$}
\mydbox{200,-44}{10,8}{$\star$}
\mydbox{200,-54}{10,8}{$9^+$}
\mydbox{200,-64}{10,8}{$10^+$}
\mydbox{200,-74}{10,8}{$11^+$}
\mydbox{200,-84}{10,8}{$10^-$}
\mydbox{200,-94}{10,8}{$9^-$}
\mydbox{200,-104}{10,8}{$11^-$}
\mydbox{200,-114}{10,8}{$12^+$}
\mydbox{200,-124}{10,8}{$12^-$}
\mydbox{200,-134}{10,8}{$\star$}
\end{picture}
\end{center}
}{%
\caption[a]{An optimal solution to the weighing problem.
%
At each step there are two boxes: the left box shows which hypotheses are still
possible; the right box shows the balls involved in the next weighing.
The 24 hypotheses are written $1^+,
% 2^+,\ldots,1^-,
\ldots, 12^-$,
with, \eg, $1^+$ denoting that 1 is the odd ball and
it is heavy.
Weighings are written by listing the names of the balls on the
two pans, separated by a line; for example, in the first weighing,
% $\displaystyle\frac{1\,2\,3\,4}{5\,6\,7\,8}$ denotes that
balls 1,
2, 3, and 4 are put on the left-hand side and 5, 6, 7, and 8 on the
right.
In each triplet of arrows the upper arrow leads to the situation when
the left side is heavier, the middle arrow to the situation when the right side is heavier,
and the lower arrow to the situation when the outcome is balanced.
The three points labelled $\star$
% arrows without subsequent boxes at the right-hand side
correspond to impossible outcomes.
%The total number of outcomes
% of the weighing process is 24, which equals $3^3 - 3$, so we would expect
% this ternary tree of depth three to have three spare branches.
}
\label{fig.weighing}\label{ex.weigh.sol}
}%
\end{figure}
The answer is that at each step of an optimal
procedure, the three outcomes (`left heavier', `right heavier', and `balance')
are {\em as close as possible to equiprobable}.
An optimal solution is shown in \figref{fig.weighing}.
Suboptimal strategies, such as weighing balls 1--6 against 7--12
on the first step, do not achieve all outcomes with equal probability:
these two sets of balls can never balance, so the only possible
outcomes are `left heavy' and `right heavy'.
% Similarly, strategies
% that after an unbalanced initial result
% do not mix together balls that might be heavy with balls that
% might be light are incapable of giving one of the three outcomes.
Such a binary outcome rules out only half of the possible
hypotheses, so a strategy that uses such outcomes must sometimes
take longer to find the right answer.
% Some suboptimal strategies produce binary trees rather than ternary trees like
% the one in \figref{fig.weighing}, and binary trees
% are necessarily deeper than balanced ternary trees
% with the same number of leaves.
The insight that the outcomes should be as near as possible
to equiprobable makes
it easier to search for an optimal strategy. The first weighing
must divide the 24 possible hypotheses into three groups of eight. Then
the second weighing must be chosen so that there is a 3:3:2
split of the hypotheses.
Thus we might conclude:
\begin{conclusionbox}
{the outcome of a random experiment is guaranteed to be most informative
if the probability distribution over outcomes is uniform.}
\end{conclusionbox}
This conclusion agrees with
the property of the entropy that you proved when you solved
\exerciseref{ex.Hineq}: the entropy of an ensemble
$X$ is biggest if all the outcomes
have equal probability $p_i \eq 1/|\A_X|$.
% for anyone who wants to play it against a machine:
% http://y.20q.net:8095/btest
% http://www.smalltime.com/dictator.html
% http://www.guessmaster.com/
\subsection{Guessing games}
In the game of \ind{twenty questions},\index{game!twenty questions}
one player thinks of
an object, and the other player attempts to guess what the object is
by asking questions that have yes/no answers, for example,
`is it alive?', or `is it human?'
The aim is to identify the object with as few questions
as possible.
What is the best strategy for playing this game?
For simplicity, imagine that we are playing the rather dull
version of twenty questions called `sixty-three'.
% % two hundred and fifty five'.
% In this game, the permitted objects are the $2^6$ integers
% $\A_X = \{ 0 , 1 , 2 , \dots 63 \}$.
% One player selects an $x \in \A_X$, and we ask
% questions that have yes/no answers in order to identify $x$.
\exampl{example.sixtythree}{ {\sf The game `sixty-three'}.
What's the smallest number of yes/no questions needed\index{game!sixty-three}
to identify an integer $x$ between 0 and 63?\index{twenty questions}
}
Intuitively,
the best questions successively divide
the 64 possibilities into equal sized sets.
Six questions suffice.
One reasonable strategy asks the following questions:
%
% want a computer program environment here.
%
\begin{quote}
\begin{tabbing}
{\sf 1:} is $x \geq 32$? \\
{\sf 2:} is $x \mod 32 \geq 16$? \\
{\sf 3:} is $x \mod 16 \geq 8$? \\
{\sf 4:} is $x \mod 8 \geq 4$? \\
{\sf 5:} is $x \mod 4 \geq 2$? \\
{\sf 6:} is $x \mod 2 = 1$?
\end{tabbing}
\end{quote}
%
% I'd like to put this in a comment column on the right beside the 'code':
%
[The notation $x \mod 32$, pronounced `$x$ modulo 32', denotes the remainder
when $x$ is divided by 32; for example, $35 \mod 32 = 3$
and $32 \mod 32 = 0$.]
The answers to these questions, if translated
from $\{\mbox{yes},\mbox{no}\}$
to $\{{\tt{1}},{\tt{0}}\}$,
give the binary expansion of $x$, for example
$35 \Rightarrow {\tt{100011}}$.\ENDsolution\smallskip
What are the
Shannon information contents of the outcomes in this example?
If we assume that all values of $x$ are equally likely, then the
answers to the questions are independent and each has
% entropy $H_2(0.5) = 1 \ubit$. The
Shannon information content
% of each answer is
$\log_2 (1/0.5)
= 1 \ubit$; the total Shannon information gained
is always six bits. Furthermore, the number $x$ that we learn from
these questions is a six-bit binary number. Our questioning
strategy defines a way of encoding the random variable $x$
as a binary file.
So far, the Shannon information content makes sense:
it measures the length of a binary file that encodes
$x$.
%
However, we have not yet studied ensembles where the
outcomes have unequal probabilities. Does the
Shannon information content make sense there too?
\fakesection{Submarine figure}
%
\newcommand{\subgrid}{\multiput(0,0)(0,10){9}{\line(1,0){80}}\multiput(0,0)(10,0){9}{\line(0,1){80}}}
\newcommand{\sublabels}{
\put(-5,75){\makebox(0,0){\sf\tiny{A}}}
\put(-5,65){\makebox(0,0){\sf\tiny{B}}}
\put(-5,55){\makebox(0,0){\sf\tiny{C}}}
\put(-5,45){\makebox(0,0){\sf\tiny{D}}}
\put(-5,35){\makebox(0,0){\sf\tiny{E}}}
\put(-5,25){\makebox(0,0){\sf\tiny{F}}}
\put(-5,15){\makebox(0,0){\sf\tiny{G}}}
\put(-5, 5){\makebox(0,0){\sf\tiny{H}}}
%
\put(75,-5){\makebox(0,0){\tiny{8}}}
\put(65,-5){\makebox(0,0){\tiny{7}}}
\put(55,-5){\makebox(0,0){\tiny{6}}}
\put(45,-5){\makebox(0,0){\tiny{5}}}
\put(35,-5){\makebox(0,0){\tiny{4}}}
\put(25,-5){\makebox(0,0){\tiny{3}}}
\put(15,-5){\makebox(0,0){\tiny{2}}}
\put( 5,-5){\makebox(0,0){\tiny{1}}}
}
\newcommand{\misssixteen}{
\put(45,65){\makebox(0,0){$\times$}}
\put(45,45){\makebox(0,0){$\times$}}
\put(35,75){\makebox(0,0){$\times$}}
\put(35,65){\makebox(0,0){$\times$}}
\put(35,55){\makebox(0,0){$\times$}}
\put(35,45){\makebox(0,0){$\times$}}
\put(35,35){\makebox(0,0){$\times$}}
\put(35,25){\makebox(0,0){$\times$}}
\put(35,15){\makebox(0,0){$\times$}}
\put(35, 5){\makebox(0,0){$\times$}}
\put(25,75){\makebox(0,0){$\times$}}
\put(25,65){\makebox(0,0){$\times$}}
\put(25,55){\makebox(0,0){$\times$}}
\put(25,45){\makebox(0,0){$\times$}}
\put(25,35){\makebox(0,0){$\times$}}
\put(25,25){\makebox(0,0){$\times$}}
\put(25,15){\makebox(0,0){$\times$}}
}
\newcommand{\missthirtytwo}{
\put(75,75){\makebox(0,0){$\times$}}
\put(75,65){\makebox(0,0){$\times$}}
\put(75,55){\makebox(0,0){$\times$}}
\put(75,45){\makebox(0,0){$\times$}}
\put(75,35){\makebox(0,0){$\times$}}
\put(75,25){\makebox(0,0){$\times$}}
\put(75,15){\makebox(0,0){$\times$}}
\put(75, 5){\makebox(0,0){$\times$}}
\put(65,75){\makebox(0,0){$\times$}}
\put(65,65){\makebox(0,0){$\times$}}
\put(65,55){\makebox(0,0){$\times$}}
\put(65,45){\makebox(0,0){$\times$}}
\put(65,35){\makebox(0,0){$\times$}}
\put(65,25){\makebox(0,0){$\times$}}
\put(65,15){\makebox(0,0){$\times$}}
\put(65, 5){\makebox(0,0){$\times$}}
\put(55,75){\makebox(0,0){$\times$}}
\put(55,65){\makebox(0,0){$\times$}}
\put(55,55){\makebox(0,0){$\times$}}
\put(55,45){\makebox(0,0){$\times$}}
\put(55,35){\makebox(0,0){$\times$}}
\put(55,25){\makebox(0,0){$\times$}}
\put(55,15){\makebox(0,0){$\times$}}
\put(55, 5){\makebox(0,0){$\times$}}
\put(45,75){\makebox(0,0){$\times$}}
%%\put(45,65){\makebox(0,0){$\times$}}
\put(45,55){\makebox(0,0){$\times$}}
%% \put(45,45){\makebox(0,0){$\times$}}
\put(45,35){\makebox(0,0){$\times$}}
\put(45,25){\makebox(0,0){$\times$}}
\put(45,15){\makebox(0,0){$\times$}}
\put(45, 5){\makebox(0,0){$\times$}}
\put(5,65){\makebox(0,0){$\times$}}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%% submarine figure %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuredangle{%
\begin{center}
%\begin{tabular}{l@{\hspace{-1mm}}*{5}{@{\hspace{2pt}}c}} \toprule
\begin{tabular}{l@{\hspace{0mm}}*{5}{@{\hspace{8.5mm}}c}} \toprule
% moves made & 1 & 2 & 32 & 48 & 49 \\
&
%
% 1 miss
%
% this fig actually needs extra width on left, but there is nothing there.
\setlength{\unitlength}{0.26mm}
\begin{picture}(80,95)(0,-10)\subgrid\sublabels
\put(25,15){\makebox(0,0){$\times$}}
\put(25,15){\circle{15}}
\end{picture}
&
%
% 2 miss
%
\setlength{\unitlength}{0.26mm}
\begin{picture}(80,95)(0,-10)\subgrid
\put(25,15){\makebox(0,0){$\times$}}
\put(5,65){\makebox(0,0){$\times$}}
\put(5,65){\circle{15}}
\end{picture}
&
%
% 32 miss
%
\setlength{\unitlength}{0.26mm}
\begin{picture}(80,95)(0,-10)\subgrid
\put(25,15){\makebox(0,0){$\times$}}
\put(45,35){\circle{15}}
\missthirtytwo
\end{picture}
&
%
% 49 miss
%
\setlength{\unitlength}{0.26mm}
\begin{picture}(80,95)(0,-10)\subgrid
\put(25,15){\makebox(0,0){$\times$}}
\put(5,65){\makebox(0,0){$\times$}}
\missthirtytwo
\misssixteen
\put(25,25){\circle{15}}
\end{picture}
&
\setlength{\unitlength}{0.26mm}
\begin{picture}(80,95)(0,-10)\subgrid
\put(25,15){\makebox(0,0){$\times$}}
\put(5,65){\makebox(0,0){$\times$}}
\missthirtytwo
\misssixteen
%%%%%%%%%%%%%%%%%%%%%%% hit the submarine:
\put(25,5){\circle{15}}
\put(25,5){\makebox(0,0){\tiny\bf S}}
\end{picture}
\\
move \# & 1 & 2 & 32 & 48 & 49 \\
question
& G3
& B1
& E5
& F3
& H3 \\
outcome
& $x = {\tt n}$ % $(\times)$
& $x = {\tt n}$ %$(\times)$
& $x = {\tt n}$ %$(\times)$
& $x = {\tt n}$ %$(\times)$
& $x = {\tt y}$ %({\small\bf S})
\\[0.1in]
$P(x)$
& $\displaystyle\frac{63}{64}$
& $\displaystyle\frac{62}{63}$
& $\displaystyle\frac{32}{33}$
& $\displaystyle\frac{16}{17}$
& $\displaystyle\frac{1}{16}$
\\[0.15in]
$h(x)$
& 0.0227
& 0.0230
& 0.0443
% & 0.0430 -------- 0.9556 , just before 32 are pasted
& 0.0874
& 4.0
\\[0.05in]
Total info.
& 0.0227
& 0.0458
& 1.0
& 2.0
& 6.0
\\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{A game of {\tt submarine}. The submarine is hit on the 49th attempt.}
\label{fig.sub}
}%
\end{figure}
\subsection{The game of {\ind{submarine}}: how many bits can one bit convey?}
In the game of {\ind{battleships}}, each player hides a fleet of
ships in a sea represented by a square grid. On each\index{game!submarine}
turn, one player
attempts to hit the other's ships by firing at one square
in the opponent's sea. The response to a selected square such
as `G3' is either `miss', `hit', or `hit and destroyed'.
In a
% rather
boring version of battleships called {\tt submarine},
each player hides just one submarine in one square of
an eight-by-eight grid.
\Figref{fig.sub} shows a few pictures of this game in progress:
the circle represents the square that is being fired at, and the
$\times$s show squares in which the outcome was a miss, $x={\tt{n}}$; the
submarine is hit (outcome $x={\tt{y}}$ shown by
the symbol $\bs$) on the 49th attempt.
Each shot made by a player defines an ensemble. The
two possible outcomes are $\{ {\tt{y}} ,{\tt{n}}\}$,
corresponding to a hit and a miss, and their probabilities
depend on the state of the board.
At the beginning, $P({\tt{y}}) = \linefrac{1}{64}$ and
$P({\tt{n}}) = \linefrac{63}{64}$.
At the second shot, if the first shot missed,
% enemy sub has not yet been hit,
$P({\tt{y}}) = \linefrac{1}{63}$ and $P({\tt{n}}) = \linefrac{62}{63}$.
At the third shot, if the first two shots missed,
% enemy submarine has not yet been hit,
$P({\tt{y}}) = \linefrac{1}{62}$ and $P({\tt{n}}) = \linefrac{61}{62}$.
% According to the Shannon information content, t
The Shannon information
gained from an outcome $x$ is $h(x) = \log (1/P(x))$.
% Let's investigate this assertion.
If we are lucky, and hit the submarine on the first shot, then
\beq
h(x) = h_{(1)}({\tt y}) = \log_2 64 = 6 \ubits .
\eeq
Now, it might seem a little strange that
one binary outcome can convey six bits.
% , but it does make sense. W
But we have learnt the hiding place,
% where the submarine was,
which
could have been any of 64 squares; so we have, by one lucky
binary question, indeed learnt six bits.
What if the first shot misses? The Shannon information that we gain from this outcome
is
\beq
h(x) = h_{(1)}({\tt n}) = \log_2 \frac{64}{63} = 0.0227 \ubits .
\eeq
Does this make sense? It is not so obvious. Let's keep going.
If our second shot also misses, the Shannon information
content of the second outcome is
\beq
h_{(2)}({\tt n}) = \log_2 \frac{63}{62} = 0.0230 \ubits .
\eeq
If we miss thirty-two times (firing at a new square each time),
the total Shannon information gained is
\beqan
%\hspace*{-0.2in}
\lefteqn{ \log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots +
\log_2 \frac{33}{32} } \nonumber \\
& \!\!\!=\!\!\! & 0.0227 + 0.0230 + \cdots + 0.0430 \:\:=\:\:
1.0 \ubits .
\eeqan
Why this round number? Well, what have we learnt? We now know
that the submarine is not in any of the 32 squares we fired at;
learning that fact is just like playing a game of \sixtythree\
(\pref{example.sixtythree}),
asking as our first question `is $x$ one of the
thirty-two numbers corresponding to these squares I fired at?',
and receiving the answer `no'. This answer rules out half of the
hypotheses, so it gives us one bit.
%It doesn't matter what the
% outcome might have been; all that matters is the probability
% of what actually happened.
After 48 unsuccessful shots, the information
gained is 2 bits: the unknown location has been narrowed down to
one quarter of the original hypothesis space.
What if we hit the submarine on the 49th shot, when there
were 16 squares left?
The Shannon information content of this outcome is
\beq
h_{(49)}({\tt y}) = \log_2 16 = 4.0 \ubits .
\eeq
The total Shannon information content of all the outcomes is
\beqan
\lefteqn{ \log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots +
% \log_2 \frac{33}{32} + \cdots +
\log_2 \frac{17}{16} +
\log_2 \frac{16}{1} }
\nonumber \\
&=& 0.0227 + 0.0230 + \cdots
% + 0.0430 + \cdots
+ 0.0874 + 4.0 \:\: =\:\: 6.0 \ubits .
\label{eq.sum.me}
\eeqan
So once we know where the submarine is, the total Shannon information
content gained is 6 bits.
This result holds regardless of when
we hit the submarine. If we hit it when there are $n$ squares
left to choose from -- $n$ was 16 in
\eqref{eq.sum.me} -- then the total information gained
is:
\beqan
\lefteqn{ \log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots +
\log_2 \frac{n+1}{n} +
\log_2 \frac{n}{1} } \nonumber \\
&=& \log_2 \left[
\frac{64}{63} \times \frac{63}{62} \times \cdots
\times \frac{n+1}{n} \times \frac{n}{1} \right]
%\times 63 \times \cdots \times (n+1) \times n}
% {63 \times 62 \times \cdots \times n \times 1}
\:\:=\:\: \log_2 \frac{64}{1}\:\: =\:\: 6 \,\bits.
\eeqan
%
% add winglish here?
%
% follows in lecture 2, after submarine
%
% aim: introduce the language of Wenglish
% and demonstrate Shannon info content.
What have we learned from the examples so far?
I think the {\tt submarine} example makes quite a convincing
case for the claim that the Shannon information content
is a sensible measure of information content.
And the game of {\tt sixty-three} shows that
the Shannon information content can be intimately connected
to the size of a file that encodes the outcomes of
a random experiment, thus suggesting a possible connection to
data compression.
In case you're not convinced, let's look at one more example.
\subsection{The \Wenglish\ language}
\label{sec.wenglish}
% [this section under construction]}
{\dem{\ind{\Wenglish}}} is a language similar to \ind{English}.
\Wenglish\ sentences consist of words drawn at random from the
\Wenglish\ dictionary, which contains $2^{15}=32$,768 words, all of length 5
characters. Each word in the \Wenglish\ dictionary was constructed
% by the \Wenglish\ language committee, who created each of those $32\,768$ words
at random by picking five letters from the
probability distribution over {\tt a$\ldots$z} depicted
in \figref{fig.monogram}.
% Since all words are five characters long
%\begin{figure}
%\figuremargin{
\marginfig{\small
\begin{center}
\begin{tabular}{rc} \toprule
% & Word \\ \midrule
1 & {\tt{aaail}} \\
2 & {\tt{aaaiu}} \\
3 & {\tt{aaald}} \\
& $\vdots$ \\
129 & {\tt{abati}} \\
& $\vdots$ \\
$2047$ & {\tt{azpan}} \\
$2048$ & {\tt{aztdn}} \\
& $\vdots$ \\
& $\vdots$ \\
$16\,384$ & {\tt{odrcr}} \\
& $\vdots$ \\
& $\vdots$ \\
$32\,737$ & {\tt{zatnt}} \\
& $\vdots$ \\
$32\,768$ & {\tt{zxast}} \\ \bottomrule
\end{tabular}
\end{center}
%}{
\caption[a]{The \Wenglish\ dictionary.}
\label{fig.wenglish}
}
%\end{figure}
% 5366+1219+2602+2718+8377+1785+1280+3058+5903+70+800+3431+2319+5470+6526+1896+539+4660+5453+6767+3108+652+1388+765+1564+78
% 77794
Some entries from the dictionary are shown in
alphabetical order in \figref{fig.wenglish}.
Notice that the number of words in the \ind{dictionary}
(32,768)
is much smaller than the total number of possible words of length 5 letters,
$26^5 \simeq 12$,000,000.
Because the probability of the letter {{\tt{z}}} is about $1/1000$,
only 32 of the words in the dictionary begin with the letter {\tt z}.
In contrast, the probability of the letter {{\tt{a}}} is about $0.0625$,
and 2048 of the words begin with the letter {\tt a}. Of those 2048 words,
two start {\tt az}, and 128 start {\tt aa}.
Let's imagine that we are reading a \Wenglish\ document, and let's discuss
the Shannon \ind{information content} of the characters as we acquire them.
If we are given the text one word at a time, the Shannon information
content of each five-character word is $\log \mbox{32,768} = 15$ bits,
since \Wenglish\ uses all its words with equal probability. The
average information content per character is therefore 3 bits.
Now let's look at the information content if we read the document
one character at a time.
If, say, the first letter of a word is {\tt a}, the Shannon information
content is
$\log 1/ 0.0625 \simeq 4$ bits.
If the first letter is {\tt z}, the Shannon information content
is $\log 1/0.001 \simeq 10$ bits.
The information content is thus highly variable
at the first character. The total information
content of the 5 characters in a word, however,
is exactly 15 bits; so the letters that
follow an initial {\tt{z}} have lower average information content
per character than the letters that follow an initial {\tt{a}}.
A rare initial letter such as {\tt{z}} indeed conveys
more information about what the word is
than a common initial letter.
Similarly, in English, if rare characters occur at the start of
the word (\eg\ {\tt{xyl}\ldots}),
then often we can identify the whole word immediately; whereas
words that start with common characters (\eg\ {\tt{pro}\ldots}) require more characters
before we can identify them.
% Does this make sense? Well, in English,
% the first few characters of a word do very often fully identify the whole word.
%
% {\em MORE HERE........}
\section{Data compression}
\index{data compression}\index{source code}The
preceding examples justify the idea that the Shannon \ind{information
content} of an outcome is a natural measure of its
\ind{information content}. Improbable outcomes
do convey more information than probable outcomes.
We now discuss the information content
of a source by considering how many bits are needed to describe
the outcome of an experiment.
% , that is, by studying {data compression}.
If we can show that we can compress data from a particular source
into a file of $L$ bits per source symbol and recover the data reliably,
then we will say that the average information
content of that source is at most
% less than or equal to
$L$ bits per symbol.
%
% cut Sat 13/1/01
%
% We will show that, for any source, the information content of the source
% is intimately related to its entropy.
\subsection{Example: compression of text files}
A file is composed of a sequence of bytes. A byte is composed of 8
bits\marginpar{\small\raggedright{Here we use the word `bit' with its meaning, `a
symbol with two values', not to be confused with the
unit of information content.}}
and can have a decimal value between 0 and 255. A
typical text file is composed of the
ASCII character set (decimal values 0 to 127).
This character set uses only
seven of the eight bits in a byte.
\exercissxB{1}{ex.ascii}{
By how much could the size of a file be reduced given that
it is an ASCII file? How would you achieve this reduction?
}
Intuitively, it seems reasonable to assert that an ASCII file
contains $7/8$ as much information as an arbitrary file of the same
size, since we already know one out of every eight bits before we even
look at the file.
This is a
% very
simple example of redundancy.
Most sources of data have further redundancy: English text files
use the ASCII characters with non-equal frequency; certain pairs
of letters are more probable than others; and entire words
can be predicted given the context and a semantic understanding
of the text.
% this par is repeated in l4.
% compressibility.
\subsection{Some simple data compression methods that define
measures of information content}
%
% IDEA: connect back to opening
%
One way of measuring the information content of a random variable
is simply to count the number of {\em possible\/} outcomes,
$|\A_X|$. (The number of elements in a set $\A$ is denoted by $|\A|$.)
If we gave a binary name to each outcome, the length
of each name would be $\log_2 |\A_X|$ bits, if $|\A_X|$ happened
to be a power of 2.
We thus make the following definition.
\begin{description}%%%% was: [Perfect information content] Raw bit content
%%%%%%%%%%%%%%%%%%%%%%% see newcommands1.tex
\item[The \perfectic] of $X$ is
\beq
H_0(X) = \log_2 |\A_X| .
\eeq
\end{description}
$H_0(X)$ is a lower bound for
the number of binary questions that are always guaranteed to identify
an outcome from the ensemble $X$.
It is an additive quantity: the \perfectic\ of an ordered pair $x,y$,
having $|\A_X||\A_Y|$
possible outcomes,
satisfies
\beq
H_0(X,Y)= H_0(X) + H_0(Y).
\eeq
This measure of information content does not include any
probabilistic element, and the encoding rule it corresponds to
does not `compress' the source data, it simply maps each
outcome
% source character
to a constant-length binary string.
\exercissxA{2}{ex.compress.possible}{
Could there be a compressor that maps
an outcome $x$ to a binary code $c(x)$, and a decompressor
that maps $c$ back to $x$, such that {\em every
possible outcome\/} is compressed into a binary code
of length {\em shorter\/}
than $H_0(X)$ bits?
}
Even though a simple counting argument\index{compression!of {\em any\/} file}
shows that it is impossible to make a reversible
compression program that reduces the size of {\em all\/} files,
amateur compression enthusiasts frequently announce that they have invented
a program that can do this -- indeed that they can further compress
compressed files by putting them through their compressor several\index{compression!of already-compressed files}\index{myth!compression}
times. Stranger yet, patents have
been granted to these modern-day \ind{alchemists}. See
the {\tt{comp.compression}} frequently asked questions
% \verb+http://www.faqs.org/faqs/compression-faq/part1/+
for further reading.\footnote{\tt{http://sunsite.org.uk/public/usenet/news-faqs/comp.compression/}}
%\footnote{\verb+http://www.lib.ox.ac.uk/internet/news/faq/+}
% ............by_category.compression-faq.html+}
% http://www.faqs.org/faqs/compression-faq/part1/preamble.html
There are only two ways in which a `compressor' can actually
compress files:
\ben
\item
A {\dem lossy\/} compressor compresses some\index{compression!lossy}
files, but maps some files
% {\em distinct\/} files are mapped
to the
{\em same\/} encoding. We'll assume that
the user requires perfect recovery of the source
file, so the occurrence of one of these
confusable files leads to a failure (though in
applications such as \ind{image compression}, lossy compression is viewed as
satisfactory). We'll denote by
$\delta$
the probability that the
source string is one of the confusable files, so a
lossy compressor\index{error probability!in compression}
has a probability $\delta$ of
failure. If $\delta$ can be made very small then
a lossy compressor may be practically useful.
\item
A {\dem lossless} compressor maps all files
to different encodings; if it
% f a lossless compressor
shortens some files,\index{compression!lossless}
it necessarily {\em makes others longer}. We try to design the
compressor so that the probability that a
file is lengthened is very small, and the probability that
it is shortened is large.
\een
In this chapter we discuss a simple lossy compressor.
In subsequent chapters we discuss lossless compression
methods.
%
\section{Information content defined in terms of lossy
compression}
%
Whichever type of compressor we construct, we need somehow to
take into account the {\em probabilities\/} of the different outcomes.
Imagine comparing the information contents of
two text files -- one
in which all 128 ASCII characters are used with equal probability,
and one in which the characters are used with their frequencies
in English text.
%: $P(x={\tt e})=$,
% $P(x={\tt e})=$, $P(x={\tt e})=$,$P(x={\tt e})=$,$P(x={\tt e})=$, \ldots
% $P(x={\tt e})=$, \ldots.
% only the characters {\tt 0} and {\tt 1} are used.
Can we define a measure of information content that
distinguishes between these two files? Intuitively,
the latter file contains less information per character
because it is more predictable.
%And a file of {\tt 0}s
% and {\tt 1}s in which nearly all the characters are {\tt 0}s
% conveys even less information.
% Maybe introducing 0 and 1 is nto a good idea.
% At this point I start talking in terms of compression.
% How can we include a probabilistic element?
One simple way to use
our knowledge that some symbols have a smaller probability is
to imagine recoding the observations into a smaller alphabet -- thus losing
the ability to encode some of the more improbable
symbols -- and then measuring the \perfectic\ of the new alphabet.
% choice here - could either map multiple symbols onto
% one, so the compression is lossy,
% or could define no entry at all for some symbols, so compression
% fails.
% The general mapping situation is not ideal since I really want all
% the losers to be mapped to one symbol. Student might imagine mapping
% Z and z to Z, Y and y to Y.. and claim they are losing little info.
% But this messes up the defn of delta.
For example,
we might take a risk when compressing English text, guessing that the most
infrequent characters won't occur,
and make a reduced ASCII code that omits the characters
% for example,
% `\verb+!+', `\verb+@+', `\verb+#+',
% `\verb+$+', `\verb+%+', `\verb+^+', `\verb+*+', `\verb+~+',
% `\verb+<+', `\verb+>+', `\verb+/+', `\verb+\+', `\verb+_+',
% `\verb+{+', `\verb+}+', `\verb+[+', `\verb+]+',
% and `\verb+|+',
$\{$ \verb+!+, \verb+@+, \verb+#+,
% \verb+$+, $
\verb+%+, \verb+^+, \verb+*+, \verb+~+,
\verb+<+, \verb+>+, \verb+/+, \verb+\+, \verb+_+,
\verb+{+, \verb+}+, \verb+[+, \verb+]+, \verb+|+ $\}$,
thereby reducing the size of the alphabet
% the total number of characters
by seventeen.
%
% cut this dec 2000
% Thus we can give new
%%%% a (not necessarily unique)
% names to a {\em subset\/} of the possible outcomes and count how many names we
% use.
The larger the risk we are willing to take, the smaller
our final alphabet becomes.
% ] the number of names we need.
% We thus relax the exhaustive requirement of the definition of
%
% aside
%
% We could imagine doing this to the numbers coming out of the guessing
% game with which this chapter started, for example. It seems
% quite unlikely that the subject would have to guess 25, 26 or 27 times
% to get the next letter; these outcomes
%%`27' is
% are very improbable,
% and we might be willing to record the sequence of numbers using
% 24 symbols only, taking the gamble that in fact more guesses might
% be needed.
We introduce a parameter $\delta$ that describes the risk we
are taking when using this compression method: $\delta$ is
the probability that there will be no name for an outcome $x$.
\exampl{exHdelta}{
Let
\beq
\begin{array}{l*{14}{@{\,}c}}
& \A_X & = & \{ & {\tt a},& {\tt b},&{\tt c},&{\tt d},&{\tt e},&{\tt f},&{\tt g},&{\tt h} & \}, \\
\mbox{and }\:\:
& \P_X & = & \bigl\{ & \frac{1}{4} ,& \frac{1}{4} ,& \frac{1}{4} ,& \frac{3}{16} ,& \frac{1}{64} ,& \frac{1}{64} ,& \frac{1}{64} ,& \frac{1}{64} & \bigr\} .
\end{array}
\eeq
The \perfectic\ of this ensemble is 3 bits, corresponding to
8 binary names.
But notice that $P( x \in \{ {\tt a}, {\tt b}, {\tt c}, {\tt d} \} ) = 15/16$.
So if we are willing to run a risk of $\delta=1/16$ of not having a name
for $x$, then we can get by with four names --
half as many names as are needed if
every $x \in \A_X$ has a name.
Table \ref{fig.delta.examples} shows binary names that could be given
to the different outcomes in the cases $\delta = 0$ and $\delta = 1/16$.
When $\delta=0$ we need 3 bits to encode the outcome;
when $\delta=1/16$ we need only 2 bits.
%
%\begin{figure}[htbp]
%\figuremargin{%
\amargintab{b}{
\begin{center}
\begin{tabular}{cc}
\toprule
\multicolumn{2}{c}{$\delta = 0$}
\\
\midrule
$x$ & $c(x)$ \\ \midrule
{\tt a} & {\tt{000}} \\
{\tt b} & {\tt{001}} \\
{\tt c} & {\tt{010}} \\
{\tt d} & {\tt{011}} \\
{\tt e} & {\tt{100}} \\
{\tt f} & {\tt{101}} \\
{\tt g} & {\tt{110}} \\
{\tt h} & {\tt{111}} \\
\bottomrule
\end{tabular}
% \hspace{0.61in}
\hspace{0.1in}
\begin{tabular}{cc}
\toprule
\multicolumn{2}{c}{$\delta = 1/16$}
\\
\midrule
$x$ & $c(x)$ \\ \midrule
{\tt a} & {\tt{00}} \\
{\tt b} & {\tt{01}} \\
{\tt c} & {\tt{10}} \\
{\tt d} & {\tt{11}} \\
{\tt e} & $-$ \\
{\tt f} & $-$ \\
{\tt g} & $-$ \\
{\tt h} & $-$ \\
\bottomrule
\end{tabular}
\end{center}
%}{%
\caption[a]{Binary names for the outcomes,
for two failure probabilities $\delta$.}
\label{fig.delta.examples}
\label{tab.twosillycodes}
}%
%\end{figure}
}
%\noindent
Let us now formalize this idea.\index{source code}
%
To make a compression strategy with risk $\delta$,
% we consider all subsets $T$ of the alphabet $\A_X$ and
% seek out
we make the smallest possible subset
$S_{\delta}$ such that the
probability that $x$ is not in $S_{\delta}$ is less than or equal to
$\delta$, \ie,
$P(x \not\in S_{\delta} ) \leq \delta$. For each value of $\delta$ we can then
define a new measure of information content -- the log of the size
of this smallest subset $S_{\delta}$. [In ensembles in which
several elements have the same probability, there may be several
smallest subsets that contain different elements, but all that matters
is their sizes (which are equal), so we will not dwell on this ambiguity.]
% worry about this possibility.
\begin{description}
\item[The smallest $\delta$-sufficient subset] $S_{\delta}$ is the smallest
subset of $\A_X$ satisfying
\beq
P(x \in S_{\delta} ) \geq 1 - \delta.
\eeq
%\beq
% S_{\delta} = \argmin
%\eeq
\end{description}
The subset $S_{\delta}$ can be constructed by
ranking the elements of $\A_X$ in order of decreasing probability
and adding successive elements starting from the
most probable elements
% front of the list
until the total
probability is $\geq (1\!-\!\delta)$.
We can make a data compression code by assigning a binary name
to each element of the smallest sufficient subset.
% (\tabref{tab.twosillycodes}).
This compression
scheme motivates the following measure of information content:
\begin{description}
\item[The \essentialic] of $X$ is: %%%%% was ESSENTIAL information content
% consider risk-delta bit content?
\beq
H_{\delta}(X) = \log_2 |S_{\delta}| .
% = \log_2 \min \left\{ |S| : S\subseteq \A_X,
%% P(S)\geq 1-\delta \right\}.
% P(x \in S)\geq 1-\delta \right\}.
\eeq
\end{description}
Note that $H_0(X)$ is the special case of $H_{\delta}(X)$ with $\delta = 0$
(if $P(x) > 0$ for all $x \in \A_X$).
%
[{\sf Caution:} do not confuse $H_0(X)$ and $H_{\delta}(X)$
with the function $H_2(p)$ displayed in \figref{fig.h2}.]
%%%%%%%(Should I change notation to avoid confusion?)
%
\newcommand{\gapline}{\cline{1-4}\cline{6-9}}
\begin{figure}
\figuremargin{%
\begin{center}
\footnotesize%
\begin{tabular}{rc}
(a)&
\hspace*{-0.2in}\input{Hdelta/Sdelta/X.tex}\\
(b)&
\mbox{\makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X)$}}\hspace{-5mm}%
\psfig{figure=Hdelta/byhand/X.ps,%
width=70mm,angle=-90}$\delta$}%
\\
\end{tabular}
\end{center}
}{%
\caption[a]{(a) The outcomes of $X$ (from \protect\exampleref{exHdelta}),
ranked by their probability.
(b) The
\essentialic\ $H_{\delta}(X)$. The labels on the graph
show the smallest sufficient set as a function of $\delta$.
Note $H_0(X) = 3$ bits and $H_{1/16}(X) = 2$ bits.
}
\label{fig.hd.1}
}
\end{figure}
%\noindent
{\Figref{fig.hd.1} shows $H_{\delta}(X)$ for the ensemble
of \exampleonlyref{exHdelta} as a function of $\delta$.
}
\subsection{Extended ensembles}
% The compression method we're studying in which a subset of
% outcomes are given binary names is not giving us a
% measure of information content for a single symbol.
%
% sanjoy wants a motivation here.
%
Is this compression method any more useful if we compress
{\em blocks\/} of symbols from a source?\index{source code!block code}\index{ensemble!extended}\index{extended ensemble}
%
We now turn to examples where the outcome $\bx = (x_1,x_2,\ldots, x_N)$ is a string of $N$
independent identically distributed random variables
from a single ensemble $X$.
We will denote by
% $\bX$ or
$X^N$ the ensemble $( X_1, X_2, \ldots, X_N )$.
% for which $\bx$ is the random variable.
Remember that entropy is additive for independent variables (\exerciseref{ex.Hadditive}),
% \footnote{There should have been an exercise on this by now.}
so
% $H(\bX) = N H(X)$.
$H(X^N) = N H(X)$.
\exampl{ex.Nfrom.1}{
% {\sf Example 2:}
Consider a string of $N$ flips of a bent coin,
$\bx = (x_1,x_2,\ldots, x_N)$, where $x_n \in
\{{\tt{0}},{\tt{1}}\}$, with probabilities $p_0 \eq 0.9,$ $p_1 \eq
0.1$. The most probable strings $\bx$ are those with most {\tt{0}}s. If
$r(\bx)$ is the number of {\tt{1}}s in $\bx$ then
\beq
% |p_0,p_1
P(\bx) = p_0^{N-r(\bx)} p_1^{r(\bx)} .
\eeq
To evaluate $H_{\delta}(X^N)$
we must find the smallest sufficient subset $S_{\delta}$.
This subset will contain
all $\bx$ with $r(\bx) = 0, 1, 2, \ldots,$ up to some $r_{\max}(\delta)-1$,
and some of the $\bx$ with $r(\bx) = r_{\max}(\delta)$.
% Working backwards, we can evaluate the cumulative probability
% $P(r(\bx) \leq r)$ and evaluate the size of the subset $T(r): \{ \bx:
% r(\bx) \leq r \}$.
%\beq
% |T(r)| = \sum_{r=0}^{r} \frac{N!}{(N-r)!r!}
%\label{l2.T}
%\eeq
%\beq
% P(r(\bx) \leq r) = \sum_{r=0}^{r} \frac{N!}{(N-r)!r!} p_0^{N-r} p_1^{r}
%\label{l2.Pr}
%\eeq
% We can then plot $\log |T(r)|$ versus $P(r(\bx) \leq r)$. This defines
% a graph of $H_{\delta}(\bX)$ against $\delta$.
Figures \ref{fig.hd.4} and \ref{fig.hd.10}
% Figure \ref{fig.hd.4}
show graphs of $H_{\delta}(X^N)$ against
$\delta$ for the cases $N=4$ and $N=10$. The steps are the values of
$\delta$ at which $|S_{\delta}|$ changes by~1, and the cusps where the slope
of the staircase changes are the points
where $r_{\max}$ changes by 1.
}
\exercissxC{2}{ex.cusps}{
What are the mathematical shapes of the curves between the cusps?
}
% , both with $p_1 =
% 0.1$. The points defined by equations (\ref{l2.T}) and (\ref{l2.Pr})
% are the cusps in the curve.
%
% I think this figure may be sick. CHECK IT.
%
\renewcommand{\gapline}{\cline{1-3}\cline{5-8}}
\begin{figure}
\figuremargin{%
%
% this table done by hand with help of (above hd.p command) /home/mackay/itp/Hdelta> more figs/4.tex
%
\begin{center}
\footnotesize%
\begin{tabular}{r@{\hspace*{-0.3in}}c}
(a)&
%%%%%%%% written by hand see also X.tex
%
% picture of Sdelta for X^4
%
\newcommand{\axislevel}{24}
\newcommand{\axislevelp}{29.5}
\newcommand{\axislevelm}{21}
\newcommand{\axislevelmm}{18}
\newcommand{\forestgap}{-0.7}
\newcommand{\forest}[3]{\multiput(#1)(\forestgap,0){#2}{\line(0,1){#3}}}
%
%
%
\setlength{\unitlength}{2.2pt}%
\begin{picture}(155,50)(-143,-20)% adjusted vertical height from 50 to 60 Sat 5/10/02. And put back again Sun 22/12/02 was (-143,-22) Sun 22/12/02
% - log P = 2.0 , 2.4 and 6.0
\forest{-6.1,0}{1}{16}% heights fictitious
\forest{-37.3,0}{4}{12.5}%
\forest{-68.5,0}{6}{9.4}% 69.5
\forest{-100.8,0}{4}{6.3}%
\forest{-132.9,0}{1}{4.2}%
% axis:
\put(-143,\axislevelm){\vector(1,0){151.0}}
%
% axis labels
\put(5,\axislevelp){\makebox(0,0)[b]{\small$\log_2 P(x)$}}
\put(0,\axislevel){\makebox(0,0)[b]{\small$0$}}
\put(-20,\axislevel){\makebox(0,0)[b]{\small$-2$}}
\put(-40,\axislevel){\makebox(0,0)[b]{\small$-4$}}
\put(-60,\axislevel){\makebox(0,0)[b]{\small$-6$}}
\put(-80,\axislevel){\makebox(0,0)[b]{\small$-8$}}
\put(-100,\axislevel){\makebox(0,0)[b]{\small$-10$}}
\put(-120,\axislevel){\makebox(0,0)[b]{\small$-12$}}
\put(-140,\axislevel){\makebox(0,0)[b]{\small$-14$}}
%
% this box is right size for the whole set
%\put(0,-2.5){\framebox(140,\axislevelm){}}
%\put(142,13){\makebox(0,0)[l]{\small$S_0$}}
% this box is round 3 clumps
\put(-83.5,-2.5){\framebox(83.5,\axislevelm){}}
\put(-84.5,13){\makebox(0,0)[r]{\small$S_{0.01}$}}
% a smaller box round 3 clumps
%\put(2.5,-1){\framebox(81,\axislevelmm){}}
%
\put(-53.5,-1){\framebox(51,\axislevelmm){}}
\put(-54.5,13){\makebox(0,0)[r]{\small$S_{0.1}$}}
%
% object labels
\put(-6.1,-12){\makebox(0,0)[t]{\footnotesize{\tt 0000}}}
\put(-37.7,-12){\makebox(0,0)[t]{\footnotesize${\tt 0010},{\tt 0001},\ldots$}}
\put(-69.5,-12){\makebox(0,0)[t]{\footnotesize${\tt 0110},{\tt 1010},\ldots$}}
\put(-101.2,-12){\makebox(0,0)[t]{\footnotesize${\tt 1101},{\tt 1011},\ldots$}}
\put(-132.9,-12){\makebox(0,0)[t]{\footnotesize{\tt 1111}}}
\multiput(-6.1,-10)(-31.6,0){5}{\vector(0,1){5}}
\end{picture}
%
%
%
%
(b)&
\makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X^4)$}}\hspace{-5mm}%
\psfig{figure=Hdelta/figs/hd/4.ps,%
width=65mm,angle=-90}$\delta$%%
%
%
% useful for making table:
% hd.p mmin=4 mmax=4 mstep=6 scale_by_n=0 plot_sub_graphs=1 latex=1
%
\end{tabular}
\end{center}
}{%
%
% I think this figure may be sick. CHECK IT.
%
\caption[a]{(a) The sixteen outcomes of the ensemble $X^4$ with $p_1=0.1$, ranked by probability. (b) The
\essentialic\ $H_{\delta}(X^4)$. The upper
schematic diagram indicates the strings'
probabilities by the vertical lines' lengths (not to scale).}
\label{fig.hd.4}
}%
\end{figure}
%
%
%
\begin{figure}%[htbp]
\figuremargin{%
\begin{center}
\mbox{%%%%%%%%%%%%% (twocol) %}\\ \mbox{
\makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X^{10})$}}\hspace{-5mm}%
\psfig{figure=Hdelta/figs/hd/10.ps,%
width=65mm,angle=-90}$\delta$}
% command, in Hdelta:
% hd.p mmin=4 mmax=10 mstep=6 scale_by_n=0 plot_sub_graphs=1 | gnuplot
\end{center}
}{%
\caption[a]{$H_{\delta}(X^N)$ for $N=10$ binary variables with $p_1=0.1$.}
\label{fig.hd.10}
}%
\end{figure}
For the examples shown in figures \ref{fig.hd.1}--\ref{fig.hd.10},
$H_{\delta}(X^N)$ depends strongly on the
value of $\delta$, so it might not seem a fundamental or useful
definition of information content.
But we will consider what happens as $N$, the number of independent variables
in $X^N$, increases. We will find the remarkable result that
$H_{\delta}(X^N)$ becomes almost independent of $\delta$ -- and for all
$\delta$ it is very close to $N H(X)$, where $H(X)$ is the
entropy of one of the random variables.
% sketch?
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\makebox[0in][r]{\raisebox{1.3in}{$\frac{1}{N}H_{\delta}(X^{N})$}}\hspace{-5mm}%
\psfig{figure=Hdelta/figs/hd/all.10.1010.ps,%
width=65mm,angle=-90}$\delta$}
\end{center}
}{%
\caption[a]{$\frac{1}{N} H_{\delta}(X^{N})$
for $N=10, 210, \dots,1010$ binary variables with $p_1=0.1$.}
\label{fig.hd.10.1010}
}
\end{figure}
\Figref{fig.hd.10.1010} illustrates this asymptotic tendency for
the binary ensemble of example \ref{ex.Nfrom.1}.
% discussed earlier with $N$ binary variables with $p_1 = 0.1$.
As $N$ increases, $\frac{1}{N} H_{\delta}(X^N)$ becomes an increasingly
flat function, except for tails close to $\delta=0$ and $1$.
% The limiting value of the plateau is $H(X) = 0.47$.
% We will explain and prove this result in the remainder of
% this chapter. Let's first note the implications of this result.
% The limiting value of the plateau, which for $N$ binary variables with $p_1 = 0.1$
% appears to be about 0.5, defines how much compression is possible:
% $N$ binary variables with $p_1 = 0.1$ can be compressed into
% about $N/2$ bits, with a probability of error $\delta$ which
% can be any value between 0 and 1.
% We will show that the plateau value to which $\frac{1}{N} H_{\delta}(X^N)$
% tends, for large $N$, is the entropy, $H(X)$.
%
% IDEA: Box this next sentence?
%
As long as we are allowed
a tiny probability of error $\delta$, compression down to
$NH$ bits is possible. Even if we are allowed a large probability of error,
we still can compress only down to $NH$ bits.
%
% IDEA: Box above?
%
This is the \ind{source coding theorem}.
% \subsection{The theorem}
\begin{ctheorem}
\label{thm.sct}
{\sf Shannon's source coding theorem.}
% HOW TO NAME THIS?????????????????
% this name is taken later
Let $X$ be an ensemble with entropy $H(X) = H$ bits. Given $\epsilon>0$
and $0<\delta<1$, there exists a positive integer $N_0$ such that for
$N>N_0$,
\beq
\left| \frac{1}{N} H_{\delta}(X^N) - H \right| < \epsilon.
\eeq
\end{ctheorem}
%
% sanjoy wants explan here
%
% The reason that increasing $N$ helps is that, if $N$ is large,
% the outcome $\bx$
\section{Typicality}
Why does increasing $N$ help?\indexs{typicality}
Let's examine long strings from $X^N$.
Table \ref{tab.typical.tcl} shows fifteen samples from $X^N$
for $N=100$ and $p_1=0.1$.
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{lr} \toprule
$\bx$ &
% \multicolumn{1}{c}{$\log_2(P(\bx))$}
\hspace{-0.3in}{$\log_2(P(\bx))$}
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule
% REQUIRE MONOSPACED FONT!!!
{\tinytt{%VERB
...1...................1.....1....1.1.......1........1...........1.....................1.......11...%END
}} & $-$50.1 \\
{\tinytt{%VERB
......................1.....1.....1.......1....1.........1.....................................1....%END
}} & $-$37.3 \\
{\tinytt{%VERB
........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1.%END
}} & $-$65.9 \\
{\tinytt{%VERB
1.1...1................1.......................11.1..1............................1.....1..1.11.....%END
}} & $-$56.4 \\
{\tinytt{%VERB
...11...........1...1.....1.1......1..........1....1...1.....1............1.........................%END
}} & $-$53.2 \\
{\tinytt{%VERB
..............1......1.........1.1.......1..........1............1...1......................1.......%END
}} & $-$43.7 \\
{\tinytt{%VERB
.....1........1.......1...1............1............1...........1......1..11........................%END
}} & $-$46.8 \\
{\tinytt{%VERB
.....1..1..1...............111...................1...............1.........1.1...1...1.............1%END
}} & $-$56.4 \\
{\tinytt{%VERB
.........1..........1.....1......1..........1....1..............................................1...%END
}} & $-$37.3 \\
{\tinytt{%VERB
......1........................1..............1.....1..1.1.1..1...................................1.%END
}} & $-$43.7 \\
{\tinytt{%VERB
1.......................1..........1...1...................1....1....1........1..11..1.1...1........%END
}} & $-$56.4 \\
{\tinytt{%VERB
...........11.1.........1................1......1.....................1.............................%END
}} & $-$37.3 \\
{\tinytt{%VERB
.1..........1...1.1.............1.......11...........1.1...1..............1.............11..........%END
}} & $-$56.4 \\
{\tinytt{%VERB
......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1..............%END
}} & $-$59.5 \\
{\tinytt{%VERB
............11.1......1....1..1............................1.......1..............1.......1.........%END
}} & $-$46.8 \\ \midrule % [0.2in]
%
{\tinytt{%VERB
....................................................................................................%END
}} & $-$15.2 \\
{\tinytt{%VERB
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111%END
}} & $-$332.1\\
%
\bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{The top 15 strings are samples from $X^{100}$,
where $p_1 = 0.1$ and $p_0 = 0.9$.
The bottom two are the most and least probable strings in this ensemble.
The final column shows the
% Compare the
log-probabilities of the random strings,
which may be compared with the entropy
% with
% the \aep: $H(X) = 0.469$, so
$H(X^{100}) = 46.9$ bits.}
\label{tab.typical.tcl}
}
\end{figure}
% 1000 Typical set size +/- 28.46 has log_2(P(x)) within +/- 90.22
% i.e. 1/N (logp) is within 0.090
% 100 Typical set size +/- 9 has log_2(P(x)) within +/- 28.53
% i.e. 1/N(logp) is within 0.285
% 200 Typical set size +/- 12.73 has log_2(P(x)) within +/- 40.35
%
% N=100 alternative (see hd.p for the commands)
%
\begin{figure}
\fullwidthfigureright{
%\figuremargin{%
\begin{center}
\begin{tabular}{r@{\hspace*{-0in}}c@{\hspace*{-0.1in}}c} \toprule
& $N=100$ & $N=1000$ \\ \midrule
\raisebox{0.71in}{\small$n(r) = {N \choose r}$}
& \mbox{\psfig{figure=Hdelta/figs/num/100.ps,%
width=50mm,angle=-90}}
& \mbox{\psfig{figure=Hdelta/figs/num/1000.ps,%
width=50mm,angle=-90}} \\
\raisebox{0.71in}{\small$P(\bx) = p_1^r (1-p_1)^{N-r}$}
& \mbox{\psfig{figure=Hdelta/figs/per/100.ps,%
width=50mm,angle=-90}}%
\makebox[0in][r]{\raisebox{0.4in}{%
\psfig{figure=Hdelta/figs/perdet/100.ps,%
width=30mm,angle=-90}}\hspace{0.2in}}
&
\\
\raisebox{0.71in}{\small$\log_2 P(\bx)$}
& \mbox{\psfig{figure=Hdelta/figs/logper/100.ps,%
width=50mm,angle=-90}}
& \mbox{\psfig{figure=Hdelta/figs/logper/1000.ps,%
width=50mm,angle=-90}} \\
\raisebox{0.71in}{\small$n(r)P(\bx)= {N \choose r} p_1^r (1-p_1)^{N-r}$}
& \mbox{\psfig{figure=Hdelta/figs/tot/100.ps,%
width=50mm,angle=-90}}
& \mbox{\psfig{figure=Hdelta/figs/tot/1000.ps,%
width=50mm,angle=-90}}
% \makebox[0in][l]{$r$}
\\
&
$r$ & $r$ \\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{Anatomy of the typical set $T$.
For $p_1=0.1$
and $N=100$ and $N=1000$, these graphs show $n(r)$, the number of
strings containing $r$ {\tt{1}}s; the probability $P(\bx)$ of a single
string that contains $r$ {\tt{1}}s; the same probability on a
log scale; and the total probability
$n(r)P(\bx)$ of all strings that contain $r$ {\tt{1}}s.
The number $r$ is on the horizontal axis.
The plot of $\log_2 P(\bx)$ also shows by a dotted line the mean value of
$\log_2 P(\bx) = -N H_2(p_1)$ which equals $-46.9$
when $N=100$ and $-469$ when $N=1000$. The typical set includes
only the strings that have $\log_2 P(\bx)$ close to this value.
The range marked {\sf T} shows the set $T_{N \beta}$ (as defined
in \protect\sectionref{sec.ts})
for $N=100$ and $\beta = 0.29$ (left) and $N=1000$, $\beta = 0.09$ (right).
}
\label{fig.num.per.tot}
}%
\end{figure}
The probability of a string $\bx$ that contains $r$ {\tt{1}}s and
$N\!-\!r$ {\tt{0}}s is
\beq
P(\bx) = p_1^r (1-p_1)^{N-r} .
\eeq
The number of strings that contain $r$ {\tt{1}}s is
\beq
n(r) = {N \choose r} .
\eeq
So the number of {\tt{1}}s, $r$, has a binomial distribution:
\beq
P(r) = {N \choose r} p_1^r (1-p_1)^{N-r} .
\eeq
These functions are shown in \figref{fig.num.per.tot}.
The mean of $r$ is $N p_1$, and its standard deviation is
$\sqrt{N p_1 (1-p_1)}$ (\pref{sec.first.binomial}).
If $N$ is 100 then
\beq
r \sim N p_1 \pm \sqrt{N p_1 (1-p_1)} \simeq 10 \pm 3 .
\eeq
If $N=1000$ then
\beq
r \sim 100 \pm 10 .
\eeq
Notice that as $N$ gets bigger, the probability distribution
of $r$ becomes more concentrated, in the sense that
while the
range of possible values of $r$ grows
as $N$, the standard deviation of $r$
grows only as $\sqrt{N}$.
That $r$ is most likely to fall
in a small range of values implies
that the outcome $\bx$ is also most likely to
fall in a corresponding small subset of outcomes
that we will call the {{\dbf\inds{typical set}}}.
\subsection{Definition of the typical set}
\label{sec.ts}
% Let us generalize our discussion to an arbitrary ensemble $X$
% with alphabet $\A_X$
% and define typicality.
Let us define \ind{typicality}\index{typical set!for compression}
for an arbitrary ensemble $X$
with alphabet $\A_X$.
Our definition of a typical string will
involve the string's probability.
A long string
% message
of $N$ symbols will usually
contain
% with high probability
about $p_1N$ occurrences of the first symbol,
$p_2N$ occurrences of the second, etc. Hence the probability
of this string
% long message
is roughly
\beq
P(\bx)_{\rm typ}
= P(x_1)P(x_2)P(x_3) \ldots P(x_N)
\simeq p_1^{(p_1N)} p_2^{(p_2N)} \ldots p_I^{(p_IN)}
\eeq
% p_i^{p_iN}
so that
the information content of a typical string is
\beq
\log_2 \frac{1}{P(\bx)}
\simeq N \sum_i p_i \log_2 \frac{1}{p_i} \simeq N H .
\eeq
So the random variable $\log_2 \!\dfrac{1}{P(\bx)}$,
% So the random variable $\frac{1}{N} \log_2 \frac{1}{P(\bx)}$,
% which is the average information content per symbol, is
which is the information content of $\bx$, is
very likely to be close in value to $N H$.
We build our definition of typicality on this observation.
We define the typical elements of $\A_X^N$ to be
those elements that
have probability close to $2^{-NH}$. (Note that the typical set,
unlike the
% best subset for compression
smallest sufficient subset, does
{\em not\/} include the most probable elements of $\A_X^N$, but we
will show that these most probable elements
contribute negligible probability.)
We introduce a parameter $\beta$ that defines how close
the probability has to be to $2^{-NH}$ for
an element to be `typical'.
% $\beta$-
We call the set of typical elements the typical set,
% $T$, or, to be more precise,
$T_{N \beta}$:
% , where the parameter $\beta$
%% controls the breadth of the typical set by defining
% defines what we mean by a probability `close' to $2^{-NH}$:
\beq
T_{N\b} \equiv \left\{ \bx\in\A_X^N :
\left| \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right| < \b
\right\} .
\label{eq.TNb}
\eeq
%
% check whether < has propagated to all necessary places
%
We will show that whatever value of $\beta$ we choose,
the typical set contains almost all the probability
as $N$ increases.
This important result is sometimes called the
{\dem `asymptotic equipartition' principle}.\index{asymptotic equipartition}
% \newpage
%\section{`Asymptotic Equipartition' and Source Coding}
\label{sec.aep}
% We will prove the following result:
\begin{description}
\item[`Asymptotic equipartition' principle\puncspace]
% (AEP).]
For an ensemble of $N$ independent identically distributed (\ind{i.i.d.})
random variables
$X^N \equiv ( X_1, X_2, \ldots, X_N )$, with $N$ sufficiently large,
the outcome $\bx = (x_1,x_2,\ldots, x_N)$ is almost certain to belong
to a subset of $\A_X^N$ having only $2^{N H(X)}$ members, each having
probability `close to' $2^{-N H(X)}$.
\end{description}
Notice that if $H(X) < H_0(X)$ then $2^{N H(X)}$ is a {\em tiny\/}
fraction of the number of possible outcomes $|\A_X^N|=|\A_X|^N=2^{N
H_0(X)}.$
\begin{aside}
The term \ind{equipartition} is chosen to describe the idea
that the members of the typical set have {\em roughly equal\/}
probability. [This should not be taken too literally, hence my
use of quotes around `asymptotic equipartition';
% in the phrase \aep;
see page \pageref{sec.aep.caveat}.]
A second meaning for equipartition, in thermal physics,
is the idea that each degree of freedom of a classical system
has {equal\/} average energy, $\half kT$. This second meaning
is not intended here.
\end{aside}
%
The \aep\ is equivalent to:
\begin{description}
\item[Shannon's source coding theorem (verbal statement)\puncspace]
$N$ i.i.d.\ random variables each
with entropy $H(X)$ can be compressed into more than $NH(X)$ bits with
negligible risk of information loss, as $N\rightarrow \infty$;
conversely if they are compressed into fewer than $NH(X)$ bits
it is virtually certain that information will be lost.
\end{description}
These two theorems are equivalent
because we can define a compression algorithm that gives a distinct
name of length $N H(X)$ bits to each $\bx$ in the typical set.
% probable subset.
% as follows:
% enumerate the $\bx$ belonging to
% the subset of $2^{N H(X)}$ equiprobable outcomes as 000\ldots000,
% 000\ldots001, etc.
\begin{figure}
\figuredangle{%
\begin{center}
%%%%%%%% written by hand see also X.tex
%
% picture of Sdelta for X^100
%
\newcommand{\axislevel}{27}
\newcommand{\axislevelp}{32.5}
\newcommand{\axislevelm}{24}
\newcommand{\axislevelmm}{21}
\newcommand{\forestgap}{-0.4}
\newcommand{\forestgab}{-0.6}
\newcommand{\forestgac}{-0.56}
\newcommand{\forestgad}{-0.52}
\newcommand{\forestgae}{-0.48}
\newcommand{\forestgaf}{-0.44}
% \newcommand{\forestgag}{0.48}
%\newcommand{\forestgap}{0.35} was .35 when I went up to 14.
\newcommand{\forest}[3]{\multiput(#1)(\forestgap,0){#2}{\line(0,1){#3}}}
\newcommand{\foresb}[4]{\multiput(#1)(#4,0){#2}{\line(0,1){#3}}}
%
% picture
%
%\setlength{\unitlength}{2.45pt}%
\setlength{\unitlength}{2.87pt}%
\begin{picture}(170,81)(-170,-42)
\forest{0,0}{1}{16.5}%
\foresb{-5,0}{2}{16}{\forestgab}
\foresb{-10,0}{3}{15.5}{\forestgab}
\foresb{-15,0}{4}{15}{\forestgac}
\foresb{-20,0}{5}{14.5}{\forestgad}
\foresb{-25,0}{6}{14}{\forestgae}
\foresb{-30,0}{7}{13.5}{\forestgaf}
\foresb{-35,0}{8}{13}{\forestgap}
\foresb{-40,0}{9}{12.5}{\forestgap}
\forest{-45,0}{10}{12}%
\forest{-50,0}{11}{11.5}%
\forest{-55,0}{12}{11}%
\forest{-60,0}{12}{10.5}%
\forest{-65,0}{12}{10}%
\forest{-70,0}{12}{9.5}%
\forest{-75,0}{12}{9}%
\forest{-80,0}{12}{8.5}%
\forest{-85,0}{12}{8}%
\forest{-90,0}{12}{7.5}%
\forest{-95,0}{12}{7}%
\forest{-100,0}{12}{6.5}%
\forest{-105,0}{12}{6}%
\forest{-110,0}{12}{5.5}%
\forest{-115,0}{11}{5}%
\forest{-120,0}{10}{4.5}%
\foresb{-125,0}{9}{4.2}{\forestgap}
\foresb{-130,0}{8}{3.9}{\forestgap}
\foresb{-135,0}{7}{3.6}{\forestgaf}
\foresb{-140,0}{6}{3.3}{\forestgae}
\foresb{-145,0}{5}{3.0}{\forestgad}
\foresb{-150,0}{4}{2.7}{\forestgac}
\foresb{-155,0}{3}{2.4}{\forestgab}
\foresb{-160,0}{2}{2.1}{\forestgab}
\forest{-165,0}{1}{1.8}%
%
% axis:
\put(-168,\axislevelm){\vector(1,0){171.0}}
%
% axis labels
\put(0,\axislevelp){\makebox(0,0)[br]{\small$\log_2 P(x)$}}
\put(-42.4,\axislevel){\makebox(0,0)[b]{\small$-NH(X)$}}
% tic mark (was at -40 until Tue 8/1/02)
\put(-42.4,\axislevelm){\line(0,1){2}}
% the S0 box
%\put(-3,-2.5){\framebox(172,\axislevelm){}}
%\put(142,16){\makebox(0,0)[l]{$S_0$}}
%
%
% typical set box
\put(-49.5,-1){\framebox(15,\axislevelmm){}}
\put(-51,16){\makebox(0,0)[r]{$T_{N\b}$}}
%
% object labels
\put(0,-40){\vector(0,1){35}}
\put(-15,-35){\vector(0,1){30}}
%\put(26,-30){\vector(0,1){25}}
\put(-36,-25){\vector(0,1){20}}
\put(-46,-20){\vector(0,1){15}}
%\put(56,-15){\vector(0,1){10}}
\put(-155,-10){\vector(0,1){5}}
\put( 0,-40){\makebox(0,0)[tr]{\footnotesize{{\tt 0000000000000}\ldots{\tt{00000000000}}}}}
\put(-15,-35){\makebox(0,0)[tr]{\footnotesize{{\tt 0001000000000}\ldots{\tt{00000000000}}}}}
%\put(26,-30){\makebox(0,0)[tl]{\footnotesize{{\tt 0000001000000}\ldots{\tt{00000010000}}}}}
\put(-36,-25){\makebox(0,0)[tr]{\footnotesize{{\tt 0100000001000}\ldots{\tt{00010000000}}}}}
\put(-46,-20){\makebox(0,0)[tr]{\footnotesize{{\tt 0000100000010}\ldots{\tt{00001000010}}}}}
%\put(56,-15){\makebox(0,0)[tl]{\footnotesize{{\tt 0100001000100}\ldots{\tt{00010100100}}}}}
\put(-155,-10){\makebox(0,0)[tl]{\footnotesize{{\tt 1111111111110}\ldots{\tt{11111110111}}}}}
\end{picture}
%
%
%
%
\end{center}
}{%
\caption[a]{Schematic diagram showing all strings
in the ensemble $X^{N}$
% with $p_0 = 0.9, p_1=0.1$
% of large length $N$
ranked by their probability, and
the typical set $T_{N\b}$.}
\label{fig.typical.set.explain}
}%
\end{figure}
\section{Proofs}
\label{sec.chtwoproof}
This section may be skipped if found tough going.
\subsection{The law of large numbers}
Our proof of the source coding theorem uses the
\ind{law of large numbers}.
\begin{description}
% \item[A random variable $u$] is any real function of $x$,
\item[Mean and variance] of a real random variable
%\footnote
are $\Exp[u] = \bar{u} = \sum_u P(u) u$ and $\var(u) =
\sigma^2_u = \Exp[(u-\bar{u})^2] = \sum_u P(u) (u - \bar{u})^2.$
\begin{aside}
Technical note:
strictly I am assuming here that $u$ is a function $u(x)$ of a
sample $x$ from a finite discrete ensemble $X$. Then the
summations $\sum_u P(u) f(u)$ should be written $\sum_x P(x)
f(u(x))$. This means that $P(u)$ is a finite sum of delta
functions. This restriction guarantees that the mean and
variance of $u$ do exist, which is not necessarily the case for general
$P(u)$.
\end{aside}
\item[Chebyshev's inequality 1\puncspace]
Let $t$ be a non-negative real random variable, and\index{Chebyshev inequality}
let $\a$ be a positive real number. Then\index{inequality}
\beq
P(t \geq \a) \:\leq\: \frac{\bar{t}}{\a}.
\label{eq.cheb.1}
\eeq
{\sf Proof:} $P(t \geq \a) = \sum_{t \geq \a} P(t)$.
We multiply each
term by $t/\a \geq 1$ and obtain:
$P(t \geq \a) \leq \sum_{t \geq \a} P(t) t/\a.$
We add the (non-negative) missing terms and obtain:
$P(t \geq \a) \leq \sum_{t} P(t) t/\a = \bar{t}/\a$. \hfill$\epfsymbol$\par
\item[Chebyshev's inequality 2\puncspace]
Let $x$ be a random variable, and let $\a$ be a positive real number.
Then
\beq
P\left( (x-\bar{x})^2 \geq \a \right) \:\leq\: \sigma^2_x / \a.
\eeq
{\sf Proof:} Take $t = (x-\bar{x})^2$ and apply the previous proposition. \hfill$\epfsymbol$\par
\item[Weak \ind{law of large numbers}\puncspace]
Take $x$ to be the average of $N$ independent random variables
$h_1, \ldots , h_N$, having common mean $\bar{h}$ and common variance
$\sigma^2_h$: $x = \frac{1}{N} \sum_{n=1}^N h_n$. Then
\beq
P( (x-\bar{h})^2 \geq \a ) \leq \sigma^2_h/\a N.
\eeq
{\sf Proof:} obtained by showing that $\bar{x}=\bar{h}$ and that
$\sigma^2_x = \sigma^2_h/ N$. \hfill$\epfsymbol$\par
\end{description}
We are interested in $x$ being very close to the mean ($\a$ very small).
No matter how large $\sigma^2_h$ is, and no matter how small the
required $\a$ is, and no matter how small the desired probability that
$(x-\bar{h})^2 \geq \a$, we can always achieve it by
taking $N$ large enough.
\subsection{Proof of theorem \protect\ref{thm.sct} (\pref{thm.sct})}
% the source coding theorem}
% or could say theorem 1
We apply the law of large numbers to the random variable $\frac{1}{N}
\log_2 \frac{1}{P(\bx)}$ defined for $\bx$ drawn from the ensemble $X^N$.
This random variable can be written as the average of $N$ information
contents
$h_n = \log_2 ( 1 / P(x_n))$, each of which is a random variable with
mean $H = H(X)$ and variance $\sigma^2 \equiv \var[ \log_2 ( 1 / P(x_n)) ]$.
(Each term $h_n$
is the Shannon information content of the $n$th
outcome.)
We again define the typical set with parameters $N$ and $\beta$ thus:
\beq
T_{N\b} = \left\{ \bx\in\A_X^N :
\left[ \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right]^2 < \b^2
\right\} .
\label{eq.TNb.2}
\eeq
For all $\bx \in T_{N\b}$, the probability of $\bx$ satisfies
\beq
2^{-N(H+\b)} < P(\bx) < 2^{-N(H-\b)}.
\eeq
And by the law of large numbers,
\beq
P(\bx \in T_{N\b}) \geq 1 - \frac{\sigma^2}{\b^2 N} .
\eeq
We have thus proved the \aep. As $N$ increases, the probability
that $\bx$ falls in $T_{N\b}$ approaches 1, for any $\beta$.
How does this result relate to source coding?
% We will prove the \aep\ first; then w
We must relate $T_{N\b}$ to $H_{\delta}(X^N)$.
We will
show that for any given $\delta$ there is
a sufficiently big $N$ such that
$H_{\delta}(X^N) \simeq N H$.
\subsubsection{Part 1: $\frac{1}{N} H_{\delta}(X^N) < H +
\epsilon$.}
% of the source coding theorem.
%
% More words here reminding what H_delta is
%
The set $T_{N\b}$ is not the best subset for compression. So the
size of $T_{N\b}$ gives an upper bound on $H_{\delta}$.
We show how {\em small} $H_{\delta}(X^N)$ must be by calculating
% the largest cardinality that $T_{N\b}$ could have.
how big $T_{N\b}$ could possibly be.
We are
free to set $\beta$ to any convenient value.
The smallest possible
probability that a member of $T_{N\b}$ can have is $2^{-N(H+\b)}$, and
the total probability that $T_{N\b}$ contains can't be any bigger
than 1. So
\beq
|T_{N\b}| \, 2^{-N(H+\b)} < 1 ,
\eeq
that is, the size of the typical set is bounded by
% so we can bound
\beq
|T_{N\b}| < 2^{N(H+\b)} .
\eeq
If we set $\b = \epsilon$ and $N_0$ such that
$\frac{\sigma^2}{\epsilon^2 N} \leq \delta$, then $P(T_{N\b}) \geq
1 - \delta$,
and the set $T_{N\b}$ becomes a witness to the fact that
$H_{\delta}(X^N) \leq \log_2 | T_{N\b} | < N ( H + \epsilon)$.
%
\amarginfig{b}{
{\footnotesize
\setlength{\unitlength}{1.2mm}
\begin{picture}(40,40)(-5,0)
\put(5,5){\makebox(0,0)[bl]{\psfig{figure=figs/gallager/Hdeltaconcept.eps,width=36mm}}}
\put(5,35){\makebox(0,0){$\smallfrac{1}{N} H_{\delta}(X^N)$}}
\put(5,27){\makebox(0,0)[r]{$H_0(X)$}}
\put(5,4){\makebox(0,0)[t]{$0$}}
\put(30,4){\makebox(0,0)[t]{$1$}}
\put(35,4){\makebox(0,0)[t]{$\delta$}}
\put(33,11){\makebox(0,0)[l]{$H-\epsilon$}}
\put(33,15){\makebox(0,0)[l]{$H$}}
\put(33,19){\makebox(0,0)[l]{$H+\epsilon$}}
\end{picture}
}
\caption[a]{Schematic illustration of the two parts of the theorem.
Given any $\delta$ and $\epsilon$, we show that
for large enough $N$, $\frac{1}{N} H_{\delta}(X^N)$
lies (1) below the line
$H+\epsilon$ and (2) above the line $H-\epsilon$.}
\label{fig.Hd.schem}
}
\subsubsection{Part 2: $\frac{1}{N} H_{\delta}(X^N) >
H - \epsilon$.}
% of the source coding theorem.}
%
% needs work ,sanjoy says:
%
% (jan 99)_
%
Imagine that someone claims this second part is not so -- that,
for any $N$, the
smallest $\delta$-sufficient subset $S_{\delta}$ is smaller than the above
inequality would allow.
% They claim that
% $|S_{}| \leq 2^{N(H-\epsilon)}$ and $P(\bx \in S_{})
% \geq 1 - \delta$.
We can make use of our typical set to show that they must be mistaken.
Remember that we are free to set $\beta$ to any value we choose.
We will set $\beta = \epsilon/2$, so that our task is to
prove that a
% that an alternative {\em smaller\/}
subset $S'_{}$ having
$|S'_{}| \leq 2^{N(H-2\beta)}$ and achieving $P(\bx \in S'_{}) \geq 1 - \delta$
cannot exist (for $N$ greater than an $N_0$ that we will specify).
%(We attach the
% prime to $S$ to denote the fact that this is a conjectured smallest subset.)
So, let us consider the probability of falling in this rival smaller subset $S'_{}$.
The probability of the subset $S'_{}$ is\marginpar[t]{%
\begin{center}
\raisebox{-0.5in}[0in][0in]{
%%%%%%%% written by hand Sun 22/12/02
%
% Venn picture
%
%
\setlength{\unitlength}{0.321pt}%
{\begin{picture}(452,215)(-173,-132)%
% axis labels
\put(-100,39){\makebox(0,0)[r]{\small$T_{N\b}$}}
\put(100,39){\makebox(0,0)[l]{\small$S'$}}
\thinlines
\put(-33,-1){\circle{126}}
\thicklines
\put(33,-1){\circle{126}}
\thinlines
\put(18,-85){\vector(-1,4){18}}
\put(33,-90){\makebox(0,0)[t]{\small$ S'_{} \cap T_{N\b} $}}
\put(105,-51){\vector(-1,1){40}}
\put(112,-39){\makebox(0,0)[tl]{\small$ S'_{} \cap \overline{T_{N\b}} $}}
\end{picture}}
%
%
%
%
\end{center}}
\beq
P(\bx \in S'_{}) \,=\, P(\bx \in S' \! \cap \! T_{N\b}) +
P(\bx \in S'_{} \!\cap\! \overline{T_{N\b}}),
\eeq
where $\overline{T_{N\b}}$ denotes
the complement $\{ \bx \not \in T_{N\b}\}$.
The maximum value of the first term is found if
$S'_{} \cap T_{N\b} $ contains
$2^{N(H-2\beta)}$ outcomes all with the maximum probability,
$2^{-N(H-\beta)}$. The maximum value the second term can have is
$P( \bx \not \in T_{N\b})$. So:
\beq
P(\bx \in S'_{}) \, \leq \, 2^{N(H-2\beta)}
\, 2^{-N(H-\beta)}
+ \frac{\sigma^2}{\b^2 N}
= 2^{-N \b} + \frac{\sigma^2}{\b^2 N} .
\eeq
We can now set $\b = \epsilon/2$ and $N_0$ such that $P(\bx \in S'_{}) < 1-
\delta$, which shows that $S'$ cannot satisfy the definition of
a sufficient subset $S_{\delta}$.
Thus {\em any\/} subset $S'$ with size
$|S'| \leq 2^{N(H-\epsilon)}$ has probability less than $1-\delta$, so
by the definition of $H_\delta$, $H_{\delta}(X^N) > N ( H - \epsilon)$.
% this sentence used to be below at
% hereherehere
Thus for large enough $N$,
the function
$\frac{1}{N} H_{\delta}(X^N)$ is essentially a constant function of $\delta$,
for $0 < \delta < 1$,
as illustrated in figures \ref{fig.hd.10.1010}
and \ref{fig.Hd.schem}. \hfill $\Box$
\section{Comments}
The source coding theorem (\pref{thm.sct}) has two parts,
$\frac{1}{N} H_{\delta}(X^N) < H + \epsilon$,
and
$\frac{1}{N} H_{\delta}(X^N) >
H - \epsilon$.
% $H -\frac{1}{N} H_{\delta}(X^N)< \epsilon$.
Both results are interesting.
The first part tells us that even if the probability of
error $\delta$ is extremely small,
the
% average
number of bits per symbol
$\frac{1}{N} H_{\delta}(X^N)$ needed to specify a long $N$-symbol
string $\bx$ with vanishingly
small error probability does not
have to exceed $H+ \epsilon$ bits.
We need to have only a tiny tolerance for error, and the number of bits
required drops significantly from $H_0(X)$ to $(H + \epsilon)$.
What happens if we are yet more tolerant to compression errors? Part
2 tells us that even if $\delta$ is very close to 1, so that errors
are made most of the time, the average number of bits per symbol needed to
specify $\bx$ must still be at least $H - \epsilon$ bits. These two
extremes tell us that regardless of our specific allowance for error,
the number of bits per symbol needed to specify $\bx$ is
% boils down to
$H$ bits; no more and no less.
\medskip
% hereherehere
%In section 2.4.2 `$\epsilon$ can decrease with increasing $N$'. I'd prefer
%something like $N$ increases with decreasing $\epsilon$', since $N$
%depends on $\epsilon$ and not vice versa -- if I got it right.
% caution warning
\subsection{Caveat regarding `asymptotic equipartition'}
\label{sec.aep.caveat}
\index{caution!equipartition}I
put the words `asymptotic equipartition' in quotes because
it is important not to\index{asymptotic equipartition!why it is a misleading term}
% be misled into
think that the
elements of the typical set $T_{N\beta}$
really do have roughly the same
probability as each other. They are similar in probability only
in the sense that their values of $\log_2 \frac{1}{P(\bx)}$ are
within $2 N \beta$ of each other. Now, as $\beta$ is decreased,
how does $N$ have to increase, if we are to keep our bound on the
mass of the typical set,
$P(\bx \in T_{N\beta}) \geq 1 - \frac{\sigma^2}{\beta^2 N}$, constant?
% CHANGED 9802:
% Since $\beta$ can decrease
%scales
% with increasing
$N$ must grow as $1/ \beta^2$, so, if we write
$\beta$ in terms of
$N$ as $\alpha/\sqrt{N}$, for some constant $\alpha$, then
the most probable string in the typical set will be of order
$2^{\alpha \sqrt{N}}$ times greater than the least probable string in the
typical set. As $\beta$ decreases, $N$ increases,
and this ratio $2^{\alpha \sqrt{N}}$ grows exponentially.
Thus we have `equipartition' only in a weak sense!
% relative
\subsection{Why did we introduce the typical set?}
The best choice of subset for block compression is (by definition)
$S_{\delta}$, not a typical set. So why did we bother introducing
the typical set? The answer is, {\em we can count the typical set}.
We know that all its elements have `almost identical' probability ($2^{-NH}$),
and we know the whole set has probability almost 1, so the typical
set must have roughly $2^{NH}$ elements.
Without the help of the typical set (which is very similar
to $S_{\delta}$) it would have been
hard to count how many elements there are in $S_{\delta}$.
%\section{Summary and overview}
%\section{Where next}
% We have established that the entropy $H(X)$ measures
% the average information content of an ensemble.
%%
% In this chapter we discussed a lossy {block}-compression scheme that
% used large blocks of fixed size.
% In the next chapter we discuss variable length compression schemes that are
% practical for small block sizes and that are not lossy.
%%
%
\section{Exercises}
% weighing problems in here
% ITPRNN Problem 1a
%
\subsection*{Weighing problems}
%
\exercisaxB{1}{ex.weighexplain}{
While some people, when they first encounter
the
weighing problem with 12 balls and the three-outcome balance (\exerciseref{ex.weigh}),
think that weighing six balls against six balls is a good first weighing,
others say `no, weighing six against six conveys {\em no\/} information
at all'. Explain to the second group why they are both right and
wrong. Compute the information gained about {\em which is the
odd ball\/}, and the information gained about {\em which is the
odd ball and whether it is heavy or light}.
}
\exercisaxB{2}{ex.weighthirtynine}{
Solve the weighing problem for the case where there are 39 balls
of which one is known to be odd.
}
\exercisaxB{2}{ex.binaryweigh}{
You are given 16 balls, all of which are equal in weight except for
one that is either heavier or lighter. You are also given a bizarre
two-pan balance that can report only two outcomes: `the two sides balance'
or `the two sides do not balance'.
Design a
strategy to determine which is the odd ball {in as few uses of the balance
as possible}.
}
\exercisaxB{2}{ex.flourforty}{
You have a two-pan balance; your job is to weigh
out bags of flour with integer weights 1 to 40 pounds inclusive.
How many weights do you need? [You are allowed
to put weights on either pan. You're only allowed to
put one flour bag on the balance at a time.]
}
\exercissxC{4}{ex.twelve.generalize.weigh}{
\ben
\item% {ex.weigh}
Is it possible to solve \exerciseref{ex.weigh}
(the
weighing problem with 12 balls and the three-outcome balance)
using a sequence of three {\em fixed\/} weighings, such that the
balls chosen for the second weighing do not depend on the outcome of the first, and
the third weighing does not depend on the first or second?
\item
Find a solution to the general $N$-ball weighing problem in which exactly one of $N$
balls is odd.
Show that in $W$ weighings, an odd ball can be identified from among
$N = (3^W - 3 )/2$ balls.
%How large can $N$ be if you are allowed $W$ weighings?
% How are the weighings arranged in the case of the largest $N$?
\een
}
\exercisaxC{3}{ex.twelve.two.weigh}{
You are given 12 balls and the three-outcome balance
of \exerciseonlyref{ex.weigh}; this time, {\em two} of the balls are odd;
each odd ball may be heavy or light, and we don't know which.
We want to identify the odd balls and in which direction they are odd.
\ben
\item
{\em Estimate\/} how many weighings are required by the optimal strategy.
And what if there are three odd balls?
%\item
% How do your answers change if it is known in advance that
% the odd balls will all have the same bias (all heavy, or all light)?
\item
How do your answers change if it is known that all the regular balls
weigh 100\grams, that light balls weigh 99\grams, and heavy ones
weigh 110\grams?
\een
}
% end weighing
\subsection*{Source coding with a lossy compressor, with loss $\delta$}
\exercissxB{2}{ex.Hd46}{
% Let ${\cal P}_X = \{ 0.4,0.6 \}$. Sketch $\frac{1}{N} H_{\delta}(X^N)$
% as a function of $\delta$ for $N=1,2$ and 100.
Let ${\cal P}_X = \{ 0.2,0.8 \}$. Sketch $\frac{1}{N} H_{\delta}(X^N)$
as a function of $\delta$ for $N=1,2$ and 1000.
}
\exercisaxB{2}{ex.Hd55}{
Let ${\cal P}_Y = \{ 0.5,0.5 \}$. Sketch $\frac{1}{N} H_{\delta}(Y^N)$
as a function of $\delta$ for $N=1,2,3$ and 100.
}
\exercissxB{2}{ex.HdSB}{
(For \ind{physics} students.)
Discuss the
relationship
% similarities
between the proof of the \aep\ and the equivalence\index{entropy!Gibbs}\index{entropy!Boltzmann}
(for large systems) of the \ind{Boltzmann entropy} and the \ind{Gibbs entropy}.}
\subsection*{Distributions that don't obey the law of large numbers}
%
% Cauchy distbn here?
The \ind{law of large numbers}, which we used in this chapter,
shows that the mean of a set of $N$ i.i.d.\ random variables
has a probability distribution that becomes
% more concentrated
narrower, with width $\propto 1/\sqrt{N}$, as $N$ increases.
However, we have proved this property only for
discrete random variables, that is, for real numbers
taking on a {\em finite\/} set of possible values.
While many random variables
with continuous probability distributions also satisfy the
law of large numbers, there are important distributions that
do not. Some continuous distributions do not have
a mean or variance.
\exercissxB{3}{ex.cauchy}{
Sketch the \ind{Cauchy distribution}
\beq
P(x) = \frac{1}{Z} \frac{1}{x^2 + 1} , \:\:\:\: x \in (-\infty,\infty).
\eeq
What is its normalizing constant $Z$? Can you evaluate
its mean or variance?
Consider the sum $z=x_1 + x_2$, where $x_1$ and $x_2$ are independent
random variables from a Cauchy
distribution. What is $P(z)$? What is the probability
distribution of the mean of $x_1$ and $x_2$, $\bar{x}=(x_1+x_2)/2$?
What is the
probability
distribution of the mean of $N$ samples from this \ind{Cauchy distribution}?
}
%
\subsection{Other asymptotic properties}
% Levy flights too?
\exercisaxC{3}{ex.chernoff}{ {\sf\ind{Chernoff bound}.}
We derived the weak law of large numbers from Chebyshev's inequality\index{Chebyshev inequality}
(\ref{eq.cheb.1}) by letting the random variable $t$
in the inequality
$%\beq
P(t \geq \a) \:\leq\: \bar{t}/\a
%\label{eq.cheb.1a}
$
be a function, $t = (x-\bar{x})^2$,
of the random variable $x$ we were interested in.
Other useful inequalities can be obtained by using other
functions. The \ind{Chernoff bound}, which is useful\index{bound}
for bounding the \ind{tail}s of a distribution, is obtained by
letting $t = \exp( s x)$.
Show that
\beq
P( x \geq a ) \leq e^{-sa} g(s) , \:\:\:\mbox{ for any $s>0$ }
\eeq
and
\beq
P( x \leq a ) \leq e^{-sa} g(s) , \:\:\:\mbox{ for any $s<0$ }
\eeq
where $g(s)$ is the moment-generating function of $x$,
\beq
g(s) = \sum_x P(x) \, e^{sx} .
\eeq
%
% Hence show that if $z$ is a sum of $N$ random variables $x$,
%\beq
% P( z \geq a ) \leq
%\eeq
}
% end
%
\subsection*{Curious functions related to $p \log 1/p$}
% SOLN - BORDERLINE
\exercissxE{4}{ex.fxxxxx}{
This exercise has {no purpose at all}; it's included
for the enjoyment of those who like mathematical curiosities.
Sketch the function
\beq
f(x) = x^{x^{x^{x^{x^{\cdot^{\cdot^{\cdot}}}}}}}
% f(x) = x^{x^{x^{x^{x^{\ddots}}}}}
\eeq
for $x \geq 0$.
% To be explicit about the order in which the powers are evaluated,
% here's another definition of $f$:
%\beq
% f(x) = x^{\left(x^{\left(x^{\cdot^{\cdot^{\cdot}}}\right)}\right)}
%\eeq
{\sf Hint:}
Work out the inverse function to $f$ -- that is, the function $g(y)$
such that if $x=g(y)$ then $y=f(x)$ -- it's closely related to
$p \log 1/p$.
% {\sf Hints:}
%\ben
%\item Consider $f(\sqrt{2})$:
% you might be able to persuade yourself
% that $f(\sqrt{2})=2$. You might also be able
% to persuade yourself that $f(\sqrt{2})=4$. What's going on?
% [Yes, a two-valued function.]
%\item
% For a given $x$, if $f(x)=y$, then we have $y = x^{y}$, so
% $y$ is found at the intersection of the curves $u_1(y)=x^y$ and $u_2(y)=y$.
%\item
% Work out the inverse function to $f$ -- that is, the function $g(y)$
% such that if $x=g(y)$ then $y=f(x)$ -- hint: it's closely related to
% $p \log 1/p$.
%\een
}
\dvips
%\chapter{The Source Coding Theorem (old version of this Chapter)}
%\label{ch.two.old}
%\input{tex/_l2old.tex}
%\dvips
\section{Solutions}% to Chapter \protect\ref{ch.two}'s exercises}
\fakesection{_s2}
% chapter 2
% ex 39...
%
\soln{ex.Hadditive}{
Let $P(x,y)=P(x)P(y)$.
Then
\beqan
H(X,Y) &=& \sum_{xy} P(x)P(y) \log \frac{1}{P(x)P(y)} \\
& = & \sum_{xy} P(x)P(y) \log \frac{1}{P(x)}
+ \sum_{xy} P(x)P(y) \log \frac{1}{ P(y)} \\
&=& \sum_{x} P(x) \log \frac{1}{P(x)} +
\sum_{y} P(y) \log \frac{1}{ P(y)} \\
&=& H(X) + H(Y) .
\eeqan
}
%
\soln{ex.ascii}{
An ASCII file can be reduced in size by a factor of 7/8. This reduction
could be achieved by a block code that maps 8-byte blocks
into 7-byte blocks by copying the
% . The mapping would copy
56 information-carrying bits into
7 bytes, and ignoring the last bit of every character.
}
\soln{ex.compress.possible}{
% Theorem:
% No program can compress without loss *all* files of size >= N bits, for
% any given integer N >= 0.
%
%Proof:
% Assume that the program can compress without loss all files of size >= N
% bits. Compress with this program all the 2^N files which have exactly N
% bits. All compressed files have at most N-1 bits, so there are at most
% (2^N)-1 different compressed files [2^(N-1) files of size N-1, 2^(N-2) of
% size N-2, and so on, down to 1 file of size 0]. So at least two different
% input files must compress to the same output file. Hence the compression
% program cannot be lossless.
%
%The proof is called the "counting argument". It uses the so-called
The \ind{pigeon-hole principle}
states: you can't put 16 pigeons into 15 holes without using one of the
holes twice.
Similarly, you can't give $\A_X$ outcomes unique
binary names of some length $l$
shorter than $\log_2 |\A_X|$ bits, because there are only $2^l$
such binary names, and $l < \log_2 |\A_X|$ implies $2^l < |\A_X|$,
so at least two different inputs to the compressor would compress to
the same output file.
}
\soln{ex.cusps}{
Between the cusps, all the changes in
probability are equal, and the number of elements
in $T$ changes by one at each step. So $H_{\delta}$
varies logarithmically with $(-\delta)$.
% NEEDS WORK!
}
%
% Another solution from Conway:
% Label them
% F AM NOT LICKED
% then use these divisions
% MA DO LIKE
% ME TO FIND
% FAKE COIN
%
%\soln{ex.twelve.generalize.weigh}{
% Thu, 28 Jan 1999 19:19:30 -0500 (EST)
% From:
%
\begin{Sexercise}{ex.twelve.generalize.weigh}
This solution was found by Dyson and Lyness in 1946
and presented in the following elegant form by
{John Conway}\index{Conway, John H.} in 1999.
% \footnote{Posting to {\tt{geometry-puzzles@forum.swarthmore.edu}}
% Thu, 28 Jan 1999.
%}
%
Be warned: the symbols A, B, and C are used to name the
balls, to name the pans of the balance,
to name the outcomes, and to name
the possible states of the odd ball!
\ben%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% enumerate 1
\item
Label the 12 balls by the sequences
%
% verbatim not allowed in the argument of a command
%
{\small
\begin{verbatim}
AAB ABA ABB ABC BBC BCA BCB BCC CAA CAB CAC CCA
\end{verbatim}
}
and in the
{\small
\begin{verbatim}
1st AAB ABA ABB ABC BBC BCA BCB BCC
2nd weighings put AAB CAA CAB CAC in pan A, ABA ABB ABC BBC in pan B.
3rd ABA BCA CAA CCA AAB ABB BCB CAB
\end{verbatim}
}
Now in a given weighing, a pan will either end up in the
\bit
\item
{\tt C}anonical position ({\tt C}) that it assumes when the pans are balanced, or
\item
{\tt A}bove that position ({\tt A}), or
\item
{\tt B}elow it ({\tt B}),
\eit
so the three weighings determine for each pan a sequence of three of these letters.
If both sequences are {\tt CCC}, then there's no odd ball. Otherwise,
for {\em just one\/} of the two pans, the sequence is among the 12 above,
and names the odd ball, whose weight is {\tt A}bove or {\tt B}elow the proper
one according as the pan is {\tt A} or {\tt B}.
\item
In $W$ weighings the odd ball can be identified from
among
\beq
N = (3^W - 3 )/2
\eeq
balls in the same way, by labelling them with all
the non-constant sequences of $W$ letters from {\tt A}, {\tt B}, {\tt C} whose
first change is A-to-B or B-to-C or C-to-A, and at the
$w$th weighing putting those whose $w$th letter is {\tt A} in pan {\tt A}
and those whose $w$th letter is {\tt B} in pan {\tt B}.
\een
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%}
\end{Sexercise}
% {ex.twelve.two.weigh}{
% removed old solution to graveyard Tue 4/3/03
\soln{ex.Hd46}{% ex 42
% hd.p p=0.2 mmin=1 mmax=2 mstep=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot
% hd.p p=0.2 mmin=2 mmax=2 mstep=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot
% hd.p p=0.2 mmin=100 mmax=100 mstep=1 suppress_early_detail=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot
% hd.p p=0.2 mmin=1000 mmax=1000 mstep=1 suppress_early_detail=1 scale_by_n=1 plot_sub_graphs=1 hd=figs/hd0.2 | gnuplot
%# gnuplot < gnu/Hd0.2.gnu
%#45:coll:/home/mackay/itp/Hdelta> gv figs/hd0.2/all.1.100.ps
The curves $\frac{1}{N} H_{\delta}(X^N)$
as a function of $\delta$ for $N=1,2$ and 1000 are shown in \figref{fig.hd.1.100}.
% and table \ref{tab.Hdelta.0.4}.
Note that $H_2(0.2) = 0.72$ bits.
\begin{figure}[htbp]
%\figuremargin{%
\figuredanglenudge{%
\begin{center}
\begin{tabular}[t]{rl}
\begin{tabular}[t]{l}\vspace{0in}\\% alignment hack
\mbox{\psfig{figure=Hdelta/figs/hd0.2/all.1.100.ps,%
width=60mm,angle=-90}}
\end{tabular}
%
\hspace{0in}
&
%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}[t]{r@{--}lcc} \toprule
\multicolumn{4}{c}{$N=1$} \\ \midrule
% delta 1/N Hdelta 2^{Hdelta}
\multicolumn{2}{c}{$\delta$} & $\frac{1}{N} H_{\delta}(\bX)$ & $2^{H_{\delta}(\bX)}$
% raise the roof!
% {\rule[-3mm]{0pt}{8mm}}
\\ \midrule
0 & 0.2 & 1 & 2 \\
0.2 & 1 & 0 & 1 \\ \bottomrule
\end{tabular}
\hspace{0.1in}
\begin{tabular}[t]{r@{--}lcc} \toprule% {r@{--}lcc}
\multicolumn{4}{c}{$N=2$} \\ \midrule
% delta 1/N Hdelta 2^{Hdelta}
\multicolumn{2}{c}{$\delta$} & $\frac{1}{N} H_{\delta}(\bX)$ & $2^{H_{\delta}(\bX)}$
% raise the roof!
% {\rule[-3mm]{0pt}{8mm}}
\\ \midrule
0 & 0.04 & 1 & 4 \\
0.04 & 0.2 & 0.79 & 3 \\ % was 0.792\,48
0.2 & 0.36 & 0.5 & 2 \\
0.36 & 1 & 0 & 1 \\ \bottomrule
\end{tabular}\\
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{center}
}{%
\caption[a]{$\frac{1}{N} H_{\delta}(\bX)$ (vertical axis) against $\delta$ (horizontal),
for $N=1, 2, 100$ binary variables with $p_1=0.4$.}
\label{fig.hd.1.100}
\label{tab.Hdelta.0.4}
}{0.25in}
\end{figure}
%\begin{table}[htbp]
%\figuremargin{%
%\begin{center}
%\end{center}
%}{%
%\caption[a]{Values of $\frac{1}{N} H_{\delta}(\bX)$ against $\delta$.}
%% add 0.4 to this caption
%\label{tab.Hdelta.0.4}
%}
%\end{table}
%
}
\soln{ex.HdSB}{
The Gibbs entropy is $\kB \sum_i p_i \ln \frac{1}{p_i}$, where $i$
runs over all states of the system. This entropy is equivalent (apart from the factor of $\kB$)
to the Shannon entropy of the ensemble.
Whereas the Gibbs entropy can be
defined for any ensemble, the Boltzmann entropy is only
defined for {\dem microcanonical\/} ensembles, which
have a probability distribution that is uniform over a
set of accessible states.
The Boltzmann entropy is defined to be $S_{\rm B} = \kB \ln \Omega$
where $\Omega$ is the number of accessible states
of the microcanonical ensemble. This is equivalent
(apart from the factor of $\kB$) to the perfect information content
$H_0$ of that constrained
ensemble. The Gibbs entropy of a microcanonical
ensemble is trivially equal to the Boltzmann entropy.
We now consider a \ind{thermal distribution} (the
{\dem\ind{canonical}\/} ensemble),
where the probability of a state $\bx$ is
\beq
% P(\bx) =\frac{1}{Z} \exp( - \beta E(\bx) )?
P(\bx) =\frac{1}{Z} \exp\left( - \frac{ E(\bx) }{\kB T} \right) .
\eeq
With this canonical ensemble we can associate a
corresponding microcanonical ensemble,
% typically
% usually
an ensemble
with total energy fixed to the mean
energy of the canonical ensemble
(fixed to within some precision $\epsilon$).
% Recalling that under the
% thermal distribution (the canonical ensemble) we see that
Now, fixing the total energy to a precision $\epsilon$ is equivalent to
fixing the value of $\ln \dfrac{1}{P(\bx)}$ to within
% $\epsilon/\beta$.
$\epsilon \kB T$.
Our definition of the typical set
$T_{N \beta}$ was precisely that it consisted of all elements that
have a value of $\log P(\bx)$ very close to the mean value
of $\log P(\bx)$ under the canonical ensemble, $- N H(X)$.
Thus the microcanonical ensemble is equivalent to
a uniform distribution over
% constraining the state $\bx$ to be in
the typical set of the canonical ensemble.
Our proof of the \aep\ thus proves -- for the
case of a system whose energy is separable into a sum of independent
terms -- that the
Boltzmann entropy of the microcanonical ensemble
is very close (for large $N$) to the Gibbs entropy of
the canonical ensemble, if the energy of the microcanonical
ensemble is constrained to equal the mean energy of the
canonical ensemble.
}
\soln{ex.cauchy}{
The normalizing constant of the \ind{Cauchy distribution}\index{distribution!Cauchy}
\[
P(x) = \frac{1}{Z} \frac{1}{x^2 + 1}
\]
is
\beq
Z = \int^{\infty}_{-\infty} \d x \: \frac{1}{x^2 + 1}
= \left[ {\tan}^{-1} x \right]^{\infty}_{-\infty} = \frac{\pi}{2} - \frac{-\pi}{2} = \pi .
\eeq
The mean and variance of this distribution are both undefined. (The distribution
is symmetrical about zero, but this does not imply that its mean is zero. The mean
is the value of a divergent integral.)
% ; depending what limiting procedure we
% define to evaluate this integral we
The sum $z=x_1 + x_2$, where $x_1$ and $x_2$ both
have Cauchy distributions, has probability density given by the convolution
\beq
P(z) = \frac{1}{\pi^2} \int^{\infty}_{-\infty} \d x_1 \:
\frac{1}{x_1^2 + 1}
\frac{1}{(z-x_1)^2 + 1}
% P(x1,x2) delta [z=x1+x2] .. -> x2 = z-x1
,
\eeq
% Introducing $\Delta \equiv x_1-x_2$ this can be written more symmetrically
% as
% \beq
% P(z) = \frac{1}{\pi^2} \int^{\infty}_{-\infty} \d \Delta \:
% \eeq
which after a considerable labour using standard methods
%\footnote{Can anyone
% give me an elegant solution?}
gives
\beq
P(z) = \frac{1}{\pi^2} 2 \frac{\pi}{z^2+4} = \frac{2}{\pi} \frac{1}{z^2+2^2} ,
\label{eq.cauchysum}
\eeq
which we recognize as a Cauchy distribution with width parameter 2
(where the original distribution has width parameter 1).
This implies that the mean of the two points, $\bar{x} = (x_1+x_2)/2 = z/2$,
has a Cauchy distribution with width parameter 1. Generalizing, the mean
of $N$ samples from a Cauchy distribution is Cauchy-distributed
with the {\em same parameters\/} as the individual samples. The probability
distribution of the mean does {\em not\/} become narrower
as $1/\sqrt{N}$.
{\em The \ind{central-limit theorem} does not apply to the \ind{Cauchy distribution},
because it does not have a finite \ind{variance}.}
An alternative neat method for getting to \eqref{eq.cauchysum} makes
use of the \ind{Fourier transform}\index{generating function}
of the Cauchy distribution, which is
a \index{biexponential distribution}{biexponential} $e^{-|\omega|}$. Convolution in real space
corresponds to multiplication in Fourier space,
so the \ind{Fourier transform} of $z$ is simply $e^{-|2 \omega|}$.
Reversing the transform, we obtain \eqref{eq.cauchysum}.
}
%\begincuttable
\soln{ex.fxxxxx}{
\amarginfig{t}{
\begin{center}
\begin{tabular}{c}
\psfig{figure=gnu/fxxxxx50.ps,width=1.7in,angle=-90}\\
\psfig{figure=gnu/fxxxxx5.ps,width=1.7in,angle=-90}\\
\psfig{figure=gnu/fxxxxx.5.ps,width=1.7in,angle=-90}\\
\end{tabular}
\end{center}
%}{% gnu: load 'fxxxxx.gnu'
\caption[a]{
% The function
$\displaystyle
f(x) = x_{\:,}^{x^{x^{x^{x^{\cdot^{\cdot^{\cdot}}}}}}}
$ shown at three different scales.}
\label{fig.xxxxx}
}%
The function $f(x)$
%\beq
% f(x) = x^{x^{x^{x^{x^{\ddots}}}}}
%\eeq
has inverse function
% to $f$ is
\beq
g(y) = y^{1/y}.
\eeq
Note
\beq
\log g(y) = 1/y \log y .
\eeq
I obtained a tentative graph of $f(x)$ by plotting $g(y)$ with
$y$ along the vertical axis and $g(y)$ along the horizontal
axis. The resulting graph suggests that $f(x)$
is single valued for $x \in (0,1)$, and looks surprisingly well-behaved
and ordinary; for $x \in (1, e^{1/e})$, $f(x)$ is two-valued.
$f(\sqrt{2})$ is equal both to 2 and 4.
For $x > e^{1/e}$ (which is about 1.44), $f(x)$ is infinite.
% undefined.
However, it might be argued that this approach to sketching $f(x)$
is only partly valid, if we define $f$ as the limit of the
sequence of functions $x$,
$x^x$, $x^{x^x}, \ldots$;
this sequence does not
have a limit for
% , below
% pr (1.0/exp(1.0))**exp(1.0)
% 0.0659880358453126
$0 \leq x \leq (1/e)^e \simeq 0.07$
on account of a pitchfork \ind{bifurcation} at $x=(1/e)^e$;
and for $x \in (1,e^{1/e})$, the sequence's limit is single-valued --
the lower of the two values sketched in the figure.
% load 'fxxxxx.gnu2'
%
}
%\endcuttable
\dvipsb{solutions source coding}
\prechapter{About Chapter}
\fakesection{intro for chapter 3}
In the last chapter, we saw a proof of the fundamental status of the entropy
as a measure of average information content.
We defined a data compression scheme using
{\em fixed length block codes}, and
proved that as $N$ increases,
it is possible to encode $N$ i.i.d.\ variables
$\bx = (x_1,\ldots,x_N)$ into a block of $N(H(X)+\epsilon)$ bits
with vanishing probability of error, whereas if we attempt to
encode $X^N$ into $N(H(X)-\epsilon)$ bits, the probability of
error is virtually 1.
We thus verified the {\em possibility\/} of
data compression, but the block coding defined in the proof
did not give a practical algorithm.
In this chapter and the next,
we study practical data compression algorithms.
Whereas the last chapter's compression scheme
used large blocks of {\em fixed\/} size and was
{\em lossy}, in the next chapter we discuss
{\em variable-length\/} compression schemes that are
practical for small block sizes and that are {\em not lossy}.
Imagine a rubber glove filled with water. If we compress two
fingers of the glove, some other part of the glove has
to expand, because
the total volume of water is constant. (Water is essentially
incompressible.) Similarly, when we shorten
the codewords for some outcomes, there must be other
codewords that get longer, if the scheme is not lossy.
In this chapter we will discover the information-theoretic
equivalent of water volume.
% the constant volume of water in the glove.
%%
\medskip
\fakesection{prerequisites for chapter 3}
Before reading \chref{ch.three}, you should have worked on
\extwenty.
\medskip
We will use the\index{notation!intervals}
following notation for intervals:\medskip
% the statement
\begin{center}
\begin{tabular}{ll}
$x \in [1 ,2)$ & means that $x \geq 1$ and $x < 2$; \\
% the statement
$x \in (1 ,2]$ & means that $x > 1$ and $x \leq 2$.\\
\end{tabular}
\end{center}
% {All these definitions of source
% codes, Huffman codes, etc., can be generalized to codes over
% other $q$-ary alphabets, but little is lost by concentrating on
% the binary case.}
%\chapter{Data Compression II: Symbol Codes}
\mysetcounter{page}{102}
\ENDprechapter
\chapter{Symbol Codes}
\label{ch.three}
% %.tex
% \documentstyle[twoside,11pt,chapternotes,lsalike]{itchapter}
% \begin{document}
% \bibliographystyle{lsalike}
% \input{psfig.tex}
% \include{/home/mackay/tex/newcommands1}
% \include{/home/mackay/tex/newcommands2}
% \input{itprnnchapter.tex}
% \setcounter{chapter}{2}% set to previous value
% \setcounter{page}{34} % set to current value
% \setcounter{exercise_number}{45} % set to imminent value
% %
% \renewcommand{\bs}{{\bf s}}
% \newcommand{\eq}{\mbox{$=$}}
% \chapter{Data Compression II: Symbol Codes}
% % \section*{Source Coding: Lossless data compression with symbol codes}
% % Practical source coding
\label{ch3}
%\section{Symbol codes}
In this chapter, we discuss
{\dem variable-length symbol codes\/}\indexs{symbol code},\index{source code!symbol code}
% , variable-length},
which encode one source symbol at a time, instead of encoding huge strings of
$N$ source symbols. These codes are
{\dem lossless:}
unlike the last chapter's block codes, they are guaranteed to
compress and decompress without
any errors; but there is a chance that the codes may sometimes produce
encoded strings longer than the original source string.
The idea is that we can achieve compression, on average,
by assigning {\em shorter\/} encodings to the more
probable outcomes and {\em longer\/} encodings to the less probable.
The key issues are:
\begin{description}
\item[What are the implications if a symbol code is {\em lossless\/}?]
If some codewords are shortened, by how much do other codewords
have to be lengthened?
\item[Making compression practical\puncspace]
How can we ensure that a symbol code is easy to decode?
\item[Optimal symbol codes\puncspace]
How should we assign codelengths to achieve the best
compression, and what is the best achievable compression?
\end{description}
We again verify the
fundamental status of the Shannon \ind{information content}
and the entropy, proving:\index{source coding theorem}
%
%
\begin{description}
\item[Source coding theorem (symbol codes)\puncspace]
There exists a variable-length encoding $C$ of an ensemble
$X$ such that the average length of an encoded symbol,
$L(C,X)$, satisfies
$L(C,X) \in \left[ H(X) , H(X) + 1 \right)$.
The average length is equal to the entropy $H(X)$ only if the codelength
for each outcome is equal to its \ind{Shannon information content}.
\end{description}
%
We will also define a constructive procedure, the
\index{Huffman code}Huffman
coding algorithm, that produces optimal symbol codes.\index{symbol code!optimal}\index{source code!symbol code!optimal}
\begin{description}
\item[Notation for alphabets\puncspace] $\A^N$ denotes the set of
ordered $N$-tuples of elements from the set $\A$, \ie,
all strings of length $N$.
The symbol $\A^+$ will denote the set of all strings of finite
length composed of elements from the set $\A$.
\end{description}
\exampla{ $\{{\tt{0}},{\tt{1}}\}^3 = \{{\tt{0}}{\tt{0}}{\tt{0}},{\tt{0}}{\tt{0}}{\tt{1}},{\tt{0}}{\tt{1}}{\tt{0}},{\tt{0}}{\tt{1}}{\tt{1}},{\tt{1}}{\tt{0}}{\tt{0}},{\tt{1}}{\tt{0}}{\tt{1}},{\tt{1}}{\tt{1}}{\tt{0}},{\tt{1}}{\tt{1}}{\tt{1}}\}$. }
\exampla{
$\{{\tt{0}},{\tt{1}}\}^+ = \{ {\tt{0}} , {\tt{1}} , {\tt{0}}{\tt{0}} , {\tt{0}}{\tt{1}} , {\tt{1}}{\tt{0}} , {\tt{1}}{\tt{1}} , {\tt{0}}{\tt{0}}{\tt{0}} , {\tt{0}}{\tt{0}}{\tt{1}} , \ldots \}$.
}
% This notation is borrowed from the standard notation for expressions
% in computer science
\section{Symbol codes}
\label{sec.symbol.code.intro}
\begin{description}
\item[A (binary) symbol code]
$C$ for an ensemble $X$ is a mapping from the range of $x$,
$\A_X \eq \{a_1,\ldots, $ $a_I\}$, to $\{{\tt{0}},{\tt{1}}\}^+$.
% a set of finite length strings of symbols
% from an alphabet (NAME?).
$c(x)$ will denote the {\dem{codeword}\/}\indexs{symbol code!codeword}
corresponding to $x$,
and $l(x)$ will denote its length, with $l_i = l(a_i)$.
The {\dem \inds{extended code}\/} $C^+$
is a mapping from $\A_X^+$ to $\{{\tt{0}},{\tt{1}}\}^+$
obtained by concatenation, without punctuation, of the
corresponding codewords:\index{concatenation!in compression}
\beq
c^+(x_1 x_2 \ldots x_N) = c(x_1)c(x_2)\ldots c(x_N) .
\eeq
[The term `\ind{mapping}' here is a synonym for `function'.]
\end{description}
\exampla{
A symbol code for the ensemble
$X$ defined by
\beq
\begin{array}{*{4}{c}*{5}{@{\,}c}}
& \A_X & = & \{ & {\tt a}, & {\tt b}, & {\tt c}, & {\tt d} & \} , \\
& \P_X & = & \{ & \dhalf, & \dquarter, & \deighth, & \deighth & \},
\end{array}
\eeq
% : \A_X = \{{\tt{a}},{\tt{b}},{\tt{c}},{\tt{d}}\},$ $\P_X = \{ \dhalf,\dquarter,\deighth,\deighth \}$
is $C_0$, shown in the margin.
% = \{ {\tt{1}}{\tt{0}}{\tt{0}}{\tt{0}}, {\tt{0}}{\tt{1}}{\tt{0}}{\tt{0}}, {\tt{0}}{\tt{0}}{\tt{1}}{\tt{0}}, {\tt{0}}{\tt{0}}{\tt{0}}{\tt{1}}\}$.
\marginpar{
\begin{center}
$C_0$:
\begin{tabular}{clc} \toprule
$a_i$ & $c(a_i)$ & $l_i$
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule
{\tt a} & {\tt 1000} & 4 \\
{\tt b} & {\tt 0100} & 4 \\
{\tt c} & {\tt 0010} & 4 \\
{\tt d} & {\tt 0001} & 4 \\
\bottomrule
\end{tabular}
\end{center}
}
Using the extended code, we may encode ${\tt{acdbac}}$
as
\beq
c^{+}({\tt{acdbac}}) =
{\tt{1000}}
{\tt{0010}}
{\tt{0001}}
{\tt{0100}}
{\tt{1000}}
{\tt{0010}} .
\eeq
}
There are basic requirements for a useful symbol code.
First, any encoded string must have a unique decoding.
Second, the symbol code must be easy to decode.
And third, the code should achieve as much compression as possible.
\subsection{Any encoded string must have a unique decoding}
\begin{description}
\item[A code $C(X)$ is uniquely decodeable] if, under the
extended code $C^+$, no two distinct
strings have the same encoding,
% every element of $\A_X^+$ maps into a different string,
\ie,
\beq
\forall \, \bx,\by \in \A_X^+, \:\: \bx \not = \by \:\: \Rightarrow \:\:
c^+(\bx) \not = c^+(\by).
\label{eq.UD}
\eeq
%cnp22@maths.cam.ac.uk:
% I'm missing the word `injectivity'. This would explain, why
% (3.2) is necessary for an inverse function.
%
% {\em I believe mathematicians would put it this way:
% a code is uniquely decodeable if the extended code is an injective
% mapping.}
\end{description}
The code $C_0$ defined above is an example of a uniquely decodeable
code.
\subsection{The symbol code must be easy to decode}
A symbol code
is easiest to decode if it is possible to identify the end of a
codeword as soon as it arrives, which means that no codeword can
be a {\dem{prefix}\/} of another codeword.
%
% {\em (Need a defn of a prefix here.)}
%\marginpar{\footnotesize
% [A word $c$
%% \in \A^{+}$
% is a {\dem prefix\/} of another word $d$
%% \in \A^{+}$
% if there exists a tail string $t$
%% \in \A^{*}
% such that the concatenation $ct$ is
% identical to $d$. For example, {\tt 1} is a prefix of {\tt 101},
% and so is {\tt 10}.]
%}
[A word $c$
% \in \A^{+}$
is a {\dem prefix\/} of another word $d$
% \in \A^{+}$
if there exists a tail string $t$
% \in \A^{*}
such that the concatenation $ct$ is
identical to $d$. For example, {\tt 1} is a prefix of {\tt 101},
and so is {\tt 10}.]
%
We will show later that we don't lose
any performance if we constrain our symbol code to be
a prefix code.
\begin{description}
\item[A symbol code is called a \inds{prefix code}]
if no codeword is a prefix of
any other codeword.
A prefix code is also known as an {\dem\ind{instantaneous}\/}
or {\dem\ind{self-punctuating}\/}
code, because an encoded string can be decoded
from left to right without looking ahead to subsequent
codewords. The end of a codeword is immediately recognizable.
A prefix code is uniquely decodeable.
\end{description}
\begin{aside}
{Prefix codes are also
% is more accurately called
known as `prefix-free codes' or `prefix condition codes'.}
\end{aside}
Prefix codes correspond to trees.
\exampla{
\amarginfignocaption{t}{\mbox{\small$C_1$ \psfig{figure=figs/C1.ps,angle=-90,width=1in}}}
The code $C_1 = \{ {\tt{0}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$ is a prefix code because
${\tt{0}}$ is not a prefix of {\tt{1}}{\tt{0}}{\tt{1}}, nor is {\tt{1}}{\tt{0}}{\tt{1}} a prefix of {\tt{0}}.
}
\exampla{
Let $C_2 = \{ {\tt{1}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$. This code is not a prefix code because
${\tt{1}}$ is a prefix of {\tt{1}}{\tt{0}}{\tt{1}}.
}
\exampla{
% \marginpar[t]{\mbox{\small\raisebox{0.4in}[0in][0in]{$C_3$} \psfig{figure=figs/C3.ps,angle=-90,width=1in}}}
The code $C_3 = \{
{\tt 0} ,
{\tt 10} ,
{\tt 110} ,
{\tt 111}
\}$
is a prefix code.
%
}
%%%%%%%%%%%%%%%
\exampla{
\amarginfignocaption{t}{\mbox{\small\raisebox{0.4in}[0in][0in]{$C_3$} \psfig{figure=figs/C3.ps,angle=-90,width=1in}}\\[0.21in]
\mbox{\small%
\raisebox{0.2in}[0in][0in]{$C_4$} \psfig{figure=figs/C4.ps,angle=-90,width=0.681in}%
}\\[0.125in]
\small\raggedright
Prefix codes can be represented on binary trees. {\dem Complete\/} prefix codes
correspond to binary trees with no unused branches. $C_1$ is an incomplete code.}
The code $C_4 = \{
{\tt 00} ,
{\tt 01} ,
{\tt 10} ,
{\tt 11}
\}$
is a prefix code.
%
}
%%%%%%%%%%%%%%%
\exercissxA{1}{ex.C1101}{
Is $C_2$ uniquely decodeable?
}
%
% example
%
% morse code with spaces stripped out. Is it a prefix code? Is it UD?
% (no,no)
%
\exampla{
% ref corrected 9802
Consider \exerciseref{ex.weigh} and \figref{fig.weighing} (\pref{fig.weighing}).
Any weighing strategy that identifies the odd ball and whether it
is heavy or light can be viewed as assigning a {\em ternary\/}
code to each of the 24 possible states.
This code is a prefix code.
}
\subsection{The code should achieve as much compression as possible}
\begin{description}
\item[The expected length $L(C,X)$] of a symbol code $C$ for ensemble $X$ is
\beq
L(C,X) = \sum_{x \in \A_X} P(x) \, l(x).
\eeq
We may also write this quantity as
\beq
L(C,X) = \sum_{i=1}^{I} p_i l_i
\eeq
where $I = |\A_X|$.
\end{description}
%
\exampla{
% {\sf Example 1:}
\marginpar[b]{
\begin{center}
$C_3$:\\[0.1in]
\begin{tabular}{cllcc} \toprule
$a_i$ & $c(a_i)$ & $p_i$ &
% \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$}
$h(p_i)$
& $l_i$
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule
{\tt a} & {\tt 0} & \dhalf & 1.0 & 1 \\
{\tt b} & {\tt 10} & \dquarter & 2.0 & 2 \\
{\tt c} & {\tt 110} & \deighth & 3.0 & 3 \\
{\tt d} & {\tt 111} & \deighth & 3.0 & 3 \\
\bottomrule
\end{tabular}
\end{center}
}
Let
\beq
\begin{array}{*{4}{c}*{5}{@{\,}c}}
& \A_X & = & \{ & {\tt a}, & {\tt b}, & {\tt c}, & {\tt d} & \} , \\
\mbox{and} \:\:& \P_X & = & \{ & \dhalf, & \dquarter, & \deighth, & \deighth & \},
\end{array}
\eeq
and consider the code $C_3$.
% $c(a)\eq {\tt{0}}$, $ c(b)\eq {\tt{1}}{\tt{0}}$,
% $c(c)\eq {\tt{1}}{\tt{1}}{\tt{0}}$, $ c(d)\eq {\tt{1}}{\tt{1}}{\tt{1}}$.
%
The entropy of $X$ is 1.75 bits, and the expected length $L(C_3,X)$ of this
code is also 1.75 bits. The sequence of symbols $\bx\eq ({\tt acdbac})$ is
% 134213
encoded as $c^+(\bx)={\tt{0110111100110}}$.
% You can confirm that no other sequence of
% symbols $\bx$ has the same encoding.
% In fact,
$C_3$ is a {prefix code\/}
and is therefore \inds{uniquely decodeable}.
Notice that the codeword lengths satisfy $l_i \eq \log_2 (1/p_i)$, or
equivalently,
$p_i \eq 2^{-l_i}$.
}
%\medskip
%
%\noindent {\sf Example 2:}
\exampla{
Consider the fixed length code for the same ensemble
$X$, $C_4$.
% $ c(1)\eq {\tt{00}}$, $ c(2)\eq {\tt{01}}$, $ c(3)\eq {\tt{10}}$, $ c(4)\eq {\tt{11}}$.
%
% C4 by itself in a table, moved to graveyard
\marginpar[b]{
\begin{center}
\begin{tabular}{cll} \toprule
% $a_i$
&
$C_4$&
$C_5$
%&$C_6$
% \\
% $c(a_i)$ & $p_i$ &
% \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$}
% $h(p_i)$ & $l_i$
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule
{\tt a} & {\tt 00} & {\tt 0} \\
{\tt b} & {\tt 01} & {\tt 1} \\
{\tt c} & {\tt 10} & {\tt 00} \\
{\tt d} & {\tt 11} & {\tt 11} \\
\bottomrule
\end{tabular}
\end{center}
}
The expected length $L(C_4,X)$ is 2 bits.
}
% edskip
%
% \noindent {\sf Example 3:}
\exampla{
Consider $C_5$.
%$ c(1)\eq {\tt{0}}$, $ c(2)\eq {\tt{1}}$, $ c(3)\eq {\tt{00}}$, $c(4)\eq {\tt{11}}$.
The expected
length $L(C_5,X)$ is 1.25 bits, which is less than $H(X)$.
But the code is not uniquely decodeable.
The sequence $\bx\eq ({\tt acdbac})$
% 134213)$
encodes as {\tt{000111000}}, which can also be
decoded as $({\tt cabdca})$.
}
% \medskip
%
% \noindent {\sf Example 4:}
\exampla{
Consider the code $C_6$.
\amargintabnocaption{c}{
\begin{center}
$C_6$:\\[0.1in]
\begin{tabular}{cllcc} \toprule
$a_i$ & $c(a_i)$ & $p_i$ &
% {$\log_2 \frac{1}{p_i}$}
$h(p_i)$
& $l_i$
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule
{\tt a} & {\tt 0} & \dhalf & 1.0 & 1 \\
{\tt b} & {\tt 01} & \dquarter & 2.0 & 2 \\
{\tt c} & {\tt 011} & \deighth & 3.0 & 3 \\
{\tt d} & {\tt 111} & \deighth & 3.0 & 3 \\
\bottomrule
\end{tabular}
\end{center}
}
%$ c(1)\eq {\tt{0}}$, $ c(2)\eq {\tt{01}}$, $ c(3)\eq {\tt{011}}$, $c(4)\eq {\tt{111}}$.
The expected length $L(C_6,X)$ of this
code is 1.75 bits. The sequence of symbols $\bx\eq ({\tt acdbac})$ is
encoded as $c^+(\bx)={\tt{0011111010011}}$.
Is $C_6$ a {prefix code}?
It is not, because $c({\tt a}) = {\tt 0}$ is a prefix of both
$c({\tt b})$ and $c({\tt c})$.
Is $C_6$ {uniquely decodeable}? This is not so obvious. If you think that
it might {\em not\/} be {uniquely decodeable}, try to prove it
so by finding a pair of strings $\bx$ and $\by$ that have the same
encoding. [The definition of unique decodeability is given in \eqref{eq.UD}.]
$C_6$ certainly isn't {\em easy\/} to decode.
When we receive `{\tt{00}}', it is possible that $\bx$ could start `{\tt{aa}}',
`{\tt{ab}}' or `{\tt{ac}}'. Once we have received `{\tt{001111}}', the second symbol
is still ambiguous, as $\bx$ could be `{\tt{abd}}\ldots' or `{\tt{acd}}\ldots'.
But eventually a unique decoding crystallizes, once the next {\tt{0}} appears in the
encoded stream.
$C_6$ {\em is\/} in fact {uniquely decodeable}. Comparing with the prefix code $C_3$,
we see that the codewords of $C_6$ are the reverse of $C_3$'s.
That $C_3$ is uniquely decodeable proves that $C_6$ is too, since
any string from $C_6$ is identical to a string from $C_3$ read backwards.
}
% \medskip
% something I recall reading in cover was a contrary statement that said that
% with a nonprefix code it will take an arb long time to figure things out.
% maybe that was just a w.c. result.
% What is it that distinguishes a uniquely
\section{What limit is imposed by unique decodeability?}
We now ask, given a list of positive integers $\{ l_i
\}$, does there exist a uniquely decodeable\index{uniquely decodeable}\index{source code!uniquely decodeable} code with those
integers as its codeword lengths?
At this stage, we ignore the probabilities of the different
symbols; once we understand unique decodeability better, we'll
reintroduce the probabilities and discuss how to make
an {\dem optimal\/} uniquely decodeable symbol code.
In the examples above, we have observed that if we take a code
such as $\{{\tt{00}},{\tt{01}},{\tt{10}},{\tt{11}}\}$, and
shorten one of its codewords,
for example ${\tt{00}} \rightarrow {\tt{0}}$, then we can retain unique
decodeability only if we lengthen other codewords.
Thus there seems to be a constrained budget\index{symbol code!budget} that we can spend
on codewords, with shorter codewords being more expensive.
Let us explore the nature of this \ind{budget}.
If we build a code purely from codewords of length $l$ equal
to three, how many
codewords can we have and retain unique decodeability?
The answer is $2^l = 8$. Once we have chosen all eight
of these codewords, is there any way we could add to the code another
codeword of some {\em other\/} length and retain unique decodeability?
It would seem not.
What if we make a code that includes a length-one codeword, `{\tt{0}}',
with the other codewords being of length three? How many length-three
codewords can we have?
If we restrict attention to prefix codes, then
% it is clear that
we can have only four codewords of length three, namely
$\{ {\tt{100}},{\tt{101}},{\tt{110}},{\tt{111}} \}$. What about other codes? Is there any other
way of choosing codewords of length 3 that can give more codewords?
Intuitively, we think this unlikely.
A codeword of length $3$ appears to
have a cost that is $2^{2}$ times smaller than a codeword of length 1.
% "... cost ... times smaller ..."; I suspect some
% readers may have difficulty with this sentence.
Let's define a total budget of size 1,
which we can spend on codewords.
If we set the cost of a codeword whose length is $l$ to $2^{-l}$,
then we have a pricing system that fits the examples
discussed above. Codewords of length 3 cost $\deighth$ each;
codewords of length 1 cost $1/2$ each.
We can spend our budget on any codewords.
If we go over our budget then the code will certainly not be
uniquely decodeable. If, on the other hand,
\beq
\sum_i 2^{-l_i} \leq 1,
\label{eq.kraft}
\eeq
then the code may be uniquely decodeable. This inequality is
the \inds{Kraft inequality}.\label{sec.kraft}
\begin{description}
\item[\Kraft\ inequality\puncspace]
For any uniquely decodeable code $C(X)$ over the binary alphabet $\{0,1\}$,
the codeword lengths must satisfy:
\beq
\sum_{i=1}^I 2^{-l_i} \leq 1 ,
\eeq
where $I = |\A_X|$.
\end{description}
\begin{description}
\item[Completeness\puncspace]
If a uniquely
decodeable code satisfies the \Kraft\ inequality with equality
then it is called a {\dbf complete} code.
\end{description}
% It is less obvious that t
We want codes that are uniquely decodeable;
prefix codes are uniquely decodeable, and are easy to decode.
% ; and it is easy to assess whether a code is a prefix code.
% codes that are not prefix codes are less straightforward to decode than
% prefix codes.
So life would be simpler for us if we could restrict attention to prefix
codes.\index{prefix code}
Fortunately,
% we can prove that
for any source there {\em is\/}
an optimal symbol code that is also a prefix
code.
% We wi, and we will discuss an
% algorithm we can restrict attention to prefix
% codes.
% The following
% result is also true:
\begin{description}
\item[\Kraft\ inequality and prefix codes\puncspace]
Given a set of codeword lengths that satisfy
the Kraft inequality,
% this inequality,
there exists a uniquely decodeable prefix
code\index{source code!prefix code}\index{prefix code} with these
codeword lengths.
\end{description}
\begin{aside}
%\subsection*{The small print}
The Kraft inequality
% , which appears on page \pageref{sec.kraft},
might be more accurately referred to
as the Kraft--McMillan inequality:\index{Kraft, L.G.}\index{McMillan, B.}\nocite{mcmillan1956}
Kraft
% (1949)
proved that if the inequality is satisfied,
then a prefix code exists with the given lengths.
% McMillan
% (1956)
\citeasnoun{mcmillan1956}
proved the converse, that unique decodeability
implies that the inequality holds.
\end{aside}
\begin{prooflike}{Proof of the \Kraft\ inequality}
%
Define $S = \sum_i 2^{-l_i}$.
Consider the quantity
\beq
S^N = \left[ \sum_i 2^{-l_i} \right]^N
= \sum_{i_1=1}^{I} \sum_{i_2=1}^{I} \cdots \sum_{i_N=1}^{I}
2^{-\displaystyle \left(l_{i_1} + l_{i_2} + \cdots l_{i_N} \right) } .
\eeq
The quantity in the exponent, $\left(l_{i_1} + l_{i_2} + \cdots +
l_{i_N} \right)$, is the length of the encoding of the string $\bx =
a_{i_1} a_{i_2} \ldots a_{i_N}$. For every string $\bx$
of length $N$, there is one term in the above sum. Introduce an
array $A_l$ that counts how many strings $\bx$ have encoded length $l$.
Then, defining $l_{\min} = \min_i l_i$ and $l_{\max} = \max_i l_i$:
\beq
S^N = \sum_{l = N l_{\min} }^{N l_{\max}} 2^{-l} A_l .
\eeq
Now assume $C$ is
uniquely decodeable, so that for all $\bx \not = \by$,
$c^+(\bx) \not = c^+(\by)$. Concentrate on the $\bx$ that have encoded
length $l$. There are a total of $2^l$ distinct bit strings of length $l$,
so it must be the case that $A_l \leq 2^l$.
%
So
\beq
S^N = \sum_{l = N l_{\min} }^{N l_{\max}} 2^{-l} A_l \leq
\sum_{l = N l_{\min} }^{N l_{\max}} 1 \:\: \leq \:\: N l_{\max}.
\label{eq.kraft.climax}
\eeq
Thus $S^N \leq l_{\max} N$ for all $N$.
Now if $S$ were greater than 1, then as $N$ increases,
$S^N$ would be an exponentially growing function, and for large enough
$N$, an exponential always exceeds a polynomial such as $l_{\max} N$.
But our result $(S^N \leq l_{\max} N)$
% \ref{eq.kraft.climax}
is true for {\em any\/} $N$.
Therefore $S \leq 1$. \hfill
% Q.E.D.
%
% to have
% enabled me to understand it the first time round, it would have been
% sufficient to have said 'for the inequality to be true for all N,
% regardless of how large, S has to be <= 1.'
%
\end{prooflike}
\exercissxB{3}{ex.KIconverse}{
% (optional)
Prove
the result stated above,
that for any set of codeword lengths $\{ l_i \}$
satisfying the \Kraft\ inequality, there is a prefix code having those
lengths.
}
%
% Symbol Coding Budget
%
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=figs/budget1.eps,height=3in}\ \psfig{figure=figs/budgetmax.eps,height=3in}}
\end{center}
}{%
\caption[a]{The symbol coding \ind{budget}.\index{source code!supermarket}\indexs{symbol code!budget}
The `cost' $2^{-l}$ of each codeword
(with length $l$)
is indicated by the size of the box it is written in. The total budget
available when making a uniquely decodeable code is 1.
You can think of this diagram as showing
a {\dem{codeword supermarket}\/}\index{supermarket for codewords},
with the codewords arranged in aisles by their length, and the cost of each codeword indicated by the
size of its box on the shelf.
If the cost of the codewords that you take exceeds the budget then your code
will not be uniquely decodeable.
}
\label{fig.budget1}
}%
\end{figure}
\begin{figure}
\figuredangle{%
\begin{center}
\mbox{
%\begin{tabular}{cc}
% $C_0$ & $C_3$ \\
%\psfig{figure=figs/budget0.eps,height=1.48in}&
%\psfig{figure=figs/budget3.eps,height=1.48in} \\[0.2in]
% $C_4$ & $C_6$ \\
%\psfig{figure=figs/budget4.eps,height=1.48in}&
%\psfig{figure=figs/budget6.eps,height=1.48in}\\
%\end{tabular}}
\begin{tabular}{cccc}
$C_0$ & $C_3$ & $C_4$ & $C_6$ \\
\psfig{figure=figs/budget0.eps,height=1.66in}&
\psfig{figure=figs/budget3.eps,height=1.66in}&
\psfig{figure=figs/budget4.eps,height=1.66in}&
\psfig{figure=figs/budget6.eps,height=1.66in}\\
\end{tabular}}
\end{center}
}{%
\caption[a]{Selections of codewords
% from the codeword supermarket
made by codes $C_0,C_3,C_4$ and $C_6$
from section \protect\ref{sec.symbol.code.intro}.}
\label{fig.budget0}
\label{fig.budget6}
}%
\end{figure}
A pictorial view of the \Kraft\ inequality may help you solve this exercise.
Imagine that we are choosing the codewords to make a symbol code.
We can draw the set of all candidate codewords
% that we might include in a code
in a supermarket that displays
the `cost' of the codeword by the area of a box (\figref{fig.budget1}).
The total budget available -- the `1' on the right-hand side of
the \Kraft\ inequality -- is shown at one side.
Some of the codes discussed in section \ref{sec.symbol.code.intro}
are illustrated in figure \ref{fig.budget0}. Notice that the codes that
are prefix codes, $C_0$, $C_3$,
and $C_4$, have the property that to the right of any selected
codeword, there are no other selected codewords --
because prefix codes correspond to trees.
% The {\em complete\/} prefix codes $C_0$, $C_3$,
% and $C_4$ have the property that
% the codewords abut
% Notice also that the
% `incomplete' code
% -\ref{fig.budget6}.
Notice that a {\em complete\/} prefix code
corresponds to a {\em complete\/} tree having no unused branches.
\medskip
We are now ready to put back the symbols' probabilities $\{ p_i \}$.
Given a set of symbol probabilities (the English language
probabilities of \figref{fig.monogram}, for example),
how do we make the best symbol code -- one with the smallest
possible expected length $L(C,X)$? And what is that smallest possible
expected length?
It's not
obvious how to assign the codeword lengths.
If we give short codewords to the more probable
symbols then the expected length might be reduced; on the other
hand, shortening some codewords necessarily causes others
to lengthen, by the Kraft inequality.
\section{What's the most compression that we can hope for?}
% there must be a compromise.
% of s
% Of the four codes displayed in figure \ref{fig.budget0},
% $C_3$ and $C_6$
We wish to minimize the expected length of a code,
\beqan
L(C,X) &=& \sum_i p_i l_i .
\eeqan
As you might have guessed, the entropy appears as the
% It is easy to show that there is a
lower bound on the expected length of a code.
\begin{description}
\item[Lower bound on expected length\puncspace] The expected length $L(C,X)$
of a uniquely decodeable code
is bounded below by $H(X)$.
\item[{\sf Proof.}]
% Introduce the optimum codelengths $l^*_i \equiv \log (1/p_i)$,
We define the {\dem\inds{implicit probabilities}\/}
$q_i \equiv 2^{-l_i}/z$,
where $z\eq \sum_{i'} 2^{-l_{i'}}$, so that $l_i \eq \log 1/q_i -
\log z$. We then use Gibbs' inequality,
$\sum_i p_i \log 1/q_i \geq \sum_i p_i \log 1/p_i$, with
equality if $q_i \eq p_i$, and the \Kraft\ inequality $z\leq 1$:
\beqan
L(C,X) &=& \sum_i p_i l_i =
\sum_i p_i \log 1/q_i - \log z
\label{eq.expected.length}
\\
& \geq & \sum_i p_i \log 1/p_i - \log z
\\
& \geq & H(X) .
\eeqan
The equality $L(C,X) \eq H(X)$ is achieved only if the \Kraft\
equality $z
% \sum_i 2^{-l_i}
\eq 1$ is satisfied, and if
the codelengths satisfy $l_i \eq \log (1/p_i)$. \hfill $\Box$
\end{description}
This is an important result so let's say it again:
\begin{description}
\item[Optimal source codelengths\puncspace]
The\index{source code!optimal lengths}
expected length is minimized and is equal to
$H(X)$ only if the codelengths
are equal to the {\dem Shannon information contents}:\index{Shannon information content}\index{information content}
\beq
l_i = \log_2 (1/p_i) .
\eeq
\item[Implicit probabilities defined by codelengths\puncspace]
Conversely, any choice of codelengths $\{l_i\}$ {\em implicitly\/}
defines a probability distribution $\{q_i\}$,
\beq
q_i \equiv 2^{-l_i}/z ,
\eeq
for which those codelengths would be the optimal codelengths.
If the code is complete then $z=1$ and the implicit probabilities
are given by $q_i = 2^{-l_i}$.
\end{description}
% This is one of the central themes of this course.
%
%
%
\section{How much can we compress?}
So, we can't compress below the entropy.
% using a symbol code.
How close can we expect to get to the entropy?
% if we are using a symbol code?
% \section{Existence of good symbol codes}
\begin{ctheorem}
{\sf Source coding theorem for symbol codes.}
For an ensemble $X$ there exists a prefix code $C$ with expected length
satisfying\indexs{extra bit}
\beq
H(X) \leq L(C,X) < H(X) + 1.
\label{eq.source.coding.symbol}
\eeq
\label{th.source.coding.symbol}
\end{ctheorem}
\begin{prooflike}{Proof} We set the codelengths to integers slightly
larger than the optimum lengths:
\beq
l_i = \lceil \log_2 (1/p_i) \rceil
\eeq
where $\lceil l^* \rceil$ denotes the smallest integer greater
than or equal to $l^*$.
[We are not asserting that the {\em optimal\/} code necessarily uses
these lengths, we are simply choosing these lengths
because we can use them to prove the theorem.]
We check that there {\em is\/} a
prefix code with these lengths by confirming that the
\Kraft\ inequality is satisfied.
\beq
\sum_i 2^{-l_i} = \sum_i 2^{-\lceil \log_2 (1/p_i) \rceil}
\leq \sum_i 2^{ -\log_2 (1/p_i) } = \sum_i p_i = 1 .
\eeq
Then we confirm
\beq
L(C,X) = \sum_i p_i \lceil \log (1/p_i) \rceil
< \sum_i p_i ( \log (1/p_i) + 1 ) = H(X) + 1.
\eeq
% corrected < to = , 9802
%
\end{prooflike}
\subsection{The cost of using the wrong codelengths}
If we use a code whose lengths are not equal to the optimal
codelengths, the average message length will be larger
than the entropy.
%when we use the `wrong' code.
If the true probabilities are $\{ p_i
\}$ and we use a complete code with lengths $l_i$,
% that satisfy the
% \Kraft\ equality (that is,
% the \Kraft\ inequality with equality),
we can view those lengths as defining
\ind{implicit probabilities} $q_i = 2^{-{l_i}}$.
% l_i \eq \log 1/q_i$ such
% that $\sum_i q_i \eq 1$, then
Continuing from \eqref{eq.expected.length},
the average length is
\beq
L(C,X) = H(X)+\sum_i p_i \log p_i/q_i,
\eeq
\ie, it exceeds the entropy by the \ind{relative entropy}
$D_{\rm KL}(\bp||\bq)$ (as defined on \pref{eq.KL}).
\section{Optimal source coding with symbol codes: Huffman coding}
Given a set of probabilities $\P$, how can we design an optimal
prefix code? For example,
what is the best symbol code for the English language ensemble
shown in \figref{fig.elfig}?
\marginfig{\begin{center}\input{tex/_paz.tex}\end{center}
\caption[a]{An ensemble in need of a symbol code.}\label{fig.elfig}}
When we say `optimal', let's assume our aim is to minimize the
expected length $L(C,X)$.
\subsection{How not to do it}
One might try
to roughly split the set $\A_X$ in two, and
continue bisecting the subsets so as to define a binary tree from the
root. This construction has the right spirit, as in the weighing problem,
% is how the {\em Shannon-Fano code\/} is constructed,\index{Shannon, Claude}\index{Fano}
but it is not
necessarily optimal; it achieves $L(C,X) \leq H(X) + 2$.
%
% find a reference for proof of this?
%
%{\em [Is Shannon-Fano
% the correct name? According to Goldie and Pinch this has a different
% meaning. Check.]}
\subsection{The Huffman coding algorithm}
We now present a beautifully simple algorithm for finding an optimal
prefix code.
\indexs{Huffman code}The trick is to
construct the code {\em backwards\/} starting from the tails of the
codewords; {\em we build the binary tree from its leaves}.
\begin{algorithm}[h]
\begin{framedalgorithmwithcaption}{\caption[a]{Huffman coding algorithm.}}
\ben
\item%[{\sf 1.}]
Take the two least probable symbols in the alphabet. These two symbols
will be given the longest codewords, which will have equal length,
and differ only in the last digit.
\item%[{\sf 2.}]
Combine these two symbols into a single symbol, and repeat.
\een
\end{framedalgorithmwithcaption}
\end{algorithm}
Since each step reduces the size of the alphabet by one,
this algorithm will have assigned strings to all the symbols
after $|\A_X|-1$ steps.
\exampla{
% {\sf Example:}
\begin{tabular}[t]{*{11}{@{\,}l}}
Let \hspace{0.1in} & $\A_X$ &=&$\{$& {\tt a},&{\tt b},&{\tt c},&{\tt d},&{\tt e} &$\}$ \\
and \hspace{0.1in} & $\P_X$ &=&$\{$& 0.25, &0.25, & 0.2, & 0.15, & 0.15 & $\}$.
\end{tabular}
\begin{center}
% \framebox{\psfig{figure=figs/huffman.ps,%
%angle=-90}}
\setlength{\unitlength}{0.015in}%was0125
\begin{picture}(200,95)(40,40)
\put( 60,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put( 60,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put( 60,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.2}}}
\put( 60,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.15}}}
\put( 60,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.15}}}
\put(100,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put(100,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put(100,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.2}}}
\put(100,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.3}}}
\put(140,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}}
\put(140,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.45}}}
\put(140,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.3}}}
\put(180,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.55}}}
\put(180,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.45}}}
\put(220,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{1.0}}}
\put( 40,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt a}}}
\put( 40,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt b}}}
\put( 40,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt c}}}
\put( 40,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt d}}}
\put( 40,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt e}}}
\put( 85,067){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}}
\put( 85,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}}
\put(125,097){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}}
\put(125,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}}
\put(165,112){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}}
\put(165,065){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}}
\put(205,112){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}}
\put(205,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}}
\thinlines
\put( 80,110){\line( 1, 0){ 15}}
\put( 80,095){\line( 1, 0){ 15}}
\put( 80,080){\line( 1, 0){ 15}}
\put( 80,065){\line( 1, 0){ 15}}
\put( 95,065){\line(-1,-1){ 15}}
\put(120,110){\line( 1, 0){ 15}}
\put(120,065){\line( 1, 0){ 15}}
\put(120,095){\line( 1, 0){ 15}}
\put(135,095){\line(-1,-1){ 15}}
\put(160,095){\line( 1, 0){ 15}}
\put(160,110){\line( 1, 0){ 15}}
\put(175,110){\line(-1,-3){ 15}}
\put(200,110){\line( 1, 0){ 15}}
\put(215,110){\line(-1,-1){ 15}}
\put( 40,125){\makebox(0,0)[bl]{\raisebox{0pt}[0pt][0pt]{$x$}}}
\put( 85,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 1}}}
\put(125,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 2}}}
\put(165,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 3}}}
\put(205,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 4}}}
\end{picture}
\end{center}
The codewords are then obtained by concatenating the binary digits
in reverse order:
% Codewords
$C = \{ {\tt{00}}, {\tt{10}} , {\tt{11}}, {\tt{010}}, {\tt{011}} \}$.
\margintab{
\begin{center}
\begin{tabular}{clrrl} \toprule
$a_i$ & $p_i$ &
\multicolumn{1}{c}{$h(p_i)$%$\log_2 \frac{1}{p_i}$}
}
& $l_i$ & $c(a_i)$
%{\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule
{\tt a} & 0.25 & 2.0 & 2 & {\tt 00} \\
{\tt b} & 0.25 & 2.0 & 2 & {\tt 10} \\
{\tt c} & 0.2 & 2.3 & 2 & {\tt 11} \\
{\tt d} & 0.15 & 2.7 & 3 & {\tt 010} \\
{\tt e} & 0.15 & 2.7 & 3 & {\tt 011} \\ \bottomrule
\end{tabular}
\end{center}
\caption[a]{Code created by the Huffman algorithm.}
\label{tab.huffman}
}
The codelengths selected by the Huffman algorithm (column 4
of \tabref{tab.huffman}) are
in some cases longer and in some cases shorter than
the ideal codelengths, the Shannon information contents $\log_2 \dfrac{1}{p_i}$ (column 3).
The expected length of the code is $L=2.30$ bits, whereas the
entropy is $H=2.2855$ bits.\ENDsolution
}
If at any point there is more than one way of selecting the two least
probable symbols then the choice may be made in any manner -- the
expected length of the code will not depend on the choice.
\exercissxC{3}{ex.Huffmanconverse}{
% (Optional)
Prove\index{Huffman code!`optimality'}
that there is no better symbol code for a source than the
Huffman code.
}
%
\exampla{
We can make a Huffman code for the probability distribution
over the alphabet introduced in \figref{fig.monogram}.
The result is shown in \figref{fig.monogram.huffman}.
This code has an expected length of 4.15 bits; the entropy of
the ensemble is 4.11 bits.
% It is interesting to notice how
% some symbols, for example {\tt q}, receive codelengths that
% differ by more than 1 bit from
Observe the disparities between the assigned
codelengths and the ideal codelengths
$\log_2 \dfrac{1}{p_i}$.
}
%%%%%%%%%%%%%%%%%%%%%%%%% alphabet of english!
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\small
\begin{tabular}{clrrl} \toprule
$a_i$ & $p_i$ & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} & $l_i$ & $c(a_i)$
%{\rule[-3mm]{0pt}{8mm}}%strut
\\[0in] \midrule
{\tt a}& 0.0575 & 4.1 & 4 & {\tt 0000 } \\
{\tt b}& 0.0128 & 6.3 & 6 & {\tt 001000 } \\
{\tt c}& 0.0263 & 5.2 & 5 & {\tt 00101 } \\
{\tt d}& 0.0285 & 5.1 & 5 & {\tt 10000 } \\
{\tt e}& 0.0913 & 3.5 & 4 & {\tt 1100 } \\
{\tt f}& 0.0173 & 5.9 & 6 & {\tt 111000 } \\
{\tt g}& 0.0133 & 6.2 & 6 & {\tt 001001 } \\
{\tt h}& 0.0313 & 5.0 & 5 & {\tt 10001 } \\
{\tt i}& 0.0599 & 4.1 & 4 & {\tt 1001 } \\
{\tt j}& 0.0006 & 10.7 & 10 & {\tt 1101000000 } \\
{\tt k}& 0.0084 & 6.9 & 7 & {\tt 1010000 } \\
{\tt l}& 0.0335 & 4.9 & 5 & {\tt 11101 } \\
{\tt m}& 0.0235 & 5.4 & 6 & {\tt 110101 } \\
{\tt n}& 0.0596 & 4.1 & 4 & {\tt 0001 } \\
{\tt o}& 0.0689 & 3.9 & 4 & {\tt 1011 } \\
{\tt p}& 0.0192 & 5.7 & 6 & {\tt 111001 } \\
{\tt q}& 0.0008 & 10.3 & 9 & {\tt 110100001 } \\
{\tt r}& 0.0508 & 4.3 & 5 & {\tt 11011 } \\
{\tt s}& 0.0567 & 4.1 & 4 & {\tt 0011 } \\
{\tt t}& 0.0706 & 3.8 & 4 & {\tt 1111 } \\
{\tt u}& 0.0334 & 4.9 & 5 & {\tt 10101 } \\
{\tt v}& 0.0069 & 7.2 & 8 & {\tt 11010001 } \\
{\tt w}& 0.0119 & 6.4 & 7 & {\tt 1101001 } \\
{\tt x}& 0.0073 & 7.1 & 7 & {\tt 1010001 } \\
{\tt y}& 0.0164 & 5.9 & 6 & {\tt 101001 } \\
{\tt z}& 0.0007 & 10.4 & 10 & {\tt 1101000001 } \\
{--}& 0.1928 & 2.4 & 2 & {\tt 01 } \\ \bottomrule
%{\verb+-+}& 0.1928 & 2.4 & 2 & {\tt 01 } \\ \bottomrule
\end{tabular}
\hspace*{0.5in}\raisebox{-2in}{\psfig{figure=tex/sortedtree.eps,width=1.972in}}
}
\end{center}
}{%
\caption[a]{Huffman code for the English language ensemble (monogram statistics).}
% introduced in \protect\figref{fig.monogram}.}
\label{fig.monogram.huffman}
}%
\end{figure}
% see \cite[p. 97]{Cover&Thomas}
% \medskip
\subsection{Constructing a binary tree top-down is suboptimal}
In previous chapters we studied weighing problems
in which we built ternary or binary trees.
We noticed that balanced trees -- ones in which, at every step, the two
possible outcomes were as close as possible to equiprobable --
appeared to describe the most efficient experiments.
This gave an intuitive motivation for entropy as a measure of information
content.
It is not the case, however, that optimal codes can {\em always\/}
be constructed
by a greedy top-down method in which the alphabet
is successively divided into subsets that are as near as possible to equiprobable.
% /home/mackay/itp/huffman> huffman.p latex=1 < fiftywrong3
\exampla{
Find the optimal binary symbol code for the ensemble:
\beq
\begin{array}{*{3}{@{\,}c@{\,}}*{6}{c@{\,}}*{2}{@{\,}c}}
\A_X & = & \{ &
{\tt a}, &
{\tt b}, &
{\tt c}, &
{\tt d}, &
{\tt e}, &
{\tt f}, &
{\tt g} &
\} \\
\P_X & = & \{
& 0.01,
& 0.24,
& 0.05,
& 0.20,
& 0.47,
& 0.01,
& 0.02
& \} \\
\end{array} .
\eeq
Notice that a greedy top-down method can split this set into two
% equiprobable
subsets
$\{ {\tt a},{\tt b},{\tt c},{\tt d} \}$ and $\{{\tt e},{\tt f},{\tt g}\}$
which both have probability $1/2$,
and that $\{ {\tt a},{\tt b},{\tt c},{\tt d} \}$ can be divided
into
% equiprobable
subsets $\{ {\tt a},{\tt b} \}$ and $\{{\tt c},{\tt d}\}$,
which have probability $1/4$;
so a greedy top-down method gives the code shown
in the third column of \tabref{tab.greed},\margintab{
\begin{center}\small
\begin{tabular}{clll} \toprule
$a_i$ & $p_i$ & Greedy & Huffman \\[0in] \midrule
{\tt a} & .01 & {\tt 000} & {\tt 000000} \\
{\tt b} & .24 & {\tt 001} & {\tt 01} \\
{\tt c} & .05 & {\tt 010} & {\tt 0001} \\
{\tt d} & .20 & {\tt 011} & {\tt 001} \\
{\tt e} & .47 & {\tt 10} & {\tt 1} \\
{\tt f} & .01 & {\tt 110} & {\tt 000001} \\
{\tt g} & .02 & {\tt 111} & {\tt 00001} \\
\bottomrule
\end{tabular}
\end{center}
\caption[a]{A greedily-constructed code compared with the Huffman code.}
\label{tab.greed}
}
which has expected length 2.53.
The Huffman coding algorithm yields the code shown in the fourth
column,
%\begin{center}
%\begin{tabular}{clrrl} \toprule
%$a_i$ & $p_i$ & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} & $l_i$ & $c(a_i)$
%%{\rule[-3mm]{0pt}{8mm}}%strut
%\\[0in] \midrule
%{\tt a} & 0.01 & 6.6 & 6 & {\tt 000000} \\
%{\tt b} & 0.24 & 2.1 & 2 & {\tt 01} \\
%{\tt c} & 0.05 & 4.3 & 4 & {\tt 0001} \\
%{\tt d} & 0.20 & 2.3 & 3 & {\tt 001} \\
%{\tt e} & 0.47 & 1.1 & 1 & {\tt 1} \\
%{\tt f} & 0.01 & 6.6 & 6 & {\tt 000001} \\
%{\tt g} & 0.02 & 5.6 & 5 & {\tt 00001} \\
% \bottomrule
%\end{tabular}
%\end{center}
which has
expected length 1.97.\ENDsolution
% entropy 1.9323
%
}
%\subsection{Twenty questions}
% The Huffman algorithm defines the optimal way to
% play `twenty questions'.
%
% {\em [MORE HERE]}
\section{Disadvantages of the Huffman code}
\label{sec.huffman.probs}
The Huffman\index{Huffman code!disadvantages}\index{symbol code!disadvantages}
algorithm produces an
optimal symbol code for an ensemble, but this is not the end of the
story. Both the word `ensemble' and the phrase `symbol code'
need careful attention.
%\begin{description}
%\item[Changing ensemble.]
\subsection{Changing ensemble}
If we wish to communicate a sequence of outcomes from one
unchanging ensemble, then a Huffman code may be convenient.
But often the appropriate ensemble changes. If for
example we are compressing text, then the symbol frequencies
will vary with context: in English the letter {\tt{u}} is
much more probable after a {\tt{q}} than after an {\tt{e}} (\figref{fig.conbigrams}). And
furthermore, our knowledge of these context-dependent symbol
frequencies will also change as we learn
% accumulate statistics on
the statistical properties of the
text source.\index{adaptive models}
% So our probabilities should change
Huffman codes do not handle changing
ensemble probabilities with any elegance.
One brute-force approach would be to
recompute the Huffman code every time the probability over
symbols changes. Another attitude is to deny the option of
adaptation, and instead run through the entire file in
advance and compute a good probability distribution, which will
then remain fixed throughout transmission. The code itself must
also be communicated in this scenario. Such a technique is
not only cumbersome and restrictive, it is also suboptimal,
since the initial message specifying the code and the document
itself are partially redundant.
% -- knowing the algorithm that
% defines the code for a given document, one can deduce what the
% initial header has to be from the .
This technique therefore wastes bits.
% flag this:
% could discuss bits back here
%
\subsection{The extra bit}
%item[The extra bit.]
An equally serious problem with Huffman codes is the
innocuous-looking `\ind{extra bit}' relative to the ideal average
length of $H(X)$ -- a Huffman code achieves a length that
satisfies $H(X) \leq L(C,X) < H(X) + 1,$ as proved in theorem
\ref{th.source.coding.symbol}.
%\eqref{eq.source.coding.symbol}).
A
Huffman code thus incurs an overhead of between 0 and 1 bits per
symbol. If $H(X)$ were large, then this overhead would be an
unimportant fractional increase. But for many applications,
the entropy may be as low as one bit per symbol, or even smaller,
so the overhead
%`$+1$'
$L(C,X)- H(X)$ may dominate the encoded file length. Consider English
text: in some contexts, long strings of characters may be
highly predictable.
% , as we saw in the guessing game of chapter \chtwo.
% given a simple model of the language.
For
example, in the context `{\verb+strings_of_ch+}', one might
predict the next nine symbols to be `{\verb+aracters_+}' with
a probability of 0.99 each. A traditional Huffman code would
be obliged to use at least one bit per character, making a total cost
of nine bits where virtually no information is being
conveyed (0.13 bits in total, to be precise).
The entropy of English, given a good model, is about
one bit per character \cite{Shannon48}, so a Huffman code is likely to be highly
% nearly 100\%
inefficient.
A traditional patch-up of Huffman codes uses them to compress
{\dem blocks\/} of symbols, for example the `extended sources'
$X^N$ we discussed in \chref{ch.two}.
% \ref{ch2}
% rather than defining a code for single symbols.
The overhead per block is at most 1 bit so the
overhead per symbol
% goes down as
is at most $1/N$ bits. For
sufficiently large blocks, the problem of the extra bit may be
removed -- but only at the expenses of (a) losing the elegant
instantaneous decodeability of simple Huffman coding; and
(b) having
to compute the probabilities of all relevant strings and build
the associated Huffman tree. One will end up explicitly
computing the
probabilities and codes for a huge number of strings, most
of which will never actually occur. (See \exerciseref{ex.Huff99}.)
% A further problem is that it may not be appropriate to model
% successive symbols as coming independently from a single ensemble
% $X$. As we already asserted, any decent model for text will
% assign a probability over symbols that depends on the context.
% A changing probability distribution over symbols is
% not incompatible with the construction of Huffman codes for
% blocks of symbols. One could consider each possible sequence,
% computing the relevant probability distributions along the way
% to evaluate the probability of the entire sequence, then build
% a Huffman tree for the sequences. One could account for
% dependences between blocks as well, if one were willing to
% use a different Huffman code each time. But this modified
% encoder would be
% computationally expensive, since for large block sizes an
% exponentially large number of possible sequences would have
% to be considered along with their adaptive probabilities.
%% is context-dependent.
% \end{description}
% \medskip
\subsection{Beyond symbol codes}
%
Huffman codes, therefore, although widely trumpeted as
`optimal', have many defects for practical
purposes.\index{Huffman code!`optimality'}
They {\em are\/} optimal {\em symbol\/} codes, but for practical
purposes {\em we don't want a symbol code}.
The defects of Huffman codes are rectified by {\dem arithmetic
coding},\index{arithmetic coding} which dispenses with the
restriction that each symbol must translate into an integer
number of bits. Arithmetic coding is the main topic of the next
chapter.
% is not a symbol coding. This
% we will discuss next.
% In an arithmetic code, the probabilistic modelling is clearly
% separated from the encoding operation.
\section{Summary}
\begin{description}
\item[Kraft inequality\puncspace]
If a code is {\dbf uniquely decodeable} its lengths must satisfy
\beq
\sum_i 2^{-l_i } \leq 1 .
\eeq
For any lengths satisfying the Kraft inequality, there exists
a prefix code with those lengths.
\item[Optimal source codelengths for an ensemble] are equal to the
Shannon information contents\index{source code!optimal lengths}\index{source code!implicit probabilities}
\beq
l_i = \log_2 \frac{1}{p_i} ,
\eeq
and conversely, any choice of codelengths defines
{\dbf\ind{implicit probabilities}}
\beq
q_i = \frac{2^{-l_i}}{z} .
\eeq
\item[The \ind{relative entropy}] $D_{\rm KL}(\bp||\bq)$ measures
how many bits per symbol are wasted by using a
% mismatched
code whose implicit probabilities are $\bq$, when
the ensemble's true probability distribution is $\bp$.
\item[Source coding theorem for symbol codes\puncspace]
For an ensemble $X$, there exists a prefix code
whose expected length satisfies
\beq
H(X) \leq L(C,X) < H(X) + 1 .
\eeq
% The expected length is only equal to the entropy if the
\item[The Huffman coding algorithm] generates an optimal symbol code
iteratively. At each iteration, the two least probable symbols are combined.
\end{description}
\section{Exercises}
\exercisaxB{2}{ex.Cnud}{
Is the code $\{ {\tt 00}, {\tt 11}, {\tt 0101}, {\tt 111}, {\tt 1010},
{\tt 100100}, {\tt 0110} \}$
% $\{ 00,11,0101,111,1010,100100,0110 \}$
uniquely decodeable?
}
\exercisaxB{2}{ex.Ctern}{
Is the ternary code
$\{ {\tt 00},{\tt 012},{\tt 0110},{\tt 0112},{\tt 100},{\tt 201},{\tt 212},{\tt 22} \}$ uniquely decodeable?
}
\exercissxA{3}{ex.HuffX2X3}{
Make Huffman codes for $X^2$, $X^3$ and $X^4$ where ${\cal A}_X = \{ 0,1 \}$
and ${\cal P}_X = \{ 0.9,0.1 \}$. Compute their expected lengths and compare
them with the entropies $H(X^2)$, $H(X^3)$ and $H(X^4)$.
Repeat this exercise for $X^2$ and $X^4$ where ${\cal P}_X = \{ 0.6,0.4 \}$.
}
\exercissxA{2}{ex.Huffambig}{
Find a probability distribution $\{ p_1,p_2,p_3,p_4 \}$ such that
there are {\em two\/} optimal codes that assign different lengths $\{ l_i \}$
to the four symbols.
}
\exercisaxC{3}{ex.Huffambigb}{
(Continuation of \exerciseonlyref{ex.Huffambig}.)
Assume that the four probabilities $\{ p_1,p_2,p_3,p_4 \}$ are ordered
such that $p_1 \geq p_2 \geq p_3 \geq p_4 \geq 0$. Let
$\cal Q$ be the set of
all probability vectors $\bp$ such that
there are {\em two\/} optimal codes with different lengths.
Give a complete description of $\cal Q$.
Find three probability vectors $\bq^{(1)}$, $\bq^{(2)}$, $\bq^{(3)}$,
which are the \ind{convex hull} of $\cal Q$, \ie, such that
any $\bp \in \cal Q$ can be written as
\beq
\bp = \mu_1 \bq^{(1)} + \mu_2 \bq^{(2)} +\mu_3 \bq^{(3)} ,
\eeq
where $\{\mu_i\}$ are positive.
}
\exercisaxB{1}{ex.twenty.questions}{
Write a short essay discussing how to play
the game of {\sf{\ind{twenty questions}}} optimally.
[In twenty questions, one player thinks of an object,
and the other player has to guess the object using as few binary
questions as possible, preferably fewer than twenty.]
}
\exercisaxB{2}{ex.powertwogood}{
Show that, if each probability $p_i$ is equal to an integer power of 2
then there exists a source code whose expected length equals the entropy.
}
\exercissxB{2}{ex.make.huffman.suck}{
Make ensembles for which the difference between the entropy
and the expected length of the Huffman code is as big as possible.
}% 14. Gallager, R. G., "Variations on a Theme by Huffman",
% IEEE Trans. on Information Theory, Vol. IT-24, No. 6, Nov. 1978, pp. 668-674.
%
%\exercisxB{2}{ex.huffman.biggerhalf}{
% If one of the probabilities $p_m$ is greater than $1/2$, how
% big must the difference between the expected length and the entropy be?
% Sketch a graph the
%}
% from {tex/huffmanI.tex}
\exercissxB{2}{ex.huffman.uniform}
{
% from 02q.tex on rum
A source $X$ has an alphabet
of eleven characters $$\{ {\tt{a}} , {\tt{b}} , {\tt{c}} , {\tt{d}} , {\tt{e}} , {\tt{f}} , {\tt{g}} , {\tt{h}} , {\tt{i}} , {\tt{j}} , {\tt{k}} \},$$
all of which have equal probability, $1/11$.
% State the meaning of the ideal codelengths
Find an {optimal uniquely decodeable symbol code}
for this source.
How much greater is the expected length of this optimal code
than the entropy of $X$?
}
\exercisaxB{2}{ex.huffman.uniform2}{
Consider the optimal symbol code for an ensemble $X$ with alphabet size
$I$ from which all symbols have identical probability
$p = 1/I$. $I$ is not a power of 2.
Show that the fraction $f^+$ of the $I$ symbols that are assigned
codelengths equal to
\beq
l^+ \equiv \lceil \log_2 I \rceil
\eeq
satisfies
\beq
f^+ = 2 - \frac{2^{l^+}}{I}
\label{eq.HIf}
\eeq
and that the expected length of the optimal symbol code
is
\beq
L = l^+ -1 + f^+ .
\label{eq.HIL}
\eeq
By differentiating
the excess length
%\beq
$ \Delta L \equiv L - H(X)$
%\eeq
with respect to $I$, show that the excess
length is bounded by
\beq
\Delta L \leq 1 - \frac{ \ln ( \ln 2 )}{ \ln 2} -\frac{ 1 }{ \ln 2}
= 0.086 .
\eeq
}
\exercisaxA{2}{ex.Huff99}{
Consider a sparse binary source with ${\cal P}_X = \{ 0.99 , 0.01 \}$.
Discuss how Huffman codes could be used to compress this source
{\em efficiently}.\index{Huffman code}
Estimate how many codewords your proposed solutions require.
% The entropy - hint: could think about run length encoding?
%
}
\exercisaxB{2}{ex.poisonglass}{
% p.111 martin gardner mathematical carnival{Gardner:Carnival}
{\em Scientific American\/} carried the following puzzle\index{puzzle!poisoned glass} in 1975.
% roughly!
\begin{description}
\item[The poisoned glass\puncspace]% This should be \exercisetitlestyle ?
`Mathematicians are curious birds', the police commissioner said to
his wife. `You see, we had all those partly filled glasses lined up
in rows on a table in the hotel kitchen. Only one contained poison,
and we wanted to know which one before searching that glass for
fingerprints. Our lab could test the liquid in each glass, but the
tests take time and money, so we wanted to make as few of them as
possible by simultaneously testing mixtures of small samples from
groups of glasses. The university sent over a mathematics professor
to help us. He counted the glasses, smiled and said:
`$\,$``Pick any glass you want, Commissioner. We'll test it first.''
`$\,$``But won't that waste a test?'' I asked.
`$\,$``No,'' he said, ``it's part of the best procedure. We can test one glass
first. It doesn't matter which one.''$\,$'
`How many glasses were there to start with?' the commissioner's wife asked.
`I don't remember. Somewhere between 100 and 200.'
What was the exact number of glasses?
\end{description}% \cite{Gardner:Carnival}
Solve this puzzle and then explain why the professor was in fact
wrong and the commissioner was right. What is in fact the optimal procedure
for identifying the one poisoned glass? What is the expected waste
relative to this optimum if one followed the professor's strategy?
Explain the relationship to symbol coding.
}
% could get worked up over the all zero codeword, which corresponds to
% a possible non-detection; if this would require an extra test
% then presumably the story is a bit different, with some deliberate
% skewing of the tree to make it more likely that we get a positive
%result along the way.
\exercissxA{2}{ex.optimalcodep1}{% problem fixed Tue 12/12/00
Assume that a sequence of symbols
from the ensemble $X$ introduced at the beginning of this
chapter is compressed using the code $C_3$.
\amarginfignocaption{t}{
\begin{center}
$C_3$:\\[0.1in]
\begin{tabular}{cllcc} \toprule
$a_i$ & $c(a_i)$ & $p_i$ & \multicolumn{1}{c}{$h({p_i})$} & $l_i$
% {\rule[-3mm]{0pt}{8mm}}%strut
\\ \midrule
{\tt a} & {\tt 0} & \dhalf & 1.0 & 1 \\
{\tt b} & {\tt 10} & \dquarter & 2.0 & 2 \\
{\tt c} & {\tt 110} & \deighth & 3.0 & 3 \\
{\tt d} & {\tt 111} & \deighth & 3.0 & 3 \\
\bottomrule
\end{tabular}
\end{center}
}
Imagine picking one bit at random from
the binary encoded sequence $\bc = c(x_1)c(x_2)c(x_3)\ldots$ .
What is the probability that this bit is a 1?
}
\exercissxB{2}{ex.Huffmanqary}{
% (Optional)
How should the\index{Huffman code!general alphabet} binary
Huffman encoding scheme be modified to make optimal symbol codes
in an encoding alphabet with $q$ symbols? (Also known as `\ind{radix} $q$'.)
}
% answer, Hamming p.73:
% add enough states with probability zero to make the total
% number of states equal to $k(q-1)+1$, for some integer $k$.
% then repeatedly combine $q$ into 1
% \end{document}
%
% \item[A code $C(X)$ is {\em non-singular\/}] if every element of $\A_X$
% maps into a different string, \ie,
% \beq
% a_i \not = a_j \Rightarrow c(a_i) \not = c(a_j).
% \eeq
%
% \item[The extension $C^+$ of a code $C$] is a mapping from finite length
% strings of $\A_X$ to $\{0,1\}^+$
% % finite length strings of NAME?
% defined by the concatentation:
% \beq
% c(x_1 x_2 \ldots x_N) = c(x_1)c(x_2)\ldots c(x_N)
% \eeq
%
% \item[A code is uniquely decodeable] if its extension is non-singular.
%
\subsection*{Mixture codes}
It is a tempting idea to construct a `\ind{metacode}' from several symbol
codes that assign different-length codewords to the alternative
symbols, then switch from one
code to another, choosing whichever assigns the shortest codeword
to the current symbol.
Clearly we cannot do this for free.\index{bits back}
If one wishes to choose between two codes, then
it is necessary to lengthen the message in a way that
indicates which of the two codes is being used. If we indicate this
choice by
a single leading bit, it will be found that the resulting code
is suboptimal because it is incomplete (that is,
it fails the Kraft equality).
\exercissxA{3}{ex.mixsubopt}{
Prove that this metacode is incomplete,
and explain why this combined code is
suboptimal.
}
%
% need more on prefix property to make clear how strings are decodeable,
% self-punctuating.
\dvips
\section{Solutions}% to Chapter \protect\ref{ch3}'s exercises}
\fakesection{solns 3}
\soln{ex.C1101}{
Yes,
$C_2 = \{ {\tt{1}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$
% $C_2 = \{ 1 , 101 \}$
is uniquely decodeable, even though
it is not a prefix code, because no two different strings
can map onto the same string; only the codeword $c(a_2)={\tt 101}$ contains
the symbol {\tt0}.
}
\soln{ex.KIconverse}{
We wish to prove that for any set of codeword lengths $\{ l_i \}$
satisfying the \Kraft\ inequality, there is a prefix code having those
lengths.
%
% Symbol Coding Budget -- cut this figure later, it is already in _l3
%
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=figs/budget1.eps,height=3in}\ \psfig{figure=figs/budgetmax.eps,height=3in}}
\end{center}
}{%
\caption[a]{The codeword supermarket and
the symbol coding budget. The `cost' $2^{-l}$ of each codeword
(with length $l$)
is indicated by the size of the box it is written in. The total budget
available when making a uniquely decodeable code is 1.}
\label{fig.budget1a}
}%
\end{figure}
This is readily proved by thinking of
the codewords illustrated in \figref{fig.budget1a}
as being in a `codeword supermarket', with size indicating
cost. We imagine purchasing\index{source code!supermarket}\index{supermarket for codewords}
codewords one at a time, starting from the shortest codewords (\ie, the biggest
purchases),
using the budget shown at the right of \figref{fig.budget1a}.
We start at one side of the codeword supermarket, say the top,
and purchase the first codeword of the required length. We advance down
the supermarket a distance $2^{-l}$, and purchase the next codeword
of the next required length, and so forth.
Because the codeword lengths are getting longer, and the corresponding
intervals are getting shorter, we can always buy
an adjacent codeword to the latest purchase, so there is no wasting
of the budget. Thus at the $I$th codeword we have advanced
a distance $\sum_{i=1}^{I} 2^{-l_i}$ down the supermarket;
if $\sum 2^{-l_i} \leq 1$, we will have purchased
all the codewords without running out of budget.
}
\soln{ex.Huffmanconverse}{
The proof that Huffman coding is optimal depends on
proving that the key step in the algorithm -- the decision to give
% combination of
the two symbols
with smallest probability equal encoded lengths
-- cannot lead to a larger expected length
than any other code. We can prove this by contradiction.
Assume that
the two symbols with smallest probability, called $a$ and $b$,
to which the Huffman algorithm would assign equal length
codewords,
do {\em not\/} have equal lengths in {\em any\/}
optimal symbol code.
The optimal symbol code is some
other rival code in which these two codewords
have unequal lengths $l_a$ and $l_b$ with $l_a < l_b$.
Without loss of
generality we can assume that this other code is a complete prefix code,
because any codelengths of a uniquely decodeable code
can be realized by a prefix code.
% We now consider transforming the other code into a new code
% in which we interchange \ldots
In this rival code, there must be some other symbol $c$ whose
probability $p_c$ is greater than $p_a$ and whose length
in the rival code is greater than or equal to $l_b$, because
the code for $b$ must have an adjacent codeword of equal or greater
length -- a complete prefix code never has a solo codeword
of the maximum length.
\begin{figure}%[htbp]
\figuremargin{%
\begin{tabular}{llllll} \toprule % \hline
symbol & \multicolumn{2}{c}{probability} & Huffman & Rival code's & Modified rival \\
& & & codewords & codewords & code \\ \midrule % [0.1in]\hline
$a$ & $p_a$ & \framebox[0.15in]{} & \framebox[1.50cm]{$c_{\rm H}(a)$} & \framebox[1.0cm]{$c_{\rm R}(a)$} & \framebox[1.6cm]{$c_{\rm R}(c)$}
\\[0.1in]
$b$ & $p_b$ & \framebox[0.1in]{} & \framebox[1.50cm]{$c_{\rm H}(b)$} & \framebox[1.5cm]{$c_{\rm R}(b)$} & \framebox[1.5cm]{$c_{\rm R}(b)$}
\\[0.1in]
$c$ & $p_c$ & \framebox[0.25in]{} & \framebox[0.95cm]{$c_{\rm H}(c)$} & \framebox[1.6cm]{$c_{\rm R}(c)$} & \framebox[1.0cm]{$c_{\rm R}(a)$}
\\ \bottomrule % [0.1in] \hline
\end{tabular}
}{%
\caption[a]{Proof that Huffman coding makes an optimal symbol code.
% The proof works by contradiction.
We assume that the rival code, which is said to be optimal, assigns {\em unequal\/} length
codewords to the two symbols with smallest probability, $a$ and $b$.
By interchanging codewords $a$ and $c$ of the rival code, where $c$ is a
symbol with rival codelength as long as $b$'s, we can make
a code better than the rival code. This shows that the rival code
was not optimal.
}
\label{fig.huffman.optimal}
}%
\end{figure}
Consider exchanging the codewords of $a$ and $c$ (\figref{fig.huffman.optimal}), so that
$a$ is encoded with the longer codeword that was $c$'s, and
$c$, which is more probable than $a$, gets the shorter codeword.
Clearly this reduces the expected length of the code.
The change in expected length is $(p_a-p_c)(l_c-l_a)$.
Thus we have contradicted the assumption that the rival code is optimal.
Therefore it is valid to give the two symbols
with smallest probability equal encoded lengths.
Huffman coding produces optimal symbol codes.\ENDsolution
}
%\soln{ex.Cnud}{
%\soln{ex.Ctern}{
\soln{ex.HuffX2X3}{
A Huffman code
for $X^2$ where ${\cal A}_X = \{ {\tt 0},{\tt 1} \}$
and ${\cal P}_X = \{ 0.9,0.1 \}$
is $\{{\tt 00},{\tt 01},{\tt 10},{\tt 11}\} \rightarrow
\{{\tt 1},{\tt 01},{\tt 000},{\tt 001}\}$.
This code has $L(C,X^2) = 1.29$, whereas the entropy $H(X^2)$ is 0.938.
A Huffman code for $X^3$ is
\[
\begin{array}{c}
\{{\tt 000},{\tt 100},{\tt 010},{\tt 001},{\tt 101},{\tt 011},{\tt 110},{\tt 111}\}
\rightarrow\\
\hspace*{1in} \{{\tt 1},{\tt 011},{\tt 010},{\tt 001},
{\tt 00000},{\tt 00001},{\tt 00010},{\tt
00011}\}.
\end{array}
\]
% corrected from 1.472 to 1.598
% 9802
This has expected length $L(C,X^3) = 1.598$ whereas the entropy $H(X^3)$
is 1.4069.
A Huffman code for $X^4$ maps the sixteen source strings to the
following codelengths:
\[
\begin{array}{c}
\{ {\tt 0000},{\tt 1000},{\tt 0100},{\tt 0010},{\tt 0001},{\tt 1100},{\tt 0110},{\tt 0011},{\tt 0101},
{\tt 1010},{\tt 1001},{\tt 1110},{\tt 1101}, \\
{\tt 1011},{\tt 0111},{\tt 1111} \}
\rightarrow \:\: \{ 1,3,3,3,4,6,7,7,7,7,7,9,9,9,10,10 \}.
% 10,10,9,9,9,7,7,7,7,7,6,4,3, 3,3,1\}.
\end{array}
\]
This has expected length $L(C,X^4) = 1.9702$ whereas the entropy $H(X^4)$
is 1.876.
%
% 0.6,0.4
When ${\cal P}_X = \{ 0.6,0.4 \}$, the Huffman code for $X^2$ has lengths
$\{ 2,2,2,2 \}$; the expected length is 2 bits, and the
entropy is 1.94 bits. A
Huffman code for $X^4$ is shown in \tabref{fig.X4huff2}.
% , has lengths
% $\{0000,1000,0100,0010,0001,1100,0110,0011,0101,1010,1001,1110,1101,1011,0111,1111\} \rightarrow$
% $\{3,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5\}$.
The expected length is 3.92 bits, and the entropy is 3.88 bits.
% see tmp3 for soln using huffman.p
% $\{0000,1000,0100,0010,0001,1100,0110,0011,0101,1010,1001,1110,1101,1011,0111,1111\} \rightarrow \{5,5,5,5,4,4,4,4,4,4,4,4,4,4,3,3\}$.
}
% see tmp3 for use of huffman.p
%\begin{figure}
%\figuremargin{%
\margintab{\footnotesize
\begin{center}
\begin{tabular}{clrl} \toprule % \hline
$a_i$ & $p_i$ &
% \multicolumn{1}{c}{$h({p_i})$} &
$l_i$ & $c(a_i)$
% {\rule[-3mm]{0pt}{8mm}}%strut
% \\[0.1in] \hline
\\ \midrule
{\tt 0000} & 0.1296 & 3 & {\tt 000 }\\
{\tt 0001} & 0.0864 & 4 & {\tt 0100 }\\
{\tt 0010} & 0.0864 & 4 & {\tt 0110 }\\
{\tt 0100} & 0.0864 & 4 & {\tt 0111 }\\
{\tt 1000} & 0.0864 & 3 & {\tt 100 }\\
{\tt 1100} & 0.0576 & 4 & {\tt 1010 }\\
{\tt 1010} & 0.0576 & 4 & {\tt 1100 }\\
{\tt 1001} & 0.0576 & 4 & {\tt 1101 }\\
{\tt 0110} & 0.0576 & 4 & {\tt 1110 }\\
{\tt 0101} & 0.0576 & 4 & {\tt 1111 }\\
{\tt 0011} & 0.0576 & 4 & {\tt 0010 }\\
{\tt 1110} & 0.0384 & 5 & {\tt 00110 }\\
{\tt 1101} & 0.0384 & 5 & {\tt 01010 }\\
{\tt 1011} & 0.0384 & 5 & {\tt 01011 }\\
{\tt 0111} & 0.0384 & 4 & {\tt 1011 }\\
{\tt 1111} & 0.0256 & 5 & {\tt 00111 }\\ \bottomrule %\hline
%expected length 3.9248
%entropy 3.8838
\end{tabular}
\end{center}
%}{%
\caption[a]{Huffman code for $X^4$ when $p_0=0.6$. Column 3 shows the
assigned codelengths and column 4 the codewords. Some strings
whose probabilities are identical, \eg, the fourth and fifth,
receive different codelengths.}
\label{fig.X4huff2}
}%
%\end{figure}
\soln{ex.Huffambig}{
The set of probabilities $\{ p_1,p_2,p_3,p_4 \} =
\{ \dsixth,\dsixth,\dthird,\dthird\}$ gives rise to two different optimal
sets of codelengths, because at the second step of the Huffman
coding algorithm we can choose any of the three possible pairings.
We may either put them in a constant length code
$\{ {\tt00},{\tt01},{\tt10},{\tt11} \}$ or
the code $\{ {\tt000},{\tt001},{\tt01},{\tt1} \}$.
Both codes have expected length 2.
Another solution is $\{ p_1,p_2,p_3,p_4 \}$ $=$
$\{ \dfifth,\dfifth,\dfifth,\dtwofifth\}$.
% =$ $\{ 0.2 , 0.2 , 0.2 , 0.4 \} $.
And a third is $\{ p_1,p_2,p_3,p_4 \} =
\{ \dthird,\dthird,\dthird,0\}$.
}
\soln{ex.make.huffman.suck}{
Let $p_{\max}$ be the largest probability in $p_1,p_2,\ldots,p_I$.
The difference between the expected length
$L$ and the entropy $H$ can be no bigger than
$\max ( p_{\max} , 0.086 )$ \cite{Gallager78}.
%
See exercises \ref{ex.huffman.uniform}--\ref{ex.huffman.uniform2} to understand
where the curious 0.086 comes from.
}
\soln{ex.huffman.uniform}{
% removed to cutsolutions.tex
Length $-$ entropy = 0.086.
%length / entropy 1.0249
}
% \soln{ex.Huff99}{
% BORDERLINE
\soln{ex.optimalcodep1}{% problem fixed Tue 12/12/00
There are two ways to answer this problem correctly,
and one popular way to answer it incorrectly.
Let's give the incorrect answer first:
\begin{description}
\item[Erroneous answer\puncspace]
``We can pick a random bit by first picking a
random source symbol $x_i$ with probability $p_i$,
then picking a random bit from $c(x_i)$. If we define $f_i$
to be the fraction of the bits of $c(x_i)$ that are {\tt 1}s,
we find
\marginpar[b]{\small
\begin{center}
$C_3$:
\begin{tabular}{cllc} \toprule
$a_i$ & $c(a_i)$ & $p_i$ & $l_i$
\\ \midrule
{\tt a} & {\tt 0} & \dhalf & 1 \\
{\tt b} & {\tt 10} & \dquarter & 2 \\
{\tt c} & {\tt 110} & \deighth & 3 \\
{\tt d} & {\tt 111} & \deighth & 3 \\
\bottomrule
\end{tabular}
\end{center}
}
\beqan
\!\!\!\!\!\!\!\!\!\!
P(\mbox{bit is {\tt 1}}) &=& \sum_i p_i f_i
\label{eq.wrongp1}
\\ &=&
\dfrac{1}{2} \times 0 +
\dfrac{1}{4} \times \dfrac{1}{2} +
\dfrac{1}{8} \times \dfrac{2}{3} +
\dfrac{1}{8} \times 1
= \dthird \mbox{.''}
\eeqan
\end{description}
This answer is wrong because it falls for the \index{bus-stop paradox}{bus-stop fallacy},\index{paradox}
which was introduced in \exerciseref{ex.waitbus}: if buses arrive
at random, and we are interested in `the average time from one bus until
the next', we must distinguish two possible averages:
(a) the average time from a randomly chosen bus until the next;
(b) the average time between the bus you just missed and the next bus.
The second `average' is twice as big as the first because,
by waiting for a bus at a random time, you bias your selection of
a bus in favour of buses that follow a large gap. You're unlikely
to catch a bus that comes 10 seconds after a preceding bus!
Similarly, the symbols {\tt c} and {\tt d} get encoded into
longer-length binary strings than {\tt a}, so when we pick a bit
from the compressed string at random, we are more likely
to land in a bit belonging to a {\tt c} or a {\tt d}
than would be given by the probabilities $p_i$ in the
expectation (\ref{eq.wrongp1}). All the probabilities need to
be scaled up by $l_i$, and renormalized.
\begin{description}
\item[Correct answer in the same style\puncspace]
Every time symbol $x_i$ is encoded, $l_i$ bits
are added to the binary string, of which $f_i l_i$ are {\tt 1}s.
The expected number of {\tt 1}s added per symbol is
\beq
\sum_i p_i f_i l_i ;
\eeq
and the expected total number of bits added per symbol is
\beq
\sum_i p_i l_i .
\eeq
So the fraction of {\tt 1}s in the transmitted string is
\beqan
P(\mbox{bit is {\tt 1}}) &=& \frac{ \sum_i p_i f_i l_i }{ \sum_i p_i l_i }
\label{eq.rightp1}
\\ &=&
\frac{ \dfrac{1}{2} \times 0 +
\dfrac{1}{4} \times 1 +
\dfrac{1}{8} \times 2 +
\dfrac{1}{8} \times 3
}{ \dfrac{7}{4} }
= \frac{\dfrac{7}{8}}{\dfrac{7}{4}} = 1/2 .
\nonumber
\eeqan
\end{description}
For a general symbol code and a general ensemble,
the expectation (\ref{eq.rightp1}) is the correct answer.
But in this case, we can use a more powerful argument.
\begin{description}
\item[Information-theoretic answer\puncspace]
The encoded string $\bc$ is the output of
an optimal compressor that compresses samples from
$X$ down to an expected length of $H(X)$ bits. We can't expect to compress
this data any further. But if the probability $P(\mbox{bit is {\tt 1}})$
were not equal to $\dhalf$ then it {\em would\/} be possible to compress
the binary string further (using a block compression code, say).
Therefore $P(\mbox{bit is {\tt 1}})$
must be equal to $\dhalf$; indeed the probability of any sequence
of $l$ bits in the compressed stream taking on any particular
value must be $2^{-l}$. The output of a perfect compressor is always
perfectly random bits.
\begincuttable
To put it another way, if the probability $P(\mbox{bit is {\tt 1}})$
were not equal to $\dhalf$, then the information content per bit of
the compressed string would be at most $H_2( P(\mbox{{\tt 1}}) )$,
which would be less than 1;
but this contradicts the fact that we can recover the original data
from $\bc$, so the information content per bit of the
compressed string must be $H(X)/L(C,X)=1$.
\ENDcuttable
\end{description}
}
%
% this one is a new addition
%
\soln{ex.Huffmanqary}{ The \index{Huffman code!general alphabet}{general Huffman coding algorithm} for
an encoding alphabet with $q$ symbols
has one difference from the binary case.
The process of combining $q$ symbols into
1 symbol reduces the number of symbols by $q\!-\!1$.
So if we start with $A$ symbols, we'll only end up
with a complete $q$-ary tree if $A \mod (q\!-\!1)$ is equal
to 1.
Otherwise, we know that whatever prefix code we make, it
must be an incomplete tree with a number of missing
leaves equal, modulo $(q\!-\!1)$, to $A \mod (q\!-\!1) - 1$.
For example, if a ternary tree is built for eight symbols,
then there will unavoidably be one missing leaf in the tree.
The optimal $q$-ary code is made by putting these
extra leaves in the longest branch of the tree. This can be achieved
by adding the appropriate number of symbols to the original source
symbol set, all of these extra symbols having probability zero.
The total number of leaves is then equal to $r(q\!-\!1)+1$, for some
integer $r$.
The symbols are then repeatedly combined by taking
the $q$ symbols with smallest probability and replacing them
by a single symbol, as in the binary Huffman coding algorithm.}
\soln{ex.mixsubopt}{
%This is important but I haven't written it yet.
We wish to show that a greedy \ind{metacode}, which
picks the code which gives the shortest encoding, is
actually suboptimal, because it violates the Kraft
inequality.
% For generality, let's call the
% that the objects to be encoded,
% $x$, `symbols'.
We'll assume that each symbol $x$ is
assigned lengths $l_k(x)$ by each of the candidate codes $C_k$.
Let us assume there are $K$ alternative codes and that we can
encode which code is being used with a header of length $\log K$
bits.
Then the metacode assigns lengths $l'(x)$ that are given by
\beq
l'(x) = \log_2 K + \min_k l_k(x) .
\eeq
We compute the Kraft sum:
\beq
S = \sum_x 2^{- l'(x)}
= \frac{1}{K} \sum_x 2^{- \min_k l_k(x)} .
\eeq
Let's divide the set $\A_X$ into non-overlapping subsets $\{\A_k\}_{k=1}^{K}$
such that subset $\A_k$ contains all the symbols $x$
that the metacode sends via code $k$.
Then
\beq
S = \frac{1}{K} \sum_k \sum_{x \in \A_{k}} 2^{- l_k(x)} .
\eeq
Now if one sub-code $k$ satisfies the Kraft equality
$\sum_{x\in \A_X} 2^{- l_k(x)} \eq 1$, then
it must be the case that
\beq
\sum_{x \in \A_{k}} 2^{- l_k(x)} \leq 1 ,
\label{eq.from.kraft}
\eeq
with equality only if all the symbols $x$ are in $\A_k$, which would mean that we
are only using one of the $K$ codes.
So
\beq
S \leq \frac{1}{K} \sum_{k=1}^K 1 = 1 ,
\eeq
with equality only if \eqref{eq.from.kraft} is an equality for all codes $k$.
But it's impossible for all the symbols to be in {\em all\/} the
non-overlapping subsets $\{\A_k\}_{k=1}^{K}$, so
we can't have equality (\ref{eq.from.kraft}) holding
for {\em all\/} $k$.
So
%\beq
% S < 1 .
%\eeq
$S < 1$.
Another way of seeing that a mixture code is suboptimal is to consider
the binary tree that it defines. Think of the special case of two
codes. The first bit we send identifies which code we are using.
Now, in a complete code, any subsequent binary string is a valid
string. But once we know that we are using, say, code A, we know that
what follows can only be a codeword corresponding to a symbol $x$
whose encoding is shorter under code A than code B. So some strings
are invalid continuations, and the mixture code is incomplete
and suboptimal.
%%% MAYBE!!!!!!!!!!!!!!
For further discussion of this issue
and its relationship to probabilistic modelling
read about `\ind{bits back} coding' in \secref{sec.bitsback}
and in \citeasnoun{frey-98}.
}
% \dvipsb{solutions 3}
\prechapter{About Chapter}
\fakesection{prerequisites for chapter known as 4}
Before reading \chref{ch.four}, you should have read the previous chapter
and worked on
most of the exercises in it.
We'll also make use of some Bayesian modelling ideas
that arrived in the vicinity of \exerciseref{ex.postpa}.
% Arithmetic coding has been invented several times,
% by Elias, by Rissanen, and
% but is only slowly becoming well known
%
% {The description of Lempel--Ziv coding is based on that of Cover and Thomas (1991).}
%\chapter{Data Compression III: Stream Codes}
\mysetcounter{page}{126}
\ENDprechapter
\chapter{Stream Codes}
\label{ch.four}
\label{ch.ac}
% _l4.tex
\fakesection{Data Compression III: Stream Codes}
%
% still need to change notation for R(|)
%
\label{ch4}
In this chapter we discuss two data
compression schemes.\index{source code!stream codes|(}\index{stream codes|(}
%% that constitute the state of the art.
{\dem\indexs{arithmetic coding}Arithmetic coding}
is a beautiful method that goes
hand in hand with the philosophy that compression of data
from a source entails
probabilistic modelling of that source. As of 1999,
the best compression methods for text files use arithmetic coding,
and several state-of-the-art image compression systems
use it too.
{\dem\ind{Lempel--Ziv coding}} is a `\ind{universal}' method,
% in my opinion an ugly hack, but
designed under the philosophy that we would like a single compression
algorithm that will do a reasonable job for {\em any\/} source.
In fact, for many real
life sources, this
algorithm's universal properties hold only
in the limit of unfeasibly large amounts of data, but,
all the same, Lempel--Ziv compression is widely used
and often effective.
\section{The guessing game}
\label{sec.startofch4}
% \looseness=-1 this did not achieve what was advertised!
As a motivation for these\index{game!guessing}
two compression methods,
% let us
consider the redundancy in a typical
% imagine compressing a
\ind{English} text file. Such files have redundancy at several levels: for example,
they contain the ASCII characters with non-equal frequency; certain consecutive
pairs of letters are more probable than others; and entire words
can be predicted given the context and a semantic understanding
of the text.
To illustrate the redundancy of English, and a curious way in which
it could be compressed, we can imagine a \ind{guessing game}
in which an English speaker repeatedly
attempts to predict the next character
in a text file.
% \subsection{The guessing game}
\label{sec.guessing}
% Could discuss the compression of English text by guessing
For simplicity, let us assume that the allowed alphabet consists
of
the 26 upper case letters {\tt A,B,C,\ldots, Z} and a space `{\tt -}'.
The game involves asking the subject to guess the next character
repeatedly, the only feedback being whether the guess is correct
or not, until the character is correctly guessed.
After a correct guess, we note the number of guesses that
were made when the character was identified, and ask the subject
to guess the next character in the same way.
One sentence
% given by Shannon
gave the following result when a human was asked to guess a sentence.
% in a guessing game.
The numbers of guesses
are listed below each character.\index{reverse}\index{motorcycle}
% and the idea of having an identical twin. This introduces the idea
% of mapping to a different alphabet with nonuniform probability.
% The guessing game. From Shannon.
\smallskip
\begin{center}\hspace*{0.3in}
%\begin{tabular}{*{36}{c@{\,\,}}}
\begin{tabular}{*{36}{p{0.15in}@{}}}
\small\tt
T&\small\tt H&\small\tt E&\small\tt R&\small\tt E&\small\tt -&\small\tt I&\small\tt S&\small\tt -&\small\tt N&\small\tt O&\small\tt -&\small\tt R&\small\tt E&\small\tt V&\small\tt E&\small\tt R&\small\tt S&\small\tt E&\small\tt -&\small\tt O&\small\tt N&\small\tt -&\small\tt A&\small\tt -&\small\tt M&\small\tt O&\small\tt T&\small\tt O&\small\tt R&\small\tt C&\small\tt Y&\small\tt C&\small\tt L&\small\tt E&\small\tt -\\
\footnotesize
1&\footnotesize 1&\footnotesize 1&\footnotesize 5&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 1&\footnotesize \hspace{-0.05in}1\hspace{-0.25mm}5&\footnotesize 1&\footnotesize \hspace{-0.05in}1\hspace{-0.25mm}7&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 3&\footnotesize 2&\footnotesize 1&\footnotesize 2&\footnotesize 2&\footnotesize 7&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 4&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1\\
\end{tabular}
\smallskip
\end{center}
% attempt to tighten this para:
\looseness=-1
Notice that in many cases, the next letter is guessed immediately, in one guess.
In other cases, particularly at the start of syllables,
more guesses are needed.
What do this game and these results offer us?
First, they demonstrate the redundancy of English from the point of
view of an English speaker.
Second, this game might be used in
a data compression scheme, as follows.
% encoding
The string of numbers `1, 1, 1, 5, 1, \ldots', listed above,
was obtained by presenting
the text to the subject. The maximum number of guesses that the
subject will make for a given letter is twenty-seven, so what the subject is
doing for us is performing a time-varying mapping of the twenty-seven letters
$\{ {\tt A,B,C,\ldots, Z,-}\}$ onto the twenty-seven numbers $\{1,2,3,\ldots,
27\}$, which we can view as symbols in a new alphabet. The total number of
symbols has not been reduced, but since he uses some of
these symbols much more frequently than others -- for example, 1 and
2 -- it should be easy to compress this new string of
symbols.
% ; we will discuss data compression
%% the details of how to do this
% properly shortly.
% decoding
How would the {\em uncompression\/} of the sequence of numbers
`1, 1, 1, 5, 1, \ldots' work? At uncompression time,
we do not have the original string `{\small\tt{THERE}}\ldots', we
have only the encoded sequence. Imagine that our subject has an
absolutely \ind{identical twin}\index{twin}
%({\em absolutely\/} identical)
who also
plays the guessing game\index{guessing game} with us, as if we
%, the experimenters,
knew the source text.
If we stop him whenever he has made a
number of guesses equal to the given number, then he will have just
guessed the correct letter, and we can then say `yes, that's right',
and move to the next character.
Alternatively, if the identical twin is not available, we could
design a compression system with the help of just one human as follows.
We choose a window length $L$, that is, a number of characters of context
to show the human. For every one of the $27^L$ possible
strings of length $L$, we ask them, `What would you predict is the next character?',
and `If that prediction were wrong, what would your next guesses be?'.
After tabulating their answers to these $26 \times 27^L$ questions,
we could use two copies of these enormous tables at the encoder and the
decoder in place of the two human twins.
Such a language model is called an $L$th order \ind{Markov model}.
These systems are clearly unrealistic for practical compression,
but they illustrate several principles that we will make use of now.
\section{Arithmetic codes}
\label{sec.ac}
% In lecture 2 we discussed fixed length block codes.
When we discussed variable-length symbol codes, and the optimal
Huffman algorithm for constructing them, we concluded by pointing
out two practical
and theoretical problems with Huffman codes (section \ref{sec.huffman.probs}).
%
% index decision: {arithmetic coding} not {arithmetic codes}
%
These defects are rectified by {\dem\index{arithmetic coding}{arithmetic codes}}, which
were invented by Elias\nocite{EliasACmentionedpages61to62},\index{Elias, Peter}
by \index{Rissanen, Jorma}{Rissanen} and by \index{Pasco, Richard}{Pasco},
and subsequently made practical by
% Witten, Neal, and Cleary.
\citeasnoun{arith_coding}.\index{Neal, Radford}
In an arithmetic code, the
probabilistic modelling is clearly separated from the encoding
operation.
The system is rather similar to the guessing game.\index{guessing game}
% that we considered in Chapter \chtwo.
The human predictor is replaced by a
{\dem\ind{probabilistic model}} of the source.
As each symbol is produced by the source, the probabilistic model
supplies a {\dem\ind{predictive distribution}}
over all possible values of the next
symbol, that is, a list of positive numbers $\{ p_i \}$ that sum to
one. If we choose to model the source as producing i.i.d.\ symbols with some
known distribution,
then the predictive distribution is the same every time; but arithmetic
coding can with equal ease handle complex adaptive models that produce
context-dependent
% time-varying
predictive distributions. The predictive model is usually
implemented in a computer program.
% a model which hypothesizes arbitrary
% context-dependences and non-stationarities, and which learns as it
% goes, so that predictive distributions in any given context gradually
% sharpen up.
% I will give an example later on,
% of an adaptive model producing appropriate probabilities
% but first let us discuss the arithmetic coding algorithm itself.
The encoder makes use of the model's predictions to create a
binary string. The decoder makes use of an identical twin of the
model (just as in the guessing \index{guessing game}game) to interpret the binary string.
Let the source alphabet be $\A_X = \{a_1 ,\ldots, a_I\}$, and let the
$I$th symbol $a_I$ have the special meaning `end of transmission'.
The source
spits out a sequence $x_1,x_2,\ldots,x_n,\ldots.$ The source does {\em not\/}
necessarily produce i.i.d.\ symbols.
We will assume that a computer program is provided to the encoder
that assigns a predictive
probability distribution over $a_i$ given the sequence that has occurred
thus far,
$P(x_n \eq a_i \given x_1,\ldots,x_{n-1})$.
% Nor will we assume that the source
% is correctly modeled by $P$. But if it is, then arithmetic coding achieves
% the Shannon rate.
%
% The encoder will send a binary transmission to the receiver.
%
The receiver has an identical program that produces the
same predictive
probability distribution $P(x_n \eq a_i \given x_1,\ldots,x_{n-1})$.
% and uses it to interpret the received message.
\medskip
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(50,40)(0,0)
\put(18,40){\makebox(0,0)[r]{0.00}}
\put(18,30){\makebox(0,0)[r]{0.25}}
\put(18,20){\makebox(0,0)[r]{0.50}}
\put(18,10){\makebox(0,0)[r]{0.75}}
\put(18, 0){\makebox(0,0)[r]{1.00}}
%
% major horizontals
%
\put(20,40){\line(1,0){37}}
\put(20,30){\line(1,0){13}}
\put(20,20){\line(1,0){28}}
\put(20,10){\line(1,0){13}}
\put(20, 0){\line(1,0){37}}
%
% biggest intervals
%
\put(45,30){\vector(0,1){9}}
\put(45,30){\vector(0,-1){9}}
\put(47,30){\makebox(0,0)[l]{{\tt{0}}}}
\put(45,10){\vector(0,1){9}}
\put(45,10){\vector(0,-1){9}}
\put(47,10){\makebox(0,0)[l]{{\tt{1}}}}
%
\put(35,25){\vector(0,1){4}}
\put(35,25){\vector(0,-1){4}}
\put(37,25){\makebox(0,0)[l]{{\tt{01}}}}
% some subdivs
\put(20,35){\line(1,0){7}}
\put(20,25){\line(1,0){7}}
\put(20,15){\line(1,0){7}}
\put(20, 5){\line(1,0){7}}
%
% 01101 = 13/32 = 16.25
% 01110 = 14/32 = 17.5
\put(20,23.75){\line(1,0){4}}
\put(20,22.50){\line(1,0){4}}
\put(62,23.125){\makebox(0,0)[l]{{\tt{01101}}}}
%
% interrupted pointer:
\put(60,23.125){\line(-1,0){14}}
\put(44,23.125){\line(-1,0){8}}
\put(34,23.125){\vector(-1,0){9.5}}
%
\end{picture}
\end{center}
}{%
\caption[a]{Binary strings define real intervals within the real line [0,1).
We first encountered a picture like this when we discussed the
\index{supermarket for codewords}\index{symbol code!supermarket}\index{source code!supermarket}{symbol-code supermarket} in \chref{ch3}.
}
\label{fig.arith.Rbinary}
}%
\end{figure}
\subsection{Concepts for understanding arithmetic coding}
\begin{aside}
%\item[Notation for intervals.]
{\sf Notation for intervals.} The interval $[0.01, 0.10)$ is all numbers
between $0.01$ and $0.10$, including $0.01\dot{0}\equiv0.01000\ldots$ but not $0.10\dot{0}\equiv0.10000\ldots.$
\end{aside}
A binary transmission defines an interval within
the real line from 0 to 1. For example, the string {\tt{01}} is
interpreted as a binary real number 0.01\ldots, which corresponds to
the interval $[0.01, 0.10)$ in binary, \ie, the interval
$[0.25,0.50)$ in base ten.
%
% why strange line breaks?
%
The longer string {\tt{01101}} corresponds to a smaller
interval $[0.01101,$ $0.01110)$. Because {\tt{01101}} has the first string,
{\tt{01}}, as a prefix, the new interval is a sub-interval
of the interval $[0.01, 0.10)$.
A one-megabyte binary file ($2^{23}$ bits) is thus viewed as specifying a number
between 0 and 1 to a precision of about two million
% $10^7$
decimal places -- {two million decimal digits, because
each byte translates into a little more than two decimal digits.}
% byte = 8 bits ~= 2 digits.
%
% one meg-byte = 2^3 * 2^20 = 2^23 binary places -> 2.5*10^7 or (2**23=8388608) .
% shall I tell you a bedtime number between 0 and 1 to 10^7 d.p. darling?
%
\medskip
Now, we can also
% Similarly, we can
divide the real line [0,1) into $I$ intervals of
lengths equal to the probabilities $P(x_1 \eq a_i)$, as shown
in \figref{fig.arith.R}.
% upsidedown
% p1 = 6 -- 34 mid: 37 w = 3-1
% p2 = 16 cum 22 -- 18 mid: 26 w = 8-1
% last = 6 cum -- 6 mid: 3 w = 3-1
\newcommand{\aonelevel}{34}
\newcommand{\atwolevel}{18}
\newcommand{\apenlevel}{6}% penultimate
\newcommand{\apenmid}{12}% put dots here
\newcommand{\aonemid}{37}
\newcommand{\aonew}{2}
\newcommand{\atwow}{7}
\newcommand{\atwomid}{26}
\newcommand{\aIw}{2}
\newcommand{\aImid}{3}
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(50,40)(0,0)
\put(18,40){\makebox(0,0)[r]{0.00}}
\put(18,\aonelevel){\makebox(0,0)[r]{$P(x_1\eq a_1)$}}
\put(18,\atwolevel){\makebox(0,0)[r]{$P(x_1\eq a_1)+P(x_1\eq a_2)$}}
\put(18,\apenlevel){\makebox(0,0)[r]{$P(x_1\eq a_1)+\ldots+P(x_1\eq a_{I\!-\!1})$}}
\put(18, 0){\makebox(0,0)[r]{1.0}}
%
% major horizontals
%
\put(20,40){\line(1,0){37}}
\put(20,\aonelevel){\line(1,0){20}}
\put(20,\atwolevel){\line(1,0){20}}
\put(20,\apenlevel){\line(1,0){20}}
\put(20, 0){\line(1,0){37}}
\put(30,\apenmid){\makebox(0,0)[l]{$\vdots$}}
%
% biggest intervals
%
\put(35,\aonemid){\vector(0,1){\aonew}}
\put(35,\aonemid){\vector(0,-1){\aonew}}
\put(37,\aonemid){\makebox(0,0)[l]{$a_1$}}% or $P(x_1\eq a_1)$}}
\put(35,\atwomid){\vector(0,1){\atwow}}
\put(35,\atwomid){\vector(0,-1){\atwow}}
\put(37,\atwomid){\makebox(0,0)[l]{$a_2$}}% or $P(x_1\eq a_2)$}}
\put(35,\aImid){\vector(0,1){\aIw}}
\put(35,\aImid){\vector(0,-1){\aIw}}
\put(37,\aImid){\makebox(0,0)[l]{$a_I$}}% or $P(x_1\eq a_I)$}}
\put(37,\apenmid){\makebox(0,0)[l]{$\vdots$}}
%
\put(20,23){\line(1,0){4}}% beg of a5
\put(20,20){\line(1,0){4}}% end a5
%
\put(62,21.5){\makebox(0,0)[l]{$a_2 a_5$}}
% interrupted pointer:
\put(60,21.5){\line(-1,0){24}}
\put(34,21.5){\vector(-1,0){9.5}}
%
% a2a1: 34 is the top
%
\put(20,30){\line(1,0){4}}% end of a1
\put(20,28){\line(1,0){4}}% end of a2
\put(20,25){\line(1,0){4}}% end of a3
%
\put(62,32){\makebox(0,0)[l]{$a_2 a_1$}}
% interrupted pointer:
\put(60,32){\line(-1,0){24}}
\put(34,32){\vector(-1,0){9.5}}
%
\end{picture}
\end{center}
}{%
\caption[a]{A probabilistic model defines real
intervals within the real line [0,1).}
\label{fig.arith.R}
}%
\end{figure}
We may then take each interval $a_i$ and subdivide it into intervals
denoted $a_ia_1,a_ia_2,\ldots, a_ia_I$, such that the length of
$a_ia_j$ is proportional to $P(x_2 \eq a_j \given x_1 \eq a_i)$. Indeed the
length of the interval $a_ia_j$ will be precisely the joint probability
\beq
P(x_1 \eq
a_i,x_2\eq a_j)=P(x_1\eq a_i)P(x_2 \eq a_j \given x_1 \eq a_i).
\eeq
Iterating this procedure, the interval $[0,1)$ can be divided
into a sequence of intervals corresponding to all possible finite
length strings $x_1x_2\ldots x_N$, such that the length of an
interval is equal to the probability of the string given our model.
% This iterative procedure
\subsection{Formulae describing arithmetic coding}
\begin{aside}
The process depicted in \figref{fig.arith.R} can be written
explicitly as follows.
The intervals are defined in terms of the lower and upper cumulative probabilities
\beqan
Q_{n}(a_i \given x_1,\ldots,x_{n-1})
& \equiv & \sum_{i'\eq 1 }^{i-1} P(x_n \eq a_{i'} \given x_1,\ldots,x_{n-1}) ,
\label{eq.arith.Q} \\
R_{n}(a_i \given x_1,\ldots,x_{n-1})
& \equiv & \sum_{i'\eq 1 }^{i} P(x_n \eq a_{i'} \given x_1,\ldots,x_{n-1}) .
\label{eq.arith.R}
\eeqan
%
As the $n$th symbol arrives, we subdivide the $n-1$th
interval at the points defined by $Q_n$ and $R_n$.
For example, starting with the first symbol,
the intervals `$a_1$', `$a_2$',
% `$a_3$',
and `$a_I$' are
% first interval,
% which we will call
\beq
a_1 \leftrightarrow [Q_{1}(a_1),R_{1}(a_1))= [0,P(x_1 \eq a_1)) ,
\eeq
\beq
a_2 \leftrightarrow [Q_{1}(a_2),R_{1}(a_2))=
\left[
P(x\eq a_1),P(x\eq a_1)+P(x\eq a_2) \right) ,
\eeq
%\beq
% a_3 \leftrightarrow [Q_{1}(a_3),R_{1}(a_3))=
% \left[
% P(x\eq a_1)+P(x\eq a_2) , P(x\eq a_1)+P(x\eq a_2) +P(x\eq a_3)\right),
%\eeq
and
\beq
a_I \leftrightarrow
\left[ Q_{1}(a_{I}) , R_{1}(a_{I}) \right)
= \left[ P(x_1\eq a_1)+\ldots+P(x_1\eq a_{I\!-\!1}) ,1.0 \right) .
\eeq
Algorithm \ref{alg.ac} describes the general procedure.
\end{aside}
\begin{algorithm}
\begin{framedalgorithmwithcaption}{
\caption[a]{Arithmetic coding.
Iterative procedure to find the interval $[u,v)$
for the string $x_1x_2\ldots x_N$.
}
\label{alg.ac}
}
%\algorithmmargin{%
\begin{center}
\begin{tabular}{l}
%\begin{description}% should be ALGORITHM
%\item[Iterative procedure to find the interval $[u,v)$
% corresponding to
% for the string $x_1x_2\ldots x_N$]
%
{\tt $u$ := 0.0} \\
{\tt $v$ := 1.0} \\
{\tt $p$ := $v-u$} \\
{\tt for $n$ = 1 to $N$ \verb+{+ } \\
\hspace*{0.5in} Compute the cumulative probabilities $Q_n$ and $R_n$
\protect(\ref{eq.arith.Q},\,\ref{eq.arith.R})
% $\{ R_{n}(a_i \given x_1,\ldots,x_{n-1}) \}_{i=1}^{I}$
%% $\{ R_{n,i \given x_1,\ldots,x_{n-1}} \}_{i=0}^{I}$
% using \eqref{eq.arith.R} \\
\\
\hspace*{0.5in} {\tt $v$ := $u + p R_{n}(x_n \given x_1,\ldots,x_{n-1}) $ } \\
\hspace*{0.5in} {\tt $u$ := $u + p Q_{n}(x_n \given x_1,\ldots,x_{n-1}) $ } \\
\hspace*{0.5in} {\tt $p$ := $v-u$} \\
% {\tt ) } \\
{\tt \verb+}+ } \\
\end{tabular}
\end{center}
%\end{description}
%}
\end{framedalgorithmwithcaption}
\end{algorithm}
To encode a string $x_1x_2\ldots x_N$,
we locate the interval corresponding to $x_1x_2\ldots x_N$, and
send a binary string whose interval lies within
that interval. This encoding can be performed
on the fly, as we now illustrate.
% \eof defined in itprnnchapter
\subsection{Example: compressing the tosses of a bent coin}
Imagine that we watch as a bent coin is tossed some number of times (\cf\
\exampleref{exa.bentcoin} and \secref{sec.bentcoin}
(\pref{sec.bentcoin})).
The two outcomes when the coin is tossed
are denoted $\tt a$ and $\tt b$. A third possibility is that the
experiment is halted, an event denoted by the `end of file' symbol, `$\eof$'.
Because the coin is bent, we expect that the probabilities of the outcomes $\tt a$
and $\tt b$ are not equal, though beforehand we don't know which
is the more probable outcome.
% Let $\A_X=\{a,b,\eof\}$, where
% $a$ and $\tb$ make up a binary alphabet with
% $\eof$ is an `end of file' symbol.
\subsubsection{Encoding\subsubpunc}
Let the source string be `$\tt bbba\eof$'. We pass along the string one symbol
at a time and use our model to compute the probability
distribution of the next symbol given the string thus far.
Let these probabilities be:
\[\begin{array}{l*{3}{r@{\eq}l}} \toprule
\mbox{Context } \\
\mbox{(sequence thus far) }
& \multicolumn{6}{c}{\mbox{Probability of next symbol}} \\[0.05in] \midrule
& P( \ta ) & 0.425 & P( \tb ) & 0.425 & P( \eof ) & 0.15 \\[0.05in]
\tb& P( \ta \given \tb ) & 0.28 & P( \tb \given \tb ) & 0.57 & P( \eof \given \tb ) & 0.15 \\[0.05in]
\tb\tb&P( \ta \given \tb\tb ) & 0.21 & P( \tb \given \tb\tb ) & 0.64 & P( \eof \given \tb\tb ) & 0.15 \\[0.05in]
\tb\tb\tb&P( \ta \given \tb\tb\tb ) & 0.17 & P( \tb \given \tb\tb\tb ) & 0.68 & P( \eof \given \tb\tb\tb ) & 0.15 \\[0.05in]
\tb\tb\tb\ta& P( \ta \given \tb\tb\tb\ta ) & 0.28 & P( \tb \given \tb\tb\tb\ta ) & 0.57 & P( \eof \given \tb\tb\tb\ta ) & 0.15 \\ \bottomrule
\end{array}
\]
\Figref{fig.ac} shows the corresponding intervals. The
interval $\tb$ is the middle 0.425 of $[0,1)$. The interval $\tb\tb$ is the
middle 0.567 of $\tb$, and so forth.
% in the following figure.
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
% created by ac.p only_show_data=1 > ac/ac_data.tex %%%%%%% and edited by hand
\mbox{
\hspace{-0.1in}\small
\setlength{\unitlength}{4.8in}
%\setlength{\unitlength}{5.75in}
\begin{picture}(0.59130434782608698452,1)(-0.29565217391304349226,0)
\thinlines
% line 0.0000 from -0.5000 to 0.0000
\put( -0.2957, 1.0000){\line(1,0){ 0.2957}}
% a at -0.4500, 0.2125
\put( -0.2811, 0.7875){\makebox(0,0)[r]{\tt{a}}}
% line 0.4250 from -0.5000 to 0.0000
\put( -0.2957, 0.5750){\line(1,0){ 0.2957}}
% b at -0.4500, 0.6375
\put( -0.2811, 0.3625){\makebox(0,0)[r]{\tt{b}}}
% line 0.8500 from -0.5000 to 0.0000
\put( -0.2957, 0.1500){\line(1,0){ 0.2957}}
% \teof at -0.4500, 0.9250
\put( -0.2811, 0.0750){\makebox(0,0)[r]{\teof}}
% line 1.0000 from -0.5000 to 0.0000
\put( -0.2957, 0.0000){\line(1,0){ 0.2957}}
% ba at -0.3500, 0.4852
\put( -0.2220, 0.5148){\makebox(0,0)[r]{\tt{ba}}}
% line 0.5454 from -0.4500 to 0.0000
\put( -0.2661, 0.4546){\line(1,0){ 0.2661}}
% bb at -0.3500, 0.6658
\put( -0.2220, 0.3342){\makebox(0,0)[r]{\tt{bb}}}
% line 0.7862 from -0.4500 to 0.0000
\put( -0.2661, 0.2138){\line(1,0){ 0.2661}}
% b\teof at -0.3500, 0.8181
\put( -0.2220, 0.1819){\makebox(0,0)[r]{\tt{b\teof}}}
% bba at -0.2300, 0.5710
\put( -0.1510, 0.4290){\makebox(0,0)[r]{\tt{bba}}}
% line 0.5966 from -0.3500 to 0.0000
\put( -0.2070, 0.4034){\line(1,0){ 0.2070}}
% bbb at -0.2300, 0.6734
\put( -0.1510, 0.3266){\makebox(0,0)[r]{\tt{bbb}}}
% line 0.7501 from -0.3500 to 0.0000
\put( -0.2070, 0.2499){\line(1,0){ 0.2070}}
% bb\teof at -0.2300, 0.7682
\put( -0.1510, 0.2318){\makebox(0,0)[r]{\tt{bb\teof}}}
% bbba at -0.1000, 0.6096
\put( -0.0741, 0.3904){\makebox(0,0)[r]{\tt{bbba}}}
% line 0.6227 from -0.2300 to 0.0000
\put( -0.1360, 0.3773){\line(1,0){ 0.1360}}
% bbbb at -0.1000, 0.6749
\put( -0.0741, 0.3251){\makebox(0,0)[r]{\tt{bbbb}}}
% line 0.7271 from -0.2300 to 0.0000
\put( -0.1360, 0.2729){\line(1,0){ 0.1360}}
% bbb\teof at -0.1000, 0.7386
\put( -0.0741, 0.2614){\makebox(0,0)[r]{\tt{bbb\teof}}}
% line 0.6040 from -0.1000 to 0.0000
\put( -0.0591, 0.3960){\line(1,0){ 0.0591}}
% line 0.6188 from -0.1000 to 0.0000
\put( -0.0591, 0.3812){\line(1,0){ 0.0591}}
% line 0.0000 from 0.0100 to 0.5000
\put( 0.0059, 1.0000){\line(1,0){ 0.2897}}
% 0 at 0.0100, 0.2500
\put( 0.2811, 0.7500){\makebox(0,0)[l]{\tt0}}
% line 0.5000 from 0.0100 to 0.5000
\put( 0.0059, 0.5000){\line(1,0){ 0.2897}}
% 1 at 0.0100, 0.7500
\put( 0.2811, 0.2500){\makebox(0,0)[l]{\tt1}}
% line 1.0000 from 0.0100 to 0.5000
\put( 0.0059, 0.0000){\line(1,0){ 0.2897}}
% 00 at 0.0100, 0.1250
\put( 0.2397, 0.8750){\makebox(0,0)[l]{\tt00}}
% line 0.2500 from 0.0100 to 0.4500
\put( 0.0059, 0.7500){\line(1,0){ 0.2602}}
% 01 at 0.0100, 0.3750
\put( 0.2397, 0.6250){\makebox(0,0)[l]{\tt01}}
% 000 at 0.0100, 0.0625
\put( 0.1806, 0.9375){\makebox(0,0)[l]{\tt000}}
% line 0.1250 from 0.0100 to 0.3800
\put( 0.0059, 0.8750){\line(1,0){ 0.2188}}
% 001 at 0.0100, 0.1875
\put( 0.1806, 0.8125){\makebox(0,0)[l]{\tt001}}
% 0000 at 0.0100, 0.0312
% was at 0.1037, move 0.02 right -> 1207
\put( 0.1207, 0.9688){\makebox(0,0)[l]{\tt0000}}
% line 0.0625 from 0.0100 to 0.2800
\put( 0.0059, 0.9375){\line(1,0){ 0.1597}}
% 0001 at 0.0100, 0.0938
\put( 0.1207, 0.9062){\makebox(0,0)[l]{\tt0001}}
% 00000 at 0.0100, 0.0156
\put( 0.0387, 0.9844){\makebox(0,0)[l]{\tt00000}}
% line 0.0312 from 0.0100 to 0.1500
\put( 0.0059, 0.9688){\line(1,0){ 0.0828}}
% 00001 at 0.0100, 0.0469
\put( 0.0387, 0.9531){\makebox(0,0)[l]{\tt00001}}
% line 0.0156 from 0.0100 to 0.0400
\put( 0.0059, 0.9844){\line(1,0){ 0.0177}}
% line 0.0078 from 0.0100 to 0.0200
\put( 0.0059, 0.9922){\line(1,0){ 0.0059}}
% line 0.0234 from 0.0100 to 0.0200
\put( 0.0059, 0.9766){\line(1,0){ 0.0059}}
% line 0.0469 from 0.0100 to 0.0400
\put( 0.0059, 0.9531){\line(1,0){ 0.0177}}
% line 0.0391 from 0.0100 to 0.0200
\put( 0.0059, 0.9609){\line(1,0){ 0.0059}}
% line 0.0547 from 0.0100 to 0.0200
\put( 0.0059, 0.9453){\line(1,0){ 0.0059}}
% 00010 at 0.0100, 0.0781
\put( 0.0387, 0.9219){\makebox(0,0)[l]{\tt00010}}
% line 0.0938 from 0.0100 to 0.1500
\put( 0.0059, 0.9062){\line(1,0){ 0.0828}}
% 00011 at 0.0100, 0.1094
\put( 0.0387, 0.8906){\makebox(0,0)[l]{\tt00011}}
% line 0.0781 from 0.0100 to 0.0400
\put( 0.0059, 0.9219){\line(1,0){ 0.0177}}
% line 0.0703 from 0.0100 to 0.0200
\put( 0.0059, 0.9297){\line(1,0){ 0.0059}}
% line 0.0859 from 0.0100 to 0.0200
\put( 0.0059, 0.9141){\line(1,0){ 0.0059}}
% line 0.1094 from 0.0100 to 0.0400
\put( 0.0059, 0.8906){\line(1,0){ 0.0177}}
% line 0.1016 from 0.0100 to 0.0200
\put( 0.0059, 0.8984){\line(1,0){ 0.0059}}
% line 0.1172 from 0.0100 to 0.0200
\put( 0.0059, 0.8828){\line(1,0){ 0.0059}}
% 0010 at 0.0100, 0.1562
\put( 0.1207, 0.8438){\makebox(0,0)[l]{\tt0010}}
% line 0.1875 from 0.0100 to 0.2800
\put( 0.0059, 0.8125){\line(1,0){ 0.1597}}
% 0011 at 0.0100, 0.2188
\put( 0.1207, 0.7812){\makebox(0,0)[l]{\tt0011}}
% 00100 at 0.0100, 0.1406
\put( 0.0387, 0.8594){\makebox(0,0)[l]{\tt00100}}
% line 0.1562 from 0.0100 to 0.1500
\put( 0.0059, 0.8438){\line(1,0){ 0.0828}}
% 00101 at 0.0100, 0.1719
\put( 0.0387, 0.8281){\makebox(0,0)[l]{\tt00101}}
% line 0.1406 from 0.0100 to 0.0400
\put( 0.0059, 0.8594){\line(1,0){ 0.0177}}
% line 0.1328 from 0.0100 to 0.0200
\put( 0.0059, 0.8672){\line(1,0){ 0.0059}}
% line 0.1484 from 0.0100 to 0.0200
\put( 0.0059, 0.8516){\line(1,0){ 0.0059}}
% line 0.1719 from 0.0100 to 0.0400
\put( 0.0059, 0.8281){\line(1,0){ 0.0177}}
% line 0.1641 from 0.0100 to 0.0200
\put( 0.0059, 0.8359){\line(1,0){ 0.0059}}
% line 0.1797 from 0.0100 to 0.0200
\put( 0.0059, 0.8203){\line(1,0){ 0.0059}}
% 00110 at 0.0100, 0.2031
\put( 0.0387, 0.7969){\makebox(0,0)[l]{\tt00110}}
% line 0.2188 from 0.0100 to 0.1500
\put( 0.0059, 0.7812){\line(1,0){ 0.0828}}
% 00111 at 0.0100, 0.2344
\put( 0.0387, 0.7656){\makebox(0,0)[l]{\tt00111}}
% line 0.2031 from 0.0100 to 0.0400
\put( 0.0059, 0.7969){\line(1,0){ 0.0177}}
% line 0.1953 from 0.0100 to 0.0200
\put( 0.0059, 0.8047){\line(1,0){ 0.0059}}
% line 0.2109 from 0.0100 to 0.0200
\put( 0.0059, 0.7891){\line(1,0){ 0.0059}}
% line 0.2344 from 0.0100 to 0.0400
\put( 0.0059, 0.7656){\line(1,0){ 0.0177}}
% line 0.2266 from 0.0100 to 0.0200
\put( 0.0059, 0.7734){\line(1,0){ 0.0059}}
% line 0.2422 from 0.0100 to 0.0200
\put( 0.0059, 0.7578){\line(1,0){ 0.0059}}
% 010 at 0.0100, 0.3125
\put( 0.1806, 0.6875){\makebox(0,0)[l]{\tt010}}
% line 0.3750 from 0.0100 to 0.3800
\put( 0.0059, 0.6250){\line(1,0){ 0.2188}}
% 011 at 0.0100, 0.4375
\put( 0.1806, 0.5625){\makebox(0,0)[l]{\tt011}}
% 0100 at 0.0100, 0.2812
\put( 0.1207, 0.7188){\makebox(0,0)[l]{\tt0100}}
% line 0.3125 from 0.0100 to 0.2800
\put( 0.0059, 0.6875){\line(1,0){ 0.1597}}
% 0101 at 0.0100, 0.3438
\put( 0.1207, 0.6562){\makebox(0,0)[l]{\tt0101}}
% 01000 at 0.0100, 0.2656
\put( 0.0387, 0.7344){\makebox(0,0)[l]{\tt01000}}
% line 0.2812 from 0.0100 to 0.1500
\put( 0.0059, 0.7188){\line(1,0){ 0.0828}}
% 01001 at 0.0100, 0.2969
\put( 0.0387, 0.7031){\makebox(0,0)[l]{\tt01001}}
% line 0.2656 from 0.0100 to 0.0400
\put( 0.0059, 0.7344){\line(1,0){ 0.0177}}
% line 0.2578 from 0.0100 to 0.0200
\put( 0.0059, 0.7422){\line(1,0){ 0.0059}}
% line 0.2734 from 0.0100 to 0.0200
\put( 0.0059, 0.7266){\line(1,0){ 0.0059}}
% line 0.2969 from 0.0100 to 0.0400
\put( 0.0059, 0.7031){\line(1,0){ 0.0177}}
% line 0.2891 from 0.0100 to 0.0200
\put( 0.0059, 0.7109){\line(1,0){ 0.0059}}
% line 0.3047 from 0.0100 to 0.0200
\put( 0.0059, 0.6953){\line(1,0){ 0.0059}}
% 01010 at 0.0100, 0.3281
\put( 0.0387, 0.6719){\makebox(0,0)[l]{\tt01010}}
% line 0.3438 from 0.0100 to 0.1500
\put( 0.0059, 0.6562){\line(1,0){ 0.0828}}
% 01011 at 0.0100, 0.3594
\put( 0.0387, 0.6406){\makebox(0,0)[l]{\tt01011}}
% line 0.3281 from 0.0100 to 0.0400
\put( 0.0059, 0.6719){\line(1,0){ 0.0177}}
% line 0.3203 from 0.0100 to 0.0200
\put( 0.0059, 0.6797){\line(1,0){ 0.0059}}
% line 0.3359 from 0.0100 to 0.0200
\put( 0.0059, 0.6641){\line(1,0){ 0.0059}}
% line 0.3594 from 0.0100 to 0.0400
\put( 0.0059, 0.6406){\line(1,0){ 0.0177}}
% line 0.3516 from 0.0100 to 0.0200
\put( 0.0059, 0.6484){\line(1,0){ 0.0059}}
% line 0.3672 from 0.0100 to 0.0200
\put( 0.0059, 0.6328){\line(1,0){ 0.0059}}
% 0110 at 0.0100, 0.4062
\put( 0.1207, 0.5938){\makebox(0,0)[l]{\tt0110}}
% line 0.4375 from 0.0100 to 0.2800
\put( 0.0059, 0.5625){\line(1,0){ 0.1597}}
% 0111 at 0.0100, 0.4688
\put( 0.1207, 0.5312){\makebox(0,0)[l]{\tt0111}}
% 01100 at 0.0100, 0.3906
\put( 0.0387, 0.6094){\makebox(0,0)[l]{\tt01100}}
% line 0.4062 from 0.0100 to 0.1500
\put( 0.0059, 0.5938){\line(1,0){ 0.0828}}
% 01101 at 0.0100, 0.4219
\put( 0.0387, 0.5781){\makebox(0,0)[l]{\tt01101}}
% line 0.3906 from 0.0100 to 0.0400
\put( 0.0059, 0.6094){\line(1,0){ 0.0177}}
% line 0.3828 from 0.0100 to 0.0200
\put( 0.0059, 0.6172){\line(1,0){ 0.0059}}
% line 0.3984 from 0.0100 to 0.0200
\put( 0.0059, 0.6016){\line(1,0){ 0.0059}}
% line 0.4219 from 0.0100 to 0.0400
\put( 0.0059, 0.5781){\line(1,0){ 0.0177}}
% line 0.4141 from 0.0100 to 0.0200
\put( 0.0059, 0.5859){\line(1,0){ 0.0059}}
% line 0.4297 from 0.0100 to 0.0200
\put( 0.0059, 0.5703){\line(1,0){ 0.0059}}
% 01110 at 0.0100, 0.4531
\put( 0.0387, 0.5469){\makebox(0,0)[l]{\tt01110}}
% line 0.4688 from 0.0100 to 0.1500
\put( 0.0059, 0.5312){\line(1,0){ 0.0828}}
% 01111 at 0.0100, 0.4844
\put( 0.0387, 0.5156){\makebox(0,0)[l]{\tt01111}}
% line 0.4531 from 0.0100 to 0.0400
\put( 0.0059, 0.5469){\line(1,0){ 0.0177}}
% line 0.4453 from 0.0100 to 0.0200
\put( 0.0059, 0.5547){\line(1,0){ 0.0059}}
% line 0.4609 from 0.0100 to 0.0200
\put( 0.0059, 0.5391){\line(1,0){ 0.0059}}
% line 0.4844 from 0.0100 to 0.0400
\put( 0.0059, 0.5156){\line(1,0){ 0.0177}}
% line 0.4766 from 0.0100 to 0.0200
\put( 0.0059, 0.5234){\line(1,0){ 0.0059}}
% line 0.4922 from 0.0100 to 0.0200
\put( 0.0059, 0.5078){\line(1,0){ 0.0059}}
% 10 at 0.0100, 0.6250
\put( 0.2397, 0.3750){\makebox(0,0)[l]{\tt10}}
% line 0.7500 from 0.0100 to 0.4500
\put( 0.0059, 0.2500){\line(1,0){ 0.2602}}
% 11 at 0.0100, 0.8750
\put( 0.2397, 0.1250){\makebox(0,0)[l]{\tt11}}
% 100 at 0.0100, 0.5625
\put( 0.1806, 0.4375){\makebox(0,0)[l]{\tt100}}
% line 0.6250 from 0.0100 to 0.3800
\put( 0.0059, 0.3750){\line(1,0){ 0.2188}}
% 101 at 0.0100, 0.6875
\put( 0.1806, 0.3125){\makebox(0,0)[l]{\tt101}}
% 1000 at 0.0100, 0.5312
\put( 0.1207, 0.4688){\makebox(0,0)[l]{\tt1000}}
% line 0.5625 from 0.0100 to 0.2800
\put( 0.0059, 0.4375){\line(1,0){ 0.1597}}
% 1001 at 0.0100, 0.5938
\put( 0.1207, 0.4062){\makebox(0,0)[l]{\tt1001}}
% 10000 at 0.0100, 0.5156
\put( 0.0387, 0.4844){\makebox(0,0)[l]{\tt10000}}
% line 0.5312 from 0.0100 to 0.1500
\put( 0.0059, 0.4688){\line(1,0){ 0.0828}}
% 10001 at 0.0100, 0.5469
\put( 0.0387, 0.4531){\makebox(0,0)[l]{\tt10001}}
% line 0.5156 from 0.0100 to 0.0400
\put( 0.0059, 0.4844){\line(1,0){ 0.0177}}
% line 0.5078 from 0.0100 to 0.0200
\put( 0.0059, 0.4922){\line(1,0){ 0.0059}}
% line 0.5234 from 0.0100 to 0.0200
\put( 0.0059, 0.4766){\line(1,0){ 0.0059}}
% line 0.5469 from 0.0100 to 0.0400
\put( 0.0059, 0.4531){\line(1,0){ 0.0177}}
% line 0.5391 from 0.0100 to 0.0200
\put( 0.0059, 0.4609){\line(1,0){ 0.0059}}
% line 0.5547 from 0.0100 to 0.0200
\put( 0.0059, 0.4453){\line(1,0){ 0.0059}}
% 10010 at 0.0100, 0.5781
\put( 0.0387, 0.4219){\makebox(0,0)[l]{\tt10010}}
% line 0.5938 from 0.0100 to 0.1500
\put( 0.0059, 0.4062){\line(1,0){ 0.0828}}
% 10011 at 0.0100, 0.6094
\put( 0.0387, 0.3906){\makebox(0,0)[l]{\tt10011}}
% line 0.5781 from 0.0100 to 0.0400
\put( 0.0059, 0.4219){\line(1,0){ 0.0177}}
% line 0.5703 from 0.0100 to 0.0200
\put( 0.0059, 0.4297){\line(1,0){ 0.0059}}
% line 0.5859 from 0.0100 to 0.0200
\put( 0.0059, 0.4141){\line(1,0){ 0.0059}}
% line 0.6094 from 0.0100 to 0.0400
\put( 0.0059, 0.3906){\line(1,0){ 0.0177}}
% line 0.6016 from 0.0100 to 0.0200
\put( 0.0059, 0.3984){\line(1,0){ 0.0059}}
% line 0.6172 from 0.0100 to 0.0200
\put( 0.0059, 0.3828){\line(1,0){ 0.0059}}
% 1010 at 0.0100, 0.6562
\put( 0.1207, 0.3438){\makebox(0,0)[l]{\tt1010}}
% line 0.6875 from 0.0100 to 0.2800
\put( 0.0059, 0.3125){\line(1,0){ 0.1597}}
% 1011 at 0.0100, 0.7188
\put( 0.1207, 0.2812){\makebox(0,0)[l]{\tt1011}}
% 10100 at 0.0100, 0.6406
\put( 0.0387, 0.3594){\makebox(0,0)[l]{\tt10100}}
% line 0.6562 from 0.0100 to 0.1500
\put( 0.0059, 0.3438){\line(1,0){ 0.0828}}
% 10101 at 0.0100, 0.6719
\put( 0.0387, 0.3281){\makebox(0,0)[l]{\tt10101}}
% line 0.6406 from 0.0100 to 0.0400
\put( 0.0059, 0.3594){\line(1,0){ 0.0177}}
% line 0.6328 from 0.0100 to 0.0200
\put( 0.0059, 0.3672){\line(1,0){ 0.0059}}
% line 0.6484 from 0.0100 to 0.0200
\put( 0.0059, 0.3516){\line(1,0){ 0.0059}}
% line 0.6719 from 0.0100 to 0.0400
\put( 0.0059, 0.3281){\line(1,0){ 0.0177}}
% line 0.6641 from 0.0100 to 0.0200
\put( 0.0059, 0.3359){\line(1,0){ 0.0059}}
% line 0.6797 from 0.0100 to 0.0200
\put( 0.0059, 0.3203){\line(1,0){ 0.0059}}
% 10110 at 0.0100, 0.7031
\put( 0.0387, 0.2969){\makebox(0,0)[l]{\tt10110}}
% line 0.7188 from 0.0100 to 0.1500
\put( 0.0059, 0.2812){\line(1,0){ 0.0828}}
% 10111 at 0.0100, 0.7344
\put( 0.0387, 0.2656){\makebox(0,0)[l]{\tt10111}}
% line 0.7031 from 0.0100 to 0.0400
\put( 0.0059, 0.2969){\line(1,0){ 0.0177}}
% line 0.6953 from 0.0100 to 0.0200
\put( 0.0059, 0.3047){\line(1,0){ 0.0059}}
% line 0.7109 from 0.0100 to 0.0200
\put( 0.0059, 0.2891){\line(1,0){ 0.0059}}
% line 0.7344 from 0.0100 to 0.0400
\put( 0.0059, 0.2656){\line(1,0){ 0.0177}}
% line 0.7266 from 0.0100 to 0.0200
\put( 0.0059, 0.2734){\line(1,0){ 0.0059}}
% line 0.7422 from 0.0100 to 0.0200
\put( 0.0059, 0.2578){\line(1,0){ 0.0059}}
% 110 at 0.0100, 0.8125
\put( 0.1806, 0.1875){\makebox(0,0)[l]{\tt110}}
% line 0.8750 from 0.0100 to 0.3800
\put( 0.0059, 0.1250){\line(1,0){ 0.2188}}
% 111 at 0.0100, 0.9375
\put( 0.1806, 0.0625){\makebox(0,0)[l]{\tt111}}
% 1100 at 0.0100, 0.7812
\put( 0.1207, 0.2188){\makebox(0,0)[l]{\tt1100}}
% line 0.8125 from 0.0100 to 0.2800
\put( 0.0059, 0.1875){\line(1,0){ 0.1597}}
% 1101 at 0.0100, 0.8438
\put( 0.1207, 0.1562){\makebox(0,0)[l]{\tt1101}}
% 11000 at 0.0100, 0.7656
\put( 0.0387, 0.2344){\makebox(0,0)[l]{\tt11000}}
% line 0.7812 from 0.0100 to 0.1500
\put( 0.0059, 0.2188){\line(1,0){ 0.0828}}
% 11001 at 0.0100, 0.7969
\put( 0.0387, 0.2031){\makebox(0,0)[l]{\tt11001}}
% line 0.7656 from 0.0100 to 0.0400
\put( 0.0059, 0.2344){\line(1,0){ 0.0177}}
% line 0.7578 from 0.0100 to 0.0200
\put( 0.0059, 0.2422){\line(1,0){ 0.0059}}
% line 0.7734 from 0.0100 to 0.0200
\put( 0.0059, 0.2266){\line(1,0){ 0.0059}}
% line 0.7969 from 0.0100 to 0.0400
\put( 0.0059, 0.2031){\line(1,0){ 0.0177}}
% line 0.7891 from 0.0100 to 0.0200
\put( 0.0059, 0.2109){\line(1,0){ 0.0059}}
% line 0.8047 from 0.0100 to 0.0200
\put( 0.0059, 0.1953){\line(1,0){ 0.0059}}
% 11010 at 0.0100, 0.8281
\put( 0.0387, 0.1719){\makebox(0,0)[l]{\tt11010}}
% line 0.8438 from 0.0100 to 0.1500
\put( 0.0059, 0.1562){\line(1,0){ 0.0828}}
% 11011 at 0.0100, 0.8594
\put( 0.0387, 0.1406){\makebox(0,0)[l]{\tt11011}}
% line 0.8281 from 0.0100 to 0.0400
\put( 0.0059, 0.1719){\line(1,0){ 0.0177}}
% line 0.8203 from 0.0100 to 0.0200
\put( 0.0059, 0.1797){\line(1,0){ 0.0059}}
% line 0.8359 from 0.0100 to 0.0200
\put( 0.0059, 0.1641){\line(1,0){ 0.0059}}
% line 0.8594 from 0.0100 to 0.0400
\put( 0.0059, 0.1406){\line(1,0){ 0.0177}}
% line 0.8516 from 0.0100 to 0.0200
\put( 0.0059, 0.1484){\line(1,0){ 0.0059}}
% line 0.8672 from 0.0100 to 0.0200
\put( 0.0059, 0.1328){\line(1,0){ 0.0059}}
% 1110 at 0.0100, 0.9062
\put( 0.1207, 0.0938){\makebox(0,0)[l]{\tt1110}}
% line 0.9375 from 0.0100 to 0.2800
\put( 0.0059, 0.0625){\line(1,0){ 0.1597}}
% 1111 at 0.0100, 0.9688
\put( 0.1207, 0.0312){\makebox(0,0)[l]{\tt1111}}
% 11100 at 0.0100, 0.8906
\put( 0.0387, 0.1094){\makebox(0,0)[l]{\tt11100}}
% line 0.9062 from 0.0100 to 0.1500
\put( 0.0059, 0.0938){\line(1,0){ 0.0828}}
% 11101 at 0.0100, 0.9219
\put( 0.0387, 0.0781){\makebox(0,0)[l]{\tt11101}}
% line 0.8906 from 0.0100 to 0.0400
\put( 0.0059, 0.1094){\line(1,0){ 0.0177}}
% line 0.8828 from 0.0100 to 0.0200
\put( 0.0059, 0.1172){\line(1,0){ 0.0059}}
% line 0.8984 from 0.0100 to 0.0200
\put( 0.0059, 0.1016){\line(1,0){ 0.0059}}
% line 0.9219 from 0.0100 to 0.0400
\put( 0.0059, 0.0781){\line(1,0){ 0.0177}}
% line 0.9141 from 0.0100 to 0.0200
\put( 0.0059, 0.0859){\line(1,0){ 0.0059}}
% line 0.9297 from 0.0100 to 0.0200
\put( 0.0059, 0.0703){\line(1,0){ 0.0059}}
% 11110 at 0.0100, 0.9531
\put( 0.0387, 0.0469){\makebox(0,0)[l]{\tt11110}}
% line 0.9688 from 0.0100 to 0.1500
\put( 0.0059, 0.0312){\line(1,0){ 0.0828}}
% 11111 at 0.0100, 0.9844
\put( 0.0387, 0.0156){\makebox(0,0)[l]{\tt11111}}
% line 0.9531 from 0.0100 to 0.0400
\put( 0.0059, 0.0469){\line(1,0){ 0.0177}}
% line 0.9453 from 0.0100 to 0.0200
\put( 0.0059, 0.0547){\line(1,0){ 0.0059}}
% line 0.9609 from 0.0100 to 0.0200
\put( 0.0059, 0.0391){\line(1,0){ 0.0059}}
% line 0.9844 from 0.0100 to 0.0400
\put( 0.0059, 0.0156){\line(1,0){ 0.0177}}
% line 0.9766 from 0.0100 to 0.0200
\put( 0.0059, 0.0234){\line(1,0){ 0.0059}}
% line 0.9922 from 0.0100 to 0.0200
\put( 0.0059, 0.0078){\line(1,0){ 0.0059}}
\end{picture}
\hspace{-0.04in}% was -.25
\raisebox{1.1895in}{% was 1.425
\setlength{\unitlength}{33.39in}
%\setlength{\unitlength}{40in}
\begin{picture}(0.085,0.04)(-0.0425,0.37)
\thinlines
%
% wings added by hand
\put( -0.0408 , 0.4082){\line(-1,-3){ 0.005}}
\put( -0.0408 , 0.3730){\line(-1,3){ 0.005}}
%
% arrow identifying the final interval added by hand
% the center of the interval is 0010 below this point
% 10011110 (0.3809)
% 0.0017 is the length of the stubby lines
%
% want vector's tip to end at height 0.37995 and x=0.0010
% 4*34 = 136 -> 36635
% this was perfectly positioned
%\put( 0.0040, 0.36635){\makebox(0,0)[tl]{\tt100111101}}
%\put( 0.0044, 0.36635){\vector(-1,4){0.0034}}
% but I shifted it to this for arty reasons
\put( 0.0048, 0.36635){\makebox(0,0)[tl]{\tt100111101}}
\put( 0.0052, 0.36635){\vector(-1,4){0.0034}}
%
% line 0.5966 from -0.4800 to 0.0000
\put( -0.0408, 0.4034){\line(1,0){ 0.0408}}
% bbba at -0.2800, 0.6096
\put( -0.0252, 0.3904){\makebox(0,0)[r]{\tt{bbba}}}
% line 0.6227 from -0.4200 to 0.0000
\put( -0.0357, 0.3773){\line(1,0){ 0.0357}}
% bbbaa at -0.1000, 0.6003
\put( -0.0099, 0.3997){\makebox(0,0)[r]{\tt{bbbaa}}}
% line 0.6040 from -0.2800 to 0.0000
\put( -0.0238, 0.3960){\line(1,0){ 0.0238}}
% bbbab at -0.1000, 0.6114
\put( -0.0099, 0.3886){\makebox(0,0)[r]{\tt{bbbab}}}
% line 0.6188 from -0.2800 to 0.0000
\put( -0.0238, 0.3812){\line(1,0){ 0.0238}}
% bbba\eof at -0.1000, 0.6207
\put( -0.0099, 0.3793){\makebox(0,0)[r]{\tt{bbba\teof}}}
% line 0.6250 from 0.0100 to 0.4900
\put( 0.0008, 0.3750){\line(1,0){ 0.0408}}
% line 0.5938 from 0.0100 to 0.4200
\put( 0.0008, 0.4062){\line(1,0){ 0.0348}}
% 10011 at 0.0100, 0.6094
\put( 0.0299, 0.3906){\makebox(0,0)[l]{\tt10011}} % moved left a bit, was.0329
% 10010111 at 0.0100, 0.5918
\put( 0.0040, 0.4082){\makebox(0,0)[l]{\tt10010111}}
% line 0.5918 from 0.0100 to 0.0300
\put( 0.0008, 0.4082){\line(1,0){ 0.0017}}
% line 0.6094 from 0.0100 to 0.3700 % shortened, was .0306
\put( 0.0008, 0.3906){\line(1,0){ 0.0276}}
% line 0.6016 from 0.0100 to 0.3000
\put( 0.0008, 0.3984){\line(1,0){ 0.0246}}
% 10011000 at 0.0100, 0.5957
\put( 0.0040, 0.4043){\makebox(0,0)[l]{\tt10011000}}
% line 0.5977 from 0.0100 to 0.2100
\put( 0.0008, 0.4023){\line(1,0){ 0.0170}}
% 10011001 at 0.0100, 0.5996
\put( 0.0040, 0.4004){\makebox(0,0)[l]{\tt10011001}}
% line 0.5957 from 0.0100 to 0.0300
\put( 0.0008, 0.4043){\line(1,0){ 0.0017}}
% line 0.5996 from 0.0100 to 0.0300
\put( 0.0008, 0.4004){\line(1,0){ 0.0017}}
% 10011010 at 0.0100, 0.6035
\put( 0.0040, 0.3965){\makebox(0,0)[l]{\tt10011010}}
% line 0.6055 from 0.0100 to 0.2100
\put( 0.0008, 0.3945){\line(1,0){ 0.0170}}
% 10011011 at 0.0100, 0.6074
\put( 0.0040, 0.3926){\makebox(0,0)[l]{\tt10011011}}
% line 0.6035 from 0.0100 to 0.0300
\put( 0.0008, 0.3965){\line(1,0){ 0.0017}}
% line 0.6074 from 0.0100 to 0.0300
\put( 0.0008, 0.3926){\line(1,0){ 0.0017}}
% line 0.6172 from 0.0100 to 0.3000
\put( 0.0008, 0.3828){\line(1,0){ 0.0246}}
% 10011100 at 0.0100, 0.6113
\put( 0.0040, 0.3887){\makebox(0,0)[l]{\tt10011100}}
% line 0.6133 from 0.0100 to 0.2100
\put( 0.0008, 0.3867){\line(1,0){ 0.0170}}
% 10011101 at 0.0100, 0.6152
\put( 0.0040, 0.3848){\makebox(0,0)[l]{\tt10011101}}
% line 0.6113 from 0.0100 to 0.0300
\put( 0.0008, 0.3887){\line(1,0){ 0.0017}}
% line 0.6152 from 0.0100 to 0.0300
\put( 0.0008, 0.3848){\line(1,0){ 0.0017}}
% 10011110 at 0.0100, 0.6191
\put( 0.0040, 0.3809){\makebox(0,0)[l]{\tt10011110}}
% line 0.6211 from 0.0100 to 0.2100
\put( 0.0008, 0.3789){\line(1,0){ 0.0170}}
% 10011111 at 0.0100, 0.6230
\put( 0.0040, 0.3770){\makebox(0,0)[l]{\tt10011111}}
% line 0.6191 from 0.0100 to 0.0300
\put( 0.0008, 0.3809){\line(1,0){ 0.0017}}
% line 0.6230 from 0.0100 to 0.0300
\put( 0.0008, 0.3770){\line(1,0){ 0.0017}}
% 10100000 at 0.0100, 0.6270
\put( 0.0040, 0.3730){\makebox(0,0)[l]{\tt10100000}}
% line 0.6289 from 0.0100 to 0.2100
\put( 0.0008, 0.3711){\line(1,0){ 0.0170}}
% line 0.6270 from 0.0100 to 0.0300
\put( 0.0008, 0.3730){\line(1,0){ 0.0017}}
\end{picture}
}
}
\end{center}
}{%
\caption[a]{Illustration of the arithmetic coding
process as the sequence {$\tt bbba\eof$} is
transmitted.}
\label{fig.ac}
}%
\end{figure}
When the first symbol `$\tb$' is observed, the encoder
knows that the encoded string will start `{\tt{01}}',
`{\tt{10}}', or `{\tt{11}}',
but does not know which. The encoder writes nothing
for the time being, and
examines the next symbol, which is `$\tb$'.
The interval `$\tt bb$' lies wholly within interval `{\tt{1}}', so
the encoder can write the first bit: `{\tt{1}}'.
The third symbol `$\tt b$' narrows down the interval
a little, but not quite enough for it to lie
wholly within interval `{\tt{10}}'. Only when the next `$\tt a$'
is read from the source can we transmit some more
bits. Interval `$\tt bbba$' lies wholly within the interval `{\tt{1001}}',
so the encoder adds `{\tt{001}}'
to the `{\tt{1}}' it has written. Finally when the `$\eof$'
arrives, we need a procedure for terminating the encoding.
Magnifying the interval `$\tt bbba\eof$' (\figref{fig.ac}, right)
we note that the marked interval `{\tt{100111101}}'
is wholly contained by $\tt bbba\eof$, so the encoding can be completed by
appending `{\tt{11101}}'.
\exercissxA{2}{ex.ac.terminate}{
Show that the overhead required to terminate a message
is never more than 2 bits, relative to the ideal message length given the
probabilistic model $\H$,
$h(\bx \given \H) = \log [ 1/ P(\bx \given \H)]$.
}
% \begin{center}
% % created by ac.p sub=1 unit=40 only_show_data=1 > ac/ac_sub_data.tex
% \input{figs/ac/ac_sub_data.tex}
% \end{center}
This is an important result. Arithmetic coding is
very nearly optimal. The message length is always
within two bits of the \ind{Shannon information content}\index{information content} of the entire
source string,
so the expected message length is within two bits of the
entropy of the entire message.
\subsubsection{Decoding\subsubpunc}
The decoder receives the string `{\tt{100111101}}'
and passes along it one
symbol at a time. First, the probabilities $P(\ta), P(\tb), P(\eof)$ are computed
using the identical program that the encoder used and the intervals
`$\ta$', `$\tb$' and `$\eof$' are deduced. Once the first two
bits `{\tt{10}}' have been examined, it is certain that the original string
must have been started with a `$\tb$', since the interval `{\tt{10}}' lies wholly within
interval `$\tb$'. The decoder can then use the model to compute $P(\ta \given \tb),
P(\tb \given \tb), P(\eof \given \tb)$ and deduce the boundaries of the intervals
`$\tb\ta$', `$\tb\tb$' and `$\tb\eof$'.
Continuing, we decode the second $\tb$ once we reach `{\tt{1001}}',
the third $\tb$ once we reach `{\tt{100111}}', and so forth, with the
unambiguous identification of `$\tb\tb\tb\ta\eof$' once the whole binary
string has been
read. With the convention that `$\eof$' denotes the end of the message, the decoder
knows to stop decoding.
\subsubsection{Transmission of multiple files\subsubpunc}
How might one use arithmetic coding to communicate
several distinct files over the binary channel?
Once the $\eof$ character has been
transmitted, we imagine that the decoder is
reset into its initial state. There is no
transfer of the learnt statistics of the first file
to the second file.
% We start a fresh arithmetic code.
If, however, we did believe that there is a relationship
among the files that we are going to compress, we could
define our alphabet differently, introducing
a second end-of-file character that
marks the end of the file but instructs
the encoder and decoder to continue using the
same probabilistic model.
% If we went this route,
% we would only be able to uncompress the second file
% after first uncompressing the first file.
\subsection{The big picture}
Notice that to communicate a string of $N$ letters
% coming from an alphabet of size $|\A| = I$
both the encoder and the decoder needed to compute only $N|\A|$
conditional probabilities -- the probabilities of each possible letter
in each context actually encountered -- just
as in the guessing game.\index{guessing game} This cost can be contrasted with the alternative
of using a Huffman code\index{Huffman code!disadvantages}
with a large block size (in order to
reduce the possible one-bit-per-symbol overhead discussed in
% the previous chapter
section \ref{sec.huffman.probs}), where {\em all\/} block sequences that could
occur
% be encoded
% in a block
must be considered and their probabilities evaluated.
Notice how flexible arithmetic coding is:
it can be used with any source alphabet
and any encoded alphabet. The size of the source alphabet and the encoded
alphabet can
change with
time. Arithmetic coding can be used with any probability distribution,
which can change utterly from context to context.
Furthermore, if we would like the symbols of the encoding alphabet (say, {\tt 0} and {\tt 1})
to be used with {\em unequal\/} frequency, that can easily be arranged by subdividing
the right-hand interval in proportion to the required frequencies.
\subsection{How the probabilistic model might make its predictions}
The technique of arithmetic coding does not force one to
produce the predictive probability in any particular way, but
the predictive distributions might naturally be produced by a
Bayesian model.
\Figref{fig.ac} was generated using a
simple model that always assigns a probability of 0.15 to $\eof$,
and assigns the remaining 0.85 to $\ta$ and $\tb$, divided
in proportion to probabilities given by Laplace's
rule,
\beq
P_{\rm L}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+1}{F_{\ta}+F_{\tb}+2} ,
\label{eq.laplaceagain}
\eeq
where
$F_{\ta}(x_1,\ldots,x_{n-1})$ is the number of times that $\ta$ has
occurred so far, and $F_{\tb}$ is the count of $\tb$s.
These predictions corresponds to a simple
Bayesian model that expects and adapts to
% is able to learn
a non-equal frequency
of use of the source symbols $\ta$ and $\tb$ within a file.
% The end result will be an encoder that can adapt to a nonuniform source.
\Figref{fig.ac2} displays the intervals corresponding to
a number of strings of length up to five. Note that if the string so far
has contained a large number of $\tb$s then the probability of
$\tb$ relative to $\ta$ is increased, and conversely if many $\ta$s
occur then $\ta$s are made more probable. Larger intervals, remember,
require fewer bits to encode.
%
\begin{figure}[tbp]
\figuremargin{%
\begin{center}
% created by ac.p only_show_data=1 > ac/ac_data.tex
\mbox{
\setlength{\unitlength}{5.75in}
\begin{picture}(0.59130434782608698452,1)(-0.29565217391304349226,0)
\thinlines
% line 0.0000 from -0.5000 to 0.0000
\put( -0.2957, 1.0000){\line(1,0){ 0.2957}}
% a at -0.4500, 0.2125
\put( -0.2811, 0.7875){\makebox(0,0)[r]{\tt{a}}}
% line 0.4250 from -0.5000 to 0.0000
\put( -0.2957, 0.5750){\line(1,0){ 0.2957}}
% b at -0.4500, 0.6375
\put( -0.2811, 0.3625){\makebox(0,0)[r]{\tt{b}}}
% line 0.8500 from -0.5000 to 0.0000
\put( -0.2957, 0.1500){\line(1,0){ 0.2957}}
% \eof at -0.4500, 0.9250
\put( -0.2811, 0.0750){\makebox(0,0)[r]{\tt{\teof}}}
% line 1.0000 from -0.5000 to 0.0000
\put( -0.2957, 0.0000){\line(1,0){ 0.2957}}
% aa at -0.3500, 0.1204
\put( -0.2220, 0.8796){\makebox(0,0)[r]{\tt{aa}}}
% line 0.2408 from -0.4500 to 0.0000
\put( -0.2661, 0.7592){\line(1,0){ 0.2661}}
% ab at -0.3500, 0.3010
\put( -0.2220, 0.6990){\makebox(0,0)[r]{\tt{ab}}}
% line 0.3612 from -0.4500 to 0.0000
\put( -0.2661, 0.6388){\line(1,0){ 0.2661}}
% a\eof at -0.3500, 0.3931
\put( -0.2220, 0.6069){\makebox(0,0)[r]{\tt{a\teof}}}
% aaa at -0.2300, 0.0768
\put( -0.1510, 0.9232){\makebox(0,0)[r]{\tt{aaa}}}
% line 0.1535 from -0.3500 to 0.0000
\put( -0.2070, 0.8465){\line(1,0){ 0.2070}}
% aab at -0.2300, 0.1791
\put( -0.1510, 0.8209){\makebox(0,0)[r]{\tt{aab}}}
% line 0.2047 from -0.3500 to 0.0000
\put( -0.2070, 0.7953){\line(1,0){ 0.2070}}
% aa\eof at -0.2300, 0.2228
\put( -0.1510, 0.7772){\makebox(0,0)[r]{\tt{aa\teof}}}
% aaaa at -0.1000, 0.0522
\put( -0.0741, 0.9478){\makebox(0,0)[r]{\tt{aaaa}}}
% line 0.1044 from -0.2300 to 0.0000
\put( -0.1360, 0.8956){\line(1,0){ 0.1360}}
% aaab at -0.1000, 0.1175
\put( -0.0741, 0.8825){\makebox(0,0)[r]{\tt{aaab}}}
% line 0.1305 from -0.2300 to 0.0000
\put( -0.1360, 0.8695){\line(1,0){ 0.1360}}
% line 0.0740 from -0.1000 to 0.0000
\put( -0.0591, 0.9260){\line(1,0){ 0.0591}}
% line 0.0887 from -0.1000 to 0.0000
\put( -0.0591, 0.9113){\line(1,0){ 0.0591}}
% line 0.1192 from -0.1000 to 0.0000
\put( -0.0591, 0.8808){\line(1,0){ 0.0591}}
% line 0.1266 from -0.1000 to 0.0000
\put( -0.0591, 0.8734){\line(1,0){ 0.0591}}
% aaba at -0.1000, 0.1666
\put( -0.0741, 0.8334){\makebox(0,0)[r]{\tt{aaba}}}
% line 0.1796 from -0.2300 to 0.0000
\put( -0.1360, 0.8204){\line(1,0){ 0.1360}}
% aabb at -0.1000, 0.1883
\put( -0.0741, 0.8117){\makebox(0,0)[r]{\tt{aabb}}}
% line 0.1970 from -0.2300 to 0.0000
\put( -0.1360, 0.8030){\line(1,0){ 0.1360}}
% line 0.1683 from -0.1000 to 0.0000
\put( -0.0591, 0.8317){\line(1,0){ 0.0591}}
% line 0.1757 from -0.1000 to 0.0000
\put( -0.0591, 0.8243){\line(1,0){ 0.0591}}
% line 0.1870 from -0.1000 to 0.0000
\put( -0.0591, 0.8130){\line(1,0){ 0.0591}}
% line 0.1944 from -0.1000 to 0.0000
\put( -0.0591, 0.8056){\line(1,0){ 0.0591}}
% aba at -0.2300, 0.2664
\put( -0.1510, 0.7336){\makebox(0,0)[r]{\tt{aba}}}
% line 0.2920 from -0.3500 to 0.0000
\put( -0.2070, 0.7080){\line(1,0){ 0.2070}}
% abb at -0.2300, 0.3176
\put( -0.1510, 0.6824){\makebox(0,0)[r]{\tt{abb}}}
% line 0.3432 from -0.3500 to 0.0000
\put( -0.2070, 0.6568){\line(1,0){ 0.2070}}
% ab\eof at -0.2300, 0.3522
\put( -0.1510, 0.6478){\makebox(0,0)[r]{\tt{ab\teof}}}
% abaa at -0.1000, 0.2539
\put( -0.0741, 0.7461){\makebox(0,0)[r]{\tt{abaa}}}
% line 0.2669 from -0.2300 to 0.0000
\put( -0.1360, 0.7331){\line(1,0){ 0.1360}}
% abab at -0.1000, 0.2756
\put( -0.0741, 0.7244){\makebox(0,0)[r]{\tt{abab}}}
% line 0.2843 from -0.2300 to 0.0000
\put( -0.1360, 0.7157){\line(1,0){ 0.1360}}
% line 0.2556 from -0.1000 to 0.0000
\put( -0.0591, 0.7444){\line(1,0){ 0.0591}}
% line 0.2630 from -0.1000 to 0.0000
\put( -0.0591, 0.7370){\line(1,0){ 0.0591}}
% line 0.2743 from -0.1000 to 0.0000
\put( -0.0591, 0.7257){\line(1,0){ 0.0591}}
% line 0.2817 from -0.1000 to 0.0000
\put( -0.0591, 0.7183){\line(1,0){ 0.0591}}
% abba at -0.1000, 0.3007
\put( -0.0741, 0.6993){\makebox(0,0)[r]{\tt{abba}}}
% line 0.3094 from -0.2300 to 0.0000
\put( -0.1360, 0.6906){\line(1,0){ 0.1360}}
% abbb at -0.1000, 0.3225
\put( -0.0741, 0.6775){\makebox(0,0)[r]{\tt{abbb}}}
% line 0.3355 from -0.2300 to 0.0000
\put( -0.1360, 0.6645){\line(1,0){ 0.1360}}
% line 0.2994 from -0.1000 to 0.0000
\put( -0.0591, 0.7006){\line(1,0){ 0.0591}}
% line 0.3068 from -0.1000 to 0.0000
\put( -0.0591, 0.6932){\line(1,0){ 0.0591}}
% line 0.3168 from -0.1000 to 0.0000
\put( -0.0591, 0.6832){\line(1,0){ 0.0591}}
% line 0.3316 from -0.1000 to 0.0000
\put( -0.0591, 0.6684){\line(1,0){ 0.0591}}
% ba at -0.3500, 0.4852
\put( -0.2220, 0.5148){\makebox(0,0)[r]{\tt{ba}}}
% line 0.5454 from -0.4500 to 0.0000
\put( -0.2661, 0.4546){\line(1,0){ 0.2661}}
% bb at -0.3500, 0.6658
\put( -0.2220, 0.3342){\makebox(0,0)[r]{\tt{bb}}}
% line 0.7862 from -0.4500 to 0.0000
\put( -0.2661, 0.2138){\line(1,0){ 0.2661}}
% b\eof at -0.3500, 0.8181
\put( -0.2220, 0.1819){\makebox(0,0)[r]{\tt{b\teof}}}
% baa at -0.2300, 0.4506
\put( -0.1510, 0.5494){\makebox(0,0)[r]{\tt{baa}}}
% line 0.4762 from -0.3500 to 0.0000
\put( -0.2070, 0.5238){\line(1,0){ 0.2070}}
% bab at -0.2300, 0.5018
\put( -0.1510, 0.4982){\makebox(0,0)[r]{\tt{bab}}}
% line 0.5274 from -0.3500 to 0.0000
\put( -0.2070, 0.4726){\line(1,0){ 0.2070}}
% ba\eof at -0.2300, 0.5364
\put( -0.1510, 0.4636){\makebox(0,0)[r]{\tt{ba\teof}}}
% baaa at -0.1000, 0.4381
\put( -0.0741, 0.5619){\makebox(0,0)[r]{\tt{baaa}}}
% line 0.4511 from -0.2300 to 0.0000
\put( -0.1360, 0.5489){\line(1,0){ 0.1360}}
% baab at -0.1000, 0.4598
\put( -0.0741, 0.5402){\makebox(0,0)[r]{\tt{baab}}}
% line 0.4685 from -0.2300 to 0.0000
\put( -0.1360, 0.5315){\line(1,0){ 0.1360}}
% line 0.4398 from -0.1000 to 0.0000
\put( -0.0591, 0.5602){\line(1,0){ 0.0591}}
% line 0.4472 from -0.1000 to 0.0000
\put( -0.0591, 0.5528){\line(1,0){ 0.0591}}
% line 0.4585 from -0.1000 to 0.0000
\put( -0.0591, 0.5415){\line(1,0){ 0.0591}}
% line 0.4659 from -0.1000 to 0.0000
\put( -0.0591, 0.5341){\line(1,0){ 0.0591}}
% baba at -0.1000, 0.4849
\put( -0.0741, 0.5151){\makebox(0,0)[r]{\tt{baba}}}
% line 0.4936 from -0.2300 to 0.0000
\put( -0.1360, 0.5064){\line(1,0){ 0.1360}}
% babb at -0.1000, 0.5066
\put( -0.0741, 0.4934){\makebox(0,0)[r]{\tt{babb}}}
% line 0.5197 from -0.2300 to 0.0000
\put( -0.1360, 0.4803){\line(1,0){ 0.1360}}
% line 0.4836 from -0.1000 to 0.0000
\put( -0.0591, 0.5164){\line(1,0){ 0.0591}}
% line 0.4910 from -0.1000 to 0.0000
\put( -0.0591, 0.5090){\line(1,0){ 0.0591}}
% line 0.5010 from -0.1000 to 0.0000
\put( -0.0591, 0.4990){\line(1,0){ 0.0591}}
% line 0.5158 from -0.1000 to 0.0000
\put( -0.0591, 0.4842){\line(1,0){ 0.0591}}
% bba at -0.2300, 0.5710
\put( -0.1510, 0.4290){\makebox(0,0)[r]{\tt{bba}}}
% line 0.5966 from -0.3500 to 0.0000
\put( -0.2070, 0.4034){\line(1,0){ 0.2070}}
% bbb at -0.2300, 0.6734
\put( -0.1510, 0.3266){\makebox(0,0)[r]{\tt{bbb}}}
% line 0.7501 from -0.3500 to 0.0000
\put( -0.2070, 0.2499){\line(1,0){ 0.2070}}
% bb\eof at -0.2300, 0.7682
\put( -0.1510, 0.2318){\makebox(0,0)[r]{\tt{bb\teof}}}
% bbaa at -0.1000, 0.5541
\put( -0.0741, 0.4459){\makebox(0,0)[r]{\tt{bbaa}}}
% line 0.5628 from -0.2300 to 0.0000
\put( -0.1360, 0.4372){\line(1,0){ 0.1360}}
% bbab at -0.1000, 0.5759
\put( -0.0741, 0.4241){\makebox(0,0)[r]{\tt{bbab}}}
% line 0.5889 from -0.2300 to 0.0000
\put( -0.1360, 0.4111){\line(1,0){ 0.1360}}
% line 0.5528 from -0.1000 to 0.0000
\put( -0.0591, 0.4472){\line(1,0){ 0.0591}}
% line 0.5602 from -0.1000 to 0.0000
\put( -0.0591, 0.4398){\line(1,0){ 0.0591}}
% line 0.5702 from -0.1000 to 0.0000
\put( -0.0591, 0.4298){\line(1,0){ 0.0591}}
% line 0.5850 from -0.1000 to 0.0000
\put( -0.0591, 0.4150){\line(1,0){ 0.0591}}
% bbba at -0.1000, 0.6096
\put( -0.0741, 0.3904){\makebox(0,0)[r]{\tt{bbba}}}
% line 0.6227 from -0.2300 to 0.0000
\put( -0.1360, 0.3773){\line(1,0){ 0.1360}}
% bbbb at -0.1000, 0.6749
\put( -0.0741, 0.3251){\makebox(0,0)[r]{\tt{bbbb}}}
% line 0.7271 from -0.2300 to 0.0000
\put( -0.1360, 0.2729){\line(1,0){ 0.1360}}
% line 0.6040 from -0.1000 to 0.0000
\put( -0.0591, 0.3960){\line(1,0){ 0.0591}}
% line 0.6188 from -0.1000 to 0.0000
\put( -0.0591, 0.3812){\line(1,0){ 0.0591}}
% line 0.6375 from -0.1000 to 0.0000
\put( -0.0591, 0.3625){\line(1,0){ 0.0591}}
% line 0.7114 from -0.1000 to 0.0000
\put( -0.0591, 0.2886){\line(1,0){ 0.0591}}
% line 0.0000 from 0.0100 to 0.5000
\put( 0.0059, 1.0000){\line(1,0){ 0.2897}}
% 0 at 0.0100, 0.2500
\put( 0.2811, 0.7500){\makebox(0,0)[l]{\tt0}}
% line 0.5000 from 0.0100 to 0.5000
\put( 0.0059, 0.5000){\line(1,0){ 0.2897}}
% 1 at 0.0100, 0.7500
\put( 0.2811, 0.2500){\makebox(0,0)[l]{\tt1}}
% line 1.0000 from 0.0100 to 0.5000
\put( 0.0059, 0.0000){\line(1,0){ 0.2897}}
% 00 at 0.0100, 0.1250
\put( 0.2397, 0.8750){\makebox(0,0)[l]{\tt00}}
% line 0.2500 from 0.0100 to 0.4500
\put( 0.0059, 0.7500){\line(1,0){ 0.2602}}
% 01 at 0.0100, 0.3750
\put( 0.2397, 0.6250){\makebox(0,0)[l]{\tt01}}
% 000 at 0.0100, 0.0625
\put( 0.1806, 0.9375){\makebox(0,0)[l]{\tt000}}
% line 0.1250 from 0.0100 to 0.3800
\put( 0.0059, 0.8750){\line(1,0){ 0.2188}}
% 001 at 0.0100, 0.1875
\put( 0.1806, 0.8125){\makebox(0,0)[l]{\tt001}}
% 0000 at 0.0100, 0.0312
\put( 0.1207, 0.9688){\makebox(0,0)[l]{\tt0000}}
% line 0.0625 from 0.0100 to 0.2800
\put( 0.0059, 0.9375){\line(1,0){ 0.1597}}
% 0001 at 0.0100, 0.0938
\put( 0.1207, 0.9062){\makebox(0,0)[l]{\tt0001}}
% 00000 at 0.0100, 0.0156
\put( 0.0387, 0.9844){\makebox(0,0)[l]{\tt00000}}
% line 0.0312 from 0.0100 to 0.1500
\put( 0.0059, 0.9688){\line(1,0){ 0.0828}}
% 00001 at 0.0100, 0.0469
\put( 0.0387, 0.9531){\makebox(0,0)[l]{\tt00001}}
% line 0.0156 from 0.0100 to 0.0400
\put( 0.0059, 0.9844){\line(1,0){ 0.0177}}
% line 0.0078 from 0.0100 to 0.0200
\put( 0.0059, 0.9922){\line(1,0){ 0.0059}}
% line 0.0234 from 0.0100 to 0.0200
\put( 0.0059, 0.9766){\line(1,0){ 0.0059}}
% line 0.0469 from 0.0100 to 0.0400
\put( 0.0059, 0.9531){\line(1,0){ 0.0177}}
% line 0.0391 from 0.0100 to 0.0200
\put( 0.0059, 0.9609){\line(1,0){ 0.0059}}
% line 0.0547 from 0.0100 to 0.0200
\put( 0.0059, 0.9453){\line(1,0){ 0.0059}}
% 00010 at 0.0100, 0.0781
\put( 0.0387, 0.9219){\makebox(0,0)[l]{\tt00010}}
% line 0.0938 from 0.0100 to 0.1500
\put( 0.0059, 0.9062){\line(1,0){ 0.0828}}
% 00011 at 0.0100, 0.1094
\put( 0.0387, 0.8906){\makebox(0,0)[l]{\tt00011}}
% line 0.0781 from 0.0100 to 0.0400
\put( 0.0059, 0.9219){\line(1,0){ 0.0177}}
% line 0.0703 from 0.0100 to 0.0200
\put( 0.0059, 0.9297){\line(1,0){ 0.0059}}
% line 0.0859 from 0.0100 to 0.0200
\put( 0.0059, 0.9141){\line(1,0){ 0.0059}}
% line 0.1094 from 0.0100 to 0.0400
\put( 0.0059, 0.8906){\line(1,0){ 0.0177}}
% line 0.1016 from 0.0100 to 0.0200
\put( 0.0059, 0.8984){\line(1,0){ 0.0059}}
% line 0.1172 from 0.0100 to 0.0200
\put( 0.0059, 0.8828){\line(1,0){ 0.0059}}
% 0010 at 0.0100, 0.1562
\put( 0.1207, 0.8438){\makebox(0,0)[l]{\tt0010}}
% line 0.1875 from 0.0100 to 0.2800
\put( 0.0059, 0.8125){\line(1,0){ 0.1597}}
% 0011 at 0.0100, 0.2188
\put( 0.1207, 0.7812){\makebox(0,0)[l]{\tt0011}}
% 00100 at 0.0100, 0.1406
\put( 0.0387, 0.8594){\makebox(0,0)[l]{\tt00100}}
% line 0.1562 from 0.0100 to 0.1500
\put( 0.0059, 0.8438){\line(1,0){ 0.0828}}
% 00101 at 0.0100, 0.1719
\put( 0.0387, 0.8281){\makebox(0,0)[l]{\tt00101}}
% line 0.1406 from 0.0100 to 0.0400
\put( 0.0059, 0.8594){\line(1,0){ 0.0177}}
% line 0.1328 from 0.0100 to 0.0200
\put( 0.0059, 0.8672){\line(1,0){ 0.0059}}
% line 0.1484 from 0.0100 to 0.0200
\put( 0.0059, 0.8516){\line(1,0){ 0.0059}}
% line 0.1719 from 0.0100 to 0.0400
\put( 0.0059, 0.8281){\line(1,0){ 0.0177}}
% line 0.1641 from 0.0100 to 0.0200
\put( 0.0059, 0.8359){\line(1,0){ 0.0059}}
% line 0.1797 from 0.0100 to 0.0200
\put( 0.0059, 0.8203){\line(1,0){ 0.0059}}
% 00110 at 0.0100, 0.2031
\put( 0.0387, 0.7969){\makebox(0,0)[l]{\tt00110}}
% line 0.2188 from 0.0100 to 0.1500
\put( 0.0059, 0.7812){\line(1,0){ 0.0828}}
% 00111 at 0.0100, 0.2344
\put( 0.0387, 0.7656){\makebox(0,0)[l]{\tt00111}}
% line 0.2031 from 0.0100 to 0.0400
\put( 0.0059, 0.7969){\line(1,0){ 0.0177}}
% line 0.1953 from 0.0100 to 0.0200
\put( 0.0059, 0.8047){\line(1,0){ 0.0059}}
% line 0.2109 from 0.0100 to 0.0200
\put( 0.0059, 0.7891){\line(1,0){ 0.0059}}
% line 0.2344 from 0.0100 to 0.0400
\put( 0.0059, 0.7656){\line(1,0){ 0.0177}}
% line 0.2266 from 0.0100 to 0.0200
\put( 0.0059, 0.7734){\line(1,0){ 0.0059}}
% line 0.2422 from 0.0100 to 0.0200
\put( 0.0059, 0.7578){\line(1,0){ 0.0059}}
% 010 at 0.0100, 0.3125
\put( 0.1806, 0.6875){\makebox(0,0)[l]{\tt010}}
% line 0.3750 from 0.0100 to 0.3800
\put( 0.0059, 0.6250){\line(1,0){ 0.2188}}
% 011 at 0.0100, 0.4375
\put( 0.1806, 0.5625){\makebox(0,0)[l]{\tt011}}
% 0100 at 0.0100, 0.2812
\put( 0.1207, 0.7188){\makebox(0,0)[l]{\tt0100}}
% line 0.3125 from 0.0100 to 0.2800
\put( 0.0059, 0.6875){\line(1,0){ 0.1597}}
% 0101 at 0.0100, 0.3438
\put( 0.1207, 0.6562){\makebox(0,0)[l]{\tt0101}}
% 01000 at 0.0100, 0.2656
\put( 0.0387, 0.7344){\makebox(0,0)[l]{\tt01000}}
% line 0.2812 from 0.0100 to 0.1500
\put( 0.0059, 0.7188){\line(1,0){ 0.0828}}
% 01001 at 0.0100, 0.2969
\put( 0.0387, 0.7031){\makebox(0,0)[l]{\tt01001}}
% line 0.2656 from 0.0100 to 0.0400
\put( 0.0059, 0.7344){\line(1,0){ 0.0177}}
% line 0.2578 from 0.0100 to 0.0200
\put( 0.0059, 0.7422){\line(1,0){ 0.0059}}
% line 0.2734 from 0.0100 to 0.0200
\put( 0.0059, 0.7266){\line(1,0){ 0.0059}}
% line 0.2969 from 0.0100 to 0.0400
\put( 0.0059, 0.7031){\line(1,0){ 0.0177}}
% line 0.2891 from 0.0100 to 0.0200
\put( 0.0059, 0.7109){\line(1,0){ 0.0059}}
% line 0.3047 from 0.0100 to 0.0200
\put( 0.0059, 0.6953){\line(1,0){ 0.0059}}
% 01010 at 0.0100, 0.3281
\put( 0.0387, 0.6719){\makebox(0,0)[l]{\tt01010}}
% line 0.3438 from 0.0100 to 0.1500
\put( 0.0059, 0.6562){\line(1,0){ 0.0828}}
% 01011 at 0.0100, 0.3594
\put( 0.0387, 0.6406){\makebox(0,0)[l]{\tt01011}}
% line 0.3281 from 0.0100 to 0.0400
\put( 0.0059, 0.6719){\line(1,0){ 0.0177}}
% line 0.3203 from 0.0100 to 0.0200
\put( 0.0059, 0.6797){\line(1,0){ 0.0059}}
% line 0.3359 from 0.0100 to 0.0200
\put( 0.0059, 0.6641){\line(1,0){ 0.0059}}
% line 0.3594 from 0.0100 to 0.0400
\put( 0.0059, 0.6406){\line(1,0){ 0.0177}}
% line 0.3516 from 0.0100 to 0.0200
\put( 0.0059, 0.6484){\line(1,0){ 0.0059}}
% line 0.3672 from 0.0100 to 0.0200
\put( 0.0059, 0.6328){\line(1,0){ 0.0059}}
% 0110 at 0.0100, 0.4062
\put( 0.1207, 0.5938){\makebox(0,0)[l]{\tt0110}}
% line 0.4375 from 0.0100 to 0.2800
\put( 0.0059, 0.5625){\line(1,0){ 0.1597}}
% 0111 at 0.0100, 0.4688
\put( 0.1207, 0.5312){\makebox(0,0)[l]{\tt0111}}
% 01100 at 0.0100, 0.3906
\put( 0.0387, 0.6094){\makebox(0,0)[l]{\tt01100}}
% line 0.4062 from 0.0100 to 0.1500
\put( 0.0059, 0.5938){\line(1,0){ 0.0828}}
% 01101 at 0.0100, 0.4219
\put( 0.0387, 0.5781){\makebox(0,0)[l]{\tt01101}}
% line 0.3906 from 0.0100 to 0.0400
\put( 0.0059, 0.6094){\line(1,0){ 0.0177}}
% line 0.3828 from 0.0100 to 0.0200
\put( 0.0059, 0.6172){\line(1,0){ 0.0059}}
% line 0.3984 from 0.0100 to 0.0200
\put( 0.0059, 0.6016){\line(1,0){ 0.0059}}
% line 0.4219 from 0.0100 to 0.0400
\put( 0.0059, 0.5781){\line(1,0){ 0.0177}}
% line 0.4141 from 0.0100 to 0.0200
\put( 0.0059, 0.5859){\line(1,0){ 0.0059}}
% line 0.4297 from 0.0100 to 0.0200
\put( 0.0059, 0.5703){\line(1,0){ 0.0059}}
% 01110 at 0.0100, 0.4531
\put( 0.0387, 0.5469){\makebox(0,0)[l]{\tt01110}}
% line 0.4688 from 0.0100 to 0.1500
\put( 0.0059, 0.5312){\line(1,0){ 0.0828}}
% 01111 at 0.0100, 0.4844
\put( 0.0387, 0.5156){\makebox(0,0)[l]{\tt01111}}
% line 0.4531 from 0.0100 to 0.0400
\put( 0.0059, 0.5469){\line(1,0){ 0.0177}}
% line 0.4453 from 0.0100 to 0.0200
\put( 0.0059, 0.5547){\line(1,0){ 0.0059}}
% line 0.4609 from 0.0100 to 0.0200
\put( 0.0059, 0.5391){\line(1,0){ 0.0059}}
% line 0.4844 from 0.0100 to 0.0400
\put( 0.0059, 0.5156){\line(1,0){ 0.0177}}
% line 0.4766 from 0.0100 to 0.0200
\put( 0.0059, 0.5234){\line(1,0){ 0.0059}}
% line 0.4922 from 0.0100 to 0.0200
\put( 0.0059, 0.5078){\line(1,0){ 0.0059}}
% 10 at 0.0100, 0.6250
\put( 0.2397, 0.3750){\makebox(0,0)[l]{\tt10}}
% line 0.7500 from 0.0100 to 0.4500
\put( 0.0059, 0.2500){\line(1,0){ 0.2602}}
% 11 at 0.0100, 0.8750
\put( 0.2397, 0.1250){\makebox(0,0)[l]{\tt11}}
% 100 at 0.0100, 0.5625
\put( 0.1806, 0.4375){\makebox(0,0)[l]{\tt100}}
% line 0.6250 from 0.0100 to 0.3800
\put( 0.0059, 0.3750){\line(1,0){ 0.2188}}
% 101 at 0.0100, 0.6875
\put( 0.1806, 0.3125){\makebox(0,0)[l]{\tt101}}
% 1000 at 0.0100, 0.5312
\put( 0.1207, 0.4688){\makebox(0,0)[l]{\tt1000}}
% line 0.5625 from 0.0100 to 0.2800
\put( 0.0059, 0.4375){\line(1,0){ 0.1597}}
% 1001 at 0.0100, 0.5938
\put( 0.1207, 0.4062){\makebox(0,0)[l]{\tt1001}}
% 10000 at 0.0100, 0.5156
\put( 0.0387, 0.4844){\makebox(0,0)[l]{\tt10000}}
% line 0.5312 from 0.0100 to 0.1500
\put( 0.0059, 0.4688){\line(1,0){ 0.0828}}
% 10001 at 0.0100, 0.5469
\put( 0.0387, 0.4531){\makebox(0,0)[l]{\tt10001}}
% line 0.5156 from 0.0100 to 0.0400
\put( 0.0059, 0.4844){\line(1,0){ 0.0177}}
% line 0.5078 from 0.0100 to 0.0200
\put( 0.0059, 0.4922){\line(1,0){ 0.0059}}
% line 0.5234 from 0.0100 to 0.0200
\put( 0.0059, 0.4766){\line(1,0){ 0.0059}}
% line 0.5469 from 0.0100 to 0.0400
\put( 0.0059, 0.4531){\line(1,0){ 0.0177}}
% line 0.5391 from 0.0100 to 0.0200
\put( 0.0059, 0.4609){\line(1,0){ 0.0059}}
% line 0.5547 from 0.0100 to 0.0200
\put( 0.0059, 0.4453){\line(1,0){ 0.0059}}
% 10010 at 0.0100, 0.5781
\put( 0.0387, 0.4219){\makebox(0,0)[l]{\tt10010}}
% line 0.5938 from 0.0100 to 0.1500
\put( 0.0059, 0.4062){\line(1,0){ 0.0828}}
% 10011 at 0.0100, 0.6094
\put( 0.0387, 0.3906){\makebox(0,0)[l]{\tt10011}}
% line 0.5781 from 0.0100 to 0.0400
\put( 0.0059, 0.4219){\line(1,0){ 0.0177}}
% line 0.5703 from 0.0100 to 0.0200
\put( 0.0059, 0.4297){\line(1,0){ 0.0059}}
% line 0.5859 from 0.0100 to 0.0200
\put( 0.0059, 0.4141){\line(1,0){ 0.0059}}
% line 0.6094 from 0.0100 to 0.0400
\put( 0.0059, 0.3906){\line(1,0){ 0.0177}}
% line 0.6016 from 0.0100 to 0.0200
\put( 0.0059, 0.3984){\line(1,0){ 0.0059}}
% line 0.6172 from 0.0100 to 0.0200
\put( 0.0059, 0.3828){\line(1,0){ 0.0059}}
% 1010 at 0.0100, 0.6562
\put( 0.1207, 0.3438){\makebox(0,0)[l]{\tt1010}}
% line 0.6875 from 0.0100 to 0.2800
\put( 0.0059, 0.3125){\line(1,0){ 0.1597}}
% 1011 at 0.0100, 0.7188
\put( 0.1207, 0.2812){\makebox(0,0)[l]{\tt1011}}
% 10100 at 0.0100, 0.6406
\put( 0.0387, 0.3594){\makebox(0,0)[l]{\tt10100}}
% line 0.6562 from 0.0100 to 0.1500
\put( 0.0059, 0.3438){\line(1,0){ 0.0828}}
% 10101 at 0.0100, 0.6719
\put( 0.0387, 0.3281){\makebox(0,0)[l]{\tt10101}}
% line 0.6406 from 0.0100 to 0.0400
\put( 0.0059, 0.3594){\line(1,0){ 0.0177}}
% line 0.6328 from 0.0100 to 0.0200
\put( 0.0059, 0.3672){\line(1,0){ 0.0059}}
% line 0.6484 from 0.0100 to 0.0200
\put( 0.0059, 0.3516){\line(1,0){ 0.0059}}
% line 0.6719 from 0.0100 to 0.0400
\put( 0.0059, 0.3281){\line(1,0){ 0.0177}}
% line 0.6641 from 0.0100 to 0.0200
\put( 0.0059, 0.3359){\line(1,0){ 0.0059}}
% line 0.6797 from 0.0100 to 0.0200
\put( 0.0059, 0.3203){\line(1,0){ 0.0059}}
% 10110 at 0.0100, 0.7031
\put( 0.0387, 0.2969){\makebox(0,0)[l]{\tt10110}}
% line 0.7188 from 0.0100 to 0.1500
\put( 0.0059, 0.2812){\line(1,0){ 0.0828}}
% 10111 at 0.0100, 0.7344
\put( 0.0387, 0.2656){\makebox(0,0)[l]{\tt10111}}
% line 0.7031 from 0.0100 to 0.0400
\put( 0.0059, 0.2969){\line(1,0){ 0.0177}}
% line 0.6953 from 0.0100 to 0.0200
\put( 0.0059, 0.3047){\line(1,0){ 0.0059}}
% line 0.7109 from 0.0100 to 0.0200
\put( 0.0059, 0.2891){\line(1,0){ 0.0059}}
% line 0.7344 from 0.0100 to 0.0400
\put( 0.0059, 0.2656){\line(1,0){ 0.0177}}
% line 0.7266 from 0.0100 to 0.0200
\put( 0.0059, 0.2734){\line(1,0){ 0.0059}}
% line 0.7422 from 0.0100 to 0.0200
\put( 0.0059, 0.2578){\line(1,0){ 0.0059}}
% 110 at 0.0100, 0.8125
\put( 0.1806, 0.1875){\makebox(0,0)[l]{\tt110}}
% line 0.8750 from 0.0100 to 0.3800
\put( 0.0059, 0.1250){\line(1,0){ 0.2188}}
% 111 at 0.0100, 0.9375
\put( 0.1806, 0.0625){\makebox(0,0)[l]{\tt111}}
% 1100 at 0.0100, 0.7812
\put( 0.1207, 0.2188){\makebox(0,0)[l]{\tt1100}}
% line 0.8125 from 0.0100 to 0.2800
\put( 0.0059, 0.1875){\line(1,0){ 0.1597}}
% 1101 at 0.0100, 0.8438
\put( 0.1207, 0.1562){\makebox(0,0)[l]{\tt1101}}
% 11000 at 0.0100, 0.7656
\put( 0.0387, 0.2344){\makebox(0,0)[l]{\tt11000}}
% line 0.7812 from 0.0100 to 0.1500
\put( 0.0059, 0.2188){\line(1,0){ 0.0828}}
% 11001 at 0.0100, 0.7969
\put( 0.0387, 0.2031){\makebox(0,0)[l]{\tt11001}}
% line 0.7656 from 0.0100 to 0.0400
\put( 0.0059, 0.2344){\line(1,0){ 0.0177}}
% line 0.7578 from 0.0100 to 0.0200
\put( 0.0059, 0.2422){\line(1,0){ 0.0059}}
% line 0.7734 from 0.0100 to 0.0200
\put( 0.0059, 0.2266){\line(1,0){ 0.0059}}
% line 0.7969 from 0.0100 to 0.0400
\put( 0.0059, 0.2031){\line(1,0){ 0.0177}}
% line 0.7891 from 0.0100 to 0.0200
\put( 0.0059, 0.2109){\line(1,0){ 0.0059}}
% line 0.8047 from 0.0100 to 0.0200
\put( 0.0059, 0.1953){\line(1,0){ 0.0059}}
% 11010 at 0.0100, 0.8281
\put( 0.0387, 0.1719){\makebox(0,0)[l]{\tt11010}}
% line 0.8438 from 0.0100 to 0.1500
\put( 0.0059, 0.1562){\line(1,0){ 0.0828}}
% 11011 at 0.0100, 0.8594
\put( 0.0387, 0.1406){\makebox(0,0)[l]{\tt11011}}
% line 0.8281 from 0.0100 to 0.0400
\put( 0.0059, 0.1719){\line(1,0){ 0.0177}}
% line 0.8203 from 0.0100 to 0.0200
\put( 0.0059, 0.1797){\line(1,0){ 0.0059}}
% line 0.8359 from 0.0100 to 0.0200
\put( 0.0059, 0.1641){\line(1,0){ 0.0059}}
% line 0.8594 from 0.0100 to 0.0400
\put( 0.0059, 0.1406){\line(1,0){ 0.0177}}
% line 0.8516 from 0.0100 to 0.0200
\put( 0.0059, 0.1484){\line(1,0){ 0.0059}}
% line 0.8672 from 0.0100 to 0.0200
\put( 0.0059, 0.1328){\line(1,0){ 0.0059}}
% 1110 at 0.0100, 0.9062
\put( 0.1207, 0.0938){\makebox(0,0)[l]{\tt1110}}
% line 0.9375 from 0.0100 to 0.2800
\put( 0.0059, 0.0625){\line(1,0){ 0.1597}}
% 1111 at 0.0100, 0.9688
\put( 0.1207, 0.0312){\makebox(0,0)[l]{\tt1111}}
% 11100 at 0.0100, 0.8906
\put( 0.0387, 0.1094){\makebox(0,0)[l]{\tt11100}}
% line 0.9062 from 0.0100 to 0.1500
\put( 0.0059, 0.0938){\line(1,0){ 0.0828}}
% 11101 at 0.0100, 0.9219
\put( 0.0387, 0.0781){\makebox(0,0)[l]{\tt11101}}
% line 0.8906 from 0.0100 to 0.0400
\put( 0.0059, 0.1094){\line(1,0){ 0.0177}}
% line 0.8828 from 0.0100 to 0.0200
\put( 0.0059, 0.1172){\line(1,0){ 0.0059}}
% line 0.8984 from 0.0100 to 0.0200
\put( 0.0059, 0.1016){\line(1,0){ 0.0059}}
% line 0.9219 from 0.0100 to 0.0400
\put( 0.0059, 0.0781){\line(1,0){ 0.0177}}
% line 0.9141 from 0.0100 to 0.0200
\put( 0.0059, 0.0859){\line(1,0){ 0.0059}}
% line 0.9297 from 0.0100 to 0.0200
\put( 0.0059, 0.0703){\line(1,0){ 0.0059}}
% 11110 at 0.0100, 0.9531
\put( 0.0387, 0.0469){\makebox(0,0)[l]{\tt11110}}
% line 0.9688 from 0.0100 to 0.1500
\put( 0.0059, 0.0312){\line(1,0){ 0.0828}}
% 11111 at 0.0100, 0.9844
\put( 0.0387, 0.0156){\makebox(0,0)[l]{\tt11111}}
% line 0.9531 from 0.0100 to 0.0400
\put( 0.0059, 0.0469){\line(1,0){ 0.0177}}
% line 0.9453 from 0.0100 to 0.0200
\put( 0.0059, 0.0547){\line(1,0){ 0.0059}}
% line 0.9609 from 0.0100 to 0.0200
\put( 0.0059, 0.0391){\line(1,0){ 0.0059}}
% line 0.9844 from 0.0100 to 0.0400
\put( 0.0059, 0.0156){\line(1,0){ 0.0177}}
% line 0.9766 from 0.0100 to 0.0200
\put( 0.0059, 0.0234){\line(1,0){ 0.0059}}
% line 0.9922 from 0.0100 to 0.0200
\put( 0.0059, 0.0078){\line(1,0){ 0.0059}}
\end{picture}
}
\end{center}
}{%
\caption[a]{Illustration of the intervals defined by a
simple Bayesian probabilistic model. The size of an intervals is proportional
to the probability of the string.
This model anticipates that
the source is likely to be biased towards one of {\tt{a}} and
{\tt{b}}, so sequences having lots of {\tt{a}}s or lots of
{\tt{b}}s have larger intervals than sequences of the same length
that are 50:50 {\tt{a}}s and
{\tt{b}}s.}
\label{fig.ac2}
}%
\end{figure}
\begin{aside}
\subsection{Details of the Bayesian model}
Having emphasized that any model could be used -- arithmetic coding
is not wedded to any particular set of probabilities -- let me explain
the simple adaptive probabilistic model used in the preceding example;
we first encountered this model
in
% chapter \ref{ch1}
% (page \pageref{ex.postpa})
\exerciseref{ex.postpa}.
%
%
% {\em (This material may be a repetition of material in \chref{ch1}.)}
%
\subsubsection{Assumptions}
The model will be described using parameters
$p_{\eof}$, $p_{\ta}$ and $p_{\tb}$, defined below,
which should not be confused with the predictive
probabilities {\em in a particular context\/},
for example,
$P(\ta \given \bs\eq {\tb\ta\ta} )$.
% An analogy for this model, as I indicated
% at the start,
% is the tossing of a bent coin (\secref{sec.bentcoin}).
A bent coin labelled
$\ta$ and $\tb$ is tossed some number of times $l$,
which we don't know beforehand. The coin's probability
of coming up $\ta$ when tossed is $p_{\ta}$, and $p_{\tb} = 1-p_{\ta}$; the parameters
$p_{\ta},p_{\tb}$ are not known beforehand. The source string $\bs = \tt baaba\eof$
indicates that $l$ was 5 and the sequence of outcomes was $\tt baaba$.
\ben
\item
It is assumed that the length of the string $l$ has an exponential
probability distribution
\beq
P(l) = (1 - p_{\eof})^l p_{\eof}
.
\eeq
This distribution corresponds to assuming a constant probability
$p_{\eof}$ for the termination symbol `$\eof$' at each character.
\item
It is assumed that the non-terminal
characters in the string are selected independently at random
from an ensemble with probabilities
% distribution
$\P = \{p_{\ta},p_{\tb}\}$; the
probability $p_{\ta}$ is fixed throughout the string to some
unknown value that could be anywhere between $0$ and $1$.
The probability of an $\ta$ occurring as the next symbol, given
$p_{\ta}$ (if only we knew it), is $(1-p_{\eof})p_{\ta}$.
% given that it is not
The probability, given $p_{\ta}$, that an unterminated string of length $F$
is a given string $\bs$
that contains $\{F_{\ta},F_{\tb}\}$ counts of the two outcomes
% $\{ a , b \}$
is the \ind{Bernoulli distribution}
\beq
P( \bs \given p_{\ta} , F ) = p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} .
\label{eq.pa.like}
\eeq
\item
We assume a uniform prior distribution for $p_{\ta}$,
\beq
P(p_{\ta}) = 1 , \: \: \: \: \: \: p_{\ta} \in [0,1] ,
\label{eq.pa.prior}
\eeq
and define $p_{\tb} \equiv 1-p_{\ta}$.
It would be easy to assume other priors on $p_{\ta}$, with beta distributions
being the most convenient to handle.
\een
This model was studied in \secref{sec.bentcoin}.
The key result we require is the predictive distribution for
the next symbol, given the string so far, $\bs$.
This probability that the next character is $\ta$
or $\tb$ (assuming that it is not `$\eof$')
was derived in \eqref{eq.laplacederived} and is precisely
Laplace's rule (\ref{eq.laplaceagain}).
\end{aside}
\exercisaxB{3}{ex.ac.vs.huffman}{
Compare the expected message length
when an ASCII file is compressed by the following
three methods.
\begin{description}
\item[Huffman-with-header\puncspace] Read the whole
file, find the empirical frequency of each symbol,
construct a Huffman code for those frequencies,
transmit the code by transmitting
the lengths of the Huffman codewords, then transmit
the file using the Huffman code.
(The actual codewords don't need to be transmitted,
since we can use a deterministic method for
building the tree given the codelengths.)
\item[Arithmetic code using the \ind{Laplace model}\puncspace]
\beq
P_{\rm L}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+1}
{\sum_{{\ta'}}(F_{{\ta'}}+1)}.
\eeq
\item[Arithmetic code using a \ind{Dirichlet model}\puncspace]
This model's predictions are:
\beq
P_{\rm D}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+\alpha}
{\sum_{{\ta'}}(F_{{\ta'}}+\alpha)},
\eeq
where $\alpha$ is fixed to a number such as 0.01.
A small value of $\alpha$ corresponds to a more responsive version of the
Laplace model; the probability over characters
is expected to be more nonuniform;
$\alpha=1$ reproduces the Laplace model.
\end{description}
Take care that the header of your Huffman message
is self-delimiting.
Special cases worth considering are (a) short files
with just a few hundred characters; (b) large files
in which some characters are never used.
}
\section{Further applications of arithmetic coding}
\subsection{Efficient generation of random samples}
\label{sec.ac.efficient}
Arithmetic coding not only offers a way to compress strings
believed to come from a given model; it also offers a way to generate
random strings from a model. Imagine sticking a
pin into the unit interval at random, that line
having been divided into subintervals in proportion
to probabilities $p_i$; the probability that your pin will
lie in interval $i$ is $p_i$.
So to generate a sample from a model, all we need to do is feed ordinary
random bits into an arithmetic {\em decoder\/}\index{arithmetic coding!decoder} for that
model.\index{arithmetic coding!uses beyond compression} An infinite random
bit sequence corresponds to the selection of a point
at random from the line $[0,1)$, so the decoder will
then select a string at random from the assumed distribution.
This arithmetic method is guaranteed to use very nearly the smallest
number of random bits possible to make the selection -- an important
point in communities where random numbers are expensive!
[{This is
not a joke. Large amounts of money are spent on generating random bits
in software and hardware. Random numbers are valuable.}]
A simple example of the use of this technique is in the
generation of random bits with a nonuniform distribution $\{ p_0,p_1 \}$.
% This is a useful technique
\exercissxA{2}{ex.usebits}{
Compare the following two techniques for generating random symbols
from a nonuniform distribution $\{ p_0,p_1 \} = \{ 0.99,0.01\}$:
\ben
\item The standard method: use a standard random number generator
to generate an integer between 1 and $2^{32}$. Rescale the integer
to $(0,1)$. Test whether this uniformly distributed random variable is
less than $0.99$, and emit a {\tt{0}} or {\tt{1}} accordingly.
\item
Arithmetic coding using the correct model, fed with standard
random bits.
\een
Roughly how many random bits will each method use to generate a thousand
samples from this sparse distribution?
}
\subsection{Efficient data-entry devices}
When we enter text into a computer, we make gestures of some sort --
maybe we tap a keyboard, or scribble with a pointer, or click with a mouse;
an {\em efficient\/}
\index{user interfaces}\index{data entry}\ind{text entry} system is
one where the
number of gestures required to enter a given text string is {\em small\/}.
Writing\index{writing}\index{text entry}%
\marginfignocaption{\small
\begin{center}
\begin{tabular}{rcl}
\multicolumn{3}{l}{Compression:}\\
text& $\rightarrow$ &bits\\[0.2in]
\multicolumn{3}{l}{Writing:} \\
text &$\leftarrow$& gestures\\[0.2in]
\end{tabular}
\end{center}
}
can be viewed as an inverse process\index{arithmetic coding!uses beyond compression}
to data compression. In data compression, the aim is to map
a given text string into a {\em small\/} number of bits.
In text entry, we want a small sequence of gestures
to produce our intended text.
By inverting an arithmetic coder,
we can obtain \index{inverse-arithmetic-coder}an information-efficient
text entry device that is driven by continuous pointing
gestures \cite{ward2000}. In this system, called \ind{Dasher},\index{human--machine interfaces}\index{software!Dasher}
the user zooms in on the unit interval to locate the\index{text entry}
interval corresponding to their intended string,
in the same style as \figref{fig.ac}. A \ind{language
model} (exactly as used in text compression) controls the
sizes of the intervals such that probable strings are
quick and easy to identify.
After an hour's practice,
a novice
user can write with one \ind{finger} driving {Dasher}
at about 25 words per minute -- that's about
half their normal ten-finger
\index{QWERTY}typing speed on a regular \ind{keyboard}.
It's even possible to write at 25 words per minute, {\em hands-free},
using gaze direction to drive Dasher \cite{wardmackay2002}.
Dasher is available as free software for various
platforms.\footnote{ {\tt http://www.inference.phy.cam.ac.uk/dasher/}}
\label{sec.stopbeforeLZ}
\section{Lempel--Ziv coding\nonexaminable}
The \index{Lempel--Ziv coding|(}Lempel--Ziv algorithms, which are widely used for data compression
(\eg, the {\tt\ind{compress}} and {\tt\ind{gzip}} commands), are different in philosophy to arithmetic
coding. There is no separation between modelling and coding,\index{philosophy}
and no opportunity for explicit modelling.\index{source code!algorithms}
\subsection{Basic Lempel--Ziv algorithm}
The method of compression is to replace a \ind{substring} with a \ind{pointer} to
an earlier occurrence of the same substring.
For example if the string is {\tt{1011010100010}}\ldots, we \ind{parse} it into
an ordered {\dem\ind{dictionary}\/} of substrings that have not appeared before
as follows:
$\l$, {\tt{1}}, {\tt{0}}, {\tt{11}}, {\tt{01}}, {\tt{010}}, {\tt{00}}, {\tt{10}}, \dots.
We include the \index{empty string}empty substring \ind{$\lambda$} as
the first substring in the dictionary and order the substrings in the dictionary
by the order in which they emerged from the source.
After every comma, we look along the next part of the
input sequence until we have read a
substring that has not been marked off before. A moment's
reflection will confirm that
this substring is longer by one bit than a substring that has occurred
earlier in the dictionary. This means that we can encode each substring by
giving a {\dem pointer\/} to the earlier occurrence of that prefix and then sending
the extra bit by which the new substring in the dictionary differs from
the earlier substring. If, at the $n$th bit, we have enumerated
$s(n)$ substrings, then we can
give the value of the pointer in
$\lceil \log_2 s(n) \rceil$ bits. The code for the above sequence
is then as shown in the fourth line of the following table (with
punctuation included for clarity), the upper lines indicating the source
string and the value of $s(n)$:
%
%
\beginfullpagewidth%% defined in chapternotes.sty, uses {narrow}
\[
\begin{array}{l|*{8}{l}}
\mbox{source substrings}&\lambda & {\tt{1}} & {\tt{0}} & {\tt{11}} & {\tt{01}} & {\tt{010}} & {\tt{00}} & {\tt{10}} \\
s(n) & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\
s(n)_{\rm binary}
& {\tt{000}} & {\tt{001}} & {\tt{010}} & {\tt{011}} & {\tt{100}} & {\tt{101}} & {\tt{110}} & {\tt{111}} \\
(\mbox{pointer},\mbox{bit})& & (,{\tt{1}}) & ({\tt{0}},{\tt{0}}) & ({\tt{01}},{\tt{1}}) & ({\tt{10}},{\tt{1}}) & ({\tt{100}},{\tt{0}})& ({\tt{010}},{\tt{0}}) & ({\tt{001}},{\tt{0}})
\end{array}
\]
\end{narrow}
% The pointer
Notice that the first pointer we send is empty,
because, given that there is only one substring in the
dictionary -- the string $\lambda$ --
no bits are needed to convey the `choice' of that substring
as the prefix.
The encoded string is {\tt 100011101100001000010}.
The encoding, in this
simple case, is actually a longer string than the source string, because
there was no obvious redundancy in the source string.
\exercisaxB{2}{ex.Clengthen}{
Prove that {\em any\/} uniquely decodeable code from $\{{\tt{0}},{\tt{1}}\}^+$ to
$\{{\tt{0}},{\tt{1}}\}^+$ necessarily makes some strings longer if it makes
some strings shorter.
}
One reason why the algorithm described above lengthens
a lot of strings
is because it is inefficient --
it transmits unnecessary bits; to put it another
way, its code is not complete.\label{sec.LZprune}
% is not necessarily the explanation for the above lengthening,
% however, because
% see also {ex.LZprune}{
% the algorithm described is certainly inefficient: o
Once a substring in
the {dictionary} has been joined there by both of its children, then
we can be sure that it will not be needed (except possibly as part
of our protocol for terminating a message); so at that point we
could drop it from our dictionary of substrings and shuffle them all along
one, thereby reducing the length of subsequent pointer messages. Equivalently,
we could write the second prefix into the dictionary at the point
previously occupied by the parent. A second unnecessary overhead
is the transmission of the new bit in these cases -- the second
time a prefix is used, we can be sure of the identity of the next bit.
% This is easy to do in a computer but not so easy for a human
% to cope with.
\subsubsection{Decoding}
The decoder again involves an identical twin at the decoding end
who constructs the dictionary of substrings as the
data are decoded.
\exercissxB{2}{ex.LZencode}{
Encode the string {\tt{000000000000100000000000}}
using the basic
Lempel--Ziv algorithm described above.
}
% lambda 0 00 000 0000 001 00000 000000
% 000 001 010 011 100 101 110 111
% ,0 1,0 10,0 11,0 010,1 100,0 110,0
% answer
% 010100110010110001100
\exercissxB{2}{ex.LZdecode}{
Decode the string
\begin{center}
{\tt{00101011101100100100011010101000011}}
\end{center}
that was encoded using the basic
Lempel--Ziv algorithm.
}
% answer
% 0100001000100010101000001 001000001000000
% lamda, 0, 1, 00, 001, 000, 10, 0010, 101, 0000, 01, 00100, 0001, 00000
% 0 , 1, 10,11, 100, 101,110, 111, 1000, 1001, 1010, 1011,1100, 1101
% ,0 0,1 01,0 11,1 011,0 010,0 100,0 110,1 0101,0 0001,1 bored!
% 10101011101100100100011010101000011
%
% see tcl/lempelziv.tcl
\subsubsection{Practicalities}
In this description I have not discussed the method for terminating
a string.
There are many variations on the Lempel--Ziv algorithm, all exploiting
the same idea but using different procedures for dictionary management,
etc.
% Two of the best known
% variations are called the Ziv-Lempel algorithm
% and the LZW algorithm.
%
The resulting programs are fast, but their performance on compression
of English text, although useful,
does not match the standards set in the arithmetic coding literature.
\subsection{Theoretical properties}
In contrast to the block code, Huffman code, and arithmetic coding
methods we discussed in the last three chapters,
the Lempel--Ziv algorithm is defined without making any mention
of a \ind{probabilistic model} for the source. Yet,
% in fact,
given any \ind{ergodic}
%\footnote{Need to clarify this. It means
% the source is memoryless on sufficiently long timescales.}
source (\ie, one that is memoryless on sufficiently long timescales),
the Lempel--Ziv algorithm can be
proven {\em asymptotically\/} to
compress down to the entropy of the source. This is why it is called
a `\ind{universal}' compression algorithm. For a proof
of this property, see \citeasnoun{Cover&Thomas}.
% Cover and Thomas (1991).
It achieves its compression,
however, only by {\em memorizing\/} substrings that have happened
so that it has a short name for them the next time they occur.
The asymptotic timescale on which this universal performance
is achieved
%%is likely to be the time that it takes for
%% if the source has not been observed long enough for
% {\em all\/} typical sequences of length $n^*$
% to occur, where $n^*$ is the longest lengthscale associated with the
% statistical fluctuations in the source.
%the longest lengthscale on
% which there are correlations in .
% red then
% For many sources the time for all typical sequences to
% occur is
may, for many sources, be unfeasibly long, because
the number of typical substrings that need memorizing
may be enormous.
%
The useful performance
of the algorithm in practice is a reflection of the fact that
many files contain multiple repetitions of particular
short sequences of characters,
a form of redundancy to which the algorithm is well suited.
\subsection{Common ground}
I have emphasized the difference in philosophy behind arithmetic coding
and Lempel--Ziv coding. There is common ground
between them, though: in principle, one can design
adaptive probabilistic models, and thence arithmetic codes, that
are `\ind{universal}', that is, models that will asymptotically compress
{\em any source in some class\/} to within some factor (preferably 1)
of its entropy.\index{compression!universal}
However, {for practical purposes\/}, I think such universal models can only be
constructed if the class of sources is severely restricted.
A general purpose compressor that can discover the probability
distribution of {\em any\/} source would be a general purpose
\ind{artificial intelligence}! A general purpose artificial
intelligence does not yet exist.
% \subsection{Comments}
% The Lempel--Ziv algorithm can be generalized to any finite alphabet
% as long as the input and output alphabets are the same. I believe
% it is not convenient to use unequal alphabets.
\section{Demonstration}
An interactive aid for exploring arithmetic coding, {\tt dasher.tcl}, is
available.\footnote{{\tt http://www.inference.phy.cam.ac.uk/mackay/itprnn/softwareI.html}}
% http://www.inference.phy.cam.ac.uk/mackay/itprnn/code/tcl/dasher.tcl
A demonstration arithmetic-coding\index{source code!algorithms}
\index{arithmetic coding!software}\index{software!arithmetic coding}software
package written by \index{Neal, Radford}{Radford Neal}\footnote{%
% is available from \\ \noindent
{\tt ftp://ftp.cs.toronto.edu/pub/radford/www/ac.software.html}}
% This package
consists of encoding and decoding modules to which the
user adds a module defining the probabilistic model. It
should be emphasized that there is no single
general-purpose arithmetic-coding compressor; a new model has to be written
for each type of source.
% application.
%
Radford Neal's\index{Neal, Radford}
package includes a simple adaptive model similar to the
Bayesian model demonstrated in section \ref{sec.ac}.
The results using this Laplace model should
be viewed as a basic benchmark since it is
the simplest possible probabilistic model -- it
% These results are anecdotal and should not be taken too
% seriously, but it is interesting that the highly developed gzip
% software only does a little better than the benchmark
% of the simple Laplace model,
simply assumes the characters in the file come independently
from a fixed ensemble.
The counts $\{ F_i \}$ of the symbols $\{ a_i \}$ are rescaled
and rounded as the file is read such that all the counts lie
between 1 and 256.
\index{DjVu}\index{deja vu}\index{Le Cun, Yann}\index{Bottou, Leon}
% Yann Le Cun, Leon Bottou and colleagues at AT{\&}T Labs
% have written a
A state-of-the-art compressor for documents
containing text and images, {\tt{DjVu}},
uses arithmetic coding.\footnote{%
% {\tt{DjVu}} is described at
\tt http://www.djvuzone.org/}
% (better Reference for deja vu?)
It uses a carefully designed approximate
arithmetic coder for binary
alphabets called the Z-coder \cite{bottou98coder},
which is much faster than the
arithmetic coding software described above. One of
the neat tricks the Z-coder uses is this: the adaptive model
adapts only occasionally (to save on computer time),
with the decision about when to adapt being pseudo-randomly
controlled by
whether the arithmetic encoder emitted a bit.
The JBIG image compression standard for binary images
uses arithmetic coding with a context-dependent
model, which adapts using a rule similar to Laplace's rule.
PPM \cite{Teahan95a} is a leading method for text compression,
and it uses arithmetic coding.
There are many Lempel--Ziv-based programs.
{\tt gzip} is based on a version of Lempel--Ziv
called `{\tt LZ77}' \cite{Ziv_Lempel77}\nocite{Ziv_Lempel78}. {\tt compress} is based on `{\tt LZW}'
\cite{Welch84}.
In my experience the
best is {\tt gzip}, with {\tt compress} being inferior
on most files.
% To
% give further credit to {\tt gzip}, it stores additional information in
% the compressed file such as the name of the file and its
% last modification date.
{\tt bzip} is
a {\dem{\ind{block-sorting} file compressor\/}}, which makes
use of a neat hack called the {\dem\ind{Burrows--Wheeler transform}}\index{source code!Burrows--Wheeler transform}\index{source code!block-sorting compression}
\cite{bwt}. This method is not based on an explicit probabilistic
model, and it only works well for files larger than several
thousand characters; but in practice it is a very effective
compressor for files in which the context of a character
is a good predictor for that character.%
% Maybe I'll describe it in a future edition of this
% book.
\footnote{There is a lot of information about the
Burrows--Wheeler transform on the net.
{\tt{http://dogma.net/DataCompression/BWT.shtml}}
}
%bzip2 compresses files using the Burrows--Wheeler block-sorting text compression algorithm, and Huffman
%coding. Compression is generally considerably better than that achieved by more conventional
%LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical
%compressors.
\subsubsection{Compression of a text file}
Table \ref{tab.zipcompare1} gives the computer time in seconds taken and the
compression achieved when these programs are applied to
the \LaTeX\ file containing the text of
this chapter, of size 20,942 bytes.
\begin{table}[htbp]
\figuremargin{
\begin{center}
\begin{tabular}{lccc} \toprule
Method & Compression & Compressed size & Uncompression \\
& time$ \,/\, $sec & (\%age of 20,942) & time$ \,/\, $sec \\ \midrule
%Adaptive encoder,
Laplace model &
0.28 & $12\,974$ (61\%) & 0.32 \\
%{\tt gzip / gunzip} &
{\tt gzip} &
0.10 & \hspace{0.06in}$ 8\,177$ (39\%) & {\bf 0.01} \\
{\tt compress}
%/ uncompress}
&
0.05 & $10\,816$ (51\%) & 0.05 \\ \midrule
{\tt bzip}
% / bunzip}
&
& \hspace{0.06in}$ 7\,495$ (36\%) & \\
{\tt bzip2}
%/ bunzip2}
&
& \hspace{0.06in}$ 7\,640$ (36\%) & \\
{\tt ppmz } &
& \hspace{0.06in}{\bf 6$\,$800 (32\%)} & \\
\bottomrule
\end{tabular}
\end{center}
}{
\caption[a]{Comparison of compression algorithms applied to a text file.
}
\label{tab.zipcompare1}
}
\end{table}
% I will report the value of ``u''
% django:
% 0.410u 0.060s 0:00.60 78.3% 0+0k 0+0io 109pf+0w
% 6800 Nov 25 18:05 ../l4.tex.ppm
% time ppmz ../l4.tex.ppm ../l4.tex.up
% 0.480u 0.040s 0:00.60 86.6% 0+0k 0+0io 109pf+0w
%
% 108:wol:/home/mackay/_tools/ac0> time adaptive_encode < ~/_courses/itprnn/l4.tex > l4.tex.aez
% 0.280u 0.040s 0:00.55 58.1% 0+105k 2+3io 0pf+0w
% 109:wol:/home/mackay/_tools/ac0> time gzip ~/_courses/itprnn/l4.tex
% 0.100u 0.060s 0:00.28 57.1% 0+161k 2+12io 0pf+0w
% 110:wol:/home/mackay/_tools/ac0> ls -lisa ~/_courses/itprnn/l4.tex.gz
% 110131 8 8177 Jan 10 15:40 /home/mackay/_courses/itprnn/l4.tex.gz
% 111:wol:/home/mackay/_tools/ac0> gunzip ~/_courses/itprnn/l4.tex.gz
% 112:wol:/home/mackay/_tools/ac0> ls -lisa ~/_courses/itprnn/l4.tex l4.tex.aez
% 109904 21 20942 Jan 10 15:40 /home/mackay/_courses/itprnn/l4.tex
% 444691 13 12974 Jan 10 15:40 l4.tex.aez
% 113:wol:/home/mackay/_tools/ac0> time gzip ~/_courses/itprnn/l4.tex
% 0.100u 0.050s 0:00.24 62.5% 0+150k 0+13io 0pf+0w
% 114:wol:/home/mackay/_tools/ac0> time gunzip ~/_courses/itprnn/l4.tex.gz
% 0.010u 0.060s 0:00.17 41.1% 0+80k 0+8io 0pf+0w
% 115:wol:/home/mackay/_tools/ac0> time adaptive_decode < l4.tex.aez > l4.tex
% 0.320u 0.030s 0:00.39 89.7% 0+101k 6+4io 5pf+0w
%
% django: bzip and gunzip:
% 149:django.ucsf.edu:/home/mackay/_tools/ac0> time bzip l4.tex
% BZIP, a block-sorting file compressor. Version 0.21, 25-August-96.
% 0.060u 0.020s 0:00.22 36.3% 0+0k 0+0io 107pf+0w
% 7495 Jan 10 1997 l4.tex.bz
% 153:django.ucsf.edu:/home/mackay/_tools/ac0> time bunzip l4.tex.bz
% 0.020u 0.010s 0:00.14 21.4% 0+0k 0+0io 93pf+0w
% 20942 Jan 10 1997 l4.tex
% 155:django.ucsf.edu:/home/mackay/_tools/ac0> time bzip2 l4.tex
% 0.050u 0.000s 0:00.37 13.5% 0+0k 0+0io 90pf+0w
% 7640 Jan 10 1997 l4.tex.bz2
% 157:django.ucsf.edu:/home/mackay/_tools/ac0> time bunzip2 l4.tex.bz2
% 0.020u 0.000s 0:00.15 13.3% 0+0k 0+0io 85pf+0w
% time gzip l4.tex
% 0.010u 0.010s 0:00.28 7.1% 0+0k 0+0io 84pf+0w
% 8177 Jan 10 1997 l4.tex.gz
% time gunzip l4.tex
% 0.000u 0.010s 0:00.12 8.3% 0+0k 0+0io 87pf+0w
%
\subsubsection{Compression of a sparse file}
Interestingly, {\tt gzip} does not always do so well.
Table \ref{tab.zipcompare2} gives the
% computer time in seconds taken and the
compression achieved when these programs are applied to
a text file containing $10^6$ characters, each of which is
either {\tt0} and {\tt1} with probabilities
0.99 and 0.01. The Laplace model is quite
well matched to this source,
and the benchmark arithmetic coder
gives good performance, followed closely by {\tt compress}; {\tt gzip}
% , interestingly,
is worst.
% see /home/mackay/_tools/ac0
%
% , and {\tt gzip --best} does no better.
% has identical performance to {\tt gzip} on this example.}]
An ideal model for this source would compress the
file into about $10^6 H_2(0.01)/8 \simeq 10\,100$ bytes. The Laplace model
compressor falls short of this performance because it is implemented
using only eight-bit precision. The {\tt{ppmz}} compressor compresses
the best of all, but takes much more computer time.\index{Lempel--Ziv coding|)}
\begin{table}[htbp]
\figuremargin{
\begin{center}
\begin{tabular}{lccc} \toprule
Method & Compression & Compressed size & Uncompression \\
& time$ \,/\, $sec & $ \,/\, $bytes & time$ \,/\, $sec \\ \midrule
% Adaptive encoder,
% Laplace model &
% 6.4 & 14089 (1.4\%)\hspace{0.06in} & 9.2 \\
%{\tt gzip / gunzip} &
% 2.1 & 20548 (2.1\%)\hspace{0.06in} & 0.43 \\
%{\tt compress / uncompress} &
% 0.73 & 14692 (1.47\%) & 0.76 \\ \bottomrule
%{\tt bzip / bunzip} &
% & & (\%) & \\
%{\tt bzip2 / bunzip2} &
% & & (\%) & \\ \hline
Laplace model &
0.45 & $14\,143$ (1.4\%)\hspace{0.06in} & 0.57 \\
{\tt gzip } &
0.22 & $20\,646$ (2.1\%)\hspace{0.06in} & 0.04 \\
{\tt gzip {\tt-}{\tt-}best+} &
%{\tt gzip \verb+--best+} &
1.63 & $15\,553$ (1.6\%)\hspace{0.06in} & 0.05 \\
{\tt compress} &
0.13 & $14\,785$ (1.5\%)\hspace{0.06in} & 0.03 \\ \midrule
{\tt bzip } &
0.30 & $10\,903$ (1.09\%) & 0.17 \\
{\tt bzip2} &
0.19 & $11\,260$ (1.12\%) & 0.05 \\
{\tt ppmz} &
533 & {\bf 10$\,$447 (1.04\%)} & 535 \\
\bottomrule
\end{tabular}
\end{center}
% ideal length = 0.0807931 * 10^6 = 80793 bits = 10099 bytes
% /home/mackay/_tools/ac0/README1
}{
\caption[a]{Comparison of compression algorithms applied to a random file
of $10^6$ characters, 99\% {\tt0}s and 1\% {\tt1}s.
}
\label{tab.zipcompare2}
}
\end{table}
\section{Summary}
In the last three chapters
we have studied three classes of data compression codes.
\begin{description}
\item[Fixed-length block codes] (Chapter \chtwo). These are mappings
from a fixed number of source symbols to a fixed-length binary message.
% Most source strings are given no encoding;
Only a tiny fraction of
the source strings are given an encoding.
These codes were fun for identifying the entropy as the measure
of compressibility but they are of little practical use.
\item[Symbol codes] (Chapter \chthree). Symbol codes employ a variable-length
code for each symbol in the source alphabet, the codelengths being
integer lengths determined by the probabilities of the symbols.
Huffman's algorithm constructs an optimal symbol code for a given
set of symbol probabilities.
Every source string has a uniquely decodeable encoding, and if
the source symbols come from the assumed distribution then the symbol
code will compress
to an expected length $L$ lying in the interval $[H,H\!+\!1)$.
Statistical fluctuations in the source may make the actual length
longer or shorter than this mean length.
If the source is not well matched to the assumed distribution then
the mean length is increased by the relative entropy $D_{\rm KL}$
between the source distribution and the code's implicit distribution.
For sources with small entropy, the symbol has to emit
at least one bit per source symbol; compression
below one bit per source symbol can only be achieved
by the cumbersome procedure of putting the source data into blocks.
\item[Stream codes\puncspace]
The distinctive property of stream codes, compared with
symbol codes, is that they are not constrained to emit at least one bit for every
symbol read from the source stream. So large numbers of
source symbols may be
coded into a smaller number of bits.
% , but unlike block codes, this is achieved
This property could only be obtained using a symbol code
if the source stream were somehow chopped into blocks.
\bit
\item {Arithmetic codes}
combine a probabilistic model with an encoding algorithm
that identifies each string with a sub-interval of $[0,1)$
of size equal to the probability of that string under the model.
This code is almost optimal in the sense that
the compressed length of a string $\bx$ closely matches
the Shannon information content of $\bx$ given
the probabilistic model. Arithmetic codes fit with
the philosophy that good compression requires
%intelligence
{\dem data modelling}, in the form of an adaptive Bayesian model.
\item
% [Stream codes: Lempel--Ziv codes\puncspace]
Lempel--Ziv codes are adaptive in the sense that they memorize strings
that have already occurred. They are built on the philosophy that
we don't know anything at all about
what the probability distribution of the source will be, and we want
a compression algorithm that will perform reasonably well
whatever that distribution is.
\eit
\end{description}
%\section{Optimal compression must involve artificial intelligence}
%\subsection{A rant about `universal' compression}
% moved this to rant.tex for the time being
Both arithmetic codes and Lempel--Ziv codes will fail to decode
correctly if any of the bits of the compressed file are altered.
So if compressed files are to be stored or transmitted over
noisy media, error-correcting codes will be essential.
Reliable communication over unreliable channels is
the topic of \partnoun\ \noisypart.
% the next few chapters.
%Exercises
\section{Exercises on stream codes}%{Problems}
\exercisaxA{2}{ex.AC52}{
Describe an arithmetic coding algorithm to encode random
bit strings of length $N$ and weight $K$ (\ie, $K$ ones and $N-K$
zeroes) where $N$ and $K$ are given.
For the case $N\eq 5$, $K \eq 2$ show in detail the intervals corresponding to
all source substrings of lengths 1--5.
}
\exercissxB{2}{ex.AC52b}{
How many bits are needed to specify a selection of
% an unordered collection of
$K$ objects from $N$ objects? ($N$ and $K$ are assumed to be known and
the selection of $K$ objects is unordered.)
How might such a selection
be made at random without being wasteful of random bits?
}
\exercisaxB{2}{ex.HuffvAC}{
% from 2001 exam
A binary source $X$ emits independent identically
distributed symbols with probability distribution $\{ f_{0},f_1 \}$,
where $f_1 = 0.01$.
Find an optimal uniquely-decodeable symbol code for a string
$\bx=x_1x_2x_3$ of {\bf{three}} successive
samples from this source.
Estimate (to one decimal place) the factor
by which the expected length of this optimal code is greater
than the entropy of the three-bit string $\bx$.
[$H_2(0.01) \simeq 0.08$, where
$H_2(x) = x \log_2 (1/x) + (1-x) \log_2 (1/(1-x))$.]
%\medskip
An {{arithmetic code}\/} is used to compress a string of $1000$ samples
from the source $X$. Estimate the mean and standard deviation of
the length of the compressed file.
% This is example 6.3, identical, except we are talking about compressing
% rather than generating.
}
\exercisaxB{2}{ex.ACNf}{
Describe an arithmetic coding algorithm to generate random
bit strings of length $N$ with density $f$ (\ie, each
bit has probability $f$ of being a one) where $N$ is given.
}
\exercisaxC{2}{ex.LZprune}{
Use a modified Lempel--Ziv algorithm in which, as discussed
on \pref{sec.LZprune}, the dictionary of prefixes
is
% effectively
pruned by writing new prefixes into the
space occupied by prefixes that will not be needed again.
Such prefixes can be identified when
both their children have been added to the dictionary of prefixes.
(You may neglect the issue of termination of encoding.)
Use this algorithm to encode the string
{\tt{0100001000100010101000001}}.
Highlight the bits that follow a prefix on the
second occasion that that prefix is used. (As discussed earlier,
these bits could be omitted.)
% from the encoding if we adopted the convention (discussed
% earlier)
% of not transmitting the bit that follows a prefix on the
% second occasion that that prefix is used.
% nb this is same as an earlier example.
% i get
% ,0 0,1 1,0 10,1 10,0 00,0 011,0 100,1 010,0 001,1
}
\exercissxC{2}{ex.LZcomplete}{
Show that this modified Lempel--Ziv code is still not `complete',
that is, there are binary strings that are not encodings of any string.
}
% answer: this is because there are illegal prefix names, e.g. at the
% 5th step, 111 is not legal.
%
\exercissxB{3}{ex.LZfail}{
Give examples of simple sources that have low entropy
but would not be compressed well by the Lempel--Ziv algorithm.
}
%
% Ideas: add a figure showing the flow diagram -- source, model.
%
%
% \begin{thebibliography}{}
% \bibitem[\protect\citeauthoryear{Witten {\em et~al.\/}}{1987}]{arith_coding}
% {\sc Witten, I.~H.}, {\sc Neal, R.~M.}, \lsaand {\sc Cleary, J.~G.}
% \newblock (1987)
% \newblock Arithmetic coding for data compression.
% \newblock {\em Communications of the ACM\/} {\bf 30} (6):~520-540.
%
% \end{thebibliography}
% \part{Noisy Channel Coding}
% \end{document}
\dvips
%
\section{Further exercises on data compression}
%\chapter{Further Exercises on Data Compression}
\label{ch_f4}
%
% _f4.tex: exercises to follow chapter 4 in a 'review, revision, further topics'
% exercise zone.
%
\fakesection{Post-compression general extra exercises}
The following exercises may be skipped by the reader who
is eager to learn about noisy channels.
%
% DOES THIS BELONG HERE? Maybe move to p92.
%
\fakesection{RNGaussian}
\exercissxA{3}{ex.RNGaussian}{
\index{life in high dimensions}\index{high dimensions, life in}
%
Consider a Gaussian distribution\index{Gaussian distribution!$N$--dimensional} in $N$ dimensions,
\beq
P(\bx) = \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \left( - \frac{\sum_n x_n^2}{2 \sigma^2} \right) .
\label{first.gaussian}
\eeq
% Show that
Define the radius of a point $\bx$ to be $r = \left( {\sum_n
x_n^2} \right)^{1/2}$.
Estimate the mean and variance of the square of the radius,
$r^2 = \left( {\sum_n x_n^2} \right)$.
\begin{aside}%{\small
You may find helpful the integral
\beq
\int \! \d x\: \frac{1}{(2 \pi \sigma^2)^{1/2}} \: x^4
\exp \left( - \frac{x^2}{2 \sigma^2} \right) = 3 \sigma^4 ,
\label{eq.gaussian4thmoment}
\eeq
though you should be able to estimate the required quantities
without it.
\end{aside}
% If you like gamma integrals
% derive the probability density of the radius $r = \left( {\sum_n
% x_n^2} \right)^{1/2}$, and find the most probable
% radius.
%\amarginfig{b}{% in first printing, before asides changed
\amarginfig{t}{%
\setlength{\unitlength}{0.7mm}
% there is a strip without ink at the left, hence I use -19
% instead of -21 as the left coordinate
\begin{picture}(42,42)(-19,-21)% original is 6in by 6in, so 7unitlength=1in
% use 42 unitlength for width
\put(-21,-21){\makebox(42,42){\psfig{figure=figs/typicalG.ps,angle=-90,width=29.4mm}}}
%\put(14,14){\makebox(0,0)[l]{\small probability density is maximized here}}
\put(10,18){\makebox(0,0)[bl]{\small probability density}}
\put(13,13){\makebox(0,0)[bl]{\small is maximized here}}
%\put(14,-14){\makebox(0,0)[l]{\small almost all probability mass is here}}
\put(9,-16){\makebox(0,0)[l]{\small almost all}}
\put(2,-21){\makebox(0,0)[l]{\small probability mass is here}}
%\put(15,-26){\makebox(0,0)[l]{\small is here}}
\put(-2,-2){\makebox(0,0)[tr]{\small $\sqrt{N} \sigma$}}
\end{picture}
\caption[a]{Schematic representation of the typical
set of an $N$-dimensional Gaussian distribution.}
}
Assuming that $N$ is large,
show that nearly all the probability of a Gaussian is contained in
a \ind{thin shell} of radius $\sqrt{N} \sigma$. Find the thickness of the
shell.
Evaluate the probability density
% in $\bx$ space
(\ref{first.gaussian}) at a point in
that thin shell and at the origin $\bx=0$ and compare.
Use the case $N=1000$ as an example.
Notice that nearly all the probability mass
% the bulk of the probability density
is located in a
different part of the space from the region of highest probability
density.
%
}
%
% extra exercises that are appropriate once source compression has been
% discussed.
%
% contents:
%
% simple huffman question
% Phone chat using rings (originally in mockexam.tex, now in M.tex)
% Bridge bidding as communication (where?)
%
\fakesection{Compression exercises: bidding in bridge, etc}
%
\exercisaxA{2}{ex.source_code}{
%
Explain what is meant by an {\em optimal binary symbol code\/}.
Find an optimal binary symbol code for the ensemble:
\[
\A = \{ {\tt{a}},{\tt{b}},{\tt{c}},{\tt{d}},{\tt{e}},{\tt{f}},{\tt{g}},{\tt{h}},{\tt{i}},{\tt{j}} \} ,
\]
\[
\P = \left\{ \frac{1}{100} ,
\frac{2}{100} ,
\frac{4}{100} ,
\frac{5}{100} ,
\frac{6}{100} ,
\frac{8}{100} ,
\frac{9}{100} ,
\frac{10}{100} ,
\frac{25}{100} ,
\frac{30}{100} \right\} ,
\]
and compute the expected length of the code.
}
\exercisaxA{2}{ex.doublet.huffman}{
A string $\by=x_1 x_2$ consists of {\em two\/} independent samples from an ensemble
\[
X : {\cal A}_X = \{ {\tt{a}} , {\tt{b}} , {\tt{c}} \} ; {\cal P}_X = \left\{ \frac{1}{10} , \frac{3}{10} ,
\frac{6}{10} \right\} .
\]
What is the entropy of $\by$?
Construct an optimal binary symbol code for the string $\by$, and find
its expected length.
}
\exercisaxA{2}{ex.ac_expected}{
% (Cambridge University Part III Maths examination, 1998.)
%
Strings of $N$ independent samples from an ensemble
with $\P = \{ 0.1 , 0.9 \}$ are compressed using
an {arithmetic code} that is matched to that ensemble.
Estimate the mean and standard deviation of
the compressed strings' lengths
for the case $N=1000$.
%
[$H_2(0.1) \simeq 0.47$]
% ; $\log_2(9) \simeq 3$.]
% .47, 3.17
% my answer: 470 pm 30
}
% from M.tex, in which model solns are found too
\exercisaxA{3}{ex.phone_chat}{%(Cambridge University Part III Maths examination, 1998.)
{\sf Source coding with variable-length symbols.}
% -- Source coding / optimal use of channel}
\begin{quote}
In the chapters on source coding, we assumed that
we were encoding into a binary alphabet $\{ {\tt0} , {\tt1} \}$ in which both symbols\index{source code!variable symbol durations}
% had the same associated cost. Clearly a good compression algorithm
% uses both these symbols with equal frequency, and the capacity of
% this alphabet is one bit per character.
should be used with equal frequency.
In this question we explore how the encoding alphabet should be
used
% what happens
if the symbols take different times to transmit.
% have different costs.
% the
\end{quote}
%
A poverty-stricken \ind{student} communicates for free with a friend
using a \ind{telephone} by selecting an integer
$n \in \{ 1,2,3\ldots \}$,
making the friend's
phone ring $n$ times, then hanging up in the middle of the $n$th ring.
This process is repeated so that a string of symbols
$n_1 n_2 n_3 \ldots$ is received. What is the optimal way to communicate?
If large integers $n$ are selected
then the message takes longer to communicate. If only
small integers $n$ are used then the information content per symbol is
small.
We aim to maximize the rate of information transfer, per unit time.
Assume that the time taken to transmit
a number of rings $n$ and to redial
%, including the space that separates them from the next sequence of rings
is $l_n$ seconds. Consider a probability distribution over $n$,
$\{ p_n \}$.
Defining the average duration {\em per symbol\/} to be
\beq
L(\bp) = \sum_n p_n l_n
\eeq
and the entropy {\em per symbol\/} to be
\beq
H(\bp) = \sum_n p_n \log_2 \frac{1}{p_n } ,
\eeq
show that for the average information
rate {\em per second\/} to be maximized,
the symbols must be used with probabilities
of the form
\beq
p_n = \frac{1}{Z} 2^{-\beta l_n}
\label{eq.phone.1}
\eeq
where
% $\beta$ is a Lagrange multiplier
%and
$Z = \sum_n 2^{-\beta l_n}$
and $\beta$ satisfies the implicit equation
% \marginpar{[6]}
\beq
\beta = \frac{H(\bp)}{L(\bp)} ,
\label{eq.phone.2}
\eeq
that is, $\beta$ is the rate of communication.
%is set so as to maximize
%\beq
% R(\beta) = - \beta - \frac{\log Z(\beta)}{L(\beta)}
%\eeq
% where $L(\beta)=\sum p_n l_n$.
% By differentiating $R(\beta)$, show that
% $\beta^*$ satisfies
Show that these two equations
(\ref{eq.phone.1}, \ref{eq.phone.2}) imply that $\beta$ must be set
such that
\beq
\log Z =0.
\label{eq.phone.3}
\eeq
%
Assuming that the channel has the property
% redialling takes the same time as one ring, so that
\beq
l_n = n \: \mbox{seconds},
\label{eq.phone.4}
\eeq
find the optimal distribution $\bp$ and show that
the maximal information rate is 1 bit per second.
% $\log xxxx$
% and that the mean number of rings
% in a group is xxxx and that the information per
% ring is xxxx.
How does this compare with the information rate per second achieved
if $\bp$ is set to
$(1/2,1/2,0,0,0,0,\ldots)$ --- that is,
only the symbols $n=1$ and $n=2$ are selected,
and they have equal probability?
Discuss the relationship between the results
(\ref{eq.phone.1}, \ref{eq.phone.3}) derived above,
and the Kraft inequality from source coding theory.
How might a random binary source
be efficiently encoded into a sequence of symbols
$n_1 n_2 n_3 \ldots$ for transmission over the channel defined
in \eqref{eq.phone.4}?
}
\exercisaxB{1}{ex.shuffle}{How many bits
does it take to shuffle a pack of cards?
% [In case this is not clear, here's the long-winded
% version: imagine using a random number generator
% to generate perfect shuffles of a deck of cards.
% What is the smallest number of random bits
% needed per shuffle?]
}
\exercisaxB{2}{ex.bridge}{In the card game\index{game!Bridge}
Bridge,\index{Bridge}
the four players receive 13 cards each from the deck of 52 and
start each game by looking at their own hand
and bidding. The legal bids are, in ascending order
$1 \clubsuit, 1 \diamondsuit, 1 \heartsuit, 1\spadesuit,$ $1NT,$
$2 \clubsuit,$ $2 \diamondsuit,$
% 2 \heartsuit, 2\spadesuit, 2NT,
$\ldots$
% 7 \clubsuit, 7 \diamondsuit,
$7 \heartsuit, 7\spadesuit, 7NT$,
and successive bids must follow this order;
a bid of, say, $2 \heartsuit$ may only be
followed by higher bids such as $2\spadesuit$ or $3 \clubsuit$ or $7 NT$.
(Let us neglect the `double' bid.)
% The outcome of the bidding process determines the subsequent
% game.
The players have several aims when bidding. One of the
aims is for two partners to communicate to each other
as much as possible about what cards are in their hands.
% There are many bidding systems whose aim is, among other things,
% to communicate this information.
Let us concentrate on this task.
\begin{enumerate}
\item
After the cards have been dealt,
how many bits are needed for North to convey to South what
her hand is?
\item
Assuming that E and W do not bid at all, what
is the maximum total information that N and S can convey to each
other while bidding? Assume that N starts the bidding, and that
once either N or S stops bidding, the bidding stops.
\end{enumerate}
}
\exercisaxB{2}{ex.microwave}{
My old `\ind{arabic}' \ind{microwave oven}\index{human--machine interfaces}
had 11 buttons for entering
cooking times, and my new `\ind{roman}' microwave has just five.
The buttons of the roman microwave are labelled `10 minutes',
`1 minute', `10 seconds', `1 second', and `Start'; I'll abbreviate
these five strings to the symbols {\tt M}, {\tt C}, {\tt X}, {\tt I}, $\Box$.
% The two keypads then look as follows.
% included by _e4.tex
\amarginfig{b}{%
\begin{center}
\begin{tabular}[t]{c}%%%%%%%%%% table containing microwave buttons
%\toprule
Arabic \\ \midrule
% The keypad
\begin{tabular}[t]{*{3}{p{.1in}}}
\framebox{1} & \framebox{2} & \framebox{3} \\
\framebox{4} & \framebox{5} & \framebox{6} \\
\framebox{7} & \framebox{8} & \framebox{9} \\
& \framebox{0} & \framebox{$\!\Box\!$} \\
\end{tabular}
\\
%\bottomrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% end all micro table
\end{tabular}
\begin{tabular}[t]{c}%%%%%%%%%% table containing microwave buttons
%\toprule
Roman \\ \midrule
% The keypad
\begin{tabular}[t]{*{3}{p{.1in}}}
\framebox{{\tt{M}}} & \framebox{{\tt{X}}} & \\
\framebox{{\tt{C}}} & \framebox{{\tt{I}}} & \framebox{$\!\Box\!$} \\
\end{tabular}
\\
%\bottomrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% end all micro table
\end{tabular}\\
\mbox{$\:$}
\end{center}
\caption[a]{Alternative keypads for microwave ovens.}
}
To enter one minute and twenty-three seconds (1:23), the arabic sequence
is
\beq
{\tt{123}}\Box,
\eeq
and the roman sequence is
\beq
{\tt{CXXIII}}\Box .
\eeq
Each of these keypads defines a code mapping the
3599 cooking times from 0:01 to 59:59 into a string of symbols.
\ben
\item
Which times can be produced with two or three symbols? (For example,
0:20 can be produced by three symbols in either code:
${\tt{XX}}\Box$ and
${\tt{20}}\Box$.)
\item
Are the two codes complete?
Give a detailed answer.
% Discuss all the ways in which these two codes are not complete.
\item
For each code, name a cooking time
% couple of times
that it can produce in
four symbols that the other code cannot.
\item
Discuss the implicit probability distributions over times to which
each of these codes is best matched.
\item
Concoct a plausible probability distribution over times
that a real user might use, and evaluate roughly the expected number of
symbols, and maximum number of symbols, that each code
requires. Discuss the ways in which
each code is inefficient or efficient.
\item
Invent a more efficient cooking-time-encoding system for a microwave oven.
\een
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
}
%
\fakesection{Cinteger}
%\input{tex/_Cinteger}
\exercissxC{2}{ex.Cinteger}{
Is the standard binary representation for positive
integers (\eg\ $c_{\rm b}(5) = {\tt 101}$)
a uniquely decodeable code?
Design a binary code for the positive integers,
\ie, a mapping from $n \in \{ 1,2,3,\ldots \}$ to $c(n) \in
\{{\tt 0},{\tt 1}\}^+$,
that is uniquely decodeable.
Try to design codes that are prefix codes and that satisfy the
\Kraft\ equality $\sum_n 2^{-l_n} \eq 1$.
%
% Not a typo.
%
\begin{aside}
Motivations: any data file terminated by a special
end of file character can be mapped onto an integer,
so a prefix code for integers can be used as a self-delimiting
encoding of files too. Large files correspond to large integers.
Also, one of the building blocks of a `universal' coding scheme --
that is, a coding scheme that will work OK for a large variety
of sources -- is the ability to encode integers. Finally,
in microwave ovens, cooking times are positive integers!
\end{aside}
Discuss criteria by which one might compare alternative codes
for integers (or, equivalently, alternative self-delimiting codes for
files).
}
%
%
%
\section{Solutions}% to Chapter \protect\ref{ch4}'s exercises}
%
% solns to exercises in l4.tex
%
\fakesection{solns to exercises in l4.tex}
\soln{ex.ac.terminate}{
The worst-case situation is when the
interval to be represented lies just inside
a binary interval. In this case, we may choose either of
two binary intervals as shown in \figref{fig.ac.worst.case}.
These binary intervals are no
smaller than $P(\bx|\H)/4$, so the binary encoding has a length
no greater than $\log_2 1/ P(\bx|\H) + \log_2 4$, which is
two bits more than the ideal message length.
}
%
% HELP HELP HELP RESTORE ME!
% \input{tex/acvshuffman.tex}
%
%
\soln{ex.usebits}{
The standard method uses
32 random bits per generated
symbol and so requires $32\,000$ bits
to generate one thousand samples.
% this is displaced down a bit.
\begin{figure}%[htbp]
\figuremargin{%
\begin{center}
% created by ac.p only_show_data=1 > ac/ac_data.tex
\mbox{
\small
\setlength{\unitlength}{1.62in}
\begin{picture}(2,1.2)(0,0)
\thicklines
% desired interval on left
\put( 0.0, 1.01){\makebox(0,0)[bl]{Source string's interval}}
\put( 0.5, 0.5){\makebox(0,0){$P(\bx|\H)$}}
\put( 0.0, 0.05){\line(1,0){ 1.0}}
\put( 0.0, 0.95){\line(1,0){ 1.0}}
%
% binary intervals
\put( 1.0, 1.03){\makebox(0,0)[bl]{Binary intervals}}
\put( 1.0, 0.0){\line(1,0){ 1.0}}
\put( 1.0, 1.0){\line(1,0){ 1.0}}
%
\thinlines
%
\put( 0.5, 0.4){\vector(0,-1){0.35}}
\put( 0.5, 0.6){\vector(0,1){0.35}}
%
\put( 1.0, 0.5){\line(1,0){ 0.5}}
\put( 1.0, 0.25){\line(1,0){ 0.25}}
\put( 1.0, 0.75){\line(1,0){ 0.25}}
%
\put( 1.125, 0.625){\vector(0,1){0.125}}
\put( 1.125, 0.625){\vector(0,-1){0.125}}
\put( 1.125, 0.375){\vector(0,1){0.125}}
\put( 1.125, 0.375){\vector(0,-1){0.125}}
\end{picture}
}
\end{center}
}{%
\caption[a]{Termination of arithmetic coding in the worst case, where
there is a two bit overhead. Either of the two binary intervals marked on the
right-hand side may be chosen. These binary intervals are no
smaller than $P(\bx|\H)/4$.}
\label{fig.ac.worst.case}
}%
\end{figure}
Arithmetic coding uses on average
about $H_2 (0.01)=0.081$ bits per generated symbol, and so
requires about 83 bits to generate one thousand samples
(assuming an overhead of roughly two bits associated with termination).
Fluctuations in the number of {\tt{1}}s would produce variations
around this mean with standard deviation 21.
}
% 57
%\soln{ex.Clengthen}{
% moved to cutsolutions.tex
\soln{ex.LZencode}{
The encoding is {\tt010100110010110001100}, which comes from the
parsing
\beq
\tt 0, 00, 000, 0000, 001, 00000, 000000
\eeq
which is encoded thus:
\beq
{\tt (,0),(1,0),(10,0),(11,0),(010,1),(100,0),(110,0) } .
\eeq
}
\soln{ex.LZdecode}{
The decoding is
\begin{center}
{\tt 0100001000100010101000001}.
\end{center}
}
%\soln{ex.AC52}{
\soln{ex.AC52b}{
This problem is equivalent to \exerciseref{ex.AC52}.
The selection of $K$ objects from $N$ objects requires
$\lceil \log_2 {N \choose K}\rceil$ bits $\simeq N H_2(K/N)$ bits.
%
This selection could be made using arithmetic coding. The selection
corresponds to a binary string of length $N$ in which the {\tt{1}} bits represent
which objects are selected. Initially the probability of a {\tt{1}} is
$K/N$ and the probability of a {\tt{0}} is $(N\!-\!K)/N$. Thereafter, given that
the emitted string thus far, of length $n$, contains $k$ {\tt{1}}s,
the probability of a {\tt{1}} is
$(K\!-\!k)/(N\!-\!n)$ and the probability of a {\tt{0}} is $1 - (K\!-\!k)/(N\!-\!n)$.
}
\soln{ex.LZcomplete}{
This modified Lempel--Ziv code is still not `complete', because,
for example, after five prefixes have been collected,
the pointer could be any of the strings $\tt000$, $\tt001$, $\tt010$,
$\tt011$, $\tt100$, but
it cannot be $\tt101$, $\tt110$ or $\tt111$. Thus there are some binary strings
that cannot be produced as encodings.
}
\soln{ex.LZfail}{
Sources with low entropy that are not well compressed by Lempel--Ziv
include:\index{Lempel--Ziv coding!criticisms}
\ben
\item
Sources with some symbols that have
long range correlations and intervening
random junk. An ideal model should capture what's correlated
and compress it. Lempel--Ziv can only compress the correlated features
by memorizing all cases of the intervening junk.
As a simple example, consider a
\index{telephone number}\index{phone number}telephone book in which
every line contains an (old number, new number) pair:
\begin{center}
{\tt{285-3820:572-5892}}\teof\\
{\tt{258-8302:593-2010}}\teof\\
\end{center}
The number of characters per line is 18, drawn from the 13-character
alphabet
$\{ {\tt{0}},{\tt{1}},\ldots,{\tt{9}},{\tt{-}},{\tt{:}},\eof\}$.
The characters `{\tt{-}}',
`{\tt{:}}' and `\teof' occur in a predictable sequence, so
the true information content per line, assuming
all the phone numbers are seven digits long, and assuming
that they are random sequences,
is about 14 \dits. (A \dit\ is the information content
of a random integer between 0 and 9.)
A finite state language model could easily capture
the regularities in these data.
A Lempel--Ziv algorithm will take a long time before
it compresses such a file down to 14 bans per line,
% by a factor of $14/18$,
however, because in order for it to `learn' that
the string {\tt{:}}$ddd$ is always followed by {\tt{-}},
for any three digits $ddd$, it will have to {\em see\/}
all those strings. So near-optimal compression
will only be achieved after thousands of lines of the
file have been read.\medskip
% figs/wallpaper.ps made by pepper.p
\begin{figure}[htbp]
\fullwidthfigureright{%
%\figuremargin{%
\small
\begin{center}
\mbox{%(a)
\psfig{figure=figs/wallpaper.ps}}\\
%\mbox{(b) \psfig{figure=figs/wallpaperc.ps}}\\
%\mbox{(c) \psfig{figure=figs/wallpaperb.ps}}
\end{center}
}{%
\caption[a]{
A source with low entropy that is not well compressed by Lempel--Ziv.
The bit sequence is read from left to right.
Each line differs from the line above in $f=5$\% of its bits.
The image width is 400 pixels.
%
% Three
% sources with low entropy that are not well compressed by Lempel--Ziv.
% The bit sequence is read from left to right. The image width is 400 pixels
% in each case.
%
% (a) Each line differs from the line above in $p=$5\% of its bits.
%
% (b)
% Each column $c$ has its own transition probability $p_c$ such that
% successive vertical bits are identical with probability $p_c$. The
% probabilities $p_c$ are drawn from a uniform distribution over $[0,0.5]$.
%
% (c) As in b, but the probabilities $p_c$ are drawn from a uniform
% distribution over $[0,1]$.
}
% ; in columns with $p_c \simeq 1$, successive
% vertical bits are likely to be opposite to each other. }
%
\label{fig.pepper}
}%
\end{figure}
%
% this is beautiful but gratuitous
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
%\begin{figure}[htbp]
%\figuremargin{%
%\begin{center}
%\mbox{\psfig{figure=figs/automaton346.big1.ps,height=7in}}\\
%\end{center}
%}{%
%\caption[a]{A longer cellular automaton history.
%}
%\label{fig.automatonII}
%\end{figure}
\vspace*{-10pt}% included to undo the cumulation of item space and figure space.
\item
Sources with long range correlations, for example two-dimensional
images that are represented by a sequence of pixels, row by row,
so that vertically adjacent pixels are a distance $w$
apart in the source stream, where $w$ is the image width.
Consider, for example, a fax transmission in which each line
is very similar to the previous line (\figref{fig.pepper}).
The true entropy is only $H_2(f)$ per pixel, where $f$
is the probability that a pixel differs from its parent.
% except for a light peppering
% of noise.
% Each line is somewhat similar to the previous line but not identical,
% so there is no previous occurrence of a long string
% to point to; some algorithms in the Lempel--Ziv class
% will achieve a certain degree of compression
% by memorizing recent short strings, but the compression achieved
% will not equal the true entropy.
% and after a few lines,
% the pattern has moved on by a random walk, so memorizing ancient patterns
% is of no use.
Lempel--Ziv algorithms will only compress
down to the entropy once {\em all\/} strings of length $2^w = 2^{400}$
have occurred and their successors have been memorized.
There are only about $2^{300}$ particles in the universe, so we
can confidently say that
Lempel--Ziv codes will {\em never\/} capture the redundancy
of such an image.
% figs/wallpaper.ps made by pepper.p
\begin{figure}[htbp]
%\figuremargin{%
\fullwidthfigureright{%
\begin{center}
%\mbox{(a) \psfig{figure=figs/wallpaperx.ps}}\\
\mbox{%(b)
\psfig{figure=figs/wallpaperx2.ps}}\\
%\mbox{(c) \psfig{figure=figs/automaton346.2.ps}}\\
% see also figs/automaton346.big1.pbm
\end{center}
}{%
\caption[a]{%A second source with low entropy that is not optimally compressed by Lempel--Ziv.
A texture consisting of horizontal and vertical pins
dropped at random on the plane.
% (c) The 100-step time-history of a cellular automaton with 400 cells.
}
\label{fig.wallpaper}
}%
\end{figure}
Another highly redundant texture is shown in \figref{fig.wallpaper}.
The image was made
by dropping horizontal and vertical pins randomly on the plane.
It contains both long-range vertical correlations and long-range horizontal
correlations. There is no practical way that Lempel--Ziv, fed with a pixel-by-pixel scan
of this image, could capture both these correlations.
% gzip on the pbm gives: 2374 wallpaperx.pbm.gz
% That is better than 50%.
% Saved as a gif, wallpaperx.pbm is 2926 characters. Original 40000 pixels would be 5000 characters.
% That is worse than 50% compression.
% cf. perl program, stripwallpaper.p
% is
% 0 8 274 /home/mackay/bin/stripwallpaper.p.gz
% 0 16 631 wallpaperx.asc.gz
% 0 24 905 total <--------
% 18 65 368 /home/mackay/bin/stripwallpaper.p
% 162 484 1390 wallpaperx.asc
% 180 549 1758 total
% lossless jpg is terrible!:
% 38828 wallpaperx.jpg
% would be nice to try JBIG on this.
% It is worth emphasizing that b
Biological computational systems
can readily identify the redundancy in these images and in images
that are much more complex; thus we might anticipate that
the best data compression algorithms will result from the development
of \ind{artificial intelligence} methods.\index{compression!future methods}
\item
Sources with intricate redundancy, such as files generated
by computers. For example, a \LaTeX\ file
followed by its encoding into a PostScript file. The information content
of this pair of files is roughly equal to the information content of the
\LaTeX\ file alone.
\item
A picture of the Mandelbrot set. The picture has an information content
equal to the number of bits required to specify the range of the
complex plane studied, the pixel sizes,
and the colouring rule used.
% mapping of set membership to pixel colour.
% \item
% Encoded transmissions arising from an error-correcting code of rate $K/N$.
% These are very easily compressed by a factor
% $K/N$ if the generator operation is known.
% see README2 in /home/mackay/_courses/comput/newising_mc
\item
A picture of a ground state of
a frustrated antiferromagnetic \ind{Ising model} (\figref{fig.ising.ground}),
which we will discuss
in \chref{ch.ising}.
Like \figref{fig.wallpaper}, this binary image has interesting
correlations in two directions.
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\mbox{\bighisingsample{hexagon2}}
\end{center}
}{%
\caption[a]{Frustrated triangular
Ising model in one of its ground states.}
\label{fig.ising.ground}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item
Cellular automata -- \figref{fig.wallpaperc} shows
the state history of 100 steps of a \ind{cellular automaton}
with 400 cells. The update rule, in which each cell's new state depends on the state of five
preceding cells, was selected at random. The information content is equal to the information
in the boundary (400 bits), and the propagation rule, which here can be described in 32 bits.
An optimal compressor will thus give a compressed file length which
is essentially constant, independent of the vertical height of the image.
Lempel--Ziv would only give this zero-cost compression once the
cellular automaton has entered a periodic limit cycle, which
could easily take about $2^{100}$ iterations.
In contrast, the JBIG compression method, which models the probability of
a pixel given its local context and uses
arithmetic coding, would do a good job on these images.
%\item
% And finally, an example relating to error-correcting codes:
% the\index{error-correcting code!and compression}\index{difficulty of compression}\index{compression!difficulty of}
% received transmissions arising when encoded transmissions are
% sent over a noisy channel. Such received strings have an entropy
% equal to the source entropy plus the channel noise's
% entropy. If a \index{Lempel--Ziv coding|)}Lempel--Ziv
% algorithm could compress these strings,
% this would be tantamount to solving the decoding problem for
% the error-correcting code!
%
% We have not got to this topic yet, but we will see later that
% the decoding of a general error-correcting code is
% a challenging intractable problem.
% automaton.p
\begin{figure}%[htbp]
%\figuremargin{%
\fullwidthfigureright{%
\begin{center}
\mbox{%(c)
\psfig{figure=figs/automaton346.2.ps}}\\
% see also figs/automaton346.big1.pbm
\end{center}
}{%
\caption[a]{% Another source with low entropy that is not optimally compressed by Lempel--Ziv.
The 100-step time-history of a cellular automaton with 400 cells.
}
\label{fig.wallpaperc}
}%
\end{figure}
\een
}
\index{source code!stream codes|)}\index{stream codes|)}
\dvipsb{solutions stream codes}
%
%
%\section{Solutions}% to Chapter \protect\ref{ch_f4}'s exercises}
% \section{Solutions to section \protect\ref{ch_f4}'s exercises}
\fakesection{RNGaussian}
\soln{ex.RNGaussian}{
For a one-dimensional Gaussian, the
variance of $x$, $\Exp[x^2]$, is $\sigma^2$.
So the mean value of $r^2$ in $N$ dimensions,
since the components of $\bx$ are independent
random variables, is
\beq
\Exp[ r^2] = N \sigma^2 .
\eeq
The variance of $r^2$, similarly,
is $N$ times the variance of $x^2$,
where $x$ is a one-dimensional Gaussian
variable.
\beq
\var (x^2 ) = \int \! \d x \:
\frac{1}{(2 \pi \sigma^2)^{1/2}} x^4 \exp \left( - \frac{x^2}{2 \sigma^2} \right)
- \sigma^4 .
\eeq
The integral is found to be $3 \sigma^4$ (\eqref{eq.gaussian4thmoment}),
so $\var(x^2) = 2 \sigma^4$.
Thus the variance of $r^2$ is $2 N \sigma^4$.
For large $N$, the \ind{central-limit theorem}
% law of large numbers
indicates that
$r^2$ has a Gaussian distribution with mean $N \sigma^2$ and standard
deviation $\sqrt{2 N} \sigma^2$, so the probability density of $r$
must similarly be concentrated about $r \simeq \sqrt{N} \sigma$.
The thickness of this shell is given by turning the standard deviation of
$r^2$ into a standard deviation on $r$: for small
$\delta r/r$,
$\delta \log r = \delta r/r = (\dhalf) \delta \log r^2 = (\dhalf) \delta (r^2)/r^2$,
so setting $\delta (r^2) = \sqrt{2 N} \sigma^2$, $r$ has standard deviation
$\delta r = (\dhalf) r \delta (r^2)/r^2$
% $=$ $(\dhalf) \sqrt{2 N} \sigma^2 / \sqrt{( N \sigma^2)}$
$=\sigma/\sqrt{2}$.
The probability density of the Gaussian at a point $\bx_{\rm shell}$ where
$r = \sqrt{N} \sigma$ is
\beq
P(\bx_{\rm shell}) = \frac{1}{(2 \pi \sigma^2)^{N/2}}
\exp \left( - \frac{N \sigma^2}{2 \sigma^2} \right)
= \frac{1}{(2 \pi \sigma^2)^{N/2}}
\exp \left( - \frac{N}{2} \right) .
\eeq
Whereas the probability density at the origin is
\beq
P(\bx\eq 0) = \frac{1}{(2 \pi \sigma^2)^{N/2}} .
\eeq
Thus $P(\bx_{\rm shell})/P(\bx\eq 0) = \exp \left( - \linefrac{N}{2} \right) .$
The probability density at the typical radius is $e^{-N/2}$ times
smaller than the density at the origin. If $N=1000$, then the probability
density at the origin is $e^{500}$ times greater.
%
}
%
%
% for _e4.tex
%
\fakesection{Source coding problems solutions}
%\soln{ex.forward-backward-language}{
%% (Draft.)
%%
% If we write down a language model for strings in forward-English,
% the same model defines a probability distribution over strings
% of backward English. The probability distributions have
% identical entropy, so the average information contents
% of the reversed
% language and the forward language are equal.
%}
%\soln{ex.microwave}{
% moved to cutsolutions.tex
% removed to cutsolutions.tex
% \soln{ex.bridge}{(Draft.)
\dvipsb{solutions further data compression f4}
%\subchapter{Codes for integers \nonexaminable}
\chapter{Codes for Integers \nonexaminable}
\label{ch.codesforintegers}
This chapter is an aside, which may safely be skipped.
\section*{Solution to \protect\exerciseref{ex.Cinteger}}% was fiftythree
\label{sec.codes.for.integers}\label{ex.Cinteger.sol}% special by hand
%\soln{ex.Cinteger}{
%}
\fakesection{Cinteger Solutions to problems}
%
% original integer stuff is in old/s_integer.tex
%
% chapter 2 , coding of integers
To discuss the coding of integers\index{source code!for integers}
we need some definitions.\index{binary representations}
\begin{description}
\item[The standard binary representation of a positive
integer] $n$ will be denoted by $c_{\rm b}(n)$,
\eg, $c_{\rm b}(5) = {\tt 101}$, $c_{\rm b}(45) = {\tt 101101}$.
\item[The standard binary length of a positive
integer] $n$, $l_{\rm b}(n)$, is the length
of the string $c_{\rm b}(n)$.
For example, $l_{\rm b}(5) = 3$, $l_{\rm b}(45) = 6$.
\end{description}
The standard binary representation $c_{\rm b}(n)$
is {\em not\/} a uniquely decodeable code for integers
since there is no way of knowing when an integer has ended.
For example, $c_{\rm b}(5)c_{\rm b}(5)$ is identical to $c_{\rm b}(45)$.
It would be uniquely decodeable if we knew the
standard binary length of each integer
before it was received.
Noticing that all positive integers have a standard binary representation
that starts with a {\tt{1}}, we might define another representation:
\begin{description}
\item[The headless binary representation of a positive
integer] $n$ will be denoted by $c_{\rm B}(n)$,
\eg, $c_{\rm B}(5) = {\tt 01}$, $c_{\rm B}(45) = {\tt 01101}$
and $c_{\rm B}(1) = \lambda$ (where $\l$ denotes the null
string).
\end{description}
This representation would be uniquely decodeable if we knew the
length $l_{\rm b}(n)$ of the integer.
So, how can we make a uniquely decodeable code for integers?
Two strategies can be distinguished.
\ben
\item {\bf Self-delimiting codes}.
We first communicate somehow
% An alternative strategy is to make the code self-delimiting
\index{symbol code!self-delimiting}\index{self-delimiting}the length of the integer, $l_{\rm b}(n)$,
which is also a positive integer; then communicate the original
integer $n$ itself using $c_{\rm B}(n)$.
\item {\bf Codes with `end of file' characters}.
We code the integer into blocks of length
$b$ bits, and reserve one of the $2^b$ symbols to
have the special meaning `end of file'. The coding
of integers into blocks is arranged so that
this reserved symbol is not needed for any other purpose.
\een
The simplest uniquely decodeable code for integers is the unary code,
which can be viewed as a code with an end of file character.
\begin{description}
\item[Unary code\puncspace]
An integer $n$ is encoded by sending a string of $n\!-\!1$ {\tt 0}s
% zeroes
followed by a {\tt 1}.
\[
\begin{array}{cl} \toprule
n & c_{\rm U}(n) \\ \midrule
1 & {\tt 1} \\
2 & {\tt 01} \\
3 & {\tt 001} \\
4 & {\tt 0001} \\
5 & {\tt 00001} \\
\vdots & \\
45 & {\tt 000000000000000000000000000000000000000000001} \\ \bottomrule
\end{array}
\]
The unary code has length $l_{\rm U}(n) = n$.
The unary code is the optimal code for integers if the probability
distribution over $n$ is $p_{\rm U}(n) = 2^{-{n}}$.
\end{description}
\subsubsection*{Self-delimiting codes}
We can use the unary code to encode the {\em length\/} of the binary
encoding of $n$ and make a self-delimiting code:
\begin{description}
\item[Code $C_\alpha$\puncspace]
% The length of the standard binary representation is a positive integer
We send the unary code for $l_{\rm b}(n)$, followed
by the headless binary representation of $n$.
\beq
c_{\alpha}(n) = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n) .
\eeq
Table \ref{tab.calpha} shows the codes for some integers. The overlining
indicates the division of each string into the parts $c_{\rm U}[ l_{\rm b}(n) ]$
and $c_{\rm B}(n)$.
\margintab{\footnotesize
\[
\begin{array}{clll} \toprule
n & c_{\rm b}(n) & \makebox[0in][c]{$l_{\rm b}(n)$} & c_{\alpha}(n)
% = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n)
\\ \midrule
1 & {\tt 1 } & 1 & {\tt {\overline{1}}} \\
2 & {\tt 10 } & 2 & {\tt {\overline{01}}0} \\
3 & {\tt 11 } & 2 & {\tt {\overline{01}}1} \\
4 & {\tt 100} & 3 & {\tt {\overline{001}}00} \\
5 & {\tt 101} & 3 & {\tt {\overline{001}}01} \\
6 & {\tt 110} & 3 & {\tt {\overline{001}}10} \\
\vdots & \\
45 & {\tt 101101} & 6 & {\tt {\overline{000001}}01101} \\ \bottomrule
\end{array}
\]
\caption[a]{$C_\alpha$.}
\label{tab.calpha}
}
We might equivalently view $c_{\alpha}(n)$ as consisting of a string
of $(l_{\rm b}(n)-1)$ zeroes followed by the standard binary representation
of $n$, $c_{\rm b}(n)$.
The codeword $c_{\alpha}(n)$ has length $l_{\alpha}(n) = 2 l_{\rm b}(n) - 1$.
The implicit probability distribution over $n$ for the code
$C_{\alpha}$ is separable
into the product of a probability distribution over the length $l$,
\beq
P(l) = 2^{-l} ,
\eeq
and a uniform distribution over integers having that length,
\beq
P(n\given l) = \left\{ \begin{array}{cl} 2^{-l+1} & l_{\rm b}(n) = l \\
0 & \mbox{otherwise}.
\end{array} \right.
\eeq
\end{description}
Now, for the above code, the header that communicates
the length always occupies the same number
of bits as the standard binary representation of the integer (give or take
one). If we are expecting to encounter large integers (large files)
then this representation seems suboptimal, since it leads to
all files occupying a size that is double their original
uncoded size. Instead of using the unary
code to encode the length $l_{\rm b}(n)$, we could use $C_{\alpha}$.%
% see graveyard for original
\margintab{{\footnotesize
\[
\begin{array}{cll} \toprule
n & c_{\beta}(n) & c_{\gamma}(n)
\\ \midrule
1 & {\tt{\overline{1}}} & {\tt{\overline{1}}} \\
2 & {\tt{\overline{010}}0} & {\tt{\overline{0100}}0} \\
3 & {\tt{\overline{010}}1} & {\tt{\overline{0100}}1} \\
4 & {\tt{\overline{011}}00}& {\tt{\overline{0101}}00} \\
5 & {\tt{\overline{011}}01}& {\tt{\overline{0101}}01} \\
6 & {\tt{\overline{011}}10}& {\tt{\overline{0101}}10} \\
\vdots & \\
45 & {\tt{\overline{00110}}01101} & {\tt{\overline{01110}}01101} \\ \bottomrule
\end{array}
\]
}
\caption[a]{$C_\beta$ and $C_{\gamma}$.}
\label{tab.cbeta}
}
\begin{description}
\item[Code $C_\beta$\puncspace]
% The length of the standard binary representation is a positive integer
We send the length $l_{\rm b}(n)$ using $C_{\alpha}$, followed
by the headless binary representation of $n$.
\beq
c_{\beta}(n) = c_{\alpha}[ l_{\rm b}(n) ] c_{\rm B}(n) .
\eeq
\end{description}
Iterating this procedure, we can define a sequence of codes.
\begin{description}
\item[Code $C_{\gamma}$\puncspace]
\beq
c_{\gamma}(n) = c_{\beta}[ l_{\rm b}(n) ] c_{\rm B}(n) .
\eeq
% see graveyard for gamma table
\item[Code $C_\delta$\puncspace]
\beq
c_{\delta}(n) = c_{\gamma}[ l_{\rm b}(n) ] c_{\rm B}(n) .
\eeq
\end{description}
\subsection{Codes with end-of-file symbols}
We can also make byte-based representations.
(Let's use the term byte
flexibly here, to denote any fixed-length string of bits, not
just a string of length 8 bits.)
If we encode the number in some base, for example decimal, then
we can represent each digit in a byte. In order to represent
a digit from 0 to 9 in a byte we need four bits.
Because $2^4 = 16$, this leaves 6 extra four-bit symbols,
$\{${\tt 1010}, {\tt 1011}, {\tt 1100}, {\tt 1101}, {\tt 1110},
{\tt 1111}$\}$,
that correspond to no decimal digit. We can use these
as end-of-file symbols to indicate the end of our positive
integer.
% Such a code can also code the integer zero, for which
% we have not been providing a code up till now.
Clearly it is redundant to have more than one end-of-file
symbol, so a more efficient code would encode the integer
into base 15, and use just the sixteenth symbol, {\tt 1111},
as the punctuation character.
Generalizing this idea, we can make similar byte-based
codes for integers in bases 3 and 7, and in any base of
the form $2^n-1$.
\margintab{\small
\[
\begin{array}{cll} \toprule
n & c_3(n) & c_{7}(n)
% = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n)
\\ \midrule
1 & {\tt 01\, 11 } & {\tt 001\, 111} \\
2 & {\tt 10\, 11 } & {\tt 010\, 111} \\
3 & {\tt 01\, 00\, 11 } & {\tt 011\, 111} \\
\vdots & \\
45 & {\tt 01\, 10\, 00\, 00\, 11} & {\tt 110\, 011\, 111} \\ \bottomrule
\end{array}
\]
\caption[a]{Two codes with end-of-file symbols,
$C_3$ and $C_7$. Spaces have been included to show the
byte boundaries.
}
}
These codes are almost complete. (Recall that
a code is `complete' if it satisfies the
Kraft inequality with equality.) The codes'
remaining inefficiency is that they provide the
ability to encode the integer zero and the empty string,
neither of which was required.
\exercissxB{2}{ex.intEOF}{
Consider the implicit probability distribution over integers
corresponding to the code with an end-of-file character.
\ben
\item
If the code has eight-bit blocks (\ie, the integer is
coded in base 255), what is the mean length in bits
of the integer, under the implicit distribution?
\item
If one wishes to encode binary files of expected size about one hundred
\kilobytes\ using a code with an end-of-file character, what is the optimal
block size?
\een
}
\subsection*{Encoding a tiny file}
% see claude.p in itp/tex
To illustrate the codes we have discussed, we now use each
code to encode
a small file consisting of just 14 characters,
\[
\framebox{\tt{Claude Shannon}}.
\]
\bit
\item
If we map the ASCII characters onto seven-bit
symbols (\eg, in decimal,
${\tt C}=67$, ${\tt l}=108$, etc.), this 14 character file corresponds to the
integer
\[
n = 167\,987\,786\,364\,950\,891\,085\,602\,469\,870 \:\:\mbox{(decimal)}.
\]
\item
The unary code for $n$ consists of this many (less one) zeroes,
followed by a one. If all the oceans were turned into ink, and if we
wrote a hundred bits with every cubic millimeter,
% or microlitre
there
% would be roughly
might be enough ink to write $c_{\rm U}(n)$.
\item
The standard binary representation of $n$ is this length-98 sequence of bits:
\beqa
c_{\rm b}(n) &=& \begin{array}[t]{l}
\tt 1000011110110011000011110101110010011001010100000 \\
\tt 1010011110100011000011101110110111011011111101110.
\end{array}
\eeqa
% To store this self-delimiting file
% on a disc, we would need
\eit
\exercisaxB{2}{ex.claudeshannonn}{
Write down or describe the following
self-delimiting representations of the above number $n$:
$c_{\alpha}(n)$,
$c_{\beta}(n)$,
$c_{\gamma}(n)$,
$c_{\delta}(n)$,
$c_{3}(n)$,
$c_{7}(n)$, and
$c_{15}(n)$.
Which of these encodings is the shortest? [{\sf{Answer:}} $c_{15}$.]
}
%
% solution moved to cutsolutions.tex
%
\subsection{Comparing the codes}
One could answer the question `which of two codes is
superior?' by a sentence of the form `For $n>k$, code 1 is
superior, for $n
Secondly, the depiction in terms of Venn diagrams
encourages one to believe that all the areas correspond to
positive quantities. In the special case of two random variables
it is indeed true that $H(X \given Y)$, $\I(X;Y)$ and $H(Y \given X)$ are positive
quantities. But as soon as we progress to three-variable ensembles,
we obtain a diagram with positive-looking areas that
may actually correspond to negative quantities. \Figref{fig.venn3}
correctly shows relationships such as
\beq
H(X) + H(Z \given X) + H(Y \given X,Z) = H(X,Y,Z) .
\eeq
But it gives the misleading impression that
the conditional mutual information $\I(X;Y \given Z)$ is
{\em less than\/} the mutual information
$\I(X;Y)$.
\begin{figure}
\figuremargin{%3/4
\begin{center}
\mbox{\psfig{figure=figs/venn3.ps,angle=-90,width=5.25in}}
\end{center}
}{%
\caption[a]{A misleading representation of entropies, continued.}
\label{fig.venn3}
}%
\end{figure}
In fact the area labelled $A$ can correspond to a {\em negative\/}
quantity. Consider the joint ensemble
$(X,Y,Z)$ in which $x \in \{0,1\}$ and $y \in \{0,1\}$
are independent binary variables and $z \in \{0,1\}$ is defined
to be $z=x+y \mod 2$.
Then clearly $H(X) = H(Y) = 1$ bit. Also $H(Z) = 1$ bit.
And $H(Y \given X) = H(Y) = 1$ since the two variables are independent.
So the mutual information between $X$ and $Y$ is zero.
$\I(X;Y) = 0$. However, if $z$ is observed, $X$ and $Y$ become dependent ---
% correlated ---
knowing $x$, given $z$, tells you what $y$ is: $y = z - x \mod 2$.
So $\I(X;Y \given Z) = 1$ bit. Thus the area labelled $A$ must correspond
to $-1$ bits for the figure to give the correct answers.
The above example is not at all a capricious or exceptional
illustration.
The binary symmetric channel with input $X$, noise $Y$, and output $Z$
% The classic\index{earthquake and burglar alarm}\index{burglar alarm and earthquake}
% earthquake-burglar-alarm ensemble \exercisebref{ex.burglar}\
%% (section ???),
% with
% earthquake $= X$,
% burglar $ = Y$ and alarm $= Z$,
% is a perfect example of a
is a
situation in which $\I(X;Y)=0$ (input and noise are independent)
% uncorrelated
but $\I(X;Y \given Z) > 0$ (once you see the output, the unknown input and the unknown noise
are intimately related!).
The Venn diagram representation is therefore valid only if one is aware
that positive areas may represent negative quantities.
With this proviso
% As long as this possibility is
kept in mind, the
interpretation of entropies in terms
of sets can be helpful \cite{Yeung1991}.
% The quantity corresponding to $A$ is denoted $I(X;Y;Z)$
% by \citeasnoun{Yeung1991}.
}
\soln{ex.dataprocineq}{% BORDERLINE
%{\bf New answer:}
For any joint ensemble $XYZ$, the following chain rule
for mutual information holds.
\beq
\I(X;Y,Z) = \I(X;Y) + \I(X;Z \given Y) .
\eeq
Now, in the case $w \rightarrow d \rightarrow r$,
$w$ and $r$ are independent given $d$, so
$\I(W;R \given D) = 0$. Using the chain rule twice, we have:
\beq
\I(W;D,R) = \I(W;D)
\eeq
and
\beq
\I(W;D,R) = \I(W;R) + \I(W;D \given R) ,
\eeq
so
\beq
\I(W;R) - \I(W;D) \leq 0 .
\eeq
% for more solutions to this problem see
% Igraveyard.tex
}
\prechapter{About Chapter}
\fakesection{prerequisites for chapter 5}
Before reading \chref{ch.five}, you should have read \chapterref{ch.one}
and worked
on
\exerciseref{ex.rel.ent}, and
\exerciserefrange{ex.Hcondnal}{ex.zxymod2}.
% \exfifteen--\exeighteen,
% \extwenty--\extwentyone, and \extwentythree.
% uvw to HXY>0
% {ex.Hmutualineq}{ex.joint},
% \exerciserefrangeshort{ex.rel.ent}
% load of H() and I() stuff shoved in here now.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\ENDprechapter
\chapter{Communication over a Noisy Channel}
\label{ch.five}
% % l5.tex
%
% useful program: bin/capacity.p for checking channel
% capacities
%
% % \part{Noisy Channel Coding}
% \chapter{Communication over a noisy channel}
% % The noisy-channel coding theorem, part a}
% % \chapter{The noisy channel coding theorem, part a}
\label{ch5}
\section{The big picture}
%
\setlength{\unitlength}{1mm}
\begin{realcenter}
%\begin{floatingfigure}[l]{3.2in}
\begin{picture}(85,50)(-40,5)
\thinlines
\put(0,5){\framebox(25,10){\begin{tabular}{c}Noisy\\ channel\end{tabular}}}
\put(-20,20){\framebox(25,10){\begin{tabular}{c}Encoder\end{tabular}}}
\put(20,20){\framebox(25,10){\begin{tabular}{c}Decoder\end{tabular}}}
\put(-20,40){\framebox(25,10){\begin{tabular}{c}Compressor\end{tabular}}}
\put(20,40){\framebox(25,10){\begin{tabular}{c}Decompressor\end{tabular}}}
\put(-40,40){\makebox(15,10){\begin{tabular}{c}{\sc Source}\\{\sc coding}\end{tabular}}}
\put(-40,20){\makebox(15,10){\begin{tabular}{c}{\sc Channel}\\{\sc coding}\end{tabular}}}
\put(-20,55){\makebox(25,10){Source}}
%
\put(-7.5,18){\line(0,-1){8}}
\put(-7.5,10){\vector(1,0){6}}
\put(32.5,10){\vector(0,1){8}}
\put(32.5,10){\line(-1,0){6}}
%
\put(32.5,31){\vector(0,1){8}}
\put(32.5,51){\vector(0,1){6}}
\put(-7.5,39){\vector(0,-1){8}}
\put(-7.5,57){\vector(0,-1){6}}
\end{picture}
\end{realcenter}
%
In\index{channel!noisy} Chapters \ref{ch2}--\ref{ch4},
we discussed source coding with block
codes, symbol codes and stream codes. We implicitly assumed that
the channel from the compressor to the decompressor
was noise-free. Real channels are noisy. We will now spend two
chapters on the subject of noisy-channel coding -- the fundamental
possibilities and limitations of error-free \ind{communication} through a
noisy channel. The aim of channel coding
is to make the noisy channel behave like a noiseless channel.
We will assume that the data to be transmitted
has been through a good compressor, so the bit stream has no
obvious redundancy. The channel code, which makes the transmission,
will put\index{redundancy!in channel code}
back
% into the transmission
redundancy of a special sort, designed
to make the noisy received signal decodeable.\index{decoder}
Suppose we transmit 1000 bits per second\index{channel!binary symmetric}
with $p_0 = p_1 = \dhalf$
over a noisy channel that flips bits with probability
$f = 0.1$. What is the rate of
transmission of information?
% shannon p.35
We might guess that the rate is 900 bits per second by subtracting
the expected number of errors per second. But this is not correct, because
the recipient does not know where the errors occurred.
Consider the case where the noise is so great that
the received symbols are independent of the
transmitted symbols. This corresponds to a noise level of $f=0.5$,
since half of the received symbols are correct due to chance alone.
But when $f=0.5$, no information is transmitted at all.
% ? cut this clearly?
\label{sec.ch5.intro}
% refer to exercise {ex.zxymod2}.
Given what we have learnt about entropy, it seems reasonable that
a measure of the information transmitted is given by the \ind{mutual
information} between the source and the received signal, that is, the
entropy of the source minus the \ind{conditional entropy}
of the source given the received signal.
%
% shannon calls the conditional entropy the equivocation
% and points out that the equivocation is the amount of extra
% information needed for a correcting device to figure out
% what is going on
We will now review the definition of conditional entropy
and mutual information. Then we will examine
% progress to the question of
whether it is possible to use such a noisy channel to communicate
{\em reliably}.
We will
% Our aim here is to
show that for any channel $Q$ there is a non-zero rate,
the \inds{capacity}\index{channel!capacity}
$C(Q)$, up to which information can be sent with arbitrarily
small probability of error.
\section{Review of probability and information}
% conditional, joint and mutual information}
% We now build on
As an example, we take the joint distribution $XY$ from
\extwentyone.
%
% A useful picture breaks down the total information content $H(X,Y)$
% of a joint ensemble thus:
% \begin{center}
% \setlength{\unitlength}{1in}
% \begin{picture}(3,1.13)(0,-0.2)
% \put(0,0.7){\framebox(3,0.20){$H(X,Y)$}}
% \put(0,0.4){\framebox(2.2,0.20){$H(X)$}}
% \put(1.5,0.1){\framebox(1.5,0.20){$H(Y)$}}
% \put(1.5125,-0.2){\framebox(0.675,0.20){$\I(X;Y)$}}
% \put(0,-0.2){\framebox(1.475,0.20){$H(X \given Y)$}}
% \put(2.225,-0.2){\framebox(0.775,0.20){$H(Y \specialgiven X)$}}
% \end{picture}
% \end{center}
%
% \subsection{Example of a joint ensemble}
% A joint ensemble $XY$ has the following joint distribution.
The
marginal distributions $P(x)$ and $P(y)$ are shown in the
margins.
% $P(x,y)$:
\[
\begin{array}{cc|cccc|c}
\multicolumn{2}{c}{P(x,y)} & \multicolumn{4}{|c|}{x} & P(y) \\[0.051in]
& & 1 & 2 & 3 & 4 & \\[0.011in]
\hline
\strutf
&1 & \dfrac{1}{8} & \dfrac{1}{16} & \dfrac{1}{32} & \dfrac{1}{32} & \dfrac{1}{4} \\[0.01in]
\raisebox{0mm}{\mbox{$y$}}
&2 & \dfrac{1}{16} & \dfrac{1}{8} & \dfrac{1}{32} & \dfrac{1}{32} & \dfrac{1}{4} \\[0.01in]
&3 & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{4} \\[0.01in]
&4 & \dfrac{1}{4} & 0 & 0 & 0 & \dfrac{1}{4} \\[0.01in]
\hline
\multicolumn{2}{c|}{P(x)}
& \strutf\dfrac{1}{2} & \dfrac{1}{4} & \dfrac{1}{8} & \dfrac{1}{8} & \\[0.051in]
\end{array}
\]
The joint entropy is $H(X,Y)=27/8$ bits.
The marginal entropies are $H(X) = 7/4$ bits and $H(Y) = 2$ bits.
We can compute the conditional distribution of $x$ for each value of $y$,
and the entropy of each of those conditional distributions:
\[
\begin{array}{cc|cccc|c}
\multicolumn{2}{c|}{P(x \given y)} & \multicolumn{4}{c|}{x} & H(X \given y) / \mbox{bits} \\[0.051in]
& & 1 & 2 & 3 & 4 & \\[0.011in]
\hline
\strutf
&1 & \dfrac{1}{2} & \dfrac{1}{4} & \dfrac{1}{8} & \dfrac{1}{8} & \dfrac{7}{4} \\[0.01in]
\raisebox{0mm}{\mbox{$y$}}
&2 & \dfrac{1}{4} & \dfrac{1}{2} & \dfrac{1}{8} & \dfrac{1}{8} & \dfrac{7}{4} \\[0.01in]
&3 & \dfrac{1}{4} & \dfrac{1}{4} & \dfrac{1}{4} & \dfrac{1}{4} & 2 \\[0.01in]
&4 & 1 & 0 & 0 & 0 & 0 \\[0.01in]
\hline
\multicolumn{3}{c}{\strutf
} & \multicolumn{4}{r}{H(X \given Y) = \dfrac{11}{8}} \\[0.1in]
\end{array}
\]
Note that whereas $H(X \given y\eq 4) = 0$ is less than $H(X)$, $H(X \given y\eq 3)$ is greater
than $H(X)$.
% _s5A.tex has a solution link already \label{ex.Hcondnal.sol}
% \label{ex.joint.sol}
So in some cases, learning $y$ can
% make us more uncertain
{\em increase\/} our uncertainty
about $x$. Note also that although $P(x \given y\eq 2)$
is a different distribution from $P(x)$, the conditional entropy $H(X \given y\eq 2)$
is equal to $H(X)$. So learning that $y$ is 2 changes our knowledge
about $x$ but does not reduce the uncertainty
of $x$, as measured by the entropy. On average though,
learning $y$ does convey information
about $x$, since $H(X \given Y) < H(X)$.
One may also evaluate $H(Y \specialgiven X) = 13/8$ bits.
The mutual information is
$\I(X;Y) = H(X) - H(X \given Y) = 3/8$ bits.
% INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% \subsection{Solutions to a few other exercises}
% \input{tex/entropy_soln.tex}
%
% INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%\mynewpage MNBV
\section{Noisy channels}
\begin{description}
\item[A discrete memoryless channel $Q$] is\index{channel!discrete memoryless}
characterized by
an input alphabet $\A_X$, an output alphabet $\A_Y$,
and a set of conditional probability distributions $P(y \given x)$, one
for each $x \in \A_X$.
These {\dbf{transition probabilities}} may be written in a matrix\index{transition probability matrix}
\beq
Q_{j|i} = P(y\eq b_j \given x\eq a_i) .
\eeq
\begin{aside}
I\index{notation!conventions of this book}\index{notation!matrices}\index{notation!vectors}\index{conventions!matrices}\index{conventions!vectors}\index{notation!transition probability}
usually orient this matrix with the output variable
$j$ indexing the rows and the input variable $i$
indexing the columns, so that each column of $\bQ$ is a probability
vector. With this convention, we can obtain the probability
of the output, $\bp_Y$, from a probability distribution over the input,
$\bp_X$, by right-multiplication:
\beq
\bp_Y = \bQ \bp_X .
\eeq
%
\end{aside}
%
\end{description}
\noindent
Some useful model channels are:
\begin{description}
% bsc
\item[Binary symmetric channel\puncspace]
\indexs{channel!binary symmetric}\indexs{binary symmetric channel}
$\A_X \eq \{{\tt 0},{\tt 1}\}$. $\A_Y \eq \{{\tt 0},{\tt 1}\}$.
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,20)(-5,0)
\put(-4,9){{\makebox(0,0)[r]{$x$}}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{1}}
\put(4,16){\makebox(0,0)[r]{0}}
\put(16,2){\makebox(0,0)[l]{1}}
\put(16,16){\makebox(0,0)[l]{0}}
\put(24,9){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}
\hspace{1in}
\begin{array}{ccl}
P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1 - \q ; \\ P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& \q ;
\end{array}
\begin{array}{ccl}
P(y\eq {\tt 0} \given x\eq {\tt 1}) &=& \q ; \\ P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1 - \q .
\end{array}
\hspace{1in}
\begin{array}{c}
\ecfig{bsc15.1}
\end{array}
\]
%
% \BEC bec BEC
%
\item[Binary erasure channel\puncspace] \indexs{channel!binary erasure}\indexs{binary erasure channel}
$\A_X \eq \{{\tt 0},{\tt 1}\}$. $\A_Y \eq \{{\tt 0},\mbox{\tt ?},{\tt 1}\}$.
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,30)(-5,0)
\put(-4,15){{\makebox(0,0)[r]{$x$}}}
\put(5,5){\vector(1,0){10}}
\put(5,25){\vector(1,0){10}}
\put(5,5){\vector(1,1){10}}
\put(5,25){\vector(1,-1){10}}
\put(4,5){\makebox(0,0)[r]{\tt 1}}
\put(4,25){\makebox(0,0)[r]{\tt 0}}
\put(16,5){\makebox(0,0)[l]{\tt 1}}
\put(16,25){\makebox(0,0)[l]{\tt 0}}
\put(16,15){\makebox(0,0)[l]{\tt ?}}
\put(24,15){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}
\hspace{1in}
\begin{array}{ccl}
P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1 - \q ; \\
P(y\eq \mbox{\tt ?} \given x\eq {\tt 0}) &=& \q ; \\
P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& 0 ;
\end{array}
\begin{array}{ccl}
P(y\eq {\tt 0} \given x\eq {\tt 1}) &=& 0 ; \\
P(y\eq \mbox{\tt ?} \given x\eq {\tt 1}) &=& \q ; \\
P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1 - \q .
\end{array}
\hspace{1in}
\begin{array}{c}
\ecfig{bec.1}
\end{array}
\]
\item[Noisy typewriter\puncspace]
\indexs{channel!noisy typewriter}\indexs{noisy typewriter}
$\A_X = \A_Y = \mbox{the 27 letters $\{${\tt A},
{\tt B}, \ldots, {\tt Z}, {\tt -}$\}$}$.
The letters are arranged in a circle, and
when the typist attempts to type {\tt B}, what comes out is
either {\tt A}, {\tt B} or {\tt C}, with probability \dfrac{1}{3} each;
when the input is {\tt C}, the output is {\tt B}, {\tt C} or {\tt D};
and so forth, with the final letter `{\tt -}'
% being
adjacent to the
first letter {\tt A}.
\[
\begin{array}{c}
\setlength{\unitlength}{1pt}
\begin{picture}(48,130)(0,2)
\thinlines
\put(5,5){\vector(3,0){30}}
\put(5,25){\vector(3,0){30}}
\put(5,15){\vector(3,0){30}}
\put(5,5){\vector(3,1){30}}
\put(5,25){\vector(3,-1){30}}
\put(4,5){\makebox(0,0)[r]{{\tt -}}}
\put(4,15){\makebox(0,0)[r]{{\tt Z}}}
\put(4,25){\makebox(0,0)[r]{{\tt Y}}}
\put(36,5){\makebox(0,0)[l]{{\tt -}}}
\put(36,15){\makebox(0,0)[l]{{\tt Z}}}
\put(36,25){\makebox(0,0)[l]{{\tt Y}}}
%
\put(5,15){\vector(3,1){30}}
\put(5,15){\vector(3,-1){30}}
\put(5,25){\vector(3,0){30}}
\put(5,25){\vector(3,1){30}}
\put(20,43){\makebox(0,0){$\vdots$}}
%
%\put(5,35){\vector(3,0){30}}
%\put(5,35){\vector(3,1){30}}
\put(5,35){\vector(3,-1){30}}
%\put(5,45){\vector(3,0){30}}
\put(5,45){\vector(3,1){30}}
%\put(5,45){\vector(3,-1){30}}
\put(5,55){\vector(3,0){30}}
\put(5,55){\vector(3,1){30}}
\put(5,55){\vector(3,-1){30}}
\thicklines
\put(5,65){\vector(3,0){30}}
\put(5,65){\vector(3,1){30}}
\put(5,65){\vector(3,-1){30}}
\thinlines
\put(5,75){\vector(3,0){30}}
\put(5,75){\vector(3,1){30}}
\put(5,75){\vector(3,-1){30}}
\put(5,85){\vector(3,0){30}}
\put(5,85){\vector(3,1){30}}
\put(5,85){\vector(3,-1){30}}
\put(5,95){\vector(3,0){30}}
\put(5,95){\vector(3,1){30}}
\put(5,95){\vector(3,-1){30}}
\put(5,105){\vector(3,0){30}}
\put(5,105){\vector(3,1){30}}
\put(5,105){\vector(3,-1){30}}
\put(5,115){\vector(3,0){30}}
\put(5,115){\vector(3,1){30}}
\put(5,115){\vector(3,-1){30}}
\put(5,125){\vector(3,0){30}}
\put(5,125){\vector(3,-1){30}}
\put(5,5){\vector(1,4){30}}
\put(5,125){\vector(1,-4){30}}
%\put(4,35){\makebox(0,0)[r]{{\tt J}}}
%\put(36,35){\makebox(0,0)[l]{{\tt J}}}
%\put(4,45){\makebox(0,0)[r]{{\tt I}}}
%\put(36,45){\makebox(0,0)[l]{{\tt I}}}
\put(4,55){\makebox(0,0)[r]{{\tt H}}}
\put(36,55){\makebox(0,0)[l]{{\tt H}}}
\put(4,65){\makebox(0,0)[r]{{\tt G}}}
\put(36,65){\makebox(0,0)[l]{{\tt G}}}
\put(4,75){\makebox(0,0)[r]{{\tt F}}}
\put(36,75){\makebox(0,0)[l]{{\tt F}}}
\put(4,85){\makebox(0,0)[r]{{\tt E}}}
\put(36,85){\makebox(0,0)[l]{{\tt E}}}
\put(4,95){\makebox(0,0)[r]{{\tt D}}}
\put(36,95){\makebox(0,0)[l]{{\tt D}}}
\put(4,105){\makebox(0,0)[r]{{\tt C}}}
\put(36,105){\makebox(0,0)[l]{{\tt C}}}
\put(4,115){\makebox(0,0)[r]{{\tt B}}}
\put(36,115){\makebox(0,0)[l]{{\tt B}}}
\put(4,125){\makebox(0,0)[r]{{\tt A}}}
\put(36,125){\makebox(0,0)[l]{{\tt A}}}
\end{picture}
\end{array}
\hspace{1in}
\begin{array}{ccl} & \vdots & \\
P(y\eq {\tt F} \given x\eq {\tt G}) &=& 1/3 ; \\
P(y\eq {\tt G} \given x\eq {\tt G}) &=& 1/3 ; \\
P(y\eq {\tt H} \given x\eq {\tt G}) &=& 1/3 ; \\
& \vdots &
\end{array}
\hspace{1.2in}
\begin{array}{c}
\ecfig{type}
\end{array}
\]
\item[Z channel\puncspace]
\indexs{channel!Z channel}\indexs{Z channel}
$\A_X \eq \{{\tt 0},{\tt 1}\}$. $\A_Y \eq \{{\tt 0},{\tt 1}\}$.
\[
% \begin{array}{c}
% \setlength{\unitlength}{0.46mm}
% \begin{picture}(20,20)(0,0)
% \put(5,5){\vector(1,0){10}}
% \put(5,15){\vector(1,0){10}}
% \put(5,5){\vector(1,1){10}}
% \put(4,5){\makebox(0,0)[r]{1}}
% \put(4,15){\makebox(0,0)[r]{0}}
% \put(16,5){\makebox(0,0)[l]{1}}
% \put(16,15){\makebox(0,0)[l]{0}}
% \end{picture}
% \end{array}
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,20)(-5,0)
\put(-4,9){{\makebox(0,0)[r]{$x$}}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
% \put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{1}}
\put(4,16){\makebox(0,0)[r]{0}}
\put(16,2){\makebox(0,0)[l]{1}}
\put(16,16){\makebox(0,0)[l]{0}}
\put(24,9){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}
\hspace{1in}
\begin{array}{ccl}
P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1 ; \\
P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& 0 ; \\
\end{array}
\begin{array}{ccl}
P(y\eq {\tt 0} \given x\eq {\tt 1}) &=& \q ; \\
P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1- \q .\\
\end{array}
\hspace{1in}
%\:\:\:\:\:\:
\begin{array}{c}
\ecfig{z15.1}
\end{array}
\]
% {\em Check if this orientation of the channel disagrees
% with any demonstrations.}
\end{description}
\section{Inferring the input given the output}
% was a subsection
% a single transmission}
If we assume that the input $x$ to a channel
comes from an ensemble $X$, then
we obtain a joint ensemble $XY$ in which the random variables $x$ and $y$
have the joint distribution:
\beq
P(x,y) = P(y \given x) P(x) .
\eeq
Now if we receive
a particular symbol $y$, what was the input symbol $x$?
We typically won't know for certain. We can write down the posterior
distribution of the input using \Bayes\ theorem:\index{Bayes' theorem}
\beq
P(x \given y) = \frac{ P(y \given x) P(x) }{P(y) }
= \frac{ P(y \given x) P(x) }{\sum_{x'} P(y \given x') P(x') } .
\eeq
\exampla{
%{\sf Example 1:}
Consider a \index{channel!binary symmetric}\ind{binary symmetric channel}
with probability of
error $\q\eq 0.15$. Let the input ensemble be $\P_X: \{p_0 \eq 0.9, p_1 \eq 0.1\}$.
Assume we observe $y\eq 1$.
\beqan
P(x\eq 1 \given y\eq 1) &=&\frac{ P(y\eq 1 \given x\eq 1) P(x\eq 1) }{\sum_{x'} P(y \given x') P(x') } \nonumber \\
&\eq & \frac{ 0.85 \times 0.1 }{ 0.85 \times 0.1 + 0.15 \times 0.9 } \nonumber \\
&=& \frac{ 0.085 }{ 0.22 } \:\:=\:\: 0.39 .
\eeqan
Thus `$x\eq 1$' is still less probable than `$x\eq 0$', although it is not
as improbable as it was before.
}
% Could turn this into an exercise.
% Alternatively, assume we observe $y\eq 0$.
% \beqa
% P(x\eq 1 \given y\eq 0) &=& \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'} P(y \given x') P(x')} \\
% &=& \frac{ 0.15 \times 0.1 }{ 0.15 \times 0.1 + 0.85 \times 0.9 } \\
% &=& \frac{ 0.015 }{0.78} = 0.019 .
% \eeqa
\exercissxA{1}{ex.bscy0}{
Now assume we observe $y\eq 0$.
Compute the probability of $x\eq 1$ given $y\eq 0$.
}
\exampla{
%{\sf Example 2:}
Consider a \ind{Z channel}\index{channel!Z channel} with probability of
error $\q\eq 0.15$. Let the input ensemble be $\P_X: \{p_0 \eq 0.9, p_1 \eq 0.1\}$.
Assume we observe $y\eq 1$.
\beqan
P(x\eq 1 \given y\eq 1)
&=& \frac{ 0.85 \times 0.1 }{ 0.85 \times 0.1 + 0 \times 0.9 }
\nonumber \\
&=& \frac{ 0.085}{0.085} \:\:=\:\: 1.0 .
\eeqan
So given the output $y\eq 1$ we become certain of the input.
}
% Alternatively, assume we observe $y\eq 0$.
% \beqa
% P(x\eq 1 \given y\eq 0)
% % &=& \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'} P(y \given x') P(x')} \\
% &=& \frac{ 0.15 \times 0.1 }{ 0.15 \times 0.1 + 1.0 \times 0.9 } \\
% &=& \frac{ 0.015}{ 0.915} = 0.016 .
% \eeqa
\exercissxA{1}{ex.zcy0}{
Alternatively, assume we observe $y\eq 0$. Compute $P(x\eq 1 \given y\eq 0)$.
}
\section{Information conveyed by a channel}
We now consider how much information can be communicated through
a channel. In {operational\/} terms, we are interested in finding
ways of using the channel such that all the bits that are communicated
are recovered with negligible probability of error.
In {mathematical\/} terms,
assuming a particular input ensemble $X$, we can measure how
much information the output conveys about the input by the mutual
information:
\beq
\I(X;Y) \equiv H(X) - H(X \given Y) = H(Y) - H(Y \specialgiven X) .
\eeq
Our aim is to establish the connection between these two ideas.
Let us evaluate $\I(X;Y)$ for some of the channels above.
\subsection{Hint for computing mutual information}
\index{hint for computing mutual information}\index{mutual information!how to compute}We
will tend to think of $\I(X;Y)$ as $H(X) - H(X \given Y)$, \ie, how much
the uncertainty of the input $X$ is reduced when we look at the output
$Y$. But for computational
purposes it is often handy to evaluate $H(Y) - H(Y \specialgiven X)$ instead.
%\medskip
% this reproduced from _p5A.tex, figure 9.1 {fig.entropy.breakdown}
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
%
% included by l1.tex
%
\setlength{\unitlength}{1in}
\begin{picture}(3,1.13)(0,-0.2)
\put(0,0.7){\framebox(3,0.20){$H(X,Y)$}}
\put(0,0.4){\framebox(2.2,0.20){$H(X)$}}
\put(1.5,0.1){\framebox(1.5,0.20){$H(Y)$}}
\put(1.5125,-0.2){\framebox(0.675,0.20){$\I(X;Y)$}}
\put(0,-0.2){\framebox(1.475,0.20){$H(X\,|\,Y)$}}
\put(2.225,-0.2){\framebox(0.775,0.20){$H(Y|X)$}}
\end{picture}
\end{center}
}{%
\caption[a]{The relationship between joint information,
marginal entropy, conditional entropy and mutual entropy.
This figure is important, so I'm showing it twice.}
\label{fig.entropy.breakdown.again}
}%
\end{figure}
%\begin{center}
%\input{tex/entropyfig.tex}
%\end{center}
%\noindent
\exampla{
%{\sf Example 1:}
Consider the
\index{channel!binary symmetric}\index{binary symmetric channel}\BSC\
again, with $\q\eq 0.15$ and
$\P_X: \{p_0 \eq 0.9, p_1 \eq 0.1\}$. We already evaluated the
marginal probabilities $P(y)$ implicitly above: $P(y\eq 0) = 0.78$;
$P(y\eq 1) = 0.22$. The mutual information is:
\beqa
\I(X;Y) &=& H(Y) - H(Y \specialgiven X) .
\eeqa
What is $H(Y \specialgiven X)$?
It is defined to be the weighted sum over $x$ of $H(Y \given x)$; but
$H(Y \given x)$ is the same for each value of $x$:
$H(Y \given x\eq{\tt{0}})$ is $H_2(0.15)$,
and $H(Y \given x\eq{\tt{1}})$ is $H_2(0.15)$. So
\beqan
\I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber \\
&=& H_2(0.22) - H_2(0.15) \nonumber \\
& =& 0.76 - 0.61 \:\: = \:\: 0.15 \mbox{ bits}.
\eeqan
% this used to be in error (0.15)
This may be contrasted with the entropy of the source $H(X) = H_2(0.1) =
0.47$ bits.
Note: here we have used the binary entropy function $H_2(p)
\equiv H(p,1\!-\!p)=p \log \frac{1}{p}
+ (1-p)\log \frac{1}{(1-p)}$.\marginpar{\small\raggedright{Throughout this book, $\log$ means $\log_2$.}}
}
%\medskip
% \noindent
\exampla{
% {\sf Example~2:}
And now the \ind{Z channel}\index{channel!Z channel}, with $\P_X$ as above.
% $P(y\eq 0)\eq 0.915;
$P(y\eq 1)\eq 0.085$.
\beqan
\I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber \\
&=& H_2(0.085) - [ 0.9 H_2(0) + 0.1 H_2(0.15) ] \nonumber \\
&=& 0.42 - ( 0.1 \times 0.61 )
= 0.36 \mbox{ bits}.
\eeqan
The entropy of the source, as above, is $H(X) = 0.47$ bits. Notice
that the mutual information $\I(X;Y)$ for the Z channel is bigger than
the mutual information for the binary symmetric channel with the
same $\q$. The Z channel is a more reliable
channel.
% is fits with our intuition that the
}
\exercissxA{1}{ex.bscMI}{Compute the mutual information between $X$ and $Y$
for the \BSC\ with $\q\eq 0.15$ when the input
distribution is $\P_X = \{p_0 \eq 0.5, p_1 \eq 0.5\}$.
}
\exercissxA{2}{ex.zcMI}{Compute the mutual information between $X$ and $Y$
for the Z channel with $\q=0.15$ when the input
distribution is $\P_X: \{p_0 \eq 0.5, p_1 \eq 0.5\}$.
}
\subsection{Maximizing the mutual information}
We have observed in the above examples that
the mutual information between the input and
the output depends on the chosen
{input ensemble}\index{channel!input ensemble}.
Let us assume that we wish to maximize the mutual information
conveyed by the channel by choosing
the best possible input ensemble.
We define the {\dbf\inds{capacity}\/} of the
channel\index{channel!capacity}
to be its maximum \ind{mutual information}.
\begin{description}
\item[The capacity] of a channel $Q$ is:
\beq
C(Q) = \max_{\P_X} \, \I(X;Y) .
\eeq
The distribution $\P_X$ that achieves the maximum is called the
{\dem{\optens}},\indexs{optimal input distribution}
denoted by $\P_X^*$. [There may be multiple
{\optens}s achieving the same value of $\I(X;Y)$.]
\end{description}
%
In \chref{ch6} we will
show that the capacity does indeed measure the maximum amount
of error-free information that can be transmitted
% is transmittable % yes, spell checked
over the channel per unit time.
% \medskip
% Sun 22/8/04 am having problems trying to get fig 9.2 to go at head
% of p 151 - putting it there causes text to move.
%\noindent
\exampla{
%{\sf Example 1:}
Consider the \BSC\ with $\q \eq 0.15$. Above, we considered
$\P_X = \{p_0 \eq 0.9, p_1 \eq 0.1\}$, and found
$\I(X;Y) = 0.15$ bits.
% the page likes to break here
How much better can we do? By symmetry,
the \optens\ is
$\{ 0.5, 0.5\}$ and%
\amarginfig{t}{
\mbox{%
%\begin{figure}[htbp]
\small
%\floatingmargin{%
%\figuremargin{%
\raisebox{0.91in}{$\I(X;Y)$}%
\hspace{-0.42in}%
\begin{tabular}{c}
\mbox{\psfig{figure=figs/IXY.15.ps,%
width=45mm,angle=-90}}\\[-0.1in]
$p_1$
\end{tabular}
}
%}{%
\caption[a]{The mutual information $\I(X;Y)$ for a binary symmetric
channel with $\q=0.15$
as a function of the input distribution.
% (\eqref{eq.IXYBSC}).
}
\label{fig.IXYBSC}
}
%%%
the capacity is
\beq
C(Q_{\rm BSC}) \:=\: H_2(0.5) - H_2(0.15) \:=\: 1.0 - 0.61 \:=\: 0.39
\ubits.
\eeq
We'll justify the \ind{symmetry argument}\index{capacity!symmetry argument}
later.
If there's any doubt about the
% such a
symmetry argument,
we can always resort to explicit maximization of
the \ind{mutual information} $I(X;Y)$,
\beq
I(X;Y) = H_2( (1\!-\!\q)p_1 + (1\!-\!p_1)\q ) - H_2(\q) \ \ \mbox{ (\figref{fig.IXYBSC}). }
\label{eq.IXYBSC}
\eeq
}
% \medskip
% \noindent
% {\sf Example 2:}
\exampl{exa.typewriter}{
The noisy typewriter.
The \optens\ is a uniform distribution over $x$, and gives
$C = \log_2 9$ bits.
}
% \medskip
% \noindent
\exampl{exa.Z.HXY}{
% {\sf Example 3:}
Consider the \ind{Z channel} with $\q \eq 0.15$.
Identifying the \optens\ is not so straightforward. We
evaluate $\I(X;Y)$ explicitly for
$\P_X = \{p_0, p_1\}$. First, we need to compute $P(y)$. The probability
of $y\eq 1$ is easiest to write down:
\beq
P(y\eq 1) \:\:=\:\: p_1 (1-\q) .
\eeq
Then%
\amarginfig{t}{
%\begin{figure}[htbp]
\mbox{%
\small
%\floatingmargin{%
%\figuremargin{%
\raisebox{0.91in}{$\I(X;Y)$}%
\hspace{-0.42in}%
\begin{tabular}{c}
\mbox{\psfig{figure=figs/HXY.ps,%
width=45mm,angle=-90}}\\[-0.1in]
$p_1$
\end{tabular}
}
%}{%
\caption{The mutual information $\I(X;Y)$ for a Z
channel with $\q=0.15$
as a function of the input distribution.}
\label{hxyz}
}
%\end{figure}
%%%%%%%%%%%%% old:
%\begin{figure}[htbp]
%\small
%\begin{center}
%\raisebox{1.3in}{$\I(X;Y)$}%
%\hspace{-0.2in}%
%\begin{tabular}{c}
%\mbox{\psfig{figure=figs/HXY.ps,%
%width=60mm,angle=-90}}\\
%$p_1$
%\end{tabular}
%\end{center}
%\caption[a]{The mutual information $\I(X;Y)$ for a Z channel with $\q=0.15$
% as a function of the input distribution.}
%% (Horizontal axis $=p_1$.)}
%\label{hxyz.old}
%\end{figure}
the mutual information is:
\beqan
\I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber \\
&=& H_2(p_1 (1-\q)) - ( p_0 H_2(0) + p_1 H_2(\q) ) \nonumber \\
&=& H_2(p_1 (1-\q)) - p_1 H_2(\q) .
\eeqan
This is a non-trivial function of $p_1$, shown in \figref{hxyz}.
It is maximized for $\q=0.15$ by
% the \optens\
$p_1^* = 0.445$.
We find $C(Q_{\rm Z}) = 0.685$. Notice
% that
the \optens\ is not
$\{ 0.5,0.5 \}$. We can communicate slightly more information
by using input symbol {\tt{0}} more frequently than {\tt{1}}.
}
%\noindent {\sf Exercise b:}
\exercissxA{1}{ex.bscC}{
What is the capacity of the \ind{binary symmetric channel} for general $\q$?\index{channel!binary symmetric}
}
\exercissxA{2}{ex.becC}{
Show that the capacity of the \ind{binary erasure channel}\index{channel!binary erasure} with $\q=0.15$
is $C_{\rm BEC} = 0.85$. What is its capacity for general $\q$?
Comment.
}
% \bibliography{/home/mackay/bibs/bibs}
%\section{The Noisy Channel Coding Theorem}
\section{The noisy-channel coding theorem}
It seems plausible that the `capacity' we have defined may be
a measure of information conveyed by a channel; what is not obvious,
and what we will prove in the next chapter, is that the \ind{capacity} indeed
measures the rate at which blocks of data can be communicated over the channel
{\em with arbitrarily small probability of error}.
We make the following definitions.\label{sec.whereCWMdefined}
\begin{description}
\item[An $(N,K)$ {block code}] for\indexs{error-correcting code!block code}
a channel $Q$ is a list of $\cwM=2^K$
codewords
$$\{ \bx^{(1)}, \bx^{(2)}, \ldots, \bx^{({2^K)}} \}, \:\:\:\:\:\bx^{(\cwm)} \in \A_X^N ,$$
each of length $N$.
Using this code we can encode a signal $\cwm \in \{ 1,2,3,\ldots, 2^K\}$
% The signal to be encoded is assumed to come from an
% alphabet of size $2^K$; signal $m$ is encoded
as $\bx^{(\cwm)}$. [The number of codewords $\cwM$ is an integer,
but the number of bits specified by choosing a codeword, $K \equiv \log_2 \cwM$,
is not necessarily an integer.]
The {\dbf \inds{rate}\/} of\index{error-correcting code!rate}
the code is $R = K/N$ bits per channel use.
% character.
[We will use this definition of the rate for any channel, not only channels with binary inputs;
note however that it is sometimes conventional to define the rate of a code for a channel
with $q$ input symbols to be $K/(N\log q)$.]
% \item[A linear $(N,K)$ block code] is a block code in which all
% moved into leftovers.tex
\item[A \ind{decoder}] for an $(N,K)$ block code is a mapping from
the set of length-$N$ strings of channel outputs, $\A_Y^N$, to
a codeword label $\hat{\cwm} \in \{ 0 , 1 , 2 , \ldots, 2^K \}$.
The extra symbol $\hat{\cwm} \eq 0$ can be used to indicate a `failure'.
\item[The \ind{probability of block error}\index{error probability!block}]
% $p_B$
of a code and decoder, for a given channel, and for a given probability
distribution over the encoded signal $P(\cwm_{\rm in})$,
is:
\beq
p_{\rm B} = \sum_{\cwm_{\rm in}} P( \cwm_{\rm in} )
P( \cwm_{\rm out} \! \not = \! \cwm_{\rm in} \given \cwm_{\rm in} )
.
\eeq
% the probability
% that the decoded signal $\cwm_{\rm out}$ is not equal to $\cwm_{\rm in}$.
\item[The maximal probability of block error] is
\beq
p_{\rm BM} = \max_{\cwm_{\rm in}} P( \cwm_{\rm out} \! \not = \!
\cwm_{\rm in} \given \cwm_{\rm in} )
.
\eeq
\item[The \ind{optimal decoder}] for a channel code is the one that minimizes
the probability of block error. It decodes an output $\by$ as
the input $\cwm$ that has maximum \ind{posterior probability} $P(\cwm \given \by)$.
\beq
P(\cwm \given \by) =
\frac{ P(\by \given \cwm ) P(\cwm) } { \sum_{\cwm' } P(\by \given \cwm') P(\cwm') }
\eeq
\beq
\hat{\cwm}_{\rm optimal} = \argmax
% _{\cwm} % did not appear underneath
P(\cwm \given \by) .
\eeq
A uniform prior distribution on $\cwm$ is usually assumed, in which case the
optimal decoder is also the {\dem \ind{maximum likelihood decoder}},
\ie, the decoder that maps an output $\by$ to
the input $\cwm$ that has maximum {\dem \ind{likelihood}} $P(\by \given \cwm )$.
\item[The probability of bit error] $p_{\rm b}$ is defined assuming that
the codeword number
$\cwm$ is represented by a binary vector $\bs$ of length $K$ bits;
it is the average probability
that a bit of $\bs_{\rm out}$ is not equal to the corresponding
bit of $\bs_{\rm in}$ (averaging over all $K$ bits).
\item[Shannon's\index{Shannon, Claude}
\ind{noisy-channel coding theorem} (part one)\puncspace]
%\begin{quote}
Associated with each discrete memoryless channel,
\marginfig{
\begin{center}
\setlength{\unitlength}{2pt}
\begin{picture}(60,45)(-2.5,-7)
\thinlines
\put(0,0){\vector(1,0){60}}
\put(0,0){\vector(0,1){40}}
\put(30,-3){\makebox(0,0)[t]{$C$}}
\put(55,-2){\makebox(0,0)[t]{$R$}}
\put(-1,35){\makebox(0,0)[r]{$p_{\rm BM}$}}
\thicklines
\put(0,0){\dashbox{3}(30,30){achievable}}
% \put(0,0){\line(0,1){50}}
%
\end{picture}
\end{center}
\caption[a]{Portion of the $R,p_{\rm BM}$ plane asserted to
be
achievable by the first part of Shannon's noisy
channel coding theorem.}
\label{fig.belowCthm}
}%end marginfig
there is a
non-negative number $C$ (called the channel capacity) with the following
property. For any $\epsilon > 0$ and $R < C$, for large enough $N$,
there exists a block code of length $N$ and rate $\geq R$ and a decoding
algorithm, such that the maximal probability of block error is
$< \epsilon$.
%\end{quote}
% \item[The negative part of the theorem\puncspace] moved to graveyard.tex Sun 3/2/02
\end{description}
\begin{figure}[htbp]
\figuremargin{%
\[
\begin{array}{c}
\setlength{\unitlength}{1pt}
\begin{picture}(48,120)(0,5)
\thinlines
%\put(5,5){\vector(3,0){30}}
%\put(5,25){\vector(3,0){30}}
\put(5,15){\vector(3,0){30}}
%\put(5,5){\vector(3,1){30}}
%\put(5,25){\vector(3,-1){30}}
% \put(4,5){\makebox(0,0)[r]{{\tt -}}}
\put(4,15){\makebox(0,0)[r]{{\tt Z}}}
% \put(4,25){\makebox(0,0)[r]{{\tt Y}}}
\put(36,5){\makebox(0,0)[l]{{\tt -}}}
\put(36,15){\makebox(0,0)[l]{{\tt Z}}}
\put(36,25){\makebox(0,0)[l]{{\tt Y}}}
%
\put(5,15){\vector(3,1){30}}
\put(5,15){\vector(3,-1){30}}
%\put(5,25){\vector(3,0){30}}
%\put(5,25){\vector(3,1){30}}
\put(20,40){\makebox(0,0){$\vdots$}}
%
%\put(5,35){\vector(3,0){30}}
%\put(5,35){\vector(3,1){30}}
% \put(5,35){\vector(3,-1){30}}
%\put(5,45){\vector(3,0){30}}
% \put(5,45){\vector(3,1){30}}
%\put(5,45){\vector(3,-1){30}}
\put(5,55){\vector(3,0){30}}
\put(5,55){\vector(3,1){30}}
\put(5,55){\vector(3,-1){30}}
% \thicklines
% \put(5,65){\vector(3,0){30}}
% \put(5,65){\vector(3,1){30}}
% \put(5,65){\vector(3,-1){30}}
% \thinlines
% \put(5,75){\vector(3,0){30}}
% \put(5,75){\vector(3,1){30}}
% \put(5,75){\vector(3,-1){30}}
\put(5,85){\vector(3,0){30}}
\put(5,85){\vector(3,1){30}}
\put(5,85){\vector(3,-1){30}}
% \put(5,95){\vector(3,0){30}}
% \put(5,95){\vector(3,1){30}}
% \put(5,95){\vector(3,-1){30}}
%\put(5,105){\vector(3,0){30}}
%\put(5,105){\vector(3,1){30}}
%\put(5,105){\vector(3,-1){30}}
\put(5,115){\vector(3,0){30}}
\put(5,115){\vector(3,1){30}}
\put(5,115){\vector(3,-1){30}}
%\put(5,125){\vector(3,0){30}}
%\put(5,125){\vector(3,-1){30}}
%
%\put(5,5){\vector(1,4){30}}
%\put(5,125){\vector(1,-4){30}}
\put(36,45){\makebox(0,0)[l]{{\tt I}}}
\put(4,55){\makebox(0,0)[r]{{\tt H}}}
\put(36,55){\makebox(0,0)[l]{{\tt H}}}
% \put(4,65){\makebox(0,0)[r]{{\tt G}}}
\put(36,65){\makebox(0,0)[l]{{\tt G}}}
% \put(4,75){\makebox(0,0)[r]{{\tt F}}}
\put(36,75){\makebox(0,0)[l]{{\tt F}}}
\put(4,85){\makebox(0,0)[r]{{\tt E}}}
\put(36,85){\makebox(0,0)[l]{{\tt E}}}
% \put(4,95){\makebox(0,0)[r]{{\tt D}}}
\put(36,95){\makebox(0,0)[l]{{\tt D}}}
% \put(4,105){\makebox(0,0)[r]{{\tt C}}}
\put(36,105){\makebox(0,0)[l]{{\tt C}}}
\put(4,115){\makebox(0,0)[r]{{\tt B}}}
\put(36,115){\makebox(0,0)[l]{{\tt B}}}
% \put(4,125){\makebox(0,0)[r]{{\tt A}}}
\put(36,125){\makebox(0,0)[l]{{\tt A}}}
\end{picture}
\end{array}
\hspace{1.5in}
\begin{array}{c}
% roughly 8pts from col to col
\setlength{\unitlength}{1.005pt}% this was 1pt in jan 2000, I tweaked it
\begin{picture}(50,110)(-5,-5)
\thinlines
\put(-5,-5){\ecfig{type}}
\multiput(7.95,-3)(12,0){9}{\framebox(4,126){}}
%\put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}}
%\put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}}
%
\end{picture}
\end{array}
\]
}{%
\caption[a]{A non-confusable subset of inputs for the noisy
typewriter.}
\label{fig.typenine}
}
\end{figure}
\subsection{Confirmation of the theorem for the noisy typewriter channel}
In the case of the \ind{noisy typewriter}\index{channel!noisy typewriter},
we can easily confirm the
% positive part of the
theorem,
% For this channel,
because we can create a
% n {\em error-free\/}
completely error-free
communication strategy using a block code of length $N =1$:
we use only the letters {\tt B}, {\tt E}, {\tt H},
\ldots, {\tt Z},
\ie, every third letter. These letters form a {\dem non-confusable subset\/}\index{non-confusable inputs}
of the input
alphabet (see \figref{fig.typenine}). Any output can be uniquely decoded. The number of
inputs in the non-confusable subset is 9, so the error-free information
rate of this system is $\log_2 9$ bits, which is equal to the capacity $C$,
which we evaluated in \exampleref{exa.typewriter}.
%
How does this translate into the terms of the theorem?
The following table explains.\medskip
%\begin{center}
\begin{raggedright}
\noindent
% THIS TABLE IS DELIBERATELY FULL WIDTH
% for textwidth, use this
% \begin{tabular}{p{2.2in}p{2.5in}}
\begin{tabular}{@{}p{2.7in}p{4.1in}@{}}
\multicolumn{1}{@{}l}{\sf The theorem} &
\multicolumn{1}{l}{\sf How it applies to the noisy typewriter } \\ \midrule
\raggedright\em Associated with each discrete memoryless channel, there is a
non-negative number $C$.
% (called the channel capacity).
&
The capacity $C$ is $\log_2 9$.
\\[0.047in]
\raggedright\em For any $\epsilon > 0$ and $R < C$, for large enough $N$,
&
% Assume we are given an $R0$.
No matter what $\epsilon$ and $R$ are, we set the blocklength $N$ to
1.
\\[0.047in]
\raggedright\em there exists a block code of length $N$ and rate $\geq R$
& The block code is
% can be the following list of nine codewords:
$\{{\tt B,E,\ldots,Z}\}$. The value of
$K$ is given by $2^K = 9$, so $K=\log_2 9$, and this code has rate
$\log_2 9$, which is greater than the requested value of $R$.
\\[0.047in]
\raggedright\em and a decoding
algorithm,
&
The decoding algorithm maps the received
letter to the nearest letter in the code;
\\[0.047in]
\raggedright\em
such that the maximal probability of block error is
$< \epsilon$.
&
the maximal probability of block error is zero, which
is
less than the given $\epsilon$.
\\
\end{tabular}
\end{raggedright}
%\end{center}
% is greater than or equal
% to 1
% source RUNME
\section{Intuitive preview of proof}
\subsection{Extended channels}
To prove the theorem for any given channel, we consider the
{\dem \ind{extended channel}\index{channel!extended}}
corresponding to $N$ uses of the
% original
channel.
The extended channel has
$|\A_X|^N$ possible inputs $\bx$ and
$|\A_Y|^N$ possible outputs.
% {\em add a picture of extended channel here.}
%
\begin{figure}
\figuremargin{%
\small\begin{center}
\begin{tabular}{cccc}
%$\bQ$
& \ecfig{bsc15.1}
& \ecfig{bsc15.2}
& \ecfig{bsc15.4}
\\
& $N=1$
& $N=2$ & $N=4$ \\
\end{tabular}
\end{center}
}{%
\caption{Extended channels obtained from a binary symmetric channel
with transition probability 0.15.}
\label{fig.extended.bsc15}
}
\end{figure}
%
\begin{figure}
\figuremargin{%
\small\begin{center}
\begin{tabular}{cccc}
%$\bQ$
& \ecfig{z15.1}
& \ecfig{z15.2}
& \ecfig{z15.4}
\\
& $N=1$
& $N=2$ & $N=4$ \\
\end{tabular}
\end{center}
}{%
\caption{Extended channels obtained from a Z channel
with transition probability 0.15. Each column corresponds to an input,
and each row is a different output.}
\label{fig.extended.z15}
}
\end{figure}
%
%
% these figures made using
% cd itp/extended
Extended channels obtained from a \BSC\ and from
a Z channel are shown in figures \ref{fig.extended.bsc15}
and \ref{fig.extended.z15}, with $N=2$ and $N=4$.
\exercissxA{2}{ex.extended}{
Find the transition probability matrices $\bQ$ for
the extended channel, with $N=2$, derived from
the binary erasure channel having erasure probability 0.15.
%\item the extended channel with $N=2$ derived from
% the ternary confusion channel,
By selecting two columns of this transition probability matrix,
% that have minimal overlap,
we can define a rate-\dhalf\ code for this channel with blocklength $N=2$.
What is the best choice of two columns? What is the decoding
algorithm?
}
To prove the noisy-channel coding theorem, we
make use of large blocklengths $N$.
The intuitive idea is that, if $N$ is large, {\em
an extended channel looks a lot like the noisy typewriter.}
Any particular input $\bx$ is very likely to produce an output
in a small subspace of the output alphabet -- the typical output
set, given that input.
So we can find a non-confusable subset of the inputs that produce
essentially disjoint output sequences.
%
% add something like:
% Remember what we learnt
% in chapter \ref{ch2}:
%
For a given $N$, let us consider a way of generating such a
non-confusable subset of the inputs, and count up how many distinct
inputs it contains.
Imagine making an input sequence $\bx$ for the extended channel by
drawing it from an ensemble $X^N$, where $X$ is an arbitrary ensemble
over the input alphabet. Recall the source coding theorem of
\chapterref{ch.two}, and consider the number of probable output sequences
$\by$. The total number of typical output sequences $\by$
% , when $\bx$ comes from the ensemble $X^N$,
is $2^{N H(Y)}$, all having similar
probability. For any particular typical input sequence $\bx$, there
are about $2^{N H(Y \specialgiven X)}$ probable sequences. Some of these subsets of
$\A_Y^N$ are depicted by circles in figure \ref{fig.ncct.typs}a.
\begin{figure}%[htbp]
\small
\figuremargin{%
\begin{center}
\hspace*{-1mm}\begin{tabular}{cc}
\framebox{
\setlength{\unitlength}{0.69mm}%was 0.8mm
\begin{picture}(80,80)(0,0)
\put(0,80){\makebox(0,0)[tl]{$\A_Y^N$}}
\thicklines
\put(40,40){\oval(50,50)}
\thinlines
\put(40,67){\makebox(0,0)[b]{Typical $\by$}}
\put(30,50){\circle{12.5}}
\put(50,40){\circle{12.5}}
\put(35,52){\circle{12.5}}
\put(58,33){\circle{12.5}}
\put(33,40){\circle{12.5}}
\put(35,45){\circle{12.5}}
\put(50,30){\circle{12.5}}
\put(40,50){\circle{12.5}}
\put(52,35){\circle{12.5}}
\put(33,58){\circle{12.5}}
\put(40,33){\circle{12.5}}
\put(45,35){\circle{12.5}}
\put(50,50){\circle{12.5}}
\put(23,55){\circle{12.5}}
\put(24,45){\circle{12.5}}
\put(27,57){\circle{12.5}}
\put(25,40){\circle{12.5}}
\put(55,42){\circle{12.5}}
\put(55,52){\circle{12.5}}
\put(58,53){\circle{12.5}}
\put(53,40){\circle{12.5}}
\put(35,22){\circle{12.5}}
\put(27,30){\circle{12.5}}
\put(40,24){\circle{12.5}}
\put(40,39){\circle{12.5}}
\put(46,43){\circle{12.5}}
\put(55,40){\circle{12.5}}
\put(40,55){\circle{12.5}}
\put(52,23){\circle{12.5}}
\put(50,26){\circle{12.5}}
\put(40,54){\circle{12.5}}
\put(52,55){\circle{12.5}}
\put(33,28){\circle{12.5}}
\put(57,33){\circle{12.5}}
\put(25,35){\circle{12.5}}
\put(55,25){\circle{12.5}}
\put(25,26){\circle{12.5}}
\multiput(23,22)(13,0){3}{\circle{12.5}}
\multiput(30,34)(13,0){3}{\circle{12.5}}
\multiput(23,46)(13,0){3}{\circle{12.5}}
\multiput(30,58)(13,0){3}{\circle{12.5}}
\thicklines
\put(23,30){\circle{12.5}}
\put(21,11){\vector(0,1){13}}
\put(8,6){\makebox(0,0)[l]{ Typical $\by$ for a given typical $\bx$}}
\end{picture}
}
&
\framebox{
\setlength{\unitlength}{0.69mm}
\begin{picture}(80,80)(0,0)
\put(0,80){\makebox(0,0)[tl]{$\A_Y^N$}}
\thicklines
\put(40,40){\oval(50,50)}
\thinlines
\put(40,67){\makebox(0,0)[b]{Typical $\by$}}
% \thicklines
\multiput(23,22)(13,0){3}{\circle{12.5}}
\multiput(30,34)(13,0){3}{\circle{12.5}}
\multiput(23,46)(13,0){3}{\circle{12.5}}
\multiput(30,58)(13,0){3}{\circle{12.5}}
%\put(30,34){\circle{12.5}}
%\put(43,34){\circle{12.5}}
%\put(56,34){\circle{12.5}}
%\put(23,45){\circle{12.5}}
%\put(36,45){\circle{12.5}}
%\put(49,45){\circle{12.5}}
%\put(30,56){\circle{12.5}}
%\put(43,56){\circle{12.5}}
%\put(56,56){\circle{12.5}}
\end{picture}
}\\
(a)&(b) \\
\end{tabular}
\end{center}
}{%
\caption[a]{(a) Some typical outputs
in $\A_Y^N$ corresponding
to typical inputs $\bx$.
(b) A subset of the \ind{typical set}s shown in
(a) that do not overlap each other. This picture can be
compared with the
solution to the \ind{noisy typewriter} in \figref{fig.typenine}.}
\label{fig.ncct.typs}
\label{fig.ncct.typs.no.overlap}
}
\end{figure}
We now imagine restricting ourselves to a subset of the typical\index{typical set!for noisy channel}
inputs $\bx$ such that the corresponding typical output sets do not overlap,
as shown in \figref{fig.ncct.typs.no.overlap}b.
We can then bound the number of non-confusable inputs by dividing the size
of the typical $\by$ set,
$2^{N H(Y)}$, by the size of each typical-$\by$-given-typical-$\bx$
set, $2^{N H(Y \specialgiven X)}$. So the number of non-confusable inputs,
if they are selected from the set of typical inputs $\bx \sim X^N$,
is $\leq 2^{N H(Y) - N H(Y \specialgiven X)} = 2^{N \I(X;Y)}$.
% \begin{figure}
% \begin{center}
% \framebox{
% \setlength{\unitlength}{0.8mm}
% }
% \end{center}
% \caption[a]{A subset of the typical sets shown in
% \protect\figref{fig.ncct.typs} that do not overlap.}
% \label{fig.ncct.typs.no.overlap}
% \end{figure}
The maximum value of
this bound is achieved if $X$ is the ensemble that
maximizes $\I(X;Y)$, in which case the number of non-confusable inputs
is $\leq 2^{NC}$. Thus asymptotically
up to $C$ bits per cycle,
and no more, can be communicated with vanishing error probability.\ENDproof
This sketch has not rigorously proved that reliable communication really
is possible --
that's our task for the next chapter.
\section{Further exercises}
% \noindent
%
\exercissxA{3}{ex.zcdiscuss}{
Refer back to the computation of the capacity of the
\ind{Z channel} with $\q=0.15$.
\ben
\item
Why is $p_1^*$ less than 0.5? One could argue that it is good
to favour the {\tt{0}} input, since it is transmitted without error --
and also argue that it is good to favour the {\tt1} input, since it
often gives rise to the highly prized {\tt1} output, which
allows certain identification of the input! Try to make a convincing
argument.
\item
In the case of general $\q$, show that the \optens\ is
\beq
p_1^* = \frac{ 1/(1-\q) }
{ \displaystyle
1 + 2^{ \left( H_2(\q) / ( 1 - \q ) \right)} } .
\eeq
\item
What happens to $p_1^*$ if the noise level $\q$ is very close to 1?
\een
}
% see also ahmed.tex for a nice bound 0.5(1-q) on the capacity of the Z channel
% and related graphs CZ.ps CZ2.ps CZ.gnu
%
\exercissxA{2}{ex.Csketch}{
Sketch graphs of the capacity of the \ind{Z channel}, the \BSC\
and the \BEC\ as a function of $\q$.
% answer in figs/C.ps
% \medskip
}
\exercisaxB{2}{ex.fiveC}{
What is the capacity of the five-input, ten-output channel
% \index{channel!others}
whose
transition probability matrix is
{\small
\beq
\left[ \begin{array}{*{5}{c}}
0.25 & 0 & 0 & 0 & 0.25 \\
0.25 & 0 & 0 & 0 & 0.25 \\
0.25 & 0.25 & 0 & 0 & 0 \\
0.25 & 0.25 & 0 & 0 & 0 \\
0 & 0.25 & 0.25 & 0 & 0 \\
0 & 0.25 & 0.25 & 0 & 0 \\
0 & 0 & 0.25 & 0.25 & 0 \\
0 & 0 & 0.25 & 0.25 & 0 \\
0 & 0 & 0 & 0.25 & 0.25 \\
0 & 0 & 0 & 0.25 & 0.25 \\
\end{array}
\right]
\hspace{0.4in}
\begin{array}{c}\ecfig{five}\end{array}
?
\eeq
}
}
\exercissxA{2}{ex.GC}{
Consider a \ind{Gaussian channel}\index{channel!Gaussian}
with binary input $x \in \{ -1, +1\}$
and {\em real\/} output alphabet $\A_Y$, with transition probability density
\beq
Q(y \given x,\sa,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}}
\, e^{-\smallfrac{(y-x \sa)^2}{2 \sigma^2}} ,
\eeq
where $\sa$ is the signal amplitude.
\ben
\item
Compute the posterior probability of $x$ given $y$, assuming that
the two inputs are equiprobable.
Put your answer in the form
\beq
P(x\eq 1 \given y,\sa,\sigma) = \frac{1}{1+e^{-a(y)}} .
\eeq
Sketch the value of $P(x\eq 1 \given y,\sa,\sigma)$
as a function of $y$.
\item
Assume that a single bit is to be
transmitted. What is the optimal decoder,
and what is its probability of error? Express your answer in terms
of the signal-to-noise ratio $\sa^2/\sigma^2$ and
the
\label{sec.erf}\ind{error function}\index{conventions!error function}\index{erf}
(the \ind{cumulative
probability function} of the Gaussian distribution),
\beq
\Phi(z) \equiv \int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}}
\, e^{-\textstyle\frac{z^2}{2}} \: \d z.
\eeq
%
% P(x \given y,s,sigma) = 1/(1+e^{-a}), a = 2 ( s / \sigma^2 ). y
%
[Note that this definition of the error function $\Phi(z)$ may not
correspond to other people's.]
% definitions of the `error function'.
% Some people
%% and some software libraries
% leave out factors of two in the definition.]
% I think that the
% above definition is the only natural one.
\een
}
% \section{
\subsection*{Pattern recognition as a noisy channel}
We may think of many pattern recognition problems in terms of\index{pattern recognition}
\ind{communication} channels. Consider the case of recognizing handwritten
digits (such as postcodes on envelopes). The author of the digit
wishes to communicate a message from the set $\A_X = \{
0,1,2,3,\ldots, 9 \}$; this selected message is the input to the
channel. What comes out of the channel is a pattern of ink on paper.
If the ink pattern is represented using 256 binary pixels, the channel $Q$
has as its output a random variable $y \in \A_Y = \{0,1\}^{256}$.
% Here is an example of an element from this alphabet.
An example of an element from this alphabet is shown in the margin.
%
% hintond.p zero=0.0 range=1.25 rows=16 background=1.0 pos=0.0 o=/home/mackay/_applications/characters/ex2.ps 16 < /home/mackay/_applications/characters/example2
%
%\[
\marginpar{
{\psfig{figure=/home/mackay/_applications/characters/ex2.ps,width=1.1in}}
}%\end{marginpar}
%\]
\exercisaxA{2}{ex.twos}{
Estimate how many patterns in $\A_Y$ are recognizable as
the character `2'. [The aim of this problem is to
% Try not to underestimate this number ---
try to demonstrate the existence of {\em as many patterns as possible\/}
that are recognizable as 2s.]
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=figs/random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/2random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/6random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/7random2.ps}}
\end{center}
\caption[a]{Some more 2s.}
\label{fig.random2s}
%\end{figure}
}%end{marginfig}
% made using figs/random2.ps seed=7
Discuss how one might model the channel $P(y \given x\eq 2)$.\index{2s}\index{twos}\index{handwritten digits}
% in the case of handwritten digit recognition.
Estimate the entropy of the probability distribution $P(y \given x\eq 2)$.
% Recognition of isolated handwritten digits
% Digit 2 -> Q -> y $\in \{0,1\}^{256})$
% 3
% Estimate how many 2's there are.
One strategy for doing \ind{pattern recognition} is to create a model
for $P(y \given x)$ for each value of the input $x= \{ 0,1,2,3,\ldots, 9 \}$,
then use \Bayes\ theorem to
infer $x$ given $y$.
\beq
P(x \given y) = \frac{ P(y \given x) P(x) } { \sum_{x'} P(y \given x') P(x') } .
\eeq
This strategy is known as {\dbf \ind{full probabilistic model}ling\/}
or {\dbf \ind{generative model}ling\/}. This is essentially how
current speech recognition systems work. In addition to the
channel model, $P(y \given x)$, one uses a prior probability distribution
$P(x)$, which in the case of both character recognition and
speech recognition is a language model that specifies the probability of
the next character/word given the context and the known grammar
and statistics of the language.
%
% Alternative, model $P(x \given y)$ directly.
% Discriminative modelling; conditional modelling.
% Feature extraction -- compute some $f(y)$ then model $P(f \given x)$
% - generative modelling in feature space.
% or else model $P(x \given f)$
% which is still discriminative modelling / conditional modelling.
% Notice number of parameters.
%
%
}
\subsection*{Random coding}
\exercissxA{2}{ex.birthday}{
Given
%\index{random coding}
% \index{code!random}
\index{random code}twenty-four people in a room,
% at a party,
what is the probability that
there are at least two people present who
% of them
have the same \ind{birthday} (\ie, day and month of birth)?
What is the expected number of
pairs of people with the same birthday? Which of these
two questions is easiest to solve? Which answer gives most
insight?
You may find it helpful to solve these problems and those that follow
using notation such as $A=$ number of days in year $=365$
and $S=$ number of people $=24$.
}
\exercisaxB{2}{ex.birthdaycode}{
The birthday problem may be related to a coding scheme.
Assume we wish to convey a message
to an outsider identifying one of the twenty-four people.
We could simply communicate a number $\cwm$ from $\A_S = \{ 1,2, \ldots,
24 \}$, having agreed a mapping of people onto numbers;
alternatively, we could convey a number from
$\A_X = \{ 1 ,2 , \ldots, 365\}$, identifying the
day of the year that is the selected person's \ind{birthday}
(with apologies to leapyearians). [The receiver is assumed to know
all the people's birthdays.]
What, roughly, is the probability of error of this communication scheme,
assuming it is used for a single transmission?
What is the capacity of the communication channel, and what is
the rate of communication attempted by this scheme?
}
%
% CHRIS SAYS ``this is not CLEAR''................. :
%
\exercisaxB{2}{ex.birthdaycodeb}{
Now imagine that there are $K$ rooms in a building, each containing
$q$ people. (You might think of $K=2$ and $q=24$ as an example.)
The aim is to communicate a selection of one person
from each room by transmitting an ordered list of $K$ days (from $\A_X$).
Compare the probability of error of the following two schemes.
\ben
\item
As before, where each
room transmits the \ind{birthday} of the selected person.
\item
To each $K$-tuple of people, one drawn from each room,
an ordered $K$-tuple of randomly selected days from $\A_X$ is assigned
(this $K$-tuple has nothing to do with their birthdays).
This enormous list of $S = q^K$ strings is
known to the receiver. When the building has selected
a particular person from each room, the ordered string of days
corresponding to that $K$-tuple of people is transmitted.
\een
What is the probability of error when $q=364$ and $K=1$?
What is the probability of error when $q=364$ and $K$ is large,
\eg\ $K=6000$?
}
% see synchronicity.tex
% for cut example
\dvips
\section{Solutions}% to Chapter \protect\ref{ch5}'s exercises}
%
\fakesection{solns to exercises in l5.tex}
%
\soln{ex.bscy0}{
If we assume we observe $y\eq 0$,
\beqan
P(x\eq 1 \given y\eq 0) &=& \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'} P(y \given x') P(x')} \\
&=& \frac{ 0.15 \times 0.1 }{ 0.15 \times 0.1 + 0.85 \times 0.9 } \\
&=& \frac{ 0.015 }{0.78} \:=\: 0.019 .
\eeqan
}
\soln{ex.zcy0}{
If we observe $y=0$,
\beqan
P(x\eq 1 \given y\eq 0)
% &=& \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'} P(y \given x') P(x')} \\
&=& \frac{ 0.15 \times 0.1 }{ 0.15 \times 0.1 + 1.0 \times 0.9 } \\
&=& \frac{ 0.015}{ 0.915} \:=\: 0.016 .
\eeqan
}
\soln{ex.bscMI}{
The probability that $y=1$
is $0.5$, so the mutual information is:
\beqan
\I(X;Y) &=& H(Y) - H(Y \given X) \\
&=& H_2(0.5) - H_2(0.15)\\
& =& 1 - 0.61 \:\: = \:\: 0.39 \mbox{ bits}.
\eeqan
}
\soln{ex.zcMI}{
We again compute the mutual information using
$\I(X;Y) = H(Y) - H(Y \given X)$.
% fixed Tue 18/2/03
The probability that $y=0$
is $0.575$, and $H(Y \given X) = \sum_x P(x) H(Y \given x) = P(x\eq1) H(Y \given x\eq1) $
$+$ $P(x\eq0) H(Y \given x\eq0)$ so the mutual information is:
\beqan
\I(X;Y) &=& H(Y) - H(Y \given X) \\
&=& H_2(0.575) - [0.5 \times H_2(0.15)+0.5 \times 0 ] \\
& =& 0.98 - 0.30 \:\: = \:\: 0.679 \mbox{ bits}.
\eeqan
}
\soln{ex.bscC}{
By symmetry, the \optens\ is
$\{0.5,0.5\}$.
Then the capacity is
\beqan
C \:=\: \I(X;Y) &=& H(Y) - H(Y \given X) \\
&=& H_2(0.5) - H_2(\q)\\
& =& 1 - H_2(\q) .
\eeqan
Would you like to find the \optens\ without invoking symmetry?
We can do this by computing the mutual information in the general
case where the input ensemble is $\{p_0,p_1\}$:
\beqan
\I(X;Y) &=& H(Y) - H(Y \given X) \\
&=& H_2(p_0 \q+ p_1(1-\q) ) - H_2(\q) .
\eeqan
The only $p$-dependence is in the first term $H_2(p_0\q+ p_1(1-\q) )$,
which is maximized by setting the argument to 0.5.
This value is given by setting $p_0=1/2$.
}
\soln{ex.becC}{
\noindent {\sf Answer 1}.
By symmetry, the \optens\ is
$\{0.5,0.5\}$. The capacity is
most easily evaluated by
writing the mutual information as $\I(X;Y) = H(X) - H(X \given Y)$.
The conditional entropy $H(X \given Y)$ is $\sum_y P(y) H(X \given y)$;
when $y$ is known, $x$ is only uncertain if $y=\mbox{\tt{?}}$, which
occurs with probability $\q/2+\q/2$,
so the conditional entropy $H(X \given Y)$ is $\q H_2(0.5)$.
\beqan
C \:=\: \I(X;Y) &=& H(X) - H(X \given Y) \\
&=& H_2(0.5) - \q H_2(0.5)\\
& =& 1 - \q .
\eeqan
% The conditional entropy $H(X \given Y)$ is $\q H_2(0.5)$.
%
The binary erasure channel
fails a fraction $\q$ of the
time. Its capacity is precisely
$1-\q$, which is the fraction of
the time that the channel is
reliable.
% functional.
% , even though the sender
% does not know when the channel will
% fail.
This result seems very reasonable, but it is far from obvious
how to encode information so as to communicate {\em reliably\/} over this channel.
\smallskip
\noindent {\sf Answer 2}.
Alternatively, without invoking the symmetry assumed above, we can
start from the input ensemble $\{p_0,p_1\}$. The probability that
$y=\mbox{\tt{?}}$ is $p_0 \q+ p_1 \q = \q$, and when we receive $y=\mbox{\tt{?}}$,
the posterior probability of $x$ is the same as the prior
probability, so:
\beqan
\I(X;Y) &=& H(X) - H(X \given Y) \\
&=& H_2(p_1) - \q H_2(p_1)\\
& =& (1 - \q ) H_2(p_1) .
\eeqan
This mutual information achieves its maximum value of $(1-\q)$ when
$p_1=1/2$.
}
%
%
%
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{ccccc}
$\bQ$
& \ecfig{bec.1}
&{\small{(a)}} \, \ecfig{bec.2}
&{\small{(b)}} \,
% roughly 8pts from col to col
\setlength{\unitlength}{1pt}
\begin{picture}(50,110)(-5,-5)
\put(-5,-5){\ecfig{bec.2}}
\put(3.95,-3){\framebox(8,96){}}
\put(28.5,-3){\framebox(8,96){}}
\put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}}
\put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}}
\end{picture}
&{\small{(c)}} \,
% roughly 8pts from col to col
\setlength{\unitlength}{1pt}
\begin{picture}(50,110)(-5,-5)
\put(-5,-5){\ecfig{bec.2}}
\put(3.95,-3){\framebox(8,96){}}
\put(28.5,-3){\framebox(8,96){}}
\put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}}
\put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}}
% roughly 8pts from col to col
%\setlength{\unitlength}{1pt}
%\begin{picture}(100,110)(-5,-5)
%\put(-5,-5){\ecfig{bec.2}}
%\put(3.95,-3){\framebox(8,96){}}
%\put(28.5,-3){\framebox(8,96){}}
%\put(2.5,97){\makebox(0,0)[bl]{$\bx^{(1)}$}}
%\put(26.5,97){\makebox(0,0)[bl]{$\bx^{(2)}$}}
%
\multiput(-4,3)(0,8){2}{\line(1,0){8}}
\multiput(-4,27)(0,8){3}{\line(1,0){8}}
\multiput(-4,59)(0,8){2}{\line(1,0){8}}
\multiput(37,3)(0,8){2}{\vector(1,0){14}}
\multiput(37,27)(0,8){3}{\vector(1,0){14}}
\multiput(37,59)(0,8){2}{\vector(1,0){14}}
\multiput(57,4)(0,8){2}{\makebox(0,0)[l]{\tiny$\hat{m}=2$}}
\multiput(57,28)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=2$}}
\multiput(57,44)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=1$}}
\multiput(57,60)(0,8){2}{\makebox(0,0)[l]{\tiny$\hat{m}=1$}}
\multiput(57,36)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=0$}}
% the box starts exactly at x=0.
\end{picture}
\\
& $N=1$ & $N=2$ & \\[-0.1in]
\end{tabular}
\end{center}
}{%
\caption[a]{(a) The {\ind{extended channel}} ($N=2$)
obtained from a binary erasure channel
with erasure probability 0.15. (b) A block code
consisting of the two codewords {\tt 00} and {\tt 11}.
(c) The optimal decoder for this code. }
\label{fig.extended.bec}
}
\end{figure}
%
\soln{ex.extended}{
The extended channel is shown in \figref{fig.extended.bec}.
The best code for this channel with $N=2$ is obtained by choosing
two columns that have minimal overlap, for example, columns {\tt 00}
and {\tt 11}. The decoding algorithm returns `{\tt 00}'
if the extended channel output is among the top four
% either output is {\tt 0},
and `{\tt 11}' if
it's among the bottom four,
% if either output is {\tt 1},
and gives up if the output is `{\tt ??}'.
}
%
% end of chapter
%
\soln{ex.zcdiscuss}{
In \exampleref{exa.Z.HXY}
% \exaseven\ of chapter \chfive\
we showed that the mutual information between input and output
of the Z channel is
\beqan
\I(X;Y) &=& H(Y) - H(Y \given X) \nonumber \\
&=& H_2(p_1 (1-\q)) - p_1 H_2(\q) .
\eeqan
We differentiate this expression with respect to $p_1$, taking care not
to confuse $\log_2$ with $\log_e$:
\beq
\frac{\d}{\d p_1} \I(X;Y)
= (1-\q) \log_2 \frac{ 1- p_1 (1-\q) }{ p_1 (1-\q) } - H_2(\q) .
\eeq
Setting this derivative to zero and rearranging using skills developed
in \exthirtyone, we obtain:
\beq
{ p_1^* (1-\q) } = \frac{1}{1 + \displaystyle 2^{H_2(\q)/(1-\q)}} ,
\eeq
so the \optens\ is
\beq
p_1^* = \frac{ 1/(1-\q) }
{ \displaystyle
1 + 2^{ \left( H_2(\q) / ( 1 - \q ) \right)} } .
\eeq
As the noise level $\q$ tends to 1, this expression tends to $1/e$
(as you can prove using L'H\^opital's rule).
For all values of $\q\!$, $p_1^*$ is smaller than $1/2$. A rough
intuition for why input {\tt1} is used less than input {\tt0} is that
when input {\tt1} is used, the noisy channel injects entropy into
the received string; whereas when input {\tt0} is used, the noise has
zero entropy. Thus starting from $p_1=1/2$, a perturbation
towards smaller $p_1$ will reduce the conditional entropy
$H(Y \given X)$ linearly while leaving $H(Y)$ unchanged, to first order.
$H(Y)$ decreases only quadratically in $(p_1-\dhalf)$.
}
\soln{ex.Csketch}{
The capacities of the three channels are shown in \figref{fig.capacities}.
% below.
\amarginfig{b}{
\begin{center}
\mbox{\psfig{figure=figs/C.ps,angle=-90,width=2in}
}
\end{center}
\caption[a]{Capacities of the Z channel, \BSC, and binary erasure channel.}
\label{fig.capacities}
}%end marginpar
For any $\q <0.5$,
% the channels can be ordered with the BEC being the
the BEC is the
channel with highest capacity and the BSC the lowest.
}
\soln{ex.GC}{
The logarithm of the posterior probability ratio, given $y$, is
\beq
a(y) = \ln \frac{P(x\eq 1 \given y,\sa,\sigma)}{P(x\eq -1 \given y,\sa,\sigma)}
= \ln \frac{Q(y \given x\eq 1,\sa,\sigma)}{Q(y \given x\eq -1,\sa,\sigma)}
= 2 \frac{\sa y}{\sigma^2} .
% corrected march 2000
% and corrected log to ln Sun 22/8/04
\eeq
Using our skills picked up from
% in chapter \ref{ch1},
\exerciseref{ex.logit}, we rewrite
% from exercise \label{eq.sigmoid} \label{eq.logistic}
this in the form
\beq
P(x\eq 1 \given y,\sa,\sigma) = \frac{1}{1+e^{-a(y)}} .
\eeq
The optimal decoder selects the most probable hypothesis; this can
be done simply by looking at the sign of $a(y)$. If $a(y)>0$
then decode as $\hat{x}=1$.
The probability of error is
\beq
p_{\rm b} = \int_{-\infty}^{0} \!\! \d y \:
Q(y \given x\eq 1,\sa,\sigma) =
% chris suggests removing the x (=1) from what follows (twice)
\int_{-\infty}^{- x \sa} \! \d y \: \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{-\smallfrac{y^2}{2 \sigma^2}}
= \Phi \left( - \frac{ x\sa }{ \sigma } \right) .
% corrected march 2000
\eeq
% where
%\beq
% \Phi(z) \equiv \int_{z}^{\infty} \frac{1}{\sqrt{2 \pi}}
% e^{-\frac{z^2}{2}} .
%\eeq
%\beq
% \Phi(z) \equiv \int_{-\infty}^{z}{\smallfrac{1}{\sqrt{2 \pi}}}
% e^{-\textstyle\frac{z^2}{2}} .
%\eeq
}
\subsection*{Random coding}
\soln{ex.birthday}{
The probability that $S=24$ people whose birthdays are drawn at random
from $A=365$ days all have {\em distinct\/} birthdays is
\beq
\frac{ A(A-1)(A-2)\ldots(A-S+1) }{ A^q } .
\eeq
The probability that two (or more) people share a \ind{birthday} is one minus
this quantity, which, for $S=24$ and $A=365$,
is about 0.5. This exact way of answering the question
is not very informative
since it is not clear for what
value of $S$ the probability changes from being close to 0 to being
close to 1.
The number of pairs is $S(S-1)/2$, and the probability that a particular
pair shares a birthday is $1/A$, so the {\em expected number\/} of collisions
is
\beq
\frac{ S(S-1)}{2 } \frac{1}{A} .
\eeq
This answer is more instructive. The expected number of collisions
is tiny if $S \ll \sqrt{A}$ and big if $S \gg \sqrt{A}$.
We can also approximate the probability that all birthdays are distinct,
for small $S$, thus:
\beqan
\lefteqn{\hspace*{-0.7in}
\frac{ A(A-1)(A-2)\ldots(A-S+1) }{ A^S }
\:\:=\:\: (1)(1-\dfrac{1}{A})(1-\dfrac{2}{A})\ldots(1-\dfrac{(S\!-\!1)}{A})
\hspace*{1.7in}}
% this hspace{ no good
\nonumber \\
&\simeq&
\exp( 0 ) \exp ( -\linefrac{1}{A}) \exp ( -\linefrac{2}{A}) \ldots \exp ( -\linefrac{(S\!-\!1)}{A})
\\
&\simeq&
\exp \left( - \frac{1}{A} \sum_{i=1}^{S-1} i \right)
= \exp \left( - \frac{S(S-1)/2}{A} \right) .
\eeqan
}
\dvipsb{solutions noisy channel s5}
\prechapter{About Chapter}
\fakesection{prerequisites for chapter 6}
Before reading \chref{ch.six}, you should have read Chapters
\chtwo\ and \chfive. \Exerciseref{ex.extended} is
especially recommended.
% and worked on \exerciseref{ex.dataprocineq}.
%
% \extwentytwo\ from chapter \chone.
% Please note that you {\em don't\/} need to understand
% this proof in order to be able to solve most of the
% problems involving noisy channels.
%\footnote
% {This exposition is based on that of Cover and Thomas (1991).}
\subsection*{Cast of characters}
\noindent%
\begin{tabular}{lp{4in}} \toprule
$Q$ & the noisy channel \\
$C$ & the capacity of the channel \\
$X^N$ & an ensemble used to create a \ind{random code} \\
$\C$ & a random code \\
$N$ & the length of the codewords \\
$\bx^{(\cwm)}$ & a codeword, the $\cwm$th in the code \\
$\cwm$ % $s$
& the number of a chosen codeword
(mnemonic: the {\em source\/} selects $\cwm$) \\
$\cwM = 2^{K}$ % $S$
& the total number of codewords in the code\\
$K=\log_2 \cwM$
& the number of bits conveyed by the choice of one codeword from $\cwM$,
assuming it is chosen with uniform probability \\
$\bs$ & a binary representation of the number $\cwm$ \\
$R = K/N$ & the rate of the code, in bits per channel use
(sometimes called $R'$ instead) \\
% $R'$ & another rate, close to $R$ \\
$\hat{\cwm}$ % $s$
& the decoder's guess of $\cwm$ \\
\bottomrule
\end{tabular} \medskip
%{\sf Typo Warning:}
% the letter $m$ may turn up where it should read $\cwm$.
%%%% !!!!!!!!!!!!!! ok???????????????????????
\ENDprechapter
\chapter{The Noisy-Channel Coding Theorem}
% {The noisy-channel coding theorem}% Proof of
\label{ch.six}
% % \lecturetitle{The noisy-channel coding theorem, part b}
% \chapter{The noisy channel coding theorem}% Proof of
\label{ch6}
\section{The theorem}\index{noisy-channel coding theorem}\index{communication}
The theorem has three parts, two positive and one negative.
The main positive result is the first.
\amarginfig{t}{
\begin{center}\small
\setlength{\unitlength}{2pt}
\begin{picture}(60,45)(-2.5,-7)
\thinlines
\put(0,0){\vector(1,0){60}}
\put(0,0){\vector(0,1){40}}
\put(30,0){\line(0,1){30}}
\put(30,0){\line(1,2){10}}
\put(30,-3){\makebox(0,0)[t]{$C$}}
\put(55,-2){\makebox(0,0)[t]{$R$}}
\put(42,22){\makebox(0,0)[bl]{$R(p_{\rm b})$}}
\put(-1,35){\makebox(0,0)[r]{$p_{\rm b}$}}
\thicklines
\put(0,0){\makebox(30,30){1}}
\put(30,0){\makebox(7.5,35){2}}
\put(35,0){\makebox(30,20){3}}
% \put(0,0){\line(0,1){50}}
%
\end{picture}
\end{center}
\caption[a]{Portion of the $R,p_{\rm b}$ plane
to be proved
achievable (1,$\,$2) and
not achievable (3).
}
\label{fig.belowCcoming}
}%end marginfig
\ben%gin{itemize}
\item
For every discrete memoryless channel, the
channel capacity
\beq
C = \max_{\P_X}\, \I(X;Y)
\eeq
has the following
property. For any $\epsilon > 0$ and $R < C$, for large enough $N$,
there exists a code of length $N$ and rate $\geq R$ and a decoding
algorithm, such that the maximal probability of block error is $<
\epsilon$.
\item
If a probability of bit error $p_{\rm b}$ is acceptable, rates up to $R(p_{\rm b})$
are achievable, where
\beq
R(p_{\rm b}) = \frac{ C }
{1 - H_2(p_{\rm b})} .
\eeq
\item
For any $p_{\rm b}$, rates greater than $R(p_{\rm b})$ are not achievable.
\een%d{itemize}
\section{Jointly-typical sequences}
We formalize the intuitive preview of the last chapter.\index{typicality}
We will define codewords $\bx^{(\cwm )}$ as coming from an ensemble $X^N$,
and consider the random selection of one codeword and a
corresponding channel output $\by$,
thus defining a joint ensemble $(XY)^N$.
%, corresponding to random generation of a codeword and a corresponding channel output.
We will use a {\dem typical-set decoder}, which
decodes
a received signal
$\by$ as $\cwm$ if $\bx^{(\cwm )}$ and $\by$ are {\dem jointly typical},
a term to be defined shortly.
The proof will then centre on determining the probabilities (a) that the true
input codeword is {\em not\/}
jointly \index{typicality}{typical} with the output sequence;
and (b) that a {\em false\/} input codeword {is\/} jointly typical with the output.
We will show that, for large $N$, both probabilities
% $\rightarrow 0$,
go to zero
as long as there are fewer
than $2^{NC}$ codewords, and the ensemble $X$ is the \index{optimal input distribution}{\optens}.
\newcommand{\JNb}{\mbox{$J_{N \beta}$}}
\begin{description}
\item[Joint typicality\puncspace]
A pair of sequences $\bx,\by$ of length $N$ are defined to be
{jointly
typical (to tolerance $\beta$)}\index{joint typicality}
with respect to the distribution
$P(x,y)$ if
\beqan
\mbox{$\bx$ is typical of $P(\bx)$,}
& \mbox{\ie,} &
\left| \frac{1}{N} \log \frac{1}{P(\bx)} - H(X) \right| < \beta ,
\nonumber
\\
\mbox{$\by$ is typical of $P(\by)$,}
& \mbox{\ie,} &
\left| \frac{1}{N} \log \frac{1}{P(\by)} - H(Y) \right| < \beta ,
\nonumber
\\
\mbox{and $\bx,\by$ is typical of $P(\bx,\by)$,}
& \mbox{\ie,} &
\left| \frac{1}{N} \log \frac{1}{P(\bx,\by)} - H(X,Y) \right| < \beta .
\nonumber
\eeqan
\item[The jointly-typical set] $\JNb$ is the set of all jointly
typical sequence pairs of length $N$.
% It has the following three properties,
\end{description}
%\begin{example}
\noindent {\sf Example.}
Here is a jointly-typical pair of length $N=100$
for the ensemble
$P(x,y)$ in which $P(x)$ has $(p_0,p_1) = (0.9,0.1)$
and $P(y \given x)$ corresponds to a binary symmetric channel with
noise level $0.2$.
\[%beq
\mbox{
\begin{tabular}{cc}
$\bx$ &\mbox{\footnotesize\tt 1111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}\\
$\by$ &\mbox{\footnotesize\tt 0011111111000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111}
\end{tabular}
}
\]%eeq
Notice that $\bx$ has 10 {\tt 1}s, and so is typical of the probability
$P(\bx)$ (at any tolerance $\beta$); and $\by$ has
% 18 + 8 = 26
26 {\tt 1}s, so it is typical of $P(\by)$ (because $P(y\eq 1) = 0.26$);
and $\bx$ and $\by$ differ in
% 18 + 2
20 bits, which is the typical number of flips for
this channel.
%\end{example}
\begin{description}
\item[Joint typicality theorem\puncspace]
Let $\bx,\by$ be drawn from the ensemble $(XY)^N$ defined
by
$$P(\bx,\by)=\prod_{n=1}^N P(x_n,y_n).$$
Then\index{joint typicality theorem}\label{theorem.jtt}
\ben
\item
the probability that $\bx,\by$
are jointly typical (to tolerance $\beta$)
tends to 1 as $N \rightarrow \infty$;
\item
the number of jointly-typical sequences $|\JNb|$
is close to $2^{N H(X,Y) }$. To be precise,
\beq
|\JNb| \leq 2^{N ( H(X,Y) + \beta ) };
\eeq
\item
if $\bx'\sim X^N$ and $\by'\sim Y^N$, \ie, $\bx'$ and $\by'$
are {\em independent\/} samples
with the same marginal distribution as $P(\bx,\by)$, then
the probability that $(\bx' ,\by')$ lands in the
jointly-typical set is about $2^{- N \I(X;Y)}$. To be precise,
\beq
P( (\bx' ,\by') \in \JNb )
\leq 2^{- N ( \I(X;Y) - 3 \beta ) } .
\eeq
% also, for the proof of the converse, we want...
% for sufficiently large N
% P( (\bx' ,\by') \in \JNb
% \geq (1-\beta) 2^{- N ( \I(X;Y) + 3 \beta ) }
\een
\item[{\sf Proof.}] The proof of parts 1 and 2
by the law of large numbers follows that of the source coding theorem in
\chref{ch2}. For part 2, let the pair $x,y$ play the role of
$x$ in the source coding theorem, replacing $P(x)$ there by
the probability distribution $P(x,y)$.
% \marginpar{\footnotesize }
For the third part,
\beqan
% \begin{array}{lll}
% was (thin column) --
% \multicolumn{3}{l}{
% P( (\bx' ,\by') \in \JNb )
% \: = \: \sum_{(\bx ,\by) \in \JNb} P(\bx ) P(\by)}
% \\[0.06in]
% &\leq & |\JNb| \, 2^{-N(H(X)-\beta)} 2^{-N(H(Y)-\beta)}
% \\[0.045in]
% &\leq& 2^{N( H(X,Y) + \b) - N(H(X)+H(Y)-2\b)}
% \\
% & =& 2^{-N ( \I(X;Y) - 3 \beta )}
P( (\bx' ,\by') \in \JNb )
& = & \sum_{(\bx ,\by) \in \JNb} P(\bx ) P(\by)
\\% [0.06in]
&\leq & |\JNb| \, 2^{-N(H(X)-\beta)} \, 2^{-N(H(Y)-\beta)}
\\% [0.045in]
&\leq& 2^{N( H(X,Y) + \b) - N(H(X)+H(Y)-2\b)}
\\
& =& 2^{-N ( \I(X;Y) - 3 \beta )} . \hspace{1in}\epfsymbol
\eeqan
% This quantity is a bound on the probability of confusing
\end{description}
A cartoon of the jointly-typical set is shown in
\figref{fig.joint.typ}.
% The property just proved, that t
Two
independent typical vectors are jointly typical with probability
\beq
P(
(\bx' ,\by') \in \JNb ) \simeq 2^{-N ( \I(X;Y))}
\eeq
% because
%, is readily understood by noticing that
because the {\em total\/} number of independent typical pairs is the
area of the dashed rectangle, $2^{NH(X)} 2^{NH(Y)}$, and the number of
jointly-typical pairs is roughly $2^{NH(X,Y)}$, so the probability of hitting
a jointly-typical pair is roughly
\beq
2^{NH(X,Y)}/2^{NH(X)+NH(Y)} = 2^{-N\I(X;Y)}.
\eeq
%
% the above eq was in-line but it looked ugly
%
\newcommand{\rad}{0.81}
\begin{figure}
\small
\figuremargin{%
\begin{center}\small
\setlength{\unitlength}{1mm}% original picture is 9.75 in by 5.25 in
\begin{picture}(74,105)(-15,-5)
%
\put(-10,-7){\framebox(62,99){}}
% as well as box put Ax and Ay sizes
\put(0,93.5){\vector(-1,0){10}}
\put(0,93.5){\vector(1,0){52}}
\put(-11,8){\vector(0,-1){15}}
\put(-11,8){\vector(0,1){84}}
%\put(0,92){\vector(-1,0){10}}
%\put(0,92){\vector(1,0){52}}
%\put(-10,8){\vector(0,-1){15}}
%\put(-10,8){\vector(0,1){84}}
%
% width indicator
\put(21,90){\vector(1,0){21}}
\put(21,90){\vector(-1,0){21}}
\put(21,88.7){\makebox(0,0)[t]{$2^{NH(X)}$}}
%
% height indicator
\put(-2,45){\vector(0,1){43}}
\put(-2,45){\vector(0,-1){43}}
\put(0,30){\makebox(0,0)[l]{$2^{NH(Y)}$}}% was 45
%
% RECTANGLE
%\put(-1,0){\framebox(45,89){}}
\put(-1,1){\dashbox{1}(44.5,88){}}
%
% strip width indicator
\put(26,35){\vector(1,0){2}}
\put(26,35){\vector(-1,0){2}}
\put(26,15){\vector(1,0){2}}
\put(26,15){\vector(-1,0){2}}
\put(26,14){\makebox(0,0)[t]{$2^{NH(X|Y)}$}}
%
% strip height indicator
\put(21,45){\vector(0,1){5}}
\put(21,45){\vector(0,-1){5}}
\put(28,45){\vector(0,1){5}}% was at 31,32
\put(28,45){\vector(0,-1){5}}
\put(29,45){\makebox(0,0)[l]{$2^{NH(Y|X)}$}}
%
% JT set
\multiput(2,88)(2,-4){21}{\circle*{\rad}}
\multiput(2,86)(2,-4){21}{\circle*{\rad}}
\multiput(0,86)(2,-4){22}{\circle*{\rad}}
\multiput(0,88)(2,-4){22}{\circle*{\rad}}
\multiput(0,82)(2,-4){21}{\circle*{\rad}}
\multiput(0,84)(2,-4){21}{\circle*{\rad}}
%
%\put(38,20){\makebox(0,0)[l]{$2^{NH(X,Y)}$}}
\put(18,64){\makebox(0,0)[l]{$2^{NH(X,Y)}$ dots}}
\put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}}
\put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}}
\end{picture}
\end{center}
}{%
\caption[a]{{The jointly-typical set.} The horizontal direction
represents $\A_{X}^N$, the set of all input strings of length $N$.
The vertical direction
represents $\A_{Y}^N$, the set of all output strings of length $N$.
The outer box contains all conceivable input--output pairs.
Each dot represents
a jointly-typical pair of sequences $(\bx,\by)$.
The total number of jointly-typical sequences is about $2^{NH(X,Y)}$.
% [Compare with \protect\figref{fig.extended.bec}a,
% \protect\pref{fig.extended.bec}.]
% page \protect\pageref{fig.extended.bec}.]
}
\label{fig.joint.typ}
}
\end{figure}
\section{Proof of the noisy-channel coding theorem}
\subsection{Analogy}
Imagine that we wish to prove that there is a baby\index{weighing babies} in a class
of one hundred babies who weighs less than 10\kg. Individual babies
are difficult to catch and weigh.%
\amarginfig{c}{
\begin{center}
\mbox{\psfig{figure=figs/babiesscale4.ps,width=53mm}}
\end{center}
\caption[a]{Shannon's method for
proving one baby weighs less than 10\kg.}
}
Shannon's method of\index{Shannon, Claude}
solving the task is to scoop up all the babies and weigh them
all at once on a big weighing machine. If we find that their {\em average\/} weight is
% smaller than 1000\kg\ then the children's average weight
% must be
smaller than 10\kg, there must exist {\em at least one\/}
baby who weighs less than 10\kg\ -- indeed there must be many!
% In the context of weighing children,
Shannon's method isn't guaranteed to reveal the existence of an underweight child,
since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000\kg\ then our task is solved.
\subsection{From skinny children to fantastic codes}
We wish to show that there exists a code and a decoder having small
probability of error. Evaluating the probability of error of any
particular coding and decoding
system is not easy. Shannon's innovation was this: instead of
constructing a good coding and decoding
system and evaluating its error probability,
Shannon
calculated the average probability of block error of {\em all\/}
codes, and proved that this average is small. There must then exist
individual codes that have small probability of block error.
% Finally
% to prove that the {\em maximal\/} probability of error is small too,
% we modify one of these good codes by throwing away the worst 50\%
% of its codewords.
\begin{figure}
\small
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\setlength{\unitlength}{0.81mm}% original picture is 9.75 in by 5.25 in
%\begin{picture}(74,100)(-15,-5)
\begin{picture}(62,100)(-5,-5)
%
%\put(-10,-2){\framebox(62,94){}}
% codewords
\put( 5,0){\framebox(2,91){}}
\put(13,0){\framebox(2,91){}}
\put(31,0){\framebox(2,91){}}
\put(35,0){\framebox(2,91){}}
%
\put(5,94){\makebox(0,2.5)[bl]{$\bx^{(3)}$}}
\put(13,94){\makebox(0,2.5)[bl]{$\bx^{(1)}$}}
\put(29,94){\makebox(0,2.5)[bl]{$\bx^{(2)}$}}
\put(37,94){\makebox(0,2.5)[bl]{$\bx^{(4)}$}}
% JT set
\multiput(2,88)(2,-4){21}{\circle*{\rad}}
\multiput(2,86)(2,-4){21}{\circle*{\rad}}
\multiput(0,86)(2,-4){22}{\circle*{\rad}}
\multiput(0,88)(2,-4){22}{\circle*{\rad}}
\multiput(0,82)(2,-4){21}{\circle*{\rad}}
\multiput(0,84)(2,-4){21}{\circle*{\rad}}
%
%\put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}}
%\put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}}
\end{picture}
&
\setlength{\unitlength}{0.81mm}
\begin{picture}(78,100)(-15,-5)
%
%\put(-10,-2){\framebox(62,94){}}
% codewords
\put(5,0){\framebox(2,91){}}
\put(13,0){\framebox(2,91){}}
\put(31,0){\framebox(2,91){}}
\put(35,0){\framebox(2,91){}}
%
\put(5,94){\makebox(0,2.5)[bl]{$\bx^{(3)}$}}
\put(13,94){\makebox(0,2.5)[bl]{$\bx^{(1)}$}}
\put(29,94){\makebox(0,2.5)[bl]{$\bx^{(2)}$}}
\put(37,94){\makebox(0,2.5)[bl]{$\bx^{(4)}$}}
%
% decodings
\put(-13,10){\makebox(0,0)[r]{$\by_c$}}
\put(-13,20){\makebox(0,0)[r]{$\by_d$}}
\put(-13,72){\makebox(0,0)[r]{$\by_b$}}
\put(-13,82){\makebox(0,0)[r]{$\by_a$}}
\put(-11.3,10){\vector(1,0){63}}
\put(-11.3,20){\vector(1,0){63}}
\put(-11.3,72){\vector(1,0){63}}
\put(-11.3,82){\vector(1,0){63}}
\put(54,10){\makebox(0,0)[l]{$\hat{\cwm}(\by_c)\eq 4$}}% was 10,
\put(54,20){\makebox(0,0)[l]{$\hat{\cwm}(\by_d)\eq 0$}}% was 25,
\put(54,72){\makebox(0,0)[l]{$\hat{\cwm}(\by_b)\eq 3$}}
\put(54,82){\makebox(0,0)[l]{$\hat{\cwm}(\by_a)\eq 0$}}
% top end
%
% JT set
\multiput(2,88)(2,-4){21}{\circle*{\rad}}
\multiput(2,86)(2,-4){21}{\circle*{\rad}}
\multiput(0,86)(2,-4){22}{\circle*{\rad}}
\multiput(0,88)(2,-4){22}{\circle*{\rad}}
\multiput(0,82)(2,-4){21}{\circle*{\rad}}
\multiput(0,84)(2,-4){21}{\circle*{\rad}}
%
%\put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}}
%\put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}}
\end{picture}
\\
(a) & (b) \\
\end{tabular}
\end{center}
}{%
\caption[a]{(a) {A \ind{random code}.}
% A random code is a selection of input
% sequences $\{ \bx^{(1)}, \ldots, \bx^{(\cwM)}\}$ from the ensemble
% $X^N$. Each codeword
% $\bx^{(\cwm)}$ is likely to be a typical sequence.
% [Compare with \protect\figref{fig.extended.bec}b,
% page \protect\pageref{fig.extended.bec}.]
(b) {Example decodings by the typical set decoder.} A sequence that is not
jointly typical
with any of the codewords, such as $\by_a$, is decoded as $\hat{\cwm}=0$.
A sequence that is jointly typical
with codeword $\bx^{(3)}$ alone, $\by_b$, is decoded as $\hat{\cwm}=3$.
Similarly, $\by_c$ is decoded as $\hat{\cwm}=4$.
A sequence that is jointly typical
with more than one codeword, such as
$\by_d$, is decoded as $\hat{\cwm}=0$.
% [Compare with \protect\figref{fig.extended.bec}c,
% page \protect\pageref{fig.extended.bec}.]
}
\label{fig.rand.code}
\label{fig.typ.set.dec}
}
\end{figure}
\subsection{Random coding and typical-set decoding}
Consider the following encoding--decoding
system, whose rate is $R'$.\index{random code}
\ben
\item
We fix $P(x)$ and generate the $\cwM = 2^{NR'}$
codewords of a $(N,NR')=(N,K)$
code $\C$
at random according to
\beq
P(\bx) = \prod_{n=1}^{N} P(x_n) .
\eeq
A random code is shown schematically in \figref{fig.rand.code}a.
\item
The code is known to both sender and receiver.
\item
A message $\cwm$ is chosen from $\{1,2,\ldots, 2^{NR'}\}$, and $\bx^{(\cwm )}$
is transmitted. The received signal is $\by$, with
\beq
P(\by \given \bx^{(\cwm )} ) = \prod_{n=1}^{N} P(y_n \given x^{(\cwm )}_n) .
\eeq
\item
The signal is decoded by {\dem{typical-set decoding}\index{typical-set decoder}}.
\begin{description}
\item[Typical-set decoding\puncspace] Decode
$\by$ as $\hat{\cwm }$ {\sf if}
$(\bx^{(\hat{\cwm })},\by)$ are jointly typical {\em
and\/} there is no other $\cwm' $ such that $(\bx^{(\cwm')},\by)$ are jointly
typical;\\
{\sf otherwise} declare a failure $(\hat{\cwm }\eq 0)$.
\end{description}
This is not
the optimal decoding algorithm, but it will be good enough, and easier
to \analyze. The typical-set decoder is illustrated in
\figref{fig.typ.set.dec}b.
\item
A decoding error occurs if $\hat{\cwm } \not = \cwm $.
\een
There are three probabilities of error that we can distinguish.
First, there is the probability of block error for a particular
code $\C$, that is,
\beq
p_{\rm B}(\C) \equiv P(\hat{\cwm } \neq \cwm \given \C).
\eeq
This is
a difficult quantity to evaluate for any given code.
Second, there is the average over all codes of this block error probability,
\beq
\langle p_{\rm B} \rangle \equiv \sum_{\C} P(\hat{\cwm } \neq \cwm \given \C)
P(\C) .
\eeq
Fortunately, this quantity is much easier to evaluate than
the first quantity $P(\hat{\cwm } \neq \cwm \given \C)$.%
\marginpar{\small\raggedright{$\langle p_{\rm B} \rangle$
is just the probability that there is a decoding error
at step 5 of the process on the previous page.}}
Third, the maximal block error probability of a code $\C$,
\beq
p_{\rm BM}(\C) \equiv \max_{\cwm } P(\hat{\cwm } \neq \cwm \given \cwm, \C),
\eeq
is the quantity we are most interested in: we wish to show
that there exists a code $\C$ with the required rate
whose maximal block error probability is small.
We will get to this result by first finding the
average block error probability, $\langle p_{\rm B} \rangle$.
Once we have shown that this can be made smaller than
a desired small number, we immediately deduce that
there must exist {\em at least one\/} code $\C$
whose block error probability is also less than this
small number. Finally, we show that this code, whose
block error probability is satisfactorily small but whose
maximal block error probability is unknown (and could
conceivably be enormous), can be
modified to make a code of slightly smaller rate whose
maximal block error probability
is also guaranteed to be small.
We modify the code by throwing away the worst 50\%
of its codewords.
We therefore now embark on finding the average probability of block error.
\subsection{Probability of error of typical-set decoder}
There are two sources of error when we use typical-set
decoding. Either (a) the output $\by$ is not jointly typical with the
transmitted codeword $\bx^{(\cwm )}$, or (b) there is some other codeword
in $\cal{C}$ that is
jointly typical with $\by$.
By the symmetry of the code construction, the average probability of error
averaged over all codes does not depend on the selected value of $\cwm$; we can
assume without loss of generality that $\cwm=1$.
(a) The probability that the input $\bx^{(1)}$ and
the output $\by$ are not jointly typical
vanishes, by the joint typicality theorem's first part
(\pref{theorem.jtt}).
We give a name, $\delta$, to the upper bound on this probability,
% .
satisfying $\delta
\rightarrow 0$ as $N \rightarrow \infty$; for any desired $\delta$,
we can find a blocklength $N(\delta)$ such that the $P( (\bx^{(1)},\by) \not \in
\JNb) \leq \delta$.
(b) The probability that $\bx^{(\cwm')}$ and $\by$
% $(\bx^{(\cwm' )},\by)$
are jointly typical, for
a {\em given\/} $\cwm' \not = 1$
is $\leq 2^{-N(\I(X;Y)-3 \beta)}$, by part 3.
And there are $(2^{NR'}-1)$ rival values of $\cwm'$ to worry about.
Thus the average probability of error $\langle p_{\rm B} \rangle$
satisfies:
\beqan
\langle p_{\rm B} \rangle &\leq &
\delta + \sum_{\cwm' =2}^{2^{NR'}} 2^{-N(\I(X;Y)-3 \beta)}
\label{eq.uniona}
\\
&\leq &
\delta + 2^{-N(\I(X;Y)- R' -3 \beta)} .
\label{eq.unionaa}
\eeqan
% MARGINPAR should align with the eqn if possible (above)
\begin{aside}
{The inequality (\ref{eq.uniona}) that bounds a
total probability of error $P_{\rm TOT}$ by the sum of the probabilities $P_{s'}$ of
all sorts of events $s'$ each of which is sufficient to cause error,
$$P_{\rm TOT} \leq P_1 + P_2 + \cdots, $$
is called a {\dem\ind{union bound}}. It is only an equality if the different events
that cause error never occur at the same time as each other.
}
\end{aside}
The average probability of error (\ref{eq.unionaa})
can be made $< 2 \delta$ by increasing $N$ if
% {\em if\/}
\beq
R' < \I(X;Y) -3 \beta .
\eeq
We are almost there. We make three modifications:
\newcommand{\expurgfig}[1]{%
\hspace*{-0.3in}\raisebox{-0.975in}[2.05in][0pt]{\psfig{figure=figs/expurgate#1.ps,width=3.2in}}\hspace*{-0.3in}}
\begin{figure}
\figuremargin{
%\marginfig{
\begin{center}\small
\begin{tabular}{c@{}c@{}c}
\expurgfig{1}
&$\Rightarrow$ &
\expurgfig{2}
\\
(a) A random code $\ldots$ & &
(b) expurgated \\
\end{tabular}
\end{center}
}{
\caption[a]{How expurgation works.
(a) In a typical random code, a small fraction of the
codewords are involved in collisions -- pairs of codewords are sufficiently
close to each other that the probability of error when either codeword
is transmitted is not tiny.
We obtain a new code from a random code by deleting
all these confusable codewords.
(b) The resulting code has slightly fewer codewords, so
has a slightly lower rate, and its maximal probability of error
is greatly reduced.
}
\label{fig.expurgate}
}
\end{figure}
% \newcommand{\optens}{optimal input distribution}
\ben
\item
We choose $P(x)$ in the proof to be the \optens\ of the channel.
Then the condition $R'<\I(X;Y) -3 \beta$ becomes $R' N C$ is not achievable, so $R > \smallfrac{C}{1-H_2(p_{\rm b})}$
is not achievable.\ENDproof
\exercisxC{3}{ex.m.s.I.aboveC}{
Fill in the details in the preceding argument.
If the bit errors between $\hat{\cwm }$ and $\cwm$ are independent
then we have $\I(\cwm;\hat{\cwm }) = N R ( 1 - H_2(p_{\rm b}))$. What if
we have complex correlations among those bit errors? Why
does the inequality $\I(\cwm;\hat{\cwm }) \geq
N R ( 1 - H_2(p_{\rm b}))$ hold?
}
\section{Computing capacity\nonexaminable}
\label{sec.compcap}
We\marginpar[c]{\small\raggedright{Sections \ref{sec.compcap}--\ref{sec.codthmpractice}
contain advanced material. The first-time reader is encouraged to
skip to section \ref{sec.codthmex} (\pref{sec.codthmex}).}}
have proved that the capacity of a channel is
the maximum rate at which reliable communication can be achieved.
How can we compute the capacity of a given discrete
memoryless channel?
We need to find its \optens. In general we can find
the \optens\ by a
computer search, making use of the derivative of the mutual information
with respect to the input probabilities.
\exercisxB{2}{ex.Iderivative}{
Find the derivative of $\I(X;Y)$ with respect to the
input probability $p_i$, $\partial \I(X;Y)/\partial p_i$, for a channel with
conditional probabilities $Q_{j|i}$.
}
\exercisxC{2}{ex.Iconcave}{
Show that $\I(X;Y)$ is a \concavefrown\ function of
the input probability vector $\bp$.
}
Since $\I(X;Y)$ is \concavefrown\ in the input distribution $\bp$,
any probability distribution $\bp$ at which
% that has $\partial \I(X;Y)/\partial p_i$
$\I(X;Y)$ is stationary
must be a global maximum of $\I(X;Y)$.
%
So it is tempting to put the derivative of $\I(X;Y)$ into a routine that
finds a local maximum of $\I(X;Y)$, that is, an input distribution
$P(x)$ such that
\beq
\frac{\partial \I(X;Y)}{\partial p_i}
= \lambda \:\:\: \mbox{for all $i$},
\label{eq.Imaxer}
\eeq
where $\lambda$ is a Lagrange multiplier associated with the constraint
$\sum_i p_i = 1$.
However, this approach may fail to find the right
answer, because $\I(X;Y)$ might be maximized
by a distribution that has $p_i \eq 0$ for some inputs.
A simple example is given by the ternary confusion channel.
\begin{description}
%
\item[Ternary confusion channel\puncspace] $\A_X \eq \{0,{\query},1\}$. $\A_Y \eq \{0,1\}$.
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(20,30)(0,0)
\put(5,5){\vector(1,0){10}}
\put(5,25){\vector(1,0){10}}
\put(5,15){\vector(1,1){10}}
\put(5,15){\vector(1,-1){10}}
\put(4,5){\makebox(0,0)[r]{1}}
\put(4,25){\makebox(0,0)[r]{0}}
\put(16,5){\makebox(0,0)[l]{1}}
\put(16,25){\makebox(0,0)[l]{0}}
\put(4,15){\makebox(0,0)[r]{{\query}}}
\end{picture}
\end{array}
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
P(y\eq 0 \given x\eq 0) &=& 1 \,; \\
P(y\eq 1 \given x\eq 0) &=& 0 \,;
\end{array}
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
P(y\eq 0 \given x\eq {\query}) &=& 1/2 \,; \\
P(y\eq 1 \given x\eq {\query}) &=& 1/2 \,;
\end{array}
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
P(y\eq 0 \given x\eq 1) &=& 0 \,; \\
P(y\eq 1 \given x\eq 1) &=& 1 .
\end{array}
\]
Whenever the input $\mbox{\query}$ is used, the output is random;
the other inputs are reliable inputs. The maximum information
rate of 1 bit is achieved by making no use of the
input $\mbox{\query}$.
\end{description}
\exercissxB{2}{ex.Iternaryconfusion}{
Sketch the mutual information for this channel as a function of
% $$a\in (0,1)$ and $b\in (0,1)$,
the input distribution $\bp$. Pick a convenient two-dimensional
representation of $\bp$.
}
The \ind{optimization} routine must therefore take account
of the possibility that, as we go up hill on $\I(X;Y)$,
we may run into the inequality constraints $p_i \geq 0$.
\exercissxB{2}{ex.Imaximizer}{
Describe the condition, similar to \eqref{eq.Imaxer}, that is satisfied at a
point where $\I(X;Y)$ is maximized, and describe a computer
program for finding the capacity of a channel.
}
\subsection{Results that may help in finding the \optens}
% The following results
\ben
\item
{All outputs must be used}.
\item
{$\I(X,Y)$ is a \convexsmile\ function of the channel parameters.}\marginpar{\small\raggedright {\sf Reminder:} The term `\convexsmile' means `convex',
and the term `\concavefrown' means `concave'; the little
smile and frown symbols are included simply to remind you what
convex and concave mean.}
\item
{There may be several {\optens}s, but they all look the same at the output.}
\een
%\subsubsection{All outputs must be used\subsubpunc}
\exercisxB{2}{ex.Iallused}{
Prove that no output $y$ is unused by an \optens, unless it is unreachable,
that is, has $Q(y \given x)=0$ for all $x$.
}
%\subsubsection{Convexity of $\I(X,Y)$ with respect to the channel parameters\subsubpunc}
\exercisxC{2}{ex.Iconvex}{
Prove that
$\I(X,Y)$ is a \convexsmile\ function of $Q(y \given x)$.
}
%\subsubsection{There may be several {\optens}s, but they all look the same at the output\subsubpunc}
\exercisxC{2}{ex.Imultiple}{
Prove that all {\optens}s of a channel have the same output
probability distribution $P(y) = \sum_x P(x)Q(y \given x)$.
}
These results, along with the fact that
$\I(X;Y)$ is a \concavefrown\ function of
the input probability vector $\bp$, prove the validity of
the symmetry argument that we have used when finding
the capacity of symmetric channels.
If a channel is invariant under a group of symmetry
operations -- for example, interchanging the
input symbols and interchanging the output symbols --
then, given any \optens\ that is not
symmetric, \ie, is not invariant under these operations,
we can create another input distribution
by averaging together this \optens\ and all
%
% WORDY!!!!!!!!!!!
%
its permuted forms that we can make by applying the
symmetry operations to the original \optens.
The permuted distributions must have the same
$\I(X;Y)$ as the original, by symmetry, so the
new input distribution created by averaging must
have $\I(X;Y)$ bigger than or equal to that of the
original distribution, because of the concavity
of $\I$.
% see capacity.p
\subsection{Symmetric channels}
\label{sec.Symmetricchannels}
In order to use symmetry arguments, it will help
to have a definition of a symmetric channel.
I like \quotecite{Gallager68}
% Gallager's
definition.\index{Gallager, Robert}
% page 94
%\subsubsection{Gallager's definition of a symmetric channel}
\begin{description}
\item[A discrete memoryless channel is a symmetric channel]
if the set of outputs can be partitioned into subsets
in such a way that for each subset the matrix of
transition probabilities
% (using inputs as columns and outputs in the subset as rows)
has the property
that each row (if more than 1) is a permutation of each other row
and each column is a permutation of each other column.
\end{description}
\exampl{exSymmetric}{
This channel
\beq
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
P(y\eq 0 \given x\eq 0) &=& 0.7 \,; \\
P(y\eq {\query} \given x\eq 0) &=& 0.2 \,; \\
P(y\eq 1 \given x\eq 0) &=& 0.1 \,;
\end{array}
\begin{array}{c@{\:\:\,}c@{\:\:\,}l}
P(y\eq 0 \given x\eq 1) &=& 0.1 \,; \\
P(y\eq {\query} \given x\eq 1) &=& 0.2 \,; \\
P(y\eq 1 \given x\eq 1) &=& 0.7.
\end{array}
\eeq
is a symmetric channel because
its outputs can be partitioned into $(0,1)$ and ${\query}$, so that
the matrix can be rewritten:
\beq
\begin{array}{cc} \midrule
\begin{array}{ccl}%{c@{}c@{}l}
P(y\eq 0 \given x\eq 0) &=& 0.7 \,; \\
P(y\eq 1 \given x\eq 0) &=& 0.1 \,;
\end{array}
&
\begin{array}{ccl}%{c@{}c@{}l}
P(y\eq 0 \given x\eq 1) &=& 0.1 \,; \\
P(y\eq 1 \given x\eq 1) &=& 0.7 \,;
\end{array}
\\ \midrule
\begin{array}{ccl}%{c@{}c@{}l}
P(y\eq {\query} \given x\eq 0) &=& 0.2 \,; \\
\end{array}
&
\begin{array}{ccl}%{c@{}c@{}l}
P(y\eq {\query} \given x\eq 1) &=& 0.2 . \\
\end{array}
\\ \midrule
\end{array}
%
\eeq
}
Symmetry is a useful property because, as we will see
in a later chapter,
communication at capacity can be achieved over symmetric channels
by {\em{linear}\/} codes.\index{error-correcting code!linear}\index{linear block code}
% that are good codes
%-- a considerable simplification of the task of finding excellent codes.
\exercisxC{2}{ex.Symmetricoptens}{
Prove that for a \ind{symmetric channel} with any
number of inputs,\index{channel!symmetric}
the uniform distribution over the inputs is an {\optens}.
}
\exercissxB{2}{ex.notSymmetric}{
Are there channels that are not symmetric whose {\optens}s are uniform?
Find one, or prove there are none.
}
\section{Other coding theorems}% this star indicates skippable
\label{sec.othercodthm}
The noisy-channel coding theorem that we proved in this chapter
is quite general, applying to any discrete memoryless channel;
but it is not very specific. The theorem only says that
reliable communication with error probability $\epsilon$ and rate $R$
% can be achieved over a channel
can be achieved
by using codes with {\em sufficiently large\/} blocklength $N$.
The theorem does not say how large $N$ needs to be
% as a function
to achieve given values
of $R$ and $\epsilon$.
Presumably, the smaller $\epsilon$ is
and the closer $R$ is to $C$, the larger $N$ has to be.
% The task of proving explicit results about the blocklength
% is challenging and solutions to this problem are considerably
% more complex than the theorem we proved in this chapter.
%\begin{figure}
\marginfig{
\begin{center}
\mbox{\raisebox{0.5in}{$E_{\rm r}(R)$}\psfig{figure=figs/Er.eps,width=0.97in}}
\end{center}
\caption[a]{A typical random-coding exponent.}
\label{fig.Er}
%\end{figure}
}%\end{marginfig}
%
%
\subsection{Noisy-channel coding theorem -- version with
explicit $N$-dependence}
% explicit blocklength dependence}
\index{noisy-channel coding theorem}
\begin{quote}
For a discrete memoryless channel, a blocklength $N$
and a rate $R$, there exist block codes of length $N$
whose average probability of error satisfies:
\beq
p_{\rm B} \leq \exp \left[ -N E_{\rm r}(R) \right]
\label{eq.pbEr}
\eeq
where $E_{\rm r}(R)$ is the {\dem\ind{random-coding exponent}\/}
of the channel, a \convexsmile, decreasing, positive function of $R$
%which
% satisfies
%\beq
% E_{\rm r}(R) > 0 \:\: \mbox{for all $R$ satisfying $0 \leq R < C$} .
%\eeq
for $0 \leq R < C$. The {random-coding exponent}
is also known as the \ind{reliability function}.
[By an \ind{expurgation} argument it can also be shown that
there exist block codes for which the {\em{maximal\/}} probability of error
$p_{\rm BM}$
% , like $p_{\rm B}$ in \eqref{eq.pbEr},
is also exponentially small in $N$.]
\end{quote}
The definition of $E_{\rm r}(R)$ is
given in \citeasnoun{Gallager68}, p.$\,$139.
$E_{\rm r}(R)$ approaches
zero as $R \rightarrow C$; the typical behaviour of this function
is illustrated in \figref{fig.Er}.
The computation of the {random-coding exponent}
for interesting channels is a challenging task
on which much effort has been expended. Even for simple
channels like the \BSC, there is no simple expression for $E_{\rm r}(R)$.
\subsection{Lower bounds on the error probability as a function of blocklength}
The theorem stated above
% gives an upper bound on the error probability:
asserts that there are codes with $p_{\rm B}$ smaller than $\exp \left[ -N E_{\rm r}(R) \right]$.
But how small can the error probability be? Could it be much smaller?
\begin{quote}
For any code with blocklength $N$ on a discrete memoryless channel,
the probability of error assuming all source messages are
used with equal probability satisfies
\beq
p_{\rm B} \gtrsim \exp[ - N E_{\rm sp}(R) ] ,
\eeq
where the function $E_{\rm sp}(R)$,
the {\dem\ind{sphere-packing exponent}\/} of the channel,
is a \convexsmile, decreasing, positive function of $R$
for $0 \leq R < C$.
\end{quote}
For a precise statement of this result and further references,
see \citeasnoun{Gallager68}, \mbox{p.$\,$157}.\index{Gallager, Robert}
\section{Noisy-channel coding theorems and coding practice}
\label{sec.codthmpractice}
Imagine a customer who wants to buy an error-correcting
code and decoder for a noisy channel.
The results described above allow us to offer
the following service: if he tells us the properties of
his channel, the desired rate $R$ and the desired error probability $p_{\rm B}$,
we can, after working out the relevant functions
$C$, $E_{\rm r}(R)$, and $E_{\rm sp}(R)$, advise him
that there exists a solution to his problem using a particular
blocklength $N$; indeed that almost any randomly
chosen code with that blocklength
should do the job. Unfortunately we have
not found out how to implement these encoders
and decoders in practice; the cost of implementing
the encoder and decoder for a random code with large $N$ would
be exponentially large in $N$.
Furthermore, for practical purposes, the customer is unlikely
to know exactly what channel he is dealing with.
% and might be reluctant to specify a desired rate
So \citeasnoun{Berlekamp80} suggests that\index{Berlekamp, Elwyn}
the sensible way to approach error-correction
is to design encoding-decoding systems
and plot their performance on a {\em variety\/}
of idealized channels
as a function of the channel's noise level. These charts (one of which
is illustrated on page \pageref{fig:GCResults})
can then be shown to the customer, who can choose
among the systems on offer without having to
specify what he really thinks his channel is like.
With this attitude to the practical problem, the importance of the
functions $E_{\rm r}(R)$ and $E_{\rm sp}(R)$ is diminished.
%
% put this back somewhere. :
%
%
%\subsection{Noisy-channel coding theorem with errors allowed:
% rate-distortion theory}
% See Gallager p.466$\pm 20$.
%
%\subsection{Special case of linear codes}
% Give Gallager's p.94 definition of a discrete symmetric channel.
% Give coding theorem for linear codes on any symmetric channel
% (including with memory).
%
%\subsection{More general case of
% channels with memory}
%
%\subsection{Finite state channels}
% Channels with and without intersymbol interference and
% with and without noise. (Is it worth discussing these in any
% individual detail, or shall
% I just have a general channels with memory discussion?)
%
% end detour
\section{Further exercises}
\label{sec.codthmex}
\exercisaxA{2}{ex.exam01}{
A binary erasure channel with input $x$ and output $y$
has transition probability matrix:
\[
\bQ = \left[
\begin{array}{cc}
1-q & 0 \\
q & q \\
0 & 1-q
\end{array}
\right]
\hspace{1in}
\begin{array}{c}
\setlength{\unitlength}{0.13mm}
\begin{picture}(100,100)(0,0)
\put(18,0){\makebox(0,0)[r]{\tt 1}}
%
\put(18,80){\makebox(0,0)[r]{\tt 0}}
\put(20,0){\vector(1,0){38}}
\put(20,80){\vector(1,0){38}}
%
\put(20,0){\vector(1,1){38}}
\put(20,80){\vector(1,-1){38}}
%
\put(62,0){\makebox(0,0)[l]{\tt 1}}
\put(62,40){\makebox(0,0)[l]{\tt ?}}
\put(62,80){\makebox(0,0)[l]{\tt 0}}
\end{picture}
\end{array}
\]
Find the {\em{mutual information}\/} $I(X;Y)$ between the input and output
for general input distribution $\{ p_0,p_1 \}$, and show that the
{\em{capacity}\/} of this channel is $C = 1-q$ bits.
\medskip
\item
%\noindent (c)
A Z channel\index{channel!Z channel}
has transition probability matrix:
\[
\bQ = \left[
\begin{array}{cc}
1 & q \\
0 & 1-q
\end{array}
\right]
\hspace{1in}
\begin{array}{c}
\setlength{\unitlength}{0.1mm}
\begin{picture}(100,100)(0,0)
\put(18,0){\makebox(0,0)[r]{\tt 1}}
%
\put(18,80){\makebox(0,0)[r]{\tt 0}}
\put(20,0){\vector(1,0){38}}
\put(20,80){\vector(1,0){38}}
%
\put(20,0){\vector(1,2){38}}
%
\put(62,0){\makebox(0,0)[l]{\tt 1}}
\put(62,80){\makebox(0,0)[l]{\tt 0}}
\end{picture}
\end{array}
\]
Show that, using
a $(2,1)$ code,
% of blocklength 2,
{\bf{two}} uses of a Z channel
can be made to emulate {\bf{one}} use of an erasure channel,
and state the erasure probability of that erasure channel.
Hence show that the
capacity of the Z channel, $C_{\rm Z}$,
satisfies $C_{\rm Z} \geq \frac{1}{2}(1-q)$ bits.
Explain why the result $C_{\rm Z} \geq \frac{1}{2}(1-q)$
is an inequality rather than an equality.
}
\exercissxC{3}{ex.wirelabelling}{
A \ind{transatlantic} cable contains $N=20$ indistinguishable
electrical wires.\index{puzzle!transatlantic cable}\index{puzzle!cable labelling}
You have the job of figuring out which
wire is which, that is,
% Alice and Bob, located at the opposite ends of the
% cable, wish
to create a consistent labelling of the wires at each end.
Your only tools are the ability to connect wires to each other
in groups of two or more, and to test for connectedness with
a continuity tester.
What is the smallest number of transatlantic trips you need to
make, and how do you do it?
How would you solve the problem for larger $N$ such as $N=1000$?
As an illustration, if $N$ were 3 then the task can be solved
in two steps by labelling one wire at one end $a$, connecting the other two together,
crossing the \ind{Atlantic}, measuring which two wires are connected, labelling them
$b$ and $c$ and the unconnected one $a$, then connecting $b$ to $a$
and returning across the Atlantic, whereupon on disconnecting
$b$ from $c$, the identities of $b$ and $c$ can be deduced.
This problem can be solved by persistent search,
but the reason it is posed in this chapter is that it can
also be solved by a greedy approach based on maximizing the acquired
{\em information}.
Let the unknown permutation of wires be $x$.
% , drawn from an ensemble $X$.
Having chosen a set of connections of wires $\cal C$ at one end,
you can then make measurements at the other end, and
these measurements $y$ convey {\em information\/} about $x$.
How much? And for what set of connections is the information that $y$ conveys
about $x$ maximized?
}
\dvips
\section{Solutions}% to Chapter \protect\ref{ch6}'s exercises} % 80,82,84,85,86
% solutions to _l6.tex
%
%\soln{ex.m.s.I.aboveC}{
%%\input{tex/aboveC.tex}
% {\em [More work needed here.]}
%}
%\soln{ex.Iderivative}{
%% Find derivative of $I$ w.r.t $P(x)$.
% Get a specific mutual information
% like object minus $\log e$.
%}
%\soln{ex.Iconcave}{
% $\I(X,Y) = \sum_{x,y} P(x) Q(y|x) \log \frac{Q(y|x)}{P(x)Q(y|x)}$
% is a \concavefrown\ function of $P(x)$.
% Easy Proof in Gallager p.90, using \verb+z->x->y+, where $z$ chooses
% between the two things we are mixing.
% This satisfies $I(X;Y|Z) = 0$
% (data processing inequality).
%}
\soln{ex.Iternaryconfusion}%
{
\marginpar{\[
\begin{array}{c}
\setlength{\unitlength}{1mm}
\begin{picture}(20,30)(0,0)
\put(5,5){\vector(1,0){8}}
\put(5,25){\vector(1,0){8}}
\put(5,15){\vector(1,1){8}}
\put(5,15){\vector(1,-1){8}}
\put(10,18){\makebox(0,0)[l]{\dhalf}}
\put(10,12){\makebox(0,0)[l]{\dhalf}}
\put(4,5){\makebox(0,0)[r]{\tt1}}
\put(4,25){\makebox(0,0)[r]{\tt0}}
\put(16,5){\makebox(0,0)[l]{\tt1}}
\put(16,25){\makebox(0,0)[l]{\tt0}}
\put(4,15){\makebox(0,0)[r]{\tt{?}}}
\end{picture}
\end{array}
\]
}
If the input distribution is $\bp=(p_0,p_{\tt{?}},p_1)$,
the mutual information is
\beq
I(X;Y) = H(Y) - H(Y|X)
= H_2(p_0 + p_{{\tt{?}}}/2) - p_{{\tt{?}}} .
\eeq
We can build
a good sketch of this function in two ways:
by careful inspection of the function, or
by looking at special cases.
For the plots, the two-dimensional
representation of $\bp$ I will use
has $p_0$ and $p_1$ as the independent
variables, so that $\bp=(p_0,p_{\tt{?}},p_1) = (p_0,(1-p_0-p_1),p_1)$.
\medskip
\noindent {\sf By inspection.}
If we use the quantities $p_* \equiv p_0 + p_{{\tt{?}}}/2$ and $p_{\tt{?}}$
as our two degrees of freedom, the mutual information becomes
very simple: $I(X;Y) = H_2(p_*) - p_{{\tt{?}}}$. Converting back to
$p_0 = p_* - p_{{\tt{?}}}/2$ and $p_1 = 1 - p_* - p_{{\tt{?}}}/2$,
we obtain the sketch shown at the left below.
This function is like a tunnel rising up the direction of
increasing $p_0$ and $p_1$.
To obtain the required plot of $I(X;Y)$ we have to strip
away the parts of this tunnel that live outside
the feasible \ind{simplex} of
probabilities; we do this by redrawing the surface,
showing only the parts where $p_0>0$
and $p_1>0$. A full plot of the function is shown at the right.
\medskip
\begin{center}
\mbox{%
\hspace*{2.3in}%
\makebox[0in][r]{\raisebox{0.3in}{$p_0$}}%
\hspace*{-2.3in}%
\raisebox{0in}[1.9in]{\psfig{figure=figs/confusion.view1.ps,angle=-90,width=3.62in}}%
\hspace{-0.3in}%
\makebox[0in][r]{\raisebox{0.87in}{$p_1$}}%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\hspace*{2.3in}%
\makebox[0in][r]{\raisebox{0.3in}{$p_0$}}%
\hspace*{-2.3in}%
\raisebox{0in}[1.709in]{\psfig{figure=figs/confusion.view2.ps,angle=-90,width=3.62in}}%
\hspace{-0.3in}%
\makebox[0in][r]{\raisebox{0.87in}{$p_1$}}%
}\\[-0.3in]
\end{center}
\medskip
\noindent {\sf Special cases.}
In the special case $p_{{\tt{?}}}=0$, the channel is a noiseless
binary channel, and $I(X;Y) = H_2(p_0)$.
In the special case $p_0=p_1$, the term $H_2(p_0 + p_{{\tt{?}}}/2)$ is equal to 1,
so $I(X;Y) = 1-p_{{\tt{?}}}$.
In the special case $p_0=0$, the channel is a Z channel with error
probability 0.5. We know how to sketch that, from the previous chapter
(\figref{hxyz}).
\amarginfig{c}{\small% skeleton fixed Thu 10/7/03
\begin{center}% was -0.51in until Sat 24/5/03
\hspace*{-0.31in}\mbox{%
\hspace*{1.62in}%
\makebox[0in][r]{\raisebox{0.25in}{$p_0$}}%
\hspace*{-1.62in}%
{\psfig{figure=figs/confusion.skel.ps,angle=-90,width=2.5in}}%was 3in
\hspace{-0.3in}%
\makebox[0in][r]{\raisebox{0.77in}{$p_1$}}}\vspace{-0.2in}%
\end{center}
\caption[a]{Skeleton of the mutual information for the ternary confusion channel.}
\label{fig.skeleton}
}% end marginpar
These special cases allow us to construct the skeleton shown
in \figref{fig.skeleton}.
% below.
}
\soln{ex.Imaximizer}{
Necessary and sufficient conditions for $\bp$ to maximize
$\I(X;Y)$ are
\beq
\left.
\begin{array}{rclcc}
\frac{\partial \I(X;Y)}{\partial p_i} & =& \lambda & \mbox{and} & p_i>0 \\[0.05in]
\frac{\partial \I(X;Y)}{\partial p_i} & \leq & \lambda & \mbox{and} & p_i=0 \\
\end{array} \right\}
\:\:\: \mbox{for all $i$},
\label{eq.IequalsC}
\eeq
where $\lambda$ is a constant related to the capacity by $C = \lambda + \log_2 e$.
This result can be used in a computer program that evaluates the
derivatives, and increments and decrements the probabilities $p_i$
in proportion to the differences between those derivatives.
This result is also useful for lazy human capacity-finders
who are good guessers. Having guessed the \optens, one can
simply confirm that \eqref{eq.IequalsC} holds.
}
%\soln{ex.Iallused}{
% coming
%}
%\soln{ex.Iconvex}{
% Easy Proof, using \verb+(x,z)->y+.
%}
%\soln{ex.Imultiple}{
%% If there are several \optens, they all give the same
%% output probability (theorem). This is a general proof that
%% the `by symmetry' argument is valid.
% coming
%}
%\soln{ex.Symmetricoptens}{
% This can be proved by the symmetry argument given in the chapter.
%
% Alternatively see p.94 of Gallager.
%}
\soln{ex.notSymmetric}{
We certainly expect nonsymmetric channels with uniform {\optens}s to exist, since
when inventing a channel we have $I(J-1)$
degrees of freedom whereas
the \optens\ is just $(I-1)$-dimensional;
so in the $I(J\!-\!1)$-dimensional
space of perturbations around a symmetric channel,
we expect there to be a
subspace of perturbations of dimension
$I(J-1)-(I-1) = I(J-2)+1$
that leave the \optens\ unchanged.
Here is an explicit example, a bit like a Z channel.
\beq \bQ =
\left[
\begin{array}{cccc}
0.9585 & 0.0415 & 0.35 & 0.0 \\
0.0415 & 0.9585 & 0.0 & 0.35 \\
0 & 0 & 0.65 & 0 \\
0 & 0 & 0 & 0.65 \\
\end{array}
\right]
\eeq
}
% removed to cutsolutions.tex
% \soln{ex.exam01}{
\soln{ex.wirelabelling}{
The labelling problem can be solved for any $N>2$ with
just two trips, one each way across the Atlantic.
The key step in the information-theoretic approach to
this problem is to write down
the information content of
one {\dem\ind{partition}}, the combinatorial object that is the connecting
together of subsets of wires.
If $N$ wires are grouped together into
$g_1$ subsets of size $1$,
$g_2$ subsets of size $2$, $\ldots,$
% $g_r$ groups of size $r$ $\ldots,$
then the number of such partitions is
\beq
\Omega = \frac{ N! }{\displaystyle \prod_r \left( r! \right)^{g_r} g_r! } ,
\eeq
and the information content of one such \ind{partition} is the $\log$ of this quantity.
In a greedy strategy we choose the first partition to maximize this information
content.
One game we can play is to maximize this information content
with respect to the quantities $g_r$, treated as real numbers, subject to the
constraint $\sum_r g_r r = N$.
Introducing a \ind{Lagrange multiplier} $\l$ for the constraint,
the derivative is
\beq
\frac{ \partial }{\partial g_r} \left( \log \Omega + \l \sum_r g_r r \right)
= - \log r! - \log g_r + \l r ,
\eeq
which, when set to zero, leads to the rather nice expression
\beq
g_r = \frac{ e^{\l r} }{ r! } ;
% \:\:(r \geq 1)
\eeq
the optimal $g_r$ is
proportional to a \ind{Poisson distribution}\index{distribution!Poisson}!
We can solve for the Lagrange multiplier by plugging $g_r$ into the
constraint $\sum_r g_r r = N$, which gives the implicit
equation
\beq
N = \mu \, e^{\mu},
\eeq
where $\mu \equiv e^{\l}$ is a convenient reparameterization of the
Lagrange multiplier.
\Figref{fig.atlantic}a shows a graph of $\mu(N)$;
\figref{fig.atlantic}b
shows the deduced non-integer assignments $g_r$ when $\mu=2.2$,
and nearby integers $g_r = \{1,2,2,1,1\}$
that motivate setting the first partition to
(a)(bc)(de)(fgh)(ijk)(lmno)(pqrst).
\marginfig{\footnotesize
\begin{center}\hspace*{-0.2in}
\begin{tabular}{r@{\hspace{0.2in}}l}
(a)&\mbox{\psfig{figure=figs/atlanticmuN.ps,width=1.5in,angle=-90}}\\[0.2in]
(b)&\mbox{\psfig{figure=figs/atlanticpoi.ps,width=1.5in,angle=-90}}\\
\end{tabular}
\end{center}
\caption[a]{Approximate solution of the \index{cable labelling}{cable-labelling} problem
using Lagrange multipliers.
(a) The parameter $\mu$ as a function of $N$; the value $\mu(20) = 2.2$ is highlighted.
(b) Non-integer values of the function $g_r = \dfrac{ \mu^{r} }{ r! }$
are shown by lines and
integer values of $g_r$ motivated by those non-integer values are
shown by crosses.
}
\label{fig.atlantic}
}
This partition produces a random partition at the other
end, which has an information content of $\log \Omega =40.4\ubits$,
% pr log(20!*1.0/( (2!)**2 * 2 * (3!)**2 * 2 * (4!) * (5!) ) )/log(2.0)
% pr log(20!*1.0/( (2!)**10 * 10! ))/log(2.0)
which is a lot more than half the total information content
we need to acquire to infer the transatlantic permutation, $\log 20! \simeq 61\ubits$.
[In contrast, if all the wires are joined together in pairs,
the information content generated
is only about 29$\ubits$.]
How to choose the second partition is left
to the reader. A Shannonesque approach is appropriate, picking a
random partition at the other end, using the same $\{g_r\}$; you
need to ensure the two partitions are as unlike each other as possible.
If $N \neq 2$, 5 or 9, then the labelling problem
has solutions
that are particularly simple to implement,
called \ind{Knowlton--Graham partitions}:
partition $\{1,\ldots,N\}$ into disjoint sets in two ways $A$
% $A_1,\ldots,A_p$ and
and $B$,
% $B_1,\ldots,B_q$,
subject to the condition that at most one element appears
both in an $A$~set of cardinality~$j$ and in a $B$~set of cardinality~$k$, for
each $j$ and~$k$ \cite{Graham66,GrahamKnowlton68}.\index{Graham, Ronald L.}
% (R. L. Graham, ``On partitions of a finite set,'' {\sl Journal of Combinatorial
% Theory\/ \bf 1} (1966), 215--223;\index{Graham, Ronald L.}
% Ronald L. Graham and Kenneth C. Knowlton, ``Method of identifying conductors in
% a cable by establishing conductor connection groupings at both ends of the
% cable,'' U.S. Patent 3,369,177 (13~Feb 1968).)
}
%%%%%%%%%%%%%%%%%%%%%%%%
%
% end of chapter
%
%%%%%%%%%%%%%%%%%%%%%%%%
%
\dvipsb{solutions noisy channel s6}
%
%
% CHAPTER 12 (formerly 7)
%
\prechapter{About Chapter}
\fakesection{prerequisites for chapter 7}
Before reading \chref{ch.ecc}, you should have read Chapters
\ref{ch.five} and \ref{ch.six}.
You will also need to be familiar with the {\dem\inds{Gaussian distribution}}.
\label{sec.gaussian.props}
\begin{description}
\item[One-dimensional Gaussian distribution\puncspace]
If a
random variable $y$ is Gaussian and has mean $\mu$ and variance $\sigma^2$,
which we write:
\beq
y \sim \Normal(\mu,\sigma^2) ,\mbox{ or } P(y) = \Normal(y;\mu,\sigma^2) ,
\eeq
then the distribution of $y$ is:
% a Gaussian distribution:
\beq
P(y\given \mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}}
\exp \left[ - ( y - \mu )^2 / 2 \sigma^2 \right] .
\eeq
[I use the symbol $P$ for both probability densities and
probabilities.]
The inverse-variance $\tau \equiv \dfrac{1}{\sigma^2}$ is sometimes
called the {\dem\inds{precision}\/} of the Gaussian distribution.
\item[Multi-dimensional Gaussian distribution\puncspace]
If $\by = (y_1,y_2,\ldots,y_N)$ has a \ind{multivariate Gaussian} {distribution}, then
\beq
P( \by \given \bx, \bA ) = \frac{1}{Z(\bA)} \exp \left( - \frac{1}{2}
(\by -\bx)^{\T} \bA (\by -\bx) \right) ,
\eeq
where $\bx$ is the mean of the distribution,
$\bA$ is the inverse of the \ind{variance--covariance matrix}\index{covariance matrix}, and
the normalizing constant is ${Z(\bA)} = \left( { {\det}\! \left( \linefrac{\bA}{2 \pi}
\right) } \right)^{-1/2}$.
This distribution has the property that
the variance $\Sigma_{ii}$ of $y_i$, and the covariance $\Sigma_{ij}$ of
$y_i$ and $y_j$
are given by
\beq
\Sigma_{ij} \equiv \Exp \left[ ( y_i - \bar{y}_i ) ( y_j - \bar{y}_j ) \right]
= A^{-1}_{ij} ,
\eeq
where $\bA^{-1}$ is the inverse of the matrix $\bA$.
The marginal distribution $P(y_i)$ of one component $y_i$ is Gaussian;
the joint marginal distribution of any subset of the
components is multivariate-Gaussian;
and the conditional density of any subset, given the values of
another subset, for example, $P(y_i\given y_j)$, is also Gaussian.
\end{description}
%\chapter{Error correcting codes \& real channels}
% ampersand used to keep the title on one line on the chapter's opening page
\ENDprechapter
\chapter[Error-Correcting Codes and Real Channels]{Error-Correcting Codes \& Real Channels}
\label{ch.ecc}\label{ch7}
% % : l7.tex -- was l78.tex
% \setcounter{chapter}{6}% set to previous value
% \setcounter{page}{70} % set to current value
% \setcounter{exercise_number}{89} % set to imminent value
% %
% \chapter{Error correcting codes \& real channels}
% \label{ch7}
The noisy-channel coding theorem that we have proved shows that there
exist reliable
% `very good'
error-correcting codes for any noisy channel.
In this chapter we address two questions.
First, many practical channels have real, rather than discrete,
inputs and outputs. What can Shannon tell us about
these continuous channels? And how should digital signals be
mapped into analogue waveforms, and {\em vice versa}?
Second, how are practical error-correcting codes
made, and what is achieved in practice, relative to the
possibilities proved by Shannon?
\section{The Gaussian channel}
The most popular model of a real-input, real-output
channel is the \inds{Gaussian channel}.\index{channel!Gaussian}
\begin{description}
\item[The Gaussian channel] has a real input $x$ and a real output $y$.
The conditional distribution of $y$ given $x$ is a Gaussian distribution:
\beq
P(y\given x) = \frac{1}{\sqrt{2 \pi \sigma^2}}
\exp \left[ - ( y - x )^2 / 2 \sigma^2 \right] .
\label{eq.gaussian.channel.def}
\eeq
%
This channel has a continuous input and output but is discrete
in time.
We will show below that certain continuous-time channels
are equivalent to the discrete-time Gaussian channel.
This channel is sometimes called the additive white Gaussian noise (AWGN)
channel.\index{channel!AWGN}\index{channel!Gaussian}\index{AWGN}
\end{description}
% Why is this a useful channel model? And w
As with discrete channels, we will discuss
what rate of error-free
information communication can be achieved over this channel.
\subsection{Motivation
% for the Gaussian channel
in terms of a continuous-time channel \nonexaminable}
Consider a physical (electrical, say) channel with inputs and outputs that
are continuous in time. We put in $x(t)$,
% which is a
%% some sort of
% band-limited signal,
and out comes $y(t) = x(t) + n(t)$.
Our transmission has a power cost. The average power of
a transmission of length $T$ may be constrained thus:
\beq
\int_0^T \d t \: [x(t)]^2 / T \leq P .
\eeq
The received signal is assumed to differ from $x(t)$ by additive
noise $n(t)$ (for example \ind{Johnson noise}), which we will model as
white\index{white noise}\index{noise!white}
Gaussian noise. The magnitude of this noise is quantified by the
{\dem noise spectral
density}, $N_0$.\index{noise!spectral density}\index{E$_{\rm b}/N_0$}\index{signal-to-noise ratio}
% , which might depend on the effective temperature of the system.
How could such a channel be used to communicate information?
\amarginfig{t}{
\begin{tabular}{r@{}l}
$\phi_1(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi1.ps,angle=-90,width=1in}}\\
$\phi_2(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi2.ps,angle=-90,width=1in}}\\
$\phi_3(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi3.ps,angle=-90,width=1in}}\\
$x(t)$ &\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/xt.ps,angle=-90,width=1in}}\\
\end{tabular}
%
\caption[a]{Three basis functions, and a
weighted combination of them,
$
x(t) = \sum_{n=1}^N x_n \phi_n(t) ,
$
with $x_1 \eq 0.4$,
$x_2 \eq -0.2$, and $x_3 \eq 0.1$.
% see figs/realchannel.gnu
}
\label{fig.continuousfunctionexample}
}
Consider transmitting a set of $N$ real numbers $\{ x_n \}_{n=1}^N$
in a signal of duration $T$ made up of a weighted combination
of orthonormal basis functions $\phi_n(t)$,
\beq
x(t) = \sum_{n=1}^N x_n \phi_n(t) ,
\eeq
where $\int_0^T \: \d t \: \phi_n(t) \phi_m(t) = \delta_{nm}$.
The receiver can then compute the scalars:
\beqan
y_n \:\: \equiv \:\: \int_0^T \: \d t \: \phi_n(t) y(t)
&=&
x_n + \int_0^T \: \d t \: \phi_n(t) n(t)
\\
&\equiv& x_n + n_n
\eeqan
for $n=1 \ldots N$.
If there were no noise, then $y_n$ would equal $x_n$. The white Gaussian
noise $n(t)$ adds scalar noise $n_n$ to the estimate $y_n$. This noise
is Gaussian:
\beq
n_n \sim \Normal(0,N_0/2),
\eeq
where $N_0$ is the spectral
density introduced above.
% [This is the definition of $N_0$.]
Thus a continuous channel used in this way
is equivalent to the Gaussian channel
defined at \eqref{eq.gaussian.channel.def}.
The power constraint $\int_0^T \d t \, [x(t)]^2 \leq P T$
defines a constraint on the signal amplitudes $x_n$,
\beq
\sum_n x_n^2 \leq PT \hspace{0.5in} \Rightarrow
\hspace{0.5in}
\overline{x_n^2} \leq \frac{PT}{N} .
\eeq
Before returning to the Gaussian channel, we define the {\dbf\ind{bandwidth}}
(measured in \ind{Hertz})
of the \ind{continuous channel}\index{channel!continuous} to be:
\beq
W = \frac{N^{\max}}{2 T},
\eeq
where $N^{\max}$ is the maximum number of orthonormal functions that can be
produced in an interval of length $T$.
This definition can be motivated by imagining creating a
\ind{band-limited signal} of duration $T$ from orthonormal cosine and sine
curves of maximum frequency $W$. The number of orthonormal functions
is $N^{\max} = 2 W T$. This definition relates to the
\ind{Nyquist sampling theorem}: if the highest frequency present in a signal
is $W$, then the signal can
be fully determined from its values at a series of discrete
sample points separated by the Nyquist interval
$\Delta t = \dfrac{1}{2W}$ seconds.
So the use of a real continuous channel with bandwidth $W$, noise spectral
density $N_0$ and power $P$ is equivalent to $N/T = 2 W$ uses per second
of a Gaussian channel with noise level
$\sigma^2 = N_0/2$ and subject to the signal power
constraint $\overline{x_n^2} \leq\dfrac{P}{2W}$.
\subsection{Definition of $E_{\rm b}/N_0$\nonexaminable}
Imagine\index{E$_{\rm b}/N_0$}
that the Gaussian channel $y_n = x_n + n_n$ is used {with
% an
% error-correcting code
an encoding system} to transmit {\em binary\/}
source bits at a rate of $R$ bits per channel use.
% , where a rate of 1 corresponds to the uncoded case.
How can we compare two encoding systems that have different
rates of \ind{communication} $R$ and that use different powers $\overline{x_n^2}$?
Transmitting at a large rate $R$ is good; using small power is
good too.
It is conventional to measure the rate-compensated
\ind{signal-to-noise ratio}
% \marginpar{\footnotesize{I'm using signal to noise ratio in two different ways. Elsewhere it is defined to be $\frac{\overline{x_n^2}}{\sigma^2}$. Should I modify this phrase?}}
by the ratio of the power per source bit $E_{\rm b} = \overline{x_n^2}/R$
to the noise spectral density $N_0$:\marginpar[t]{\small\raggedright
{$E_{\rm b}/N_0$ is dimensionless, but it is usually reported in the units
of \ind{decibels}; the value given is $10 \log_{10} E_{\rm b}/N_0$.}}
\beq
E_{\rm b}/N_0 = \frac{\overline{x_n^2}}{2 \sigma^2 R} .
\eeq
% This signal-to-noise measure equates low rate, low power
% cf ebno.p
% The difference in
$E_{\rm b}/N_0$ is one of the
measures used to compare coding schemes
for Gaussian channels.
\section{Inferring the input to a real channel}
\subsection{`The best detection of pulses'}
\label{sec.pulse}
In 1944
Shannon
wrote a memorandum \cite{shannon44} on the
problem of best differentiating between two types of pulses of known shape,
represented by vectors $\bx_0$ and $\bx_1$, given that one of them has
been transmitted over a noisy channel. This is a
\ind{pattern recognition} problem.%
\amarginfig{t}{
\begin{tabular}{r@{}l}
$\bx_0$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/x0.ps,angle=-90,width=1in}}\\
$\bx_1$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/x1.ps,angle=-90,width=1in}}\\
$\by$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/xn1.ps,angle=-90,width=1in}}\\
\end{tabular}
%
\caption[a]{Two pulses $\bx_0$ and $\bx_1$, represented
as 31-dimensional vectors, and
a noisy version of one of them, $\by$.
% see figs/realchannel.gnu
}
\label{fig.detectionofpulses}
}
It is assumed that the noise is Gaussian with probability density
\beq
P( \bn ) = \left[ {\det}\left( \frac{\bA}{2 \pi}
\right) \right]^{1/2} \exp \left( - \frac{1}{2}
\bn^{\T} \bA \bn \right) ,
\eeq
where $\bA$ is the inverse of the variance--covariance matrix of the
noise, a symmetric and positive-definite matrix.
(If $\bA$ is a multiple of the identity matrix, $\bI/\sigma^2$,
then the noise is `white'.\index{noise!white}\index{white noise}
For more general $\bA$, the
noise is \index{noise!coloured}\index{coloured noise}`{coloured}'.) The probability of the received vector $\by$ given that the
source signal was $s$ (either zero or one) is then
\beq
P( \by \given s ) = \left[ { {\det} \left( \frac{\bA}{2 \pi}
\right) }\right]^{1/2} \exp \left( - \frac{1}{2}
(\by -\bx_s)^{\T} \bA (\by -\bx_s) \right) .
\eeq
The optimal detector is based on the posterior probability ratio:
\beqan
\hspace{-0.6cm}
\lefteqn{\frac{ P( s \eq 1\given \by )}{P(s \eq 0\given \by )} =
\frac{ P( \by \given s \eq 1 ) }{ P( \by \given s \eq 0)}
\frac{ P( s \eq 1 )}{P(s \eq 0 )} }
\\
&=& \exp \left( - \frac{1}{2}
(\by -\bx_1)^{\T} \bA (\by -\bx_1) + \frac{1}{2}
(\by -\bx_0)^{\T} \bA (\by -\bx_0) + \ln \frac{ P( s \eq 1 )}{P(s \eq 0 )} \right)
\nonumber
\\ &=& \exp \left( \by^{\T} \bA ( \bx_1 -\bx_0) + \theta \right),
\eeqan
where $\theta$ is a constant independent of the received vector $\by$,
\beq
\theta = - \frac{1}{2}
\bx_1^{\T} \bA \bx_1 + \frac{1}{2}
\bx_0^{\T} \bA \bx_0 + \ln \frac{ P( s \eq 1 )}{P(s \eq 0 )} .
\eeq
If the detector is forced to make a decision (\ie, guess either
$s \eq 1$ or $s \eq 0$) then the
decision that minimizes the probability of error is
to guess the most probable hypothesis. We can write the
optimal decision in terms of a {\dem\ind{discriminant function}}:
\beq
a(\by) \equiv \by^{\T} \bA ( \bx_1 -\bx_0) + \theta
\eeq
with the decisions
\marginfig{
\begin{tabular}{r@{}l}
$\bw$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/w.ps,angle=-90,width=1in}}\\
\end{tabular}
%
\caption[a]{The weight vector $\bw \propto \bx_1 -\bx_0$
that is used to discriminate between $\bx_0$ and $\bx_1$.
% see figs/realchannel.gnu
}
\label{fig.detectionofpulses.w}
}
\beq
\begin{array}{ccl} a(\by) > 0& \rightarrow & \mbox{guess $s \eq 1$} \\
a(\by) < 0& \rightarrow & \mbox{guess $s \eq 0$} \\
a(\by)=0 & \rightarrow & \mbox{guess either.}
\end{array}
\eeq
Notice
% It should be noted
that $a(\by)$ is a linear function of the
received vector,
\beq
a(\by) = \bw^{\T} \by + \theta ,
\eeq
where $\bw \equiv \bA ( \bx_1 -\bx_0)$.
\section{Capacity of Gaussian channel}
\label{sec.entropy.continuous}
Until now we have only measured the joint, marginal, and conditional
entropy of discrete variables. In order to define the information conveyed
by continuous variables, there are two issues we must
address -- the infinite length of the real line, and the infinite
precision of real numbers.
\subsection{Infinite inputs}
How much information can we convey in one use of a Gaussian
channel? If we are allowed to put {\em any\/} real number $x$ into the
Gaussian channel, we could communicate an enormous
string of $N$ digits $d_1d_2d_3\ldots d_N$
by setting $x = d_1d_2d_3\ldots d_N 000\ldots 000$.
The amount of
error-free information conveyed in just a single transmission could
be made arbitrarily large by increasing $N$,
and the communication could be made arbitrarily reliable
by increasing the number of zeroes at the end of $x$.
There is usually some \ind{power cost} associated
with large inputs, however, not to mention practical limits
in the dynamic range acceptable to a receiver.
It is therefore conventional to introduce a {\dem\ind{cost
function}\/} $v(x)$ for every input $x$, and constrain codes to have
an average cost $\bar{v}$ less than or equal to some maximum value.
% a maximum average cost $\bar{v}$.
A generalized channel coding theorem, including a cost
function for the inputs, can be proved
% for the discrete channels discussed previously
-- see McEliece (1977).\nocite{McEliece77}
The result is a channel
capacity $C(\bar{v})$ that is a function of the permitted cost. For
the Gaussian channel we will assume a cost
\beq
v(x) = x^2
\eeq
such that the `average power' $\overline{x^2}$ of the input is
constrained. We motivated this cost function
above in the case of real electrical channels in
which the physical power consumption is indeed quadratic in $x$.
The constraint $\overline{x^2}=\bar{v}$ makes it impossible to
communicate infinite information in one use of the Gaussian channel.
\subsection{Infinite precision}
\amarginfig{b}{
{\footnotesize\setlength{\unitlength}{1mm}
\begin{tabular}{lc}
(a)&{\psfig{figure=gnu/grainI.ps,angle=-90,width=1.3in}}\\
(b)&\makebox[0in]{\hspace*{4mm}\begin{picture}(20,10)%
\put(17.65,6){\vector(1,0){1.42}}
\put(17.65,6){\vector(-1,0){1.42}}
\put(17.5,8){\makebox(0,0){$g$}}
%
\end{picture}}%
{\psfig{figure=gnu/grain10.ps,angle=-90,width=1.3in}}\\
&{\psfig{figure=gnu/grain18.ps,angle=-90,width=1.3in}}\\
&{\psfig{figure=gnu/grain34.ps,angle=-90,width=1.3in}}\\
& $\vdots$ \\
\end{tabular}
}
%
\caption[a]{(a) A probability density $P(x)$. {\sf Question:}
can we define the `entropy' of this density?
(b) We could evaluate the entropies of
a sequence of probability distributions with
decreasing grain-size $g$, but these entropies tend to
$\displaystyle \int P(x) \log \frac{1}{ P(x) g } \, \d x$,
which is not independent of $g$:
% increases as $g$ decreases:
the entropy goes up by
one bit for every halving of $g$.
$\displaystyle \int P(x) \log \frac{1}{ P(x) } \, \d x$
is an\index{sermon!illegal integral}
% \\ \hspace
illegal integral.}
% see gnu/grain.gnu
\label{fig.grain}
}
It is tempting to define joint, marginal, and conditional entropies\index{entropy!of continuous variable}\index{grain size}
for real variables simply by replacing summations by integrals, but
this is not a well defined operation. As we discretize an interval
into smaller and smaller divisions, the entropy of the discrete
distribution diverges (as the logarithm of the granularity) (\figref{fig.grain}).
Also, it is not permissible
to take the logarithm of a dimensional quantity such as
a probability density $P(x)$ (whose dimensions are $[x]^{-1}$).\index{sermon!dimensions}\index{dimensions}
There is one information measure, however, that has a well-behaved
limit, namely the mutual information -- and this is the one that
really matters, since it measures how much information one variable
conveys about another. In the discrete case,
\beq
\I(X;Y) = \sum_{x,y}
P(x,y) \log \frac{P(x,y)}{P(x)P(y)} .
\eeq
Now because the argument of the log is a ratio of two probabilities
over the same space, it is
OK to have $P(x,y)$, $P(x)$ and $P(y)$ be
probability densities
% (as long as they are not pathological)
% densities)
and replace the sum by an integral:
\beqan
\I(X;Y)& =& \int \! \d x \: \d y \:
P(x,y) \log \frac{P(x,y)}{P(x)P(y)}
\\ &=&
\int \! \d x \: \d y \:
P(x)P(y\given x) \log \frac{P(y\given x)}{P(y)} .
\eeqan
We can now ask these questions for the Gaussian channel:
(a) what probability distribution
$P(x)$ maximizes the mutual information (subject to the constraint
$\overline{x^2}={v}$)? and (b) does the maximal
mutual information still measure the maximum
error-free communication rate of this real channel,
as it did for the discrete channel?
\exercissxD{3}{ex.gcoptens}{
Prove that the probability distribution
$P(x)$ that maximizes the mutual information (subject to the constraint
$\overline{x^2}={v}$) is a Gaussian distribution of mean zero
and variance $v$.
}
% solution is in tex/sol_gc.tex
\exercissxB{2}{ex.gcC}{
%
Show that the
mutual information $\I(X;Y)$,
in the case of this optimized distribution, is
\beq
C = \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2} \right) .
\eeq
}
This is an important result. We see that the capacity of the Gaussian
channel is a function of the {\dem signal-to-noise ratio} $v/\sigma^2$.
\subsection{Inferences given a Gaussian input distribution}
If
$
P(x) = \Normal(x;0,v) \mbox{ and } P(y\given x) = \Normal(y;x,\sigma^2)
$
then the marginal distribution of $y$ is
$
P(y) = \Normal(y;0,v\!+\!\sigma^2)
$
and the posterior distribution of the input, given that the output is $y$,
is:
\beqan
P(x\given y) &\!\!\propto\!\!&
P(y\given x)P(x)
\\
&\!\!\propto\!\!& \exp( -(y-x)^2/2 \sigma^2)
\exp( -x^2/2 v)
\label{eq.two.gaussians}
\\
&\!\! =\!\! &
\Normal\left( x ; \frac{ v}{v+\sigma^2} \, y \, , \,
\left({\frac{1}{v}+\frac{1}{\sigma^2}}\right)^{\! -1} \right) .
\label{eq.infer.mean.gaussian}
\eeqan
%
% label this bit for reference when we get to Gaussian land
[The step from (\ref{eq.two.gaussians}) to (\ref{eq.infer.mean.gaussian})
is made by completing the square in the exponent.]
This
\label{sec.infer.mean.gaussian}
formula deserves careful study. The mean of the posterior
distribution, $\frac{ v}{v+\sigma^2} \, y $, can be viewed
as a weighted combination of the value that best fits the
output, $x=y$, and the value that best fits the prior, $x=0$:
\beq
\frac{ v}{v+\sigma^2} \, y =
\frac{1/\sigma^2 }{1/v+1/\sigma^2} \, y + \frac{1/v}{1/v+1/\sigma^2} \, 0 .
\eeq
The weights $1/\sigma^2$ and $1/v$ are the {\dem\ind{precision}s\/}
% parameters'
of the two Gaussians that we multiplied together in \eqref{eq.two.gaussians}:
the prior and the likelihood.
%-- the probability of the output given the input,
% and the prior probability of the input.
The precision of the posterior distribution is
the sum of these two precisions. This is a general property:
whenever two independent sources contribute information, via
Gaussian distributions, about an unknown variable, the\index{precisions add}
precisions add. [This is the dual to the better-known
relationship `when independent variables are added,
their variances add'.]\index{variances add}
% inverse-variances add to define the inverse-variance of the
% posterior distribution.
\subsection{Noisy-channel coding theorem for the Gaussian channel}
We\index{noisy-channel coding theorem!Gaussian channel}
have evaluated a maximal mutual information. Does it correspond
to a maximum possible rate of error-free information transmission?
One way of proving that this is so
is to define a sequence
of discrete channels, all derived from the Gaussian channel, with
increasing numbers of inputs and outputs, and prove that the maximum
mutual information of these channels tends to the asserted $C$.
The noisy-channel coding theorem for discrete channels applies
to each of these derived channels, thus we obtain a coding theorem for
the continuous channel.
% coding theorem is then proved.
% (with discrete inputs and
% discrete outputs) by chopping the output into bins and using a
% finite set of inputs, and then defining a sequence of such channels with
% increasing numbers of inputs and outputs. A proof that the maximum
% mutual information
% of these channels tends to $C$ then completes the job, as we have already
% proved the noisy channel coding theorem for discrete channels.
%
% A more intuitive argument for the coding theorem may be preferred.
Alternatively, we can make an intuitive argument for the coding theorem
specific for the Gaussian channel.
\subsection{Geometrical view of the noisy-channel coding theorem: sphere packing}
\index{sphere packing}Consider a sequence $\bx = (x_1,\ldots, x_N)$ of inputs, and the
corresponding output $\by$, as defining two points in an $N$ dimensional
space. For large $N$, the noise power is very likely to be close
(fractionally) to $N \sigma^2$. The output $\by$ is therefore very likely
to be close to the surface of a sphere of radius $\sqrt{ N \sigma^2}$
centred on $\bx$. Similarly, if the original signal $\bx$ is generated
at random subject to an average power constraint $\overline{x^2} = v$,
then $\bx$ is likely to lie close to a sphere, centred on the
origin, of radius $\sqrt{N v}$; and because the total average power of $\by$
is $v+\sigma^2$, the received signal $\by$ is likely to lie on the surface
of a sphere of radius $\sqrt{N (v+\sigma^2)}$, centred on the origin.
The volume of an $N$-dimensional sphere of radius $r$ is
%
% this also appeared in _s1.tex
%
\beq
\textstyle
V(r,N) = \smallfrac{ \pi^{N/2} }{ \Gamma( N/2 + 1 ) } r^N .
\eeq
Now consider making a communication system based on non-confusable
inputs $\bx$, that is, inputs whose spheres do not overlap significantly.
The maximum number $S$ of non-confusable inputs is given by dividing
the volume of the sphere of probable $\by$s by the volume of
the sphere for $\by$ given $\bx$:
%
% An upper bound for the number $S$ of non-confusable inputs is:
\beq
S \leq \left( \frac{ \sqrt{N (v+\sigma^2)} }{ \sqrt{ N \sigma^2} }
\right)^{\! N}
\eeq
Thus the capacity is bounded by:\index{capacity!Gaussian channel}
\beq
C = \frac{1}{N} \log M \leq \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2}
\right) .
\eeq
A more detailed argument
% using the law of large numbers
like the one used in the previous chapter
can establish equality.
\subsection{Back to the continuous channel}
Recall that
the use of a real continuous channel with bandwidth $W$, noise spectral
density $N_0$ and power $P$ is equivalent to $N/T = 2 W$ uses per second
of a Gaussian channel with $\sigma^2 = N_0/2$ and subject to the
constraint $\overline{x_n^2} \leq P/2W$.
Substituting the result for the capacity of the Gaussian channel, we find the
capacity of the continuous channel to be:
\beq
C = W \log \left( 1 + \frac{P}{N_0 W} \right) \: \mbox{ bits per second.}
\eeq
This formula gives insight into the tradeoffs of practical
\ind{communication}. Imagine that we have a fixed power constraint. What
is the best \ind{bandwidth} to make use of that power? Introducing
$W_0=P/N_0$, \ie, the bandwidth for which the signal-to-noise ratio
is 1, figure \ref{fig.wideband} shows $C/W_0 = W/W_0 \log \! \left( 1 + W_0/W
\right)$ as a function of $W/W_0$. The capacity increases to an
asymptote of $W_0 \log e$. It is dramatically better (in terms of capacity
for fixed power) to transmit at a
low signal-to-noise ratio over a large bandwidth, than with high
signal-to-noise in a narrow bandwidth; this is one motivation for wideband
communication methods such as the `direct sequence spread-spectrum'\index{spread spectrum}
approach used
in {3G} \ind{mobile phone}s. Of course, you are not alone,
and your electromagnetic neighbours
may not be pleased if you use a large bandwidth, so for social reasons,
engineers often have to make do with higher-power, narrow-bandwidth
transmitters.
%\begin{figure}
%\figuremargin{%
\marginfig{
% figs: load 'wideband.com'
\begin{center}
\mbox{\psfig{figure=figs/wideband.ps,%
width=1.75in,angle=-90}}
\end{center}
%}{%
\caption[a]{Capacity versus bandwidth for a real channel:
$C/W_0 = W/W_0 \log \left( 1 + W_0/W
\right)$ as a function of $W/W_0$.}
\label{fig.wideband}
}%
%\end{figure}
\section{What are the capabilities of practical error-correcting codes?\nonexaminable}
\label{sec.bad.code.def}% see also {sec.good.codes}!
% cf also \ref{sec.bad.dist.def}
% in _linear.tex
% Description of Established Codes}
%
Nearly all codes are good, but nearly all codes require exponential look-up
tables for practical
implementation of the encoder and decoder -- exponential in the
blocklength $N$. And the coding theorem required $N$ to be large.
By a {\dem\ind{practical}\/} error-correcting code, we mean one that
can be encoded and decoded in a reasonable amount of time,
for example, a time that scales as a polynomial function
of the blocklength $N$ -- preferably linearly.
\subsection{The Shannon limit is not achieved in practice}
The non-constructive proof of the noisy-channel coding theorem showed
that good block codes exist for any noisy channel, and indeed that nearly
all block codes are good. But writing down an explicit and {practical\/}
encoder
and decoder that are as good as promised by Shannon is still an unsolved
problem.
% Most of the explicit families of codes that have been written down have the
% property that they can achieve a vanishing error probability $p_{\rm b}$
% as $N \rightarrow \infty$ only if the rate $R$ also goes to zero.
%
% There is one exception to this statement:
% , given by a family of codes based on
% {\dbf concatentation}.
\label{sec.good.codes}
\begin{description}
\item[Very good codes\puncspace]
Given a channel, a family of block\index{error-correcting code!very good}
codes that achieve arbitrarily small
probability of error
at any communication rate
up to the capacity
of the channel are called `very good' codes
for that channel.
\item[Good codes]
are code families that
achieve arbitrarily small probability of error
at non-zero communication rates
up to some maximum rate
that may be {\em less than\/} the \ind{capacity}
of the given channel.\index{error-correcting code!good}
\item[Bad codes] are code families that cannot achieve arbitrarily small
probability of error, or that
can only achieve arbitrarily small
probability of error\index{error-correcting code!bad}
% $\epsilon$ `bad'
by decreasing the information rate
% $R$
to zero.
Repetition codes\index{error-correcting code!repetition}\index{repetition code}%
\index{error-correcting code!bad}
are an example of a bad code family.
(Bad codes are not necessarily useless for practical
purposes.)
\item[Practical codes] are code families that can be\index{error-correcting code!practical}
encoded and decoded in time and space polynomial in the blocklength.
\end{description}
\subsection{Most established codes are linear codes}
Let us review the definition of a block code, and then add
the definition of a linear block
code.\index{error-correcting code!block code}\index{error-correcting code!linear}\index{linear block code}
\begin{description}
\item[An $(N,K)$ block code] for a channel $Q$ is a list of $\cwM=2^K$
codewords
$\{ \bx^{(1)}, \bx^{(2)}, \ldots, \bx^{({2^K)}} \}$, each of length $N$:
$\bx^{(\cwm)} \in \A_X^N$.
The signal to be encoded, $\cwm$, which comes from an
alphabet of size $2^K$, is encoded as $\bx^{(\cwm)}$.
% The {\dbf\ind{rate}} of the code\index{error-correcting code!rate} is $R = K/N$ bits.
%
% [This definition holds for any channels, not only binary channels.]
\item[A linear $(N,K)$ block code] is a block code in which
the codewords $\{ \bx^{(\cwm)} \}$ make up a $K$-dimensional subspace of
$\A_X^N$. The encoding operation can be represented by an $N \times K$
binary matrix\index{generator matrix}
$\bG^{\T}$ such that if the signal to be encoded,
in binary notation, is $\bs$ (a vector of length $K$ bits), then the
encoded signal is $\bt = \bG^{\T} \bs \mbox{ modulo } 2$.
The codewords $\{ \bt \}$ can be defined as the set of vectors
satisfying $\bH \bt = {\bf 0} \mod 2$, where $\bH$ is the
{\dem\ind{parity-check matrix}\/}
of the code.
\end{description}
\marginpar[c]{\[%beq
\bG^{\T} = {\small \left[ \begin{array}{@{\,}*{4}{c@{\,}}}
1 & \cdot & \cdot & \cdot \\[-0.05in]
\cdot & 1 & \cdot & \cdot \\[-0.05in]
\cdot & \cdot & 1 & \cdot \\[-0.05in]
\cdot & \cdot & \cdot & 1 \\[-0.05in]
1 & 1 & 1 & \cdot \\[-0.05in]
\cdot & 1 & 1 & 1 \\[-0.05in]
1 & \cdot & 1 & 1 \end{array} \right] } % nb different from l1.tex, no longer
\]%eeq
}
For example the
$(7,4)$ \ind{Hamming code} of section \ref{sec.ham74}
takes $K=4$ signal bits, $\bs$, and transmits
them followed by three parity-check bits. The $N=7$ transmitted
symbols are given by $\bG^{\T} \bs \mod 2$.
% , where:
Coding theory was born with the work of Hamming, who invented a
family of practical
error-correcting codes, each able to correct one error in a
block of length $N$, of which the repetition code $R_3$ and the
$(7,4)$ code are the simplest.
Since then most established codes have been
generalizations of Hamming's codes:
% `BCH' (Bose, Chaudhury and Hocquenhem)
Bose--Chaudhury--Hocquenhem
% The search for decodeable codes has produced the following families.
codes, Reed--M\"uller codes, Reed--Solomon codes, and
Goppa codes, to name a few.
\subsection{Convolutional codes}
Another family of linear codes are {\dem\ind{convolutional code}s}, which
do not divide the source stream into blocks, but instead read
and\index{error-correcting code!convolutional}
transmit bits continuously. The transmitted bits
are a linear function of the past source bits.
% both bits and parity checks in some fixed proportion.
Usually the rule for generating the transmitted bits
% parity checks
involves feeding the present source bit
into a \lfsr\index{linear feedback shift register} of length $k$,
and transmitting one or more
linear functions of the state of the shift register
at each iteration.
The resulting transmitted bit stream
is
%can be thought of as
the convolution
of the source stream with a linear filter.
The impulse-response function of this filter may have finite
or infinite duration, depending on the choice of feedback shift-register.
% it is
We will discuss convolutional codes in \chapterref{ch.convol}.
\subsection{Are linear codes `good'?}
One might ask, is the reason that the Shannon limit is not achieved
in practice
because linear codes are inherently not\index{error-correcting code!linear}\index{error-correcting code!good}\index{error-correcting code!random}
as good as random codes?\index{random code} The answer is no, the noisy-channel coding theorem
can still be proved for linear codes, at least for some channels
(see \chapterref{ch.linear.good}),
though the proofs, like Shannon's
proof for random codes, are non-constructive.
%(We will prove that
% there exist linear codes that are very good codes
% in chapter \ref{ch.linear.good}.
% and in particular for `cyclic codes',
% a class to which BCH and Reed--Solomon codes belong.
Linear codes are easy to implement at the encoding end. Is decoding a
linear code also easy? Not necessarily. The general decoding problem\index{error-correcting code!decoding}\index{linear block code!decoding}
(find the maximum likelihood $\bs$ in the equation $\bG^{\T} \bs + \bn =
\br$) is in fact \inds{NP-complete} \cite{BMT78}. [NP-complete problems are
computational problems that are all equally difficult and which
are widely believed to require exponential
computer time to solve in general.] So attention focuses on families of codes
% (such as those listed above)
for which there is a fast decoding algorithm.
\subsection{Concatenation}
One trick for building codes with practical decoders
is the idea of {concatenation}.\index{error-correcting code!concatenated}\index{concatenation!error-correcting codes}
An\amarginfignocaption{t}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(25,10)%
\put(17.5,8){\makebox(0,0){$\C' \rightarrow \underbrace{\C \rightarrow Q \rightarrow \D}
\rightarrow \D'$}}
\put(17.5,3){\makebox(0,0){$Q'$}}
%
\end{picture}%
\end{center}
%\caption[a]{none}
}
encoder--channel--decoder system $\C \rightarrow Q \rightarrow \D$
can be viewed as defining a \ind{super-channel} $Q'$
with a smaller probability of error, and with complex\index{channel!complex}
correlations among
its errors. We can create an encoder $\C'$ and decoder $\D'$
for this super-channel $Q'$.
The code consisting of the outer code $\C'$ followed by the inner code $\C$ is known
as a {\dem{concatenated code}}.\index{concatenation!error-correcting codes}
Some concatenated codes make use of the idea of {\dbf
\ind{interleaving}}. We read
% Interleaving involves encoding
the data in blocks, the size of each block being larger than the
blocklengths of the constituent codes $\C$ and $\C'$.
After encoding the data of one block using code $\C'$, the bits
are reordered within the block in such a way that
nearby bits are separated from each other once the block is fed to
the second code $\C$. A simple example of an interleaver
is a {\dbf\ind{rectangular code}\/}
or\index{error-correcting code!rectangular}\index{error-correcting code!product code}
{\dem\ind{product code}\/} in which the data are arranged in a
$K_2 \times K_1$ block, and encoded horizontally using an
$(N_1,K_1)$
linear code, then vertically using a $(N_2,K_2)$ linear code.
\exercisaxB{3}{ex.productorder}{
Show that either of the two codes can be viewed as the \ind{inner code} or the
\ind{outer code}.
}
%\subsection{}
% see also _concat2.tex
As an example, \figref{fig.concath1}
shows a product code in which we
% encode horizontally
% For example, if we
encode first with the repetition code $\Rthree$ (also known
as the \ind{Hamming code} $H(3,1)$)
horizontally then with $H(7,4)$
vertically.
The blocklength of the
concatenated\index{concatenation} code is 27. The number of source bits per codeword is
four, shown by the small rectangle.
% The code would be equivalent if we
% encoded first with $H(7,4)$ and second with $\Rthree$.
\begin{figure}
\figuremargin{%
\setlength{\unitlength}{0.4mm}
\begin{center}
\begin{tabular}{rrrrr}
(a)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){0}}
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){0}}
\put(5,15){\makebox(0,0){0}}
\put(5,5){\makebox(0,0){1}}
\put(15,65){\makebox(0,0){1}}
\put(15,55){\makebox(0,0){0}}
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){1}}
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){0}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){0}}
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture}&
%
% noise picture
%
(b)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,55){\makebox(0,0){$\star$}}%
\put(5,15){\makebox(0,0){$\star$}}%
%
\put(15,55){\makebox(0,0){$\star$}}%
\put(15,35){\makebox(0,0){$\star$}}%
%
\put(25,25){\makebox(0,0){$\star$}}%
\end{picture}&
%
% received vector picture
%
(c)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){1}}%
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){0}}
\put(5,15){\makebox(0,0){1}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){1}}
\put(15,55){\makebox(0,0){1}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){0}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){0}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){1}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture} &
% after R3 correction
(d)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){1}}%
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){0}}
\put(5,15){\makebox(0,0){{\bf 0}}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){1}}
\put(15,55){\makebox(0,0){1}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){{\bf 1}}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){{\bf 1}}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){{\bf 0}}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture}&
% after 74 correction
(e)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){{\bf 0}}}%
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){0}}
\put(5,15){\makebox(0,0){{0}}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){1}}
\put(15,55){\makebox(0,0){{\bf 0}}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){{1}}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){{\bf 0}}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){{0}}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture}\\
&
%
% noise picture
%
&
&
% after 74 correction
(d$^{\prime}$)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){1}}%
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){{\bf 1}}}%
\put(5,15){\makebox(0,0){1}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){{\bf 0}}}%
\put(15,55){\makebox(0,0){1}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){0}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){0}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){{\bf 0}}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture} &
% after R3 correction
(e$^{\prime}$)
\begin{picture}(30,70)(0,0)
\put(0,0){\framebox(30,70)}
\put(0,30){\framebox(10,40)}
\put(5,65){\makebox(0,0){1}}
\put(5,55){\makebox(0,0){(1)}}
\put(5,45){\makebox(0,0){1}}
\put(5,35){\makebox(0,0){1}}
\put(5,25){\makebox(0,0){{\bf 0}}}
\put(5,15){\makebox(0,0){{\bf 0}}}%
\put(5,5){\makebox(0,0){1}}
%
\put(15,65){\makebox(0,0){{\bf 1}}}
\put(15,55){\makebox(0,0){(1)}}%
\put(15,45){\makebox(0,0){1}}
\put(15,35){\makebox(0,0){{\bf 1}}}%
\put(15,25){\makebox(0,0){0}}
\put(15,15){\makebox(0,0){0}}
\put(15,5){\makebox(0,0){1}}
%
\put(25,65){\makebox(0,0){1}}
\put(25,55){\makebox(0,0){(1)}}
\put(25,45){\makebox(0,0){1}}
\put(25,35){\makebox(0,0){1}}
\put(25,25){\makebox(0,0){{0}}}%
\put(25,15){\makebox(0,0){0}}
\put(25,5){\makebox(0,0){1}}
\end{picture}\\
\end{tabular}
\end{center}
}{%
\caption[a]{A product code.
(a) A string {\tt{1011}} encoded using a concatenated code
consisting of two Hamming codes, $H(3,1)$ and $H(7,4)$.
(b) a noise pattern that flips 5 bits. (c) The received vector.
(d) After decoding using the horizontal $(3,1)$ decoder,
and (e) after subsequently using the vertical $(7,4)$ decoder. The decoded
vector matches the original.
(d$^{\prime}$, e$^{\prime}$) After decoding in the other order, three errors
still remain.}
\label{fig.concath1}
}%
\end{figure}
\label{sec.concatdecode}We
can decode conveniently (though not optimally) by using the
individual decoders for each of the subcodes in some sequence.
It makes most sense to first decode the code which has the
lowest rate and hence the greatest error-correcting ability.
\Figref{fig.concath1}(c--e) shows what happens if we receive the
codeword of \figref{fig.concath1}a with some errors (five bits flipped,
as shown) and
apply the decoder for $H(3,1)$ first, and then the
decoder for $H(7,4)$. The first decoder corrects three of the errors,
but erroneously modifies the third bit in the second row where there
are two bit errors. The $(7,4)$ decoder can then correct all three of
these errors.
\Figref{fig.concath1}(d$^{\prime}$--$\,$e$^{\prime}$) shows what happens if we decode the two codes
in the other order.
In columns one and two there are two errors, so the $(7,4)$ decoder
introduces two extra errors. It corrects the one error in column 3.
The $(3,1)$ decoder then cleans up four of the errors, but erroneously
infers the second bit.
% To make simple decoding possible,
% we split up bits that are in a single codeword at the first level,
% grouping them with other bits. Rectangular arrangement makes this easiest
% to see.
\subsection{Interleaving}
The motivation for interleaving is that by spreading out bits that
are nearby in one code, we make it possible to ignore
% forget about
the complex correlations among the errors that are produced by the
inner code. Maybe the inner code will mess up an entire codeword;
but that codeword is spread out one bit at a time over several codewords
of the outer code. So we can treat the errors introduced by the
inner code as if they are independent.\index{approximation!of complex distribution}
% by a simpler one}
%
% By iterating this process, with each successive
% code adding a small amount of redundancy to a geometrically increasing block,
% we can define an explicit sequence of codes with the property that
% $p_{\rm b} \rightarrow 0$ for some rate $R > 0$ (but not any $R$ up to the
% capacity $C$).
%
% There is also a proof by Forney that better concatenations
% exist, which achieve rates up to capacity and have encoding and decoding
% complexity of order $O(N^4)$. But the proof is non-constructive.
%
% gf.tex could be included here
%
% \subsection{Coding theory sells you short}
% At this point could discuss the universalist `this code corrects
% all errors up to $t$' with the Shannonist `the prob of error
% is tiny'. The latter attitude allows you to communicate at far
% greater rates. The former attitude is happy with something that
% is only halfway.
%
% Distance
% Show Prob of error of ideal decoder (Schematic) as function of noise level.
% Show that you can cope with double the noise.
\subsection{Other channel models}
% Most of the codes mentioned above are designed in terms of
In addition to the binary
symmetric channel and the Gaussian channel,
% or in terms of the number of errors they can correct, but
coding theorists keep more complex
channels in mind also.
%\index{burst-error channels}
{\dem Burst-error channels\/}\index{channel!bursty}\index{burst errors}
are important models in
practice. \ind{Reed--Solomon code}s use \ind{Galois field}s
(see \appendixref{app.GF})
with large numbers of
elements (\eg\ $2^{16}$) as their input alphabets,
and thereby automatically achieve a degree
of burst-error tolerance in that even if 17 successive bits are
corrupted, only 2 successive symbols in the Galois field representation are
corrupted. Concatenation and interleaving can give further
% fortuitous
protection against
% \index{concatenated code}
burst errors. The concatenated\index{concatenation!error-correcting codes}\index{error-correcting code!concatenated}
Reed--Solomon codes used on digital compact discs
% DISKS?
are able to correct bursts of errors of length 4000 bits.
\exercissxB{2}{ex.interleaving.dumb}{
The technique of \ind{interleaving},\index{implicit assumptions}
which allows bursts of\index{error-correcting code!interleaving}
errors to be treated as independent, is widely used, but is theoretically
a poor way to protect data against
\ind{burst errors}, in terms of the amount of redundancy required.
Explain why interleaving is a poor method, using the following
burst-error channel as an example. Time is divided into chunks
of length
$N=100$ clock cycles; during each chunk, there is a burst with
probability $b=0.2$; during a burst, the channel is a binary symmetric channel
with $f=0.5$. If there is no burst, the channel is an error-free binary
channel. Compute the capacity of this channel and compare it with the
maximum communication rate that could conceivably be achieved if one
used interleaving and treated the errors as independent.
}
% The BSC is an inadequate channel model for a second reason: many
% channels have {\em real outputs}. For example, a
% binary input $x$ may give rise to a
% probability distribution over a real output $y$. Codes whose decoders
% can handle real outputs (log likelihood ratios) are therefore
% important. `Convolutional codes' are such codes, as are some block codes.
{\dem\index{fading channel}{Fading channels}\/} are real\index{channel!fading}
channels like Gaussian\index{channel!Gaussian}
channels except that the received
power is assumed to vary with time.
A moving
\ind{mobile phone}\index{cellphone|see{mobile phone}}\index{phone!cellular|see{mobile phone}}
is an important example.
The incoming \ind{radio} signal is reflected
off nearby objects so that there are interference patterns and the
intensity of the
signal received by the phone varies with its location. The received power
can easily vary by 10 decibels\index{decibel}
(a factor of ten) as the phone's antenna
moves through a distance similar to the wavelength of the radio signal
(a few centimetres).
%Fading channels are used as models
% of the radio channel of mobile phones, in which the received power
% varies rapidly
\section{The state of the art}
What are the best known codes for communicating over Gaussian channels?
All the practical codes are linear codes, and are either
based on convolutional codes or block codes.\index{linear block code}
\subsection{Convolutional codes, and codes based on them}
\begin{description}
\item[Textbook convolutional codes\puncspace] The `de facto standard'
% cite golomb?
error-correcting code for\index{communication}
\ind{satellite communications} is a
convolutional code with constraint length 7.
Convolutional codes are discussed in \chref{ch.convol}.
\item[Concatenated convolutional codes\puncspace]
The above \ind{convolutional code}
can be used as the inner code of a\index{error-correcting code!concatenated}
concatenated code whose
outer code
is a \ind{Reed--Solomon code} with eight-bit symbols. This code
was used in deep space communication systems such as the
Voyager spacecraft.
For further reading about Reed--Solomon codes,
see \citeasnoun{lincostello83}.
\item[The code for \index{Galileo code}{Galileo}\puncspace]
A code using the same format but using a longer
constraint length -- 15 -- for its convolutional code and a larger
Reed--Solomon code was developed by the \ind{Jet Propulsion Laboratory} \cite{JPLcode}.
The details of this code are unpublished outside JPL,
and the decoding is only possible using
a room full of special-purpose hardware.
In 1992, this
was the best code known of rate \dfrac{1}{4}.
\item[Turbo codes\puncspace]
In 1993, \index{Berrou, C.}{Berrou}, \index{Glavieux, A.}{Glavieux}
and \index{Thitimajshima, P.}{Thitimajshima}
\nocite{Berrou93:Turbo}reported
work on {\dem\ind{turbo code}s}. The encoder of a turbo code is based on
the encoders of two
% or more constituent codes. In
% the original paper the two constituent codes were
convolutional codes.
The source bits are fed into each encoder, the order of the
source bits being permuted in a random way, and the resulting
parity bits from each constituent code are transmitted.
The decoding algorithm
% invented by Berrou {\em et al\/}
involves iteratively decoding each constituent code%
\amarginfig{b}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(25,30)(0,8)%
\put(15,18){\framebox(8,8){$C_1$}}
\put(15, 8){\framebox(8,8){$C_2$}}
\put( 9,12){\circle{6}}
\put( 5, 8){\framebox(8,8){$\pi$}}
\put(9.7,14.875){\vector(1,0){0.1}}% right pointing circle vector % was 975
\put(23,22){\vector(1,0){3}}
\put(23,12){\vector(1,0){3}}
\put(13,12){\line(1,0){2}}
\put( 2,12){\vector(1,0){3}}
\put( 0,22){\vector(1,0){15}}
\put( 2,22){\line(0,-1){10}}
%
\end{picture}%
\end{center}
\caption[a]{The encoder of a turbo code.
Each box $C_1$, $C_2$, contains a convolutional code.
The source bits are reordered using a permutation $\pi$ before
they are fed to $C_2$. The transmitted codeword is obtained
by concatenating or interleaving the outputs of the two
convolutional codes.
The random
permutation is chosen when the code is designed, and fixed
thereafter.
}
}
using its
standard decoding algorithm, then using the
output of the decoder as the input to the other decoder.
This decoding algorithm is an instance of a
{\dbf{message-passing}}\index{message passing}
algorithm called the {\dbf\ind{sum--product algorithm}}.
Turbo codes are discussed in \chref{ch.turbo}, and message passing in Chapters \ref{ch.message},
\ref{ch.noiseless}, \ref{ch.exact}, and \ref{ch.sumproduct}.
\end{description}
\subsection{Block codes}
\begin{description}
\item[Gallager's low-density parity-check codes\puncspace]
The%
\amarginfig{c}{
\[
\raisebox{0.425in}{ \bH \hspace{0.02in} =}\hspace{-0.1in}
\psfig{figure=MNCfigs/12.4.3.111/A.ps,angle=-90,width=1.5in,height=1in}
\]
\begin{center}
\mbox{
\psfig{figure=/home/mackay/itp/figs/gallager/16.12.ps,width=2in,angle=-90}
}\end{center}
\caption[a]{A low-density parity-check matrix
and the corresponding graph of a rate-\dfrac{1}{4}
low-density parity-check code with
% $(j,k) = (3,4)$,
blocklength $N \eq 16$, and $M \eq 12$ constraints.
Each white circle represents a transmitted bit. Each bit
participates in $j=3$ constraints, represented by
\plusnode\ squares. Each
% \plusnode\
constraint forces the
sum of the $k=4$ bits to which it is connected to
be even.
This code is a $(16,4)$
code. Outstanding performance is obtained when
the blocklength is increased to
$N \simeq 10\,000$.
}
\label{fig.ldpccIntro}
}
best block codes known for Gaussian channels
were invented by Gallager\index{Gallager, Robert} in
1962 but were promptly forgotten by most of the coding theory community.
% by MacKay and Neal,
They were rediscovered in 1995\nocite{mncEL,wiberg:phd}\index{Wiberg, Niclas}\index{MacKay, David}\index{error-correcting code!low-density parity-check}\index{Neal, Radford}
and shown to have outstanding theoretical and practical properties.\index{error-correcting code!practical}
Like turbo codes, they are decoded by message-passing algorithms.
We will discuss these beautifully simple codes in Chapter
% \ref{ch.belief.propagation} and
\ref{ch.gallager}.
\end{description}
The performances of the above codes are compared for Gaussian
channels in \figref{fig:GCResults}, \pref{fig:GCResults}.%{fig.gl.gc}.
% the Galileo code and
% Only the Galileo code and turbo codes outperform the original
% regular, binary Gallager codes.
% The best known Gallager codes, which are irregular,
%% and non-binary,
% outperform the Galileo code and turbo codes too \cite{DaveyMacKay96,Richardson2001b}.
\section{Summary}
\begin{description}
\item[Random codes] are good, but they require exponential resources to encode
and decode them.
\item[Non-random codes] tend for the most part not to be as good as
random codes. For a non-random code, encoding may be easy, but even for
simply-defined linear codes, the decoding problem remains very difficult.
\item[The best practical codes]
%\ben
%\item
(a)
employ very large block sizes; (b)
% \item
are based on semi-random code constructions;
and (c)
%\item
make use of probability-based decoding algorithms.
% \een
\end{description}
\section{Nonlinear codes}
Most practically used codes are linear, but not all.\index{error-correcting code!nonlinear}\index{nonlinear code}
Digital soundtracks are encoded onto cinema film as
a binary pattern. The likely errors affecting the
film involve
dirt and scratches, which produce large numbers of {\tt{1}}s
and {\tt{0}}s respectively. We want none
of the codewords to look like all-{\tt{1}}s or all-{\tt{0}}s,
so that it will be
easy to detect errors caused by dirt and scratches.
One of the codes used in \ind{digital cinema}\index{cinema} \ind{sound} systems is
a nonlinear $(8,6)$ code consisting of 64 of the ${{8}\choose{4}}$
binary patterns of weight 4.
% That's 70 patterns. Pick 64.
\section{Errors other than noise}
Another source of uncertainty for the receiver
is uncertainty about the {\em{\ind{timing}}\/} of the transmitted signal $x(t)$.
In ordinary coding theory and information theory,
the transmitter's time $t$ and the receiver's time
$u$ are assumed to be perfectly synchronized.
% If a bit sequence is encoded by a simple signal
% $x(t) \in \pm 1$, information is easily conveyed if
% the transmitter and the receiver both know the same
% time $t$;
But if the receiver receives a signal $y(u)$,
where the receiver's time, $u$, is an imperfectly
known function $u(t)$
of the transmitter's time
$t$, then the capacity of this channel for communication
is reduced. The theory of
such channels is incomplete, compared
with the
% ordinary
% `normal'
synchronized channels\index{insertions}\index{deletions}
we have discussed thus far. Not even
the {\em capacity\/} of channels with \ind{synchronization errors}\index{capacity!channel with synchronization errors}
is known \cite{Levenshtein66,Ferreira97};
%
% ear recommends citing zigangirov69 ullman67
%
codes for reliable communication over channels
with synchronization errors remain an active research area
\cite{DaveyMacKay99b}.
% ear recommends citing ratzer2003
\subsection*{Further reading}
For a review of the history of spread-spectrum\index{spread spectrum} methods, see
\citeasnoun{Scholtz82}.
\section{Exercises}
\subsection{The Gaussian channel}
\exercissxB{2}{ex.gcCb}{
Consider a Gaussian channel with a real input $x$, and signal to
noise ratio $v/\sigma^2$.
\ben
\item
What is its capacity $C$?
\item
If the input is constrained to be binary, $x \in \{ \pm \sqrt{v} \}$,
what is the capacity $C'$ of this constrained channel?
\item
If in addition the output of the channel is thresholded using the
mapping
\beq
y \rightarrow y' = \left\{ \begin{array}{cc} 1 & y > 0 \\
0 & y \leq 0, \end{array}
\right.
\eeq
what is the capacity $C''$ of the resulting channel?
\item
Plot the three capacities above as a function of $v/\sigma^2$
from 0.1 to 2. [You'll need to do a numerical integral to
evaluate $C'$.]
\een
}
\exercisaxB{3}{ex.codeslinear}{
For large integers $K$ and $N$,
what fraction of all binary error-correcting codes of length $N$
and rate $R=K/N$ are {\em{linear}\/} codes?
[The answer will depend on whether you choose to define the
code to be an {\em{ordered}\/} list of $2^K$ codewords,
that is, a mapping from $s \in \{1,2,\ldots,2^K\}$ to $\bx^{(s)}$,
or to define the code to be an unordered list, so that
two codes consisting of the same codewords are identical.
Use the latter definition: a code\index{error-correcting code} is a set of
codewords; how the encoder operates is not part of the
definition of the code.]
}
% that have not already been covered.
\subsection{Erasure channels}
\exercisxB{4}{ex.beccode}{
Design a code for the binary erasure channel, and a decoding
algorithm, and evaluate their probability of error.
[The design of good codes for erasure channels\index{erasure-correction}\index{channel!erasure}
is an active research area
\cite{spielman-96,LubyDF}; see also \chref{chdfountain}.]
% Have fun!]
%
}
\exercisaxB{5}{ex.qeccode}{
Design a code for the $q$-ary erasure channel,\index{erasure-correction}
whose input $x$ is drawn from $0,1,2,3,\ldots,(q-1)$,
and whose output $y$ is equal to $x$ with probability $(1-f)$
and equal to {\tt{?}} otherwise.
[This erasure channel is a good model for \ind{packet}s
transmitted over the \ind{internet}, which are either received reliably
or are lost.]
}
\exercissxC{3}{ex.raid}{
How do redundant arrays of independent disks (RAID) work?\marginpar{%
\small\raggedright{%
% aside
[Some people say RAID stands for `redundant array of inexpensive disks',
but I think that's silly -- RAID would still be a good idea\index{RAID}\index{redundant array of independent disks}
even if the disks were expensive!]
% end aside
}}
These are information storage systems consisting of about\index{erasure-correction}
ten \disc{} drives,\index{disk drive} of which any two or three can be disabled and the others
are able to still able to reconstruct any requested file.\index{file storage}
What codes are used, and how far are these systems from the Shannon
limit for the problem they are solving? How would {\em you\/} design
a better RAID system?
%
Some information is provided in the solution section.
See {\tt http://{\breakhere}www.{\breakhere}acnc.{\breakhere}com/{\breakhere}raid2.html}; see also \chref{chdfountain}.
% and {\tt http://www.digitalfountain.com/} for more.
}
%%%%\input{tex/_e7.tex}
\dvips
\section{Solutions}% to Chapter \protect\ref{ch.ecc}'s exercises} %
% ex 89
\soln{ex.gcoptens}{
% \subsection{Maximization}
Introduce a Lagrange multiplier $\l$ for the power constraint and another,
$\mu$, for the constraint of normalization of $P(x)$.
\beqan
F &\eq & \I(X;Y) -
{ \l \textstyle \int \d x \, P(x) x^2 - \mu \textstyle \int \d x \, P(x) }
\\ &\eq &
\int \! \d x \,
P(x) \left[ \int \! \d y \, P(y\given x) \ln \frac{P(y\given x)}{P(y)}
- \l x^2 - \mu \right] .
\eeqan
Make the functional derivative with respect to
$P(x^*)$.
\beqan
\frac{\delta F}{\delta P(x^*)} &=&
\int \! \d y \, P(y\given x^*) \ln \frac{P(y\given x^*)}{P(y)}
- \l {x^*}^2 - \mu
\nonumber \\ &&
- \int \! \d x \: P(x)
\int \! \d y \: P(y\given x) \frac{1}{P(y)} \frac{\delta P(y)}{\delta P(x^*)} . \hspace{0.5cm}
\eeqan
The final factor $\delta P(y)/\delta P(x^*)$ is found, using $P(y) =
\int \! \d x \, P(x) P(y\given x)$, to be $P(y\given x^*)$, and the whole of the
last term collapses in a puff of smoke to 1, which can be absorbed into the
$\mu$ term.
% We now substitute
Substitute
$P(y\given x) = \exp( -(y-x)^2/2 \sigma^2) / \sqrt{2 \pi \sigma^2}$
and set the derivative to zero:
\beq
\int \! \d y \, P(y\given x) \ln \frac{P(y\given x)}{P(y)} - \l x^2 - \mu' = 0
\eeq
\beq
\Rightarrow
\int \! \d y \,
\frac{\exp( -(y-x)^2/2 \sigma^2)}{\sqrt{2 \pi \sigma^2} }
\ln \left[ P(y) \sigma \right] = - \l x^2 - \mu' - \frac{1}{2} .
\label{eq.theconstr}
\eeq
This condition must
be satisfied by $\ln \! \left[ P(y) \sigma \right]$ for all $x$.
Writing a Taylor expansion of $\ln \! \left[ P(y) \sigma \right]
= a + b y + c y^2 + \cdots$, only a quadratic function
$\ln \! \left[ P(y) \sigma \right]
= a + c y^2$ would satisfy the constraint (\ref{eq.theconstr}).
(Any higher order terms $y^p$, $p>2$, would produce
terms in $x^p$ that are not present on the right-hand side.)
Therefore $P(y)$ is Gaussian. We can obtain this optimal output distribution
by using a Gaussian input distribution $P(x)$.
% \footnote{Note in passing that
% the Gaussian is the probability distribution that has maximum
% pseudo-entropy
}
\soln{ex.gcC}{
Given a Gaussian input distribution of variance $v$, the
output distribution is $\Normal(0,v\!+\!\sigma^2)$, since
$x$ and the noise are independent random variables,
and variances add for independent random variables.
The mutual information is:
\beqan
\!\!\!\!\!\!\!\!\!\! \I(X;Y)& =& \!\! \int \! \d x \, \d y \:
P(x)P(y\given x) \log {P(y\given x)}
- \int \! \d y \:
P(y) \log {P(y)} \\
&=&
\frac{1}{2} \log \frac{1}{\sigma^2} - \frac{1}{2} \log \frac{1}{v+\sigma^2} \\
&=&
% \frac{1}{2} \log \frac{v+\sigma^2}{\sigma^2} =
\frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2}
\right) .
\eeqan
}
\soln{ex.interleaving.dumb}{
The capacity of the channel is one minus the information content
of the noise that it adds. That information content is, per chunk,
the entropy of the selection of whether the chunk is bursty,
$H_2(b)$, plus, with probability $b$, the entropy of the flipped bits, $N$,
which adds up to $H_2(b) + Nb$ per chunk (roughly; accurate if $N$ is large).
So, per bit, the capacity is, for $N=100$,
\beq
C = 1 - \left( \frac{1}{N} H_2(b) + b \right) = 1 - 0.207 = 0.793 .
\eeq
In contrast, interleaving, which treats bursts of\index{sermon!interleaving}
errors as independent, causes the channel to be treated as
a binary symmetric channel with $f= 0.2 \times 0.5 = 0.1$, whose capacity is about 0.53.
Interleaving throws away the useful information about the
correlatedness of the errors.
Theoretically, we should be able to communicate about
$(0.79/0.53) \simeq 1.6$ times
faster using a code and decoder that explicitly treat bursts as bursts.
}
% ex 91
\soln{ex.gcCb}{
\ben
\item
Putting together the results of exercises \ref{ex.gcoptens} and \ref{ex.gcC},
we deduce that
a Gaussian channel with real input $x$, and signal to
noise ratio $v/\sigma^2$ has capacity
\beq
C = \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2}
\right) .
\label{eq.unconstrained.cap}
\eeq
\item
If the input is constrained to be binary, $x \in \{ \pm \sqrt{v} \}$,
the capacity is achieved by using these two inputs
with equal probability.
The capacity is reduced to a somewhat messy integral,
\beq
C'' =
\int_{-\infty}^{\infty}
\d y \, N(y;0) \log N(y;0)
%\nonumber \\
%& &
-
\int_{-\infty}^{\infty}
\d y \,
P(y) \log P(y) ,
\eeq
where $N(y;x) \equiv (1/\sqrt{2 \pi}) \exp [ ( y-x)^2/2 ]$,
$x\equiv \sqrt{v}/ \sigma$,
and $P(y) \equiv [ N(y;x)+N(y;-x) ]/2$.
This capacity is smaller than
the unconstrained capacity (\ref{eq.unconstrained.cap}),
but for small signal-to-noise ratio, the two capacities are close
in value.
\item
If the output is thresholded, then the Gaussian channel is turned
into a binary symmetric channel whose transition probability
is given by the error function $\erf$
defined on page \pageref{sec.erf}. The capacity is
%%%%%%%
\marginfig{%
\begin{center}
\psfig{figure=/home/mackay/_doc/code/brendan/gc.ps,width=1.85in,angle=-90}
\mbox{\psfig{figure=/home/mackay/_doc/code/brendan/gc.l.ps,width=1.85in,angle=-90}}\\[-0.05in]
\end{center}
%
\caption[a]{Capacities (from top to bottom in each graph)
$C$, $C'$, and $C''$,
versus the signal-to-noise ratio $(\sqrt{v}/\sigma)$.
The lower graph is a log--log plot.}
}
%%%%%%%%
\beq
C'' = 1 - H_2( f ), \mbox{ where $f= \erf(\sqrt{v}/\sigma)$} .
\eeq
%\item
% The capacities are plotted in the margin.
\een
}
%\soln{ex.beccode}{
% The design of good codes for erasure channels\index{erasure-correction}
% is an active research area
% \cite{spielman-96,LubyDF}. Have fun!
%}
% RAID
\soln{ex.raid}{
There are several RAID systems. One of the easiest
to understand consists of 7 \disc{} drives which store data\index{erasure-correction}
at rate $4/7$ using a $(7,4)$ \ind{Hamming code}: each successive\index{RAID}\index{redundant array of independent disks}
four bits are encoded with the code and the seven codeword
bits are written one to each disk. Two or perhaps
three disk drives
can go down and the others can recover the data. The
effective channel model here is a binary erasure channel,
because it is assumed that we can tell when a disk is
dead.
It is not
possible to recover the data for {\em some\/} choices
of the three dead disk drives; can you see why?
}
\exercissxB{2}{ex.raid3}{
Give an example of three \disc{} drives that, if lost, lead
to failure of the above RAID system, and three that can
be lost without failure.
}
\soln{ex.raid3}{
The $(7,4)$ Hamming code has codewords of weight 3. If any set of
three \disc{} drives\index{erasure-correction} corresponding to one of those codewords
is lost, then the other four disks can only recover 3 bits
of information about the four source bits; a fourth bit is lost.
[\cf\ \exerciseref{ex.qeccodeperfect} with $q=2$: there are
no binary MDS codes. This deficit is discussed further in
\secref{sec.RAIDII}.]
Any other set of three disk drives can be lost without
problems because the corresponding four by four submatrix
of the generator matrix is invertible.
% The simplest
% example of a recoverable failure is when the three parity
% drives (5,6,7) go down.
A better code would be the digital fountain
-- see \chref{chdfountain}.
% \cite{LubyDF},\footnote{{\tt http://www.digitalfountain.com/}}
}
\dvipsb{solutions real channels s7}
%%%%%%% was a chapter on further exercises here once!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%% PART %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\renewcommand{\partfigure}{\poincare{8.2}}
\part{Further Topics in Information Theory}
\prechapter{About Chapter}
In Chapters \ref{ch1}--\ref{ch7}, we
concentrated on two aspects of information theory
and coding theory: source coding -- the compression
of information so as to make efficient use of data transmission
and storage channels; and channel coding -- the redundant
encoding of information so as to be able to detect and correct
\ind{communication} errors.
In both these areas we started by ignoring practical
considerations, concentrating on the question
of the theoretical limitations and possibilities of coding.
We then discussed practical source-coding and channel-coding
schemes, shifting the emphasis towards computational
feasibility. But the prime criterion for comparing encoding
schemes remained the efficiency of the code in terms of
the channel resources it required: the best source codes
were those that achieved the greatest compression; the best channel
codes were those that communicated at the highest rate with a given
probability of error.
In this chapter we now shift our viewpoint a little, thinking of
{\em ease of information retrieval\/} as a primary goal. It turns out that
the random codes\index{random code} which were theoretically useful in our
study of channel coding are also useful for rapid information
retrieval.
Efficient information retrieval is one of the problems that brains seem
to solve effortlessly, and
\ind{content-addressable memory}\index{memory!content-addressable} is one of the
topics we will study when we look at neural networks.
\medskip
%\chapter{Hash codes: codes for efficient information retrieval}
\ENDprechapter
\chapter{Hash Codes: Codes for Efficient Information Retrieval \nonexaminable}
% 9
\label{ch.hash}
% \chapter{Hash codes: codes for efficient information retrieval}
% \input{tex/_lhash.tex}
%
% prerequisites -- the birthday problem questions
% postreqs: hopfield nets
%
% exercises also in _e8.tex AND _e7.tex, solns in _shash and _se8
% _e8 has ones relevant to hashes
%
% \label{ch.hash}
%
% SUGGESTION:
%
% include an illustrative example at start.
% add a diagram showing buckets, memory....
\newcommand{\hashS}{S}
\newcommand{\hashs}{s}
\newcommand{\hashN}{N}
\newcommand{\hashT}{T}
% \newcommand{\hashn}{n}
\section{The information-retrieval problem}
A simple example of an
\index{information retrieval}{information-retrieval}\index{hash code}\
%\index{code!hash}
problem is the task of
implementing a \ind{phone directory}\index{telephone directory} service, which, in response to a
person's {\dem name}, returns (a) a confirmation that that person
is listed in the directory; and (b) the person's {phone number} and other
details.
We could formalize this problem as follows, with $\hashS$ being the
number of names that must be stored in the \ind{directory}.
\marginfig{\small
\begin{tabular}{@{}p{1.20in}l} \toprule
\parbox[t]{1.2in}{\small string length} & $N \simeq 200$ \\
\parbox[t]{1.2in}{\small\raggedright number of strings} & $S \simeq 2^{23}$ \\
\parbox[t]{1.2in}{\small\raggedright number of possible} & $2^N \simeq 2^{200}$ \\
\parbox[t]{1.2in}{\small\raggedright \hspace{0.2in} strings} & \\
\bottomrule
% WOULD love this paragraph to be indented differently
% HELP
\end{tabular}
\caption[a]{Cast of characters.}
}
% Imagine that y
You are given a list of $\hashS$ binary strings of length
$\hashN$ bits, $\{\bx^{(1)}, \ldots, \bx^{(\hashS)}\}$, where
$\hashS$ is considerably
smaller than the total number of possible strings, $2^\hashN$. We will call
the superscript `$\hashs$' in $\bx^{(\hashs)}$ the {\dem record number\/} of the string.
The idea is that $\hashs$ runs over customers in the order in which they are
added to the directory and $\bx^{(\hashs)}$ is the name of customer $\hashs$. We assume
for simplicity that all people have names of the same length.
The name length might be, say,
$\hashN = 200$ bits, and
we might want to store the details of
ten million customers, so $\hashS \simeq 10^7 \simeq 2^{23}$. We will ignore the possibility that two
customers have identical names.
The task is to construct the inverse of the mapping from $s$ to
$\bx^{(\hashs)}$, \ie, to make a system that, given a string $\bx$,
% with an unknown record number, will
returns the value of $\hashs$ such that
$\bx = \bx^{(\hashs)}$ if one exists, and otherwise reports that no such
$\hashs$ exists. (Once we have the record number, we can go and look in
memory location $\hashs$ in a separate memory full of
phone numbers to find the required
number.)
The aim, when solving this task, is to
% is system should
use minimal computational resources
in terms of the amount of memory used to store the inverse
mapping from $\bx$ to
$\hashs$ and the amount of time to compute the inverse
mapping. And, preferably, the inverse mapping should be implemented
in such a way that
further new strings can be added to the directory
in a small amount of computer time too.\index{content-addressable memory}
%
% add picture to show lookup table
%
\subsection{Some standard solutions}
\label{sec.simplehash}
The simplest and dumbest solutions to the information-retrieval problem
are a look-up table and a raw list.
\begin{description}
\item[The look-up table] is a piece of memory of size $2^N \log_2 \hashS$,
$\log_2 \hashS$ being the amount of memory required to store an integer
between 1 and $\hashS$. In each of the $2^N$ locations, we put a zero, except
for the locations $\bx$ that correspond to strings $\bx^{(\hashs)}$,
into which we write the value of $\hashs$.
The look-up table is a simple and quick solution, but only if there
is sufficient memory for the table, and if the cost of
looking up entries in memory is independent of the memory size.
But in our definition of the task, we assumed that $N$ is
% sufficiently large
about 200 bits or more, so the amount of memory required would be
of size $2^{200}$;
this solution is completely out of the question. Bear in mind that
the number of particles in the solar system is only about $2^{190}$.
% particles in the known universe is
\item[The raw list]
is a simple list of ordered pairs $(\hashs, \bx^{(\hashs)} )$ ordered by the value of
$\hashs$. The mapping from $\bx$ to $\hashs$ is achieved by searching through
the list of strings, starting from the top, and comparing the incoming
string $\bx$
with each record $\bx^{(\hashs)}$ until a
match is found. This system is very easy to
maintain, and uses a small amount of memory, about $\hashS \hashN$ bits,
but is rather slow to use, since on average five million pairwise
comparisons will be made.
\end{description}
\exercissxB{2}{ex.meanhash}{
Show that the average time taken
to find the required string in a raw list, assuming that the original names
were chosen at random, is about $\hashS + N$ binary comparisons.
(Note
that you don't have to compare the whole string of length $N$,
since a comparison can be terminated as soon as a mismatch occurs;
show that you need on average two binary comparisons per incorrect
string match.)
Compare this with the worst-case search time
-- assuming that the devil chooses
the set of strings and the search key.
}
The standard way in which phone directories are made improves
on the look-up table and the raw list by using
an {\dem{{alphabetically-ordered list}}}\index{alphabetical ordering}.
\begin{description}
\item[Alphabetical list\puncspace]
The strings $\{ \bx^{(\hashs)} \}$
% $...$
are sorted into alphabetical order. Searching for
an entry now usually takes less time than was needed for the raw list because
we can take advantage of the sortedness; for example, we can open
the phonebook at its middle page, and compare the
name we find there with the target string; if the target is `greater'
than the middle string then we know that the required string, if
it exists, will be found in the second half of the alphabetical directory.
Otherwise, we look in the first half.
By iterating this splitting-in-the-middle procedure,
we can identify the target string, or establish that the string is not
listed, in $\lceil \log_2 \hashS \rceil$ string comparisons. The expected
number of binary comparisons per string comparison
will tend to increase as the search
progresses,
%, because the leading bits of the two strings involved
% in the comparison are expected to become similar; but by being smart
% and keeping track of which leading bits we have looked at
% already in previous searches, it seems plausible that
% we can reduce the number of binary
% operations to about $\lceil \log_2 \hashS \rceil + N$ binary comparisons.
but the total number of binary comparisons required will be
no greater than $\lceil \log_2 \hashS \rceil N$.
The amount of memory required is the same as that required for the raw list.
Adding new strings to the database requires that we insert them in the
correct location in the list. To find that location takes about
$\lceil \log_2 \hashS \rceil$ binary comparisons.
%Then shuffling along all
% of the subsequent entries in the directory to make space for the
% new entry may take some computer time, depending on how the memory works.
\end{description}
Can we improve on the well-established alphabetized list?
Let us consider our task from some new viewpoints.
% for a moment and think of other ways of viewing it.
The task is to construct a mapping $\bx \rightarrow \hashs$ from $N$ bits
% ($\bx$)
to $\log_2 \hashS$ bits.
% ($\hashs$).
%
% what does this mean?
%
This is a pseudo-invertible mapping, since for any $\bx$
that maps to a non-zero $\hashs$, the customer database contains the
pair $(\hashs , \bx^{(\hashs)})$ that takes us back. Where have we come
across the idea of mapping from $N$ bits to $M$ bits before?
We encountered this idea twice: first,
in source coding, we studied block codes which were mappings
from strings of $N$ symbols to a selection of one label in a list.
% $...$.
The task of information retrieval is similar
% pretty much identical
to the task
(which we never actually solved) of making an encoder for a
typical-set compression code.
The second time that we mapped bit strings to bit strings of another dimensionality
was when we studied channel codes. There, we considered codes
that mapped from $K$ bits to $N$ bits, with $N$ greater than $K$,
and we made theoretical progress using {\em random\/} codes.
In hash codes, we put together these two notions.
We will study {random codes that map from $N$ bits to $M$ bits where
$M$ is {\em smaller\/} than $N$}.\index{random code}
% Another strand: the dumb look-up table would be really nice, very quick,
% the only problem is it requires too much memory. But there are so
% few vectors, what if we project them down into a lower-dimensional
% space? A few will collide, but if they are mainly distinct then
% we can just implement the look-up table in a lower dimensional
% space.
The idea is that we will map the original high-dimensional space
down into a lower-dimensional space, one in which it is feasible
to implement the dumb look-up table method which we rejected a
moment ago.
\marginfig{\small
\begin{tabular}{@{}p{1.2in}l} \toprule
\parbox[t]{1.2in}{\small string length} & $N \simeq 200$ \\
\parbox[t]{1.2in}{\small number of strings} & $S \,\simeq 2^{23}$ \\
\parbox[t]{1.2in}{\small size of hash function} & $M \simeq 30\ubits$ \\[0.01in]
\parbox[t]{1.2in}{\small size of hash table} & $T = 2^M $\\
& $\:\:\:\:\: \simeq 2^{30}$ \\ \bottomrule
% HELP the spacing between successive rows
% is smaller than the spacing between lines!! :-(
% HELP
\end{tabular}
\caption[a]{Revised cast of characters.}
}
\section{Hash codes}
First we will describe how a hash code works, then we will study the
properties of idealized hash codes.
A hash code implements a solution to the information-retrieval problem,
that is, a mapping from $\bx$ to $s$, with the help of a pseudo-random
function called a {\dem\ind{hash function}},
which maps the $N$-bit string $\bx$ to an $M$-bit string $\bh(\bx)$,
where $M$ is smaller than $N$. $M$ is typically chosen
% to be sufficiently small
such that the `table size' $\hashT \simeq
2^M$ is a little bigger than $S$ -- say,
ten times
% one or two orders of magnitude
bigger.
For example,
if we were expecting
% $S$ a million values for $\bx$
$S$ to be about a million,
we might map
%a 200-bit
$\bx$ into a 30-bit hash $\bh$ (regardless of the size $N$ of each
item $\bx$).
The hash function is some fixed deterministic function which should
ideally be indistinguishable from a fixed random code. For practical
purposes, the hash function must be quick to compute.
Two simple examples of \ind{hash function}s are:
\begin{description}
\item[Division method\puncspace]
The table size $\hashT$ is a prime number, preferably
one that is not close to a power of 2. The hash value is the remainder
when the integer $\bx$ is divided by $\hashT$.
\item[Variable string addition method\puncspace]
This method assumes that $\bx$ is a string of
bytes and that the table size $\hashT$ is 256.
The characters of $\bx$ are added, modulo 256.
%
%
% http://members.xoom.com/thomasn/s_man.htm
%
%
%
% This hash function does not distinguish anagrams.
This hash function has the defect that it maps strings
that are anagrams of each other onto the same hash.
It may be improved by putting the running total through
a fixed pseudorandom permutation
after each character is added.
%
%\item[
In the\index{hash function}
{\dem variable string exclusive-or method\/} with table size $\leq 65\,536$,
the string is hashed twice in this way, with the initial running total
being set to 0 and 1 respectively (\algref{alg.hashxor}).
The result is a 16-bit hash.
\end{description}
%
% probably a good idea to include this code stolen from Thomas Niemann
% typedef unsigned short int HashIndexType; (changed to int)
%
\begin{algorithm}% figure}
\begin{framedalgorithmwithcaption}%
{
\caption[a]{{\tt C} code implementing the variable string exclusive-or method
to create
a hash {\tt h} in the range $0\ldots 65\,535$
from a string {\tt x}.
Author: Thomas Niemann.}
\label{alg.hashxor}
}
\small
\begin{verbatim}
unsigned char Rand8[256]; // This array contains a random
permutation from 0..255 to 0..255
int Hash(char *x) { // x is a pointer to the first char;
int h; // *x is the first character
unsigned char h1, h2;
if (*x == 0) return 0; // Special handling of empty string
h1 = *x; h2 = *x + 1; // Initialize two hashes
x++; // Proceed to the next character
while (*x) {
h1 = Rand8[h1 ^ *x]; // Exclusive-or with the two hashes
h2 = Rand8[h2 ^ *x]; // and put through the randomizer
x++;
} // End of string is reached when *x=0
h = ((int)(h1)<<8) | // Shift h1 left 8 bits and add h2
(int) h2 ;
return h ; // Hash is concatenation of h1 and h2
}
\end{verbatim}
% original code stored in tex/_hash.code
\end{framedalgorithmwithcaption}
\end{algorithm}% figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremargin{\footnotesize
\setlength{\unitlength}{1mm}
\thinlines
\begin{picture}(100,100)(-20,-40)
\put(65,-40){\line(0,1){90}}
\put(75,-40){\line(0,1){90}}
\multiput(65,-40)(0,3){31}{\line(1,0){10}}
\newcommand{\xvector}[2]{\put(-10,#1){\framebox(40,4){$\bx^{(#2)}$}}}
\newcommand{\hvector}[2]{\put(53,#1){\makebox(5,0){$\bh(\bx^{(#2)})\rightarrow$}}}
\newcommand{\svector}[2]{\put(74.3,#1){\makebox(0,0)[r]{$#2$}}}
\newcommand{\slvector}[2]{\put(35,#1){\vector#2{10}}}
\newcommand{\xhs}[4]{\xvector{#1}{#2}\hvector{#3}{#2}\svector{#3}{#2}\slvector{#1}{#4}}
\xhs{30}{1}{18.7}{(1,-1)}
\xhs{24}{2}{45.536}{(1,2)}
\xhs{18}{3}{6.7}{(1,-1)}
\xhs{0}{s}{-20.5}{(1,-2)}
% labels
\put(39,65){\makebox(0,0){Hash}}
\put(39,62){\makebox(0,0){function}}
\put(34,59){\vector(1,0){11}}
\put(10,60){\makebox(0,0){Strings}}
\put(48,58.60){\makebox(0,0)[l]{hashes}}
\put(70,62){\makebox(0,0){Hash table}}
%
\put(10,12){\makebox(0,0){$\vdots$}}
\put(10,-8){\makebox(0,0){$\vdots$}}
% N range indication
\put(10,40){\vector(-1,0){20}}
\put(10,40){\vector(1,0){20}}
\put(10,43){\makebox(0,0){$N$ bits}}
% M range indication
\put(70,54){\vector(-1,0){5}}
\put(70,54){\vector(1,0){5}}
\put(70,57){\makebox(0,0){$M$ bits}}
% 2^M range
\put(82,5){\vector(0,-1){45}}
\put(82,5){\vector(0,1){45}}
\put(84,5){\makebox(0,0)[l]{$2^M$}}
% S range
\put(-15,10){\vector(0,1){23}}
\put(-15,10){\vector(0,-1){30}}
\put(-17,10){\makebox(0,0)[r]{$S$}}
%
\end{picture}
}{
\caption[a]{Use of hash functions for information retrieval.
For each string $\bx^{(s)}$,
the hash $\bh= \bh(\bx^{(s)})$ is computed,
and the value of $s$ is written into the
$\bh$th row of the hash table. Blank rows in the hash table
contain the value zero.
The table size is $T = 2^M$.}
\label{fig.hashtable}
}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Having picked a hash function $\bh(\bx)$,
we implement an
% efficient
information retriever
as follows. (See \figref{fig.hashtable}.)
\begin{description}
\item[Encoding\puncspace]
A piece of memory called the {\em hash table\/}
is created of size $2^Mb$ memory units, where $b$ is the amount of memory needed to represent
an integer between $0$ and $\hashS$. This table is initially set to zero
throughout. Each memory $\bx^{(\hashs)}$ is put through
the hash function, and at the location in the hash table corresponding
to the resulting vector $\bh^{(\hashs)} = \bh( \bx^{(\hashs)} )$, the integer $\hashs$ is written --
unless that entry in the hash table is already occupied, in which case
we have a {\em collision\/} between $\bx^{(\hashs)}$ and some earlier
$\bx^{(\hashs')}$ which both happen to have the same hash code.
Collisions can be handled in various ways -- we will discuss some
in a moment -- but first let us complete the basic picture.
\item[Decoding\puncspace]
To retrieve a piece of information corresponding to a
target vector $\bx$, we compute the hash $\bh$ of
$\bx$ and look at the corresponding location in the hash table.
If there is a zero, then we know immediately that the string $\bx$ is
not in the database. The cost of this answer is the cost of one hash-function
evaluation and one look-up in the table of size $2^M$.
If, on the other hand, there is a non-zero entry $\hashs$ in the table,
there are two possibilities: either the vector $\bx$ is
indeed equal to $\bx^{(\hashs)}$; or the vector $\bx^{(\hashs)}$ is another
vector that happens to have the same hash code as the target $\bx$. (A third
possibility is that this
non-zero entry might have something to do with our
yet-to-be-discussed collision-resolution system.)
To check whether $\bx$ is
indeed equal to $\bx^{(\hashs)}$, we take the tentative answer $\hashs$,
look up $\bx^{(\hashs)}$ in the original forward database, and compare it
bit by bit with $\bx$; if it matches then we report $\hashs$ as the
desired answer. This successful retrieval has an overall cost of
one hash-function
evaluation, one look-up in the table of size $2^M$,
another look-up in a table of size $\hashS$, and
% up to
$N$ binary comparisons -- which may be much cheaper
than the simple solutions presented in section \ref{sec.simplehash}.
\end{description}
\exercissxB{2}{ex.hash.retrieval}{
If we have checked the first few bits of $\bx^{(\hashs)}$ with
$\bx$
and found them to be equal, what is the probability that
the correct entry has been retrieved, if the alternative hypothesis
is that $\bx$ is actually not in the database? Assume that
the original source strings are random, and the hash function is a random hash function.
How many
% Could have an exercise here on the number of
binary evaluations are
needed to be sure with odds of a billion to one that
the correct entry has been retrieved?
% [Note we are not assuming that the
% original strings $\{ \bx^{(\hashs)} \}$ are random; they may be
% very similar to each other. We are just assuming that the hash function
% is random.]
}
%
% view as a kind of source
% encoding - reduces huge redundancy, where the redundancy
% has the form P(x) = sum_x pi_c delta(x_c)
%
% does so using random coding.
The hashing method of information retrieval
can be used for strings $\bx$ of arbitrary length,
if the hash function $\bh(\bx)$ can be applied to strings of
any length.
\section{Collision resolution}
We will study two ways of resolving collisions: appending in the
table, and storing elsewhere.
\subsection{Appending in table}
When encoding, if a collision occurs, we continue
down the hash table and write the value of $s$ into the next available
location in memory that currently contains a zero. If we reach the bottom
of the table before encountering a zero, we continue from the top.
When decoding, if we compute the hash code for $\bx$ and find that
the $s$ contained in the table doesn't point to an $\bx^{(s)}$ that
matches the cue $\bx$, we continue down the hash table until we
either find an $s$ whose $\bx^{(s)}$ does match the cue
% key
$\bx$, in which case we are done, or else encounter a zero, in which
case we know that the cue $\bx$ is not in the database.
For this method, it is essential that the table be substantially
bigger in size than $\hashS$. If $2^M < \hashS$ then the encoding
rule will become stuck with nowhere to put the last strings.
\subsection{Storing elsewhere}
A more robust and flexible method is to use {\dem pointers\/}
to additional pieces of memory in which collided strings are stored.
There are many ways of doing this. As an example, we could store
in location $\bh$ in
the hash table a pointer (which must be distinguishable from
a valid record number $s$) to a `bucket' where all the
strings that have hash code $\bh$ are stored in a
{\dem sorted list}.
The encoder sorts the strings in each bucket alphabetically as the hash table and buckets
are created.
The decoder simply has to go and look in the relevant bucket
and then check the short list of strings that are there by a
brief alphabetical search.
% of strings that have this encoding.
This method of storing the strings in buckets allows the option of
making the hash table quite small, which may have practical benefits. We
may make it so small that almost all strings are involved in collisions,
so all buckets contain a small number of strings.
It only takes a small number of binary comparisons to identify which
of the strings in the bucket matches the cue $\bx$.
\section{Planning for collisions: a birthday problem}
\index{birthday}
\exercissxA{2}{ex.hash.collision}{
If we wish to store $S$ entries using a hash function
whose output has $M$ bits, how many collisions should we expect
to happen, assuming that our hash function is an ideal random function?
What size $M$ of hash table is needed if we would like
the expected number of collisions to be smaller than 1?
What size $M$ of hash table is needed if we would like
the expected number of collisions to be a small fraction, say 1\%,
of $S$?
}
[Notice the similarity of this problem to
\exerciseref{ex.birthday}.]
\section{Other roles for hash codes}
\subsection{Checking arithmetic}
\index{error detection}If you wish to check an addition that was done by hand,
you may find useful the method of {\dem{\ind{casting out nines}}}.\index{nines}
In casting out nines, one finds the sum, modulo nine, of
all the {\em digits\/} of the numbers
to be summed and compares it with the
sum, modulo nine, of the digits of the putative answer.
[With a little practice, these sums can be computed
much more rapidly than the full original addition.]
% calculation proper.]
\exampla{%???????????
% want this to have reference: {ex.nines}{
In the calculation shown in the margin
\marginpar{\begin{center}
\begin{tabular}[t]{r}
{\tt 189} \\
{\tt +1254} \\
{\tt + 238} \\
\hline
{\tt 1681} \\
\end{tabular}
\end{center}}
the sum, modulo nine, of the digits in {\tt 189+1254+238}
is {\tt 7}, and the sum, modulo nine, of {\tt 1+6+8+1} is {\tt 7}.
The calculation thus passes the casting-out-nines test.
}
Casting out nines gives a simple example of a hash function.
For any addition expression of the form $a+b+c+\cdots$,
where $a, b, c, \ldots$ are decimal numbers
we define $h \in \{0,1,2,3,4,5,6,7,8\}$ by
\beq
h(a+b+c+\cdots) = \mbox{ sum modulo nine of all digits in $a,b,c$ } ;
\eeq
then it is nice property of decimal arithmetic that if
\beq
a+b+c+\cdots = m+n+o+\cdots
\eeq
then the hashes $h(a+b+c+\cdots)$ and $h(m+n+o+\cdots)$ are equal.
\exercissxB{1}{ex.nines.p}{
What evidence\index{model comparison} does a correct casting-out-nines
match give in favour of the
hypothesis that the addition has been done correctly?
}
\subsection{Error detection among friends}
\index{error detection}Are two files the same? If the files are on the same computer,
we could just compare them bit by bit. But if the two files
are on separate machines,
it would be nice to have a way of confirming that two
files are identical without having to transfer one of the
files from A to B. [And even if we did transfer one of the files,
we would still like a way to confirm whether it has been
received without modifications!]
This problem can be solved using hash codes.
% Alice sends a file to Bob, and wants to do error detection.
Let Alice and Bob be the holders of the two files; Alice sent
the file to Bob, and they wish to confirm it has been received
without error.
If Alice computes the hash
% function
of her file and sends it to Bob,
and Bob computes the hash
% function
of his file, using the
same $M$-bit hash function, and the two hashes match, then
Bob can deduce that the two files are almost surely the
same.
% should have some sort of reference to digest?
% The hash of the file is often called the {\dem\ind{digest}}.
\exampl{example.hash.II}{
What is the probability of a false negative, \ie,
the probability, given that
the two files do differ, that the two hashes
% Bob concludes
are nevertheless identical?
}
% Solution::::::::
If we assume that the hash function is random and that
the
% unrelated
process that causes the files to differ knows nothing about the hash function,
then the probability of a false negative is $2^{-M}$.\ENDsolution
A 32-bit hash gives a probability of false negative of about $10^{-10}$.
% 2.3283064365387e-10
It is common practice to use a linear hash function called
a 32-bit cyclic redundancy check to detect errors in files.
(A cyclic redundancy check is a set of 32 parity-check bits
similar to the 3 parity-check bits of the $(7,4)$ Hamming code.)
%%%%%%%%% end solution
\begin{conclusionbox}
To have a false-negative rate smaller than one in a billion,
$M = 32$ bits is plenty, if the errors are produced by noise.
\end{conclusionbox}
\exercissxB{2}{ex.whyonlyCRC}{
Such a simple parity-check code only detects errors; it doesn't help correct
them. Since error-{\em{correcting\/}} codes exist, why not use
one of them to get some error-correcting
capability too?
}
%
% more maths requested here
%
\subsection{Tamper detection}
\index{security}\index{tamper detection}\index{detection of forgery}\index{forgery}What
if the differences between the two files are not
simply `noise', but are introduced by an adversary,
a clever {\dem forger\/} called
Fiona, who modifies the original file to make
a \ind{forgery}\index{cryptography!digital signatures}\index{cryptography!tamper detection}
that purports to be \ind{Alice}'s file?
How can Alice make a \ind{digital signature}
for the file so that \ind{Bob} can confirm that
no-one has tampered with the file?
And how can we prevent Fiona from listening
in on Alice's signature and attaching it to other
files?
Let's assume that Alice computes a hash function for the
file and sends it securely to Bob.
% , in the same way as for error-detection above.
If Alice computes a simple hash function for the file
like the linear cyclic redundancy check, and Fiona knows
that this is the method of verifying the file's
integrity, Fiona can make her chosen modifications
to the file and then easily identify (by linear
algebra) a further 32-or-so single bits
that, when flipped, restore the hash function
of the file to its original value.
{\em Linear hash functions give no security against
forgers.}
We must
therefore require that the hash function\index{inversion of hash function}
be {\em hard to invert\/} so that no-one can construct a
tampering that leaves the hash function unaffected.
We would still like the hash function to be easy
to compute, however, so that Bob doesn't have to
do hours of work to verify every file he received.
Such a hash function -- easy to compute, but hard to invert --
is called
a {\dem\ind{one-way hash function}}.\index{hash function!one-way}
Finding such functions is one of the active research areas of
\ind{cryptography}.
% Don't want to use an ecc, because with a linear ecc it is easy to construct
% a pair of tamperings which have the same syndrome and
% so leave the hash unaffected.
%How can we invent a function that has the
%property that h(x) is easy to compute, but
%it is very hard to find an x
%suxh that h(x) has a chosen value h?
%A lot of research is being done on this question
%still, and the sort of functions people use
%to make a one-way hash function are functions like:
%
% exponentiation-modulo-M
%
%Definition:
% take x, and think of it as a number.
% compute 1023^(x) modulo M,
% where "^" means "1023 to the power x",
% and M is some other integer, eg 97.
%
%Apparently it is hard to invert this sort of
% function (i.e. to take the "discrete logarithm").
%
%Real one-way hash functions are more complicated than
%this, but I hope this gives the idea.
%
A hash function that is widely used in the free software\index{software!hash function}
community to confirm that two files do not differ
is {\tt\ind{MD5}}, which produces a 128-bit
hash. The details of how it works
are quite complicated, involving convoluted exclusive-or-ing
and if-ing and and-ing.\footnote{{\tt http://www.freesoft.org/CIE/RFC/1321/3.htm}}
%
% of bits with each other
%
% Cryptography is the topic of the next chapter.
%
% rsync uses MD4 with a 128-bit checksum (for files with a matching size
% and date) initially. But (from the man entry):
% Current versions of rsync actually use an adaptive
% algorithm for the checksum length by default, using
% a 16 byte file checksum to determine if a 2nd pass
% is required with a longer block checksum. Only use
% this option if you have read the source code and
% know what you are doing.
% The `md5sum' program also uses 128 bits.
Even with a good one-way hash function, the digital signatures
described above
are still vulnerable to attack, if Fiona has access to the
hash function. Fiona could take the tampered file
and hunt for a further tiny modification to it such that its hash
matches the original hash of Alice's file. This would take
some time -- on average,
about $2^{32}$ attempts, if the hash function has
32 bits -- but eventually Fiona would find a tampered file that
matches the given hash. To be secure against
forgery, \ind{digital signature}s must either have
enough bits for such a random search to take too long,
or the \ind{hash function} itself must be kept
\ind{secret}.
\begin{conclusionbox}
Fiona has to hash $2^M$ files to cheat.
$2^{32}$ file modifications is not very many, so a 32-bit hash function
is not large enough for \ind{forgery} prevention.
\end{conclusionbox}
% If Fiona works as
Another person who might have a motivation for forgery is
Alice herself.
For example, she might be making a bet on the outcome
of a race, without wishing to broadcast her prediction
publicly; a method for placing bets would be for her
to send to Bob the bookie the hash of her bet.
Later on, she could send Bob the details of her bet.
Everyone can confirm that her bet is consistent with
the previously publicized hash. [This method
of secret publication
% shing ideas
was used by Isaac Newton and Robert Hooke\index{Newton, Isaac}\index{Hooke, Robert}
% (1635-1703)
when they
wished to establish priority for scientific ideas
without revealing them. Hooke's hash function was alphabetization
% ed latin statements,
as illustrated by the
conversion of {\em UT TENSIO, SIC VIS\/} into the \ind{anagram} {\tt{CEIIINOSSSTTUV}}.]
% http://www.microscopy-uk.org.uk/mag/artmar00/hooke2.html
% http://www.rod.beavon.clara.net/leonardo.htm
% It was in his Helioscopes in 1676 that Hooke followed the popular seventeenth-century conceit of announcing a discovery in an anagram: cediinnoopsssttuu. He published its key two years later, in his most complete treatment of elasticity, in De Potentia Bestitutiva, or Of Spring. Here Hooke enunciated the original formulation of the law that bears his name: Ut Pondus sic Tensia, or 'the weight is equal to the tension'. [33] As the tension was seen as the product of an increasing series of weights in pans suspended on coiled springs, it is easy in this pre-Newtoniangravitation age to understand how Hooke spoke of the pondus, or weight, as acting on the spring. The formulation of 'Hooke's Law' with which we are more familiar today is Ut Tensia, sic Vis, or 'the tension is equal to the force'.
%
% http://www.aero.ufl.edu/~uhk/strength/strength.htm ??? CEIIOSSOTTUU ??? CEIINOSSITTUV
% ??? ceiiinosssttvv
% all accounts differ!
% http://arc-gen1.life.uiuc.edu/Bioph354/lect19.html
Such a protocol relies on the assumption that Alice cannot
change her bet after the event without the hash coming out wrong.
How big a hash function do we need to use to ensure that
Alice cannot cheat?
The answer is different from the size of the hash we needed
in order to defeat Fiona above, because Alice is the author of {\em
both\/} files. Alice could \ind{cheat} by searching for
two files that have identical
hashes to each other. For example, if
she'd like to cheat by placing two bets for the price of
one, she could make a large number $N_1$ of
versions of bet one (differing from each other
in minor details only), and a large number $N_2$ of versions of bet two,
and hash them all. If there's a \ind{collision} between
the hashes of two bets of different types,
then she can submit the common hash and thus buy herself the
option of placing either \ind{bet}.
\exampl{example.hashN1N2}{
If the hash has $M$ bits, how big do $N_1$
and $N_2$ need to be for Alice to have a good chance of finding
two different bets with the same hash?
}
% solution
This is a \ind{birthday} problem like \exerciseref{ex.birthday}.
If there are $N_1$ Montagues and $N_2$ Capulets at a party,
and each is assigned a `birthday' of $M$ bits,
the expected number of \ind{collision}s between a Montague and a Capulet
is
\beq
N_1 N_2 2^{-M} ,
\eeq
so to minimize the number of files hashed, $N_1+N_2$, Alice
should make $N_1$ and $N_2$ equal, and will need to hash about
$2^{M/2}$ files until she finds two that match.\ENDsolution
\begin{conclusionbox}
Alice has to hash $2^{M/2}$ files to cheat.
[This is the square root of the number of hashes Fiona
had to make.]
\end{conclusionbox}
If Alice has the use of $C=10^6$ computers for $T=10$\,years, each computer
taking $t=1\,$ns to evaluate a hash,
the bet-communication system is\index{security}
secure against Alice's dishonesty only if $M \gg 2 \log_2 CT/t \simeq
160$ bits.
% end solution
\section*{Further reading}
The Bible for hash codes is volume 3 of \citeasnoun{KnuthAll}.
I highly recommend the story of Doug McIlroy's {\tt{\ind{spell}}}
program, as told in section 13.8 of {\em{Programming Pearls}} \cite{Bentley2}.
This astonishing piece of software makes use of a 64-\kilobyte\ data structure
to store the spellings of all the words of $75\,000$-word dictionary.
% also has some hash functions for strings on p 161, chapter 15.
% and random text generator.
\section{Further exercises} % removed and returned (maybe should transfer some of these?)
% solutions in _se8.tex
% oct 97
%
% info theory and the real world
%
\fakesection{Information theory and the real world (questions relating to hash functions)}
\exercisaxA{1}{ex.address}{
What is the shortest the \ind{address} on a typical international {letter}
could be, if it is to get to a unique human recipient? (Assume the permitted
characters are {\tt{[A-Z,0-9]}}.)
How long are typical \ind{email} addresses?
}
\exercissxA{2}{ex.uniquestring}{
How long does a piece of text need to be for you to be pretty
sure that no human has written that string of characters
before?
How many notes are there in a new \ind{melody}\index{music} that
has not been composed before?
}
\exercissxB{3}{ex.proteinmatch}{
{\sf Pattern recognition by \ind{molecules}}.\index{pattern recognition}
Some proteins produced in a cell have a regulatory role.
A regulatory \ind{protein} controls
the transcription of specific \ind{genes} in the \ind{genome}.
% that might code for other proteins or sometimes the protein itself.
This control often involves the protein's binding to a particular \ind{DNA}
sequence in the vicinity of the regulated gene. The presence of the
bound protein either promotes or inhibits transcription of the gene.
\ben
\item
Use information-theoretic arguments to obtain a lower bound on the size of a
typical protein that acts as a regulator specific to one gene in the
whole human genome. Assume that the genome is
a sequence of
$3 \times 10^{9}$
nucleotides drawn from a four letter alphabet $\{{\tt A},{\tt C},{\tt G},{\tt T}\}$;\index{amino acid}\index{nucleotide}\index{binding DNA}
a protein is a sequence of amino acids drawn from a twenty letter alphabet.
[Hint: establish how long the recognized DNA sequence has to be
in order for that sequence to be
unique to the vicinity of one
gene, treating the rest of the genome as a random sequence. Then
discuss how big the protein must be to recognize a
sequence of that length uniquely.]
\item
Some of the sequences recognized by \ind{DNA}-binding regulatory\index{protein!regulatory}
proteins consist of a subsequence that is repeated twice or
more, for example
the sequence
\beq
\mbox{{\tt{\underline{GCCCCC}CACCCCT\underline{GCCCCC}}}}
\eeq
is a binding site found upstream of the alpha-actin gene in humans.
%; this is a binding site for a transcription factor called Sp1.
Does the fact that some binding sites consist of
a {repeated\/} subsequence influence your answer to part (a)?
\een
}
%
% stole information acquisition exercises from here to move to gene chapter
%
\dvips
\section{Solutions}% to Chapter \protect\ref{ch.hash}'s exercises} %
\soln{ex.meanhash}{
First imagine comparing the string $\bx$ with
another random string $\bx^{(s)}$.
The probability that the first bits of the two strings match
is $1/2$. The probability that the second bits match
is $1/2$. Assuming we stop comparing once we hit the
first mismatch, the expected number of matches is 1,
so the expected number of comparisons is 2
\exercisebref{ex.waithead}.
% errors corrected in draft 2.0.7 on Sun 31/12/00
Assuming the correct string is located at random in the
raw list, we will have to compare with an average
of $\hashS/2$ strings before we find it, which costs
$2 \hashS/2$ binary comparisons; and comparing
the correct strings takes $N$ binary comparisons,
giving a total expectation of $\hashS + N$ binary comparisons,
if the strings are chosen at random.
In the worst case (which may indeed happen in practice),
the other strings are very similar
to the search key, so that a lengthy sequence of comparisons
is needed to find each mismatch. The worst case is when the correct
string is last in the list, and all the other strings
differ in the last bit only, giving a requirement of $\hashS N$
binary comparisons.
}
\soln{ex.hash.retrieval}{
The likelihood ratio for the two hypotheses,
$\H_0$: $\bx^{(\hashs)} = \bx$, and
$\H_1$: $\bx^{(\hashs)} \neq \bx$,
contributed by the datum `the first bits of $\bx^{(\hashs)}$ and $\bx$ are equal'
is
\beq
\frac{ P( \mbox{Datum} \given \H_0 ) }
{ P( \mbox{Datum} \given \H_1 ) }
= \frac{1}{1/2} = 2.
\eeq
If the first $r$ bits all match, the likelihood ratio is $2^r$ to one.
On finding that 30 bits match, the odds are a billion to one
in favour of $\H_0$, assuming we start from even odds.
[For a complete answer, we should compute the evidence
% prior probability of $\H_0$ and $\H_1$
given by the prior information that the hash entry $s$
has been found in the table at $\bh(\bx)$. This fact gives further evidence
in favour of $\H_0$.]
}
\soln{ex.hash.collision}{
Let the hash function have an output alphabet of size $T = 2^M$.
If $M$ were equal to $\log_2 S$ then we would have exactly enough bits
for each entry to have its own unique hash.
The probability that one particular pair of entries collide under a random
hash function is $1/T$.
The number of pairs is $S(S-1)/2$. So the expected number
of collisions between pairs is exactly
\beq
S(S-1)/(2T).
\eeq
If we would like this to be smaller than 1, then we need
$
T > S(S-1)/2
% S(S-1) < 2A \:\: \Rightarrow \:\: S < \sqrt{2A}
$
so
\beq
M > 2 \log_2 S.
\label{eq.M2Shash}
\eeq
We need {\em twice as many\/} bits as the number of bits,
$\log_2 S$,
that would be sufficient to give each entry a unique name.
% fS = S(S-1)/(2A)
% A = (S-1) / (2 f )
If we are happy to have occasional collisions, involving a fraction
$f$ of the names $S$, then
we need $T > S/f$ (since the probability that one particular name
is collided-with is $f \simeq S/T$) so
\beq
M > \log_2 S + \log_2 [1/f] ,
\label{eq.MShash}
\eeq
which means for $f \simeq 0.01$ that we need an extra
7 bits above $\log_2 S$.
The important point to note is the \ind{scaling} of $T$
with $S$ in the two cases (\ref{eq.M2Shash},$\,$\ref{eq.MShash}). If we want
the hash function to be collision-free, then
we must have $T$ greater than $\sim \! S^2$.
If we are happy to have a small frequency of collisions, then
$T$ needs to be of order $S$ only.
% some factor greater than
}
%
%
%
\soln{ex.nines.p}{
The posterior probability ratio for
the two hypotheses, $\H_{+} = $ `calculation correct'
and $\H_{-} = $ `calculation incorrect'
is the product of the prior probability ratio
$P(\H_{+})/P(\H_{-})$ and the likelihood ratio,
$P(\mbox{match} \given \H_{+})/P(\mbox{match} \given \H_{-})$.
This second factor is the answer to the question.
The numerator $P(\mbox{match} \given \H_{+})$ is equal to 1.
The denominator's value depends on our model of errors.
If we know that the human calculator is prone to errors
involving multiplication of the answer by 10, or to transposition
of adjacent digits, neither of which affects the hash value,
then
$P(\mbox{match} \given \H_{-})$ could be equal to 1 also,
so that the correct match gives no evidence
in favour of $\H_{+}$. But if we assume that errors are
`random from the point of view of the hash function' then
the probability of a false positive is
$P(\mbox{match} \given \H_{-}) = 1/9$, and the
correct match gives evidence 9:1 in favour
of $\H_{+}$.
}
%
%
%
\soln{ex.whyonlyCRC}{
If you add a tiny $M=32$ extra bits of hash to a huge $N$-bit
file you get pretty good \ind{error detection}\index{error-correcting code} --
% $1-2^{-M}$
the probability that an
% of detecting an error, less than a one-in-a-billion chance that the
error is undetected is $2^{-M}$,
less than one in a billion. To do error {\em correction\/}
requires far more check bits, the number depending on the expected types of
corruption, and on the file size.
For example, if just eight random bits in a megabyte file
are corrupted, it would take
% $\log_2 {{ 8\times 10^{6}} \choose {8} } \simeq 180$
about $\log_2 {{ 2^{23} }\choose{8} } \simeq 23 \times 8 \simeq 180$
bits
to specify which are the corrupted bits, and the number of \ind{parity-check bits} used by a successful error-correcting code would have to
be at least this number, by the counting argument of \exerciseonlyref{ex.makecode2error}
(solution, \pref{ex.makecode2error.sol}).
% Shannon's \ind{noisy-channel coding theorem}.
}
% see also _se8.tex
\fakesection{se8}
%\begincuttable% NO, I LIKE IT
\soln{ex.uniquestring}{
We want to know the length $L$ of a string
such that it is very improbable that that
string matches any part of the entire writings
of humanity.
Let's estimate that these writings total
about one book for each person living, and that each book
contains two million characters (200 pages with $10\,000$ characters
per page) -- that's
% $5\times 10^9 \times 2 \times 10^6 =
$10^{16}$ characters, drawn from
an alphabet of, say, 37 characters.
The probability that a randomly chosen string of length $L$ matches
at one point in the collected works of humanity is $1/37^{L}$.
So the expected number of matches is
$10^{16} /37^{L}$, which is vanishingly small if
$L \geq 16/\log_{10} 37 \simeq 10$.
% 10.2
Because of the redundancy and repetition of humanity's writings,
it is possible that $L \simeq 10$ is an overestimate.
So, if you want to write something unique, sit down and compose
a string of ten characters. But don't write {\tt{gidnebinzz}}, because
I already thought of that string.
As for a new \ind{melody},\index{music} if we focus on the sequence of notes,
ignoring duration and stress, and
allow leaps of up to an octave at each note,
then the number of choices per note is 23.
The pitch of the first note is arbitrary.
The number of melodies of length $r$ notes in this rather
ugly ensemble of \ind{Sch\"onberg}ian tunes is $23^{r-1}$;
for example, there are $250\,000$ of length $r=5$.
Restricting the permitted intervals will reduce this figure;
including duration and stress will increase it again.
[If we restrict the permitted intervals to
repetitions and tones or semitones,
the reduction is particularly severe; is this why
the melody of
`\ind{Ode to Joy}' sounds so boring?]
The number of recorded compositions is probably less than a
million.
% top of the pops for 50 * 50 weeks with 100 new songs per week
If you learn 100 new melodies per week for every week of your
life then you will have learned $250\,000$ melodies at age 50.
Based on empirical experience of playing the game\index{game!guess that tune}
`{\tt{guess that tune}}',\marginpar{\small\raggedright{In {\tt{guess that tune}},
one player chooses a melody, and sings a gradually-increasing number
of its notes, while the other
participants try to guess the whole melody.\medskip
% aka http://www.melodyhound.com/
The {\dem\ind{Parsons code}\/} is a related hash function for
melodies:
% . To make the Parsons code of a melody,
each pair of consecutive notes is coded as {\tt{U}} (`up')
if the second note is higher than the first, {\tt{R}} (`repeat')
if the pitches are equal, and {\tt{D}} (`down') otherwise.
You can find out how well this hash function
works at {\tt{www.{\breakhere}name-{\breakhere}this-{\breakhere}tune.{\breakhere}com}}.
}}
it seems to me that
whereas many four-note sequences are shared in common
between melodies, the number of collisions between
five-note sequences is rather smaller -- most famous five-note
sequences are unique.
}
%\ENDcuttable
\soln{ex.proteinmatch}{
%\ben
%\item
(a) Let the DNA-binding \ind{protein} recognize a sequence of length $L$ nucleotides.
That is, it binds preferentially to that \ind{DNA} sequence, and not to
any other pieces of DNA in the whole genome. (In reality, the
recognized sequence may contain some wildcard characters, \eg,
the {\tt{*}} in {\tt{TATAA*A}}, which denotes `any of {\tt{A}}, {\tt{C}},
{\tt{G}} and {\tt{T}}';
so, to be precise, we are assuming that the recognized sequence
contains $L$ non-wildcard
characters.)
% in a sequence whose length can be greater than $L$.)
Assuming the rest of the genome is `random', \ie, that the sequence
consists of random nucleotides {\tt{A}}, {\tt{C}},
{\tt{G}} and {\tt{T}} with equal probability -- which is obviously
untrue, but it shouldn't make too much difference to our calculation --
the chance of there being no other occurrence of the target sequence
in the whole genome, of length $N$ nucleotides, is roughly
\beq
(1 - (1/4)^L )^N \simeq \exp ( - N (1/4)^L ) ,
\eeq
which is close to one only if
\beq
N 4^{-L} \ll 1 ,
\eeq
that is,
\beq
L > \log N / \log 4 .
\eeq
Using $N= 3 \times 10^9$,
% from cell p.386
we require the recognized sequence to be longer than $L_{\min} = 16$
nucleotides.
What size of \ind{protein} does this imply?
%\ben
\bit
\item
%
A weak lower bound can be obtained by assuming that the information
content of the protein sequence itself is greater than
the information content of the \ind{nucleotide}
sequence the protein prefers to bind to (which we have argued above
must be at least 32 bits).
This gives a minimum protein length of $32 / \log_2(20) \simeq 7$
\ind{amino acid}s.
\item
Thinking realistically, the \ind{recognition} of the DNA sequence
by the protein presumably involves the protein coming into contact
with all sixteen nucleotides in the target sequence.
If the protein is a monomer, it must be big enough that it can
simultaneously make contact with sixteen nucleotides of DNA.
One helical turn of DNA containing ten nucleotides has a length of
3.4$\,$nm, so a contiguous sequence of sixteen nucleotides has a length
of 5.4$\,$nm. The diameter of the protein must therefore be about 5.4$\,$nm
or greater. Egg-white lysozyme is a small globular protein with
a length of 129 amino acids
% cell p.90
and a diameter of about 4$\,$nm.
% cell p.130.
Assuming that volume is proportional to sequence length
and that volume scales as the cube of the diameter, a protein of
diameter 5.4$\,$nm must have a sequence of length $2.5 \times 129
\simeq 324$ amino acids.
%\een
\eit
% \item
%
(b)
If, however, a target sequence consists of a twice-repeated sub-sequence,
we can get by with a much smaller protein that recognizes
only the sub-sequence, and that binds to the \ind{DNA} strongly only if
it can form a {\em\ind{dimer}},
both halves of which are bound to the recognized sequence.
% , which must appear twice in succession in the DNA.
% with a neighbour.
Halving the diameter of the protein, we now only need a protein whose length
is greater than 324/8 = 40 amino acids.
A protein of length smaller than this cannot by itself serve as a
regulatory protein\index{protein!regulatory} specific to one gene,
because it's simply too small to be able to make a sufficiently
specific match -- its available surface does not have enough
information content.
% \een
}
%
\dvips
%
% ch 8 LINEAR
%\chapter{Linear Error correcting codes and perfect codes}
%\chapter{Linear Error Correcting Codes and Perfect Codes \nonexaminable}
\prechapter{About Chapter}
% prechapter for linear codes / binary codes
In Chapters \ref{ch.prefive}--\ref{ch.ecc},
we established Shannon's noisy-channel coding theorem
for a general channel with any
input and output alphabets.
A great deal of attention in coding theory focuses on the special
case of channels with binary inputs.
Constraining ourselves to these channels simplifies
matters, and leads us into an exceptionally rich world,
which we will only taste in this book.
One of the aims of this chapter is to point out a
contrast between Shannon's aim of achieving reliable communication
over a noisy channel and the apparent aim of many in the
% this wonderful
world of \ind{coding theory}.\index{sphere packing}
Many coding theorists take as their fundamental problem
the task of packing as many spheres as possible, with radius
as large as possible, into an $N$-dimensional space, {\em with
no spheres overlapping}.
Prizes are awarded to people
who find packings that squeeze in an extra few spheres.
% of a given radius.
While this is a fascinating mathematical topic,
we shall see that the aim of maximizing
the \ind{distance} between codewords in a code has only a tenuous
relationship to Shannon's aim of reliable \ind{communication}.
\ENDprechapter
\chapter{Binary Codes \nonexaminable}
\label{ch.linearecc}
\label{ch.linear}
% see also linearblock.tex
%
% chapter 8: linear error correcting codes
%
% distance
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% see also NOTES.tex
We've established Shannon's noisy-channel coding theorem
for a general channel with any
input and output alphabets.
A great deal of attention in coding theory focuses on the special
case of
channels with binary inputs, the first implicit choice being the
binary symmetric channel.\index{channel!binary symmetric}
The optimal decoder for a code, given a binary symmetric channel,
finds the codeword that is closest to the received vector, closest\marginpar[b]{\small{{\sf Example:}\\[0.0012in]
%\begin{center}
\begin{tabular}{rl}
\multicolumn{2}{c}{
The Hamming distance
}\\
{between}& {\tt{00001111}}\\ and & {\tt{11001101}}\\
\multicolumn{2}{c}{
is 3.
}\\
\end{tabular}
%\end{center}
}}
in {\dem\ind{Hamming distance}}.\index{distance!Hamming}
The Hamming distance between two binary vectors is the number
of coordinates in which the two vectors
differ.
Decoding errors will occur
if the noise takes us from the transmitted codeword $\bt$ to a
received vector $\br$ that is closer to some other codeword.
The {\dem{distances\/}} between codewords are thus relevant to the
probability of a decoding error.\index{distance!of code}
\section{Distance properties of a code}
%\begin{description}
%\item[The {\dem{distance}\/} of a\index{distance!of code} code]
The {\dem{distance}\/} of a\index{distance!of code}
% \index{error-correcting code!distance}
code is the smallest separation between two of its
codewords.
% \end{description}
% \begin{ indented
\exampl{ex.hamm74dist}{
%\noindent {\sf Example:}
The $(7,4)$ Hamming code (\pref{sec.ham74})
has distance $d= 3$. All pairs of its
codewords differ in at least 3 bits.
The maximum number of errors it can correct is $t=1$;
in general
a code with distance $d$ is
$\lfloor (d\!-\!1)/2 \rfloor$-error-correcting.
}
% , and
% the distance is related to this quantity by
% $d=2t+1$.
% \end{indented
A more precise term for distance is
the {\dem\ind{minimum distance}\/} of the code.
The distance of a code is often denoted by $d$ or $d_{\min}$.
%
% \section{Weight enumerator function}
% see code/bucky/README
\index{error-correcting code!weight enumerator}%
%\index{error-correcting code!distance distribution}%
We'll now constrain our attention to linear codes.
In a linear code, all codewords have identical
distance properties, so we can summarize
% the dis.
% are equivalent,
% from the point of view of the spectrum of
% distances to other codewords.
% summarizes
all the distances between the code's codewords
by counting the distances from the all-zero codeword.
%\begin{description}
%\item[The {\dem\ind{weight enumerator} function} of a code,] $A(w)$,
The {\dem\ind{weight enumerator} function} of a code, $A(w)$,
% $A(w)$
is defined to be the number of codewords in the code that
have weight $w$.
\amarginfig{b}{%
\footnotesize
\begin{tabular}{c}
\raisebox{0.2in}{\buckypsfig{H74.eps}}
\\
%# weight enumerator of $(7,4)$ code
%# w A(w) C Random Random N-choose-w
\begin{tabular}[b]{rr}
\toprule
$w$ & $A(w)$ \\ \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0 & 1 \\
3 & 7 \\
4 & 7 \\
7 & 1 \\ \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
Total & 16\\ \bottomrule
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\\
\buckypsgraphb{H74.Aw.ps}
\end{tabular}
% see /home/mackay/code/bucky/H74.gnu
\caption[a]{The graph of the $(7,4)$ Hamming code, and its weight enumerator
function.}
\label{fig.wef.h74}
}
%
The weight enumerator function is also
known as the {\dem{{distance distribution}\index{distance!distance distribution}}\/} of the code.
%\end{description}
% original is in graveyard.tex
\begin{figure}
\figuremargin{%
\footnotesize
\begin{tabular}{ccc}
\buckypsfig{dodec.eps}
&
%# weight enumerator of (30,11) code dodec2.G
%# w A(w) C Random Random N-choose-w
\begin{tabular}{rr}
\toprule
$w$ & $A(w)$ \\ \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0 & 1 \\
5 & 12 \\
8 & 30 \\
9 & 20 \\
10 & 72 \\
11 & 120 \\
12 & 100 \\
13 & 180 \\
14 & 240 \\
15 & 272 \\
16 & 345 \\
17 & 300 \\
18 & 200 \\
19 & 120 \\
20 & 36 \\ \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
Total & 2048\\ \bottomrule
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
&
\begin{tabular}{@{}c@{}}
\buckypsgraphB{dodec2.Aw.ps}
\\
\buckypsgraphB{dodec2.Aw.l.ps}
\end{tabular}
\\% see /home/mackay/code/bucky
\end{tabular}
}{
\caption[a]{
The graph defining the $(30,11)$
\ind{dodecahedron code}\index{error-correcting code!dodecahedron}
% first introduced in secref{sec.dodecahedron}
(the circles are the 30 transmitted bits and the triangles are the 20 parity checks,
one of which is redundant) and the
% (b-ii) The
weight enumerator function (solid lines). The
dotted lines show the
average weight enumerator function of all random linear codes
with the
same size of generator matrix,
% (dotted lines),
which will be computed shortly.
The lower
figure shows the same functions on a log scale.
%%%%%%%%%%%%%%% CHECK %%%%%%%%%%%%%%%
% {\em (Check for cross-reference to earlier occurrence?)}
}
\label{fig.Aw}
}
\end{figure}
% \begin{ indented ?
\exampl{ex.hamm74Aw}{
% \noindent {\sf Example:}
The weight enumerator functions
of the $(7,4)$ Hamming code and the \ind{dodecahedron code}\index{error-correcting code!dodecahedron}
are shown in figures \ref{fig.wef.h74} and \ref{fig.Aw}.
% \end{indented
}
\section{Obsession with distance}
Since the maximum number of errors that a code can {\em guarantee\/} to correct,
$t$, is related to its distance $d$ by
$t= \lfloor (d\!-\!1)/2 \rfloor$,\marginpar{\small{%
$d=2t+1$ if $d$ is odd, and\\
$d=2t+2$
if $d$ is even.}}
many coding theorists focus on the\index{distance!of code} distance of a code, searching for
codes of a given size that have the biggest possible distance.
Much of practical coding theory has focused on
decoders that give the optimal decoding for all error patterns
of weight up to the half-distance $t$ of their codes.
\begin{description}
\item[A \ind{bounded-distance decoder}]\index{decoder!bounded-distance}
is a decoder that returns the closest codeword to a received\label{sec.bdd}
binary vector $\br$ if the distance from $\br$ to that codeword
is less than or equal to $t$; otherwise it returns a failure
message.
\end{description}
The rationale for not trying to decode when more than $t$
errors have occurred might be `we can't {\em guarantee\/}
that we can correct more than $t$ errors, so we
won't bother trying -- who would
be interested in a decoder that corrects some\index{sermon!worst-case-ism}
error patterns of weight greater than $t$, but not others?'
This defeatist attitude is an example of {\dem\ind{worst-case-ism}},
a widespread mental ailment
% yes, spell checked
which this book is intended to cure.
The fact is that bounded-distance decoders cannot reach the\wow\
Shannon limit of the binary symmetric channel; only a decoder
that often corrects more than $t$ errors can do this.
The state of the art in error-correcting codes
have decoders that work way beyond the minimum distance
of the code.
\subsection{Definitions of good and bad distance properties}
\index{distance!of code!good/bad}Given
a family of codes of increasing blocklength $N$, and with rates
approaching a limit $R>0$,
we may be able to put that family in one of the following categories,
which have some similarities to the categories of `good' and `bad' codes
defined earlier (\pref{sec.bad.code.def}):\index{error-correcting code!good}\index{error-correcting code!bad}\index{error-correcting code!very bad}\index{distance!good}\index{distance!bad}\index{distance!very bad}
\label{sec.bad.dist.def}
\begin{description}
\item[A sequence of codes has `good' distance]
if $d/N$ tends to a constant greater than zero.
\item[A sequence of codes has `bad' distance]
if $d/N$ tends to zero.
\item[A sequence of codes has `very bad' distance]
if $d$ tends to a constant.
\end{description}
% THIS really belongs over the page
\amarginfig{b}{
\begin{center}
\mbox{
\psfig{figure=/home/mackay/itp/figs/gallager/16.12.G.ps,width=2in,angle=-90}
}\end{center}
\caption[a]{The graph of a rate-\dfrac{1}{2}
low-density generator-matrix code. The rightmost $M$
of the transmitted bits are each connected to a single distinct parity
constraint.
}
\label{fig.ldgmc}
}
\exampl{example.badcode}{
A {\dem\ind{low-density generator-matrix code}\/} is a
linear code whose $K \times N$ generator matrix
$\bG$ has a small number $d_0$ of {\tt{1}}s per row,
regardless of how big $N$ is.
The minimum distance of such a code is at most $d_0$,
so {low-density generator-matrix code}s have `very bad' distance.
}
While having large distance is no bad thing, we'll see, later on, why
an emphasis on distance can be unhealthy.
\begin{figure}[htbp]
\figuremargin{
\mbox{\psfig{figure=figs/caveperfect.ps,angle=-90,width=3in}}
}{
\caption[a]{Schematic picture of part of
Hamming space perfectly filled
by $t$-spheres centred on the codewords of a perfect code.}
\label{fig.caveperfect}
}
\end{figure}
\section{Perfect codes}
A $t$-sphere (or a sphere of radius $t$)
in Hamming space, centred on a point $\bx$,
is the set of points whose Hamming distance from $\bx$
is less than or equal to $t$.
The $(7,4)$ \ind{Hamming code}\index{perfect code}\index{error-correcting code!perfect} has the beautiful property that
if we place 1-spheres
% of radius 1
about each of its 16 codewords,
those spheres perfectly fill Hamming space without overlapping.
As we saw in \chref{ch1},
every binary vector of length 7 is within a distance of $t=1$ of
exactly one codeword of the Hamming code.
\begin{description}
\item[A code is a perfect $t$-error-correcting code]
if the set of $t$-spheres centred on the codewords of the code fill the
Hamming space without overlapping. (See \figref{fig.caveperfect}.)
\end{description}
Let's recap our cast of characters.
The number of codewords is $S=2^K$. The number of points in the
entire Hamming space is $2^N$. The number of points in a
Hamming sphere of radius $t$ is
\beq
\sum_{w=0}^{t} {{N}\choose{w}} .
\eeq
For a code to be perfect with these parameters, we
require $S$ times the number of points in the $t$-sphere to equal $2^N$:
\beqan
\mbox{for a perfect code, } \:\:
2^K \sum_{w=0}^{t} {{N}\choose{w}} & =& 2^N
\\
\mbox{or, equivalently, }\:\:
\sum_{w=0}^{t} {{N}\choose{w}} & =& 2^{N-K} .
\eeqan
For a perfect code, the number of noise vectors
in one sphere must equal the number of possible syndromes.
The $(7,4)$ Hamming code satisfies this numerological condition\index{numerology}
because
\beq
1 + {{7}\choose{1}} = 2^3 .
\label{eq.coincidence}
\eeq
% Interestingly, the first appearance of the ternary Golay code predated
%Golay's publication by a good year. A Finnish devotee of football pools thought it up in list form (!) and published it
% in 1947.
% Covering codes.
%G.Cohen, I.Honkala. S.Litsyn, and A.Lobstein
%North-Holland Publishing Co., Amsterdam, 1997. xxii+542 pp. ISBN 0-444-82511-8
% It is this "ternary" Golay code which was first discovered by a Finn who was
% determining good strategies for betting on blocks of 11 soccer games. Here,
% one places a bet by predicting a Win, Lose, or Tie for all 11 games, and as
% long as you do not miss more than two of them, you get a payoff. If a group
% gets together in a "pool" and makes multiple bets to "cover all the options"
% (so that no matter what the outcome, somebody's bet comes within 2 of the
% actual outcome), then the codewords of a 2-error-correcting perfect code
% provide a very nice option; the balls around its codewords fill all of the
% space, with none left over.
%
% It was in this vein that the ternary Golay code was first constructed; its
% discover, Juhani Virtakallio, exhibited it merely as a good betting system
% for football-pools, and its 729 codewords appeared in the football-pool
% magazine Veikkaaja. For more on this, see Barg's article [1].
%
% [1] Barg, Alexander. "At the Dawn of the Theory of Codes," The Mathematical
% Intelligencer, Vol. 15 (1993), No. 1, pp. 20--26.
\subsection{How happy we would be to use perfect codes}
If there were large numbers of perfect codes to choose from,
with a wide range of blocklengths and rates,
then these would be the perfect solution to Shannon's problem.
We could communicate over a binary symmetric channel with noise
level $f$, for example, by picking a perfect $t$-error-correcting code
with blocklength $N$ and $t=f^* N$, where $f^* = f + \delta$
and $N$ and $\delta$ are chosen such that the probability that
the noise flips more than $t$ bits is satisfactorily small.
However, {\em there are almost no perfect codes}.\wow\
The only nontrivial
perfect binary
codes are
\ben
\item
the Hamming codes, which are perfect codes with $t=1$
% -error-correcting with
and blocklength $N=2^M-1$,
defined below; the rate of a \ind{Hamming code} approaches 1
as its blocklength $N$ increases;
\item
the repetition codes of odd blocklength $N$, which are perfect codes
with $t=(N-1)/2$; the rate of repetition codes goes to zero as $1/N$; and
\item
one remarkable $3$-error-correcting code with $2^{12}$
codewords of
blocklength $N=23$ known
as the binary \ind{Golay code}\index{error-correcting code!Golay}.
[A second 2-error-correcting Golay code of
length $N=11$ over a
ternary alphabet was
% 729 cw's in football-pool magazine Veikkaaja.
discovered by a Finnish football-pool
enthusiast\index{football pools}\index{bet}\index{design theory}
% \index{Finland}
called Juhani Virtakallio\index{Virtakallio, Juhani} in 1947.]
% 1+23+23*11 + 23*11*7 = 2048
% If we allow more symbols in our alphabet than just 0 and 1, then we get analogues of the
% Hamming codes, and another Golay code of length 11, this time on three letters (say 0, +, and -) and with parameters (11,
% 3^6, 5). This completes the list of all linear perfect codes. parameters (11,3^6, 5).
% http://lev.yudalevich.tripod.com/ECC/betting.html
%
% [1] Barg, Alexander. "At the Dawn of the Theory of Codes," The Mathematical Intelligencer, Vol. 15 (1993), No. 1, pp.
% 20--26.
\een
There are no other binary perfect codes.
Why this shortage of perfect codes?
Is it because precise numerological coincidences like those satisfied by the parameters
of the Hamming code (\ref{eq.coincidence}) and the Golay
code,
\beq
1 + {{23}\choose{1}} + {{23}\choose{2}} + {{23}\choose{3}} = 2^{11},
\eeq
are rare? Are there plenty of `almost-perfect' codes for which
the $t$-spheres fill {\em almost\/} the whole space?
No. In fact, the picture
of Hamming spheres centred on the
codewords {\em{almost}\/} filling Hamming space (\figref{fig.cavenotquite})
is a misleading one: for most codes, whether
they are good codes or bad codes,\index{sermon!sphere-packing}
%
almost all the Hamming space is taken up by the space {\em{between}\/}
$t$-spheres
% \wow\
(which is shown in grey in \figref{fig.cavenotquite}).
\begin{figure}
\figuremargin{
\mbox{\psfig{figure=figs/cavenotquite.ps,angle=-90,width=3in}}
}{
\caption[a]{Schematic picture of Hamming space not perfectly filled
by $t$-spheres centred on the codewords of a code.
The grey regions show points that are at a Hamming distance
of more than $t$ from any codeword. This is a misleading picture,
as, for any code with large $t$
in high dimensions, the grey space between the
spheres takes up almost all of Hamming space.
}
\label{fig.cavenotquite}
}
\end{figure}
Having established this gloomy picture, we spend a moment
filling in the properties of the perfect codes mentioned
above.
\subsection{The Hamming codes}
The $(7,4)$ Hamming code can be defined as the linear code
whose $3\times 7$ parity-check matrix contains, as its columns,
all the 7 ($=2^3-1$) non-zero vectors of length 3.
Since these 7 vectors are all different, any single bit-flip
produces a distinct syndrome, so all single-bit errors
can be detected and corrected.
% from \input{tex/_concat2.tex}
We can generalize this code, with $M=3$ parity constraints,
as follows.
The Hamming codes are single-error-correcting codes
defined by picking a number of parity-check constraints, $M$;
the blocklength $N$ is $N = 2^M-1$; the parity-check matrix
contains, as its columns, all the $N$ non-zero vectors
of length $M$ bits.
The first few Hamming codes have the following rates:
\medskip% added because of my change to the center environment
\begin{center}
\begin{tabular}{cr@{,$\,$}llp{1.4in}} \toprule
% checks &
%% (block length, source bits)
% & rate & \\
% $M$ & ($N = 2^M-1$ , $K = N - M$) & $R=K/N$ & \\ \midrule
\multicolumn{1}{c}{Checks, $M$} & \multicolumn{2}{c}{($N,K$)} & $R=K/N$ & \\ \midrule
2 & (3&1) & 1/3 & repetition code $R_3$ \\
3 & (7&4) & 4/7 & $(7,4)$ Hamming code \\
4 & (15&11) & 11/15 & \\
5 & (31&26) & 26/31 & \\
6 & (63&57) & 57/63 & \\ \bottomrule
\end{tabular}
\end{center}
\exercissxA{2}{ex.HammingP}{
What is the probability of block error of the $(N,K)$ Hamming
code to leading order, when the code
is used for a binary symmetric channel with noise density $f$?
}
\section{Perfectness is unattainable -- first proof \nonexaminable}
We will show in several ways
that useful \ind{perfect code}s do not exist (here,
`useful' means `having large blocklength $N$, and rate
close neither to 0 nor 1').
% First, let's study a pithy, no-nonsense example.
Shannon proved that, given a binary symmetric channel
with any noise level $f$, there exist codes with large blocklength $N$
and rate as close as you like to $C(f) = 1 - H_2(f)$
that enable \ind{communication} with
arbitrarily small error probability.
For large $N$, the number of errors per block will typically
be about $\fN$, so these codes of Shannon are
`almost-certainly-$\fN$-error-correcting'
codes.
Let's pick the special case of a noisy channel with $f \in ( 1/3, 1/2)$.
Can we find a large
{\em perfect\/} code that is $\fN$-error-correcting?
% with large blocklength for this channel?
Well, let's suppose that such a code has been found, and examine
just three of its codewords. (Remember that the code
ought to have rate $R \simeq 1-H_2(f)$, so it should have
an enormous number ($2^{NR}$) of codewords.)
\begin{figure}
\figuremargin{
\mbox{\psfig{figure=figs/noperfect3.ps,%
width=64mm,angle=-90}}
}{%
\caption[a]{Three
codewords.
}
\label{fig.noperfect}
}
% load 'gnuR'
\end{figure}
Without loss of generality, we choose one of the codewords
to be the all-zero codeword and define the other two to have
overlaps with it as shown in \figref{fig.noperfect}.
The second codeword differs from the first in a fraction $u+v$
of its coordinates.
The third codeword differs from the first in a fraction $v+w$,
and from the second in a fraction $u+w$. A fraction $x$
of the coordinates have value zero in all three codewords.
Now, if the code is $\fN$-error-correcting, its minimum distance
must be greater than $2\fN$, so
\beq
u+v > 2f, \:\:\: v+w > 2f, \:\:\: \mbox{and} \:\:\: u+w > 2f .
\eeq
Summing these three inequalities and dividing by two, we have
\beq
u +v+w > 3f .
\eeq
So if $f>1/3$, we can deduce $u+v+w > 1$, so that $x<0$,
which is impossible. Such a code cannot exist.
So the code cannot have {\em three\/} codewords, let alone
$2^{NR}$.
We conclude that, whereas Shannon proved there
are plenty of codes for communicating over
a \ind{binary symmetric channel}\index{channel!binary symmetric}\index{perfect code}
with $f>1/3$, {\em there are no perfect codes\index{error-correcting code!perfect}
that can do this.}
We now study a more general argument that indicates
that there are no large perfect linear codes for general rates (other than 0 and 1).
We do this by finding the typical distance of a random linear code.
%\mynewpage
\section{Weight enumerator function of random linear codes \nonexaminable}
\label{sec.wef.random}
Imagine
% H=rand(12,24)>0.5
% octave
\marginfig{\tiny{
\[%\mbox{\footnotesize{$\bH=$}}
\hspace{-2mm}\begin{array}{c}
{N}\\
\overbrace{\left.\hspace{-2mm}\left[\begin{array}{@{}*{24}{c@{\hspace{0.45mm}}}}
1&0&1&0&1&0&1&0&0&1&0&0&1&1&0&1&0&0&0&1&0&1&1&0\\
0&0&1&1&1&0&1&1&1&1&0&0&0&1&1&0&0&1&1&0&1&0&0&0\\
1&0&1&1&1&0&1&1&1&0&0&1&0&1&1&0&0&0&1&1&0&1&0&0\\
0&0&0&0&1&0&1&1&1&1&0&0&1&0&1&1&0&1&0&0&1&0&0&0\\
0&0&0&0&0&0&1&1&0&0&1&1&1&1&0&1&0&0&0&0&0&1&0&0\\
1&1&0&0&1&0&0&0&1&1&1&1&1&0&0&0&0&0&1&0&1&1&1&0\\
1&0&1&1&1&1&1&0&0&0&1&0&1&0&0&0&0&1&0&0&1&1&1&0\\
1&1&0&0&1&0&1&1&0&0&0&1&1&0&1&0&1&1&1&0&1&0&1&0\\
1&0&0&0&1&1&1&0&0&1&0&1&0&0&0&0&1&0&1&1&1&1&0&1\\
0&1&0&0&0&1&0&0&0&0&1&0&1&0&1&0&0&1&1&0&1&0&1&0\\
0&1&0&1&1&1&1&1&0&1&1&1&1&1&1&1&1&0&1&1&1&0&1&0\\
1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&1&0&1&0&0&0&0&1&1
\end{array}\right]\right\} M \hspace{-2mm}\hspace{-0.25in} }
\end{array} \]
}
\caption[a]{A random binary parity-check matrix.}
\label{fig.randommatrix}
}%
making a code by picking the binary entries
in the $M \times N$ parity-check matrix $\bH$ at random.\index{error-correcting code!random linear}
What weight enumerator function should we expect?
The \ind{weight enumerator} of one particular code with
parity-check matrix $\bH$, $A(w)_{\bH}$, is
the number of codewords of weight $w$, which
can be written
\beq
A(w)_{\bH} = \sum_{\bx: |\bx| = w} \truth\! \left[ \bH \bx = 0 \right] ,
\eeq
where the
sum is over all vectors $\bx$ whose weight is $w$ and
the \ind{truth function} $\truth\! \left[ \bH \bx = 0 \right]$
equals one if
% it is true that
$\bH \bx = 0$
and zero otherwise.
We can find the expected value of $A(w)$,
\beqan
\langle A(w) \rangle &=& \sum_{\bH} P(\bH) A(w)_{\bH}
\\
&=& \sum_{\bx: |\bx| = w} \sum_{\bH} P(\bH) \,
\truth\! \left[ \bH \bx \eq 0 \right]
,
\label{eq.expAw}
\eeqan
by evaluating the probability
that a particular word of weight $w>0$ is a codeword of the code (averaging
over all binary linear codes in our ensemble).
By symmetry, this probability depends only on the weight $w$ of the word,
not on the details of the word.
The probability that the entire syndrome
$\bH \bx$ is zero can be found by multiplying together
the probabilities that each of the $M$ bits in the syndrome
is zero. Each bit $z_m$ of the syndrome is a sum (mod 2)
of $w$ random bits, so the probability that $z_m \eq 0$ is $\dhalf$.
The probability that $\bH \bx \eq 0$ is thus
\beq
\sum_{\bH} P(\bH) \, \truth\! \left[ \bH \bx \eq 0 \right]
= (\dhalf)^M = 2^{-M},
\eeq
independent of
$w$.
The expected number of words of weight $w$ (\ref{eq.expAw})
is given by summing, over all words of weight $w$, the probability
that each word is a codeword.
The number of words of weight $w$ is ${{N}\choose{w}}$,
so
\beq
\langle A(w) \rangle = {{N}\choose{w}} 2^{-M} \:\:\mbox{for any $w>0$}.
\eeq
For large $N$, we can use $\log {{N}\choose{w}}
\simeq N H_2(w/N)$ and $R\simeq 1-M/N$ to write
\beqan
\log_2 \langle A(w) \rangle &\simeq& N H_2(w/N) -M
\\
&\simeq& N [ H_2(w/N) - (1-R) ] \:\:\mbox{for any $w>0$}.
\label{eq.wef.random}
\eeqan
As a concrete example, \figref{fig.Aw.540} shows the
expected weight enumerator function of a rate-$1/3$
random linear code\index{error-correcting code!random linear} with $N=540$ and $M=360$.
\marginfig{
\begin{center}
\mbox{%
\small
\hspace{-0.01in}%
\begin{tabular}{c}
\hspace{-0.15in}\mbox{\psfig{figure=/home/mackay/_doc/code/gallager/Am540R.ps,%
width=41.5mm,angle=-90}}\\[-0.01in]
\hspace{0.1in}\mbox{\hspace*{-0.35in}\psfig{figure=/home/mackay/_doc/code/gallager/Am540Rl.ps,%
width=41.5mm,angle=-90}}\\[-0.1in]
\end{tabular}
}
\end{center}
%}{%
\caption[a]{The
expected weight enumerator function
$\langle A(w) \rangle$ of a
\index{error-correcting code!random linear}random linear code with $N=540$ and $M=360$. Lower figure shows
$\langle A(w) \rangle$ on a logarithmic scale.
}
\label{fig.Aw.540}
% load 'gnuR'
}
\subsection{Gilbert--Varshamov distance}
For weights $w$ such that $H_2(w/N) < (1-R)$, the expectation
of $A(w)$ is smaller than 1; for weights such that $H_2(w/N) > (1-R)$,
the expectation is greater than 1. We thus expect, for large $N$,
that the minimum distance
of a random linear code will be close to the distance $d_{\rm GV}$
defined by
\beq
H_2(d_{\rm GV}/N) = (1-R) .
\label{eq.GV.def}
\eeq
% INDENT ME?
\noindent
{\sf Definition.}
This distance, $d_{\rm GV} \equiv N H_2^{-1}(1-R)$,
is
% known as
the
{\dem{Gilbert--Varshamov\index{distance!Gilbert--Varshamov}\index{Gilbert--Varshamov distance}
distance}\/}
for rate $R$ and blocklength $N$.
The {\dem{Gilbert--Varshamov conjecture}},
widely believed, asserts that (for large $N$) it is not possible to\index{Gilbert--Varshamov conjecture}
create binary codes with minimum distance significantly greater than $d_{\rm GV}$.
\medskip
% INDENT ME?
\noindent
{\sf Definition.}
The {\dem{\index{Gilbert--Varshamov rate}Gilbert--Varshamov rate}\/} $R_{\rm GV}$
is the maximum rate at which you can reliably
communicate with a \ind{bounded-distance decoder} (as defined
on \pref{sec.bdd}),
assuming that the
Gilbert--Varshamov conjecture\index{Gilbert--Varshamov conjecture}
is true.
% \section{Perfect codes} A \index{error-correcting code!perfect}\see{perfect code}{code}
\subsection{Why sphere-packing is a bad perspective, and an obsession with
distance is inappropriate}
If one uses a \ind{bounded-distance decoder},\index{sermon!sphere-packing}
the maximum tolerable noise level will flip a fraction $f_{\rm bd} = \half d_{\min}/N$
of the bits. So, assuming $d_{\min}$ is equal to the \index{Gilbert--Varshamov distance}Gilbert distance
$d_{\rm GV}$ (\ref{eq.GV.def}), we have:%
\amarginfig{b}{
\begin{center}
\mbox{\psfig{figure=figs/RGV.ps,angle=-90,width=1.7in}}\\[-0.1in]
$f$
\end{center}
\caption[a]{Contrast between Shannon's channel capacity $C$
and the Gilbert rate $R_{\rm GV}$ --
the maximum communication rate
achievable using a \ind{bounded-distance decoder}, as a function
of noise level $f$.
For any given rate, $R$, the maximum tolerable
noise level for Shannon is twice as big as the
maximum tolerable noise level for a `worst-case-ist'
who uses a bounded-distance decoder.
}
}
\beq
H_2(2 f_{\rm bd}) = (1-R_{\rm GV}) .
\label{eq.idiotf}
\eeq
\beq
R_{\rm GV} = 1 - H_2(2 f_{\rm bd}).
\eeq
Now, here's the crunch: what did Shannon say is achievable?\index{Shannon, Claude}
He said the maximum possible rate of communication is the capacity,
\beq
C = 1 - H_2(f) .
\eeq
So for a given rate $R$,
the maximum tolerable noise level, according to Shannon,
is given by
\beq
H_2(f) = (1-R) .
\label{eq.shannonf}
\eeq
Our conclusion: imagine a good code of rate $R$ has been chosen;
equations (\ref{eq.idiotf}) and (\ref{eq.shannonf})
respectively define
the maximum noise levels tolerable
by a bounded-distance
decoder, $f_{\rm bd}$, and by Shannon's decoder, $f$.
\beq
f_{\rm bd} = f/2 .
\eeq
Bounded-distance decoders can only ever cope with
{\em half\/} the noise-level that Shannon proved is tolerable!
% Need to show implication for perfect codes at the same time.
How does this relate to perfect\index{error-correcting code!perfect}
codes? A code is perfect
if there are $t$-spheres around its codewords that
fill Hamming space without overlapping.
But when a typical random linear code is used to
communicate over a binary symmetric channel near to the
Shannon limit, the typical number of bits flipped
is $\fN$, and the minimum distance between codewords is
also $\fN$, or a little bigger, if we are
a little below the Shannon limit.
So the $\fN$-spheres around the codewords overlap
with each other sufficiently that each sphere almost contains
the centre of its nearest neighbour!
\marginfig{\begin{center}
\mbox{\psfig{figure=figs/overlap.eps,width=1.7in}}\\[-0.02in]
\end{center}
\caption[a]{Two overlapping spheres whose radius
is almost as big as the distance between their centres.
}
\label{fig.overlap}
}
The reason why this overlap is not disastrous is because,
in high dimensions, the volume associated with the overlap,
shown shaded in \figref{fig.overlap}, is a tiny fraction of
either sphere, so the probability of landing in it is
extremely small.
The moral of the story is that \ind{worst-case-ism} can be bad for you,
halving your ability to tolerate noise.
You have to be able to decode {\em way\/} beyond the minimum distance of a code
to get to the Shannon limit!
Nevertheless, the minimum
distance of a code is of interest in practice, because, under some
conditions, the minimum distance dominates the errors made by
a code.
% On to the bat cave. (Could also dissect the random code
% in more detail.)
\section{Berlekamp's bats}
\label{sec.bats}
A blind \ind{bat}\index{Berlekamp, Elwyn} lives in a cave.
It flies about the centre of the cave, which corresponds to
one codeword,
with its typical distance from the centre controlled by
a friskiness parameter $f$. (The displacement of the
bat from the centre corresponds to the noise vector.)
The boundaries of the cave are made up of stalactites that
point in towards the centre of the cave (\figref{fig.cavereal}). Each stalactite
is analogous to the boundary between the home codeword
and another codeword. The stalactite is
like the shaded region in \figref{fig.overlap},
but reshaped to convey the idea that it is a region of very small volume.
Decoding errors correspond to the bat's intended trajectory passing
inside a stalactite. Collisions with stalactites at various distances
from the centre are possible.
If the friskiness
% (noise level)
is very small, the bat is usually very close to the
centre of the cave;
collisions will be rare,
and when they do occur, they will usually involve the
stalactites whose tips are closest to the centre point. Similarly,
under low-noise conditions, decoding errors will be rare,
and they will typically involve low-weight codewords. Under low-noise
conditions, the minimum distance of a code is relevant to
the (very small) probability of error.
\begin{figure}[hbtp]
\figuremargin{
\mbox{\psfig{figure=figs/cavereal.ps,angle=-90,width=3in}}
}{
\caption[a]{Berlekamp's schematic picture of Hamming space in
the vicinity of a codeword. The jagged solid line encloses all points to which
this codeword is the closest.
The $t$-sphere around the
codeword takes up a small fraction of this space.
}
\label{fig.cavereal}
}
\end{figure}
If the friskiness is higher, the bat may often make excursions
beyond the safe distance $t$ where the longest stalactites start,
but
% it is quite possible that
it will collide most frequently
with more distant stalactites, owing to their greater number.
There's only a tiny number of \ind{stalactite}s at the minimum
distance, so they are relatively unlikely to cause the errors.
Similarly, errors in a real
error-correcting code
depend on the properties of the \ind{weight enumerator} function.
At very high friskiness, the \ind{bat} is always a long way from the centre of
the \ind{cave}, and almost all its collisions involve contact with distant stalactites.
% bat in a cave.
Under these conditions,
the bat's collision frequency has nothing to do with
the distance from the centre to the closest stalactite.
%\section{Concatenation}
% see also _concat.tex
% this is the bit where we do the ``hamming are good'' story
\section{Concatenation of Hamming codes\nonexaminable}
\label{sec.concatenation}
It is instructive to play some more with the \ind{concatenation} of
\ind{Hamming code}s,\index{error-correcting code!Hamming}
a concept we first visited in \figref{fig.concath1},
because we will get insights into the notion of good codes
and the relevance or otherwise of the \ind{minimum distance} of a code.\index{distance!of code}
We can create a concatenated code
for a binary symmetric channel with noise density $f$
by encoding with
several Hamming codes in succession.
% /home/mackay/bin/concath.p~
% /home/mackay/_courses/itprnn/hamming/concath
% /home/mackay/_courses/itprnn/hamming/concath.gnu
The table recaps the key properties of
the Hamming codes, indexed by number of constraints, $M$.
All the Hamming codes have minimum distance $d=3$
and can correct one error in $N$.
\medskip% because of modified center
\begin{center}
\begin{tabular}{ll}\toprule
$N = 2^M-1$
& blocklength
\\
% $K$ &
$K = N - M$ & number of source bits \\
$p_{\rm B} = \smallfrac{3}{N} {{N} \choose {2}} f^2$
& probability of block error to leading order \\ \bottomrule
% $R$ & $K/N$ \\
\end{tabular}
\medskip
\end{center}
\marginfig{
\begin{center}
%\mbox{%
\footnotesize
\raisebox{0.3591in}{$R$}%
\hspace{0.2in}%
\begin{tabular}{c}
\mbox{\psfig{figure=hamming/concath.rate.ps,%
width=40.5mm,angle=-90}}\\[0.1in]
\hspace{0.3in}$C$
\end{tabular}
%}
\end{center}
%}{%
\caption[a]{The rate $R$ of the concatenated Hamming code
as a function of the number of concatenations, $C$.
}
\label{fig.concath.rate}
}
%
% \subsection{Proving that good codes can be made by concatenation}
If we make a \ind{product code} by\index{error-correcting code!good}\index{error-correcting code!product code}
concatenating a sequence of $C$ Hamming codes with increasing $M$,
we can choose those parameters $\{ M_c \}_{c=1}^{C}$
in such a way that the rate of the product
code
% $R_C$
\beq
R_C = \prod_{c=1}^C \frac{N_c - M_c}{N_c}
\eeq
tends to a non-zero limit as $C$ increases.
For example, if we set $M_1 =2$, $M_2=3$, $M_3=4$, etc.,
then the asymptotic rate is 0.093 (\figref{fig.concath.rate}).
The blocklength $N$ is a rapidly-growing function of $C$, so these codes
are somewhat impractical.
A further weakness of these codes is\index{distance!of concatenated code}
that\index{error-correcting code!concatenated}
their\index{error-correcting code!product code}
minimum distance is not very good (\figref{fig.concath.n.d}).%
\amarginfig{b}{
\begin{center}
\small\footnotesize
%
%\hspace{0.042in}%
%\begin{tabular}{c}
%\mbox{\psfig{figure=hamming/concath.n.k.l.ps,%
%width=40.5mm,angle=-90}}
%% \\[-0.1in] $C$
%\end{tabular}\\[0.13in]
\hspace*{0.2042in}%
\begin{tabular}{c}
\mbox{\psfig{figure=hamming/concath.n.d.ps,%
width=40.5mm,angle=-90}}\\[0.1in]
\hspace{0.3in}$C$\\[-0.05in]
\end{tabular}
\end{center}
%}{%
\caption[a]{The blocklength $N_C$ (upper curve)
and
% $(N,K)$ (upper figure) and
minimum distance $d_C$ (lower curve)
% (lower figure)
of the concatenated Hamming code
as a function of the number of concatenations $C$.
}
\label{fig.concath.n.k.l}
\label{fig.concath.n.d}
}
%
% why is this fig not taking up its correct space?
%
% The blocklength $N$ is a rapidly growing function of $C$, so these codes
% are mainly of theoretical interest.
%
Every one of the constituent
Hamming codes has
\ind{minimum distance}\index{distance!of code} 3, so the minimum
distance of the $C$th product is $3^C$. The blocklength $N$ grows faster
% with $C$
than $3^C$, so the ratio $d/N$ tends to zero as $C$ increases. In contrast,
for typical random codes, the ratio $d/N$ tends to a constant\index{random code}
% distance tends to a fraction of $N$,
such that $H_2(d/N) = 1-R$.\index{Hamming code}
Concatenated Hamming codes\index{distance!bad}\index{distance!of product code}
thus have `bad' distance.% \pref{distance.defs}
Nevertheless, it turns out that this simple sequence of codes
yields good codes\index{error-correcting code!good} for some channels -- but
not very good codes
(see \sectionref{sec.good.codes} to recall the definitions of the terms
`good' and `very good').
Rather than prove this result, we will simply explore it numerically.
\Figref{fig.concath.rateeb} shows the bit error probability $p_{\rm b}$
of the concatenated
codes assuming that the constituent codes are decoded in sequence,
as described in section \ref{sec.concatdecode}. [This one-code-at-a-time
decoding is suboptimal, as we saw there.]
% refers to {tex/_concat.tex}% contains simple example
%
% concath.p
The horizontal axis shows the rates of the codes.
As the number of concatenations increases, the rate drops
to 0.093 and the error probability drops towards zero.
The channel assumed in the figure is the binary
symmetric channel with $\q=0.0588$. This is the highest noise level that
can be tolerated using this concatenated code.
\amarginfig{c}{
\begin{center}
\footnotesize
\mbox{%
\raisebox{0.591in}{$p_{\rm b}$}%
\hspace{0.2042in}%
\begin{tabular}{c}
\mbox{\psfig{figure=hamming/concath.rate.058.ps,%
width=40mm,angle=-90}}\\[0.1in]
\hspace{0.54in}$R$\\[-0.03in]
\end{tabular}}
\end{center}
%}{%
\caption[a]{The bit error probabilities versus the rates
$R$
of the concatenated Hamming codes, for the binary
symmetric channel with $\q=0.0588$. Labels alongside the points show the
blocklengths, $N$. The solid line shows the Shannon
limit for this channel.
The bit error probability drops to zero while the rate tends to
0.093, so the concatenated Hamming codes are a `good' code family.
}
\label{fig.concath.rateeb}
}
%%%%%%%%%%%%%%%%%%%% there is a major margin object problem here,
% don't understand it!
The take-home message from this story is
{\em{distance isn't everything}}.\index{distance!isn't everything}
% Indeed, t
The minimum distance of a code, although widely worshipped by coding
theorists, is not of fundamental importance\index{coding theory}
to Shannon's
mission of achieving reliable \ind{communication} over noisy channels.\index{Shannon, Claude}\index{coding theory}
\exercisxB{3}{ex.distancenotE}{
Prove that there exist families of codes with `bad' distance
that are `very good' codes.
}
% soln in _linear.tex
\section{Distance isn't everything}
Let's
% look at this assertion some more in order to
get a
quantitative feeling for the effect of the minimum distance
of a code, for the special case of a \ind{binary symmetric channel}.\index{channel!binary symmetric}
%\exampl{ex.bhat}{
\subsection{The error probability associated with one low-weight codeword}
\label{sec.err.prob.one}
% begin INTRO
Let a binary code have blocklength $N$ and
just two codewords, which differ in $d$ places. For simplicity, let's
assume $d$ is even.
What is the error probability if this code is used on a binary
symmetric channel with noise level $f$?
Bit flips matter only in places where the two codewords differ.
% Only flips of bits in the places that differ matter.
The error probability is dominated by the probability that $d/2$
of these bits are flipped.
What happens
to the other bits is irrelevant, since the optimal decoder ignores them.
\beqan
P(\mbox{block error}) & \simeq & {{d}\choose{d/2}} f^{d/2} (1-f)^{d/2} .
% \geq here if you want
\eeqan
This error probability associated with a single codeword of weight $d$
is plotted in \figref{fig.dist}.%
\amarginfig{c}{%
\footnotesize
\begin{tabular}{c}
\hspace*{0.2in}\psfig{figure=gnu/errorVdist.ps,width=1.8in,angle=-90}\\[0.1in]
\end{tabular}
% see /home/mackay/itp/gnu/dist.gnu
\caption[a]{ The error probability associated with a single codeword of weight $d$,
${{d}\choose{d/2}} f^{d/2} (1-f)^{d/2}$, as
a function of $f$.}
\label{fig.dist}
}
Using the approximation for the binomial coefficient (\ref{eq.stirling.choose}), we
can further approximate
\beqan
P(\mbox{block error})
% \leq here if you want
& \simeq & \left[ 2 f^{1/2} (1-f )^{1/2} \right]^{d} \\
& \equiv & [\beta(f)]^{d} ,
\label{eq.bhatta}
\eeqan
where $\beta(f) = 2 f^{1/2} (1-f )^{1/2}$
is called the \ind{Bhattacharyya parameter} of the channel.\nocite{Bhattacharyya}
%\marginpar{\footnotesize{You don't need
% to memorize this name; indeed, I need to check this is the correct name, as it is not in the
%index of any coding theory books on my shelf! Must check in McEliece.}}
%
% Bhattacharyya, A.On a measure of divergence between two statistical
% populations defined by their probability distributions. Bull.
% Calcutta Math. Soc. 35 (1943), pp. 99-110.
%
% A recent book that calls your $\beta$ the Bhattacharyya parameter is
% Johanesson and Zigangirov's book on convolutional codes. I think some
% of Viterbi's books also use the term.
%
% end INTRO
% \subsection{Recap of `very bad' distance}
Now, consider a general linear code with distance $d$.
Its block error probability
must be at least ${{d}\choose{d/2}} f^{d/2} (1-f)^{d/2}$,
independent of the blocklength $N$ of the code.
For this reason, a sequence of codes of increasing blocklength
$N$ and constant distance $d$ (\ie, `very bad' distance)\label{sec.verybadisbad}
cannot have a block error probability
that tends to zero, on any binary symmetric channel.
If we are interested in making superb error-correcting
codes with tiny, tiny error probability,
we might therefore shun codes with bad distance.
However, being pragmatic, we should look more carefully
at \figref{fig.dist}.
In \chref{ch1} we argued that codes for disk drives
need an error probability smaller than about $10^{-18}$.
If the raw error probability in the \ind{disk drive} is
about $0.001$, the error probability associated
with one codeword at distance $d=20$ is smaller than
$10^{-24}$.
If the raw error probability in the disk drive is
about $0.01$, the error probability associated
with one codeword at distance $d=30$ is smaller than
$10^{-20}$.
For practical purposes, therefore, it is not essential for
a code to have good distance. For example,
codes of blocklength $10\,000$, known to
have many codewords of weight 32, can nevertheless
correct errors of weight 320 with tiny error probability.
I wouldn't want you to think I am {\em recommending\/}
the use of codes with bad distance; in \chref{ch.ldpcc}
we will discuss low-density parity-check codes,
my favourite codes, which have both excellent performance
and {\em good\/} distance.
% These are my favourite codes.
% It's as a matter of honesty that I am pointing out
% that having good distance scarcely matters.
% So regardless of the blocklength used,
\section{The union bound}
The error probability of a code on the binary symmetric
channel can be bounded in terms
of its \ind{weight enumerator} function by adding up
appropriate multiples of
the error probability associated with a single codeword (\ref{eq.bhatta}):
\beq
P(\mbox{block error}) \leq \sum_{w>0} A(w) [\beta(f)]^w .
\label{eq.unionB}
\eeq
% could include Bob's poor man's coding theorem here.
This inequality, which is an example of a {\dem\ind{union bound}},
is accurate for low noise levels $f$,
but inaccurate for high noise levels, because it overcounts
the contribution of errors that cause confusion with more than
one codeword at a time.
%MNBV\newpage
\exercisxB{3}{ex.poormancoding}{
{\sf Poor man's noisy-channel coding theorem}.\index{noisy-channel coding theorem!poor
man's version}\index{poor man's coding theorem}
Pretending
that the union bound
(\ref{eq.unionB}) {\em is\/}
accurate, and using the
average {\ind{weight enumerator} function of a random linear code} (\ref{eq.wef.random}) (\secref{sec.wef.random})
as $A(w)$, estimate the maximum rate $R_{\rm UB}(f)$ at which
one can communicate over a binary symmetric channel.
Or, to look at it more positively, using the union bound
(\ref{eq.unionB}) as an inequality, show that communication
at rates up to $R_{\rm UB}(f)$ is possible over the binary symmetric channel.
% In proving this result, you are proving a `poor man's version' of
% {Shannon}'s noisy-channel coding theorem.
}
In the following chapter, by analysing the probability of error
of {\em \ind{syndrome decoding}\/} for a binary linear code,
and using a union bound, we will prove
Shannon's noisy-channel coding theorem (for
symmetric binary channels), and thus show that {\em very good linear codes exist}.
% possible point for exercise from exact.tex to be included.
\section{Dual codes\nonexaminable}
A concept that has some importance in coding theory,\index{error-correcting code!dual}
though we will have no immediate use for it in this book,
is the idea of the {\dem\ind{dual}} of a linear error-correcting code.
An $(N,K)$
linear error-correcting code can be thought of as a set of $2^{K}$
codewords
generated by adding together all combinations of $K$ independent basis
codewords. The generator matrix of the code consists of
those $K$ basis codewords, conventionally written as row vectors.
For example, the $(7,4)$ Hamming code's generator matrix (from \pref{eq.Generator})
% \eqref{eq.Generator},
is
\beq
\bG = \left[ \begin{array}{ccccccc}
\tt 1& \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1 \\
\tt 0& \tt 1& \tt 0& \tt 0& \tt 1& \tt 1& \tt 0 \\
\tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1& \tt 1 \\
\tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1 \\
\end{array} \right]
\label{eq.Generator2}
\eeq
and its sixteen codewords were displayed in
\tabref{tab.74h} (\pref{tab.74h}).
The codewords of this code are linear combinations of
the four vectors $\left[
\tt 1 \: \tt 0 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1 \right]$,
$\left[
\tt 0 \: \tt 1 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 1 \: \tt 0 \right]$,
$\left[
\tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1 \: \tt 1 \: \tt 1 \right]$,
and
$\left[
\tt 0 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1 \: \tt 1 \right]$.
An $(N,K)$ code may also be described in terms
of an $M \times N$ parity-check matrix (where $M=N-K$)
as the set of vectors $\{ \bt \}$ that satisfy
\beq
\bH \bt = {\bf 0} .
\eeq
One way of thinking of this equation is that each row
of $\bH$ specifies a vector to which $\bt$ must be orthogonal
if it is a codeword.
\medskip
\noindent
\begin{conclusionbox}
The generator matrix specifies $K$ vectors {\em from
which\/} all codewords can be built, and
the parity-check matrix specifies a set of $M$ vectors
{\em to which\/}
all codewords are orthogonal. \smallskip
The dual of a code is obtained by exchanging the generator
matrix and the parity-check matrix.
\end{conclusionbox}
\medskip
\noindent
{\sf Definition.}
The set of {\em all\/} vectors of length $N$ that are orthogonal to all
codewords in a code, $\C$, is called the dual of the code, $\C^{\perp}$.
\medskip
If $\bt$ is orthogonal to $\bh_1$
and $\bh_2$, then it is also orthogonal to $\bh_3 \equiv \bh_1 + \bh_2$;
so all codewords are orthogonal to
any linear combination of the $M$ rows of $\bH$.
So
the set of all linear combinations of the rows of the parity-check matrix
is the dual code.
% called the dual of the code.
% The dual is itself a linear
% error-correcting code, whose generator matrix is $\bH$.
%% And similarly, t
% The parity-check matrix of the dual is $\bG$,
% the generator matrix of the first code.
For our Hamming $(7,4)$ code, the parity-check matrix is
(from \pref{eq.pcmatrix}):
\beq
\bH = \left[ \begin{array}{cc} \bP & \bI_3 \end{array}
\right] = \left[
\begin{array}{ccccccc}
\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0&\tt 0 \\
\tt 0&\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0 \\
\tt 1&\tt 0&\tt 1&\tt 1&\tt 0&\tt 0&\tt 1
\end{array} \right] .
\label{eq.pcmatrix2}
\eeq
% and the three vectors to which the codewords are
% orthogonal are
%$\left[
%\tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0\: \tt 0
% \right]$,
%$\left[
%\tt 0\: \tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0
% \right]$,
% and
%$\left[
%\tt 1\: \tt 0\: \tt 1\: \tt 1\: \tt 0\: \tt 0\: \tt 1
% \right]$.
% The codewords are not orthogonal to these $M$
% vectors only, however. I
The dual of the $(7,4)$ Hamming code $\H_{(7,4)}$
is the code shown in
\tabref{tab.74h.dual}.
\begin{table}[htbp]
\figuremargin{%
\begin{center}
\mbox{\small
\begin{tabular}{c} \toprule
% Transmitted sequence
% $\bt$ \\ \midrule
\tt 0000000 \\% yes
\tt 0010111 \\% yes
\bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{c} \toprule
% $\bt$ \\ \midrule
\tt 0101101 \\% yes
\tt 0111010 \\ \bottomrule % yes
\end{tabular} \hspace{0.02in}
\begin{tabular}{c} \toprule
% $\bt$ \\ \midrule
\tt 1001110 \\% yes
\tt 1011001 \\ \bottomrule % yes
\end{tabular} \hspace{0.02in}
\begin{tabular}{c} \toprule
% $\bt$ \\ \midrule
\tt 1100011 \\% yes
\tt 1110100 \\ % yes
\bottomrule
\end{tabular}
}%%%%%%%%% end of row of four tables
\end{center}
}{%
\caption[a]{The eight codewords
% $\{ \bt \}$
of the dual of the $(7,4)$ Hamming code.
[Compare with \protect\tabref{tab.74h},
\protect\pref{tab.74h}.]
}
\label{tab.74h.dual}
}
\end{table}
% STRANGE MISREF????????? CHECK
A possibly unexpected property
of this pair of codes is that the dual, $\H_{(7,4)}^{\perp}$,
is contained within the code $\H_{(7,4)}$ itself:
every word in the dual code is a codeword of the
original $(7,4)$ Hamming code.
This relationship can be written using set notation:
\beq
\H_{(7,4)}^{\perp} \subset \H_{(7,4)}
.
\eeq
The possibility that the set of dual vectors
can overlap the set of codeword vectors is counterintuitive
if we think of the vectors as real vectors -- how
can a vector be orthogonal to itself?
But when we work in modulo-two arithmetic, many non-zero vectors
are indeed orthogonal
% perpendicular
to themselves!
\exercissxB{1}{ex.perp}{
Give a simple rule that distinguishes
whether a binary vector is orthogonal to itself, as is each of the
three vectors
$\left[
\tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0\: \tt 0
\right]$,
$\left[
\tt 0\: \tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0
\right]$,
and
$\left[
\tt 1\: \tt 0\: \tt 1\: \tt 1\: \tt 0\: \tt 0\: \tt 1
\right]$.
}
\subsection{Some more duals}
In general, if a code has a systematic generator matrix,
\beq
\bG = \left[ \bI_K | \bP^{\T} \right] ,
\eeq
where $\bP$ is a $K \times M$ matrix,
then its parity-check matrix is
\beq
\bH = \left[ \bP | \bI_M \right] .
\eeq
\exampl{example.rthreedual}{
The repetition code $\Rthree$ has generator matrix
\beq
\bG =\left[
\begin{array}{ccc}
\tt 1 &\tt 1 &\tt 1
\end{array}
\right];
% [{\tt 1\:1\:1} ] ;
\eeq
its parity-check matrix is
\beq
\bH = \left[
\begin{array}{ccc}
\tt 1 &\tt 1 &\tt 0 \\
\tt 1 &\tt 0 &\tt 1
\end{array}
\right] .
\eeq
The two codewords are [{\tt 1 1 1}] and [{\tt 0 0 0}].
The dual code has generator matrix
\beq
\bG^{\perp} = \bH = \left[
\begin{array}{ccc}
\tt 1 &\tt 1 &\tt 0 \\
\tt 1 &\tt 0 &\tt 1
\end{array}
\right]
\eeq
or equivalently, modifying $\bG^{\perp}$ into systematic form
by row additions,
% manipulations,
\beq
\bG^{\perp} = \left[
\begin{array}{ccc}
\tt 1 &\tt 0 &\tt 1 \\
\tt 0 &\tt 1 &\tt 1
\end{array}
\right] .
\eeq
We call this dual code the {\dem{simple parity code}} P$_3$;\index{error-correcting code!P$_3$}\index{error-correcting code!simple parity}\index{error-correcting code!dual}
it is the code with one parity-check bit, which is equal to
the sum of the two source bits.
The dual code's four codewords are
$ \left[
\tt 1 \: \tt 1 \: \tt 0
\right]
$,
$ \left[
\tt 1 \: \tt 0 \: \tt 1
\right]
$,
$ \left[
\tt 0 \: \tt 0 \: \tt 0
\right]
$,
and
$ \left[
\tt 0 \: \tt 1 \: \tt 1
\right]
$.
In this case, the only vector common to the code and the dual is
the all-zero codeword.
}
\subsection{Goodness of duals}
If a sequence of codes is `good', are their \index{error-correcting code!dual}duals
{good} too?\index{error-correcting code!good}
Examples can be constructed of all cases:
good codes with good duals (random linear codes);
bad codes with bad duals; and good codes with bad duals.
The last category is especially important:
many state-of-the-art codes have the property that
their duals are bad.
The classic example is the low-density parity-check code,
whose dual is a low-density generator-matrix code.\index{error-correcting code!low-density generator-matrix}
\exercisxB{3}{ex.ldgmbad}{
Show that low-density generator-matrix codes
are bad.
A family of low-density generator-matrix codes
is defined by two parameters $j,k$, which are the column
weight and row weight of all rows and columns respectively
of $\bG$. These weights are fixed, independent of $N$;
for example, $(j,k)=(3,6)$.
[Hint: show that the code has low-weight codewords, then
use the argument from \pref{sec.verybadisbad}.]
}
\exercisxD{5}{ex.ldpcgood}{
Show that low-density parity-check codes
are good, and have good distance.\index{error-correcting code!low-density parity-check}
(For solutions, see \citeasnoun{Gallager63} and
\citeasnoun{mncN}.)
}
\subsection{Self-dual codes}
The $(7,4)$ Hamming code had the property that the dual
was contained in the code itself.
% used to say -
% A code is {\dem{\ind{self-orthogonal}}} if it contains its dual.
A code is {\dem{\ind{self-orthogonal}}\/} if it is contained in its dual.
For example,
the dual of the $(7,4)$ Hamming code is a self-orthogonal code.
One way of seeing this is that the overlap between any pair
of rows of $\bH$ is even.
%\marginpar{Is
% it an accepted abuse of terminology to also say
% a code is self-orthogonal if it contains its dual?}
Codes that contain their duals are important in quantum error-correction
\cite{ShorCSS}.
It is intriguing, though not necessarily useful, to
look at codes that are {\dem\ind{self-dual}}.
A code $\C$ is self-dual if
the dual
of the code is identical to the code.
% Here, we are looking for codes that satisfy
\beq
\C^{\perp} = \C .
\eeq
Some properties of self-dual codes can be deduced:
%
\ben
\item
If a code is self-dual, then its generator matrix is also a parity-check
matrix for the code.
\item
Self-dual codes have rate $1/2$, \ie, $M=K=N/2$.
\item
All codewords have even weight.
\een
\exercissxB{2}{ex.selfdual}{
What property must the matrix $\bP$ satisfy, if the code
with generator matrix
$\bG = \left[ \bI_K | \bP^{\T} \right]$
is self-dual?
}
\subsubsection{Examples of self-dual codes}
\ben
\item
The repetition code R$_2$ is a simple example of
a self-dual code.
\beq
\bG = \bH = \left[
\begin{array}{cc}
\tt 1 &\tt 1
\end{array}
\right] .
% [{\tt 1 \: 1 } ]
\eeq
\item
The smallest non-trivial self-dual code is the following
$(8,4)$ code.
\beq
\bG = \left[ \begin{array}{c|c} \bI_4 & \bP^{\T} \end{array}
\right] = \left[
\begin{array}{cccc|cccc}
\tt 1&\tt 0&\tt 0 &\tt 0 &\tt 0&\tt 1&\tt 1&\tt 1\\
\tt 0&\tt 1&\tt 0 &\tt 0 &\tt 1&\tt 0&\tt 1&\tt 1\\
\tt 0&\tt 0&\tt 1 &\tt 0 &\tt 1&\tt 1&\tt 0&\tt 1\\
\tt 0&\tt 0&\tt 0 &\tt 1 &\tt 1&\tt 1&\tt 1&\tt 0
\end{array} \right] .
\label{eq.selfdual84G}
\eeq
\een
\exercissxB{2}{ex.dual84.74}{
Find the relationship of the above $(8,4)$ code to the $(7,4)$ Hamming code.
}
\subsection{Duals and graphs}
Let a code be represented by a graph in which there are
nodes of two types, parity-check constraints and equality
constraints, joined by edges which represent the bits
of the code (not all of which need be transmitted).
The dual code's graph is obtained by replacing all
\ind{parity-check nodes} by equality nodes and {\em vice versa}.
This type of graph is called a \ind{normal graph} by
\citeasnoun{Forney2001}.
% Forney
% added Thu 16/1/03
\subsection*{Further reading}
Duals are important in coding theory because functions
involving a code (such as the posterior distribution over
codewords) can be transformed by a \ind{Fourier transform}
into functions over the dual code.
For an accessible introduction to Fourier analysis on
finite groups, see \citeasnoun{Terras99}.
See also \citeasnoun{macwilliams&sloane}.
\section{Generalizing perfectness to other channels}
Having given up on the search for \ind{perfect code}s
for the binary symmetric channel, we could console
ourselves by changing channel.
We could call a code
`a perfect $u$-error-correcting code for the binary \ind{erasure channel}'\index{channel!erasure}
if it can restore any $u$ erased bits, and never more than $u$.%
\marginpar{\small\raggedright{In a perfect $u$-error-correcting code for the
binary {erasure channel}, the number of redundant bits must be $N-K=u$.
}}
Rather than using the word perfect, however,
the conventional term for such a code is a `\ind{maximum distance separable} code', or MDS code.
\label{sec.RAIDII}
% Examples:
As we already noted in \exerciseref{ex.raid3},
the $(7,4)$ \ind{Hamming code} is {\em not\/}
an MDS
% maximum distance separable
code.
It can recover {\em some\/} sets of 3 erased bits,
but not all. If any 3 bits corresponding to a codeword of weight 3
are erased, then one bit of information is unrecoverable.
This is why the $(7,4)$ code is a poor choice for a \ind{RAID} system.
%A maximum distance separable (MDS) block code is a linear code whose distance is maximal among all linear
% block codes of rate k/n. It is well known that MDS block codes do exist if the field size is more than n.
A tiny example of a
maximum distance separable code\index{erasure-correction}\index{error-correcting code!maximum distance separable}\index{error-correcting code!parity-check code}\index{MDS}
is the simple parity-check code $P_{3}$
whose parity-check matrix is
$\bH = [{\tt 1\, 1\, 1}]$.
This code has 4 codewords, all of which have even parity. All codewords
are separated by a distance of 2. Any single erased bit can be restored
by setting it to the parity of the other two bits.
The repetition codes are also maximum distance separable codes.
\exercissxB{5}{ex.qeccodeperfect}{
Can you make an $(N,K)$ code, with $M=N-K$ parity symbols,
for a $q$-ary erasure channel, such that the decoder can recover
the codeword when {\em{any}\/} $M$ symbols
are erased in a block
of $N$?
[Example: for
% There do exist some such codes: for example, for
the channel with
$q=4$ symbols there is
an $(N,K) = (5,2)$ code which can correct any $M=3$ erasures.]
% ; and for $q=8$ there is a $(9,2)$ code.]
}
For the $q$-ary erasure channel with $q>2$, there are large numbers
of MDS codes, of which the Reed--Solomon codes are the most
famous and most widely used.
As long as the field size $q$ is bigger than the blocklength $N$,
MDS block codes of any rate can be found. (For further reading, see \citeasnoun{lincostello83}.)
% according to my notes.
% 4-ary erasure channel.
% Include tournament example. GF4, 16 individuals. can tolerate 3 erasures.
% Reed--Solomon codes.
\section{Summary}
Shannon's codes for the binary symmetric channel
can almost always correct $\fN$ errors, but they
are not $\fN$-error-correcting codes.
%\noindent
\subsection*{Reasons why the distance of a code has little relevance}
\ben
\item
The Shannon limit shows that the best codes must be able to
cope with a noise level twice as big as the maximum
noise level for a bounded-distance decoder.
\item
When the binary symmetric channel has
$f>1/4$, no code with a bounded-distance decoder
can communicate at all; but Shannon says good codes exist
for such channels.
\item
Concatenation shows that we can get good performance even if
the distance is bad.\index{concatenation}\index{distance!of code}
\een
%
% Furthermore, `distance isn't everything' -- you can actually
% get to the Shannon limit with a code whose distance is `bad'.
%
% Exercise - prove that if a sequence of codes is very bad then it can't
% have arbitrarily small error probability.
The whole weight enumerator function is relevant to the question
of whether a code is a good code.
The relationship between good codes and
distance properties is discussed further in \exerciseref{ex.prob.error.match}.
% ex.equal.threshold}.
%\section*{Further reading}
% For a paper with codes having the property
% distance, but for practical purposes a code with blocklength $N=10\,000$
% can have codewords of weight $d=32$ and the error probability
% can remain negligibly small even when the channel
% is creating errors of weight 320.
% {mackaymitchisonmcfadden2003}
\section{Further exercises}
% also known as {ex.equal.threshold}
\exercissxC{3}{ex.prob.error.match}{
A codeword $\bt$ is selected from a linear $(N,K)$
code $\C$, and it is transmitted
over a noisy channel; the received signal is
$\by$.
We assume that the channel is a memoryless
channel such as a Gaussian channel.
Given an assumed channel model $P(\by \given \bt)$, there are
two decoding problems.
\begin{description}
\item[The codeword decoding problem] is the task of\index{decoder!codeword}
inferring which codeword $\bt$ was transmitted given the
received signal.
\item[The bitwise decoding problem] is the task of inferring\index{decoder!bitwise}
for each transmitted bit $t_n$ how likely it is that that
bit was a one rather than a zero.
\end{description}
Consider optimal decoders for these two decoding problems.
%
% these will be presented again in
% section \ref{sec.decoding.problems}
% exact.tex
%
Prove that the probability of error of the optimal
bitwise-decoder is closely related to the probability of error of
the optimal codeword-decoder, by proving the following
theorem.\index{decoder!probability of error}
\begin{ctheorem}
If a binary linear code\index{distance!of code, and error probability}
has minimum distance $d_{\min}$,
then,
for any given channel, the codeword bit error probability of the optimal
bitwise decoder, $p_{\rm b}$,
and the block error probability of the maximum likelihood decoder, $p_{\rm B}$,
are related by:
\beq
p_{\rm B} \geq p_{\rm b} \geq \frac{1}{2} \frac{d_{\min}}{N} p_{\rm B} .
\label{eq.thmpBpb}
\eeq
% [I am sure this theorem is well-known; I am not claiming it is original.]
\end{ctheorem}
}
\exercisaxA{1}{ex.HammingD}{
What are the minimum distances of the $(15,11)$ Hamming
code and the $(31,26)$ Hamming
code?
}
\exercisaxB{2}{ex.estimate.wef}{
Let $A(w)$ be the
average weight enumerator function of a rate-$1/3$
random linear code with $N=540$ and $M=360$.
Estimate, from first principles, the value of $A(w)$ at $w=1$.
}
\exercisaxC{3C}{ex.handshakecode}{
{\sf A code with minimum distance\index{Gilbert--Varshamov distance}\index{distance!Gilbert--Varshamov}
greater than $d_{\rm GV}$.}
% Another way to make a code is to define a generator matrix
% or parity-check matrix.
A rather nice $(15,5)$ code
is generated by this generator matrix, which is based on measuring the parities
of all the ${{5}\choose{3}} = 10$ triplets of source bits:
\beq
\bG = \left[
\begin{array}{*{15}{c}}
1&\tinyo&\tinyo&\tinyo&\tinyo&\tinyo&1&1&1&\tinyo&\tinyo&1&1&\tinyo&1 \\
\tinyo&1&\tinyo&\tinyo&\tinyo&\tinyo&\tinyo&1&1&1&1&\tinyo&1&1&\tinyo \\
\tinyo&\tinyo&1&\tinyo&\tinyo&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo&1&1\\
\tinyo&\tinyo&\tinyo&1&\tinyo&1&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo&1\\
\tinyo&\tinyo&\tinyo&\tinyo&1&1&1&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo
\end{array} \right] .
\eeq
Find the minimum distance and weight enumerator function
of this code.
}
\exercisaxC{3C}{ex.findAwmonodec}{
% {\sf A code with minimum distance\index{Gilbert--Varshamov distance}\index{distance!Gilbert--Varshamov}
% slightly greater than $d_{\rm GV}$.}
Find the minimum distance of the `{pentagonful}\index{pentagonful code}'\index{error-correcting code!pentagonful}%
\amarginfig{t}{
\begin{center}
\buckypsfigw{pentagon.eps}
\end{center}
\caption[a]{The graph of the pentagonful
low-density parity-check code with
15 bit nodes (circles) and 10 parity-check nodes (triangles).
}
}
low-density parity-check code whose
parity-check matrix is
\beq
\bH = \left[ \begin{array}{*{5}{c}|*{5}{c}|*{5}{c}}
1 & \tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\
1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\
\tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\
\tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\
\tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ \hline
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & 1 \\
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo \\
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo \\
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & 1 & 1 & \tinyo \\
\tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & 1
\end{array} \right] .
\label{eq.monodec}
\eeq
Show that nine of the ten rows are independent, so the
code has parameters $N=15$, $K=6$.
Using a computer, find its weight enumerator function.
% Find its weight enumerator function.
}
\exercisxB{3C}{ex.concateex}{
Replicate the calculations used to produce
\figref{fig.concath.rate}.
Check the assertion that the highest noise level
that's correctable is 0.0588.
Explore alternative concatenated
sequences of codes. Can you find a better sequence of concatenated
codes -- better in the sense that it
has either higher asymptotic rate $R$ or can tolerate
a higher noise level $\q$?
}
\exercissxA{3}{ex.syndromecount}{
Investigate the possibility of achieving the Shannon
limit with linear block codes, using the following \ind{counting argument}.
Assume a linear code of large blocklength $N$ and rate $R=K/N$.
The code's parity-check matrix $\bH$ has $M = N - K$ rows.
Assume that the code's optimal decoder, which solves the
syndrome decoding problem $\bH \bn = \bz$, allows reliable communication
over a binary symmetric channel with flip probability $f$.
How many `typical' noise vectors $\bn$ are there?
Roughly how many distinct syndromes $\bz$ are there?
Since $\bn$ is reliably deduced from $\bz$ by the optimal decoder,
the number of syndromes must be greater than or equal to the number of
typical noise vectors. What does this tell you about the largest
possible value of rate $R$ for a given $f$?
}
\exercisxB{2}{ex.zchanneldeficit}{
Linear binary codes use the input symbols {\tt{0}} and {\tt{1}} with
equal probability, implicitly treating the channel as a symmetric
channel. Investigate how much loss in communication rate is caused by
this assumption, if in fact the channel is a highly asymmetric channel.
Take as an example a Z-channel. How much smaller is the maximum possible rate
of communication using symmetric inputs than the capacity of the channel?
[Answer: about 6\%.]
}
\exercisxC{2}{ex.baddistbad}{
Show that codes with `very bad' distance are `bad' codes, as defined
in \secref{sec.bad.code.def} (\pref{sec.bad.code.def}).
%
% Show that there exist codes with `bad' distance
% that are `very good' codes.
%
% this bit already done in {ex.distancenotE}{
}
\exercisxC{3}{ex.puncture}{
One linear code can be obtained from another
by {\dem{\ind{puncturing}}}. Puncturing
means taking each codeword and deleting a defined set of bits.
Puncturing turns an $(N,K)$ code into
an $(N',K)$ code, where $N'2$, some MDS codes can be found.
As a simple example, here is a $(9,2)$ code for the
$8$-ary erasure channel.
The code is defined in terms of the\index{Galois field}
% \index{finite field}
multiplication and addition rules of $GF(8)$,
which are given in \appendixref{sec.gf8}.
The elements of the input alphabet are $\{0,1,A,B,C,D,E,F\}$
and
the
generator matrix of the code is
\beq
\bG = \left[ \begin{array}{*{9}{c}}
1 &0 &1 &A &B &C &D &E &F \\
0 &1 &1 &1 &1 &1 &1 &1 &1 \\
\end{array} \right] .
\eeq
The resulting 64 codewords are:\smallskip
{\footnotesize\tt
\begin{narrow}{0in}{-\margindistancefudge}%
\begin{realcenter}
\begin{tabular}{*{8}{c}}
000000000 &
011111111 &
0AAAAAAAA &
0BBBBBBBB &
0CCCCCCCC &
0DDDDDDDD &
0EEEEEEEE &
0FFFFFFFF
\\
101ABCDEF &
110BADCFE &
1AB01EFCD &
1BA10FEDC &
1CDEF01AB &
1DCFE10BA &
1EFCDAB01 &
1FEDCBA10
\\
A0ACEB1FD &
A1BDFA0EC &
AA0EC1BDF &
AB1FD0ACE &
ACE0AFDB1 &
ADF1BECA0 &
AECA0DF1B &
AFDB1CE0A
\\
B0BEDFC1A &
B1AFCED0B &
BA1CFDEB0 &
BB0DECFA1 &
BCFA1B0DE &
BDEB0A1CF &
BED0B1AFC &
BFC1A0BED
\\
C0CBFEAD1 &
C1DAEFBC0 &
CAE1DC0FB &
CBF0CD1EA &
CC0FBAE1D &
CD1EABF0C &
CEAD10CBF &
CFBC01DAE
\\
D0D1CAFBE &
D1C0DBEAF &
DAFBE0D1C &
DBEAF1C0D &
DC1D0EBFA &
DD0C1FAEB &
DEBFAC1D0 &
DFAEBD0C1
\\
E0EF1DBAC &
E1FE0CABD &
EACDBF10E &
EBDCAE01F &
ECABD1FE0 &
EDBAC0EF1 &
EE01FBDCA &
EF10EACDB
\\
F0FDA1ECB &
F1ECB0FDA &
FADF0BCE1 &
FBCE1ADF0 &
FCB1EDA0F &
FDA0FCB1E &
FE1BCF0AD &
FF0ADE1BC
\\
\end{tabular}
\end{realcenter}
\end{narrow}
}
}
% from exercise section in _linear.tex
%
% this was in _sexact
%
% ex.prob.error.match
\soln{ex.prob.error.match}{% ex.equal.threshold}{
{\sf Quick, rough proof of the theorem.} Let $\bx$ denote the difference
between the reconstructed codeword and the transmitted codeword.
For any given channel output $\br$, there is a posterior distribution over
$\bx$. This posterior distribution is positive only
on vectors $\bx$ belonging to the code; the sums
that follow are over codewords $\bx$. The block error probability is:
\beq
p_{\rm B} = \sum_{\bx \neq 0} P(\bx \given \br) .
\label{eq.pBdef}
\eeq
The average bit error probability, averaging over all bits in
the codeword, is:
\beq
p_{\rm b} = \sum_{\bx \neq 0} P(\bx \given \br) \frac{w(\bx)}{N} ,
\label{eq.pbdef}
\eeq
where $w(\bx)$ is the weight of codeword $\bx$.
Now the weights of the non-zero codewords satisfy
\beq
1 \geq \frac{w(\bx)}{N} \geq \frac{d_{\min}}{N} .
\label{eq.ineq}
\eeq
Substituting the inequalities (\ref{eq.ineq}) into
the definitions (\ref{eq.pBdef},$\,$\ref{eq.pbdef}),
we obtain:
%
\beq
p_{\rm B} \geq p_{\rm b} \geq
% \frac{1}{2}
\frac{d_{\min}}{N} p_{\rm B} ,
\label{eq.thmpBpbA}
\eeq
which is a factor of two stronger, on the right, than
the stated result (\ref{eq.thmpBpb}).
In making the proof watertight, I have weakened the result a little.\medskip
% So the bit and block {\em thresholds\/} of a code with good distance
% are identical.
%\section
\noindent
{\sf Careful proof.}
The theorem relates the performance of the optimal
block decoding algorithm and the optimal bitwise decoding algorithm.
We introduce another pair of decoding algorithms, called the block-guessing
decoder and the bit-guessing decoder. The idea is that
these two algorithms are similar to
the optimal block decoder and the optimal bitwise decoder,
but lend themselves more easily to
analysis.
We now define these decoders. Let $\bx$ denote
the inferred codeword. For any given code:
\begin{description}
\item[The optimal block decoder] returns the codeword $\bx$ that maximizes
the posterior probability
$P(\bx \given \br)$, which is proportional to the likelihood
$P( \br \given \bx)$.
The probability of error of this decoder is called
$\PB$.
\item[The optimal bit decoder] returns for each of the $N$ bits, $x_n$,
the value of $a$ that maximizes
the posterior probability
$P( x_n \eq a \given \br ) = \sum_{\bx} P(\bx \given \br) \,\truth\! [ x_n\eq a ]$.
The probability of error of this decoder is called
$\Pb$.
\item[The block-guessing decoder] returns a random codeword $\bx$
with probability distribution given by the posterior probability
$P(\bx \given \br)$.
The probability of error of this decoder is called
$\PGB$.
\item[The bit-guessing decoder] returns for each of the $N$ bits, $x_n$,
a random bit from the probability distribution $P( x_n \eq a \given \br )$.
The probability of error of this decoder is called
$\PGb$.
\end{description}
The theorem states that
the optimal bit error probability $\Pb$
is bounded above by
$\PB$ and below by a given multiple of $\PB$ (\ref{eq.thmpBpb}).
%
%\beq
% P_B \geq P_b \geq \frac{1}{2} \frac{d_{\min}}{N} P_B .
%\label{eq.thmpBpb.again}
%\eeq
The left-hand inequality in (\ref{eq.thmpBpb})
is trivially true -- if a block is correct,
all its constituent bits are correct; so if the optimal
block decoder outperformed the optimal bit decoder, we could
make a better bit decoder from the block decoder.
We prove the right-hand inequality by establishing that:
% the following two lemmas:
\ben
\item
the bit-guessing decoder is nearly
as good as the optimal bit decoder:
\beq
\PGb \leq 2 \Pb .
\label{eq.guess}
\eeq
\item
the bit-guessing decoder's error probability
is related to the block-guessing decoder's
by
\beq
\PGb \geq \frac{d_{\min}}{N} \PGB .
\eeq
\een
Then since $\PGB \geq \PB$, we have
\beq
\Pb > \frac{1}{2} \PGb \geq \frac{1}{2} \frac{d_{\min}}{N} \PGB
\geq \frac{1}{2} \frac{d_{\min}}{N} \PB .
\eeq
We now prove the two lemmas.\medskip
\noindent
%\subsection
{\sf Near-optimality of guessing:}
Consider first the case of
a single bit, with posterior probability $\{ p_0, p_1 \}$.
% Without loss of generality, let $p_0 \geq p_1$.
The optimal bit decoder
% picks $\argmax_a p_a$,
% \ie, 0,
% and
has probability of error
\beq
% \Pb
P^{\rm{optimal}} = \min (p_0,p_1).
\eeq
% $p_1$.
The guessing decoder picks from 0 and 1. The truth is also
distributed with the same probability. The probability
that the guesser and the truth match is
$p_0^2 + p_1^2$; the probability that they
mismatch is the guessing error probability,
\beq
% \PGb
P^{\rm guess} = 2 p_0 p_1 \leq 2 \min (p_0,p_1) = 2 P^{\rm{optimal}} .
\eeq
Since $\PGb$ is the average
of many such error probabilities, $P^{\rm guess}$,
and $\Pb$ is the average of the corresponding optimal
error probabilities, $P^{\rm{optimal}}$,
we obtain the desired relationship (\ref{eq.guess})
between $\PGb$ and $\Pb$.\ENDproof
%
\medskip
%\subsection
\noindent
{\sf Relationship between bit error probability
and block error probability:}
The bit-guessing and block-guessing decoders
can be combined in a single system:
% The posterior probability of a bit $x_n$ and a block $\bx$
% is given by
%\beq
% P( x_n = a , \bx \given \br ) =
% P( \bx \given \br ) P( x_n = a \given \bx, \br ) =
%\eeq
% So w
we can draw a sample $x_n$ from the marginal distribution
$P(x_n \given \br)$ by
drawing a sample $( x_n , \bx )$
from the joint distribution $P( x_n , \bx \given \br )$,
then discarding the value of $\bx$.
We can distinguish between two cases: the discarded value of $\bx$
is the correct codeword, or not.
The probability of bit error for the bit-guessing decoder
can then be written as a sum of two terms:
\beqa
\PGb &\eq &
P(\mbox{$\bx$ correct}) P(\mbox{bit error} \given \mbox{$\bx$ correct})
\nonumber
\\
& & + \,
P(\mbox{$\bx$ incorrect}) P(\mbox{bit error} \given \mbox{$\bx$ incorrect})
\\
&=&
% P(\mbox{$\bx$ correct}) \times
0 + \PGB P(\mbox{bit error} \given \mbox{$\bx$ incorrect}) .
\eeqa
% The first of these terms is zero.
Now, whenever the guessed $\bx$ is incorrect, the true
$\bx$ must differ from it in at least $d$ bits, so
the probability of bit error in these cases is at least $d/N$.
So
\[%beq
\PGb \geq \frac{d}{N} \PGB .
% \eepf
\]%eeq
QED.\hfill $\epfsymbol$
}
\soln{ex.syndromecount}{
The number of `typical' noise vectors $\bn$ is
roughly $2^{NH_2(f)}$.
% , where $H=H_2(f)$.
The number of distinct syndromes $\bz$ is $2^M$.
So reliable communication implies
\beq
M \geq NH_2(f) ,
\eeq
or, in terms of the rate $R = 1-M/N$,
\beq
R \leq 1 - H_2(f) ,
\eeq
a bound which agrees precisely with the capacity of the channel.
This argument is turned into a proof in the following chapter.
}
% BORDERLINE
\soln{ex.hat.puzzle}{
% Mathematicians credit the problem to Dr. Todd Ebert, a computer
% science instructor at the University of California at Irvine, who
% introduced it in his Ph.D. thesis at the University of California at
% Santa Barbara in 1998.
In the three-player case,
it is possible for the group to win three-quarters of the time.
Three-quarters of the time, two of the players will have hats of the
same colour and the third player's hat will be the opposite colour. The
group can win every time this happens by using the following strategy.
Each player looks at the other two players'
hats. If the two hats are {\em different\/}
colours, he passes. If they are the
{\em same\/} colour, the player guesses his own hat is the {\em opposite\/}
colour.
This way, every time the hat colours are distributed two and one, one
player will guess correctly and the others will pass, and the group
will win the game. When all the hats are the same colour, however, {\em all
three\/} players will guess incorrectly and the group will lose.
When any particular player guesses a colour, it is true
that there is only a 50:50 chance that their guess is right.
The reason that the group wins 75\% of the time is that their
strategy ensures that when players are guessing wrong, a great many are
guessing wrong.
For larger numbers of players, the aim is
to ensure that most of the time no one
is wrong and occasionally everyone is wrong at once.
In the game with 7 players, there is a strategy for
which the group wins 7 out of every 8 times they play.
In the game with 15 players, the group can win 15 out of 16 times.
If you have not figured out these winning strategies for teams
of 7 and 15, I recommend thinking about the
solution to the three-player game in terms of the locations
of the winning and losing states on the three-dimensional hypercube,
then thinking laterally.
\begincuttable
If the number of players, $N$, is $2^r-1$,
the optimal strategy can be defined using a Hamming code of length $N$,
and the probability of winning the prize is $\linefrac{N}{(N+1)}$.
Each player
is identified with a number $n \in 1\ldots N$.
The two colours
are mapped onto {\tt{0}} and {\tt{1}}. Any state of their hats
can be viewed as a received vector out of a binary channel.
A random binary vector of length $N$
is either a codeword of the Hamming code, with probability
$1/(N+1)$, or it differs
in exactly one bit from a codeword.
% There is a probability
Each player looks at all the other bits and considers whether his bit
can be set to a colour
such that the state is a codeword (which can be deduced
using the decoder
of the Hamming code). If it can, then
the player guesses that his hat is the {\em other\/} colour.
If the state is actually a codeword, all players will guess and
will guess wrong. If the state is a non-codeword, only
one player will guess, and his guess will be correct.
It's quite easy to train seven players to follow the optimal
strategy if the cyclic representation of the $(7,4)$ Hamming code
is used (\pref{sec.h74cyclic}).
% I am not sure of the optimal solution for the `Scottish version'
% of the rules in which the prize is only awarded to the group
% if they {\em all\/} guess correctly.
% As a starting point, if one flips the guesses of the winning strategy
% for the original game, the group
% will win whenever it is in a codeword state, which
% happens with probability $1/(N+1)$. The question is
% what to do with the `passes'.
%% since passing is never in one's interests.
% Can the group do better than replacing passes with random guessing?
}
% \soln{ex.selforthog}{
% removed to cutsolutions.tex
% end from _linear.tex
\dvips
%\section{Solutions to Chapter \protect\ref{ch.linearecc}'s exercises} %
%\section{Solutions to Chapter \protect\ref{ch.linearecc}'s exercises} %
\dvipsb{solutions linear}
\dvips
\prechapter{About Chapter}
In this chapter we will draw together several ideas
that we've encountered so far in one nice short proof.
We will simultaneously prove both
Shannon's noisy-channel coding theorem (for
symmetric binary channels)
and his source coding theorem (for binary sources).
While this proof has connections to many preceding chapters
in the book, it's not essential to have read them all.
On the noisy-channel coding side,
our proof will be more constructive than the
proof given in \chref{ch.six}; there, we proved that
almost any random code is `very good'.
Here we will show that
almost any {\em linear\/} code is very good.
We will make use of the idea of typical sets (Chapters \ref{ch.two} and \ref{ch.six}),
and we'll borrow from the previous chapter's
calculation of the weight enumerator function of random linear codes (\secref{sec.wef.random}).
On the source coding side,
our proof will show that {\em random linear \ind{hash function}s} can be used
for compression of compressible binary sources, thus giving
a link to \chref{ch.hash}.
\ENDprechapter
\chapter{Very Good Linear Codes Exist}
\label{ch.lineartypical}
%
% very good linear codes exist
%
In this chapter we'll use a single calculation
to prove simultaneously
the \ind{source coding theorem} and the\index{noisy-channel coding theorem}
noisy-channel coding theorem for the \ind{binary symmetric channel}.\index{channel!binary symmetric}\index{noisy-channel coding theorem!linear codes}\index{linear block code!noisy-channel coding theorem}\index{error-correcting code!linear!noisy-channel coding theorem}
{Incidentally,
this proof works for much more general channel models,
not only the binary symmetric channel. For example,
the proof can be reworked for channels with
non-binary outputs, for time-varying channels
and for channels with memory, as long as they
have binary inputs satisfying a symmetry property,
\cf\ \secref{sec.Symmetricchannels}.}
%
\label{ch.linear.good}
\section{A simultaneous proof of the source coding and
noisy-channel coding theorems}
We consider a linear error-correcting code with binary \ind{parity-check
matrix} $\bH$. The matrix has $M$ rows and $N$ columns.
Later in the proof we will increase $N$ and $M$, keeping $M \propto N$.
The
rate of the code satisfies
\beq
R \geq 1 - \frac{M}{N}.
\eeq
If all the rows of $\bH$ are independent then this
is an equality, $R = 1 -M/N$. In what follows,\index{error-correcting code!rate}\index{error-correcting code!linear}
we'll assume the equality holds. Eager readers
may work out the expected rank of
a random binary matrix $\bH$ (it's very close to $M$)
and pursue the effect that the difference ($M - \mbox{rank}$) has
% small number of linear dependences have
on the rest of this proof (it's negligible).
A codeword $\bt$ is selected, satisfying
\beq
\bH \bt = {\bf 0} \mod 2 ,
\eeq
and a binary symmetric channel adds noise $\bx$, giving
the received signal\marginpar{\small\raggedright{In this chapter
$\bx$ denotes the noise added by the channel,
not the input to the channel.}}
\beq
\br = \bt + \bx \mod 2.
\eeq
The receiver aims to infer both $\bt$ and $\bx$ from
$\br$ using a \index{syndrome decoding}{syndrome-decoding} approach.
Syndrome decoding was first introduced in
\secref{sec.syndromedecoding} (\pref{sec.syndromedecoding} and \pageref{sec.syndromedecoding2}).
% and \secref{sec.syndromedecoding2}.
The receiver computes the syndrome
\beq
\bz = \bH \br \mod 2 = \bH \bt + \bH \bx \mod 2
= \bH \bx \mod 2 .
\eeq
% Since $\bH \bt = {\bf 0}$, t
The syndrome only depends on the noise $\bx$,
and the decoding problem is to find the most probable $\bx$ that
satisfies
\beq
\bH \bx = \bz \mod 2.
\eeq
This best estimate for the noise vector, $\hat{\bx}$, is then
subtracted from $\br$ to give the best guess for $\bt$.
Our aim is to show that,
as long as $R < 1-H(X) = 1-H_2(f)$,
where $f$ is the flip probability of the binary symmetric channel,
the optimal decoder for this syndrome-decoding
problem has vanishing probability of error, as $N$ increases,
for random $\bH$.
% and averaging over all binary matrices $\bH$.
We prove this result by studying a sub-optimal
strategy for solving the decoding problem. Neither the optimal decoder
nor this {\em \ind{typical-set decoder}\/} would be easy to implement,
but the typical-set decoder is easier to \analyze.
The typical-set decoder examines the typical
set $T$ of noise vectors, the set of
noise vectors $\bx'$ that satisfy $\log \dfrac{1}{P(\bx')} \simeq
NH(X)$,\marginpar{\small\raggedright{We'll leave out the $\epsilon$s and $\beta$s that make
a typical-set definition rigorous. Enthusiasts are encouraged
to revisit \secref{sec.ts} and put these details into this proof.}}
checking to see if any of those typical vectors
$\bx'$ satisfies the observed syndrome,
\beq
\bH \bx' = \bz .
\eeq
If exactly one typical vector $\bx'$ does so, the typical
set decoder reports that vector as the hypothesized
noise vector.
If no typical vector matches the observed syndrome,
or more than one does, then the typical
set decoder reports an error.
The probability of error of the typical-set decoder, for
a given matrix $\bH$, can be written as a sum of two terms,
\beq
P_{{\rm TS}|\bH} = P^{(I)} + P^{(II)}_{{\rm TS}|\bH} ,
\eeq
where $P^{(I)}$ is the probability that the true noise
vector $\bx$ is itself not typical,
and $P^{(II)}_{{\rm TS}|\bH}$ is the probability
that the true $\bx$ is typical and at least one other typical vector
clashes with it.
The first probability vanishes as $N$ increases,
as we proved when we first studied typical sets (\chref{ch.two}).
We concentrate on the second probability.
% , the probability of a type-II error.
To recap, we're imagining a true noise vector, $\bx$;
and if {\em any\/} of the typical noise vectors
$\bx'$, different from $\bx$, satisfies $\bH (\bx' - \bx) = 0$,
then we have an error.
We use the truth function
\beq
\truth \! \left[ \bH (\bx' - \bx) = 0 \right],
\eeq
whose value is one if the statement $\bH (\bx' - \bx) = 0$ is true
and zero otherwise.
We can bound the number of type II errors made when the noise is
$\bx$ thus:
\newcommand{\xprimecondition}{\raisebox{-4pt}{\footnotesize\ensuremath{\bx'}:}
\raisebox{-3pt}[0.025in][0.0in]{% prevent it from hanging down and pushing other stuff down
\makebox[0.2in][l]{\tiny$\!\begin{array}{l} {\tiny\bx' \!\in T}\\
{\tiny\bx' \! \neq \bx} \end{array}$}}}
\beq
\left[\mbox{Number of errors given $\bx$ and $\bH$}\right] \leq \sum_{\xprimecondition}
\truth\! \left[ \bH (\bx' - \bx) = 0 \right] .
\label{eq.lt.union}
\eeq
The number of errors is either zero or one; the sum on the
right-hand side may exceed one,\marginpar{\small\raggedright{\Eqref{eq.lt.union}
is a \ind{union bound}.}}
in cases where several typical noise
vectors have the same syndrome.
We can now write down the probability of a type-II error
by averaging over $\bx$:
\beq
P^{(II)}_{{\rm TS}|\bH} \leq \sum_{\bx \in T} P(\bx)
\sum_{\xprimecondition} \truth\! \left[ \bH (\bx' - \bx) = 0 \right] .
\eeq
Now, we will find the average of this probability of type-II error
over all linear codes by averaging over $\bH$.
By showing that the {\em average\/} probability of type-II error
vanishes, we will thus show that there exist linear
codes with vanishing error probability, indeed, that
almost all linear codes are very good.
We denote averaging over all binary matrices $\bH$ by $\left< \ldots \right>_{\bH}$.
The average probability of type-II error is
\beqan
\bar{P}^{(II)}_{{\rm TS}}
& =&
\sum_{\bH} P(\bH)
P^{(II)}_{{\rm TS}|\bH} \: = \:
\left< P^{(II)}_{{\rm TS}|\bH} \right>_{\bH}
\\
&=&
\left<
\sum_{\bx \in T} P(\bx)
\sum_{\xprimecondition} \truth\! \left[ \bH (\bx' - \bx) = 0 \right]
\right>_{\!\bH}
\\
&=&
\sum_{\bx \in T} P(\bx)
\sum_{\xprimecondition}
\left<
\truth\! \left[ \bH (\bx' - \bx) = 0 \right]
\right>_{\bH}
.
\eeqan
Now, the quantity
$\left<
\truth\! \left[ \bH (\bx' - \bx) = 0 \right]
\right>_{\bH}$ already cropped up
when we
were calculating the
expected weight enumerator function of random linear codes (\secref{sec.wef.random}):
for any non-zero binary vector $\bv$, the probability that $\bH \bv =0$,
averaging over all matrices $\bH$, is $2^{-M}$.
So
\beqan
\bar{P}^{(II)}_{{\rm TS}}
& = &
\left( \sum_{\bx \in T} P(\bx) \right)
\left( |T| - 1 \right)
2^{-M}\\
& \leq &
|T| \: 2^{-M}
,
\eeqan
where $|T|$ denotes the size of the typical set.
As you will recall from \chref{ch.two}, there are roughly
$2^{NH(X)}$ noise vectors in the typical set.
So
\beqan
\bar{P}^{(II)}_{{\rm TS}}
& \leq &
2^{NH(X)} 2^{-M}
.
\eeqan
This bound on the probability of error either vanishes
or grows exponentially as $N$
increases (remembering that
% , as we are fixing the code rate
% $R = 1-M/N$,
we are keeping $M$ proportional to $N$ as $N$ increases).
It vanishes if
\beq
H(X) < M/N .
\eeq
% this clause is cuttable
% CUT ME?
% and grows if
%\beq
% NH(X) > M .
%\eeq
% end CUT ME
Substituting $R=1-M/N$,
we have thus established the
% positive half of Shannon's
noisy-channel coding theorem for the binary symmetric channel:
very good linear codes exist
%as long as
%$H(X) < M/N$, \ie, as long as
for any rate $R$ satisfying
\beq
R < 1-H(X) ,
\eeq
where $H(X)$ is the entropy of the channel
noise, per bit.\ENDproof
\exercisxC{3}{ex.generalchannel}{
Redo the proof for a more general channel.
}
\section{Data compression by linear hash codes}
The decoding game we have just played can also\index{random code!for compression}
be viewed as an {\dem\ind{uncompression}\/} game.\index{hash code}
The world produces a binary noise vector $\bx$
from a source $P(\bx)$. The noise has redundancy (if the flip probability is not 0.5). We
compress it with a linear compressor
that maps the $N$-bit input $\bx$ (the noise) to
the $M$-bit output $\bz$ (the syndrome).\index{hash function!linear}\index{hash code}
Our uncompression task is to recover the
input $\bx$ from the output $\bz$.
The rate of the compressor is
\beq
R_{\rm compressor} \equiv M/N .
\eeq
[We don't care about the possibility of linear redundancies
in our definition of the rate, here.]
The result that we just found, that
the decoding problem can be solved, for
almost any $\bH$, with vanishing error probability,
as long as $H(X) < M/N$, thus instantly
proves a \ind{source coding theorem}:
\begin{quote}
Given a binary source $X$ of entropy $H(X)$, and
a required compressed rate $R > H(X)$, there exists a
linear compressor $\bx \rightarrow \bz = \bH \bx \mod 2$
having rate $M/N$ equal to that required rate $R$,
and an associated uncompressor,
that is virtually lossless.
\end{quote}
% To put it another way, if you have a source of
% entropy $H(X)$ and you encode a string of
% $N$ bits from it using a \ind{hash code} (\chref{ch.hash})
% where the hash $\bz$ is of length $M$ bits,
% where $M > N H(X)$,
% a random linear hash function $\bz = \bH \bx \mod 2$
% is just as good (for collision avoidance) as a
% fully random hash function.
%% there are very unlikely to be any collisions among
%% the hashes
{This theorem is true
not only for a source of independent identically distributed
symbols but also for any source for which a typical set can be defined:
sources with memory, and time-varying sources, for example; all that's
required is that the source be ergodic.
}
\subsection*{Notes}
This method for proving that codes are
good can be applied to
other linear codes,
such as low-density parity-check codes
\cite{mncN,McElieceMacKay00}.
For each code we need an approximation of its expected weight
enumerator function.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
%
\dvips
% \chapter{Further exercises on information theory}
\chapter{Further Exercises on Information Theory}
% this was two chapters once
\label{ch_fInfo}
% {noisy channels}
\label{ch_f8}
\fakesection{Further exercises on noisy channels}
% I've been asked to include some exercises {\em without\/} worked
% solutions. Here are a few. Numerical solutions to some of them
% are provided on page \pageref{sec.solf8}.
%
The most exciting exercises, which will introduce you
to further ideas in information theory,
are towards the end of this chapter.
%\section{Exercises}
\subsection*{Refresher exercises on source coding and noisy channels}
\exercisaxB{2}{ex.X100}{
% from Yaser
Let $X$ be an ensemble with $\A_X = \{0,1\}$ and $\P_X = \{ 0.995,
0.005\}$. Consider source coding
using the block coding of $X^{100}$ where every $\bx
\in X^{100}$ containing 3 or fewer 1s is assigned a distinct
codeword, while the other $\bx$s are ignored.
\ben
\item
If the assigned codewords are all of the same length, find the minimum length
required to provide the above set with distinct codewords.
\item
Calculate the probability of getting an $\bx$ that will be ignored.
\een
}
\exercisaxB{2}{ex.0001}{
Let $X$ be an ensemble with $\P_X = \{ 0.1,0.2,0.3,0.4 \}$.
The ensemble is encoded using the symbol
code $\C = \{ 0001 , 001 , 01 , 1 \}$.
Consider the codeword corresponding to $\bx \in X^N$, where
$N$ is large.
\ben
\item
Compute the entropy of the fourth bit of transmission.
\item
Compute the conditional entropy of the fourth bit given
the third bit.
\item
Estimate the entropy of the hundredth bit.
\item
Estimate the conditional entropy of the hundredth bit given the
ninety-ninth bit.
% \item
\een
}
\exercisaxA{2}{ex.dicetree}{
Two fair dice are rolled by Alice and the sum is recorded.
Bob's task is to ask a sequence of questions with yes/no answers to
find out this number.
Devise in detail a strategy that achieves the minimum possible
average number of questions.
}
% added Wed 22/1/03
\exercisxB{2}{ex.fairstraws}{
How can you use a coin to \ind{draw straws} among 3 people?\index{straws, drawing}
}% my solution: arithmetic coding.
% perhaps use this in exam?
% - could also use exact sampling method! (see mcexact.tex)
\exercisxB{2}{ex.magicnumber}{
In a {magic} trick,\index{puzzle!magic trick}
there are three participants: the \ind{magician}, an assistant, and a volunteer.
The assistant, who
claims to have \ind{paranormal}\index{conjuror}\index{puzzle!magic trick}
abilities, is in a soundproof room.
The magician gives the volunteer six blank cards, five white and one blue.
The volunteer writes a different integer from 1 to 100
on each \ind{card}, as the magician is watching.
The volunteer keeps the blue card.
The magician arranges the five white cards in some order and passes them to the assistant.
The assistant then announces the number on the blue card.
How does the trick work?
}
% card trick
\exercisxB{3}{ex.magicnumber2}{
How does {\em this\/} trick work?
\begin{quote}
`Here's an ordinary pack of cards, shuffled into random
order. Please choose five cards from the pack, any that you wish. Don't
let me see their faces. No, don't give them to me: pass them to my
assistant Esmerelda. She can look at them.
`Now, Esmerelda, show me four of the cards. Hmm$\ldots$ nine of spades, six of
clubs, four of hearts, ten of diamonds. The hidden card, then, must be the
queen of spades!'
\end{quote}
The trick can be performed as described above\index{puzzle!magic trick}
for a pack of 52 cards. Use information theory
to give an upper bound
on the number of cards for which the trick can be performed.
% (This exercise is much harder than \exerciseonlyref{ex.magicnumber}.)
% Hint: think of X = the 5 cardds, Y = the seque of 4 cards. how does H(X) compare with H(Y)?
% n choose 5 cf. n....(n-3) -> (n-4)/5! = 1 -> n=124.
}
% see l/iam for soln
\exercisxB{2}{ex.Hinfty}{
Find a probability sequence $\bp = (p_1,p_2, \ldots)$ such that
$H(\bp) = \infty$.
}
\exercisaxB{2}{ex.typical2488}{
Consider a discrete memoryless source with $\A_X = \{a,b,c,d\}$
and $\P_X =$ $\{1/2,1/4,$ $1/8,1/8\}$. There are $4^8 = 65\,536$ eight-letter
words that can be formed from the four letters. Find the total number
of such words that are in the typical set $T_{N\beta}$ (equation \ref{eq.TNb})
where $N=8$ and $\beta = 0.1$.
%The definition of $T_{N\b}$, from
% chapter \chtwo, is:% equation \ref{eq.TNb}
%\beq
% T_{N\b} = \left\{ \bx\in\A_X^N :
% \left| \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right| < \b
% \right\} .
%\eeq
}
% source coding and channels...........
\exercisxB{2}{ex.sourcechannel}{
Consider the source
$\A_S = \{ a,b,c,d,e\}$,
$\P_S = \{ \dthird, \dthird, \dfrac{1}{9}, \dfrac{1}{9}, \dfrac{1}{9} \}$ and the
channel whose transition probability matrix is
\beq
Q =
\left[
\begin{array}{cccc}
1 & 0 & 0 & 0 \\
0 & 0 & \dfrac{2}{3} & 0 \\
0 & 1 & 0 & 1 \\
0 & 0 & \dthird & 0 \\
% 1 & 0 & 0 & 0 \\
% 0 & 0 & 1 & 0 \\
% 0 & \dfrac{2}{3} & 0 & \dthird \\
% 0 & 0 & 1 & 0 \\
\end{array}\right] .
\eeq
Note that the source alphabet
% $\A_S = \{a,b,c,d,e\}$
has five symbols, but the channel
alphabet $\A_X = \A_Y = \{0,1,2,3\}$
has only four. Assume that the source produces symbols at
exactly 3/4 the rate that the channel accepts channel symbols. For a
given (tiny) $\epsilon>0$, explain how you would design a system for
communicating the source's output over the channel with an
% overall
average error probability per source symbol
less than $\epsilon$. Be as explicit as possible.
In particular, {\em do not\/} invoke Shannon's noisy-channel coding theorem.
}
% \subsection{Noisy Channels}
\exercisxB{2}{ex.C0000}{Consider a binary symmetric channel and a code
$C = \{ 0000,0011,1100,1111 \}$; assume that the
four codewords are used with probabilities
$\{ 1/2, 1/8,1/8,1/4\}$.
What is the decoding rule that minimizes the probability of
decoding error? [The optimal decoding rule depends on
the noise level $f$ of the binary symmetric channel. Give
the decoding rule for each range of values of $f$, for $f$ between 0 and
$1/2$.]
}
\exercisaxA{2}{ex.C3channel}{
Find the capacity and \optens\
% optimizing input distribution
for the three-input, three-output
channel whose transition probabilities are:
\beq
Q = \left[
\begin{array}{ccc}
1 & 0 & 0 \\
0 & \dfrac{2}{3} & \dthird \\
0 & \dthird & \dfrac{2}{3}
\end{array}\right] .
\eeq
}
%
% I am not sure I like this ex:
%
%\exercis{ex.Herrors}{
% Consider the $(7,4)$ Hamming code.
%\ben\item
% What is the probability of bit error if 3 channel errors occur
% in a single block?
%\item
% What is the probability of bit error if 4 channel errors occur
% in a single block?
%\een
%}
% \end{document}
% see also _e6.tex
%
% extra exercises do-able after chapter 6.
%
\fakesection{e6 exam qs}
\exercissxA{3}{ex.85channel}{
% Describe briefly the encoder for a $(7,4)$ Hamming code.
%
% Assuming that one codeword of this code is sent over a
% binary symmetric channel, define the {\em syndrome\/} $\bf z$
% of the received vector $\bf r$; state how many different possible syndromes
% there are; and state
% the maximum number of channel errors that the optimal decoder
%% code
% can correct.
%
% Define the {\em capacity\/} of a channel with input $x$ and output $y$
% and transition probability matrix $Q(y|x)$.
%
The input to a channel $Q$ is a word of 8 bits. The output is also
a word of 8 bits.
% A message block consisting of 8 bits is transmitted over a channel which
Each time it is used, the channel
flips {\em exactly one\/} of the transmitted bits, but
the receiver does not know which one. The other
seven bits are received without error. All 8 bits are equally likely to
be the one that is flipped. Derive the capacity
of this channel.
% Tough version:
%
% {\bf Either} show, by constructing an explicit encoder and decoder using a
% linear (8,5) code that it
% is possible to reliably communicate 5 bits per cycle
% over this channel, {\bf or} prove that no such linear (8,5) code exists.
%
% Wimps version:
% practical
Show, by describing an {\em explicit\/} encoder
% {\em and\/}
and
decoder that it
is possible {\em reliably\/} (that is, with
{\em zero\/} error probability) to communicate 5 bits per cycle
over this channel.
% Your description should be
% {\em should I give a hint here?}
% [Hint: a solution exists that involves a simple $(8,5)$ code.]
}
\exercisxB{2}{ex.rstu}{
A channel with input $x \in \{ {\tt a},{\tt b},{\tt c} \}$
and output $y \in \{ {\tt r},{\tt s},{\tt t} ,{\tt u} \}$
has conditional probability matrix:
\[
\bQ = \left[
\begin{array}{ccc}
\dhalf & 0 & 0 \\
\dhalf & \dhalf & 0 \\
0 & \dhalf & \dhalf \\
0 & 0 & \dhalf \\
\end{array}
\right] .
\hspace{1in}
\begin{array}{c}
\setlength{\unitlength}{0.13mm}
\begin{picture}(100,140)(0,-20)
\put(18,0){\makebox(0,0)[r]{\tt c}}
\put(18,40){\makebox(0,0)[r]{\tt b}}
\put(18,80){\makebox(0,0)[r]{\tt a}}
%
\multiput(20,0)(0,40){3}{\vector(2,1){36}}
\multiput(20,0)(0,40){3}{\vector(2,-1){36}}
%
\put(62,-20){\makebox(0,0)[l]{\tt u}}
\put(62,20){\makebox(0,0)[l]{\tt t}}
\put(62,60){\makebox(0,0)[l]{\tt s}}
\put(62,100){\makebox(0,0)[l]{\tt r}}
\end{picture}
\end{array}
\]
What is its capacity?
}
\exercisxB{3}{ex.isbn}{
The ten-digit number on the cover of a book known as the\index{book ISBN}
\ind{ISBN}\amargintab{t}{
\begin{center}
\begin{tabular}{l}
0-521-64298-1 \\
1-010-00000-4 \\
\end{tabular}
\end{center}
\caption[a]{Some valid ISBNs.
[The hyphens
are included for legibility.]
}
}
incorporates an error-detecting code.
The number consists of nine source digits $x_1,x_2,\ldots,x_{9}$,
satisfying $x_n \in \{ 0,1,\ldots,9 \}$, and a tenth check
digit whose value is given by
\[
x_{10} = \left( \sum_{n=1}^{9} n x_n \right) \mod 11 .
\]
Here $x_{10} \in \{ 0,1,\ldots,9 , 10 \}.$ If $x_{10} = 10$ then
the tenth digit is shown using the roman numeral X.
% $\tt X$.
% For example, 1-010-00000-4 is a valid ISBN.
% bishop
% 0-19-853864-2
% see lewis:con/isbn.p
Show that a valid ISBN satisfies:
\[
\left( \sum_{n=1}^{10} n x_n \right) \mod 11 = 0 .
\]
Imagine that an ISBN is communicated over an unreliable human
channel which sometimes {\em modifies\/} digits and sometimes
{\em reorders\/} digits.
Show that this code can be used to detect (but not correct)
all errors in which
any one of the ten digits is modified (for example,
1-010-00000-4 $\rightarrow$ 1-010-00080-4).
Show that this code can be used to detect all errors in which
any two adjacent digits are transposed (for example,
1-010-00000-4 $\rightarrow$ 1-100-00000-4).
What other transpositions of pairs of {\em non-adjacent\/}
digits can be detected?
% What types of error can be detected {\em and corrected?}
If the tenth digit were defined
to be
\[
x_{10} = \left( \sum_{n=1}^{9} n x_n \right) \mod 10 ,
\]
why would the code not work so well? (Discuss the detection of
% errors
% involving
both modifications of single digits and transpositions
of digits.)
}
\exercisaxA{3}{ex.two.bsc.choose}{
A\marginpar{\[
\setlength{\unitlength}{0.17mm}
\begin{picture}(100,140)(0,-45)
\put(15,-40){\makebox(0,0)[r]{d}}
\put(15,0){\makebox(0,0)[r]{{c}}}
\put(15,40){\makebox(0,0)[r]{b}}
\put(15,80){\makebox(0,0)[r]{a}}
\put(20,0){\vector(1,0){34}}
\put(20,40){\vector(1,0){34}}
\put(20,-40){\vector(1,0){34}}
\put(20,80){\vector(1,0){34}}
\put(20,40){\vector(1,1){34}}
% \put(20,40){\vector(1,-1){34}}
\put(20,-40){\vector(1,1){34}}
\put(20,0){\vector(1,-1){34}}
% \put(20,0){\vector(1,1){34}}
\put(20,80){\vector(1,-1){34}}
%
\put(65,-40){\makebox(0,0)[l]{d}}
\put(65,0){\makebox(0,0)[l]{c}}
\put(65,40){\makebox(0,0)[l]{b}}
\put(65,80){\makebox(0,0)[l]{a}}
\end{picture}
\]
}
channel with input $x$ and output $y$ has transition probability matrix:
\[
Q = \left[
\begin{array}{cccc}
1-f & f & 0 & 0 \\
f & 1-f & 0 & 0 \\
0 & 0 & 1-g & g \\
0 & 0 & g & 1-g
\end{array}
\right] .
\]
Assuming an input distribution of the form
\[
{\cal P}_X
= \left\{ \frac{p}{2}, \frac{p}{2} , \frac{1-p}{2} , \frac{1-p}{2} \right\},
\]
write down the entropy of the output, $H(Y)$, and the
conditional entropy of the output given the input, $H(Y|X)$.
Show that the optimal input distribution
is given by
\[
% corrected!
p = \frac{1}{1 + 2^{-H_2(g) + H_2(f) }} ,
\]
where $H_2(f) = f \log_2 \frac{1}{f} +
(1-f) \log_2 \frac{1}{(1-f)}$.
% CUTTABLE
% [You may find the identity
% $\frac{\d}{\d p} H_2(p) = \log_2 \frac{1-p}{p}$ helpful.]
\marginpar{\small\raggedright{Remember
$\frac{\d}{\d p} H_2(p) = \log_2 \frac{1-p}{p}$.}}
Write down the optimal input distribution and
the capacity of the channel in the case $f=1/2$, $g=0$,
and comment on your answer.
}
\exercisxB{2}{ex.detect.vs.correct}{
What are the differences in the redundancies needed
in an error-detecting code (which can reliably
detect that a block of data has been corrupted)
and an error-correcting code (which can detect and
correct errors)?
}
% difficult exercises see _e7
% \input{tex/_fInfo.tex}
% included directly by thebook.tex after _f8.tex
\subsection{Further tales from information theory}
The following exercises give you the chance to
discover for yourself the answers to some more surprising
results of information theory.
% \subsection{Further tales from information theory}
% \input{tex/_e7.tex}
% \noindent
\ExercisxC{3}{ex.corrinfo}{
% \item[Communication of correlated information.]
{\sf Communication of information from correlated
% dependent <--- would be better, but I want to keep same name for exercise as in first edn.
sources.}\index{channel!with dependent sources}
Imagine that we want to communicate data from
two data sources $X^{(A)}$ and $X^{(B)}$ to a central
location C via noise-free one-way \index{communication!of dependent information}{communication} channels (\figref{fig.achievableXY}a).
The signals $x^{(A)}$ and $x^{(B)}$ are strongly
dependent, so their joint information
content is only a little greater than the marginal information
content of either of them.
For example,
C is a \ind{weather collator} who wishes to receive a string of
reports saying
whether it is raining in Allerton ($x^{(A)}$)
and whether it is raining in Bognor ($x^{(B)}$).
The joint probability of $x^{(A)}$ and $x^{(B)}$ might be
\beq
\fourfourtabler{{$P(x^{(A)},x^{(B)})$}}{$x^{(A)}$}{{\mathsstrut}0}{{\mathsstrut}1}{{\mathsstrut}$x^{(B)}$}{0.49}{0.01}{0.01}{0.49}
%\fourfourtable{\makebox[0.2in][r]{$P(x^{(A)},x^{(B)})$}}{$x^{(A)}$}{{\mathsstrut}0}{{\mathsstrut}1}{{\mathsstrut}$x^{(B)}$}{0.49}{0.01}{0.01}{0.49}
%\:\:
%\begin{array}{c|cc}
%x^{(A)} :x^{(B)} & 0 & 1 \\ \hline
%0 & 0.49 & 0.01 \\
%1 & 0.01 & 0.49 \\
%\end{array}
\eeq
The weather collator would like to know $N$ successive
values of $x^{(A)}$ and $x^{(B)}$
exactly, but, since he has to pay for every bit
of information he receives,
he is interested in the possibility of avoiding buying
$N$ bits from source $A$
{\em and\/} $N$ bits from source $B$.
Assuming that variables $x^{(A)}$ and $x^{(B)}$ are generated
repeatedly from this distribution, can they be encoded at rates $R_A$
and $R_B$
in such a way that C can reconstruct all the variables, with the
sum of information transmission rates on the two lines being less than two
bits per cycle?
% For simplicity, assume that the
% one-way communication channels are noise-free binary channels.
% Encoding of correlated sources. Slepian Wolf (409)
\begin{figure}
\figuremargin{%
\begin{center}\small
\begin{tabular}{cc}
\raisebox{0.71in}{(a)\hspace{0.2in}{\input{tex/corrinfo.tex}}} &
\mbox{(b)\footnotesize
\setlength{\unitlength}{0.075in}
\begin{picture}(28,21)(-7.5,-1)
\put(0.3,0){\makebox(0,0)[bl]{\psfig{figure=figs/achievableXY.eps,width=1.5in}}}
\put(0,6.5){\makebox(0,0)[r]{\footnotesize$H(X^{(B)} \given X^{(A)})$}}
\put(0,14){\makebox(0,0)[r]{\footnotesize$H(X^{(B)})$}}
\put(0,17.5){\makebox(0,0)[r]{\footnotesize$H(X^{(A)},X^{(B)})$}}
\put(0,20){\makebox(0,0)[r]{\footnotesize$R_B$}}
%
\put(20,-0.27){\makebox(0,0)[t]{\footnotesize$R_A$}}
\put(2.5,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)} \given X^{(B)})$}}
\put(12,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)})$}}
%\put(15,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)},X^{(B)})$}}
\end{picture}
}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Communication of
% correlated
information from dependent sources.
(a)
% The communication situation:
$x^{(A)}$ and $x^{(B)}$ are dependent
sources (the dependence is represented by the dotted arrow).
Strings of values of each variable are encoded using
codes of rate $R_A$ and $R_B$ into transmissions
$\bt^{(A)}$ and $\bt^{(B)}$, which are communicated
over noise-free channels to a receiver $C$.
(b) The achievable rate region.
Both strings can be conveyed
without error even though $R_A < H(X^{(A)})$ and
$R_B < H(X^{(B)})$.
}
%
% this copy is all ready to work on......
%
% cp achievableXY.fig achievableXYAB.fig
\label{fig.achievableXY}
}%
\end{figure}
The answer, which you should demonstrate,\index{dependent sources}\index{correlated sources}
%\index{Slepian--Wolf|see{dependent sources}}
is indicated in \figref{fig.achievableXY}.
In the general
case of two dependent sources $X^{(A)}$ and $X^{(B)}$, there exist codes for
the two transmitters that can achieve reliable communication
of both $X^{(A)}$ and $X^{(B)}$ to C, as long as: the information rate from
$X^{(A)}$, $R_A$, exceeds $H(X^{(A)} \given X^{(B)})$; the information rate from
$X^{(B)}$, $R_B$, exceeds $H(X^{(B)} \given X^{(A)})$; and the total information rate
$R_A+R_B$ exceeds the joint entropy $H(X^{(A)},X^{(B)})$ \cite{SlepianWolf}.
% In the general
% case of two correlated sources $X$ and $Y$, there exist codes for
% the two transmitters that can achieve reliable communication
% of both $X$ and $Y$ to C, as long as: the information rate from
% $X$, $R(X)$, exceeds $H(X \given Y)$; the information rate from
% $Y$, $R(Y)$, exceeds $H(Y \given X)$; and the total information rate
% $R(X)+R(Y)$ exceeds the joint information $H(X,Y)$.
So in the case of $x^{(A)}$ and $x^{(B)}$ above, each transmitter must transmit
at a rate greater than $H_2(0.02) = 0.14$ bits, and the total
rate $R_A+R_B$ must be greater than 1.14 bits, for example $R_A=0.6$, $R_B=0.6$.
There exist codes that can achieve these rates. Your task is to
figure out why this is so.
Try to find an explicit solution in which one of the sources
is sent as plain text, $\bt^{(B)} = \bx^{(B)}$, and the other is
encoded.
}
% \end{description}
%\noindent
\ExercisxC{3}{ex.multaccess}{
{\sf \index{multiple access channel}Multiple
access channels}.\index{channel!multiple access}
Consider a channel with two sets of
inputs and one output --
for example, a shared telephone line (\figref{fig.achievableAB}a).
A simple model system has two binary inputs $x^{(A)}$ and $x^{(B)}$ and a ternary output $y$
equal to the arithmetic sum of the two inputs, that's 0, 1 or 2.
There is no noise. Users $A$ and $B$ cannot communicate with each other, and they
cannot hear the output of the channel.
If the output is a 0, the receiver can be certain that both inputs
were set to 0;
and if the output is a 2, the receiver can be certain that both inputs
were set to 1. But if the output is 1, then it could be that the input
state was $(0,1)$ or $(1,0)$.
How should users $A$ and $B$ use this channel so that their messages
can be deduced from the received signals? How fast can $A$
and $B$ communicate?
Clearly the total information rate from $A$ and $B$
to the receiver cannot be two bits. On the other hand it is easy to achieve
a total information rate $R_A + R_B$ of one bit. Can reliable communication
be achieved at rates $(R_A,R_B)$ such that $R_A + R_B> 1$?
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{l}
(a) \hspace{0.1in}{\input{tex/multacc.tex}} \\[0.1in]
(b)\hspace{0.2in}\fourfourtabler{$y$}{$x^{(A)}$}{{\mathsstrut}$\:0\:$}{{\mathsstrut}$\:1\:$}{{\mathsstrut}$x^{(B)}$}{0}{1}{1}{2}\hspace{0.5492in}
%(c)\raisebox{-0.425in}{\psfig{figure=figs/achievableAB.eps,angle=-90,width=2in}}
(c)\raisebox{-0.25in}{\mbox{\epsfbox{metapost/channels.1}}}
\end{tabular}
\end{center}
}{%
\caption[a]{Multiple access channels.
(a) A general multiple access channel with two transmitters and one receiver.
(b) A binary multiple access channel with output
% given by adding the
equal to the sum of
two inputs.
(c) The achievable region. }
\label{fig.achievableAB}
}%
\end{figure}
The answer is indicated in \figref{fig.achievableAB}.
% There exist codes for
% the two transmitters such that the rates $(R(A),R(B))$ can be
% any point in the convex hull of
% $\{(1,0),$ $(1,.5),$ $(.5,1),$ (0,1), $(0,0)\}$.
Some practical codes for multi-user channels are presented in \citeasnoun{RatzerMacKay2003}.
}
%
% answer anything in the convex hull of 1,0, 1,.5 .5,1 0,1, 0,0
%
\ExercisxC{3}{ex.broadcast}{
{\sf \index{broadcast channel}Broadcast channels}\index{channel!broadcast}.
A broadcast channel consists of a single transmitter and
two or more receivers. The properties of the
channel are defined by a conditional distribution
$Q(y^{(A)},y^{(B)} \given x)$. (We'll assume the channel is memoryless.)
%\begin{figure}
%\figuremargin{%
\amarginfig{t}{
\begin{center}\footnotesize\small
\raisebox{0in}{%(a)\hspace{0.2in}
{\input{tex/broadcast.tex}}}
%\hspace{0.4in}
%(b)
% \mbox{\psfig{figure=figs/achievableXY.eps,angle=-90,width=2in}}
\end{center}
%}{%
\caption[a]{The broadcast channel. $x$ is
the channel input; $y^{(A)}$ and $y^{(B)}$ are the outputs.
% (b) The achievable rate region.
}
%
% this copy is all ready to work on......
%
% cp achievableXY.fig achievableXYAB.fig
\label{fig.achievableBroadcast}
}%
%\end{figure}
The task is to add an encoder and two decoders to enable
reliable communication\index{communication!broadcast} of
a common message at rate $R_0$ to both receivers,
an individual message at rate $R_A$ to receiver $A$,
and an individual message at rate $R_B$ to receiver $B$.
The {\dem{capacity}} region of the broadcast channel
is the convex hull of the set of achievable rate triplets $(R_0,R_A,R_B)$.
A simple benchmark for such a channel is given by
time-sharing
%
% had to move the figure down a bit to avoid clash
% it was here
%
(\ind{time-division} signaling). If the capacities of the
two channels, considered separately, are $C^{(A)}$ and
$C^{(B)}$, then by devoting a fraction $\phi_A$ of
the transmission
time to channel $A$ and $\phi_B\eq 1\!-\!\phi_A$ to channel B, we can achieve
$(R_0,R_A,R_B) = (0,\phi_A C^{(A)},\phi_B C^{(B)})$.
\amarginfig{t}{
\begin{center}\footnotesize\small
\setlength{\unitlength}{0.03975in}
\begin{picture}(28,21)(-7.5,-2.91)
\put(0.3,0){\vector(1,0){20}}
\put(0.3,0){\vector(0,1){20}}
\put(0.3,15){\line(1,-1){15}}
\put(0,15){\makebox(0,0)[r]{\footnotesize$C^{(B)}$}}
\put(-0.40,20){\makebox(0,0)[r]{\footnotesize$R_B$}}
%
\put(22,-1.5){\makebox(0,0)[t]{\footnotesize$R_A$}}
\put(15,-1.3){\makebox(0,0)[t]{\footnotesize$C^{(A)}$}}
\end{picture}
\end{center}
%}{%
\caption[a]{Rates achievable by simple timesharing.}
%
% this copy is all ready to work on......
%
% cp achievableXY.fig achievableXYAB.fig
\label{fig.timesharing}
}%
We can do better than this, however.
% To borrow an analogy from Cover and Thomas.
As an analogy, imagine speaking simultaneously to an American
and a \ind{Belarusian};
%\ind{Golgafrinchan} \ind{telephone sanitizer};
you are fluent in \ind{American}
and in \ind{Belarusian}, but
% , needless to say,
neither
of your two receivers understands the
other's language. If each receiver can distinguish
whether a word is in their own language or not,
then an extra binary file can be conveyed to both recipients
by using its bits to decide whether the next transmitted
word should be from the American source text or from the
% \ind{Golgafrinchan}
\ind{Belarusian} source text. Each recipient can concatenate
the words that they understand in order to receive their personal
message, and can also recover the binary string.
An example of a broadcast channel consists of two
binary symmetric channels with a common input. The two halves
of the channel
have flip probabilities
$f_A$ and $f_B$. We'll assume that $A$ has the better
half-channel, \ie, $f_A < f_B < \dhalf$.
[A closely related channel is a
`degraded' broadcast channel,
in which the conditional probabilities are such that
the random variables have the structure of a Markov chain,
\beq
x \rightarrow y^{(A)} \rightarrow y^{(B)},
\eeq
\ie, $y^{(B)}$ is a further degraded version of $y^{(A)}$.]
In this special case, it turns out that whatever information
is getting through to receiver $B$ can also be recovered by
receiver $A$.
% stolen from Blahut
% [This is obvious for the degraded channel,
So there is no point distinguishing between $R_0$ and $R_B$:
the task is to find the capacity region for the rate pair $(R_0,R_A)$,
where $R_0$ is the rate of information reaching both $A$ and $B$,
and $R_A$ is the rate of the extra information reaching $A$.
The following exercise is equivalent to this one,
and a solution to it is illustrated in
\figref{fig.broadcastIII}.
% Blahut page 338.
% Cover and Thomas page
}
\ExercisxC{3}{ex.broadcastII}{
{\sf Variable-rate error-correcting codes
for\index{channel!unknown noise level}\index{error-correcting code!variable rate}\index{variable-rate error-correcting codes}
{channels with unknown noise level}}.
In real life,%
\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/broadcastII.eps,angle=0,width=1.27in}}
\end{center}
\caption[a]{Rate of reliable communication $R$,
as a function of noise level $f$, for Shannonesque
codes designed to operate at noise levels $f_A$ (solid line)
and $f_B$ (dashed line).}
\label{fig.broadcastII}
}
channels may sometimes not be well characterized
before the encoder is installed. As a model
of this situation, imagine that a channel
is known to be a binary symmetric channel with noise level
either $f_A$ or $f_B$. Let $f_B>f_A$, and let the
two capacities be $C_A$ and $C_B$.
Those who like to live dangerously might install a system
designed for noise level $f_A$
with rate $R_A \simeq C_A$; in the event that the noise level
turns out to be $f_B$, our experience of Shannon's theories
would lead us to expect that there would be a catastrophic failure
to communicate
information reliably (solid line in \figref{fig.broadcastII}).%
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=figs/broadcastIIa.eps,angle=0,width=1.27in}}
\end{center}
\caption[a]{Rate of reliable communication $R$,
as a function of noise level $f$, for a desired
{\dem{variable-rate}} code.}
\label{fig.broadcastIIa}
}
A conservative approach would design the encoding system
for the worst-case scenario, installing a code with rate $R_B
\simeq C_B$ (dashed line in \figref{fig.broadcastII}).
In the event that the lower noise level, $f_A$, holds
true, the managers would have a feeling of regret
because of the wasted capacity difference $C_A - R_B$.
Is
it possible to create a system that not only transmits
reliably at some rate $R_0$ whatever the noise level,
but also communicates some extra,
`lower-priority'\index{priority of bits in a message}
bits if the noise level is low, as shown in\index{error-correcting code!with varying level
of protection}
\figref{fig.broadcastIIa}?
This code communicates
the high-priority bits reliably at all noise levels
between $f_A$
and $f_B$, and communicates the low-priority bits also
if the noise level is $f_A$ or below.
This problem is mathematically equivalent to the
previous problem, the degraded \ind{broadcast channel}.\index{channel!broadcast}
The lower rate of communication was there called $R_0$, and
the rate at which the low-priority bits are communicated if
the noise level is low was called $R_A$.
\amarginfig{t}{
\begin{center}
\raisebox{0.1in}{\psfig{figure=figs/broadcastans.ps,angle=-90,width=1.27in}}
\end{center}
\caption[a]{An achievable region for
the channel with unknown noise level.
Assuming the two possible noise levels
are $f_A=0.01$ and $f_B=0.1$, the
dashed lines show the rates $R_A,R_B$ that
are achievable using a simple time-sharing approach,
and the solid line shows rates achievable using a more
cunning approach.
}
\label{fig.broadcastIII}
}
% load 'broadcast.gnu'
An illustrative answer is shown in \figref{fig.broadcastIII},
for the case $f_A=0.01$ and $f_B=0.1$.
(This figure also shows the
achievable region for a broadcast channel whose
two half-channels have noise levels $f_A=0.01$ and $f_B=0.1$.)
I admit I find the gap between the simple time-sharing
solution and the cunning solution disappointingly small.
In \chref{chdfountain} we will discuss codes for a
special class of broadcast channels, namely erasure channels,
where every symbol is either received without error or erased.
These codes have the nice property that they are {\dem rateless} --
the number of symbols transmitted is determined on the fly such that
reliable comunication is achieved, whatever the erasure statistics of
the channel.
}
% \begin{description}
%\item[Multiterminal information networks]
% \noindent
\ExercisxC{3}{ex.multiterminal}{
{\sf \index{multiterminal networks}{Multiterminal information networks}}\index{channel!multiterminal}
are both important practically and
intriguing theoretically. Consider the following example of a two-way
binary channel (\figref{fig.achievabletwo}a,b):
two people both wish to talk over the channel,
and they both want to hear what
the other person is saying; but you can only hear
the signal transmitted by the other person if you are transmitting
a zero. What simultaneous information rates from $A$ to $B$ and
from $B$ to $A$ can be achieved, and how? Everyday examples
of such networks include
the VHF channels used by ships, and computer ethernet networks (in which
{\em all\/} the devices are unable to hear {\em anything\/}
if two or more devices are broadcasting simultaneously).
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{{\footnotesize{(a)}}
\setlength{\unitlength}{0.07in}
\begin{picture}(50,10)(0,2.5)
\put(4,10){\makebox(0,0)[r]{$x^{(A)}$}}
\put(4,5){\makebox(0,0)[r]{$y^{(A)}$}}
\put(5,10){\vector(1,0){5}}
\put(10,5){\vector(-1,0){5}}
\put(10,2.5){\framebox(25,10){$P(y^{(A)},y^{(B)}| x^{(A)} , x^{(B)} )$}}
\put(41,10){\makebox(0,0)[l]{$y^{(B)}$}}
\put(41,5){\makebox(0,0)[l]{$x^{(B)}$}}
\put(35,10){\vector(1,0){5}}
\put(40,5){\vector(-1,0){5}}
\end{picture}
}\\[0.2in]
\mbox{
{\footnotesize{(b)}}\hspace{0.2in}
{%\footnotesize
\fourfourtabler{$y^{(A)}$}{$x^{(A)}$}{{\mathsstrut}$\:0\:$}{{\mathsstrut}$\:1\:$}{{\mathsstrut}$x^{(B)}$}{0}{0}{1}{0}\hspace{0.2in}
\fourfourtabler{$y^{(B)}$}{$x^{(A)}$}{{\mathsstrut}0}{{\mathsstrut}1}{{\mathsstrut}$x^{(B)}$}{0}{1}{0}{0}
}}
\\[0.2in]
\mbox{
\hspace{0.4in}
{\footnotesize{(c)}}
\hspace{-0.2in}
\raisebox{-0.1in}{\psfig{figure=figs/twoway.ps,angle=-90,height=1.8in,width=2.45in}}}
\end{center}
}{%
\caption[a]{(a) A general two-way channel.
(b) The rules for a binary two-way channel. The two tables show the
outputs $y^{(A)}$ and $y^{(B)}$ that result for each state of the inputs.
(c) Achievable region for the two-way binary channel.
Rates below the solid line are achievable.
The dotted line shows the `obviously achievable' region which
can be attained by simple time-sharing.}
\label{fig.achievabletwo}
}%
\end{figure}
%gnuplot> plot "twoway1.4" u 1:2 w l 1, "twoway1.4" u 2:1 w l 1,1-x w l 2
%gnuplot> set term post
%Terminal type set to 'postscript'
%Options are 'landscape monochrome dashed "Helvetica" 14'
%gnuplot> set output "twoway.ps"
%gnuplot> replot
%
% generated using figs/twoway.p
Obviously, we can achieve rates of $\dhalf$ in both directions by simple
time-sharing. But can the two information rates be made larger?
Finding the capacity of a general two-way
channel is still an open problem. However,
we can obtain interesting results concerning achievable points for the simple
binary channel discussed above, as
indicated in \figref{fig.achievabletwo}c. There exist codes that can achieve
rates up to the boundary shown.
There may exist better codes too.
% cover 457
% using independently generated codes you can prove that the following rate region is achievable: R1 < I(X1;Y2 \given X2), R2 print (1+sqrt(5))/2 - 1
% 0.618034
%gnuplot> print (1+sqrt(5))/2 - 2
% -0.381966
\exercissxB{2}{ex.count.ones}{
If a file containing a fraction $f=0.5$ {\tt 1}s is
transmitted by $C_2$, what fraction of
the transmitted stream is {\tt 1}s?
What fraction of the transmitted bits is {\tt 1}s
if we drive code $C_2$ with a sparse source of density $f = 0.38$?
}
% answer f/(1+f) =
% (gamma-1.0)/(2*gamma - 1.0) =
% 0.2764
A second, more fundamental approach {\em counts\/}
% Alternatively, count
how many valid sequences of length $N$ there are, $S_N$.
We can communicate $\log S_N$ bits in $N$ channel cycles by giving
one name to each of these valid sequences.
% Define capacity here.
% Having got a feel for this toy channel, let us now tackle the
% general problem.
\section{The capacity of a constrained noiseless channel}
% How can we define the capacity of a constrained channel?
We defined the capacity of a noisy channel in terms of\index{channel!capacity}
the mutual information between its input and its output, then
we proved
% -- with considerable effort --
that this number, the capacity, was related to the
number of distinguishable messages
$S(N)$
% $M(N)$
% \marginpar{Do I want to use $M$?}% Sun 31/12/00: YES, from here on
that
could be reliably conveyed over the channel in $N$ uses of
the channel by
\beq
C = \lim_{N \rightarrow \infty} \frac{1}{N} \log S(N) .
\eeq
In the case of the constrained noiseless channel,
we can adopt this identity as our definition of
the channel's capacity.
However, the name $s$, which,
when we were making codes for noisy channels (\secref{sec.whereCWMdefined}),
ran over messages $s = 1, \ldots, S$,
is about to take on a new role: labelling the states
of our channel; so in this chapter
we will denote the number of distinguishable messages of length $N$
by $M_N$, and define the capacity to be:\index{capacity!constrained channel}
\beq
C = \lim_{N \rightarrow \infty} \frac{1}{N} \log M_N .
\eeq
% Knowing the capacity of a channel doesn't tell us how practically to
% achieve that rate of communication, so o
Once we have figured out
the capacity of a channel we will return to
the task of making a practical code for that channel.
\section{Counting the number of possible messages}
First let us introduce some representations of
constrained channels.
% We can often conveniently represent a constrained channel
% by a state diagram.
In a {\dem\ind{state diagram}}, states of the transmitter are represented
by circles labelled with the name of the state.
Directed edges\index{edge}\index{graph} from one state to another indicate that the
transmitter is permitted to move from the first state to the
second, and a label on that edge indicates the
symbol emitted when that \ind{transition} is made.
\Figref{fig.state1}a shows the state diagram for
channel A.
% the ${\tt 0}^+{\tt 1}^1$
It has two states, $0$
and $1$. When transitions to state $0$ are made,
a {\tt 0} is transmitted; when transitions to state $1$ are made,
a {\tt 1} is transmitted; transitions from state $1$ to state $1$
are not possible.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremarginb{% bottom aligned to avoid clash
\small
\begin{center}
\begin{tabular}{cc}
(a)\mbox{\psfig{figure=noiseless/figs/state1.ps,angle=-90,width=0.6in}}&
(c)
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.01cm}
\begin{picture}(1004,180)(-25,-25)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(61,-2){\makebox(0,0)[t]{\tt{0}}}
\put(17,6){\vector(1,1){88}}
\put(86,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(122,130){\makebox(0,0)[b]{{$s_{1}$}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(183,-2){\makebox(0,0)[t]{\tt{0}}}
\put(139,94){\vector(1,-1){88}}
\put(158,77){\makebox(0,0)[bl]{\tt{0}}}
\put(139,6){\vector(1,1){88}}
\put(208,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(244,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{0}}}
\put(244,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(244,130){\makebox(0,0)[b]{{$s_{2}$}}}
%
% lines
%
\put(261,0){\vector(1,0){88}}
\put(305,-2){\makebox(0,0)[t]{\tt{0}}}
\put(261,94){\vector(1,-1){88}}
\put(280,77){\makebox(0,0)[bl]{\tt{0}}}
\put(261,6){\vector(1,1){88}}
\put(330,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(366,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{0}}}
\put(366,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(366,130){\makebox(0,0)[b]{{$s_{3}$}}}
%
% lines
%
\put(383,0){\vector(1,0){88}}
\put(427,-2){\makebox(0,0)[t]{\tt{0}}}
\put(383,94){\vector(1,-1){88}}
\put(402,77){\makebox(0,0)[bl]{\tt{0}}}
\put(383,6){\vector(1,1){88}}
\put(452,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(488,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(488,0){\makebox(0,0){\tiny{0}}}
\put(488,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(488,130){\makebox(0,0)[b]{{$s_{4}$}}}
%
% lines
%
\put(505,0){\vector(1,0){88}}
\put(549,-2){\makebox(0,0)[t]{\tt{0}}}
\put(505,94){\vector(1,-1){88}}
\put(524,77){\makebox(0,0)[bl]{\tt{0}}}
\put(505,6){\vector(1,1){88}}
\put(574,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(610,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(610,0){\makebox(0,0){\tiny{0}}}
\put(610,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(610,130){\makebox(0,0)[b]{{$s_{5}$}}}
%
% lines
%
\put(627,0){\vector(1,0){88}}
\put(671,-2){\makebox(0,0)[t]{\tt{0}}}
\put(627,94){\vector(1,-1){88}}
\put(646,77){\makebox(0,0)[bl]{\tt{0}}}
\put(627,6){\vector(1,1){88}}
\put(696,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(732,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(732,0){\makebox(0,0){\tiny{0}}}
\put(732,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(732,130){\makebox(0,0)[b]{{$s_{6}$}}}
%
% lines
%
\put(749,0){\vector(1,0){88}}
\put(793,-2){\makebox(0,0)[t]{\tt{0}}}
\put(749,94){\vector(1,-1){88}}
\put(768,77){\makebox(0,0)[bl]{\tt{0}}}
\put(749,6){\vector(1,1){88}}
\put(818,77){\makebox(0,0)[br]{\tt{1}}}
%
% starting circles
%
\multiput(854,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(854,0){\makebox(0,0){\tiny{0}}}
\put(854,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(854,130){\makebox(0,0)[b]{{$s_{7}$}}}
%
% lines
%
\put(871,0){\vector(1,0){88}}
\put(915,-2){\makebox(0,0)[t]{\tt{0}}}
\put(871,94){\vector(1,-1){88}}
\put(890,77){\makebox(0,0)[bl]{\tt{0}}}
\put(871,6){\vector(1,1){88}}
\put(940,77){\makebox(0,0)[br]{\tt{1}}}
%
% end circles
%
\multiput(976,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(976,0){\makebox(0,0){\tiny{0}}}
\put(976,100){\makebox(0,0){\tiny{1}}}
%
% state label for this column
%
\put(976,130){\makebox(0,0)[b]{{$s_{8}$}}}
%
\end{picture}
%
\\
(b)
\raisebox{-0.4in}{
%
% trellis section written by trellis.p
% handedited Tue 24/12/02
%
\setlength{\unitlength}{0.015cm}
\begin{picture}(150,180)(-25,-3)
%
% starting circles
%
\multiput(0,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){{0}}}
\put(0,100){\makebox(0,0){{1}}}
%
% state label for this column
%
\put(0,130){\makebox(0,0)[b]{{$s_{n}$}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(61,-2){\makebox(0,0)[t]{\tt{0}}}
\put(17,94){\vector(1,-1){88}}
\put(36,77){\makebox(0,0)[bl]{\tt{0}}}
\put(17,6){\vector(1,1){88}}
\put(86,77){\makebox(0,0)[br]{\tt{1}}}
%
% end circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){{0}}}
\put(122,100){\makebox(0,0){{1}}}
%
% state label for this column
%
\put(122,130){\makebox(0,0)[b]{{$s_{n+1}$}}}
%
\end{picture}
%
}
&
(d)\hspace{0.1in}
{ $\bA = \begin{array}[b]{c@{}cc@{}c}
& & \multicolumn{2}{c}{\mbox{\tiny (from)}} \\
% & & \multicolumn{2}{c}{\mbox{state}} \\
& & \: 1 & 0 \: \\
\mbox{\tiny (to)}
% {state}
& \begin{array}{c}
1\\
0\end{array} & \left[ \begin{array}{c}
0\\
1\end{array} \right. & \left.
\begin{array}{c}
1\\
1\end{array} \right]\\
\end{array}$
}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{(a) State diagram for
% the ${\tt 0}^+{\tt 1}^1$
channel A.
(b) Trellis section. (c) Trellis. (d) \Connectionmatrix.}
\label{fig.state1}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% worked on these figs Sun 3/2/02
%
\begin{figure}
\figuremargin{%
\small
\begin{center}
\begin{tabular}{cc@{\hspace{0.2in}}|cc}
\raisebox{0.1in}[0in][0in]{\psfig{figure=noiseless/figs/state101.ps,angle=-90,height=2in}}
&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.015cm}
\begin{picture}(150,380)(-25,-25)
%
% starting circles
%
\multiput(0,0)(0,100){4}{\circle{32}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\small{00}}}
\put(0,100){\makebox(0,0){\small{0}}}
\put(0,200){\makebox(0,0){\small{1}}}
\put(0,300){\makebox(0,0){\small{11}}}
%
% state label for this column
%
\put(0,330){\makebox(0,0)[b]{{$s_{n}$}}}
%
% lines
%
\put(19.5384615384615,0){\vector(1,0){88}}
\put(63.5384615384615,-2){\makebox(0,0)[t]{\tt{0}}}
\put(19.5384615384615,94){\vector(1,-1){88}}
\put(38.5384615384615,77){\makebox(0,0)[bl]{\tt{0}}}
\put(19.5384615384615,288){\vector(1,-2){88}}
\put(38.5384615384615,252){\makebox(0,0)[bl]{\tt{0}}}
\put(19.5384615384615,12){\vector(1,2){88}}
\put(63.5384615384615,102){\makebox(0,0)[br]{\tt{1}}}
\put(19.5384615384615,206){\vector(1,1){88}}
\put(88.5384615384615,277){\makebox(0,0)[br]{\tt{1}}}
\put(19.5384615384615,300){\vector(1,0){88}}
\put(63.5384615384615,302){\makebox(0,0)[b]{\tt{1}}}
%
% end circles
%
\multiput(127.076923076923,0)(0,100){4}{\circle{32}}
%
% labels for circles
%
\put(127.076923076923,0){\makebox(0,0){\small{00}}}
\put(127.076923076923,100){\makebox(0,0){\small{0}}}
\put(127.076923076923,200){\makebox(0,0){\small{1}}}
\put(127.076923076923,300){\makebox(0,0){\small{11}}}
%
% state label for this column
%
\put(127.076923076923,330){\makebox(0,0)[b]{{$s_{n+1}$}}}
%
\end{picture}
%
& % divider
\begin{tabular}{@{}c@{}}
\raisebox{0.15in}[0in][0.2in]{%
\mbox{\psfig{figure=noiseless/figs/state111.ps,angle=-90,height=1.7in}}
}
\\
\end{tabular}
&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.015cm}
\begin{picture}(150,380)(-25,-25)
%
% starting circles
%
\multiput(0,0)(0,100){4}{\circle{36}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\small{00}}}
\put(0,100){\makebox(0,0){\small{0}}}
\put(0,200){\makebox(0,0){\small{1}}}
\put(0,300){\makebox(0,0){\small{11}}}
%
% state label for this column
%
\put(0,330){\makebox(0,0)[b]{{$s_{n}$}}}
%
% lines
%
\put(21.2307692307692,94){\vector(1,-1){88}}
\put(40.2307692307692,77){\makebox(0,0)[bl]{\tt{0}}}
\put(21.2307692307692,194){\vector(1,-1){88}}
\put(40.2307692307692,177){\makebox(0,0)[bl]{\tt{0}}}
\put(21.2307692307692,288){\vector(1,-2){88}}
\put(40.2307692307692,252){\makebox(0,0)[bl]{\tt{0}}}
\put(21.2307692307692,12){\vector(1,2){88}}
\put(65.2307692307692,102){\makebox(0,0)[br]{\tt{1}}}
\put(21.2307692307692,106){\vector(1,1){88}}
\put(90.2307692307692,177){\makebox(0,0)[br]{\tt{1}}}
\put(21.2307692307692,206){\vector(1,1){88}}
\put(90.2307692307692,277){\makebox(0,0)[br]{\tt{1}}}
%
% end circles
%
\multiput(130.461538461538,0)(0,100){4}{\circle{36}}
%
% labels for circles
%
\put(130.461538461538,0){\makebox(0,0){\small{00}}}
\put(130.461538461538,100){\makebox(0,0){\small{0}}}
\put(130.461538461538,200){\makebox(0,0){\small{1}}}
\put(130.461538461538,300){\makebox(0,0){\small{11}}}
%
% state label for this column
%
\put(130.461538461538,330){\makebox(0,0)[b]{{$s_{n+1}$}}}
%
\end{picture}
%
\\ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\normalsize B &
$\bA = \input{noiseless/tex/mfile101.tex}$
& % divider
\normalsize C &
$\bA = \input{noiseless/tex/mfile111.tex}$
\\
\end{tabular}
\end{center}
}{%
\caption[a]{State diagrams, trellis sections
and \connectionmatrices\ for channels B and C. }
\label{fig.state101}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremargin{%
\begin{center}\raisebox{0.2in}[0.85in]{
\begin{tabular}{ccc} % \toprule
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(150,310)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,6){\vector(1,1){88}}
% section 1 : cumulative counts
% 0 1
% 1 1
\put(97,-50){\makebox(50,30){\small\bf{1}}}
\put(97,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,200){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% end circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
\end{picture}
%
&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(272,310)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,6){\vector(1,1){88}}
% section 1 : cumulative counts
% 0 1
% 1 1
\put(97,-50){\makebox(50,30){\small\bf{1}}}
\put(97,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,200){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% starting circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(139,94){\vector(1,-1){88}}
\put(139,6){\vector(1,1){88}}
% section 2 : cumulative counts
% 0 2
% 1 1
\put(219,-50){\makebox(50,30){\small\bf{2}}}
\put(219,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,200){\makebox(0,0)[t]{\small{$M_{2}\eq 3$}}}
%
% end circles
%
\multiput(244,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{0}}}
\put(244,100){\makebox(0,0){\tiny{1}}}
%
\end{picture}
%
&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(394,310)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,6){\vector(1,1){88}}
% section 1 : cumulative counts
% 0 1
% 1 1
\put(97,-50){\makebox(50,30){\small\bf{1}}}
\put(97,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,200){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% starting circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(139,94){\vector(1,-1){88}}
\put(139,6){\vector(1,1){88}}
% section 2 : cumulative counts
% 0 2
% 1 1
\put(219,-50){\makebox(50,30){\small\bf{2}}}
\put(219,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,200){\makebox(0,0)[t]{\small{$M_{2}\eq 3$}}}
%
% starting circles
%
\multiput(244,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{0}}}
\put(244,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(261,0){\vector(1,0){88}}
\put(261,94){\vector(1,-1){88}}
\put(261,6){\vector(1,1){88}}
% section 3 : cumulative counts
% 0 3
% 1 2
\put(341,-50){\makebox(50,30){\small\bf{3}}}
\put(341,120){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(366,200){\makebox(0,0)[t]{\small{$M_{3}\eq 5$}}}
%
% end circles
%
\multiput(366,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{0}}}
\put(366,100){\makebox(0,0){\tiny{1}}}
%
\end{picture}
%
\\ % \bottomrule
\end{tabular}
}
\end{center}
}{%
\caption[a]{Counting the number of paths in the trellis of channel A.
The counts
% in the square boxes
next to the nodes
are accumulated by passing from left to right
across the trellises.}
\label{fig.state1count123}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\fullwidthfigureright{%
\begin{center}\small
\begin{tabular}{@{}*{1}{l@{}l@{}}} \toprule
\raisebox{1in}{(a) Channel A}&
%
% trellis section written by trellis.p
%
% handedited Tue 24/12/02 to widen from 1004 to 1034
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(1064,310)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{0}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,6){\vector(1,1){88}}
% section 1 : cumulative counts
% 0 1
% 1 1
\put(97,-50){\makebox(50,30){\small\bf{1}}}
\put(97,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,200){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% starting circles
%
\multiput(122,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{0}}}
\put(122,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(139,94){\vector(1,-1){88}}
\put(139,6){\vector(1,1){88}}
% section 2 : cumulative counts
% 0 2
% 1 1
\put(219,-50){\makebox(50,30){\small\bf{2}}}
\put(219,120){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,200){\makebox(0,0)[t]{\small{$M_{2}\eq 3$}}}
%
% starting circles
%
\multiput(244,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{0}}}
\put(244,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(261,0){\vector(1,0){88}}
\put(261,94){\vector(1,-1){88}}
\put(261,6){\vector(1,1){88}}
% section 3 : cumulative counts
% 0 3
% 1 2
\put(341,-50){\makebox(50,30){\small\bf{3}}}
\put(341,120){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(366,200){\makebox(0,0)[t]{\small{$M_{3}\eq 5$}}}
%
% starting circles
%
\multiput(366,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{0}}}
\put(366,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(383,0){\vector(1,0){88}}
\put(383,94){\vector(1,-1){88}}
\put(383,6){\vector(1,1){88}}
% section 4 : cumulative counts
% 0 5
% 1 3
\put(463,-50){\makebox(50,30){\small\bf{5}}}
\put(463,120){\makebox(50,30){\small\bf{3}}}
%
% total count
%
\put(488,200){\makebox(0,0)[t]{\small{$M_{4}\eq 8$}}}
%
% starting circles
%
\multiput(488,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(488,0){\makebox(0,0){\tiny{0}}}
\put(488,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(505,0){\vector(1,0){88}}
\put(505,94){\vector(1,-1){88}}
\put(505,6){\vector(1,1){88}}
% section 5 : cumulative counts
% 0 8
% 1 5
\put(585,-50){\makebox(50,30){\small\bf{8}}}
\put(585,120){\makebox(50,30){\small\bf{5}}}
%
% total count
%
\put(610,200){\makebox(0,0)[t]{\small{$M_{5}\eq 13$}}}
%
% starting circles
%
\multiput(610,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(610,0){\makebox(0,0){\tiny{0}}}
\put(610,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(627,0){\vector(1,0){88}}
\put(627,94){\vector(1,-1){88}}
\put(627,6){\vector(1,1){88}}
% section 6 : cumulative counts
% 0 13
% 1 8
\put(707,-50){\makebox(50,30){\small\bf{13}}}
\put(707,120){\makebox(50,30){\small\bf{8}}}
%
% total count
%
\put(732,200){\makebox(0,0)[t]{\small{$M_{6}\eq 21$}}}
%
% starting circles
%
\multiput(732,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(732,0){\makebox(0,0){\tiny{0}}}
\put(732,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(749,0){\vector(1,0){88}}
\put(749,94){\vector(1,-1){88}}
\put(749,6){\vector(1,1){88}}
% section 7 : cumulative counts
% 0 21
% 1 13
\put(829,-50){\makebox(50,30){\small\bf{21}}}
\put(829,120){\makebox(50,30){\small\bf{13}}}
%
% total count
%
\put(854,200){\makebox(0,0)[t]{\small{$M_{7}\eq 34$}}}
%
% starting circles
%
\multiput(854,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(854,0){\makebox(0,0){\tiny{0}}}
\put(854,100){\makebox(0,0){\tiny{1}}}
%
% lines
%
\put(871,0){\vector(1,0){88}}
\put(871,94){\vector(1,-1){88}}
\put(871,6){\vector(1,1){88}}
% section 8 : cumulative counts
% 0 34
% 1 21
\put(951,-50){\makebox(50,30){\small\bf{34}}}
\put(951,120){\makebox(50,30){\small\bf{21}}}
%
% total count
%
\put(976,200){\makebox(0,0)[t]{\small{$M_{8}\eq 55$}}}
%
% end circles
%
\multiput(976,0)(0,100){2}{\circle{26}}
%
% labels for circles
%
\put(976,0){\makebox(0,0){\tiny{0}}}
\put(976,100){\makebox(0,0){\tiny{1}}}
%
\end{picture}
%
\\ \midrule
\raisebox{1.8in}{(b) Channel B}&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(999,510)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{00}}}
%
% lines
%
\put(17,0){\vector(1,0){88}}
\put(17,12){\vector(1,2){88}}
% section 1 : cumulative counts
% 0 1
% 1 0
% 2 1
% 3 0
\put(102,-50){\makebox(50,30){\small\bf{1}}}
\put(102,220){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,400){\makebox(0,0)[t]{\small{$M_{1}\eq 2$}}}
%
% starting circles
%
\multiput(122,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{00}}}
\put(122,100){\makebox(0,0){\tiny{0}}}
\put(122,200){\makebox(0,0){\tiny{1}}}
\put(122,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(139,0){\vector(1,0){88}}
\put(139,94){\vector(1,-1){88}}
\put(139,288){\vector(1,-2){88}}
\put(139,12){\vector(1,2){88}}
\put(139,206){\vector(1,1){88}}
\put(139,300){\vector(1,0){88}}
% section 2 : cumulative counts
% 0 1
% 1 0
% 2 1
% 3 1
\put(224,-50){\makebox(50,30){\small\bf{1}}}
\put(224,220){\makebox(50,30){\small\bf{1}}}
\put(224,320){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,400){\makebox(0,0)[t]{\small{$M_{2}\eq 3$}}}
%
% starting circles
%
\multiput(244,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{00}}}
\put(244,100){\makebox(0,0){\tiny{0}}}
\put(244,200){\makebox(0,0){\tiny{1}}}
\put(244,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(261,0){\vector(1,0){88}}
\put(261,94){\vector(1,-1){88}}
\put(261,288){\vector(1,-2){88}}
\put(261,12){\vector(1,2){88}}
\put(261,206){\vector(1,1){88}}
\put(261,300){\vector(1,0){88}}
% section 3 : cumulative counts
% 0 1
% 1 1
% 2 1
% 3 2
\put(346,-50){\makebox(50,30){\small\bf{1}}}
\put(346,120){\makebox(50,30){\small\bf{1}}}
\put(346,220){\makebox(50,30){\small\bf{1}}}
\put(346,320){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(366,400){\makebox(0,0)[t]{\small{$M_{3}\eq 5$}}}
%
% starting circles
%
\multiput(366,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{00}}}
\put(366,100){\makebox(0,0){\tiny{0}}}
\put(366,200){\makebox(0,0){\tiny{1}}}
\put(366,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(383,0){\vector(1,0){88}}
\put(383,94){\vector(1,-1){88}}
\put(383,288){\vector(1,-2){88}}
\put(383,12){\vector(1,2){88}}
\put(383,206){\vector(1,1){88}}
\put(383,300){\vector(1,0){88}}
% section 4 : cumulative counts
% 0 2
% 1 2
% 2 1
% 3 3
\put(468,-50){\makebox(50,30){\small\bf{2}}}
\put(468,120){\makebox(50,30){\small\bf{2}}}
\put(468,220){\makebox(50,30){\small\bf{1}}}
\put(468,320){\makebox(50,30){\small\bf{3}}}
%
% total count
%
\put(488,400){\makebox(0,0)[t]{\small{$M_{4}\eq 8$}}}
%
% starting circles
%
\multiput(488,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(488,0){\makebox(0,0){\tiny{00}}}
\put(488,100){\makebox(0,0){\tiny{0}}}
\put(488,200){\makebox(0,0){\tiny{1}}}
\put(488,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(505,0){\vector(1,0){88}}
\put(505,94){\vector(1,-1){88}}
\put(505,288){\vector(1,-2){88}}
\put(505,12){\vector(1,2){88}}
\put(505,206){\vector(1,1){88}}
\put(505,300){\vector(1,0){88}}
% section 5 : cumulative counts
% 0 4
% 1 3
% 2 2
% 3 4
\put(590,-50){\makebox(50,30){\small\bf{4}}}
\put(590,120){\makebox(50,30){\small\bf{3}}}
\put(590,220){\makebox(50,30){\small\bf{2}}}
\put(590,320){\makebox(50,30){\small\bf{4}}}
%
% total count
%
\put(610,400){\makebox(0,0)[t]{\small{$M_{5}\eq 13$}}}
%
% starting circles
%
\multiput(610,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(610,0){\makebox(0,0){\tiny{00}}}
\put(610,100){\makebox(0,0){\tiny{0}}}
\put(610,200){\makebox(0,0){\tiny{1}}}
\put(610,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(627,0){\vector(1,0){88}}
\put(627,94){\vector(1,-1){88}}
\put(627,288){\vector(1,-2){88}}
\put(627,12){\vector(1,2){88}}
\put(627,206){\vector(1,1){88}}
\put(627,300){\vector(1,0){88}}
% section 6 : cumulative counts
% 0 7
% 1 4
% 2 4
% 3 6
\put(712,-50){\makebox(50,30){\small\bf{7}}}
\put(712,120){\makebox(50,30){\small\bf{4}}}
\put(712,220){\makebox(50,30){\small\bf{4}}}
\put(712,320){\makebox(50,30){\small\bf{6}}}
%
% total count
%
\put(732,400){\makebox(0,0)[t]{\small{$M_{6}\eq 21$}}}
%
% starting circles
%
\multiput(732,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(732,0){\makebox(0,0){\tiny{00}}}
\put(732,100){\makebox(0,0){\tiny{0}}}
\put(732,200){\makebox(0,0){\tiny{1}}}
\put(732,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(749,0){\vector(1,0){88}}
\put(749,94){\vector(1,-1){88}}
\put(749,288){\vector(1,-2){88}}
\put(749,12){\vector(1,2){88}}
\put(749,206){\vector(1,1){88}}
\put(749,300){\vector(1,0){88}}
% section 7 : cumulative counts
% 0 11
% 1 6
% 2 7
% 3 10
\put(834,-50){\makebox(50,30){\small\bf{11}}}
\put(834,120){\makebox(50,30){\small\bf{6}}}
\put(834,220){\makebox(50,30){\small\bf{7}}}
\put(834,320){\makebox(50,30){\small\bf{10}}}
%
% total count
%
\put(854,400){\makebox(0,0)[t]{\small{$M_{7}\eq 34$}}}
%
% starting circles
%
\multiput(854,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(854,0){\makebox(0,0){\tiny{00}}}
\put(854,100){\makebox(0,0){\tiny{0}}}
\put(854,200){\makebox(0,0){\tiny{1}}}
\put(854,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(871,0){\vector(1,0){88}}
\put(871,94){\vector(1,-1){88}}
\put(871,288){\vector(1,-2){88}}
\put(871,12){\vector(1,2){88}}
\put(871,206){\vector(1,1){88}}
\put(871,300){\vector(1,0){88}}
% section 8 : cumulative counts
% 0 17
% 1 10
% 2 11
% 3 17
\put(956,-50){\makebox(50,30){\small\bf{17}}}
\put(956,120){\makebox(50,30){\small\bf{10}}}
\put(956,220){\makebox(50,30){\small\bf{11}}}
\put(956,320){\makebox(50,30){\small\bf{17}}}
%
% total count
%
\put(976,400){\makebox(0,0)[t]{\small{$M_{8}\eq 55$}}}
%
% end circles
%
\multiput(976,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(976,0){\makebox(0,0){\tiny{00}}}
\put(976,100){\makebox(0,0){\tiny{0}}}
\put(976,200){\makebox(0,0){\tiny{1}}}
\put(976,300){\makebox(0,0){\tiny{11}}}
%
\end{picture}
%
\\ \midrule
\raisebox{1.8in}{(c) Channel C}&
%
% trellis section written by trellis.p
%
\setlength{\unitlength}{0.012cm}
\begin{picture}(999,510)(-25,-55)
%
% starting circles
%
\multiput(0,0)(0,100){1}{\circle{26}}
%
% labels for circles
%
\put(0,0){\makebox(0,0){\tiny{00}}}
%
% lines
%
\put(17,12){\vector(1,2){88}}
% section 1 : cumulative counts
% 0 0
% 1 0
% 2 1
% 3 0
\put(102,220){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(122,400){\makebox(0,0)[t]{\small{$M_{1}\eq 1$}}}
%
% starting circles
%
\multiput(122,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(122,0){\makebox(0,0){\tiny{00}}}
\put(122,100){\makebox(0,0){\tiny{0}}}
\put(122,200){\makebox(0,0){\tiny{1}}}
\put(122,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(139,94){\vector(1,-1){88}}
\put(139,194){\vector(1,-1){88}}
\put(139,288){\vector(1,-2){88}}
\put(139,12){\vector(1,2){88}}
\put(139,106){\vector(1,1){88}}
\put(139,206){\vector(1,1){88}}
% section 2 : cumulative counts
% 0 0
% 1 1
% 2 0
% 3 1
\put(224,120){\makebox(50,30){\small\bf{1}}}
\put(224,320){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(244,400){\makebox(0,0)[t]{\small{$M_{2}\eq 2$}}}
%
% starting circles
%
\multiput(244,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(244,0){\makebox(0,0){\tiny{00}}}
\put(244,100){\makebox(0,0){\tiny{0}}}
\put(244,200){\makebox(0,0){\tiny{1}}}
\put(244,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(261,94){\vector(1,-1){88}}
\put(261,194){\vector(1,-1){88}}
\put(261,288){\vector(1,-2){88}}
\put(261,12){\vector(1,2){88}}
\put(261,106){\vector(1,1){88}}
\put(261,206){\vector(1,1){88}}
% section 3 : cumulative counts
% 0 1
% 1 1
% 2 1
% 3 0
\put(346,-50){\makebox(50,30){\small\bf{1}}}
\put(346,120){\makebox(50,30){\small\bf{1}}}
\put(346,220){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(366,400){\makebox(0,0)[t]{\small{$M_{3}\eq 3$}}}
%
% starting circles
%
\multiput(366,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(366,0){\makebox(0,0){\tiny{00}}}
\put(366,100){\makebox(0,0){\tiny{0}}}
\put(366,200){\makebox(0,0){\tiny{1}}}
\put(366,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(383,94){\vector(1,-1){88}}
\put(383,194){\vector(1,-1){88}}
\put(383,288){\vector(1,-2){88}}
\put(383,12){\vector(1,2){88}}
\put(383,106){\vector(1,1){88}}
\put(383,206){\vector(1,1){88}}
% section 4 : cumulative counts
% 0 1
% 1 1
% 2 2
% 3 1
\put(468,-50){\makebox(50,30){\small\bf{1}}}
\put(468,120){\makebox(50,30){\small\bf{1}}}
\put(468,220){\makebox(50,30){\small\bf{2}}}
\put(468,320){\makebox(50,30){\small\bf{1}}}
%
% total count
%
\put(488,400){\makebox(0,0)[t]{\small{$M_{4}\eq 5$}}}
%
% starting circles
%
\multiput(488,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(488,0){\makebox(0,0){\tiny{00}}}
\put(488,100){\makebox(0,0){\tiny{0}}}
\put(488,200){\makebox(0,0){\tiny{1}}}
\put(488,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(505,94){\vector(1,-1){88}}
\put(505,194){\vector(1,-1){88}}
\put(505,288){\vector(1,-2){88}}
\put(505,12){\vector(1,2){88}}
\put(505,106){\vector(1,1){88}}
\put(505,206){\vector(1,1){88}}
% section 5 : cumulative counts
% 0 1
% 1 3
% 2 2
% 3 2
\put(590,-50){\makebox(50,30){\small\bf{1}}}
\put(590,120){\makebox(50,30){\small\bf{3}}}
\put(590,220){\makebox(50,30){\small\bf{2}}}
\put(590,320){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(610,400){\makebox(0,0)[t]{\small{$M_{5}\eq 8$}}}
%
% starting circles
%
\multiput(610,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(610,0){\makebox(0,0){\tiny{00}}}
\put(610,100){\makebox(0,0){\tiny{0}}}
\put(610,200){\makebox(0,0){\tiny{1}}}
\put(610,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(627,94){\vector(1,-1){88}}
\put(627,194){\vector(1,-1){88}}
\put(627,288){\vector(1,-2){88}}
\put(627,12){\vector(1,2){88}}
\put(627,106){\vector(1,1){88}}
\put(627,206){\vector(1,1){88}}
% section 6 : cumulative counts
% 0 3
% 1 4
% 2 4
% 3 2
\put(712,-50){\makebox(50,30){\small\bf{3}}}
\put(712,120){\makebox(50,30){\small\bf{4}}}
\put(712,220){\makebox(50,30){\small\bf{4}}}
\put(712,320){\makebox(50,30){\small\bf{2}}}
%
% total count
%
\put(732,400){\makebox(0,0)[t]{\small{$M_{6}\eq 13$}}}
%
% starting circles
%
\multiput(732,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(732,0){\makebox(0,0){\tiny{00}}}
\put(732,100){\makebox(0,0){\tiny{0}}}
\put(732,200){\makebox(0,0){\tiny{1}}}
\put(732,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(749,94){\vector(1,-1){88}}
\put(749,194){\vector(1,-1){88}}
\put(749,288){\vector(1,-2){88}}
\put(749,12){\vector(1,2){88}}
\put(749,106){\vector(1,1){88}}
\put(749,206){\vector(1,1){88}}
% section 7 : cumulative counts
% 0 4
% 1 6
% 2 7
% 3 4
\put(834,-50){\makebox(50,30){\small\bf{4}}}
\put(834,120){\makebox(50,30){\small\bf{6}}}
\put(834,220){\makebox(50,30){\small\bf{7}}}
\put(834,320){\makebox(50,30){\small\bf{4}}}
%
% total count
%
\put(854,400){\makebox(0,0)[t]{\small{$M_{7}\eq 21$}}}
%
% starting circles
%
\multiput(854,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(854,0){\makebox(0,0){\tiny{00}}}
\put(854,100){\makebox(0,0){\tiny{0}}}
\put(854,200){\makebox(0,0){\tiny{1}}}
\put(854,300){\makebox(0,0){\tiny{11}}}
%
% lines
%
\put(871,94){\vector(1,-1){88}}
\put(871,194){\vector(1,-1){88}}
\put(871,288){\vector(1,-2){88}}
\put(871,12){\vector(1,2){88}}
\put(871,106){\vector(1,1){88}}
\put(871,206){\vector(1,1){88}}
% section 8 : cumulative counts
% 0 6
% 1 11
% 2 10
% 3 7
\put(956,-50){\makebox(50,30){\small\bf{6}}}
\put(956,120){\makebox(50,30){\small\bf{11}}}
\put(956,220){\makebox(50,30){\small\bf{10}}}
\put(956,320){\makebox(50,30){\small\bf{7}}}
%
% total count
%
\put(976,400){\makebox(0,0)[t]{\small{$M_{8}\eq 34$}}}
%
% end circles
%
\multiput(976,0)(0,100){4}{\circle{26}}
%
% labels for circles
%
\put(976,0){\makebox(0,0){\tiny{00}}}
\put(976,100){\makebox(0,0){\tiny{0}}}
\put(976,200){\makebox(0,0){\tiny{1}}}
\put(976,300){\makebox(0,0){\tiny{11}}}
%
\end{picture}
%
\\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{Counting the number of paths in the trellises of channels A, B, and C.
We assume that at the start the first bit is preceded
by {\tt 00}, so that for channels A and B,
any initial character is permitted, but
for channel C, the first character must be
a {\tt 1}.}
\label{fig.state1count}
\label{fig.state101count}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{*{1}{c@{}c}}
&
% written by fibonacci.p
\begin{tabular}{*{4}{r}c} \toprule
\multicolumn{1}{r}{$n$} &
\multicolumn{1}{r}{$M_n$} &
\multicolumn{1}{c}{$M_n/M_{n-1}$} &
\multicolumn{1}{l}{$\log_2 M_n$} &
\multicolumn{1}{c}{$\frac{1}{n} \log_2 M_n$} \\[0.051in] \midrule
1 & 2 & & 1.0 & 1.00 \\
2 & 3 & 1.500 & 1.6 & 0.79 \\
3 & 5 & 1.667 & 2.3 & 0.77 \\
4 & 8 & 1.600 & 3.0 & 0.75 \\
5 & 13 & 1.625 & 3.7 & 0.74 \\
6 & 21 & 1.615 & 4.4 & 0.73 \\
7 & 34 & 1.619 & 5.1 & 0.73 \\
8 & 55 & 1.618 & 5.8 & 0.72 \\
9 & 89 & 1.618 & 6.5 & 0.72 \\
10 & 144 & 1.618 & 7.2 & 0.72 \\
11 & 233 & 1.618 & 7.9 & 0.71 \\
12 & 377 & 1.618 & 8.6 & 0.71 \\
100 & $9\!\times\! 10^{20}$ & 1.618 & 69.7 & 0.70 \\
200 & $7\!\times\! 10^{41}$ & 1.618 & 139.1 & 0.70 \\
300 & $6\!\times\! 10^{62}$ & 1.618 & 208.5 & 0.70 \\
400 & $5\!\times\! 10^{83}$ & 1.618 & 277.9 & 0.69 \\
\bottomrule
\end{tabular}
% but needs to be changed to use toprule not hline
\\
\end{tabular}
\end{center}
}{%
\caption[a]{Counting the number of paths in the trellis of channel A.}
\label{fig.fibonacci}
}%
\end{figure}
% .p
% see noiseless
% source makeall
We can also represent the state diagram by a
{\dem\ind{trellis section}}, which shows two successive states
in time at two successive horizontal locations (\figref{fig.state1}b).
The state of the transmitter at time $n$ is called $s_n$.
% \footnote{Change $s_n$ to $i_n$?} NO
The set of possible state sequences can be represented
by a {\dem\ind{trellis}} as shown in \figref{fig.state1}c.
A valid sequence corresponds to a path through the
trellis, and the number of valid sequences is the
number of paths.
For the purpose of counting how many paths there are
through the trellis, we can ignore the labels on the
edges and summarize the trellis section by
the {\dem\ind{\connectionmatrix}\/} $\bcmA$, in which $\cmA_{ss'} = 1$
if there is an edge from state $s$ to $s'$, and $\cmA_{ss'} = 0$
otherwise (\figref{fig.state1}d).
\Figref{fig.state101} shows the state diagrams, trellis sections
and \connectionmatrices\ for channels B and C.
% So, let's count!
Let's count the number of paths for channel A by message-passing
in its trellis.
\Figref{fig.state1count123} shows the first few steps of this counting
process, and
\figref{fig.state1count}a shows the number of paths ending in each state
after $n$ steps for $n=1, \ldots, 8$.
The total number of paths of length $n$, $M_n$, is shown along the top.
We recognize $M_n$ as the Fibonacci series.
\exercisxB{1}{ex.fibo}{
Show that the ratio of successive terms in the \ind{Fibonacci} series tends
to the \ind{golden ratio},
\beq
\gamma \equiv \frac{1 + \sqrt{5}}{2} = 1.618 .
\eeq
}
Thus, to within a constant factor, $M_N$ scales as $M_N \sim \gamma^N$
as $N \rightarrow \infty$, so the capacity of channel A is
\beq
C = \lim \frac{1}{N} \log_2 \!\left[ \mbox{constant} \cdot \gamma^N \right]
= \log_2 \gamma = \log_2 1.618 = 0.694 .
\eeq
How can we describe what we just did?
The count of the number of paths is a vector $\bc^{(n)}$; we can obtain
$\bc^{(n+1)}$ from $\bc^{(n)}$ using:
\beq
\bc^{(n+1)} = \bAcm \bc^{(n)} .
\eeq
So
\beq
\bc^{(N)} = \bAcm^{\!N} \bc^{(0)} ,
\eeq
where $\bc^{(0)}$ is the state count before any symbols are transmitted.
In \figref{fig.state1count} we assumed $\bc^{(0)} = [ 0 , 1]^{\T}$, \ie, that
either of the two symbols is permitted at the outset.
The total number of paths is $M_n = \sum_s c^{(n)}_s = \bc^{(n)} \cdot \bn$.
In the limit, $\bc^{(N)}$ becomes dominated by the principal right-eigenvector
of $\bAcm$.
\beq
\bc^{(N)} \rightarrow \mbox{constant} \cdot \lambda_1^N \eR^{(0)} .
\eeq
Here, $\lambda_1$ is the principal eigenvalue of $\bAcm$.
So to find the capacity of any constrained channel,
% defined by a \connectionmatrix,
all we need to do is find the
principal eigenvalue, $\l_1$, of its \connectionmatrix.
Then
\beq
C = \log_2 \l_1 .
\eeq
\section{Back to our model channels}
Comparing \figref{fig.state1count}a and figures \ref{fig.state101count}b and c
it looks as if channels B and C have the same
capacity as channel A. The principal eigenvalues of
the three trellises are the same (the eigenvectors for
channels A and B are given at the bottom of \tabref{tab.eigsforyou},
\pref{tab.eigsforyou}).
% see section \ref{sec.rll.eigenvectors}).
And indeed the channels are intimately related.
%\begin{figure}[htbp]
%\figuremargin{%
\marginfig{
\begin{tabular}{c}
\mbox{\input{convol/tex/k1_1_3sr.tex}}
\\%&
\mbox{\input{convol/tex/k1_1_3snFLIP.tex}} % has s and t reversed from normal
\\
%
\end{tabular}
%}{%
\caption[a]{An \ind{accumulator} and a \ind{differentiator}.}
% , with $s$ and $t$ possibly mislabelled.}
}%
%\end{figure}
\subsubsection{Equivalence of channels A and B}
If we take any valid string $\bs$ for channel A and pass it through
an {\dem\ind{accumulator}}, obtaining $\bt$ defined by:
\beq
\begin{array}{rclc}
t_1 &=& s_1 \\
t_{n} &=& t_{n-1} + s_{n} \mod 2 & \mbox{for $n \geq 2$,}
\end{array}
\eeq
then the resulting string is a valid string for channel B, because
there are no {\tt 11}s in $\bs$, so there are no isolated digits
in $\bt$.
The accumulator is an invertible operator, so, similarly,
any valid string $\bt$ for channel B can be mapped onto a
valid string $\bs$ for channel A through the
{\dem{binary \ind{differentiator}}},
\beq
\begin{array}{rclc}
s_1 &=& t_1 \\
s_{n} &=& t_{n} - t_{n-1} \mod 2 & \mbox{for $n \geq 2$.}
\end{array}
\eeq
Because $+$ and $-$ are equivalent in modulo 2 arithmetic,
the differentiator is also a blurrer, convolving the source stream
with the filter $(1,1)$.
% (A bit surprising that blurring is invertible?)
Channel C is also intimately related to channels A and B.
\exercissxB{1}{ex.abc.compare}{
% It looks as if channels B and C have the same
% capacity as channel A. Show this is so by showing that (apart from edge effects)
% all three channels are actually equivalent channels, in that
% any valid string for one channel can be mapped onto valid
% strings for the others.
What is the relationship of channel C to channels A and B?
}
\section{Practical communication over constrained channels}
OK, how to do it in practice? Since all three channels are equivalent, we can
concentrate on channel A.
%
\subsection{Fixed-length solutions}
% This code
We start with explicitly-enumerated codes.
The code in the \tabref{tab.eightwords}%
\margintab{
\begin{center}
\begin{tabular}{cc} \toprule
% this was m, not s.........feb 2000
$s$ & $c(s)$ \\ \midrule
1 & {\tt 00000} \\
2 & {\tt 10000} \\
3 & {\tt 01000} \\
4 & {\tt 00100} \\
5 & {\tt 00010} \\
6 & {\tt 10100} \\
7 & {\tt 01010} \\
8 & {\tt 10010} \\ \bottomrule
\end{tabular}
\end{center}
\caption{A runlength-limited code for channel A.}
\label{tab.eightwords}
}
achieves a rate of $\dfrac{3}{5} = 0.6$.
% added Sun 3/2/02
\exercissxB{1}{ex.con8.10}{
Similarly, enumerate all strings of length 8 that end in the zero state.
(There are 34 of them.)
Hence show that we can map 5 bits (32 source strings) to 8 transmitted
bits and achieve
rate $\dfrac{5}{8} = 0.625$.
What rate can be achieved by mapping an integer
number of source bits to
% Crank up to
$N=16$ transmitted bits?
}
\subsection{Optimal variable-length solution}
% {\em It is probably confusing that I have used
% $s$ to run over source message names, and $s$ to
% run over states in the trellis. Let's change to $u$ or $i$
% for trellis states?}
The optimal way to convey information over the constrained
channel is to find the
optimal transition probabilities
for all points in the trellis,
$Q_{s'|s}$, and make transitions with these probabilities.
When discussing channel A, we showed that a sparse source with density $f=0.38$,
driving code $C_2$,
would achieve capacity.\index{arithmetic coding!uses beyond compression}
And we know how to make \ind{sparsifiers} (\chapterref{ch4}):
we design an arithmetic code that is optimal for compressing
a sparse source; then its associated decoder gives an optimal
mapping from dense (\ie, random binary) strings to sparse strings.
%
% improve this reference to ch 4
%
The task of finding the optimal probabilities is given
as an exercise.
\exercisxC{3}{ex.optimal.constrained}{
Show that the
optimal transition probabilities $\bQ$ can be found as follows.
Find the principal right- and left-eigenvectors of
$\bcmA$, that is the
solutions of $\bA \be^{(R)} = \l \be^{(R)}$
and ${\be^{(L)}}^{\T}\bA = \l {\be^{(L)}}^{\T}$
with largest eigenvalue $\l$.
Then construct a matrix $\bQ$ whose invariant distribution
is proportional to $e^{(R)}_i e^{(L)}_i$, namely
% . This is given by
\beq
Q_{s'|s} = \frac{e^{(L)}_{s'} \cmA_{s's} }{\l e^{(L)}_s } .
\label{eq.optimalQ}
\eeq
[Hint: \exerciseref{ex.path2} might give helpful
cross-fertilization here.]
% in message.tex
}
\exercissxB{3}{ex.show.trellis.entropy}{
Show that when sequences are generated using the
optimal transition probability
matrix (\ref{eq.optimalQ}),
the entropy of the resulting sequence is
asymptotically $\log_2 \l$ per symbol.
[Hint: consider the conditional entropy of just
one symbol given the previous one, assuming the previous
one's distribution is the invariant distribution.]
}
In practice, we would probably use\index{channel!constrained}
finite-precision approximations to the optimal
variable-length solution.\index{variable-length code}\index{code!for constrained channel!variable-length}
One might dislike variable-length solutions
because of the resulting unpredictability of the actual
encoded length in any particular case. Perhaps in
some applications we would like a guarantee that the
encoded length of a source file of size $N$ bits will be less than
a given length such as $N/(C+\epsilon)$.
For example,
a \ind{disk drive} is easier to control if
all blocks of 512 bytes are known to take exactly the same amount of
disk real-estate.
% \index{disk drive}
%
For some constrained channels we can make a simple modification to
our variable-length encoding and offer such a guarantee,
as follows\nocite{MacKay00RLLT}.
We find two codes, two mappings of binary strings to variable-length
encodings, having the property that
for any source string $\bx$, if
the encoding of $\bx$ under the first code is shorter than average,
then the encoding of $\bx$ under the second code is
% refer to rllt.tex
longer than average, and {\em vice versa}.
Then to transmit a string $\bx$ we encode the whole string with both codes
and send whichever encoding has the shortest length,
prepended by a suitably encoded single bit to convey which
of the two codes is being used.
% \section{Exercises}
\amarginfig{c}{\small
\begin{center}
\begin{tabular}{@{}cc}
\raisebox{0.215in}{$\input{noiseless/tex/mfile_rl2.tex}$ }% a simple array
&
\raisebox{-0.0in}{% was -0.15in
\mbox{\psfig{figure=noiseless/figs/staterl2.ps,angle=-90,height=0.8in}}
}
% \hspace{0.2in}
\\
%\hspace{0.2in}
\raisebox{0.3in}{$\input{noiseless/tex/mfile_rl3.tex}$}
&
\raisebox{-0.15in}{%
\mbox{\psfig{figure=noiseless/figs/staterl3.ps,angle=-90,height=1in}}
}
\\
\end{tabular}
\end{center}
%}{%
\caption[a]{State diagrams
and \connectionmatrices\ for channels with maximum runlengths for
{\tt 1}s equal to 2 and 3. }
\label{fig.state_rl23}
}%
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Problems here?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
\exercissxB{3C}{ex.rl.small}{
How
%\begin{figure}
%\figuremargin{%
many valid sequences of length 8 starting
% ending
with a {\tt 0}
are there for the run-length-limited channels
% $0^+1^{1,2}$ and $0^+1^{1,3}$ (
shown in \figref{fig.state_rl23}?
What are the capacities of these channels?
% eigs are 1.839286 -> 0.879
% and 1.9276 -> 0.947
Using a computer, find
% Find
the matrices $\bQ$ for generating a random path
through the trellises of the
% run-length-limited channels $0^+1^{1}$ (channel A), $0^+1^{1,2}$ and $0^+1^{1,3}$.
channel A, and the two run-length-limited channels
shown in \figref{fig.state_rl23}.
}
% BORDERLINE
\exercissxB{3}{ex.rl.limit}{
Consider the run-length-limited channel in which
any length of run of {\tt 0}s is permitted,
and the maximum run length of {\tt 1}s is a large
number $L$ such as nine or ninety.
% (Nine is large enough for our purposes.)
Estimate the capacity of this channel. (Give the first two terms in a series
expansion involving $L$.)
What, roughly, is the form of the optimal
matrix $\bQ$ for generating a random path through
the trellis of this channel? Focus on the
values of the elements $Q_{1|0}$, the probability of
generating a {\tt{1}} given a preceding {\tt0},
and $Q_{L|L-1}$, the probability
of generating a {\tt1} given a preceding run of $L\!-\!1$ {\tt1}s.
Check your answer by explicit computation
for the
% $0^+1^{1,9}$
channel in which
% the string {\tt 1111111111} is forbidden, \ie,
the maximum \ind{runlength} of {\tt 1}s is nine.
% my code is in the file matlabs, see also qrl9h.dat and qrl9h.tex in eigen
}
\section{Variable symbol durations\nonexaminable}
We can add a further frill\index{channel!variable symbol durations}
to the task of communicating over constrained channels
by assuming that the symbols we send have different
{\em{durations}\/}, and that our aim is to
communicate at the maximum possible rate per unit time.
Such channels can come in two flavours:
unconstrained, and constrained.
\subsection{Unconstrained channels with variable symbol durations}
We
% already
encountered an
unconstrained noiseless channel with variable symbol durations
in \exerciseref{ex.phone_chat}.
% Each symbol had a different duration.
Solve that problem, and you've done this topic.\index{source code!variable symbol durations}
The task is to determine the optimal frequencies with which
the symbols should be used, given their durations.
There is a nice analogy between this task
and the task of designing an optimal symbol code (\chref{ch.two}).
When we make an binary symbol code for a source
with unequal probabilities
$p_i$, the optimal message lengths are $ l^*_i = \log_2 \dfrac{1}{p_i}$,
so
\beq
p_i = 2^{-l^*_i}.
\eeq
Similarly, when we have a channel whose symbols have durations
$l_i$ (in some units of time), the optimal probability with which those
symbols should be used is
\beq
p^*_i = 2^{ - \beta l_i },
\eeq
where $\beta$ is the capacity of the channel in bits per unit time.
\subsection{Constrained channels with variable symbol durations}
% Then there's the general problem of a channel with constraints
% and with \ind{variable duration symbols}. [\eg, \ind{Morse code},
% where dots and dashes must be separated by either short or
% long spaces.]
%
% {\em MORE HERE. Add an exercise to solve the general case
% with both constraints and variable duration.}
Once you have grasped the\index{channel!constrained} preceding topics in this chapter,
you should be able to figure out how to define and
find the capacity of these, the trickiest
constrained channels.
\exercisxC{3}{ex.morse}{
A classic example of a constrained channel with variable symbol durations
is the `\ind{Morse}' channel, whose symbols are
\begin{center}
\begin{tabular}{ll}
the dot & {\tt{d}}, \\
the dash & {\tt{D}}, \\
the short space (used between letters in morse code) & {\tt{s}}, and \\
the long space (used between words) & {\tt{S}}; \\
\end{tabular}
\end{center}
the constraints are that
% dots and dashes may only be followed by spaces, and
spaces may only be followed by dots and dashes.
Find the capacity of this channel in bits per unit time
assuming (a) that
all four symbols have equal durations; or (b) that the
symbol durations are 2, 4, 3 and 6 time units respectively.
}
\exercisxC{4}{ex.morse2}{
How well-designed is Morse code
for English (with, say, the probability
distribution of \figref{fig.monogram})?
}
%Figure showing state, symbol, duration.
%
% There we used an entropy-maximization method to solve for the
% optimal probability distribution over symbols.
%
% This method works fine for all channels that can be described
% by a single state with a load of edges.
% But if there's several states then
% a new solution is needed.
%
%
% Find by representing the state diagram by a polynomial.
% Exponent is path length.
%
% Another approach is to assume a probability distribution
%
% Write $f(x) = x + x^2$. What does that mean?
%
% PUT ME BACK:
%
%
% \input{tex/rll_fortuitous.tex}
%
%
%& ${\tt 0}^+{\tt 1}^+$ &1
%& ${\tt 0}^+{\tt 1}^1$ &?
%& ${\tt 0}^{2+}{\tt 1}^{2+}$ &?
%& ${\tt 0}^{1,2}{\tt 1}^{1,2}$ &?
%
\exercisxC{3C}{ex.constrainedphysics}{
{\sf How difficult is it to get \ind{DNA} into a narrow tube?}
To an information theorist, the entropy associated with a
\ind{constrained channel} reveals how much information can be conveyed over it.
In \ind{statistical physics}, the same calculations are done for a
different reason: to predict the thermodynamics of polymers,
for example.
As a toy example, consider a \ind{polymer} of length $N$
that can either sit in a constraining \ind{tube}, of width $L$,
or in the open where there are no constraints.
In the open, the polymer adopts a state drawn at random
from the set of one dimensional random walks, with, say, 3
possible directions per step.
\marginfig{
\begin{center}
\mbox{\psfig{figure=metrop/dna/walk.ps,width=0.4in,angle=180}}
\end{center}
\caption[a]{Model of DNA squashed in a narrow tube.
% The tube's diameter is 10 steps.
The DNA will have a tendency to pop
out of the tube, because, outside the tube, its random walk has greater entropy.
}}
% see _research/drunkard
The entropy of this walk is $\log 3$ per step, \ie, a total of $N \log 3$.
[The \ind{free energy} of the polymer is defined to be $-kT$ times this,
where $T$ is the temperature.]
% # /home/mackay/_research/drunkard/DO.m
% and itp/metrop/dna
In the tube, the polymer's one-dimensional walk
can go in 3 directions unless the wall is in the way, so
the \ind{connection matrix}\index{trellis section} is, for example (if $L=10$),
\[
%\begin{realcenter}\small
%\begin{tabular}{*{10}{c}}
\left[\begin{array}{*{10}{c}}
1 &1 &0 &0 &0 &0 &0 &0 &0 &0\\
1 &1 &1 &0 &0 &0 &0 &0 &0 &0\\
0 &1 &1 &1 &0 &0 &0 &0 &0 &0\\
0 &0 &1 &1 &1 &0 &0 &0 &0 &0\\
0 &0 &0 &1 &1 &1 &0 &0 &0 &0\\
& & & & \makebox[0in][c]{$\ddots$} & \makebox[0in][c]{$\ddots$} & \makebox[0in][c]{$\ddots$} & \\
%
% graveyard.tex
%
% 0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &1 &1 &1 &0 &0\\
% 0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &0 &1 &1 &1 &0\\
0 &0 &0 &0 &0 &0 &0 &1 &1 &1\\
0 &0 &0 &0 &0 &0 &0 &0 &1 &1\\
\end{array}\right].
%\end{tabular}.
%\end{realcenter}
\]
Now, what is the entropy of the polymer?
What is the {\em change\/} in entropy associated
with the polymer entering the tube?
% In DO.m I got 0.0075 nats * N
% or 0.01 bits * N
If possible, obtain an expression as a function of $L$.
Use a computer to find the entropy of the
walk for a particular value of $L$, \eg\ 20,
and plot the probability density of the
polymer's transverse location in the tube.
% in DO.m I found this, which is rather pretty to plot.
% The shape seems to be independent of L.
%
% 0.011169
% 0.022089
% 0.032515
% 0.042215
% 0.050972
% 0.058590
% 0.064900
% 0.069759
% 0.073061
% 0.074730
% 0.074730
% 0.073061
% 0.069759
% 0.064900
% 0.058590
% 0.050972
% 0.042215
% 0.032515
% 0.022089
% 0.011169
%
%
Notice the difference in capacity between two channels, one constrained
and one unconstrained, is directly proportional to the force required to pull
the DNA into the
tube.\index{connection between!channel capacity and physics}\index{channel!capacity!connection with physics}
}
\dvips
\section{Solutions}% to Chapter \protect\ref{ch.noiseless}'s exercises}
\soln{ex.count.ones}{
A file transmitted by $C_2$ contains, on average, one-third {\tt 1}s
and two-thirds {\tt 0}s.
If $f = 0.38$,
the fraction of {\tt 1}s is
$f/(1+f) =
(\gamma-1.0)/(2\gamma - 1.0) = 0.2764$.
}
\soln{ex.abc.compare}{
A valid string for channel C can be obtained from a valid
string for channel A by first inverting it [${\tt 1} \rightarrow {\tt 0}$;
${\tt 0} \rightarrow {\tt 1}$], then passing it through
an accumulator. These operations are invertible, so any
valid string for C can also be mapped onto a valid string
for A. The only proviso here comes from the
edge effects. If we assume that the first character
transmitted over channel C is preceded by a string of zeroes, so that
the first character is forced to be a {\tt 1} (\figref{fig.state101count}c)
then the two channels are exactly equivalent only if we assume that
channel A's first character must be a zero.
}
% added Sun 3/2/02
\soln{ex.con8.10}{
With $N=16$ transmitted bits,
% 10.6
the largest integer number of
source bits that can be encoded is 10,
so the maximum rate of a fixed length code with $N=16$ is 0.625.
}
% \exercis{ex.show.trellis.entropy}{
\begincuttable
\soln{ex.show.trellis.entropy}{
Let the invariant distribution be
\beq
P(s) = \alpha e^{(L)}_s e^{(R)}_s ,
\eeq
where $\a$ is a normalization constant.\marginpar{\small\raggedright{Here,
as in
\chapterref{chtwo}, $S_t$ denotes the ensemble whose random variable
is the state $s_t$.}}
The entropy of $S_t$ given $S_{t-1}$, assuming $S_{t-1}$ comes
from the invariant distribution, is
\beqan
H(S_t|S_{t-1})
&\eq &
- \sum_{s,s'} P(s)P(s'|s) \log P(s'|s)
\\
&\eq &
- \sum_{s,s'} \alpha e^{(L)}_s e^{(R)}_s
\frac{e^{(L)}_{s'} \cmA_{s's} }{\l e^{(L)}_s }
\log
\frac{e^{(L)}_{s'} \cmA_{s's} }{\l e^{(L)}_s }
\eeqan
\beq
%\\
%\lefteqn{%&\eq &
=
- \sum_{s,s'} \alpha \, e^{(R)}_s
\frac{e^{(L)}_{s'} \cmA_{s's} }{\l}
\left[
\log
e^{(L)}_{s'}
+ \log \cmA_{s's} - \log \l - \log e^{(L)}_s
\right] .
%}
\eeq
Now, $\cmA_{s's}$ is either 0 or 1, so the contributions from
the terms proportional to $\cmA_{s's} \log \cmA_{s's}$
are all zero. So
\beqan
H(S_t|S_{t-1})
&\eq & \log \l +
- \frac{ \alpha}{\l}
\sum_{s'}
\left( \sum_{s} \cmA_{s's} e^{(R)}_s \right)
e^{(L)}_{s' }
\log
e^{(L)}_{s'} +
\nonumber \\
& &
\frac{ \alpha}{\l}
\sum_{s}
\left( \sum_{s'}
e^{(L)}_{s'} \cmA_{s's} \right)
e^{(R)}_s
\log e^{(L)}_s
\eeqan
\beqan
&\eq &
\log \l
%
- \frac{ \alpha}{\l}
\sum_{s'}
\l e^{(R)}_{s'}
e^{(L)}_{s' }
\log
e^{(L)}_{s'}
+\frac{ \alpha}{\l}
\sum_{s}
\l e^{(L)}_{s}
e^{(R)}_s
\log e^{(L)}_s
\\
&=&
\log \l .
\eeqan
}
\ENDcuttable
\soln{ex.rl.small}{
The principal eigenvalues of the
\connectionmatrices\ of the two channels are 1.839
and 1.928.
The capacities ($\log \l$) are 0.879 and 0.947 bits.
% See the eigenvector tables (section \ref{sec.eigenvectors.qrl})
% for the matrices $\bQ$.
% I think this is a ref to eigen.tex,
% see \label{sec.eigenvectors.qrl}% Fri 14/12/01
%
%\begin{center}
%\input{noiseless/tex/tcounts_rl2.tex}
%\\
%\input{noiseless/tex/tcounts_rl3.tex}
%\end{center}
}
% BORDERLINE
%
%%%%%%%%%%%%%\soln{ex.rl.limit}{
%
% conjecture this is too big for a { }
% so doing it by hand
\begincuttable
\begin{Sexercise}{ex.rl.limit}
The channel is similar to the unconstrained binary channel;
runs of length greater than $L$ are rare if $L$ is large,
so we only expect weak differences from this channel; these
differences will show up in contexts where the run length
is close to $L$. The capacity of the channel is
very close to one bit.
A lower bound on the capacity is obtained by considering the
simple variable-length code for this channel which
replaces occurrences of the maximum runlength
string {\tt 111$\ldots$1} by {\tt 111$\ldots$10},
and otherwise leaves the source file unchanged.
The average rate of this code is $1/(1+2^{-L})$
because the invariant distribution will hit the `add an extra zero'
state a fraction $2^{-L}$
of the time.
% sum( a * r^n , n=0..N ) ;
%
% (N + 1)
% a r a
% ---------- - -----
% r - 1 r - 1
We can reuse the solution for the variable-length channel
in \exerciseref{ex.phone_chat}. The capacity
is the value of $\beta$ such that the equation
\beq
Z(\beta) = \sum_{l=1}^{L+1} 2^{-\beta l} = 1
\eeq
is satisfied.
The $L+1$ terms in the sum correspond to the $L+1$ possible
strings that can be emitted, {\tt 0}, {\tt 10}, {\tt 110}, $\ldots$~, {\tt 11$\ldots$10}.
The sum is\index{geometric progression}
exactly given by:
% \marginpar{\footnotesize{$\displaystyle\left[\sum_{n=0}^{N}ar^{n}={\frac {a (r^{N+1}-1)}{r-1}}\right]$}}
%
\beq
Z(\beta) = 2^{-\beta} \frac{ \left(2^{-\beta}\right)^{L+1} - 1 }{ 2^{-\beta} - 1} .
\eeq
$\displaystyle\left[\mbox{Here we used\ }\sum_{n=0}^{N}ar^{n}={\frac {a (r^{N+1}-1)}{r-1}}.\right]$
We anticipate that $\beta$ should be a little less than 1 in order for $Z(\beta)$ to
equal 1.
Rearranging and solving approximately for $\beta$, using $\ln (1+x) \simeq x$,
\beqan
Z(\beta) & = & 1 \\
%\Rightarrow
% 2 2^{-\beta} - 1& =& \left(2^{-\beta}\right)^{L+2} \\
%\Rightarrow
% 2^{1-\beta} & =& 1 + \left(2^{-\beta}\right)^{L+2} \\
%\Rightarrow
% {1-\beta} & =& \log_2 \left[ 1 + \left(2^{-\beta}\right)^{L+2} \right] \\
%\Rightarrow
% {1-\beta} &\simeq& \left(2^{-\beta}\right)^{L+2} / \ln 2 \\
\:\Rightarrow \:
{\beta}& \simeq & 1 - 2^{-(L+2)} / \ln 2 .
\eeqan
% c(x) = 1 - (2.0**(-(x+2.0)) )/log(2.0)
% print c(1)
% print c(2)
% print c(3)
% print c(4)
% print c(9)
% L=9: the eigenvalue is 1.9990
% log_2 is 0.99929
%
% eigen matlabs rl4: 1.9659 -> 0.9752
% 5: 1.9836 -> 0.9881
% 6: 1.9920 -> 0.9942
% L guess rate true capacity
% 1 0.81966 0.6942
% 2 0.90983 0.879
% 3 0.95491 0.947
% 4 0.97745 0.9752
% 5 0.98873 0.9881
% 6 0.994364 0.99419
% 9 0.9992919 0.99929556
We evaluated the true capacities for $L=2$ and $L=3$ in an
earlier exercise. The table%
\amargintab{b}{\footnotesize
\begin{center}
\begin{tabular}{ccc} \toprule
$L$ & $\beta$ & \mbox{True capacity} \\ \midrule
2 & 0.910\phantom{0} & 0.879\phantom{0} \\
3 & 0.955\phantom{0} & 0.947\phantom{0} \\
4 & 0.977\phantom{0} & 0.975\phantom{0} \\
5 & 0.9887 & 0.9881 \\
6 & 0.9944 & 0.9942 \\
9 & 0.9993 & 0.9993 \\ \bottomrule
\end{tabular}
\end{center}
}
compares the approximate
capacity $\beta$ with the true capacity for a selection of values of $L$.
The element $Q_{1|0}$ will be close
to $1/2$ (just a tiny bit larger), since in the
unconstrained binary channel $Q_{1|0}=1/2$.
When a run of length $L-1$ has occurred, we effectively have a choice of
printing {\tt 10} or {\tt 0}. Let the probability of selecting {\tt 10}
be $f$. Let us estimate the entropy of the {\em remaining\/} $N$
characters in the stream as a function
of $f$, assuming the rest of the matrix $\bQ$ to have been
set to its optimal value.
The entropy of the next $N$ characters in the stream is
the entropy of the first bit, $H_2(f)$, plus the entropy of
the remaining characters, which is roughly
$(N\!-\!1)$ bits if we select {\tt 0} as the first bit
and $(N\!-\!2)$ bits if {\tt 1} is selected. More precisely, if $C$
is the capacity of the channel (which is roughly 1),
\beqan
\!\!\!\!\hspace*{-1.25em}
H(\mbox{the next $N$ chars})& \simeq& H_2(f) + \left[ (N-1) (1-f) + (N-2) f \right] C
\nonumber \\
&=& H_2(f) + N C - f C \: \simeq \: H_2(f) + N - f .
\eeqan
Differentiating and setting to zero to find the optimal $f$, we obtain:
\beq
\log_2 \frac{1-f}{f} \simeq 1 \:\: \Rightarrow \frac{1-f}{f} \simeq 2
\:\: \Rightarrow f \simeq 1/3 .
\eeq
The probability of emitting a {\tt 1} thus decreases from about 0.5
to about $1/3$ as the number of emitted {\tt 1}s increases.
Here is the optimal matrix:
\beq
%%%%
%%%% written by matrix2tex.p
%%%%
%%%% beginning of matrix
%%%%
\left[
\begin{array}{@{\,}*{9}{c@{\,\,}}c@{\,}}
0 & .3334 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & .4287 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & .4669 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & .4841 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & .4923 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & .4963 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & .4983 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & .4993 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & .4998 \\
1 & .6666 & .5713 & .5331 & .5159 & .5077 & .5037 & .5017 & .5007 & .5002
\end{array}
\right]
%%%%
\eeq
%
% something wrong here?
%
Our rough theory works.
\end{Sexercise}
\ENDcuttable
% What is wrong with latex here?
%%%%%%%%%%}
% c(9) 0.999755859375
% c(8) 0.99951171875
\dvips
% ch 10
%\chapter{Language models and crosswords \nonexaminable}
%\chapter{Language Models and Crosswords \nonexaminable}
\chapter{Crosswords and Codebreaking \nonexaminable}% An Aside
% This chapter belongs as close as possible to the
% compression and noisy channel chapters. But it would
% also go better after the constrained channel chapter,
% since language can be viewed as a constrained channel.
%
% \input{tex/monogram.tex}
In this chapter we make a random walk through a few topics related
to language modelling.
% \section{Crosswords}
\label{ch.xword}
%\section{}
\section{Crosswords}
The rules of crossword-making may be thought of as defining
a constrained channel. The fact that {\em many\/}
valid crosswords can be made demonstrates that
this \ind{constrained channel} has a capacity greater than zero.
There are two archetypal \ind{crossword} formats.%
\amarginfig{t}{
\begin{tabular}{c}
\mbox{\epsfbox{metapost/xword.14}}\\
\mbox{\epsfbox{metapost/xword.1}}\\
%\psfig{figure=figs/xwordA3.ps,width=1.5in} \\
%\psfig{figure=figs/xwordB.ps,width=1.5in} \\
%\psfig{figure=figs/grid-us.ps,width=1.5in} \\
%\psfig{figure=figs/grid994.ps,width=1.5in} \\
\end{tabular}
\caption[a]{Crosswords
%$grids
of types A (\ind{American}) and B (\ind{British}).}
}
%, which differ in their grids' properties.
In a `type A' (or \ind{American})
\ind{crossword}, every row and column consists of a succession of
words of length 2 or more separated by one or more spaces.
In a `type B' (or \ind{British}) crossword, each row and column
consists of a mixture of words and
single characters, separated by one or more spaces, and
every character lies in at least one word (horizontal or vertical).
Whereas in a type A crossword every letter lies in a horizontal
word {\em and\/}
a vertical word, in a typical type B crossword only about half of
the letters do so; the other half lie in one word only.
%[`A' and `B' are
% mnemonic for America and Britain,
% where these two types of crosswords are respectively
% more widespread.]
Type A crosswords are harder to {\em create\/}
than type B because of the constraint
% that they are subject to,
that no single characters are permitted.
Type B crosswords are generally harder to {\em solve\/} because
there are fewer constraints per character.
\subsection{Why are crosswords possible?}
If a language has no redundancy, then any letters written on
a grid form a valid crossword.
In a language with high redundancy, on the
other hand, it is hard to make crosswords (except perhaps
a small number of trivial ones).
The possibility of making crosswords in a language
thus demonstrates a {\em bound on the redundancy\/} of
that language.
Crosswords
% , when read horizontally or vertically,
are not normally written in genuine \ind{English}. They are written in
% Perhaps we should introduce a name
% like Wenglish
% \footnote{Need to decide whether to call this Wenglish, as in chapter 2.}
% for
`\ind{word-English}', the language consisting of
strings of words from a dictionary, separated by spaces.
\exercisxB{2}{ex.winglishcap}{
Estimate the capacity of word-English, in bits per
character.
[Hint: think of word-English
as defining a constrained channel (\chref{ch.noiseless})
and see \exerciseref{ex.phone_chat}.]
}
% ? (relate to telephone rings and
% chapter on constrained channels). Give an estimate.
The fact that many crosswords can be made leads to
% that the redundancy is not very big
% the entropy of english is quite big
a lower bound on the entropy of word-English.
For simplicity, we now model
\ind{word-English} by \ind{\wenglish},
the language introduced in
\secref{sec.wenglish} which
consists of $W$ words all of length
$L$. The entropy of
such a language, per character, including inter-word spaces, is:
\beq
H_W \equiv \frac{\log_2 W }{L+1} .
\label{eq.HW}
\eeq
% I reckon $W$ is about 100,000 tops and L=6 seems reasonable.
% that's 17/7 -> 2ish bits per character.
We'll find that the conclusions we come to depend on the value of $H_W$ and are not terribly
sensitive to the value of $L$.
%
Consider a large crossword of size $S$ squares in area.
Let the number of words be $f_w S$ and let the number of
letter-occupied squares be $f_1 S$. For typical crosswords of
types A and B made of words of length $L$, the two fractions $f_w$
and $f_1$ have
% the following values:
roughly the values in \tabref{tab.xwordf}.
\margintab{\small
\begin{center}
\begin{tabular}{ccc} \toprule
& A & B \\ \midrule
$f_w$ & $\displaystyle \frac{2}{L+1}$ & $\displaystyle \frac{1}{L+1}$ \\[0.1in]
$f_1$ & $\displaystyle \frac{L}{L+1}$ & $\displaystyle \frac{3}{4}\frac{L}{L+1}$ \\
\bottomrule
\end{tabular}
\end{center}
\caption[a]{Factors $f_w$ and $f_1$ by which the number of words
and number of letter-squares respectively are smaller than the total
number of squares.}
\label{tab.xwordf}
}
We now estimate how many crosswords there are of size $S$
using our simple model of \Wenglish.
% , and work out the condition for
We
% Let's
assume that \Wenglish\ is created at random by generating $W$
strings from a monogram
% single
(\ie, memoryless)
source with entropy $H_0$. If, for example, the source used all
$A=26$ characters with equal probability then $H_0 = \log_2 A =
4.7$ bits. If instead we use \chref{ch.prob.ent}'s distribution then
the entropy is 4.2.
The redundancy of Wenglish stems from
% these
two sources:
it tends to use some letters more than others;
and there are only $W$ words in the dictionary.
% 3.xxx.
Let's now count how many crosswords there are by
imagining filling in the squares of a crossword at random using the same
distribution that produced the \Wenglish\ dictionary
and evaluating the probability that this random scribbling produces
valid words in all rows and columns.
The total number of {\em typical\/} fillings-in of the
$f_1 S$ squares in the crossword
that can be made is
\beq
|T| = 2^{ f_1 S H_0} .
\eeq
The probability that one word of length $L$ is validly filled-in
is
\beq
\beta = \frac{W}{2^{L H_0 }},
\eeq
and the probability that the whole crossword, made of $f_w S$ words, is validly filled-in
by a single typical in-filling is approximately\marginpar{\small\raggedright{This calculation
underestimates
the number of valid Wenglish crosswords
by counting only crosswords filled with `typical' strings.
If the monogram distribution is non-uniform then the
true count is dominated
by `atypical' fillings-in, in which crossword-friendly words appear more often.
}}
\beq
\beta^{f_w S} .
\eeq
So the log of the
number of valid crosswords of size $S$ is estimated to be
\beqan
\log \beta^{f_w S} |T| &=& S \left[
( f_1 - f_w L ) H_0 + f_w \log W
\right]
% \log \beta^{f_w S} |T|
\\
&=& S \left[
( f_1 - f_w L ) H_0 + f_w (L+1) H_W
% by defn, ref{eq.HW} === \frac{\log W}{L+1}
\right] ,
\eeqan
which is an increasing function of $S$
only if
\beq
( f_1 - f_w L ) H_0 + f_w (L+1) H_W
> 0.
\eeq
So arbitrarily many crosswords can be made only
if there's enough words in the \Wenglish\ dictionary that
\beq
H_W > \frac{( f_w L- f_1 )}{f_w(L+1)} H_0 .
\eeq
Plugging in the values of $f_1$ and $f_w$ from \tabref{tab.xwordf},
we find the following.
\begin{realcenter}
\begin{tabular}{lcc} \toprule
Crossword type & A & B \\ \midrule
%$f_w$ & $\frac{2}{L+1}$ & $\frac{1}{L+1}$ \\[0.05in]
%$f_w(L+1)$ & {2} & {1} \\[0.05in]
%$f_1$ & $\frac{L}{L+1}$ & $\frac{3}{4}\frac{L}{L+1}$ \\[0.05in]
%$-f_1+f_wL$ & $\frac{L}{L+1}$ & $\frac{1}{4}\frac{L}{L+1}$ \\[0.05in]
Condition for crosswords
&
$H_W > \frac{1}{2}\frac{L}{L+1} H_0$
& $H_W > \frac{1}{4}\frac{L}{L+1} H_0$ \\
\bottomrule
\end{tabular}
\end{realcenter}
If we set $H_0=4.2\ubits$ and assume there are $W=4000$
words in a normal English-speaker's
dictionary, all with length $L=5$, then we find
that the condition for crosswords of type B is
satisfied, but
the condition for crosswords of type A is
{\em only just\/} satisfied. This fits with
my experience that crosswords of type A
usually contain more obscure words.
% Thus crosswords are possible in English because English has
% high enough entropy.
% In a language with fewer, longer words, the possibility of making
% crosswords vanishes.
% see xwordaside.tex
% units.tex has its own further reading
\section*{Further reading}
These observations about crosswords were first made by
\index{Shannon, Claude}\index{Wolf, Jack}\index{Siegel, Paul}\citeasnoun{Shannon48};
%Shannon;
% http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html
% p15
I learned about them from \citeasnoun{wolf1998}.
The topic is closely related to the capacity of two-dimensional
constrained channels. An example of a two-dimensional\index{channel!two-dimensional}
constrained channel is a two-dimensional \ind{bar-code},
as seen
% in hexagonal patterns
on parcels.
%\section{}
% http://www.adams1.com/pub/russadam/stack.html
% exercises at the end fo the crossword chapter.
\fakesection{Xword Exercises}
\exercisxC{3}{ex.constrainedchannel2}{
A two-dimensional channel is defined by the constraint that,
of the eight neighbours of every interior pixel
in an $N \times N$ rectangular grid,
four must be black and four white. (The counts of black and white pixels
around boundary pixels are not constrained.)
A binary pattern satisfying this constraint is shown in
\figref{fig.granny}.
\marginfig{
\begin{center}
\mbox{\epsfbox{metapost/xword.21}}
\end{center}
%
\caption[a]{A binary pattern in which every pixel is adjacent
to four black and four white pixels.
}
\label{fig.granny}
}
What is the capacity of this channel, in bits per pixel, for large $N$?
% answer: tends to 0.
}
%
\dvips
%
\section{Simple language models}
\label{sec.zipf}
\subsection{The Zipf--Mandelbrot distribution}
The\index{Zipf, George K.}
crudest model for a language is the monogram
model, which asserts that each successive word
is drawn independently from a distribution over
words.
What is the nature of this distribution over words?
Zipf's law \cite{zipf} asserts that\index{Zipf's law}\index{Zipf plot}
the probability of the $r$th most probable word in a language is
approximately
\beq
P(r) = \frac{\kappa}{ r^{\alpha} },
\eeq
where the exponent $\alpha$ has a value close to 1, and $\kappa$ is
a constant. According to Zipf,
a log--log plot of frequency versus word-rank should
show a straight line with slope $-\alpha$.
\quotecite{Frac}
% Mandelbrot's
modification\index{Mandelbrot, Benoit}
of Zipf's law introduces a third parameter $v$,
asserting that the probabilities are given by
\beq
P(r) = \frac{\kappa}{ (r+v)^{\alpha} } . % 1/D
\label{eq.mandelbrot}
\eeq
For some documents, such as Jane Austen's {\em Emma},
the Zipf--Mandelbrot distribution
fits well -- \figref{fig.emma.zipf}.
Other documents give distributions that are not so well fitted
by a Zipf--Mandelbrot distribution.
\Figref{fig.book.zipf} shows a plot of frequency versus rank for
the \LaTeX\ source of this book. Qualitatively, the graph
is similar to a straight line, but a curve is noticeable.
To be fair,
% to Zipf and Mandelbrot,
this source file is not written
in pure English -- it is a mix of English, maths symbols such as `$x$',
and \LaTeX\ commands.
\begin{figure}[hbtp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
\mbox{\psfig{figure=zipf/pr_ps/161014.emma.ps,angle=-90,width=2.3in}}
\end{tabular}
\end{center}
}{
\caption[a]{Fit of the Zipf--Mandelbrot distribution (\ref{eq.mandelbrot}) (curve)
to
the empirical frequencies of words in Jane Austen's {\em Emma} (dots).
The fitted parameters are
$\kappa = 0.56$; $v = 8.0$; $\alpha =1.26$.
% D = 0.79$.
}
\label{fig.emma.zipf}
}
\end{figure}
%
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
\mbox{\psfig{figure=zipf/pr_ps/346998.book.ps,angle=-90,width=2.3in}}
\end{tabular}
\end{center}
}{
\caption[a]{Log--log plot of frequency versus rank for
the words in the \LaTeX\ file
of this book.
}
\label{fig.book.zipf}
}
\end{figure}
\subsection{The Dirichlet process}
\label{sec.dirichletprocess}
Assuming we are interested in
monogram models for languages, what model should we
use? One difficulty in modelling a language is the
unboundedness of vocabulary. The greater the sample
of language, the greater the number of words encountered.
A generative model for a language should emulate this property.
If asked `what is the next word in a newly-discovered
work of Shakespeare?' our probability distribution over words
must surely include some non-zero probability for
{\em words that Shakespeare never used before}.
Our generative monogram model for language should
also satisfy a consistency rule called {\dem\ind{exchangeability}}.
If we imagine generating a new language from our generative model,
producing an ever-growing corpus of text,
all statistical properties of the text should
be homogeneous: the probability of finding a particular word
at a given location in the stream of text should be
the same everywhere in the stream.
The Dirichlet process model is a model for a stream
of symbols (which
we think of as `words')
that satisfies the exchangeability rule
and that allows the vocabulary of symbols to grow without limit.
The model has one parameter $\alpha$. As the
stream of symbols is produced, we identify each new symbol
by a unique integer $w$.
When we have seen a stream of length $F$ symbols, we define
the probability of the next symbol in terms of
the counts $\{ F_w \}$ of the symbols seen so far thus:
the probability that the next symbol is a new symbol, never
seen before, is
\beq
\frac{ \alpha }{ F + \alpha } .
\eeq
The probability that the next symbol is symbol $w$ is
\beq
\frac{ F_w }{ F + \alpha } .
\eeq
\Figref{fig.zipf.dprocess}
shows Zipf plots\index{Zipf plot} (\ie, plots of symbol frequency versus rank)
for million-symbol `documents' generated by
Dirichlet process priors with values of
$\alpha$ ranging from 1 to 1000.
% load 'gnu/1000000.all'
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{c}
\mbox{\psfig{figure=zipf/pr_ps/1000000.all.ps,angle=-90,width=2.3in}}
\end{tabular}
\end{center}
}{
\caption[a]{Zipf plots for four `languages' randomly generated
from Dirichlet processes with parameter $\alpha$ ranging
from 1 to 1000. Also shown is the Zipf plot for this book.
}
\label{fig.zipf.dprocess}
}
\end{figure}
It is evident that a Dirichlet process is
not an adequate model for observed distributions
that roughly obey Zipf's law.\index{Zipf's law}
With a small tweak, however, Dirichlet processes
can produce rather nice Zipf plots.
Imagine generating a language composed of
elementary symbols using a Dirichlet process
with a rather small value of the parameter $\alpha$,
so that the number of reasonably frequent symbols is about 27.
If we then declare one
of those symbols (now called `characters' rather
than words) to be a space character,
then we can identify the strings between the space characters
as `words'.
If we generate a language in this way then
the frequencies of words often come out as
very nice Zipf plots, as shown in \figref{fig.dprocess2.zipf}.
Which character is selected as the space character
determines the slope of the Zipf plot -- a less probable
space character gives rise to a richer language with a
% larger vocabulary and a
shallower slope.
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
\mbox{\psfig{figure=zipf/pr_ps/fakes2003.ps,angle=-90,width=2.3in}}\\
\end{tabular}
\end{center}
}{
\caption[a]{Zipf plots for the words of two `languages' generated
by creating successive characters from a Dirichlet process
with $\alpha=2$, and declaring one
% randomly selected
character to be the space character. The two curves result
from two different choices of the space character.
}
\label{fig.dprocess2.zipf}
}
\end{figure}
% ch 10
%\chapter{Cryptography and cryptanalysis: codes for information concealment \nonexaminable}
%\chapter{Cryptography and Cryptanalysis: Codes for Information Concealment \nonexaminable}
%\label{ch.crypto}
%\input{tex/crypto.tex}
%
\dvips
%\chapter{Units of measurement of information content \nonexaminable}
\section{Units of information content \nonexaminable}
%\chapter{Units of Information Content \nonexaminable}
% units.tex
\fakesection{Units of measurement of information content}
The information content of an outcome, $x$,
whose probability is $P(x)$, is defined to
be
\beq
h(x) = \log \frac{1}{P(x)} .
\eeq
The entropy of an ensemble is
an average information content,
\beq
H(X) = \sum_x P(x) \log \frac{1}{P(x)} .
\eeq
When we compare hypotheses with each other in
the light of data, it is often convenient
to compare the log of the probability
of the data under the alternative hypotheses,
% models,
\beq
\mbox{`log evidence for $\H_i$'} = \log P( D \given \H_i ) ,
\eeq
or, in the case where just two hypotheses
% models
are being compared, we evaluate the
`log odds',
\beq
\log \frac{ P( D \given \H_1 ) }{ P( D \given \H_2 ) } ,
\eeq
which has also been called the `weight of evidence in favour
of $\H_1$'.
The log evidence for a hypothesis, $\log P( D \given \H_i )$ is
the negative of the information content of the data $D$:
if the data have large information content, given a hypothesis, then they
are surprising to that hypothesis;
if some other hypothesis is not so surprised
by the data, then that hypothesis becomes more probable.
`Information content', `\ind{surprise value}', and
log likelihood or log evidence are the same thing.
All these quantities are logarithms of probabilities,
or weighted sums of logarithms of probabilities, so they
can all be measured in the same units. The units depend
on the choice of the base of the logarithm.
% This chapter is a brief aside to mention
The names that have been given to these units
are shown in \tabref{tab.units}.\index{bit (unit)}\index{nat (unit)}\index{ban (unit)}\index{deciban (unit)}\index{units}
\begin{table}[htbp]
\figuremargin{
\begin{center}
\begin{tabular}{cc} \toprule
Unit & Expression that has those units \\ \midrule
bit & $\log_2 p$ \\
nat & $\log_e p$ \\
ban & $\log_{10} p$ \\
deciban (db) & ${10}\log_{10} p$ \\ \bottomrule
\end{tabular}
\end{center}
}{\caption[a]{Units of measurement of information content.}
\label{tab.units}}
\end{table}
% Jaynes p.91 calls the db the decibel.
The {\em bit\/} is the unit that we use most in this book. Because
the word `bit' has other meanings, a backup name for this unit
is the {\em shannon}.\index{shannon (unit)}
A {\em byte\/} is 8 bits. A megabyte is $2^{20} \simeq 10^6$ bytes.
If one works in natural logarithms,
% (which is conventional in Bayesian
information contents and weights of evidence
are measured in {\em nats}.
The most interesting units are the {\em ban\/} and the {\em deciban}.
\subsection{The history of the ban}
Let me tell you why
a factor of ten in probability is called a ban.
% , after the
% English
% town of Banbury.
When Alan {Turing} and the other\index{Turing, Alan}
% British
\ind{codebreakers} at \ind{Bletchley Park} were breaking each new
day's
% German
\ind{Enigma} code, their task was a huge inference problem: to infer,
given the day's cyphertext, which three wheels were in
the Enigma machines that day; what their starting positions were;
what further letter substitutions were in use on the steckerboard;
and, not least, what the original German messages were.
These inferences were conducted using Bayesian methods (of course!),
and the chosen units were decibans or half-decibans, the deciban
being judged the smallest weight of evidence discernible to
a human. The evidence in favour of particular hypotheses
was tallied using sheets of paper that
were specially printed in {Banbury}, a town
about 30 miles from {Bletchley}. The inference task was
known as \ind{Banburismus}, and the units in which
% the game
Banburismus
was played were called {ban}s, after that town.
\section{A taste of Banburismus}
The details of the code-breaking methods of Bletchley
Park were kept secret for a long time, but some aspects
of Banburismus can be pieced together. I hope the following
description of a small part of Banburismus is not too inaccurate.\footnote{I've
been most helped by descriptions given by Tony Sale
({\tt http://{\breakhere}www.{\breakhere}codesandciphers.{\breakhere}org.uk/{\breakhere}lectures/})
% http://www.codesandciphers.org.uk/lectures/
% was http://www.cranfield.ac.uk/ccc/bpark/lectures/})
and by Jack Good (1979),\nocite{GoodEnigma} who worked with
Turing at Bletchley.
}
How much information was needed? The number of possible
settings of the Enigma machine was about $8 \times 10^{12}$.
% see cryptonotes
To deduce the state of the machine, `it was
therefore necessary to find about 129 decibans from somewhere',
as Good\index{Good, Jack} puts it. \ind{Banburismus} was aimed not at deducing the
entire state of the machine, but only at figuring out which
wheels were in use; the logic-based \ind{bombes}, fed with guesses
of the \ind{plaintext} (\ind{crib}s), were then
used to crack what the settings of the wheels were.
% the remaining uncertainty.
The \ind{Enigma} machine, once its wheels and plugs were put in place,
implemented a continually-changing permutation
cypher that wandered deterministically through a
state space
% , starting from
of $26^3$ permutations.
Because an enormous number of messages were sent each day,
there was a good chance that whatever state one machine
was in when sending one character
of a message, there would be another machine
{\em in the same state\/} while sending a particular character
in another message.
Because the evolution of the machine's state was deterministic,
the two machines would remain in the same state as
each other for the rest of the transmission.
The resulting correlations between the outputs of
such pairs of machines
provided a dribble of information-content
from which Turing and his co-workers
extracted their daily 129 decibans.
\subsection{How to detect
that two messages came from machines with a common state
sequence}
The hypotheses are the null hypothesis, $\H_0$, which
states that the machines are in {\em different\/} states, and
that
the two plain messages are unrelated; and the
`match' hypothesis, $\H_1$, which
says that the machines are in the {\em same\/} state, and
that the two plain messages are unrelated.
No attempt is being made here to infer what the state of
either machine is.
The data provided are the two cyphertexts $\bx$ and $\by$;
let's assume they
both have length $T$ and that the alphabet size is $A$ (26 in Enigma).
What is the probability of the data, given the two hypotheses?
First, the null hypothesis.
This hypothesis asserts that the two cyphertexts are
given by
\beq
\bx = x_1x_2x_3\ldots = c_1(u_1)c_2(u_2)c_3(u_3)\ldots
\eeq
and
\beq
\by = y_1y_2y_3\ldots = c'_1(v_1)c'_2(v_2)c'_3(v_3)\ldots,
\eeq
where the codes $c_t$ and $c'_t$ are two unrelated time-varying
permutations of the alphabet, and
$u_1u_2u_3\ldots$ and
$v_1v_2v_3\ldots$ are the plaintext messages.
An exact computation of the probability of the data ($\bx,\by$)
would depend on a language model of the plain text,
and a model of the Enigma machine's guts, but if we
assume that each Enigma machine is an {\em ideal\/} random time-varying
permutation, then the probability distribution of the
two cyphertexts is uniform. All cyphertexts are
equally likely.
\beq
P(\bx , \by \given \H_0 ) = \left( \frac{1}{A} \right)^{\! 2 T}
\:\:\mbox{for all $\bx,\by$ of length $T$}.
\eeq
What about $\H_1$?
This hypothesis asserts that a {\em single\/}
time-varying permutation $c_t$ underlies both
\beq
\bx = x_1x_2x_3\ldots = c_1(u_1)c_2(u_2)c_3(u_3)\ldots
\eeq
and
\beq
\by = y_1y_2y_3\ldots = c_1(v_1)c_2(v_2)c_3(v_3)\ldots \: .
\eeq
% are generated from two plaintext messages $u_1u_2u_3\ldots$ and
% $v_1v_2v_3\ldots$
What is the probability of the data ($\bx,\by$)?
We have to make some assumptions about
the plaintext language.
% [`Horrors! How can we possibly
% make assumptions?' the idiot non-Bayesians ask.]
If it were the case that the plaintext language was
completely random, then the probability of
$u_1u_2u_3\ldots$ and
$v_1v_2v_3\ldots$ would be uniform, and so would that
of $\bx$ and $\by$, so the probability $P(\bx,\by\given \H_1)$
would be equal to $P(\bx,\by\given \H_0)$, and the two hypotheses
$\H_0$ and $\H_1$ would be
indistinguishable.
We make progress by assuming that the plaintext is not
completely random. Both plaintexts are written in a
language, and that language has redundancies.
Assume for example that particular plaintext letters
are used more often than others. So, even though the two
plaintext messages are unrelated, they are slightly more
likely to use the same letters as each other; if $\H_1$ is
true, two synchronized letters from the two cyphertexts
are slightly more likely to
be identical. Similarly, if a language uses particular
bigrams and trigrams frequently, then the two plaintext messages
will occasionally contain the same bigrams and trigrams
at the same time as each other, giving rise, if $\H_1$ is
true, to a little burst of 2 or 3 identical letters.
\Tabref{fig.coincidenceexample} shows such a \ind{coincidence} in
two plaintext messages that are unrelated, except that
they are both written in English.
\begin{table}
\figuredangle{
\small
\begin{center}
\hspace*{0.6in}
%\parbox{4in}{
%\begin{tabular}{cl}
%$\bu$ & \verb+THEXCODEXBREAKERSXWEREXLOOKINGXFORXINSTANCESXWHEREX+ \\
%$\bv$ & \verb+TRIGRAMSXFORXTWOXORXMOREXMESSAGESXDIFFEREDXONLYXINX+ \\
% & \verb+*.......*..........................*..............*+ \\
%\end{tabular}\\
\begin{tabular}{rl}\toprule
$\bu$ & {\tt{LITTLE-JACK-HORNER-SAT-IN-THE-CORNER-EATING-A-CHRISTMAS-PIE--HE-PUT-IN-H}} \\
$\bv$ & {\tt{RIDE-A-COCK-HORSE-TO-BANBURY-CROSS-TO-SEE-A-FINE-LADY-UPON-A-WHITE-HORSE}} \\
{\sf matches:}& {\tt{.*....*..******.*..............*...........*................*...........}} \\
\bottomrule
\end{tabular}
%}
\end{center}
}{\caption[a]{%
Two aligned pieces of English plaintext, $\bu$ and $\bv$, with
matches marked by {\tt{*}}.
% Notice that there are four matches,
% whereas the expected number of matches in two completely
% random strings of length $T=51$ would be about 2.
Notice that there are twelve matches, including a run of six,
whereas the expected number of matches in two completely
random strings of length $T=74$ would be about 3.
The two corresponding cyphertexts from two machines in
identical states would also have twelve matches.
}
\label{fig.coincidenceexample}}
\end{table}
The codebreakers hunted among pairs of messages for
pairs that were suspiciously similar to each other,
counting up the numbers of matching monograms, bigrams, trigrams, etc.
This method was first used by the Polish codebreaker Rejewski.
Let's look at the simple case of a monogram language model and
estimate how long a message is needed to be able to decide whether
two machines are in the same state.
%many messages would be needed, and of
% what length, to have a good chance of cracking the Enigma.
I'll assume the source language is monogram-English,
the language in which successive letters are drawn
i.i.d.\ from the probability distribution $\{ p_i \}$ of
\figref{fig.monogram}.
The probability of $\bx$ and $\by$ is nonuniform:
consider two single characters, $x_t=c_t(u_t)$ and $y_t=c_t(v_t)$;
the probability that they are identical is
\beq
\sum_{u_t,v_t} P(u_t) P(v_t) \, \truth[ u_t\eq v_t ]
\: = \: \sum_i p_i^2
\: \equiv \:
m.
\eeq
We give this quantity the name $m$, for `match probability';
for both English and German, $m$ is about $2/26$ rather than $1/26$ (the value
that would hold for a completely random language).
Assuming that $c_t$ is an ideal random permutation,
the probability of $x_t$ and $y_t$ is, by symmetry,
\beq
P(x_t,y_t\given \H_1) \: = \: \left\{ \begin{array}{ccl}
\smallfrac{m}{A} & & \mbox{if $ x_t = y_t $} \\
\smallfrac{(1-m)}{A(A-1)} & & \mbox{for $ x_t \not = y_t $.}
\end{array} \right.
\eeq
Given a pair of cyphertexts $\bx$ and $\by$ of length $T$
that match in $M$ places and do not match in $N$ places,
the log evidence in favour of $\H_1$ is
then
\beqan
\log \frac{P(\bx,\by\given \H_1)}{P(\bx,\by\given \H_0)}
&=& M \log \frac{ m/A }{ 1/A^2 }
+ N \log \frac{ \smallfrac{(1-m)}{A(A-1)} }{ 1/A^2 }
\\
&=& M \log m A
+ N \log \frac{ (1-m) A}{A-1} .
\label{eq.weight.of.evidence}
\eeqan
Every match contributes $\log m A$ in favour
of $\H_1$;
every non-match contributes $\log \frac{A-1}{ (1-m) A}$
in favour of $\H_0$.
%tex/crypto/psquared.p
% double checked.........
%gnuplot> pr 10 * log(0.075884 * 27)/log(10.0)
%3.11513979524554
%gnuplot> pr 10 * log((1.0 - 0.075884) * 27/26.0 )/log(10.0)
%-0.178830941957283
\medskip
\begin{center}
\begin{tabular}{lcr@{}l} \toprule
Match probability for monogram-English & $m$ & & 0.076 \\
Coincidental match probability & $1/A$ && 0.037 \\
log-evidence for $\H_1$ per match &
${10}\log_{10} m A$ & & 3.1\,db \\
log-evidence for $\H_1$ per non-match &
${10}\log_{10} \frac{ (1- m) A}{(A-1)}$ & $-$&$0.18$\,db \\
\bottomrule
\end{tabular}
\medskip
\end{center}
If there were $M=4$ matches and $N=47$ non-matches
in a pair of length $T=51$, for example,
the weight of evidence in favour of
$\H_1$ would be +4 decibans, or a likelihood ratio of 2.5 to 1
in favour.
% odds === (1-p)/p
%
% If there were $M=3$ matches and $N=17$ non-matches
% in a pair of length $T=20$, for example,
% the evidence in favour of
% $\H_1$ would be +12.4 decibans, or odds of 17 to 1
% in favour.
%%%%%%%%%%%%%%%%%
%%%%%%% [Check if this is the right use of odds.]
%%%%%%%%%%%%%%%%%
The {\em expected\/} weight of evidence
from a line of text of length $T=20$ characters
is the expectation of (\ref{eq.weight.of.evidence}),
which depends on whether $\H_1$ or $\H_0$ is true.
If $\H_1$ is true then matches are expected to turn up
at rate $m$, and
the expected weight of evidence is
1.4\,decibans per 20 characters.
If $\H_0$ is true then spurious matches are expected to turn up
at rate $1/A$, and
the expected weight of evidence is
$-1.1$~decibans per 20 characters.
% $-1.1$\,decibans per 20 characters.
Typically, roughly 400 characters need to be inspected in order
to have a weight of evidence greater than a hundred to one (20 decibans) in
favour of one hypothesis or the other.
So, two English plaintexts have more matches
than two random strings. Furthermore, because consecutive characters
in English are not independent, the bigram and trigram
statistics of English are nonuniform and the
matches tend to occur in bursts of consecutive matches.
[The same observations also apply to German.]
% , the plaintext language used in the Enigma messages.]
Using better language models, the evidence contributed by
runs of matches was more accurately computed. Such a scoring
system was worked out by Turing and refined by Good.
Positive results were passed on to automated and human-powered codebreakers.
According to
Good, the longest false-positive that arose in this
work was a string of 8 consecutive matches between two machines that were
actually in unrelated states.
% The same codebreaking
%% cracking
% system was implemented on the Colossus
% computer in the work known as Fish. The computer
% accumulated weights of evidence and searched
% for the most probable hypothesis.
% xword.tex has its own further reading
\section*{Further reading}
For further reading about Turing and Bletchley Park,
see \citeasnoun{hodges83} and \citeasnoun{GoodEnigma}.
For an in-depth read about cryptography,
\quotecite{Schneier96} book is highly recommended.
It is readable, clear, and entertaining.
% see also xword.tex which includes exword.tex
\section{Exercises}
\exercisxB{2}{ex.enigmaleak}{
Another weakness in the design of the \ind{Enigma} machine, which
was intended to emulate a perfectly random time-varying
\ind{permutation}, is that it never mapped a letter to
itself. When you press {\tt{Q}}, what comes out is
always a different letter from {\tt{Q}}.
How much information per character is leaked by this
design flaw?
How long a \ind{crib} would be needed to be confident
that the crib is correctly aligned with the cyphertext?
And how long a crib would be needed to be able
confidently to identify the correct key?
[A {\dem{crib}\/} is a guess for what the plaintext was.
Imagine that the Brits know that a very important German
is travelling from Berlin to Aachen, and they intercept
Enigma-encoded messages sent to Aachen. It is a good bet
that one or more of the original plaintext messages contains the
string {\tt OBERSTURMBANNFUEHRERXGRAFXHEINRICHXVONXWEIZSAECKER},
the name of the important chap.
A crib could be used in a brute-force approach
to find the correct Enigma key (feed the received messages
through all possible Engima machines and see if any of the
putative decoded texts match the above plaintext).
This question centres on the idea that the crib can also be
used in a much less expensive manner: slide the plaintext crib
along all the encoded messages until a perfect {\em mismatch\/}
of the crib and the encoded message is found; if correct,
this alignment then tells you a lot about the key.]
}
%{Why have sex? Information acquisition and evolution}
\chapter{Why have Sex? Information Acquisition and Evolution}
\label{ch.sex}
%\title{Rate of Information Acquisition\\ by a Species subjected to Natural Selection}
% \date{\today\ -- Draft 5.5} from _doc/gene/gene.tex
\newcommand{\explanfig}[1]{\raisebox{-0.5cm}{\psfig{figure=psm/e.#1.ps,width=2in,height=0.5in,angle=-90}}}
\newcommand{\fitfig}[1]{\mbox{\psfig{figure=psm2/#1.ps,width=2.5in,angle=-90}}}
\newcommand{\fitfigx}[1]{\mbox{\psfig{figure=psx/#1.ps,width=2.5in,angle=-90}}}
% \exercisxC{5}{ex.evolutionteach}{
% {\bf What is the difference (in bits) between an ape and a human?}
%5?????????????????????????????????/
Evolution has been\index{evolution}\index{natural selection}
happening on earth for about the last $10^{9}$ years.
% DNA-binding proteins are just one of the families of sophisticated
% molecules which the Blind Watchmaker of evolution has created.
Undeniably, {\em information has been acquired\/} during this process.
Thanks to the tireless work
of the \ind{Blind Watchmaker},
some cells now carry within them all the information required
to be outstanding spiders; other cells carry all the information
required to make excellent octopuses. Where did this information
come from?
The entire blueprint of all organisms on the planet has emerged
in a teaching process in which the teacher is
natural selection:
% , \ie, the process whereby
fitter individuals have more progeny, the \ind{fitness} being defined by the
local environment (including the other organisms).
The teaching signal is only a few bits per
individual: an individual simply has a smaller
or larger number of grandchildren, depending on the
individual's fitness.
`Fitness' is a broad term that could cover
\bit
\item
the ability of an antelope to run faster than other antelopes
and hence avoid being eaten by a lion;
\item
the ability of a lion to be well-enough camouflaged and run
fast enough to catch one antelope per day;
\item
the ability of a peacock to attract a peahen to mate with it;
\item
the ability of a peahen to rear many young simultaneously.
\eit
The fitness of an organism is largely determined
by its DNA -- both the coding regions, or genes,
and the non-coding regions (which play an important
role in regulating the transcription of genes).
We'll think of fitness as a function of the DNA
sequence and the environment.
% For simplicity, let's focus on a gene and think a bit more
% about the information acquisition process.
How does the DNA determine fitness, and how
does information get from natural selection into the genome? Well,
if the gene that codes for one of an antelope's proteins is
defective, that antelope might get eaten by a lion
early in life and have only two grandchildren rather than forty.
The information content of natural selection is fully
contained in a specification of which offspring survived to
have children -- an information content of {\em
at most one bit per offspring}.
The teaching signal does not communicate to the ecosystem any description
of the imperfections in the organism that caused it to have
fewer children.
% And these
The bits of the teaching signal are highly
redundant, because, throughout a species,
unfit individuals who are similar to each other
will be failing to have offspring for similar reasons.
So, how many bits per generation are acquired by the \ind{species}\index{human}\index{ape}
as a whole by \ind{natural selection}?
% What is the difference
How many bits has natural selection succeeded in conveying to the human
branch of the tree of life, since the divergence between Australopithecines
% and apes 4,000,000 years ago.
% 277, Maynard Smith
%
%
% Australopithecines
and apes $4\,000\,000$ years ago?
Assuming a generation time of 10 years for reproduction,
there have been about $400\,000$ generations of human precursors
since the divergence from apes. Assuming a population of
$10^{9}$ individuals, each receiving a couple of bits of
information from natural selection, the total number of bits
of information responsible for modifying the genomes of 4 million
B.C.\ into today's human genome is about
$8\times 10^{14}$ bits. However, as we noted, natural selection is not
smart at collating the information that it dishes out to the
population, and there is a great deal of redundancy in that
information. If the population size were twice as great, would it evolve
twice as fast? No, because natural selection will simply be
correcting the same defects twice as often.
John Maynard Smith has suggested that the rate of information
acquisition by a species is independent of the population size,
and is of order 1 bit per generation.
This figure would allow for only $400\,000$ bits of difference
between apes and humans, a number that is much smaller than
the total size of the human genome -- $6 \times 10^9$ bits.
[One human genome contains about $3\times 10^{9}$ nucleotides.]
It is certainly the case that the genomic overlap between
apes and humans is huge, but is the difference that small?
% (Don't forget that
% if two bit sequences of length $N$ have 90\% overlap, then it takes
% about $N/2$ bits to describe the differences between them;
% according to {\tt http://users.ox.ac.uk/$\sim$mckee/chimp.html},
% we share 98.4\% of our DNA with chimpanzees, which corresponds to
% a difference of 0.12$N$ bits, or $7 \times 10^{8}$ bits.
% This is considerably larger than the 400,000 bits of difference
% mentioned above. Of course, the difference between
% us and chimpanzees could involve neutral changes to the DNA,
% and if some of the differences are redundant, then we are
% further overcounting; but are we overcounting by a factor of 1000?)
% http://users.ox.ac.uk/~mckee/chimp.html
% %We share 98.4% of our DNA
In this chapter, we'll develop a crude model
of the process of information acquisition through evolution,
based on the assumption that a gene with two defects
is typically likely to be more defective than a gene with one defect,
and an organism with two defective genes is likely to be
less fit than an organism with one defective gene.
Undeniably, this is a crude model, since
real biological systems are baroque constructions with
complex interactions. Nevertheless, we persist with a simple
model because it readily yields striking results.
% I have developed a simple model of natural selection
% \footnote{{\tt http://www.inference.phy.cam.ac.uk/mackay/abstracts/gene.html}}
What we find from this simple model is that
\ben
\item
% whereas
John Maynard Smith's figure of 1 bit per generation
is correct for an {\em asexually-reproducing\/} population;
\item in contrast,
{\em if the species reproduces
sexually}, the rate of information
acquisition
% , though independent of the population size,
can be as large as
$\sqrt{G}$ bits per generation, where $G$
is the size of the genome.
\een
% Setting $G \simeq 10^4$--$10^{8}$, we would then have had time to acquire
% about $4 \times 10^7$ or $4 \times 10^9$ bits of information from
% evolution.
We'll also find interesting results concerning
the maximum mutation rate that a species can withstand.
\section{The model}
% At what rate, in bits per generation, can the blind watchmaker
% cram information into a species by natural selection?
% And what is the maximum mutation rate that a species can withstand?
We study a simple
model of a reproducing population of $N$ individuals with a genome of size
$G$ bits:
% fitness is a strictly
% additive trait subjected to directional selection;
variation is produced by mutation or by recombination (\ie, sex)
and truncation selection
selects the $N$ fittest children at each generation
to be the parents of the next.
We find striking differences between populations that
have recombination and populations that do not.
% If variation is produced by mutation alone, then the entire population gains
% up to roughly
% 1 bit per generation. If variation is created by
% recombination, the population can gain
% $O(\sqrt{G})$ bits per generation.
% Furthermore, recombination raises
% the maximum mutation rate that can be tolerated
% by a factor of order $\sqrt{G}$.
%% the square root of the size of the genome.
% This model explains the prevalence of sex in evolution
% and shows why sex persists in
% species with large genomes, even when they
% have reached evolutionary stasis.
%\subsection{Fitness}
The genotype of each individual is a vector $\bx$ of
$G$ bits, each having a good state $x_g \eq 1$ and a bad
state $x_g \eq 0$.
The fitness $F(\bx)$ of
an individual is simply the sum of her bits:
\beq
F(\bx) = \sum_{g=1}^G x_g .
\eeq
The bits in the genome could be considered to
correspond either to genes that have good alleles ($x_g \eq 1$)
and bad alleles ($x_g \eq 0$), or to the nucleotides of
a genome.
% , with two bits per nucleotide.
We will concentrate on the
latter interpretation.
The essential property of fitness that we are assuming is
that it is locally a roughly linear function of the genome, that is,
that there are many possible changes one could make to the
genome, each of which has a small effect on fitness, and
that these effects combine approximately linearly.
We define the normalized
fitness $f(\bx) \equiv F(\bx)/G$.
We consider evolution by natural selection under
two models of variation.
\begin{description}
\item[Variation by mutation\puncspace]% was colon
% \subsection{Variation by mutation}
The model assumes discrete
generations.
At each generation, $t$, every individual produces two children.
% and then dies.
% progenies'
The children's
genotypes differ from the parent's
% genotype
by random
mutations. Natural selection selects the fittest $N$ progeny in the
child population to reproduce, and a new generation starts.
[The selection of the fittest $N$ individuals at each generation
is known as truncation selection.]
The simplest model of mutations is that the child's bits $\{ x_g \}$
are independent. Each bit has a small probability of being flipped, which,
thinking of the bits as corresponding roughly to nucleotides, is
taken to be a constant $m$, independent of $x_g$.
[If alternatively we thought
of the bits as corresponding to genes, then we would
model the probability of the discovery of a good gene,
% by mutation,
$P(x_g \eq 0 \rightarrow x_g \eq 1)$, as being
a smaller number
% $m_{\uparrow}$
than the probability of a deleterious mutation
in a good gene,
$P(x_g \eq 1 \rightarrow x_g \eq 0)$.]
% ,
% which we denote by
% $m_{\downarrow}$.]
\item[Variation by recombination (or crossover, or sex)\puncspace]
% \subsection{Sex}
Our organisms are haploid, not diploid. They enjoy sex by recombination.
% crossover.
The $N$ individuals in the population are married into $M \eq N/2$ couples,
at random,
and each couple has $C$ children -- with $C\eq 4$ children being our
standard assumption, so as to have the population double and halve
every generation, as before.
The $C$
% siblings'
children's
genotypes are independent given the parents'.
Each child obtains its genotype $\bz$ by random crossover of its parents'
genotypes, $\bx$ and $\by$. The simplest model of recombination
% crossover,
% which we use here,
has no linkage, so that:
\beq
z_g \:=\: \left\{ \begin{array}{cl}
x_g & \mbox{with probability $1/2$} \\
y_g & \mbox{with probability $1/2$.} \end{array} \right.
\eeq
% It would be easy to introduce linkage if we wanted to.
Once the $MC$ progeny have been born, the parents pass away, the fittest
$N$ progeny are selected by natural selection, and a new generation starts.
\end{description}
We now study these two models of variation in detail.
%\section{Rate of information acquisition}
\section{Rate of increase of fitness}
\subsection{Theory of mutations}
We assume
that the genotype of an individual with normalized fitness $f \eq F/G$ is
subjected to mutations that flip bits with probability $m$.
We first show that if the average normalized
fitness $f$ of the population is greater than $1/2$, then
the optimal mutation rate is small, and the rate of
acquisition of information is at most of order one bit per
generation.
Since it is easy to achieve a normalized fitness of $f \eq 1/2$ by
simple mutation, we'll assume $f > 1/2$ and work in terms of
the excess normalized fitness $\deltaf \equiv f - 1/2$.
If an individual with excess normalized
fitness $\deltaf$ has a child and the mutation rate $m$ is small,
the probability distribution
of the excess normalized fitness of the child has
mean
\beq
%\mbox{mean}(t\!+\!1) =
\overline{\deltaf}_{\rm child} = (1-2 m) \deltaf
\eeq
and variance
% standard deviation \sqrt
\beq
{ \frac{m(1-m)}{G} } \simeq { \frac{m}{G} } .
\eeq
% where the approximation is based on the assumption that the mutation
% rate $m$ will be small.
% If $G$ is large, this binomial distribution is well approximated
% by a Gaussian, and w
If the population of parents has mean $\deltaf(t)$
and variance $\sigma^2(t) \equiv \beta \linefrac{m}{G}$, then
the child population, before selection, will
have mean $(1-2 m) \deltaf(t)$ and variance $(1+\beta) \linefrac{m}{G}$.
Natural selection chooses the upper half of this distribution,
% e Gaussian,
so the mean fitness and variance of fitness
at the next generation are given by
\beq
\deltaf(t\!+\!1) = (1-2 m) \deltaf(t)
+ \alpha \sqrt{(1+\beta)} \sqrt{\frac{m}{G} } ,
% + \sqrt{\frac{2}{\pi}} \sqrt{ \frac{m}{G} } .
\label{eq.rate1}
\eeq
\beq
\sigma^2(t\!+\!1) = \gamma (1+\beta) \frac{m}{G} ,
\eeq
where $\alpha$ is the mean deviation from the mean,
measured in
standard deviations,
and $\gamma$ is the factor by which the child distribution's
variance is reduced by selection.
The
numbers $\alpha$ and $\gamma$ are of order 1.
% , and satisfies $\alpha \leq 1$.
For the
case of a Gaussian distribution, $\alpha = \sqrt{\linefrac{2}{\pi}} \simeq
0.8$
and $\gamma = (1-2/\pi) \simeq 0.36$.
If we assume that the variance is in
dynamic equilibrium, \ie, $\sigma^2(t\!+\!1) \simeq \sigma^2(t)$,
then
\beq
\gamma (1+\beta) = \beta, \mbox{ so } (1+\beta) = \frac{1}{1-\gamma},
\eeq
and the factor $\alpha \sqrt{(1+\beta)}$ in \eqref{eq.rate1}
is equal to 1, if we take the results for the Gaussian distribution,
an approximation that becomes poorest when the discreteness of
fitness becomes important, \ie, for small $m$.
% \footnote{We get the same result for any symmetric distribution. If
% the distribution is not symmetrical, then we are approximating.}
The rate of increase of normalized fitness is thus:
\beq
\frac{\d f}{\d t} \simeq -2 m \, \deltaf + \sqrt{\frac{m}{G}},
\label{eq.rate2}
\eeq
which,
assuming $G (\deltaf)^2 \gg 1$,
is maximized
% with respect to the mutation rate by setting $m$ to
for
\beq
m_{\rm opt} = \frac{1}{16 G (\deltaf)^2} ,
\label{eq.mopt}
\eeq
% critical df is 0.2, if use the Gaussian approx.
%
% if keep m(1-m) around, and assume G(\deltaf)^2 >> 1, get
%
% something like 1/( 2 + 16 G df^2 )
%
at which point,
\beq
\left(\frac{\d f}{\d t}\right)_{\! \rm opt} = \frac{1}{8 G (\deltaf)}.
\eeq
So the rate of increase of fitness $F \eq fG$ is at most
\beq
\frac{\d F}{\d t} = \frac{1}{8 (\deltaf)} \:\:\mbox{per generation}.
\eeq
% critical df is 0.08, if use the Gaussian approx.
For a population with low fitness ($\deltaf < 0.125$),
the rate of increase of fitness may exceed 1 unit per generation. Indeed,
if $\deltaf \lesssim 1/\sqrt{G}$, the rate of increase, if $m \eq \dhalf$,
is of order $\sqrt{G}$; this initial spurt can last only of order
$\sqrt{G}$ generations.
%
% if the mutation rate is tuned to the fitness,
For $\deltaf > 0.125$, the rate of increase of fitness is
% acquisition of information is
smaller than one per generation.
As the fitness approaches $G$, the optimal mutation rate
tends to $m \eq 1/(4 G)$, so that an average of $1/4$
bits are flipped per genotype, and the rate of increase of
fitness is also equal to $1/4$;
information is gained at a rate of about $0.5$ bits per generation.
It takes about $2 G$ generations for the
genotypes of all individuals in the population to
attain perfection.
For fixed $m$, the fitness is given by
\beq
\deltaf(t) = \frac{1}{2 \sqrt{mG}} ( 1 - c \, e^{-2 mt} ) ,
\label{eq.mutation.soln}
\eeq
subject to the constraint $\deltaf(t) \leq 1/2$,
where $c$ is a constant of integration, equal to 1 if $f(0)=1/2$.
If the mean
number of bits flipped per genotype, $mG$, exceeds 1, then
the fitness $F$ approaches an equilibrium value
$F_{\rm eqm} = (1/2 + 1/(2 \sqrt{mG})) G$.
% If $m$ is tuned to the optimal fitness-dependent value,
% $m_{\rm opt}$ (\ref{eq.mopt}),
% then the fitness is given, assuming $\deltaf(0) = 0$, by
%\beq
% \deltaf(t) = \frac{ t^{1/2} }{ 2 \sqrt{G} },
%\eeq
% which hits $\deltaf = 1/2$ at $t=G$.
This theory is somewhat inaccurate in that the true probability
distribution of fitness is non-Gaussian, asymmetrical, and quantized to
integer values. All the same, the predictions of the theory are
not grossly at variance with the results of simulations
described below.
% in section \ref{sec.simulations}.
\begin{figure}
\figuredanglenudge{\footnotesize
\begin{center}
\begin{tabular}{p{2in}cc}
& No sex & Sex \\
Histogram of parents' fitness
& \explanfig{iparent}
& \explanfig{iparent}
\\
Histogram of children's fitness
& \explanfig{mchild}
& \explanfig{schild}
\\
Selected children's fitness
& \explanfig{mnextparent}
& \explanfig{snextparent}
\\
\end{tabular}
\end{center}
}{
\caption[a]{Why sex is better than sex-free reproduction.\index{parthenogenesis}
If mutations are used to create variation among children,
then it is unavoidable that the average fitness of the children
is lower than the parents' fitness; the
greater the variation, the greater the average deficit. Selection bumps
up the mean fitness again.
In contrast,
%sex (recombination)
recombination produces variation without
a decrease in average fitness. The typical amount of variation
scales as $\sqrt{G}$, where $G$ is the genome size, so after
selection, the average fitness rises by $O(\sqrt{G})$.
}
\label{fig.nutshell}
}{-0.14in}
\end{figure}
\subsection{Theory of sex}
% {\em Shorten this bit.}
%
The analysis of the sexual population becomes tractable
with two approximations:
first, we assume that the {gene-pool} mixes sufficiently rapidly
that correlations between genes can be neglected; second, we
assume {\em homogeneity}, \ie, that
the fraction $f_g$ of bits $g$ that are in the good state
is the same, $f(t)$, for all $g$.
\begin{boxfloat}
\margincaption{
\caption[a]{Details of the {theory of sex}.}
\label{sec.sex.app}
}
\begin{framedalgorithm}
\footnotesize
% Theory of sex appendix
How does $f(t\!+\!1)$ depend on $f(t)$? Let's first assume
the two parents of a child both have exactly $f(t) G$ good bits, and,
by our homogeneity assumption, that those bits are independent
random subsets of the $G$ bits.
% (We will include variation in the parental population in a moment.)
The number of bits that
are good in both parents is roughly $f(t)^2 G$, and the number
that are good in one parent only is roughly $2 f(t)(1-f(t)) G$,
so the fitness of the child will be $f(t)^2 G$ plus
% a number drawn from a binomial distribution
the sum of $2 f(t)(1-f(t)) G$ fair coin flips, which
has a binomial distribution of mean $f(t)(1-f(t)) G$ and
variance $\frac{1}{2} f(t)(1-f(t)) G$.
The fitness of a child
is thus roughly distributed as
\[%\beq
F_{\rm{child}} \sim \mbox{Normal}\left(\mbox{mean}\eq f(t) G,
\mbox{variance}\eq \frac{1}{2} f(t)(1-f(t)) G \right) .
\]%\eeq
The important property of this distribution, contrasted with
the distribution under mutation, is that the mean fitness is equal
to the parents' fitness; the variation produced by sex does
not reduce the average fitness.
If we include the parental population's variance, which
we will write as $\sigma^2(t) = \beta (t) \frac{1}{2} f(t)(1-f(t)) G$,
the children's fitnesses are
% .
% The average of the two parents will have variance $\sigma^2(t)/2$,
% so the population of all children will have fitness, before selection,
distributed as
\[%\beq
F_{\rm{child}} \sim \mbox{Normal}\left(\mbox{mean}\eq f(t) G,
\mbox{variance}\eq \left(1+\frac{\beta}{2}\right)
\frac{1}{2} f(t)(1-f(t)) G \right) .
\]%\eeq
Natural selection selects the children on the upper side
of this distribution. The mean increase in
fitness will be
% of order
\[%\beq
\bar{F}(t\!+\!1) - \bar{F}(t)
= [ \alpha (1+\beta/2)^{1/2}/\sqrt{2} ] \sqrt{f(t)(1-f(t)) G},
\label{eq.alpha.sex}
\]%\eeq
% [A factor of $\sqrt{2/\pi}$ appears from the mean absolute
% value of a standard normal variate.]
and the variance of the surviving children will be
\[%\beq
\sigma^2(t+1) = \gamma (1+\beta/2) \frac{1}{2} f(t)(1-f(t)) G,
\]%\eeq
where $\alpha = \sqrt{2/\pi}$ and
$\gamma = (1-2/\pi)$.
If there is dynamic equilibrium [$\sigma^2(t+1) = \sigma^2(t)$]
then
%\[%\beq
% \gamma (1+\beta/2) = \beta , \mbox{ so } (1+\beta/2) = \frac{2}{2-\gamma} ,
%\]%\eeq
% and
the factor in (\ref{eq.alpha.sex}) is
\[%\beq
\alpha (1+\beta/2)^{1/2}/\sqrt{2}
% = {\alpha}\frac{1}{(2-\gamma)^{1/2}}
% = \sqrt{ \frac{ 2/\pi }{ 1 + 2/\pi } }
= \sqrt{\frac{2}{(\pi+2)}} \simeq 0.62.
\]%\eeq
% print sqrt((4/pi)/(1+2/pi))
% 0.882025543449103
% print sqrt((2/pi)/(1+2/pi))
% 0.62368624295261
Defining this constant to be $\eta \equiv \sqrt{{2/(\pi+2)}}$,
% formerly, eta was 1/sqrt(pi)
we conclude that, under sex and natural selection,
the mean fitness of the population
increases at a rate
{\em proportional to the square root of the size of the
genome},
\[%\beq
\frac{\d\bar{F}}{\d t}
\simeq \eta \sqrt{f(t)(1-f(t)) G} \:\:\:\mbox{bits per generation}.
\]%\eeq
% If, recklessly, we take our homogeneity assumption to hold
% for all time, we can
% write $\bar{F} = f G$ and obtain the differential equation:
%\[%\beq
% \frac{\d f}{\d t} \simeq \frac{\eta}{\sqrt{G}} \sqrt{f(t)(1-f(t))} ,
%\]%\eeq
%% an equation
% whose solution is
%%\[%\beq
%% \sin^{-1}( 2 f(t) - 1 ) = \frac{1}{\sqrt{G}} ( C + t ) ,
%%\]%\eeq
%% or
%%\[%\beq
%% ( 2 f(t) - 1 ) = \sin( \frac{1}{\sqrt{G}} ( C + t ) ) ,
%%\]%\eeq
%% or
%\[%\beq
% f(t) = \frac{1}{2} \left[ 1 + \sin \left(
% \frac{\eta}{\sqrt{G}} ( t + c )
% \right) \right] ,
%\:\:\:\mbox{ for $t+c \in \left(-\frac{\pi}{2}\sqrt{G}/\eta,\frac{\pi}{2}\sqrt{G}/\eta
% \right)$,}
%\label{eq.sex.solution.app}
%\]%\eeq
% where $c$ is a constant of integration, $c = \sin^{-1} (2 f(0) - 1)$.
%% asin( 2*a0 - 1 )
\end{framedalgorithm}
\end{boxfloat}
Given these assumptions, if two parents of fitness $F \eq fG$
mate, the probability distribution of their children's fitness
has mean equal
to the parents' fitness, $F$; the variation produced by sex does
not reduce the average fitness. The standard deviation
of the fitness of the children scales as $\sqrt{G f(1-f)}$.
Since, after selection, the increase in fitness is
proportional to this standard deviation, {\em the
fitness increase per generation scales as the square root of the size of the
genome,} $\sqrt{G}$.
As shown in \boxref{sec.sex.app}, the mean fitness $\bar{F} \eq f G$
evolves in accordance with the differential equation:
\beq
\frac{\d\bar{F}}{\d t} \simeq {\eta} \sqrt{f(t)(1-f(t)) G} ,
\eeq
where $\eta \equiv \sqrt{{2/(\pi+2)}}$.
% an equation
The solution of this equation is
%\beq
% \sin^{-1}( 2 f(t) - 1 ) = \frac{1}{\sqrt{G}} ( C + t ) ,
%\eeq
% or
%\beq
% ( 2 f(t) - 1 ) = \sin( \frac{1}{\sqrt{G}} ( C + t ) ) ,
%\eeq
% or
\beq
f(t) = \frac{1}{2} \left[ 1 + \sin \left(
\frac{\eta}{\sqrt{G}} ( t + c )
\right) \right] ,
\:\:\:\mbox{ for $t+c \in
\left(-\frac{\pi}{2}\sqrt{G}/\eta,\frac{\pi}{2}\sqrt{G}/\eta
\right)$,}
\label{eq.sex.solution}
\eeq
where $c$ is a constant of integration, $c = \sin^{-1} (2 f(0) - 1)$.
% asin( 2*a0 - 1 )
So this idealized system reaches a state of
eugenic\index{eugenics}
perfection $(f=1)$ within a finite time: $(\pi/\eta)\sqrt{G}$ generations.
\begin{figure}
\figuremargin{\footnotesize
\begin{center}\small
\begin{tabular}{c}
\raisebox{13pt}{(a)}\hspace{-0.2in}\mbox{\psfig{figure=perl1/1000.1000.d.ps,width=2.8in,angle=-90}}\\
%(b)\mbox{\psfig{figure=perl1/1000.500.d.ps,width=2in,angle=-90}}\\
%(c)\mbox{\psfig{figure=perl1/1000.200.d.ps,width=2in,angle=-90}}&
%(d)\mbox{\psfig{figure=perl1/1000.100.d.ps,width=2in,angle=-90}}\\
\hspace{-0.3in}\begin{tabular}{cc}
%(b1)\mbox{\psfig{figure=perl1/1000.1000.25M.ps,width=2in,angle=-90}}&
(b)\hspace{-0.2in}\mbox{\psfig{figure=perl1/1000.1000.25S+M.ps,width=2.53in,angle=-90}}&
%(b2)\mbox{\psfig{figure=perl1/1000.1000.6M.ps,width=2in,angle=-90}}&
(c)\hspace{-0.2in}\mbox{\psfig{figure=perl1/1000.1000.6S+M.ps,width=2.53in,angle=-90}}\\
\end{tabular}
\end{tabular}
\end{center}
}{
\caption[a]{Fitness as a function of time.
% These experiments were identical to those in figure 1,
% except that I forced all the initial genomes to have
% fitness exactly $F=G/2$, instead of picking the
% genotypes completely at random.
The genome size is $G=1000$.
%
The dots show
the fitness of six randomly selected individuals from the
birth population at each generation.
% The error bars show
% the standard deviation of fitness in the population.
The initial population of $N=1000$ had randomly
generated genomes
with $f(0) = 0.5$ (exactly).
(a) Variation produced by {sex} alone. Line shows theoretical curve
(\ref{eq.sex.solution})
for infinite homogeneous population.
(b,c) Variation produced by mutation, with and without sex,
when the mutation rate is $mG=0.25$ (b) or 6 (c) bits per
genome. The dashed line shows the curve (\ref{eq.mutation.soln}).
%
% (c) Variation produced by mutation, with and without
% sex, when the mutation rate is $mG=6$ bits per
% genome.
}
\label{fig.fitness.500}
\label{fig.fitness.1000}
}
\end{figure}
\subsection{Simulations}
\label{sec.simulations}
Figure \ref{fig.fitness.1000}a shows the fitness
of a sexual population of $N=1000$ individuals with a
genome size of $G=1000$ starting from
a random initial state with normalized fitness $0.5$.
It also shows the theoretical curve $f(t)G$
% using $f(t)$ derived for
% the infinite homogeneous population,
from \eqref{eq.sex.solution},
which fits remarkably well.
In contrast, figures \ref{fig.fitness.1000}(b) and (c) show the
evolving fitness when variation is
produced by mutation at rates
$m=0.25/G$ and $m=6/G$ respectively. Note the difference in the
horizontal scales from panel (a).
% Figure \ref{fig.fitness.1000}(b) shows the fitness
% of a population of $N=500$ individuals with a
% genome size of $G=1000$ starting from
% a random initial state with normalized fitness $0.1$.
%
% Figures \ref{fig.fitness.1000}(c) and (d) show what happens for smaller
% population sizes, $N=200$ and $N=100$.
\exercissxC{3}{ex.smallpopn}{
%\subsection{Small populations}
{\sf Dependence on population size}.
How do the results for a sexual population depend on the
population size? We anticipate that there is a minimum population
size above which the theory of sex is accurate.
% infinite-population approximation works well.
How
% In what way
is that minimum population size
related to $G$?
}
\exercisxC{3}{ex.crossover}{
{\sf Dependence on crossover mechanism}.
In the simple model of sex, each bit is taken at random
from one of the two parents, that is, we allow crossovers
to occur with probability 50\% between any two adjacent
nucleotides.
How is the model affected
%\ben
%\item
(a)
if the crossover probability
is smaller?
(b)
% \item
if crossovers occur exclusively
at {\dem\ind{hot-spot}s\/}
located every $d$ bits along the genome?
% \een
}
\begin{figure}
\figuremargin{\footnotesize
\begin{center}
\begin{tabular}{ccc}
& $G=1000$ &
$G=100\,000$ \\
\raisebox{1in}{$mG$}\hspace{-0.2in} &
\mbox{\psfig{figure=psm/maxrate.1000.ps,width=2.32in,angle=-90}}
&\mbox{\psfig{figure=psm/maxrate.100000.ps,width=2.32in,angle=-90}}
\\
&
$f$ & $f$ \\
\end{tabular}
\end{center}
}{
\caption[a]{Maximal tolerable mutation rate, shown as number of
errors per genome ($mG$), versus normalized fitness $f=F/G$.
Left panel: genome size $G=1000$; right:
$G=100\,000$.
Independent of genome size, a parthenogenetic species (no sex) can
tolerate only of order 1 error per genome per generation;
a species that uses recombination (sex) can tolerate far greater
mutation rates.
}
\label{fig.maxrate}
}
\end{figure}
\section{The maximal tolerable mutation rate}
%{Sex with mutations}
%{\em This section needs checking over, to confirm the
% details of the factors of $\eta$, etc.}
What if we combine the two models of variation? What
is the maximum mutation rate that can be tolerated by a
species that has sex?
The rate of increase of fitness is given by
\beq
\frac{\d f}{\d t} \simeq - 2 m \, \deltaf +
\eta\sqrt{{2}} \sqrt{ \frac{m + f(1-f)/2}{G} } ,
\eeq
which
% This quantity
is positive if
%\beq
% 2 m \, \deltaf < \eta\sqrt{{2}} \sqrt{ \frac{m + f(1-f)/2}{G} } .
%\eeq
% Replacing $\deltaf$ by its largest value, $1/2$, and omitting the $m$
% on the right-hand side,
% the rate of increase of fitness is positive, for a given $f$,
% if
the mutation rate satisfies
\beq
m < \eta\sqrt{\frac{f(1-f)}{G}} .
\eeq
Let us compare this rate with the result in the absence of sex,
which, from \eqref{eq.rate2}, is that the maximum tolerable mutation rate
is
\beq
m < \frac{1}{G} \frac{1}{(2 \, \deltaf)^2} .
\label{eq.no.sex.crit.m}
\eeq
% These two maximum mutation rates are of completely different
% orders.
% (May I be permitted an exclamation mark?)
The tolerable mutation rate with sex is
of order $\sqrt{G}$ times greater than that without sex!
% this is d/dm[ df/dt ]:
%plot[x=0.5:1] -(2*x-1) + 1.0/sqrt(G*pi * x*(1-x) )
% optimum mutation rate:
% is m=0
% this omits G:
%f=0.75; plot[m=0:0.5] -(2*x*(f-0.5)) + sqrt(2.0/pi) * sqrt(x+pi * x*(1f=0.75; plot[m=0:0.5] -(2**(f-0.5)) + sqrt(2.0/pi) * sqrt(x+pi * x*(1-x)/2.0 ) m*(f-0.5)) + sqrt(2.0/pi) * sqrt(m+pi * f*(1-f)/2.0 )
A parthenogenetic (non-sexual) species could try to wriggle out of
this bound on its mutation rate by increasing its litter sizes.
% , so as to tolerate higher mutation rates.
But if mutation flips on average $mG$ bits, the probability
that no bits are flipped in one genome is roughly $e^{-mG}$, so a mother
needs to have roughly $e^{mG}$ offspring in order
to have a good chance of having one child with
the same fitness as her. The litter size of a non-sexual
species thus has
to be exponential in $mG$ (if $mG$
% the factor by which $m$
is bigger than 1),
% exceeds the critical value defined in equation \ref{eq.no.sex.crit.m},
if the species is to persist.
So the maximum tolerable mutation rate is pinned close to
$1/G$, for a non-sexual species, whereas it is a larger
number of order $1/\sqrt{G}$, for a species with
recombination.
Turning these results around, we can predict the largest
possible genome size
for a given fixed mutation rate, $m$.
For a parthenogenetic species,
the largest genome size is of order $1/m$, and for a sexual species, $1/m^2$.
Taking the figure $m= 10^{-8}$ as the mutation rate
per nucleotide per generation \cite{EWK99},
% 2 \times this going by EWK actually
and allowing for a maximum brood size of $20\,000$ (that is,
$mG \simeq 10$),
we predict that
all species with more than $G = 10^{9}$ coding
nucleotides make at least occasional use of recombination.
If the brood size is 12, then this number falls to
$G = 2.5 \times 10^{8}$.
% graveyard.tex
\section{Fitness increase and information acquisition}
For this simple model it is possible to relate increasing fitness
to information acquisition.
If the bits are set at random, the fitness is roughly
$F=G/2$.
If evolution leads to a population in which
all individuals have the maximum fitness $F=G$, then
$G$ bits of information have been acquired by the species,
namely for each bit $x_g$, the species has figured
out which of the two states is the better.
We define the information acquired at an intermediate
fitness
% , suggested by \citeasnoun{Kimura61}, is
to be the amount of selection (measured in bits)
required to select the perfect state from the gene pool.
Let a fraction $f_g$ of the population
have $x_g \eq 1$. Because $\log_2 (1/f)$ is the information required to
find a black ball in an urn containing black and white balls
in the ratio $f:1\!-\!f$,
%
% Defining $\delta F \equiv F-G/2$, it will be convenient
% to
% We therefore view the fitness as measuring, in bits,
we define the information acquired to be
\beq
I = \sum_g \log_2 \frac{ f_g }{ 1/2 } \mbox{bits} .
\eeq
If all the fractions $f_g$ are equal to $F/G$, then
\beq
I = G \log_2 \frac{ 2F }{ G } ,
\eeq
which is well approximated by
\beq
\tI \equiv 2( F-G/2 ) .
\eeq
The rate of information acquisition is thus roughly two times
the rate of increase of fitness in the population.
% We will find it useful to define the normalized
% fitness $f(\bx) \equiv F(\bx)/G$.
\section{Discussion}
These results quantify the well known
argument for why species reproduce by sex with recombination, namely
that recombination allows useful mutations to spread more rapidly through
the species and allows deleterious mutations to be more rapidly cleared
from the population \cite{JMS78,Felsenstein85,JMS88,JMSES95}.
%
%
A population that reproduces by recombination can
% parthenogenesis
% and experiences
% variation through
% random mutations can
acquire information from natural selection at a
rate of order $\sqrt{G}$ times faster than
a parthenogenetic population, and it can tolerate
% only of about one bit per generation.
% A population that reproduces by sex
% can acquire information at a rate of order $\sqrt{G}$, the
% square root of the size of the genome. For
a mutation rate that is of order $\sqrt{G}$ times greater.
For
genomes of size $G \simeq 10^8$ coding nucleotides,
this factor of $\sqrt{G}$
is
% e differences between these two rates are
substantial.
This enormous advantage conferred by sex has been noted before
by \citeasnoun{Kondrashov1988},
but this meme, which Kondrashov calls `the deterministic mutation hypothesis',
does not seem to have diffused throughout the
evolutionary research community, as there are still numerous
papers in which the prevalence of sex is viewed as a
mystery to be explained by elaborate mechanisms.
% removed to itp/tex/genecut.tex Tue 22/10/02
\subsection*{`The cost of males' -- stability of a gene for sex or parthenogenesis }
Why
% has the meme explaining the prevalence of sex been swamped by this plethora of articles that
do people declare sex to be a mystery?
The main motivation for being mystified is an idea
called the `\ind{cost of males}'.\index{male}\index{female}
Sexual reproduction is disadvantageous compared with asexual reproduction,
it's argued, because of every two offspring produced by sex, one
(on average) is a useless male, incapable of child-bearing,
and only one is a productive female. In the same time,
a parthenogenetic mother could give birth to {\em two\/}
female clones.
To put it another way, the big advantage of parthenogenesis, from the
point of view of the individual, is that one is able
to pass on 100\% of one's genome to one's children,
instead of only 50\%.
%
Thus if there were two versions of a species, one
reproducing with and one without sex, the
% population of
single mothers would be expected to
outstrip their sexual cousins. The simple model presented
thus far did not include either genders or the ability
to convert from sexual reproduction to asexual, but we can easily
modify the model.
% include the effect which supposedly should give a disadvantage
% to sexual production.
\begin{figure}
\figuremargin{
\begin{center}\small\footnotesize
\small\begin{tabular}{c@{\hspace*{-0.05in}}c@{\hspace*{-0.1in}}c}
&
\mbox{(a) $mG=4$}
&
\mbox{(b) $mG=1$}
\\
\raisebox{0.6in}{\rotatebox{90}{\footnotesize\sf Fitnesses}} & \fitfig{F1000.1000.4.C4} & \fitfig{F1000.1000.1.C4} \\
\raisebox{0.6in}{\rotatebox{90}{\footnotesize\sf Percentage}} & \fitfig{P1000.1000.4.C4} & \fitfig{P1000.1000.1.C4} \\
\end{tabular}
\end{center}
}{
\caption[a]{Results when there is a gene for parthenogenesis,
and no interbreeding, {\em and single mothers produce as many children
as sexual couples}. $G=1000$, $N=1000$.
(a) $mG = 4$; (b) $mG=1$.
%Vertical axis shows both fitness and
Vertical axes show the fitnesses of the two
sub-populations, and the percentage of the population
that is parthenogenetic.}
\label{fig.mixed2.C4}
}
\end{figure}
We modify the
model so that one of the $G$ bits in the
genome determines whether an
individual prefers to reproduce
parthenogenetically ($x \eq 1$) or
sexually ($x \eq 0$).
%
The results depend on the number of children
had by a single parthenogenetic mother, $\Kp$ and the number
of children born by a sexual couple, $\Ks$.
Both ($\Kp \eq 2$, $\Ks \eq 4$) and ($\Kp \eq 4$, $\Ks \eq 4$)
are reasonable models. The former ($\Kp \eq 2$, $\Ks \eq 4$)
would seem most appropriate
in the case of unicellular organisms, where the cytoplasm
of both parents goes into the children. The latter ($\Kp \eq 4$, $\Ks \eq 4$)
is appropriate if the children are solely nurtured by
one of the parents, so single mothers have just as many offspring
as a sexual pair. I concentrate on the latter model, since it gives the
greatest advantage to the parthenogens, who are supposedly
expected to outbreed the sexual community.
Because parthenogens have four children per generation, the maximum
tolerable mutation rate for them is twice the expression
(\ref{eq.no.sex.crit.m})
derived before for $\Kp \eq 2$. If the fitness
is large, the maximum tolerable rate is $mG \simeq 2$.
Initially the genomes are set randomly with $F=G/2$,
with half of the population having the gene for parthenogenesis.
%
%\subsection{$\Kp \eq 4$, $\Ks \eq 4$, `consensual sex'}
\Figref{fig.mixed2.C4} shows the outcome.
% if single parthenogens produce as many offspring
% as a {\em pair\/} of sexuals.
During the `learning' phase of evolution,\index{learning!in evolution}\index{evolution!as learning}
in which the fitness is increasing rapidly,
pockets of parthenogens appear briefly, but
then disappear within a couple of generations
as their sexual cousins overtake them in fitness
and leave them behind. Once the population reaches its
top fitness, however, the parthenogens can take over,
if the mutation rate is sufficiently low ($mG\eq1$).
% In these simulations, sex does not tend to reappear once
% the parthenogens have taken over, because a small sexual
% community, having size $N_{\rm sexual}<\sqrt{G}$,
% will be in-bred and will not have the advantage
% discussed in the rest of this paper.
In the presence of a higher mutation rate ($mG \eq 4$),
however, the parthenogens never take over. The breadth of the
sexual population's fitness is of order $\sqrt{G}$, so
a mutant parthenogenetic colony arising
with slightly above-average fitness will last for about
$\sqrt{G}/(mG) = 1/(m\sqrt{G})$ generations before its fitness falls
below that of its sexual cousins. As long as the population
size is sufficiently large for some sexual individuals
to survive for this time, sex will not die out.
In a sufficiently unstable environment, where the
fitness function is continually changing,
the parthenogens will always lag behind the sexual
community.
These results are consistent with
the argument of \index{Haldane, J.B.S.}{Haldane}
% \citeasnoun{Haldane1949}
and \index{Hamilton, William D.}\citeasnoun{Hamilton2002}
% \citeasnoun{Hamilton1990}
% {Hamilton}
that sex is helpful
% maintains variation which is
% useful in the co-evol. arms race with parasites.
in an \ind{arms race} with parasites. The \ind{parasite}s define
an effective fitness function which changes with time,
and a sexual population will always ascend the current fitness
function more rapidly.
\subsection{Additive fitness function}
Of course, our results depend on the fitness
function that we assume, and on our model
of selection. Is it reasonable to model fitness, to first
order, as a {\em sum\/} of independent terms?
\citeasnoun{Smith68} argues that it is: the more good genes you
have, the higher you come in the pecking order, for example.
The directional selection model has been used extensively in theoretical population
genetic studies \cite{Bulmer1985}.
We might expect real fitness functions to involve interactions,
in which case crossover might reduce the average fitness.
However, since recombination gives the biggest advantage
to species whose fitness functions are additive, we might predict
that {\em evolution will have favoured species that used a representation
of the genome that corresponds to a fitness function that
has only weak interactions}. And even if there are interactions,
it seems plausible that the fitness would
still involve a sum of such interacting terms, with the number
of terms being some fraction of the genome size $G$.
% moved this to genecut.tex
% Fitness functions that are sums of interacting terms are investigated in section \ref{sec.interactions}.
\exercisxC{3C}{ex.interactions}{
Investigate how fast sexual
and asexual species evolve if they have a fitness
function with interactions.
For example, let the fitness be a sum of exclusive-ors of pairs
of bits; compare the evolving fitnesses with
those of the sexual
and asexual species with a simple additive fitness function.
}
\begincuttable
Furthermore, if the \ind{fitness} function were a highly nonlinear
function of the genotype, it could be made more smooth and locally linear
by the \ind{Baldwin effect}.
The Baldwin effect \cite{Baldwin1896,HintonNowlan87}
has been widely studied as a mechanism whereby
{\em learning\/} guides evolution, and it could also act at the level of
transcription and translation.
Consider the \ind{evolution} of a peptide sequence for
a new purpose. Assume the effectiveness
of the peptide is a highly nonlinear function of the
sequence, perhaps having a small island of good sequences surrounded\index{evolution!Baldwin effect}
by an ocean of equally bad sequences. In an
organism whose transcription and translation machinery
is flawless, the fitness will be an equally nonlinear function
of the DNA sequence, and evolution will wander around the
ocean making progress towards the island only by
a random walk. In contrast, an organism having the same
DNA sequence, but whose DNA-to-RNA
transcription or RNA-to-protein translation is `faulty',
will occasionally, by mistranslation or mistranscription,
accidentally produce a working enzyme; and it will do so with greater
probability if its DNA sequence is close to a good
sequence. One cell might produce 1000 proteins from the
one mRNA sequence, of which 999 have no enzymatic effect, and one
does. The one working catalyst will be enough for that cell
to have an increased fitness relative to rivals whose DNA sequence
is further from the island of good sequences.
For this reason I conjecture that,
at least early in evolution, and perhaps still now, the
\ind{genetic code} was not implemented perfectly but was implemented noisily,\index{evolution!of the genetic code}
with some codons coding for a distribution of possible
\ind{amino acid}s. This noisy code could even be switched on and off
from cell to cell in an organism by
having multiple aminoacyl-tRNA synthetases, some more reliable than
others.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Whilst our model assumed that the bits of the genome do not interact,
ignored the fact that the information is represented redundantly,
assumed that there is a direct relationship between phenotypic
fitness and the genotype,
and assumed that the crossover probability in recombination is high,
I believe these qualitative results would still hold if more complex
models of fitness and crossover were used: the relative benefit
of sex will still scale as $\sqrt{G}$.
% , where $G$ is proportional to the genome size.
Only in small, in-bred populations
are the benefits of sex expected to be diminished.
In summary: Why have sex? Because sex is good for your bits!
\section*{Further reading}
% Do all self-replicating systems have a lot of information content?
% If so, how did life start at all, given that information can be acquired by
% natural selection only gradually?
How did a high-information-content self-replicating system ever
emerge in the first place?
In the general area of the origins of life and other tricky questions about evolution,
% , the genetic code, and sex,
I highly recommend \citeasnoun{JMSES95}, \citeasnoun{JMSES99}, \citeasnoun{Kondrashov1988},
\citeasnoun{JMS88}, \citeasnoun{MarkRidley}, \citeasnoun{Dyson1985},
\citeasnoun{CairnsSmith1985}, and \citeasnoun{Hopfield1978}.
\section{Further exercises}
\ExercisxC{3}{ex.estimateDNAerror}{
How good must the error-correcting
machinery in \index{DNA!replication}DNA replication be, given
that mammals have not all died out long ago?
Estimate the probability of nucleotide substitution, per cell division.
%\soln{ex.estimateDNAerror}{
[See
% chapter
\appendixref{ch.numbers}.]
% for some estimates.]
%}
}
\ExercisxC{4}{ex.dna-ecc}{
Given that {DNA replication} is achieved by bumbling
\ind{Brownian motion} and ordinary thermodynamics
in a biochemical \ind{porridge} at a temperature of 35$\,$C, it's astonishing
that the error-rate of \index{DNA!replication}DNA replication is about $10^{-9}$ per
replicated nucleotide. How can this reliability be achieved,\index{error correction!in DNA replication}\index{error-correcting code!in DNA replication}
given that the energetic difference between a correct
base-pairing and an incorrect one is only one or two \ind{hydrogen bond}s
% (one hydrogen bond is worth about 1$\,$kJ$\,$mol$^{-1}$ in free energy)
and the thermal energy $kT$ is only about a factor of
four smaller than the free energy associated with a hydrogen bond?
% about 8$\,$kJ$\,$mol$^{-1}$.
% thermal energy is 0.6 kcal/mol
% hydrogen bond is 1-5 kcal/mol
If ordinary thermodynamics is what favours correct \ind{base-pairing},\index{Watson--Crick base pairing}
surely the frequency of incorrect base-pairing should be
about
\beq
f = \exp( - \dfrac{\upDelta E}{kT} ),
\eeq
where $\upDelta E$ is the free energy difference, \ie,
an error frequency of $f \simeq 10^{-4}$?
% \exp(-8)$?
%
How has DNA replication cheated thermodynamics?
The situation is equally perplexing
in the case of \ind{protein synthesis},\index{puzzle!fidelity of DNA replication}
which translates an mRNA sequence into a polypeptide in accordance
with the genetic code. Two specific chemical reactions are
protected against errors: the binding of tRNA molecules to amino acids,
and the production of the polypeptide in the ribosome, which,
like DNA replication, involves base-pairing.
Again, the fidelity is high (an error rate of about $10^{-4}$),
and this fidelity can't be caused by the
energy of the `correct' final state being especially low --
the correct polypeptide sequence is not expected to be significantly lower in energy
than any other sequence. How do cells perform error correction?\index{error correction!in protein synthesis}\index{error-correcting code!in protein synthesis}
(See \citeasnoun{Hopfield1974}, \citeasnoun{Hopfield1980}).\index{Hopfield, John J.}
}
\ExercisxC{2}{ex.estimateBrainmemoryrate}{
While the \ind{genome} acquires information through
natural selection at a rate of a few bits per generation,
% (\chref{ch.sex}),
% exerciseonlyref{ex.evolutionteach}),
your brain acquires information at a greater rate.
Estimate at what rate new information can be stored in
long term memory by your brain. Think of learning
the words of a new language, for example.
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Solutions}
\soln{ex.smallpopn}{
For small enough $N$,
whilst the average fitness of the population increases,
some unlucky bits become frozen into the bad state. (These
bad genes are sometimes known as \ind{hitchhiker}s.)
The homogeneity assumption breaks down.
Eventually, all individuals have identical genotypes that
are mainly 1-bits, but contain some 0-bits too.
The smaller the population, the greater the number of
frozen 0-bits is expected to be.
How small can the population size $N$ be if the theory of sex is accurate?
We find experimentally that the theory based on assuming homogeneity
fits poorly only if the population size
$N$ is smaller than $\sim\! \sqrt{G}$.
If $N$ is significantly smaller than $\sqrt{G}$, information
cannot possibly be acquired at a rate as big as $\sqrt{G}$,
since the information content of the Blind Watchmaker's
decisions cannot be any greater than $2N$ bits per generation,
this being the number of bits required to specify which
of the $2N$ children get to reproduce.
\citeasnoun{Baum95}, analyzing a similar model, show that
the population size $N$ should be about $\sqrt{G}(\log G)^2$
to make hitchhikers unlikely to arise.
% the finite population's fitness is to rise at the same rate
% as the infinite population.
}
\dvips
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%% PART %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\renewcommand{\partfigure}{\poincare{8.frag1}}
\part{Probabilities and Inference}
\subchapter{About Part IV}% Introduction to
%
\fakesection{introduction to pt IV}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The number of inference problems
that can (and perhaps should) be tackled
by Bayesian inference methods is enormous.
In this book, for example, we discuss the decoding problem for
error-correcting codes, the task of inferring clusters
from data, the task of interpolation through noisy data,
and the task of classifying patterns given labelled examples.
Most techniques for solving these problems
can be categorized as follows.
\begin{description}
\item[Exact methods] compute the required quantities
directly. Only a few interesting problems have a direct
solution, but exact methods are important as tools
for solving subtasks within larger problems.
Methods for the exact solution of inference problems
are the subject of Chapters \ref{ch.enumerate},
\ref{ch.exactmarg}, \ref{ch.exact}, and \ref{ch.sumproduct}.
% for example using forward-backward within EM.
\item[Approximate methods] can be subdivided into
\ben
\item {\bf deterministic approximations}, which include\index{approximation!of complex distribution}
maximum likelihood (\chref{ch.ml}),
Laplace's method (Chapters \ref{ch.laplace} and \ref{ch.occam})
and variational methods (\chapterref{ch.variational}); and
\item {\bf Monte Carlo methods} -- techniques in which
random numbers play an integral part -- which will be discussed
in Chapters \ref{ch.mc},
\ref{ch.mc2}, and \ref{ch.mcexact}.
\een
\end{description}
% removed fit.tex from here
% removed material from here to enumerate.tex
% \section{Overview}
This part of the book does not form a one-dimensional
story. Rather, the ideas make up a web of interrelated threads which
will
% . These threads
recombine in subsequent chapters.
\Chapterref{ch.bayes}, which
is an honorary member of this part,
discussed a range of simple examples of inference
problems and their Bayesian solutions.
To give further motivation for the toolbox of
inference methods discussed in this part,
\chapterref{ch.clustering} discusses the problem of clustering; subsequent chapters
discuss the probabilistic interpretation of
clustering as \ind{mixture modelling}.
\Chapterref{ch.enumerate} discusses the option of
dealing with probability distributions by completely
enumerating all hypotheses.
\Chapterref{ch.ml} introduces the idea of maximization
methods as a way of avoiding the large cost associated with complete
enumeration, and points out reasons why maximum likelihood is
not good enough.
\Chapterref{ch.distributions} reviews the probability distributions
that arise most often in Bayesian inference.
Chapters \ref{ch.exactmarg}, \ref{ch.exact}, and \ref{ch.sumproduct}
discuss another way of avoiding the
cost of complete enumeration: marginalization.
Chapter \ref{ch.exact} discusses message-passing methods appropriate
for graphical models, using the
decoding of error-correcting codes as an example.
Chapter \ref{ch.sumproduct} combines these ideas with
message-passing concepts from Chapters \ref{ch.message} and
\ref{ch.noiseless}. These chapters are a
prerequisite for the understanding of advanced error-correcting codes.
Chapter \ref{ch.laplace} discusses deterministic approximations including
Laplace's method. This chapter is a prerequisite for understanding
the topic of complexity control in learning algorithms, an idea that
is discussed in general terms in \chref{ch.occam}.
Chapter \ref{ch.mc} discusses Monte Carlo methods.
Chapter \ref{ch.mc2} gives details of
state-of-the-art Monte Carlo techniques.
Chapter \ref{ch.ising} introduces the \ind{Ising model} as a test-bed
for probabilistic methods. An exact {message-passing} method\index{message passing} and a Monte Carlo method
are demonstrated. A motivation for studying the Ising model
is that it is intimately related to several neural network models.
\Chref{ch.mcexact} describes `exact' Monte Carlo methods
and demonstrates their application to the Ising model.
Chapter \ref{ch.variational} discusses variational methods and their application
to Ising models and to simple statistical inference problems including
clustering. This
chapter will help the reader understand the \ind{Hopfield network}
(\chapterref{ch.hopfield}) and
the \ind{EM algorithm}, which is an important method in {latent-variable modelling}.\index{latent variable model}
% (\chapterref{ch.em}).
\Chref{ch.ica} discusses a particularly simple latent variable
model called independent component analysis.
% Is there going to be a chapter called hierarchical modelling?
% Will I define graphical models?
% Where do I talk about trellises?
%
% Latent variable models will come in a later part. Have parts on nn's,
% on l.v.'s. Or on supervised and unsupervised.
% This part of the book ends with
\Chref{ch.ignorance}
discusses a ragbag of
assorted inference topics.
\Chref{ch.decision} discusses a simple
example of decision theory.
% What experiments should one do
%discusses interesting examples
% of prior probability distributions that describe ignorance.
\Chref{ch.sampling} discusses differences between
sampling theory and Bayesian methods.
%\subsection*{Head off the misconceptions early}
\subsection*{A theme: what inference is about}
A widespread misconception is that
the aim of inference is to find
{\em the most probable explanation\/} for some data.\index{sermon!MAP method}
While this most probable hypothesis may
be of interest, and some inference methods do
locate it, this hypothesis is just the peak of
a probability distribution, and it is the
whole distribution that is of interest.
%
As we saw in \chapterref{ch2}, the {\em most probable\/}
outcome from a source is often not a {\em typical\/} outcome
from that source.
Similarly, the most probable hypothesis given some data
may be atypical of the whole set of reasonably-plausible
hypotheses.\index{sermon!most probable is atypical}
%%%%%%% Maybe I should say marginalization is the key idea.
% Another important idea is the concept of marginalization,
% \ie, integrating over variables that we are not
% interested in; typical hypotheses contribute
% most to the marginal probability densities.
% YES/?????????????????/
% \prechapter{About Chapter}
\subsection*{About \protect\chref{ch.clustering}}
Before reading the next chapter,
exercise
\ref{ex.logit} (\pref{ex.logit})
% \ref{ex.logit} (\pref{ex.logit})
and section \ref{sec.pulse} (inferring the input to a Gaussian channel)
are
recommended reading.
\dvips
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\typeout{ DON'T FORGET input{tex/fit.tex} shows the fit of a gaussian in 2d!!!!! }
%
%\chapter{An example inference task: clustering}
\chapter{An Example Inference Task: Clustering}
\label{ch.clustering}
%
% clust.tex
%
Human brains are good at finding regularities in data.
One way of expressing regularity is to put a set of
objects into groups that are similar to each other.
For example, biologists have
found that most objects in the natural world
fall into one of two categories: things that
are brown and run away, and things that are green
and don't run away. The first group they call animals,
and the second, plants.
We'll call this operation of grouping things together
{\dem\ind{clustering}}.
If the biologist further sub-divides
the cluster of plants into sub-clusters, we would
call this `\ind{hierarchical clustering}'; but
we won't be talking about hierarchical clustering yet.
In this chapter we'll just discuss ways to take a set of $N$
objects and group them into $K$ clusters.
There are several motivations for clustering.\indexs{clustering}
First, a good clustering
has predictive power. When an early biologist encounters a
new green thing he has not seen before, his internal model
% which says that all living things are either
of plants and
animals fills in predictions for attributes of the
green thing: it's unlikely to jump on him and eat him;
if he touches it, he might get grazed or stung; if he eats
it, he might feel sick. All of these predictions, while uncertain,
are useful, because they help the biologist invest his
resources (for example, the time spent watching for predators) well.
Thus, we perform clustering because we believe the underlying
cluster labels are meaningful, will lead to a more efficient
description of our data, and will help us choose better actions. This type of clustering
is sometimes called `mixture \ind{density modelling}',\index{mixture modelling}\index{modelling!density modelling}
and the objective function that measures how well the predictive
model is working is the information content of the data, $\log 1/P(\{\bx\})$.
Second, clusters can be a useful aid to communication because
they allow
% provide codewords for
\ind{lossy compression}.\index{compression!lossy}
The biologist can give directions to a friend such as
% of the form
`go to the
third
{\em tree\/}
% {\underline{tree}\/}
on the right then take a right turn' (rather than
`go past the large green thing with red berries, then past
the large green thing with thorns, then $\ldots$').
The brief category name `tree' is helpful because it is
sufficient to identify an object.
Similarly, in lossy \ind{image compression}, the aim is to convey
in as few bits as possible a reasonable reproduction of a picture;
one way to do this is to divide the image into $N$ small patches, and find
a close match to each patch in an alphabet of $K$ image-templates;
then we send a close fit to the image
by sending the list of labels $k_1,k_2,\ldots,k_N$ of the matching templates.
The task of creating a good library of image-templates is equivalent
to finding a set of cluster centres.
This type of clustering is sometimes called `\ind{vector quantization}'.
%\marginfig{
%\caption[a]{Vector quantization}
%}
We can formalize a vector quantizer in terms of an {\dem{assignment rule}}
$\bx \rightarrow k(\bx)$ for assigning
datapoints $\bx$ to one of $K$ codenames, and a {\dem{reconstruction
rule}} $k \rightarrow \bm^{(k)}$, the aim being to choose the
functions $k(\bx)$ and $\bm^{(k)}$ so as to
minimize the {\dem{expected distortion}}, which might be
defined to be
\beq
D = \sum_{\bx} P(\bx) \half \left[ \bm^{(k(\bx))} - \bx \right]^2 .
\eeq
% Vector quantization is used in some lossy image compression algorithms
% which represent small patches of image using a small alphabet of
% template images.
[The ideal objective function would be
to minimize the psychologically perceived distortion of the image.
Since it is hard to quantify the distortion perceived by a human,
vector quantization and \ind{lossy compression}\index{compression!lossy}
are not so crisply defined
problems as {data modelling} and lossless compression.]\index{modelling}
In vector quantization, we don't necessarily believe that
the templates $\{ \bm^{(k)}\}$ have any natural meaning; they
are simply tools to do a job. We note in passing
the similarity of the assignment rule (\ie, the encoder)
of vector quantization to the {\em decoding\/} problem
when decoding an error-correcting code.\index{connection between!vector quantization and error-correction}
A third reason for making a cluster model is that failures of the
cluster model may highlight interesting objects that deserve
special attention.
If we have trained a vector quantizer to do a good job of compressing
satellite pictures of ocean surfaces, then maybe patches of image that
are not well compressed by the vector quantizer are the patches that
contain ships!
If the biologist encounters a green thing and sees it run (or slither) away,
this misfit with his cluster model (which says green things don't run
away) cues him to pay special attention. One can't spend all one's time being
fascinated by things; the cluster model can help sift out from the
multitude of objects in one's world the ones that really deserve attention.
\amarginfig{c}{
\begin{center}\small
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/data.ps,width=1.65in,angle=-90}
\end{center}
\caption[a]{$N=40$ data points.}
}A fourth reason for liking clustering algorithms is that
they may serve as models of learning processes in neural systems.
The clustering algorithm that we now discuss, the K-means\index{learning algorithms!competitive learning}
algorithm, is an example of a {\dem\ind{competitive learning}\/} algorithm.
The algorithm works by having the $K$ clusters compete with
each other for the right to own the data points.
% At the heart of a clustering method there is always an {\dem{assignment rule}},
% a method for allocating a point $\bx$ to one of the $K$ clusters.
% Often this rule takes the form `
%
%
% Motivations for clustering.
%\ben
%\item
% Similarity of clustering assignment step to decoding.
%\item
% Clustering as mixture density modelling.
% If we adopt the attitude of density modelling, then our aim is
% to find a good description of the observed data in terms
% of a mixture of probability densities. In contrast to vector quantization,
% we are likely to view the underlying clusters as having a natural meaning.
%
% For example, if we model handwritten characters with a mixture
% model, we might intend each cluster to correspond to a different
% character; if we model protein sequences with a mixture model,
% we might think of the clusters as representing protein families
% all of whose members descended by evolution from a common protein ancestor.
%\een
\section{K-means clustering}
The\marginpar{\small\raggedright
%\begin{aside}
{\sf About the name...}
As far as I know, the `K' in K-means clustering
simply refers to the chosen number of clusters.
If Newton had followed the same naming policy, maybe
we would learn at school about `calculus for the variable $x$'.
It's a silly name, but we are stuck with it.
%\end{aside}
}
K-means algorithm is an algorithm\indexs{K-means clustering}
for putting $N$ data points in an $I$-dimensional space
into $K$ clusters.
Each cluster is parameterized by a vector $\bm^{(k)}$ called its mean.
The data points will be denoted by $\bx^{(n)}$ where the superscript $n$
runs from 1 to the number of data points $N$. Each vector $\bx$ is a vector with
$I$ components $x_i$.
We will assume that the
space that $\bx$ lives in
is a real space and that we have a metric that defines distances between points,
for example,
\beq
d(\bx,\by) = \half \sum_i (x_i - y_i )^2 .
\eeq
% \subsection{The K-means algorithm}
To start the K-means algorithm (\algref{alg.kmeans}), the $K$
% parameter vectors called the
means $\{\bm^{(k)}\}$
are initialized in some way, for example to random values.
K-means is then an iterative two-step algorithm.
In the {\dem{assignment step}},
each data point $n$ is assigned to the nearest mean.
In the {\dem{update step}}, the means are adjusted to match
the sample means of the data points that they are responsible for.
%\newcommand{\rnk}{r^{(n)}_k}
%\newcommand{\hkn}{\hat{k}^{(n)}}
\begin{algorithm}[htbp]
\algorithmmargin{%
\begin{description}
\item[Initialization\puncspace] Set $K$ means $\{ \bm^{(k)} \}$ to random values.
\item[Assignment step\puncspace]
Each data point $n$ is assigned to the nearest mean. We denote our
guess for the cluster $k^{(n)}$ that the point $\bx^{(n)}$ belongs to
by $\hkn$.
\beq
\hkn = \argmin_k \{ d(\bm^{(k)} ,\bx^{(n)} ) \} .
\eeq
An alternative, equivalent representation of this assignment of points to clusters
is given by `responsibilities', which are indicator variables $\rnk$.
In the assignment step, we set $\rnk$ to one if mean $k$ is the closest
mean to datapoint $\bx^{(n)}$; otherwise $\rnk$ is zero.
\beq
\rnk = \left\{
\begin{array}{ccc} 1 &\mbox{ if } & \hkn = k
\\ 0 & \mbox{ if } & \hkn \neq k .
\end{array} \right.
\eeq
\noindent
{\em What about ties?} --
We don't expect two means to be exactly the same distance from
a data point,
but if a tie does happen, $\hkn$ is set to the smallest of the
winning $\{ k \}$.
\item[Update step\puncspace]% also called Adaptation or Reestimation
The model parameters, the means, are adjusted to match
the sample means of the data points that they are responsible for.
\beq
\bm^{(k)} = \frac{ \displaystyle \sum_{n} \rnk \bx^{(n)} }{ R^{(k)} }
\eeq
where $R^{(k)}$ is the total responsibility of mean $k$,
\beq
R^{(k)} = \sum_{n} \rnk .
\eeq
{\em What about means with no responsibilities?} --
If $R^{(k)} = 0$, then we leave the mean $\bm^{(k)}$ where it is.
\item[Repeat the assignment step and update step]
until the assignments do not change.
\end{description}
}{
\caption{The K-means clustering algorithm.\indexs{learning algorithms!K-means clustering}}
\label{alg.kmeans}
}
\end{algorithm}
{The K-means algorithm} is demonstrated for a toy
two-dimensional data set in \figref{fig.kmeans.2},
where 2 means are used. The assignments of the points to
the two clusters are indicated by two point styles, and the
two means are shown by the circles.
%
The algorithm converges after three iterations, at which point
the assignments
are unchanged so the means remain unmoved when updated.
The K-means algorithm always converges to a fixed point.
\exercissxC{4}{ex.proveconverge}{
See if you can prove that K-means always converges. [Hint: find a
physical analogy and an associated \ind{Lyapunov function}.]
[A Lyapunov function is a function of the state of the
algorithm that decreases whenever the state changes
and that is bounded below.
If a system has a Lyapunov function then its dynamics converge.]
% You might like to try to prove this fact. We'll prove it in a few
% chapters's time.
}
{The K-means algorithm} with a larger number of means, 4,
is demonstrated in \figref{fig.kmeans.4}.
The outcome of the algorithm depends on the initial condition.
In the first case, after five iterations,
a steady state is found in which the data points
are fairly evenly split between the four clusters.
In the second case, after six iterations,
half the data points are in one cluster, and the others are
shared among the other three clusters.
%
\begin{figure}
\figuremargin{%
\begin{center}\small
\begin{tabular}{llllllll}
% \raisebox{1in}{(a)}\hspace{-0.3in}%
& \raisebox{0.81in}{Data:} &
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/data.ps,width=1.65in,angle=-90}&\\
% \hline
Assignment & Update & Assignment & Update & Assignment & Update & \\
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.2.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.3.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.4.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.5.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.6.ps,width=1.65in,angle=-90}&
\hspace{-0.54in}\psfig{figure=octave/kmeans/ps1/5.7.ps,width=1.65in,angle=-90}&
\\[0.12in]
\end{tabular}
\end{center}
}{%
\caption[a]{K-means algorithm applied to a data set of 40 points. $K=2$ means
evolve to stable locations after three iterations.}
\label{fig.kmeans.2}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}\small
\begin{tabular}{*{6}{l}}
Run 1\\
\hspace{-0.45in}\psfig{figure=octave/kmeans/ps1/15.2.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.4.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.6.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.8.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.10.ps,width=1.50in,angle=-90}
%&
%\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/15.11.ps,width=1.50in,angle=-90}
\\[0.12in]
Run 2\\
\hspace{-0.45in}\psfig{figure=octave/kmeans/ps1/16.2.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.4.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.6.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.8.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.10.ps,width=1.50in,angle=-90}&
\hspace{-0.60in}\psfig{figure=octave/kmeans/ps1/16.12.ps,width=1.50in,angle=-90}
\\[0.12in]
\end{tabular}
\end{center}
}{%
\caption[a]{K-means algorithm applied to a data set of 40 points.
Two separate runs, both with $K=4$ means, reach different solutions.
Each frame shows a successive assignment step.}
\label{fig.kmeans.4}
}%
\end{figure}
% Fri 29/6/01 removed k=5 figure to graveyard
%\label{fig.kmeans.5}
\subsection{Questions about this algorithm}
The K-means algorithm has several {\em ad hoc\/}
features.
Why does the update step set the `mean' to the mean of the assigned points?
% What if there were a few outliers?
% Outlying data points can have a big influence on the mean!
Where did the distance $d$ come from? What if we used a different
measure of distance between $\bx$ and $\bm$? How can
we choose the `best' distance? [In vector quantization,
the distance function is provided as
part of the problem definition; but I'm assuming
we are interested in data-modelling rather than vector quantization.]
%-- it's a measure of perceived distortion,
% whose expectation we wish to minimize --
% but in mixture density modelling, the choice of distance
% corresponds to a choice of density. The choice of distance certainly can have
% an effect on the resulting clusters, as we'll see in a moment.]
How do we choose $K$? Having found multiple alternative clusterings for
a given $K$, how can we choose among them?
% How to do spaces other than real spaces? For example
% categorical spaces.
%
% Choice of number of clusters
%
% What about clusters with unequal width.
%
% And clusters with unequal weight.
\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{%
\raisebox{1in}{(a)}\hspace{-0.3in}%
\psfig{figure=octave/kmeans/xbs75.ps,%
width=2.4in,angle=-90}%
\hspace{0.3in}%
\raisebox{1in}{(b)}\hspace{-0.3in}%
\psfig{figure=octave/kmeans/xbs75m.ps,%
width=2.4in,angle=-90}}%
\end{center}
}{%
\caption[a]{K-means algorithm for a case with two dissimilar clusters.
(a) The\index{little 'n' large data set}\index{data set}
%{!little 'n' large}
``little 'n' large'' data. (b) A stable set of assignments and means.
Note that four points belonging to the broad cluster have been incorrectly
assigned to the narrower cluster. (Points assigned to the right-hand cluster
are shown by plus signs.)}
\label{fig.kmeans.xbs}
}%
\end{figure}
\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{%
\raisebox{1in}{(a)}\hspace{-0.03in}%
\psfig{figure=octave/kmeans/ps3/30.1.ps,%
width=2in,angle=-90}%
\hspace{0.3in}%
\raisebox{1in}{(b)}\hspace{-0.03in}%
\psfig{figure=octave/kmeans/ps3/31.9.ps,%
width=2in,angle=-90}}%
\end{center}
}{%
\caption[a]{Two elongated clusters, and
the stable solution found by the K-means algorithm.}
\label{fig.kmeans.lozenge}
}%
\end{figure}
\subsection{Cases where K-means might be viewed as failing.}
% We can deliberately construct examples where K-means
% gives inadequate answers, from a density modelling perspective.
%\subsubsection{Outliers}
% Similarly, one or two outlying data points can have a large
% effect on the stable state of the K-means algorithm.
% The sample mean is a good estimator of the centre of a {\em Gaussian\/}
% distribution, but if a cluster is not Gaussian in shape, then
% the sample mean is not the most robust estimator of the centre
% of the cluster.
Further questions arise when we look for cases where
the algorithm behaves badly (compared with what
the man in the street would call `clustering').
% \Figref{fig.kmeans.xbs}a shows a data set which evidently
% contains two clusters -- a big one and a small one.
\Figref{fig.kmeans.xbs}a shows a set of 75 data points
generated from a mixture of two Gaussians. The
right-hand Gaussian
% centred at $(8,5)$ differs from that centred at $(3,5)$ in two ways:
% it
has less weight (only one fifth of the data points),
and it is a less broad cluster.
\Figref{fig.kmeans.xbs}b shows the outcome of using
K-means clustering with $K=2$ means. Four of the big cluster's
data points have been assigned to the small cluster,
and both means end up displaced
to the left of the true centres of the clusters.
The K-means algorithm takes account only of the distance between
the means and the data points; it has no representation of the
weight or breadth of each cluster. Consequently, data points that
actually belong to the broad cluster are incorrectly
assigned to the narrow cluster.
%Can get silly answers, see the big'n'small example.
% Algorithm implicitly assumes clusters have similar size and
% similar weight.
%\subsubsection{Unequal weight and unequal width clusters}
% Once the algorithm converges, as shown in \figref{fig.kmeans.xbs}b,
% the means have both become displaced to the left from their
% correct locations.
\Figref{fig.kmeans.lozenge} shows another case of K-means
behaving badly. The data evidently fall into two elongated clusters.
But the only stable state of the K-means algorithm is that shown in
\figref{fig.kmeans.lozenge}b: the two clusters have been sliced
in half!
% at their midpoints
These two examples show that there is something wrong with the
distance $d$ in the K-means algorithm.
The K-means algorithm has no way of
representing the size or shape of a cluster.
A final criticism of K-means is that it is a `hard' rather than a `soft' algorithm:
points are assigned to exactly one cluster and
all points assigned to a cluster are equals in that cluster.
Points located near the border between two or more clusters
should, arguably, play a {\em{partial}\/}
role in determining the locations of all the clusters
that they could plausibly be assigned to. But in the K-means algorithm,
each borderline point is dumped in one cluster, and has an equal vote
with all the other points in that cluster, and no vote in any other clusters.
\section{Soft K-means clustering}
These criticisms of K-means motivate the `soft K-means algorithm',\indexs{learning algorithms!K-means clustering}\index{K-means clustering!soft}\index{soft K-means clustering}
\algref{alg.softkmeans1}. The algorithm has one parameter, $\beta$,
which we could term the {\dem\ind{stiffness}}.
% , stiff being the opposite of soft.
% Soft version. Write algorithm, showing how similar it is
% to hard K-means.
% Could demonstrate the repulsion effect of hard K-means
% when two clusters overlap. Hard to make convincing because
% human can't see two clusters in there.
% BOX THIS and assign it a number, algorithm 23.x
% first arg is the algm, 2nnd is the title
\begin{algorithm}[htbp]
\algorithmmargin{%
\begin{description}
\item[Assignment step\puncspace]
Each data point $\bx^{(n)}$ is given a soft `degree of assignment'
to each of the means. We call the degree to which $\bx^{(n)}$
is assigned to cluster $k$ the {\dem{\ind{responsibility}}} $r_k^{(n)}$
(the responsibility of cluster $k$ for point $n$).
\beq
\rnk
% r_k^{(n)}
= \frac{ \exp \left( - \beta \, d(\bm^{(k)} ,\bx^{(n)}) \right) }
{\sum_{k'} \exp \left( -\beta \, d(\bm^{(k')} ,\bx^{(n)}) \right) } .
\label{eq.softminr}
\eeq
The sum of the $K$ responsibilities for the $n$th point is 1.
\item[Update step\puncspace]% also called Adaptation or Reestimation
The model parameters, the means, are adjusted to match
the sample means of the data points that they are responsible for.
\beq
\bm^{(k)} = \frac{ \displaystyle \sum_{n} \rnk \bx^{(n)} }{ R^{(k)} }
\eeq
where $R^{(k)}$ is the total responsibility of mean $k$,
\beq
R^{(k)} = \sum_{n} \rnk .
\eeq
\end{description}
}{
%{\sf Soft K-means algorithm, version 1}
\caption{Soft K-means algorithm, version 1.}
\label{alg.softkmeans1}
}
\end{algorithm}
Notice the similarity of this soft K-means algorithm
% \ref{alg.softkmeans1}
to the
hard K-means algorithm \ref{alg.kmeans}.
The update step is identical; the only difference is
that the responsibilities\index{responsibility} $\rnk$ can take on values
between 0 and 1.
Whereas the assignment $\hkn$ in the K-means algorithm
involved a `min' over the distances,
the rule for assigning the responsibilities is
a `soft-min' (\ref{eq.softminr}).\index{softmax, softmin}
\exercisxB{2}{ex.stiffnessKmeans}{
Show that as the stiffness $\beta$ goes to $\infty$, the soft
K-means algorithm becomes identical to the original hard K-means
algorithm, except for the way in which means with no
assigned points behave. Describe what those means do instead of
sitting still.
}
Dimensionally, the stiffness $\beta$ is an inverse-length-squared,
so we can associate a lengthscale, $\sigma \equiv 1/\sqrt{\beta}$, with it.
% the value of $\beta$.
The soft K-means algorithm is demonstrated in \figref{fig.skmeans.2d}.
The lengthscale is shown by the radius of the circles surrounding the
four means.
Each panel shows the final fixed point reached for a different value of
the lengthscale $\sigma$.
% , with large lengthscale at the top left and short lengthscale at the bottom right.
\section{Conclusion}
At this point, we may have fixed some of the problems with the original
K-means algorithm by introducing an extra {complexity-control}\index{complexity control} parameter $\beta$.
% whose value controls the algorithm's outcome.
But how should we set $\beta$?
And what about the problem of the elongated clusters, and
the clusters of unequal weight and width? Adding one stiffness
parameter $\beta$ is not going to make all these problems go away.
We'll come back to these questions in a later chapter,
as we develop the mixture-density-modelling view of clustering.
\section*{Further reading}
For a \index{vector quantization}{vector-quantization} approach
to clustering see \cite{Luttrell89d,Luttrell_IEEE90}.
\section{Exercises}
\exercissxB{3}{ex.softkmeans}{
Explore the properties of the soft K-means algorithm,
version 1,
% (\pageref{alg.softkmeans})
assuming that the
datapoints $\{\bx\}$ come from a {\em single\/} separable two-dimensional Gaussian
distribution with mean zero and variances $(\var(x_1),\var(x_2)) =
(\sigma^2_1, \sigma^2_2)$, with $\sigma^2_1 > \sigma^2_2$.
Set $K=2$, assume $N$ is large, and investigate the fixed points of the
algorithm as $\beta$ is varied. [Hint: assume that $\bm^{(1)} = (m,0)$
and $\bm^{(2)} = (-m,0)$.]
}
% Discuss dependence of algorithm on $\beta$.
% Show the bifurcations as $\beta$ varies.
% here are the filenames for a fixed variance sequence in ps5
\begin{figure}
\figuremargin{%
\begin{center}\small
\begin{tabular}{*{6}{l}}
\multicolumn{4}{l}{\makebox[0in][l]{Large $\sigma$ $\ldots$}}\\
\softfc{1.39}&
\softfc{2.35}&
\softfc{3.35}&
\softfc{4.49}\\
\multicolumn{4}{c}{\makebox[0in][l]{$\ldots$}}\\
\softfc{5.51}&
\softfc{6.37}&
\softfc{7.69}&
\softfc{8.119}\\
\multicolumn{4}{r}{\makebox[0in][r]{$\ldots$ small $\sigma$}}\\
\softfc{9.37}&
\softfc{10.35}&
%\softfc{11.35}&
%\softfc{12.35}&
%\softfc{13.35}\\&
\softfc{14.35}&
%\softfc{15.35}&
%\softfc{16.35}&
%\softfc{17.35}\\&
\softfc{18.35}\\
%\softfc{19.35}&
%\softfc{20.35}&
%\softfc{21.35}\\&
%\softfc{22.35}&\\
%\softfc{23.35}&
%\softfc{24.35}&
%\softfc{25.35}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm, version 1,
applied to a data set of 40 points. $K=4$.
Implicit lengthscale parameter $\sigma=1/\beta^{1/2}$ varied from
a large to a small value.
Each picture shows the state of
all four means, with the implicit lengthscale
shown by the radius of the four circles,
after running the algorithm
for several tens of iterations.
At the largest lengthscale,
all four means converge exactly to the
data mean. Then the four means separate into
two groups of two. At shorter lengthscales,
each of these pairs itself
bifurcates into subgroups.
}% 3 down to 0.3.}
\label{fig.skmeans.2d}
}%
\end{figure}
\label{sec.SOFT-KMEANS}
% Give probabilistic interpretation -- no, given later in enumerate.tex?
% Refer forward to that exercise in which the algorithm was derived by the reader.
% \section{Exercises}
\exercisxB{3}{ex.repelkmeans}{
Consider the soft K-means%
\amarginfignocaption{t}{
\mbox{\psfig{figure=figs/m2g.ps,width=1.7in,angle=-90}}
}
algorithm applied to a large amount of one-dimensional data
that comes from a mixture of two equal-weight Gaussians
with true means $\mu=\pm 1$ and standard deviation $\sigma_P$,
for example $\sigma_P=1$.
Show that the hard K-means algorithm with $K=2$
leads to a solution in which the two means are
further apart than the two true means.
Discuss what happens for other values of $\beta$,
% in particular the value $\beta = 1/\sigma_P^2$.
and find the value of $\beta$ such that the soft algorithm
puts the two means in the correct places.
}
\section{Solutions}
\soln{ex.proveconverge}{
We can associate an `\ind{energy}' with the state of the K-means algorithm
by connecting a spring between each point $\bx^{(n)}$
and the mean that is responsible for it.
The energy of one spring is proportional to its squared-length, namely
$\b d(\bx^{(n)}, \bm^{(k)})$ where $\b$ is the stiffness of the spring.
The
total energy of all the \ind{spring}s is a {\dem\ind{Lyapunov function}\/} for the algorithm,
because
%\ben
%\item
(a)
the assignment step can only decrease the energy -- a point
only changes its allegiance if the length of its spring would be reduced;
%\item
(b)
the update step can only decrease the energy -- moving $\bm^{(k)}$ to
the mean
% centre of mass
is the way to minimize the energy of its springs; and
%\item
(c) the energy is bounded below -- which is the second condition for a Lyapunov
function.
%\een
Since the algorithm has a Lyapunov function, it converges.
}
\soln{ex.softkmeans}{
If the means are initialized to $\bm^{(1)} = (m,0)$
and $\bm^{(1)} = (-m,0)$, the assignment step for a point at location $x_1,x_2$
gives
\amarginfig{c}{
\begin{center}
\mbox{\psfig{figure=figs/gallager/clusterbelow.ps,%
width=1.75in,angle=-90}}\\[0in]
\end{center}
%}{%
\caption[a]{Schematic diagram of the \ind{bifurcation} as the largest data variance
$\sigma_1$ increases from below $1/\beta^{1/2}$
to above $1/\beta^{1/2}$. The data variance is indicated by the ellipse.}
\label{fig.kmeansbifurc1}
}%
%%% (\cf\ \exerciseref{ex.easyclassificationexample})
\beqan
r_1(\bx) &=& \frac{ \exp ( - \beta (x_1-m)^2 / 2 ) }
{ \exp ( - \beta (x_1-m)^2 / 2 ) + \exp ( - \beta (x_1+m)^2 / 2 ) }
\\
&=&
\frac{1}{1 + \exp ( - 2 \beta m x_1 ) } ,
\eeqan
and the updated $m$ is
\beqan
m' & =& \frac{ \int \d x_1 \: P(x_1) \, x_1\, r_1 (\bx) }
{ \int \d x_1 \: P(x_1) \, r_1 (\bx) }
\\
&=& 2 \int \d x_1 \:
%\frac{1}{\sqrt{2\pi} \sigma_1}} \exp ( - x_1^2/ ( 2 \sigma_1^2) )
P(x_1) \,
x_1 \,
\frac{1}{1 + \exp ( - 2 \beta m x_1 ) }.
\eeqan
Now,%
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=gnu/kmeansbi.ps,%
width=1.7in,angle=-90}}\vskip 0.1in
\end{center}
%}{%
\caption[a]{The stable mean locations as a function of
$\sigma_1$, for constant $\b$, found numerically (thick lines), and the
approximation (\ref{eq.approxbifurc}) (thin lines). }
\label{fig.kmeansbifurc}
}
$m=0$ is a fixed point, but the question is,
is it stable or unstable?
For tiny $m$ (that is, $\beta \sigma_1 m \ll 1$), we can Taylor
expand
%
\beq
\frac{1}{1 + \exp ( - 2 \beta m x_1 )} \simeq
\frac{1}{2} ( 1 + \beta m x_1 ) + \cdots
\eeq
so
\beqan
m' & \simeq & \int \d x_1 \:
% \frac{1}{\sqrt{2\pi \sigma_1^2}} \exp ( - x_1^2/ ( 2 \sigma_1^2) )
P(x_1) \:
x_1 \:
( 1 + \beta m x_1 )
\\
%&=& \int \d x_1 \:
% P(x_1) \: x_1^2 \:
% \beta m
%\\
&=& \sigma_1^2 \beta m .
\eeqan
For small $m$, $m$ either grows or decays exponentially under this mapping,
depending on whether $\sigma_1^2 \beta$ is greater than or
less than 1.
The fixed point $m=0$ is {\em stable\/} if
\beq
\sigma_1^2 \leq 1/ \beta
\eeq
and
{\em unstable\/} otherwise.
[Incidentally, this derivation shows that this result
is general, holding for any true probability
distribution $P(x_1)$ having variance $\sigma_1^2$,
not just the Gaussian.]
If $\sigma_1^2 > 1/ \beta$ then there is a \ind{bifurcation}
and there are two stable fixed points surrounding the unstable
fixed point at $m=0$.
% There are two ways to visualize this bifurcation. Either we can imagine
% an algorithm with fixed $\beta$ and look at what happens
% as we increase the variance of the data fed to it,
% or we can imagine attacking fixed data with various values of $\beta$.
% On dimensional grounds we can think of $\beta$ as defining
% an inverse-variance, and $1/\beta^{1/2}$ as defining an implicit
% length scale
%% standard deviation
% in the algorithm.
% see kmeansoft/ms.ms
% see itp/gnu/kmeans.gnu
To illustrate this bifurcation, \figref{fig.kmeansbifurc}
shows the outcome of running the soft K-means
algorithm with $\beta=1$
on one-dimensional data with standard deviation $\sigma_1$
for various values of $\sigma_1$.
% for four iterations, starting from initial mean locations $m = \pm 1$.
\Figref{fig.kmeansbifurcinv}
shows this \ind{pitchfork bifurcation} from the
other point of view, where the
data's standard deviation $\sigma_1$ is fixed and the
algorithm's lengthscale $\sigma = 1/\beta^{1/2}$
is varied on the horizontal axis.%
\amarginfig{b}{
\begin{center}~\par
\mbox{\psfig{figure=gnu/kmeansbi-inv.ps,%
width=1.75in,angle=-90}}\vskip 0.1in
\end{center}
%}{%
\caption[a]{The stable mean locations as a function of
$1/\beta^{1/2}$, for constant $\sigma_1$.}
\label{fig.kmeansbifurcinv}
}
% adding another term in the expansion looked hopeless
% does it converge???
% We'll be able to show this is a standard pitchfork bifurcation
% once we have discussed the objective function
% that the K-means algorithm minimizes.
%
% Meanwhile, h
\begin{aside}
Here is a cheap theory to model how the fitted parameters $\pm m$ behave
beyond the bifurcation, based on continuing
the series expansion. This continuation of the series is
rather suspect, since the
series isn't necessarily expected to converge
beyond the bifurcation point, but the theory fits
well anyway.
We take our analytic approach one term further in the
expansion
\beq
\frac{1}{1 + \exp ( - 2 \beta m x_1 )} \simeq
\frac{1}{2} ( 1 + \beta m x_1 - \frac{1}{3} ( \beta m x_1)^3 ) + \cdots
\eeq
% (but this expansion may be invalid!)
then we can solve for the shape of the bifurcation to leading order,
which depends on the fourth moment of the distribution:
%\marginpar{\footnotesize{At (\ref{eq.gauss3m}) we use the fact that $P(x_1)$ is Gaussian to find the fourth moment.}}
\beqan
m' & \simeq & \int \d x_1 \:
P(x_1)
% \frac{1}{\sqrt{2\pi \sigma_1^2}} \exp ( - x_1^2/ ( 2 \sigma_1^2) )
x_1
( 1 + \beta m x_1 - \frac{1}{3} ( \beta m x_1)^3 )
\\
%&=& \int \d x_1 \:
%% \frac{1}{\sqrt{2\pi \sigma_1^2}} \exp ( - x_1^2/ ( 2 \sigma_1^2) )
% P(x_1)
% \left[ x_1^2 \, \beta m
% - \frac{1}{3} ( \beta m)^3 x_1^4 \right]
%\\
&=& \sigma_1^2 \beta m - \frac{1}{3} ( \beta m)^3 3 \sigma_1^4 .
\label{eq.gauss3m}
%\\
%&=& \sigma_1^2 \beta m ( 1 - ( \beta m)^2 \sigma_1^2 ) .
\eeqan
[{At (\ref{eq.gauss3m}) we use the fact that $P(x_1)$ is Gaussian to find the fourth moment.}]
This map has a fixed point at $m$ such that
\beq
\sigma_1^2 \beta ( 1 - ( \beta m)^2 \sigma_1^2 ) = 1,
\eeq
\ie,
\beq
% ( \beta m)^2 \sigma_1^2 = ( 1 - 1/ (\sigma_1^2 \beta ) ),
% m = \pm \frac{ ( 1 - 1/ (\sigma_1^2 \beta ) )^{1/2} }{ \beta \sigma_1 } ,
% m = \pm \frac{ ( \sigma_1^2 \beta - 1 )^{1/2} }{ \sigma_1 \beta^{1/2} \beta \sigma_1 } ,
m = \pm \beta^{-1/2} \frac{ ( \sigma_1^2 \beta - 1 )^{1/2} }{ \sigma_1^2 \beta } .
\label{eq.approxbifurc}
\eeq
The thin line in \figref{fig.kmeansbifurc}
shows this theoretical approximation.
\Figref{fig.kmeansbifurc} shows the bifurcation as a function of $\sigma_1$
for fixed $\beta$; \figref{fig.kmeansbifurcinv} shows the bifurcation
as a function of $1/\b^{1/2}$ for fixed $\sigma_1$.
\end{aside}
}
\exercissxB{2}{ex.kmeansdetails}{
Why does the pitchfork in \figref{fig.kmeansbifurcinv}
tend to the values
\mbox{$\sim \! \pm 0.8$} as $1/\beta^{1/2} \rightarrow 0$?
Give an analytic expression for this asymptote.
}
\soln{ex.kmeansdetails}{
The asymptote is the mean of the rectified Gaussian,
\beq
\frac{\int_{0}^{\infty} \Normal(x,1) x \: \d x}{1/2}
= \sqrt{ 2/\pi } \simeq 0.798 .
\eeq
}
%
%
\dvips
\chapter{Exact Inference by Complete Enumeration}
%\chapter{Exact inference by complete enumeration}
\label{ch.enumerate}
% \section{Complete enumeration}
We open our toolbox of methods for handling
probabilities by
discussing a brute-force
inference
% of handling
method: complete enumeration of all
hypotheses, and evaluation of their probabilities.
This approach is an exact method, and the difficulty of
carrying it out will motivate the smarter exact
and approximate methods introduced in the
following chapters.
\section{The {burglar alarm} }
Bayesian probability theory is sometimes called
`common sense, amplified'.
When thinking about the following questions, please ask your
common sense what it thinks the answers are; we will then
see how Bayesian methods confirm your everyday intuition.
% EXAMPLE 1 }
% Explaining away example -- earthquake/burglar?
% stolen from \input{tex/_e1b.tex}% contains earthquake - should be earlier
\fakesection{quake}%
%\begin{figure}
%\figuremargin{%
\amarginfig{t}{\small
\begin{center}
\begin{tabular}{c}
\setlength{\unitlength}{0.451mm}
\begin{picture}(70,50)(-20,-36)%
\put(0,20){\circle{8.5}} % quake
\put(0,28){\makebox(0,0)[b]{Earthquake}} % quake
\put(5,15){\vector(1,-1){10}}
\put(40,20){\circle{8.5}} % buglar
\put(40,28){\makebox(0,0)[b]{Burglar}}
\put(35,15){\vector(-1,-1){10}}
\put(20,0){\circle{8.5}} % alarm
\put(28,0){\makebox(0,0)[l]{Alarm}}
\put(-5,15){\vector(-1,-1){10}}
\put(-20,0){\circle{8.5}} % q report radio
\put(-20,-8){\makebox(0,0)[t]{Radio}}
\put(25,-5){\vector(1,-1){10}}
\put(40,-20){\circle{8.5}} % alarm report
\put(40,-28){\makebox(0,0)[t]{Phonecall}}
\end{picture}
\\
\end{tabular}
%
\end{center}
%}{%
\caption[a]{Belief network for the burglar alarm problem.}
\label{fig.quake}
}%
%\end{figure}
%\subsection*{The {burglar alarm}}
%\exercisxA{1}{ex.burglar}{
\exampl{ex.burglar}{
Fred\index{explaining away}\index{Bayesian belief networks}\index{earthquake and burglar alarm}\index{burglar alarm and earthquake}
lives in Los Angeles and commutes 60 miles to work. Whilst at work,
he receives a phone-call from his neighbour saying that Fred's burglar
alarm is ringing. What is the probability that there was a burglar in his
house today? While driving home to investigate, Fred hears on the radio that
there was a small earthquake that day near his home.
`Oh', he says, feeling relieved, `it was probably the earthquake that
set off the alarm'.
% Given that
% earthquakes sometimes set off burglar alarms (\figref{fig.quake}),
% {\em now\/}
What is the probability that there was
a burglar in his house?
(After Pearl, 1988).
%{\em Aims of this problem: illustrate meaning of probability;
% and show the subtlety of inverse probability: E and B are
% independent, but given A they become dependent.
%
% \input{figs/quake_nums.tex}
}
% \input{tex/_s1b.tex} % earthquake solution
% \fakesection{quake}
Let's introduce
% You may make use of the following probability distributions relating
variables $b$ (a burglar was present in Fred's house today),
$a$ (the alarm is ringing), $p$ (Fred receives a phonecall from the
neighbour reporting the alarm),
$e$ (a small earthquake
% capable of triggering burglar alarms
took place today near Fred's house),
and $r$ (the radio report of earthquake is heard by Fred).
The probability of all these variables might factorize as follows:
\beq
P( b, e, a, p , r ) = P(b) P(e) P(a\given b,e) P(p\given a) P(r\given e) ,
\eeq
and plausible values for the probabilities are:
\begin{enumerate}
\item Burglar probability:
\beq
P(b\eq 1) = \beta , \:\:\: P(b\eq 0) = 1-\beta ,
\eeq
\eg, $\beta = 0.001$ gives a mean burglary rate of once every three
years.
\item Earthquake probability:
\beq
P(e\eq 1) = \epsilon , \:\:\: P(e\eq 0) = 1-\epsilon ,
\eeq
with, \eg, $\epsilon = 0.001$;
our assertion that the earthquakes are independent of burglars, \ie, the
prior probability of $b$ and $e$ is $P(b,e) = P(b)P(e)$,
seems reasonable unless we take into account opportunistic burglars
who strike immediately after earthquakes.
\item Alarm ringing probability: we assume
the alarm will ring if {\em{any}\/} of
the following
three events happens: (a) a burglar enters the house, and
triggers the alarm (let's
assume the alarm has a reliability of $\alpha_b=0.99$,
\ie, 99\% of burglars trigger the alarm);
(b) an earthquake takes place, and triggers the alarm
(perhaps $\a_e =1$\% of alarms are triggered by earthquakes?);
or (c) some other event causes a false alarm; let's assume
the false alarm rate $f$ is 0.001, so Fred has false alarms from
non-earthquake causes once every
three years.
[{This type of dependence of $a$ on $b$ and $e$ is known as
a `\ind{noisy-or}'.}]
The probabilities of $a$ given $b$ and $e$ are then:
\[
\begin{array}{rclrcl}
P(a\eq 0\given b\eq 0,\, e\eq 0) &=& (1-f) ,& P(a\eq 1\given b\eq 0,\, e\eq 0) &=& f \\
P(a\eq 0\given b\eq 1,\, e\eq 0) &=& (1-f)(1-\alpha_b) ,& P(a\eq 1\given b\eq 1,\, e\eq 0) &=& 1- (1-f)(1-\alpha_b) \\
P(a\eq 0\given b\eq 0,\, e\eq 1) &=& (1-f)(1-\alpha_e) ,& P(a\eq 1\given b\eq 0,\, e\eq 1) &=& 1- (1-f)(1-\alpha_e) \\
P(a\eq 0\given b\eq 1,\, e\eq 1) &=& (1-f)(1-\alpha_b)(1-\alpha_e) ,& P(a\eq 1\given b\eq 1,\, e\eq 1) &=& 1- (1-f)(1-\alpha_b)(1-\alpha_e)
\end{array}
\]
or, in numbers,
\[
\begin{array}{rclrcl}
P(a\eq 0\given b\eq 0,\, e\eq 0) &=& 0.999 ,& P(a\eq 1\given b\eq 0,\, e\eq 0) &=& 0.001 \\
P(a\eq 0\given b\eq 1,\, e\eq 0) &=& 0.009\,99 ,& P(a\eq 1\given b\eq 1,\, e\eq 0) &=& 0.990\,01 \\
P(a\eq 0\given b\eq 0,\, e\eq 1) &=& 0.989\,01 ,& P(a\eq 1\given b\eq 0,\, e\eq 1) &=& 0.010\,99 \\
P(a\eq 0\given b\eq 1,\, e\eq 1) &=& 0.009\,890\,1 ,& P(a\eq 1\given b\eq 1,\, e\eq 1) &=& 0.990\,109\,9 .
\end{array}
\]
% with $\alpha_b=0.99$, $f=0.001$, $\alpha_e=0.01$.
\end{enumerate}
We assume the neighbour would never phone if the
alarm is not ringing [$P(p\eq 1\given a\eq 0)=0$];
and that the radio is a trustworthy reporter too [$P(r\eq 1\given e\eq 0)=0$];
we won't need to specify the probabilities $P(p\eq 1\given a\eq 1)$ or $P(r\eq 1\given e\eq 1)$
in order to answer the questions above, since the outcomes $p=1$
and $r\eq 1$ give us certainty respectively that $a\eq 1$ and $e\eq 1$.
We can answer the two questions about the burglar
by computing the posterior probabilities of all hypotheses
given the available information.
Let's start by reminding
ourselves that the probability that there is a burglar,
before either $p$ or $r$ is observed, is $P(b\eq 1)=\b=0.001$,
and the probability that an earthquake took place is $P(e\eq 1) = \epsilon = 0.001$,
and these two propositions are {\em independent}.
First, when $p\eq 1$,
we know that the alarm is ringing: $a\eq 1$.
The posterior probability of $b$ and $e$ becomes:
\beq
P(b,e\given a\eq 1) = \frac{ P(a\eq 1 \given b,e ) P(b) P(e ) }{ P(a\eq 1) } .
\eeq
The numerator's four possible values are
\[
\begin{array}{rcl@{\times}l@{\times}lcl}
P(a\eq 1\given b\eq 0,\, e\eq 0)\times P(b\eq 0)\times P(e\eq 0) &=& 0.001 & 0.999 & 0.999 &=& 0.000\,998 \\
P(a\eq 1\given b\eq 1,\, e\eq 0)\times P(b\eq 1)\times P(e\eq 0) &=& 0.990\,01 & 0.001 & 0.999 &=&0.000\,989 \\
P(a\eq 1\given b\eq 0,\, e\eq 1)\times P(b\eq 0)\times P(e\eq 1) &=& 0.010\,99 & 0.999 & 0.001 &=&0.000\,010\,979 \\
P(a\eq 1\given b\eq 1,\, e\eq 1)\times P(b\eq 1)\times P(e\eq 1) &=& 0.990\,109\,9 & 0.001 & 0.001 &=& 9.9\times 10^{-7} .
\end{array}
\]
The normalizing constant is the sum of these four numbers,
$P(a\eq 1) = 0.002$,
% 0.0019989901099
% pr z
% z = 0.001 * 0.999 * 0.999+ 0.99001 * 0.999 * 0.001 + 0.01099 * 0.001 * 0.999 + 0.9901099 * 0.001 * 0.001
and the posterior probabilities are
\beq
\begin{array}{rcl}
P(b\eq 0,\, e\eq 0\given a\eq 1) &=& 0.4993 \\
P(b\eq 1,\, e\eq 0\given a\eq 1) &=& 0.4947 \\
P(b\eq 0,\, e\eq 1\given a\eq 1) &=& 0.0055 \\
P(b\eq 1,\, e\eq 1\given a\eq 1) &=& 0.0005 .
\end{array}
\label{eq.earthquake.post}
\eeq
To answer the question, `what's the probability a burglar was there?'
we {\dem\index{marginalization}{marginalize}\/} over the earthquake variable $e$:
\beq
\begin{array}{rclcl}
P(b\eq 0\given a\eq 1) &=& P(b\eq 0,\, e\eq 0\given a\eq 1) + P(b\eq 0,\, e\eq 1\given a\eq 1) &=& 0.505 \\
P(b\eq 1\given a\eq 1) &=& P(b\eq 1,\, e\eq 0\given a\eq 1) + P(b\eq 1,\, e\eq 1\given a\eq 1) &=& 0.495 .
\end{array}
\eeq
So there is nearly a 50\% chance that there was a burglar present.
It is important to note that the variables $b$ and $e$, which
were independent {\em a priori},
are now {\em dependent}. The posterior distribution (\ref{eq.earthquake.post})
is not a separable function of $b$ and $e$.
%
%pr 0.001 * 0.999 * 0.999/z
%pr 0.99001 * 0.999 * 0.001 /z
%pr 0.01099 * 0.001 * 0.999/z
%pr 0.9901099 * 0.001 * 0.001 /z
%
%pr 0.001 * 0.999 * 0.999/z + 0.01099 * 0.001 * 0.999/z
%pr 0.9901099 * 0.001 * 0.001 /z + 0.99001 * 0.999 * 0.001 /z
This fact is illustrated most simply by studying the effect
of learning that $e=1$.
When we learn $e\eq1$,
the posterior probability of $b$
is given by $P(b \given e\eq 1,\, a\eq 1 ) = P(b ,e\eq 1\given a\eq 1) / P( e\eq 1\given a\eq 1)$,
\ie, by dividing the bottom two rows of
%quantities from
(\ref{eq.earthquake.post}),
%\beq
%\begin{array}{rcl}
% P(b\eq 0,\, e\eq 1\given a\eq 1) &=& 0.0055 \\
% P(b\eq 1,\, e\eq 1\given a\eq 1) &=& 0.0005 ,
%\end{array}
%\label{eq.earthquake.post2}
%\eeq
%pr 0.9901099 * 0.001 * 0.001 /z + 0.01099 * 0.001 * 0.999/z
% e = 0.9901099 * 0.001 * 0.001 /z + 0.01099 * 0.001 * 0.999/z
% pr 0.9901099 * 0.001 * 0.001 /(z*e)
% pr 0.01099 * 0.001 * 0.999/(z*e)
%
by their sum $P( e\eq 1\given a\eq 1) = 0.0060$. The posterior probability of $b$ is:
\beq
\begin{array}{rcl}
P(b\eq 0\given e\eq 1,\, a\eq 1) &=& 0.92 \\
P(b\eq 1\given e\eq 1,\, a\eq 1) &=& 0.08 .
\end{array}
\label{eq.earthquake.post3}
\eeq
% 0.0827220303808637
% 0.917277969619136
There is thus now an 8\% chance that a burglar was in Fred's house.
It is in accordance with everyday intuition that the probability that $b\eq 1$
(a possible cause of the alarm)
reduces when Fred learns that an earthquake, an alternative
explanation of the alarm, has happened.
\subsection{Explaining away}
This phenomenon, that one of the possible causes ($b\eq 1$) of some data (the data
in this case being
$a\eq 1$) becomes {\em less\/} probable when another of the causes ($e\eq 1$)
becomes more probable, even though those two causes were independent
variables {\em a priori}, is known as {\dem\ind{explaining away}}.
Explaining away is an important feature of correct inferences,
and one that any artificial intelligence should replicate.
If we believe that the neighbour and the radio
service are unreliable or capricious,
so that we are not certain that the alarm really is
ringing or that an earthquake really has happened, the calculations
become more complex, but the
explaining-away effect persists;
the arrival of the earthquake report $r$
simultaneously makes it {\em more\/} probable that the
alarm truly is ringing, and {\em less\/} probable that
the burglar was present.
In summary, we solved the inference questions about the burglar
by enumerating all four hypotheses about the variables $(b,e)$,
finding their posterior probabilities, and marginalizing
to obtain the required inferences about $b$.
\exercisxB{2}{ex.earthquake}{
After Fred receives the phone-call about the burglar alarm, but before
he hears the radio report, what, from his point of view, is the probability that there was
a small earthquake today?
}
\section{Exact inference for continuous hypothesis spaces }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% probc clustering
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
% \section{Mixture modelling}
Many of the hypothesis spaces we will consider
are naturally thought of as continuous. For example,
the unknown decay length $\l$ of
\sectionref{sec.decay} (\pref{sec.decay})
lives in a continuous one-dimensional space;
and
the unknown mean and standard deviation of
a Gaussian $\mu,\sigma$
% $(\mu_1,\mu_2)$ of the Gaussian
live in a continuous two-dimensional space.
In any practical computer implementation,
such continuous spaces will necessarily be discretized,
however, and so can, in principle, be enumerated -- at a grid of parameter
values, for example. In \figref{decay.like.2} we
plotted the likelihood function for the decay length as a function of $\l$
by evaluating the likelihood at a finely-spaced series of points.
\subsection{A two-parameter model}
Let's look at the Gaussian distribution as an example of
a model with a two-dimensional hypothesis space.
\begin{figure}
\figuremargin{\begin{center}\begin{tabular}{c}
\fbox{\hspace*{-0.05in}\psfig{figure=mixture/hundred0.ps,angle=-90,width=\skinnytextwidth}\hspace{0.05in}}\\
\end{tabular}\end{center}}
{\caption[a]{Enumeration of an
entire (discretized) hypothesis space for one Gaussian with parameters $\mu$ (horizontal axis) and
$\sigma$ (vertical). }
\label{fig.enumerate.gaussian}}
\end{figure}
The one-dimensional
Gaussian distribution is parameterized by a mean $\mu$
and a standard deviation $\sigma$:
%
\beq
P(x\given \mu,\sigma)
% ,\H_{\rm Normal})
= \frac{1}{\sqrt{2 \pi} \sigma}
\exp \left( - \frac{ ( x-\mu )^2 }{2 \sigma^2 } \right)
\equiv {\rm Normal}(x;\mu,\sigma^2) .
\eeq
%
\Figref{fig.enumerate.gaussian}
shows an enumeration of one hundred
hypotheses about the mean and standard deviation of
a one-dimensional Gaussian distribution.
These hypotheses are evenly spaced in a ten by ten
square grid covering ten values of $\mu$ and
ten values of $\sigma$. Each hypothesis is represented
by a picture showing
% its associated
the probability density that it puts on $x$.
%
%\begin{figure}
\marginfig{\begin{center}\begin{tabular}{c}
\mbox{\psfig{figure=mixture/data5.ps,width=1.9in,angle=-90}}\\[0.03in]
\end{tabular}\end{center}
%}{
\caption[a]{Five datapoints $\{x_n\}_{n=1}^5$.
The horizontal coordinate
is the value of the datum, $x_n$; the vertical coordinate
has no meaning.}
% represents the order in which the data were acquired.}
\label{fivepoints}
}
%\end{figure}
%
We now examine the inference of $\mu$ and $\sigma$
given data points $x_n$, $n=1,\ldots, N$, assumed to be drawn independently
from this density.
% distribution.
%
Imagine that we acquire data, for example the
five points shown in \figref{fivepoints}.
We can now evaluate the posterior probability of each of the one hundred
subhypotheses by evaluating the likelihood of each,
that is, the value of
$P( \{x_n\}_{n=1}^5 \given \mu, \sigma )$.
The likelihood values are shown diagrammatically
in \figref{fig.gaussian5} using the line thickness
to encode the value of the likelihood. Subhypotheses
with likelihood smaller than $e^{-8}$ times
the maximum likelihood have been deleted.
\begin{figure}
\figuremargin{\begin{center}\begin{tabular}{c}
\fbox{\psfig{figure=mixture/hundred.ps,angle=-90,width=\skinnytextwidth}}\\
\end{tabular}\end{center}}
{\caption[a]{ Likelihood function, given the data of \figref{fivepoints},
represented by line thickness. Subhypotheses having
likelihood smaller than $e^{-8}$ times the maximum
likelihood are not shown.}
\label{fig.gaussian5}}
\end{figure}
Using a finer grid, we can represent the same information by
plotting the likelihood as a surface plot or contour plot
as a function of $\mu$ and $\sigma$ (\figref{like.sig.mu1}).
% copy from bayes_intermediate.tex
% /home/mackay/book/figs
\begin{figure}
\figuremargin{\small%
\vspace{-0.56in}
\begin{center}\small
\begin{tabular}{l@{}l}
%(a1)
\hspace{-0.2in}\raisebox{-8mm}{\psfig{figure=\bookfigs/basic/new_surfaceplot.ps,angle=-90,width=3in}}
&
%(a2)
\hspace{-0.6in}\raisebox{-8mm}{\psfig{figure=\bookfigs/basic/new_contourplot.ps,angle=-90,width=3in}}
\\
%(b)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_posts.ps,angle=-90,width=2.3in}}
%&
%\hspace*{-0.3in}(c)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_sigposts.ps,angle=-90,width=2.3in}}
%\\
\end{tabular}
\end{center}
}{%
\caption[abbrev]{{The likelihood function for the parameters of
a Gaussian distribution}.
% {(a1,a2)}
Surface plot and
contour plot of the log likelihood as a function of $\mu$
and $\sigma$. The data set of $N=5$ points had mean
$\bar{x}=1.0$ and $S^2 = \sum(x-\bar{x})^2 = 1.0$.
% Notice that
% the maximum is skew in $\sigma$. The two estimators of
% standard deviation have values $\sigma_{\ssN}=0.45$ and
% $\sigma_{\ssNM}=0.50$.
%{(b)} The posterior probability of $\mu$ for various values of
% $\sigma$.
%
%{(c)} The posterior probability of $\sigma$ for
% various fixed values of $\mu$.
}
\label{like.sig.mu1}
}%
\end{figure}
%
%
%
\subsection{A five-parameter mixture model}
\label{sec.gaussian.firsttime}
Eyeballing the data (\figref{fivepoints}), you might agree that it seems
more plausible that they come not from a single Gaussian but
from a mixture of two Gaussians,
defined by two means, two standard deviations,
and two {\ind{mixing coefficients}} $\pi_1$ and $\pi_2$,
satisfying $\pi_1+\pi_2=1$, $\pi_i \geq 0$.
\[%beq
P(x|\mu_1,\sigma_1,\pi_1,\mu_2,\sigma_2,\pi_2) =
\frac{\pi_1}{\sqrt{ 2 \pi} \sigma_1} \exp \left( -\smallfrac{(x-\mu_1)^2}{2 \sigma_1^2} \right)
+
\frac{\pi_2}{\sqrt{ 2 \pi} \sigma_2} \exp \left( -\smallfrac{(x-\mu_2)^2}{2 \sigma_2^2} \right)
\]%eeq
Let's enumerate the subhypotheses for this alternative
model.
The parameter space is five-dimensional, so it becomes challenging to
represent it on a single page.
% \Figref{fig.mixture200} shows
\Figref{fig.mixture200} enumerates 800 subhypotheses with
different values of the five parameters
$\mu_1,\mu_2,\sigma_1,\sigma_2,\pi_1$.
The means are varied between five values each in the horizontal directions.
The standard deviations take on four values each vertically.
And $\pi_1$ takes on two values vertically.
We can represent the inference about these five parameters
in the light of the five datapoints as shown in
\figref{fig.mixture200post}.
% And do model comparison too.
\begin{figure}
%\figuredangle{%
\figuremargin{%
\begin{center}\begin{tabular}{c}
\mbox{\psfig{figure=mixture/mix0.0.6.ps,angle=-90,width=\skinnytextwidth}}\\
\mbox{\psfig{figure=mixture/mix0.0.8.ps,angle=-90,width=\skinnytextwidth}}\\
\end{tabular}\end{center}}
{\caption[a]{Enumeration of the
entire (discretized) hypothesis space for a mixture of two Gaussians. Weight of the mixture components
is $\pi_1,\pi_2 = 0.6,0.4$ in the top half and $0.8,0.2$ in the
bottom half. Means $\mu_1$ and $\mu_2$ vary horizontally,
and standard deviations $\sigma_1$ and $\sigma_2$ vary
vertically. }
\label{fig.mixture200}}
\end{figure}
\begin{figure}
\figuremargin{%
% \figuredangle{%
\begin{center}\begin{tabular}{c}
\mbox{\psfig{figure=mixture/D1mix.0.6.ps,angle=-90,width=\skinnytextwidth}}\\
\mbox{\psfig{figure=mixture/D1mix.0.8.ps,angle=-90,width=\skinnytextwidth}}\\
\end{tabular}\end{center}}
{\caption[a]{Inferring a mixture of two Gaussians. Likelihood function,
given the data of \figref{fivepoints},
represented by line thickness.
The hypothesis space is identical to that shown in
\figref{fig.mixture200}.
Subhypotheses having
likelihood smaller than $e^{-8}$ times the maximum
likelihood are not shown, hence the blank regions, which
correspond to hypotheses that the data have ruled out.\medskip
}
\label{fig.mixture200post}
\begin{realcenter}
\mbox{\psfig{figure=mixture/data5.ps,width=1.9in,angle=-90}}
\end{realcenter}
}
\end{figure}
If we wish to compare the one-Gaussian model with the
mixture-of-two model, we can find the models' posterior probabilities
% y of the two models
by evaluating the \ind{marginal likelihood} or \ind{evidence} for each model $\H$,
$P( \{x\} \given \H )$. The evidence
is given by
integrating over the parameters, $\btheta$; the integration can be implemented
numerically by summing over the
alternative enumerated values
of $\btheta$,
\beq
P( \{x\} \given \H ) = \sum_{ \btheta } P(\btheta) P( \{x\} \given \btheta , \H ) ,
\eeq
where $P(\btheta)$ is the prior distribution over the grid of parameter
values, which I take to be uniform.
% The data set
% contains weak evidence for two clusters,
% and the evidence for the two models shown here comes
% out about 10:1 in favour of the two-Gaussian model.
%
For the mixture of two Gaussians this integral is a five-dimensional integral;
if it is to be performed at all accurately, the grid of points will
need to be much finer than the grids shown in the figures. If the uncertainty
about each of $K$ parameters has been reduced by, say, a factor of ten by observing
the data, then
brute force integration requires a grid of at least $10^K$ points.
This exponential growth of computation with model size is the reason why
complete enumeration is rarely a feasible computational strategy.
% inference
% \end{figure}
\exercisxA{1}{ex.tengaussians}{
Imagine fitting a mixture of ten Gaussians to data in a twenty-dimensional
space. Estimate the computational cost of implementing inferences
for this model by enumeration of a grid of parameter values.
}
\dvips
% Show the surface plot of the likelihood also.
% Idea: Add the exam question on biexponential distbn here?
\chapter{Maximum Likelihood and Clustering}
\label{ch.ml}
\label{ch.clust}
% maximum likelihood and clustering - start of chapter
%
Rather than enumerate all hypotheses -- which may
be exponential in number -- we can save a lot of time by
homing in on one good hypothesis that fits the data
well. This is the philosophy behind the \ind{maximum likelihood}
method, which identifies the setting of the parameter vector
$\btheta$ that maximizes the likelihood, $P(\mbox{Data} \given \btheta, \H)$.
For some models the maximum likelihood parameters can be identified
instantly from the data; for more complex models, finding
the maximum likelihood parameters may require an iterative algorithm.
For any model, it is usually easiest to work with the {\em logarithm\/} of
the likelihood rather than the likelihood, since likelihoods, being
products of the probabilities of many data points, tend to be very small.
Likelihoods multiply; log likelihoods add.
\section{Maximum likelihood for one Gaussian}
\label{sec.mloneg}
We return to the Gaussian for our first examples.
Assume we have data $\{ x_n \}_{n=1}^N$.
The log likelihood is:
\beqan
\ln P(\{x_n\}_{n=1}^N \given \mu,\sigma)
&=& -N \ln (\sqrt{2 \pi} \sigma)
-\sum_n \linefrac{(x_n-\mu)^2}{(2 \sigma^2)} .
\eeqan
% Given the Gaussian model,
The likelihood can be expressed
in terms of two functions of the data, the sample mean
\beq
\barx \equiv {\sum_{n=1}^{N} x_n} / {N} ,
\eeq
and the sum of square deviations
\beq
S \equiv \sum_n (x_n-\barx)^2:
\eeq
\beq
\ln P(\{x_n\}_{n=1}^N \given \mu,\sigma)
=
-N \ln (\sqrt{2 \pi} \sigma) - \linefrac{ [ N ( \mu - \barx )^2 + S ]}
{ (2 \sigma^2) } .
\eeq
Because the likelihood depends on the data only through
$\barx$ and $S$,
these two quantities are known as {\dem\ind{sufficient statistics}}.\index{statistic!sufficient}
% copy from bayes_intermediate.tex
% /home/mackay/book/figs
\begin{figure}
\figuremargin{\small%
\vspace{-0.56in}
\begin{center}
\begin{tabular}{l@{}l}
% \newcommand{\bookfigs}{/home/mackay/book/figs}
(a1)\hspace{-0.4in}\raisebox{-10mm}{\psfig{figure=\bookfigs/basic/new_surfaceplot.ps,angle=-90,width=3in}}
&
(a2)\hspace{-0.8in}\raisebox{-10mm}{\psfig{figure=\bookfigs/basic/new_contourplot.ps,angle=-90,width=3in}}
\\
(b)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_posts.ps,angle=-90,width=2.3in}}
&
\hspace*{-0.3in}(c)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_sigposts.ps,angle=-90,width=2.3in}}
\\
\end{tabular}
\end{center}
}{%
\caption[abbrev]{{The likelihood function for the parameters of
a Gaussian distribution}.
{(a1, a2)} Surface plot and
contour plot of the log likelihood as a function of $\mu$
and $\sigma$. The data set of $N=5$ points had mean
$\bar{x}=1.0$ and $S^2 = \sum(x-\bar{x})^2 = 1.0$.
% Notice that
% the maximum is skew in $\sigma$. The two estimators of
% standard deviation have values $\sigma_{\ssN}=0.45$ and
% $\sigma_{\ssNM}=0.50$.
{(b)} The posterior probability of $\mu$ for various values of
$\sigma$.
{(c)} The posterior probability of $\sigma$ for
various fixed values of $\mu$ (shown as a density over $\ln \sigma$).
}
\label{like.sig.mu1a}
}%
\end{figure}
%
\exampl{ex.muML}{
Differentiate the log likelihood with respect to $\mu$
% and $\ln \sigma$
and show that,
if the standard deviation is known to be $\sigma$,
the maximum likelihood mean $\mu$ of a Gaussian
% whose
is
equal to the sample mean $\barx$,
for any value of $\sigma$.
}
\solution
\beqan
\frac{\partial}{\partial \mu} \ln P &=& - \frac{N(\mu-\bar{x})}{\sigma^2}\\
&=&0 \:\: \ \mbox{when $\mu = \bar{x}$. \hspace{1in} \ensuremath{\epfsymbol}\hspace{-1in}}
\eeqan
% end soln
If we Taylor-expand the log likelihood about the maximum,
we can define approximate
\ind{error bars} on the maximum likelihood parameter:
we use a quadratic approximation to estimate
how far from the maximum-likelihood parameter setting we can go
before the likelihood falls by some standard factor,
for example $e^{1/2}$, or $e^{4/2}$.
In the special case of a likelihood that is a Gaussian\index{approximation!by Gaussian}
function of the parameters, the quadratic approximation is exact.
\exampl{ex.muML2}{
Find the second derivative of the log likelihood with
respect to $\mu$, and find the error bars on $\mu$, given
the data and $\sigma$.
}
{
\solution
\beq
\frac{\partial^2}{\partial \mu^2} \ln P = - \frac{N}{\sigma^2}.
\hspace{1.4in} \ensuremath{\epfsymbol}\hspace{-1.5in}
% \hfill \ensuremath{\epfsymbol}
\eeq
Comparing this curvature with the curvature of the log of a Gaussian
distribution over $\mu$ of standard deviation $\sigma_{\mu}$,
$\exp ( - \mu^2/(2 \sigma_{\mu}^2) )$,
which is $1/\sigma^2_{\mu}$, we can deduce that the error bars
on $\mu$ (derived from the likelihood function) are
\beq
\sigma_{\mu} = \frac{\sigma}{\sqrt{N}} .
\eeq
The \ind{error bars} have this property:
at the two points $\mu = \bar{x} \pm \sigma_{\mu}$, the likelihood is smaller than its maximum
value by a factor of $e^{1/2}$.
}
\exampl{ex.sigML}{
Find the maximum likelihood standard deviation $\sigma$ of a Gaussian,
whose mean is known to be $\mu$,
in the light of data $\{ x_n \}_{n=1}^N$.
Find the second derivative of the log likelihood with
respect to $\ln \sigma$, and error bars on $\ln \sigma$.
}
{
\solution\
The likelihood's dependence on $\sigma$ is
\beq
\ln P(\{x_n\}_{n=1}^N \given \mu,\sigma)
=
-N \ln (\sqrt{2 \pi} \sigma) - \frac{ S_{\rm tot} }
{ (2 \sigma^2) },
\eeq
where $S_{\rm tot} = \sum_n \! {(x_n-\mu)^2}$.
% N ( \mu - \barx )^2 + S$.
To find the maximum of the likelihood, we can differentiate with
respect to $\ln \sigma$. [It's often most hygienic to differentiate
with respect to $\ln u$ rather than $u$, when $u$ is a scale variable;
we use
% Recall $d(e^{nx})/dx = n e^{nx}$,
% so
$\d u^{n}/\d(\ln u) = n u^{n}$.]
%
\beq
\frac{\partial \ln P(\{x_n\}_{n=1}^N \given \mu,\sigma) }
{\partial \ln \sigma}
=
-N + \frac{ S_{\rm tot} }
{ \sigma^2 }
\eeq
This derivative is zero when
\beq
\sigma^2 = \frac{ S_{\rm tot} }{ N } ,
\eeq
\ie,
\beq
\sigma = \sqrt{
\frac{\sum_{n=1}^{N} ( x_n - \mu )^2 }{N}
} .
\eeq
The second derivative is
\beq
\frac{\partial^2 \ln P(\{x_n\}_{n=1}^N \given \mu,\sigma) }
{\partial (\ln \sigma)^2}
=
- 2 \frac{ S_{\rm tot} } { \sigma^2 } ,
\eeq
and at the maximum-likelihood value of $\sigma^2$,
this equals $-2N$.
% is
%\beq
%\frac{\partial^2 \ln P(\{x_n\}_{n=1}^N \given \mu,\sigma) }
%{\partial (\ln \sigma)^2}= - 2N.
%\eeq
So error bars on $\ln \sigma$ are
\beq
\sigma_{\ln \sigma} = \frac{1}{\sqrt{2N}} .
\hspace{1in} \ensuremath{\epfsymbol}\hspace{-1.1in}
\eeq
}
\exercisxB{1}{ex.MLgaussian}{
% Differentiate the log likelihood with respect to $\mu$ and $\ln \sigma$ and s
Show that the values of
$\mu$ and $\ln \sigma$ that jointly maximize the likelihood are:
% \beq
$
\{\mu,\sigma\}_{\ML} = \left\{ \bar{x},\sigma_{\ssN}
= \sqrt{ \linefrac{S}{N} } \right\} ,
$
% \eeq
where
\beq
\sigma_{\ssN} \equiv \sqrt{
\frac{\sum_{n=1}^{N} ( x_n - \barx )^2 }{N}
}
.
\label{eq.sigmaML}
\eeq
}
\section{Maximum likelihood for a mixture of Gaussians}
% kmeans
% LABEL MOG mog
\label{sec.mog}
We now derive an algorithm for fitting a mixture of Gaussians to one-dimensional
data. In fact, this algorithm is so important to understand that,
{\em you}, gentle reader, get to derive the algorithm. Please work through the following exercise.
\ExercissxA{2}{ex.mixture_em}{
% kmeans
A random variable $x$ is assumed to have a probability
distribution that is a {\em mixture of two Gaussians},
%\beq
% P(x| \mu_1,\mu_2 ,\sigma_1, \sigma_2, p_1, p_2)
% =
%% \frac{1}{2}
% \left[\sum_{c=1}^{2}
% p_c \frac{1}{\sqrt{2 \pi \sigma_c^2}}
% \exp \left( - \frac{(x-\mu_c)^2}{2 \sigma_c^2} \right) \right] ,
%\eeq
% where the two Gaussians are labelled by the class labels
% $c=1$ and $c=2$; $p_1$ and $p_2$ are the prior probabilities
% of the two Gaussians,
% which satisfy $p_1 + p_2 = 1$; and $\{ \mu_c \}$ and
% $\{ \sigma_c\}$ are their means and standard deviations.
% For brevity, we will denote these parameters by
% $\btheta \equiv \left\{ \{ p_c \}, \{ \mu_c \}, \{ \sigma_c\} \right\}$.
%
% Assuming that $\{ p_c \}$, $\{ \mu_c \}$ and
% $\{ \sigma_c\}$ are known and that the standard deviations
% are equal, that is, $\sigma$
\beq
P(x \given \mu_1,\mu_2 ,\sigma)
=
\left[\sum_{k=1}^{2}
% \frac{1}{2}
p_k
\frac{1}{\sqrt{2 \pi \sigma^2}}
\exp \left( - \frac{(x-\mu_k)^2}{2 \sigma^2} \right) \right] ,
\eeq
where the two Gaussians are given the labels
$k=1$ and $k=2$; the prior probability
of the class label $k$ is $\{p_1 \eq 1/2 , \, p_2 \eq 1/2 \}$; $\{ \mu_k \}$ are
the means of the two Gaussians; and both have standard deviation
$\sigma$.
For brevity, we denote these parameters by
$\btheta \equiv \left\{ \{ \mu_k \}, \sigma \right\}$.
A data set consists of $N$ points $\{ x_n \}_{n=1}^N$ which are assumed
to be independent samples
from this distribution. Let $k_n$ denote the unknown class
label of the $n$th point.
Assuming that $\{ \mu_k \}$ and
$\sigma$ are known, show that the
posterior probability of the class label $k_n$ of the $n$th point
can be written as
\begin{equation}
\begin{array}{rcl}
P(k_n \eq 1 \given x_n , \btheta )& =&
\displaystyle \frac{1}{1+\exp[ - ( w_1 x_n + w_0)] }
\\[0.21in]
P(k_n \eq 2 \given x_n , \btheta )& =&
\displaystyle \frac{1}{1+\exp[ + ( w_1 x_n + w_0)] } ,
\end{array}
\label{eq1}
\end{equation}
and give expressions for $w_1$ and $w_0$.
%\marginpar{[5]}
\medskip
Assume now that the means $\{ \mu_k \}$ are {\em not\/} known,
and that we wish to infer them from the data $\{ x_n \}_{n=1}^N$.
(The standard deviation $\sigma$ is known.)
In the remainder of this question we will derive an iterative
algorithm for finding values for $\{ \mu_k \}$ that
maximize the likelihood,
\beq
P( \{ x_n \}_{n=1}^N \given \{ \mu_k \} , \sigma )
= \prod_n P( x_n \given \{ \mu_k \} , \sigma ) .
\eeq
% Assume that we
% have set the parameters $\mu_1, \mu_2$ to some initial values.
% $\{ \mu_k \}$
% but that we do have a current guess for them both
Let $L$ denote the natural log of the likelihood.
Show that the derivative of the log likelihood with respect
to $\mu_k$ is given by
\beq
\frac{\partial}{\partial \mu_k} L
= \sum_n p_{k|n} \frac{( x_n - \mu_k )}{\sigma^2} ,
\eeq
%\marginpar{[5]}
where $p_{k|n} \equiv P( k_n \eq k \given x_n , \btheta )$ appeared
% was discussed
above at equation (\ref{eq1}).
Show, neglecting terms in
$\frac{\partial}{\partial \mu_k} P( k_n \eq k \given x_n , \btheta )$,
that the second derivative is approximately given by
%\marginpar{[2]}
\beq
\frac{\partial^2}{\partial \mu_k^2} L
= - \sum_n p_{k|n} \frac{1}{\sigma^2} .
\eeq
Hence show that from an initial state $\mu_1, \mu_2$,
an approximate \ind{Newton--Raphson} step updates these parameters to
$\mu_1', \mu_2'$, where
\beq
\mu_k' = \frac{ \sum_n p_{k|n} x_n }{ \sum_n p_{k|n} } .
\eeq
[The Newton--Raphson method for maximizing $L(\mu)$
updates $\mu$ to $\mu' = \mu - \left[ \left. \frac{\partial L}{\partial \mu}
\right/ \frac{\partial^2 L}{\partial \mu^2} \right]$.]
%\medskip
% -- inference problem, sigmoid function, adaptive mixture model}
%
\[
\mbox{\hspace{-0.5in}\psfig{figure=figs/points32.ps,angle=-90,width=3in}}
\]
Assuming that $\sigma =1$,
sketch a contour plot of the likelihood function as a function of
$\mu_1$ and $\mu_2$ for the data set shown above.
% The data set consists
% of 200 points, shown by the horizontal coordinates of the
% {\tt x}s below the $x$ axis, and by a histogram
% above it. Indicate the widths of the peaks in your sketch.
The data set consists
of 32 points. Describe the peaks in your sketch
and indicate their widths.
}
Notice that the algorithm you have derived for maximizing
the likelihood is identical to the soft {K-means algorithm}\index{K-means clustering!derivation}
of \secref{sec.SOFT-KMEANS}.\index{learning algorithms!K-means clustering}
Now that it is clear that clustering can be viewed as mixture-density-modelling,\index{density modelling}\index{mixture modelling}\index{modelling!density modelling}
we are able to derive enhancements to the K-means algorithm, which
rectify the problems we noted earlier.\index{clustering}\indexs{K-means clustering}
% such as unequal variance algorithm and
% unequal masses of clusters.
\begin{algorithm}
\algorithmmargin{%
\begin{description}
\item[Assignment step\puncspace]
The responsibilities are
\beq
r_k^{(n)} = \frac{ \pi_k \frac{1}{(\sqrt{2 \pi} \sigma_k)^I}
\exp \left( - \displaystyle\frac{1}{\sigma^2_k} \, d(\bm^{(k)} ,\bx^{(n)}) \right) }
{\sum_{k'} \pi_k \frac{1}{(\sqrt{2 \pi} \sigma_{k'})^I}
\exp \left( - \displaystyle \frac{1}{\sigma^2_{k'}} \, d(\bm^{(k')} ,\bx^{(n)}) \right) }
\label{eq.assignII}
\eeq
where $I$ is the dimensionality of $\bx$.
\item[Update step\puncspace]% also called Adaptation or Reestimation
Each cluster's parameters, $\bm^{(k)}$, $\pi_k$, and $\sigma^2_k$,
are adjusted to match
the data points that it is responsible for.
\beq
\bm^{(k)} = \frac{ \displaystyle \sum_{n} \rnk \bx^{(n)} }{ R^{(k)} }
\label{eq.softkmeans.meanupdate}
\eeq
\beq
\sigma^2_{k} = \frac{ \displaystyle \sum_{n} \rnk ( \bx^{(n)} - \bm^{(k)} )^2 }{ I R^{(k)} }
\label{eq.softkmeans.varianceupdate}
\eeq
\beq
\pi_{k} = \frac{ R^{(k)} }{ \sum_{k} R^{(k)} }
\eeq
where $R^{(k)}$ is the total responsibility of mean $k$,
\beq
R^{(k)} = \sum_{n} \rnk .
\eeq
% and $I$ is the dimensionality of $\bx$.
\end{description}
}{
\caption{The soft K-means algorithm, version 2.}
\label{alg.kmeansoft2}
}
\end{algorithm}
% .1 just shows the data.
% .last shows the final state and should be included, ideally
% .2, .4 show initial params, updated params and new assignments, so are the best.
\begin{figure}
\figuremargin{%
\begin{center}\small\hspace*{0.2in}
\begin{tabular}{*{10}{l}}
\softtfbig{2.2}{0}&
\softtfbig{2.4}{1}&
\softtfbig{2.6}{2}&
\softtfbig{2.8}{3}&
\softtfbig{2.19}{9}
\\[0.012in]
\softtfbig{4.2}{0}&
\softtfbig{4.4}{1}&
\softtfbig{4.22}{10}&
\softtfbig{4.42}{20}&
\softtfbig{4.62}{30}&
\softtfbig{4.72}{35}
%\softtfbig{4.81}{40}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm, with $K=2$,
applied (a) to the 40-point data set of
\protect\figref{fig.kmeans.2}; (b) to the little 'n' large data set of
\protect\figref{fig.kmeans.xbs}. }
\label{fig.skmeans.2dK2}
}%
\end{figure}
\begin{algorithm}
\algorithmmargin{%
%%%%%%%%%%%%%%%%%%%%%%55 CHECK box needed
\beq
r_k^{(n)} = \frac{ \pi_k \displaystyle \frac{1}{\prod_{i=1}^I \sqrt{2 \pi} \sigma_i^{(k)}}
\exp \left( - \displaystyle \sum_{i=1}^I \lfrac{(m_i^{(k)}-x_i^{(n)})^2}{2( \sigma_i^{(k)})^2} \right) }
{\sum_{k'} \mbox{ (numerator, with $k'$ in place of $k$) } }
\label{eq.assignIII}
\eeq
\beq
{\sigma^2_{i}}^{(k)} = \frac{
\displaystyle \sum_{n} \rnk ( x^{(n)}_i - m^{(k)}_i )^2 }{ R^{(k)} }
\label{eq.softkmeans.varianceupdate.axisaligned}
\eeq
}{
\caption{The soft K-means algorithm, version 3, which
corresponds to a model of axis-aligned Gaussians.}
\label{alg.kmeansoft3}
}
\end{algorithm}
\section{Enhancements to soft K-means}
\Algref{alg.kmeansoft2} shows
%%%%%%%%%%%%% stolen from clust.tex
a version of the soft-K-means algorithm corresponding
to a modelling assumption that each cluster is a
spherical Gaussian having its own width
(each cluster has
its own $\beta^{(k)} = \lfrac{1}{\sigma^2_{k}}$).
% First, version 2 of the soft K-means algorithm
% removes the job of adjusting $\b=1/\sigma^2$ by
% giving every cluster its own lengthscale parameter $\sigma_k$
% which is updated so as to maximize the likelihood.
% add reference to ML sig example above
%
The algorithm updates the lengthscales $\sigma_k$ for itself.
The algorithm also includes cluster weight parameters $\pi_1,\pi_2,\ldots, \pi_K$
which also update themselves, allowing accurate modelling
of data from clusters of unequal weights.
This algorithm is demonstrated in
% erroneous reference to earlier chapter!
% \figref{fig.skmeans.2d} and
\figref{fig.skmeans.2dK2}
%
% CHECK THESE REFS
%
for two data sets that
we've seen before.
The second example shows that convergence can take a long time, but eventually
the algorithm identifies the small cluster and the large cluster.
% Do my demos include adapting weights $\pi$ as well? Yes, effectively.
%
%\begin{aside}
% Where did all this come from?
% Well, if you did the \exerciseref{ex.mixture_em}
%% (exam q on k-means) in bayes_intermediate.tex
% then you have a derivation. It's a maximum likelihood algorithm.
% Later, we will give a more general derivation, once we have
% learnt about variational methods.
% Then show that the update rules (EM) both increase a single variational
% objective function.
%\end{aside}
Soft K-means, version 2, is a maximum-likelihood
algorithm for fitting a mixture of {\em spherical Gaussians\/} to data --%
\marginpar{\small\raggedright{A proof that the algorithm does indeed maximize the likelihood
is deferred to \secref{sec.EM}.}}
`spherical' meaning that the variance of the Gaussian is the same in
all directions. This algorithm is still no good at modelling the
cigar-shaped clusters of \figref{fig.kmeans.lozenge}.
If we wish to model the clusters by axis-aligned Gaussians
with possibly-unequal variances, we
replace the assignment rule (\ref{eq.assignII})
and the variance update rule (\ref{eq.softkmeans.varianceupdate})
by the rules
(\ref{eq.assignIII}) and
(\ref{eq.softkmeans.varianceupdate.axisaligned}) displayed in
\algref{alg.kmeansoft3}.
% was displayed HERE, moved it to be with alg 2.
\begin{figure}
\figuremargin{%
\begin{center}\small\hspace*{0.1025in}
\begin{tabular}{*{10}{l}}
\softtfbbig{2.2}{0}&
%\softtfbbig{2.4}{1}&
\softtfbbig{2.22}{10}&
\softtfbbig{2.42}{20}&
\softtfbbig{2.60}{30}
\\[0.12in]
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm, version 3, applied to the data consisting
of two cigar-shaped clusters. $K=2$ (\cf\ \figref{fig.kmeans.lozenge}).}
\label{fig.skmeans.lozenge}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}\small\hspace*{0.05125in}
\begin{tabular}{*{10}{l}}
\softtfbig{18.2}{0}&
\softtfbig{18.22}{10}&
\softtfbig{18.42}{20}&
\softtfbig{18.54}{26}&
\softtfbig{18.65}{32}
\\%[0.12in]
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm, version 3, applied to the little 'n' large data set. $K=2$.}
\label{fig.skmeans.2f}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}\small\hspace*{0.1025in}
\begin{tabular}{*{10}{l}}
\softtfbbig{4.2}{0}&
\softtfbbig{4.12}{5}&
\softtfbbig{4.22}{10}&
\softtfbbig{4.42}{20}&
\\%[0.12bigin]
\end{tabular}
\end{center}
}{%
\caption[a]{Soft K-means algorithm applied to a data set of 40 points. $K=4$.
Notice that at convergence, one very small cluster has formed between
two data points.}
\label{fig.skmeans.2g}
}%
\end{figure}
% ^^^^^^6 Fri 29/6/01 I stripped out the $K=4$. case from this.
% it had an interesting singular cluster, but not really representative
% of real life.
% 6.33 in graveyard
This third version of soft K-means is demonstrated in
\figref{fig.skmeans.lozenge} on the `two cigars' data
set of \figref{fig.kmeans.lozenge}.
After 30 iterations, the algorithm correctly
locates the two clusters.
\Figref{fig.skmeans.2f} shows the same algorithm applied to
the little 'n' large data set; again, the
correct cluster locations are found.
\section{A fatal flaw of maximum likelihood}
\label{sec.kaboom}
Finally,
\figref{fig.skmeans.2g} sounds a cautionary note: when we fit $K=4$ means
to our first toy data set, we sometimes find that very small clusters form,
covering just one or two data points. This is a pathological property
of soft K-means clustering, versions 2 and 3.
\exercisxB{2}{ex.kaboom}{
Investigate what happens if one mean $\bm^{(k)}$ sits exactly
on top of one data point; show that if the variance $\sigma^2_k$
is sufficiently small, then no return is possible: $\sigma^2_k$
becomes ever smaller.}
\subsection{KABOOM!}
Soft K-means can blow up.\index{kaboom}\index{blow up}
% end \section{Soft clustering}
Put one cluster exactly on one data point
and let its variance go to zero --
you can obtain an arbitrarily large likelihood!
Maximum likelihood methods can break down
% absurdly
by finding highly tuned models that fit
part of the data perfectly. This phenomenon is known
as \ind{overfitting}. The reason we are not interested
in these solutions with enormous likelihood is this: sure,
these parameter-settings may have enormous posterior probability
{\em density\/}, but the density is large over only a very small
{\em volume\/} of parameter space. So the probability
{\em mass\/}
associated with these likelihood spikes is usually tiny.
% This overfitting problem is one reason why we must say bye-bye to maximum likelihood.
% Another example of overfitting: Imagine
% we are interested in making a model of the surnames in a telephone directory;
% one theory says that 5\% of people are called Smith, 3\% Jones, 2\% Davis, etc.;
% another theory says that 20\% are called Lo and 15\% are called Li; indeed we
% can imagine a high-dimensional continuum of such hypotheses.
% You tear a random page from the phone directory and pick a random name:
% it's {\tt{Shercliff}}.
% In the light of this datum, what is the maximum likelihood hypothesis? Answer: the hypothesis
% that says that 100\% of the surnames are {\tt{Shercliff}}!
%
We conclude that maximum likelihood methods are not a satisfactory
general solution to data-modelling problems:\index{sermon!maximum likelihood}
the likelihood may be infinitely large at certain parameter settings.
Even if the likelihood does not have infinitely-large spikes,
the maximum of the likelihood is often unrepresentative,
in high-dimensional problems.
Even in low-dimensional problems,
maximum likelihood solutions can be unrepresentative.
As you may know from basic statistics, the
maximum likelihood estimator (\ref{eq.sigmaML}) for a
Gaussian's standard deviation, $\sigma_{\ssN}$\index{bias!in statistics},
is a {\em{biased}\/} estimator, a topic that we'll take up in
\chref{ch.exactmarg}.
\subsubsection{The maximum {\itshape a posteriori\/} (MAP) method}
A popular replacement for maximizing the likelihood
is maximizing the Bayesian posterior probability density
of the parameters instead.
However, multiplying
the likelihood by a prior and maximizing the posterior
does not make the above problems go away;
the posterior density often also has infinitely-large spikes,
and the maximum of the posterior probability density is
often unrepresentative of the whole posterior distribution.
Think back to the concept of typicality, which we encountered in \chref{ch.two}:
in high dimensions, most of the probability mass is in a typical set
whose properties are quite different from the points that have
the maximum probability density. Maxima are atypical.
A further reason\index{sermon!maximum {\em a posteriori\/} method}
for disliking the maximum {\em a posteriori\/} is that it is {\em basis-dependent}.\index{basis dependence}
If we make a nonlinear change of basis from the
parameter $\theta$ to the parameter $u=f(\theta)$
then the probability density of $\theta$
is transformed to
\beq
P(u) = P(\theta) \left| \frac{ \partial \theta}{\partial u} \right| .
\label{eq.transformation.of.density}
\eeq
The maximum of the density $P(u)$ will
usually not coincide with the maximum of the density $P(\theta)$.
(For figures illustrating such nonlinear changes of basis, see
the next chapter.)
It seems undesirable to use a method whose answers change
when we change representation.
\section*{Further reading}
The soft K-means algorithm is at the heart of the automatic classification
package, \ind{AutoClass}
\cite{AutoClass,AutoClassTR}.
\section{Further exercises}
\subsection{Exercises where maximum likelihood may be useful}
\exercisxC{3}{ex.KmeansD}{
Make a version of the K-means algorithm that
models the data as a mixture of $K$ arbitrary Gaussians, \ie,
Gaussians that are not constrained to be axis-aligned.
}
\exercisxB{2}{ex.poissonml}{
\ben
\item A \ind{photon counter} is pointed at a remote
star for one minute, in order to infer the brightness,
\ie, the rate of
photons arriving at the counter per minute, $\l$.
Assuming the number of photons collected $r$ has a
\ind{Poisson
distribution} with mean $\l$,
\beq
P(r \given \l ) = \exp( - \l)\frac{ \l^{r} }{r!} ,
\eeq
what is the maximum likelihood estimate for $\l$, given $r = 9$?
Find error bars on $\ln \l$.
\item
Same situation, but now we assume that the
counter detects not only photons from the star but
also `background' photons.
The \ind{background rate} of {photon}s is known to be $b \eq 13$ photons
per minute. We assume the number of photons collected, $r$,
has a Poisson distribution with mean $\l+b$.
Now, given $r\eq 9$ detected photons, what is the maximum likelihood estimate
for $\l$?
Comment on this answer, discussing also the Bayesian posterior
distribution, and the `unbiased\index{unbiased estimator}\index{sermon!unbiased estimator}
\ind{estimator}\index{bias!in statistics}'
of sampling theory, $\hat{\l} \equiv r-b$.
\een
}
\exercisxC{2}{ex.bentcoin}{
A bent coin is tossed $N$ times, giving $N_a$ heads and $N_b$
tails. Assume a beta distribution prior for the probability of heads, $p$,
for example the uniform distribution.
Find the maximum likelihood and {maximum {\em a posteriori\/}}\index{maximum {\em a posteriori}}
values of $p$, then find the maximum likelihood and {maximum {\em a posteriori\/}}
values of the logit $a \equiv \ln[p/(1-p)]$. Compare with the
predictive distribution, \ie, the probability that the next
toss will come up heads.
}
\exercisxB{2}{ex.stars}{
{\em Two men looked through prison bars; one
saw stars, the other tried to infer where the
window frame was.}
\amarginfignocaption{t}{
\newcommand{\imwidthb}{25}
\begin{center}\footnotesize\small
\setlength{\unitlength}{0.03in}
\begin{picture}(38,31)(-7.5,-5.1)
\put(0,\imwidthb){\line(1,0){\imwidthb}}
\put(\imwidthb,0){\line(0,1){\imwidthb}}
\put(0,0){\line(1,0){\imwidthb}}
\put(0,0){\line(0,1){\imwidthb}}
%
\put(0,0){\makebox(0,0)[tr]{\footnotesize$(x_{\min},y_{\min})$}}
\put(\imwidthb,\imwidthb){\makebox(0,0)[bl]{\footnotesize$(x_{\max},y_{\max})$}}
\put(4.5,15){\makebox(0,0)[r]{$\star$}}
\put(8.7,6){\makebox(0,0)[t]{$\star$}}
\put(14.5,21.8){\makebox(0,0)[r]{$\star$}}
\put(12.7,12.9){\makebox(0,0)[t]{$\star$}}
\put(24.5,9.5){\makebox(0,0)[r]{$\star$}}
\put(10.88,4){\makebox(0,0)[t]{$\star$}}
\end{picture}
\end{center}
%}{%
% \caption[a]{}
% \label{fig.stars}
}%
From the other side of a room,
you look through a \ind{window} and see \ind{stars} at locations
$\{ (x_n,y_n) \}$. You can't see the window edges
because it is totally dark apart from the stars.
Assuming the window is rectangular
and that the visible stars' locations are independently randomly distributed,
what are the inferred values of $(x_{\min},$ $y_{\min}$, $x_{\max}$, $y_{\max})$,
according to maximum likelihood?
Sketch the likelihood as a function of $x_{\max}$, for fixed $x_{\min}$,
$y_{\min}$, and $y_{\max}$.
}
\exercisxB{3}{ex.navigator}{
A%
\amarginfig{t}{
\newcommand{\locone}{\put(5,5)}
\newcommand{\loctwo}{\put(32,-1)}
\newcommand{\locthr}{\put(10,31)}
\newcommand{\imwidthc}{25}
\begin{center}\footnotesize\small
\setlength{\unitlength}{0.03975in}
\begin{picture}(38,40)(-7.5,-6)
\locone{\circle{1}}
\loctwo{\circle{1}}
\locthr{\circle{1}}
\locone{\line(1,1){\imwidthc}}
\locone{\makebox(0,0)[tr]{\footnotesize$(x_{1},y_{1})$}}
\loctwo{\line(-1,2){15}}
\loctwo{\makebox(0,0)[tr]{\footnotesize$(x_{2},y_{2})$}}
\locthr{\line(3,-2){20}}
\locthr{\makebox(0,0)[tr]{\footnotesize$(x_{3},y_{3})$}}
\end{picture}
\end{center}
%}{%
\caption[a]{The standard way of drawing
three slightly inconsistent bearings on a chart
produces a triangle called a cocked hat. Where is the sailor?}
\label{fig.buoys}
}
sailor infers his location $(x,y)$ by measuring the
bearings of three buoys whose
locations $(x_n,y_n)$ are given on his chart.
Let the true bearings of the buoys be $\theta_n$.
Assuming that his measurement $\tilde\theta_n$ of each bearing
is subject to Gaussian noise of small standard deviation $\sigma$,
what is his inferred location, by maximum likelihood?
% http://education.qld.gov.au/tal/kla/compass/html/cncha.htm
The sailor's rule of thumb says that the boat's
position can be taken to be the centre of the
\ind{cocked hat}, the \ind{triangle} produced
by the intersection of the three measured bearings (\figref{fig.buoys}).
Can you persuade him that the maximum likelihood answer is better?
%
% 2 answers: 1) consider special case where two buoys
% very close. Then those bearings very accurate, should ignore the third.
% The centre of the triangle
% may be some way away from the intersection of the first two bearings.
% 2) consider special case where the triangle does not exist
% because the three bearings intersect on the wrong side of one of the buoys.
% /
% --*--Boat known to be out in this direction
% \
}
\exercissxB{3}{ex.mlmaxenta}{
{\sf Maximum likelihood fitting of an \ind{exponential-family} model.}
Assume that a variable $\bx$ comes from a probability
distribution of the form
\beq
P(\bx \given \bw) = \frac{1}{Z(\bw)} \exp \left( \sum_k w_k f_k(\bx) \right),
\eeq
where the functions $f_k(\bx)$ are given, and the parameters $\bw = \{ w_k \}$
are not known.
A data set $\{ \bx^{(n)} \}$ of $N$ points is supplied.
Show by differentiating the log likelihood that the maximum-likelihood
parameters $\wml$ satisfy
\beq
\sum_{\bx} P(\bx \given \wml) f_k(\bx) = \frac{1}{N} \sum_{n} f_k(\bx^{(n)}) ,
\eeq
where the left-hand sum is over {\em all\/} $\bx$, and the right-hand
sum is over the data points.
A shorthand for this result is that each function-average under the
fitted model must equal the function-average found in the data:
\beq
\left< f_k \right>_{ P(\bx \given \wml) } =
\left< f_k \right>_{ {\rm Data} } .
\eeq
}
\exercisxB{3}{ex.mlmaxentb}{
{\sf `Maximum entropy' fitting of models to constraints.}\index{maximum entropy}
When confronted by a probability distribution $P(\bx)$
about which only a few facts are known, the {\dem{maximum entropy principle}\/} (maxent)
offers a rule for {\em choosing\/} a distribution that
satisfies those constraints.
According to \ind{maxent}, you should select the
$P(\bx)$ that maximizes the entropy
\beq
H = \sum_{\bx} P(\bx) \log 1/P(\bx) ,
\eeq
subject to the constraints.
Assuming the constraints assert that
the {\em averages\/} of certain functions $f_k(\bx)$ are known, \ie,
\beq
\left< f_k \right>_{ P(\bx) } = F_k ,
\label{eq.consME}
\eeq
show, by introducing Lagrange multipliers (one for each constraint,
including normalization),
that the maximum-entropy
distribution has the form
\beq
P(\bx)_{\rm Maxent} = \frac{1}{Z} \exp \left( \sum_k w_k f_k(\bx) \right) ,
\eeq
where the parameters $Z$ and $\{ w_k \}$ are set such that
the constraints (\ref{eq.consME}) are satisfied.
And hence the maximum entropy method gives identical results
to maximum likelihood fitting of an \ind{exponential-family} model
(previous exercise).
\begin{aside}
The maximum entropy method has sometimes been recommended
as a method for assigning \index{prior!assigning}prior
distributions in Bayesian modelling.
While the outcomes of the maximum entropy method are sometimes
interesting and thought-provoking, I do not advocate maxent
as {\em the\/} approach to assigning \ind{prior}s.
Maximum entropy is also sometimes proposed as a method
for solving inference problems -- for example, `given that
the mean score of this unfair six-sided die is 2.5, what is its
probability distribution $(p_1,p_2,p_3,p_4,p_5,p_6)$?'
I think it is a bad idea to use maximum entropy in this way;\index{sermon!maximum entropy}
it can give silly answers. The correct way to solve
inference problems is to use \Bayes\ theorem.
\end{aside}
}
\subsection{Exercises where maximum likelihood and MAP have difficulties}
\exercisxB{2}{ex.mog}{
This exercise explores the idea that maximizing a probability density
is a poor way to find a point that is representative of the density.
Consider a Gaussian distribution in a $k$-dimensional space,
$P(\w) = (1/\sqrt{2 \pi} \, \sigW)^k \exp( -\sum_1^k w_i^2/2 \sigW^2)$.
Show that nearly all of the probability mass of a Gaussian is in a thin shell
of radius $r=\sqrt{k} \sigW$ and of thickness proportional to
% $\propto
$r/\sqrt{k}$. For example, in 1000 dimensions, 90\% of the mass of a
Gaussian with $\sigW = 1$ is in a shell of radius 31.6 and thickness
2.8.
% 2.4 sigma gives 0.9986 of a Gaussian.
However, the probability {\em density\/} at the origin is $e^{k/2}
\simeq 10^{217}$ times bigger than the density at this shell where
most of the probability mass is.
%$\bullet$
Now consider two Gaussian densities in 1000 dimensions that differ in
radius $\sigW$ by just 1\%, and that contain equal total probability mass.
% In
% each case 90\% of the mass is located in a shell which differs in
% radius by only 1\% between the two distributions.
Show that the maximum
probability density
%, however,
is greater at the centre of the
Gaussian with smaller $\sigW$ by a factor of $\sim \! \exp( 0.01 k )
\simeq 20\,000$.
In \ind{ill-posed problem}s,
a typical posterior
distribution is often a weighted
superposition of Gaussians with varying means and standard deviations,
so the true posterior has a skew peak, with the maximum of the
probability density located near the mean of the
Gaussian distribution that has the smallest standard deviation,
not the Gaussian with the greatest weight.
}
\exercisxB{3}{ex.manyparams}{ {\sf The seven scientists}.
$N$ datapoints $\{x_n\}$ are drawn from
$N$ distributions, all of which are Gaussian with
a common mean $\mu$ but with different unknown standard deviations $\sigma_n$.
What are the maximum likelihood parameters
$\mu, \{ \sigma_n \}$ given the data?
\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/manyparams.ps,width=1.75in,angle=-90}}
\\[0.431in]
\begin{tabular}{cr@{.}l} \toprule
Scientist & \multicolumn{2}{c}{ $x_n$ } \\ \midrule
A & $-$27&020 \\
B & 3&570 \\
C & 8&191 \\
D & 9&898 \\
E & 9&603 \\
F & 9&945 \\
G & 10&056 \\ \bottomrule
\end{tabular}
\end{center}
\caption[a]{Seven measurements $\{x_n\}$ of a parameter $\mu$
by seven scientists each having his own
noise-level $\sigma_n$.}
\label{fig.manyparams}
}
For example, seven scientists (A, B, C, D, E, F, G)
with wildly-differing
\ind{experimental skill}s measure $\mu$. You expect some of them to do accurate
work (\ie, to have small $\sigma_n$), and some of them to turn in
wildly inaccurate answers (\ie, to have enormous $\sigma_n$).
\Figref{fig.manyparams} shows their seven results.
What is $\mu$, and how reliable is each scientist?
I hope you agree that, intuitively, it looks pretty certain
that A and B are both inept measurers, that D--G are better, and
that the true value of $\mu$ is somewhere close to 10.
But what does maximizing the likelihood tell you?
}
\exercisxC{3}{ex.alpha}{
{\sf Problems with MAP method.}
A collection of widgets $i=1,\ldots, k$ have a property called `wodge',
$w_i$, which we measure, widget by widget, in noisy experiments with
a known noise level $\snu\eq 1.0$. Our model for these quantities
is that they come from a Gaussian prior $P(w_i \given \a) = \Normal(0,\dfrac{1}{\a})$,
where $\a
\eq 1 / \sigW^2 $ is not known. Our prior for this variance is flat
over $\log \sigW$ from $\sigW = 0.1$ to $\sigW = 10$.
{\sf Scenario 1.} Suppose four widgets have been measured and give
the following data: $\{d_1,d_2,d_3,d_4\}=
\{$2.2, $-2.2$, 2.8, $-2.8\}$.
We are interested in inferring the wodges of these four
widgets.
\ben
\item Find the values of $\bw$ and $\a$ that maximize the
posterior probability $P(\bw, \log \a \given \bd)$.
\item
Marginalize over $\a$ and find the posterior probability
density of $\bw$ given the data. [Integration skills required. See
\citeasnoun{MacKay94:alpha_nc} for solution.]
Find maxima
of $P(\bw \given \bd)$.
[Answer: two maxima -- one at
$\wmp =
\{1.8,-1.8,2.2,-2.2 \},$ with error bars on all four parameters (obtained
from Gaussian approximation to the posterior) $\pm 0.9$;
and one at $\wmp' =
\{ 0.03 , - 0.03 , 0.04 , - 0.04 \}$ with error bars $\pm 0.1$.]
\een
{\sf Scenario 2.} Suppose in addition to the four measurements above
we are now informed that there are
four more widgets that have been measured with a
much less accurate instrument, having $\snu'\eq 100.0$. Thus we now
have both well-determined and ill-determined parameters, as in a typical
\ind{ill-posed problem}. The data from these measurements were
a string of uninformative
values,
$\{d_5,d_6,d_7,d_8\}= \{$100, $-100,$ 100,
$-100\}$.
We are again asked to infer the wodges of the widgets.
Intuitively, our inferences about
the well-measured widgets should be negligibly affected by this vacuous
information about the poorly-measured widgets.
But what happens to the MAP method?
\ben
\item Find the values of $\bw$ and $\a$ that maximize the
posterior probability $P(\bw, \log \a \given \bd)$.
\item
Find maxima
of $P(\bw \given \bd)$.
[Answer:
only one maximum,
$\wmp = \{
0.03$, $-0.03$, $0.03$, $-0.03$, $0.0001$, $-0.0001$, $0.0001$, $-0.0001 \}$,
with
% marginal
error bars on
all eight parameters $\pm 0.11$.]
% \sigma_{w|D} = 0.11$.]
\een
% see bayes/alpha4.ms
% see bayes/alpha974.ms
}
\section{Solutions}
\soln{ex.mixture_em}{
% Follow the instructions.
%
\amarginfig{c}{
\begin{raggedright}
\raisebox{-0.795in}[0in][0in]{\mbox{\hspace*{-0.55in}\psfig{figure=figs/likeanswer.ps,angle=-90,width=2.9in}}}
% made by dologmix
\end{raggedright}
\caption[a]{The likelihood
as a function of $\mu_1$ and $\mu_2$.}
\label{fig.32mog}
}%
\Figref{fig.32mog} shows
a contour plot of the likelihood function for the 32 data points.
The peaks are pretty-near centred on
the points $(1,5)$ and $(5,1)$, and are pretty-near
circular in their contours. The width of each of the peaks
is a standard deviation of $\sigma/\sqrt{16}$ = 1/4.
The peaks are roughly Gaussian in shape.
}
\soln{ex.mlmaxenta}{
% {\sf Maximum likelihood fitting of an \ind{exponential-family} model.}
The log likelihood is:
\beq
\ln P( \{ \bx^{(n)} \} \given \bw) = -N\ln {Z(\bw)}
+ \sum_n \sum_k w_k f_k(\bx^{(n)}) .
\eeq
\beq
\frac{\partial}{\partial w_k}
\ln P( \{ \bx^{(n)} \} \given \bw)
= - N \frac{\partial}{\partial w_k} \ln {Z(\bw)} + \sum_n f_k(\bx) .
\eeq
Now, the fun part is what happens when we differentiate the
log of the normalizing constant:
\[
\frac{\partial}{\partial w_k} \ln {Z(\bw)} \ = \
\frac{1}{Z(\bw)} \sum_{\bx} \frac{\partial}{\partial w_k} \exp \left( \sum_{k'} w_{k'} f_{k'}(\bx) \right)
\]
\beq
= \
\frac{1}{Z(\bw)} \sum_{\bx} \exp \left( \sum_{k'} w_{k'} f_{k'}(\bx) \right) f_k(\bx)
\ = \
\sum_{\bx} P( \bx \given \bw) f_k(\bx) ,
\eeq
so
\beq
\frac{\partial}{\partial w_k}
\ln P( \{ \bx^{(n)} \} \given \bw)
= - N \sum_{\bx} P( \bx \given \bw) f_k(\bx) + \sum_n f_k(\bx) ,
\eeq
and at the maximum of the likelihood,
\beq
\sum_{\bx} P(\bx \given \wml) f_k(\bx) = \frac{1}{N} \sum_{n} f_k(\bx^{(n)}) .
\eeq
}
\chapter{Useful Probability Distributions}
\label{ch.distributions}
% This chapter is unfortunately found a little intimidating
% because it uses gamma distributions, which are not really
% worth being scared of, and are not central to the chapter.
% Gamma distributions are a lot like Gaussian distributions,
% except that whereas the Gaussian goes from $-\infty$ to $\infty$,
% gamma distributions go from 0 to $\infty$.
% Include a graph of a gamma distribution here.
\newcommand{\dinkyfig}[1]{\mbox{\psfig{figure=#1,angle=-90,width=1.51in}}}
\newcommand{\dinkyfigl}[1]{\mbox{\psfig{figure=#1,angle=-90,width=1.64in}}}
\amarginfig{t}{\small%
\begin{tabular}{r}
% $P(r \given f,N)$\\
\dinkyfig{bigrams/urn.f.g.ps}%
\\
\dinkyfigl{bigrams/urn.f.l.ps}%
\\[-0.1in]
\multicolumn{1}{c}{$r$}
\\
\end{tabular}
%}{%
\caption[a]{The binomial distribution $P(r \given f\eq 0.3,\,N \eq 10)$,
on a linear scale (top) and a logarithmic scale (bottom).}
\label{fig.binomial.again}
}
In Bayesian data modelling, there's a small collection of
probability distributions that come up again and again.
The purpose of this chapter is to introduce these distributions
so that they won't be intimidating when encountered in
combat situations.
There is no need to memorize any of them, except
perhaps the Gaussian;
if a distribution is important enough,
it will memorize itself, and otherwise, it
can easily be looked up.
\section{Distributions over integers}
\begin{center}
{\sf Binomial, Poisson, exponential}\par
\end{center}
\noindent
\index{distribution!useful}\index{probability distributions}We already encountered the binomial distribution and the
Poisson distribution on page \pageref{sec.poisson}.
The {\dem\ind{binomial distribution}\/} for an integer\index{distribution!binomial}
$r$ with parameters $f$ (the bias, $f \in [0,1]$)
and $N$ (the number of trials) is:
\beq
P(r \given f,N) = {N \choose r} f^{r} (1-f)^{N-r} \:\:\:\:\:\: r \in \{ 0,1,2,\ldots , N \} .
\label{eq.binomial.again}
\eeq
The binomial distribution arises, for example, when we flip a bent
coin, with bias $f$, $N$ times, and observe the number of heads, $r$.
\medskip
% see bigrams/README
The {\dem\ind{Poisson distribution}\/} with parameter $\l > 0$ is:\index{distribution!Poisson}
\beq
P( r \given \l ) = e^{-\l} \frac{\l^r}{r!} \:\:\:\:\:\: r\in \{ 0,1,2,\ldots\} .
\label{eq.poisson.again}
\eeq
The Poisson distribution arises, for example, when we count the number
of photons $r$ that arrive in a pixel during a fixed interval,
given that the mean intensity on the pixel corresponds to
an average number of photons $\l$.
\amarginfig{b}{\small%
~\\[0.2in]
\begin{tabular}{r}
\mbox{\psfig{figure=bigrams/poisson.a1.g.ps,angle=-90,width=1.5in}}%
\\
\mbox{\psfig{figure=bigrams/poisson.a1.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{$r$}
\\
\end{tabular}
%}{%
\caption[a]{The Poisson distribution $P(r \given \l\eq 2.7)$,
on a linear scale (top) and a logarithmic scale (bottom).}
\label{fig.poisson.2}
}
% see bigrams/README
\medskip
The {\dem{exponential distribution on integers}},\index{exponential distribution!on integers},\index{distribution!exponential}
\beq
P(r \given f)
=
f^{r} (1-f) \:\:\:\:\:\: r \in (0,1,2,\ldots,\infty) ,
\label{eq.exponentiali}
\eeq
arises in waiting problems. How long will you have to
wait until a six is rolled, if a fair six-sided dice is rolled?
Answer: the probability distribution of the number of rolls, $r$,
is exponential over integers with parameter $f=5/6$.
The distribution may also be written
\beq
P(r \given f)
=
(1-f) \, e^{-\lambda r} \:\:\:\:\:\: r \in (0,1,2,\ldots,\infty) ,
\label{eq.exponentialii}
\eeq
where $\lambda = \ln (1/f)$.
\section{Distributions over unbounded real numbers}
\begin{center}
{\sf Gaussian, Student, Cauchy, biexponential, inverse-cosh.}
\par
\end{center}
\noindent
The {\dem\ind{Gaussian distribution}\/} or \ind{normal} distribution\index{distribution!Gaussian}\index{distribution!normal}
with mean $\mu$ and standard deviation $\sigma$
is
\beq
P(x \given \mu,\sigma) = \frac{1}{Z} \exp \left( - \frac{(x-\mu)^2}{2 \sigma^2} \right)
\ \ \:\: x\in(-\infty,\infty) ,
\eeq
where
\beq
Z = \sqrt{ 2 \pi \sigma^2 } .
\eeq
It is sometimes useful to work with the quantity $\tau \equiv 1/\sigma^2$,
which is called the {\dem\ind{precision}} parameter of the Gaussian.
%\begin{aside}
{A \ind{sample}\index{sample!from Gaussian} $z$\index{Gaussian distribution!sample from}\index{distribution!Gaussian!sample from}
from a standard univariate Gaussian can be generated by computing
\beq
z = \cos(2 \pi u_1) \sqrt{2 \ln(1/u_2) },
\eeq
where $u_1$ and $u_2$ are uniformly distributed in $(0,1)$.}
%
A second sample $z_2 = \sin(2 \pi u_1) \sqrt{2 \ln(1/u_2) }$,
independent of the first, can then be obtained
for free.
%\end{aside}
The Gaussian distribution is widely used and often asserted
to be a very common distribution in the real world, but I am
sceptical about this assertion. Yes, {\em unimodal\/} distributions
may be common; but a Gaussian is a special, rather extreme,
unimodal distribution. It has very light tails: the log-probability-density
decreases quadratically.\index{tail}
The typical deviation of $x$ from $\mu$ is $\sigma$, but the
respective probabilities that $x$ deviates from $\mu$ by more than $2 \sigma$,
$3 \sigma$, $4 \sigma$, and $5 \sigma$,
are
$0.046$, 0.003, $6 \times 10^{-5}$, and $6 \times 10^{-7}$.
% 046
% 0027
% 6.3
% 5.7
In my experience, deviations from a mean four or five times greater
than the typical deviation may be rare, but
% they can happen more often than
not as rare as $6 \times 10^{-5}$!
I therefore urge caution\index{caution!Gaussian distribution}
in the use of Gaussian distributions:
if a variable that is modelled with a Gaussian
actually has a heavier-tailed\index{tail} distribution, the rest of the model
will contort itself to reduce the deviations of the
outliers, like a sheet of paper being crushed by a
rubber band.
\exercisxB{1}{ex.findstats}{
Pick a variable that is supposedly bell-shaped
in probability distribution, gather data,
and make a plot of the variable's empirical distribution. Show the distribution
as a histogram on a log scale and investigate whether
the tails are well-modelled by a Gaussian distribution.
[One example of a variable to study is the amplitude of
an audio signal.]
}
One distribution with heavier tails than a Gaussian
is a {\dem\ind{mixture of Gaussians}}. A mixture of two Gaussians,
for example, is defined by two means, two standard deviations,
and two {\dem\ind{mixing coefficients}} $\pi_1$ and $\pi_2$,
satisfying $\pi_1+\pi_2=1$, $\pi_i \geq 0$.
\[%beq
P(x \given \mu_1,\sigma_1,\pi_1,\mu_2,\sigma_2,\pi_2) =
% \pi_1
\frac{\pi_1}{\sqrt{ 2 \pi} \sigma_1} \exp \left( -\smallfrac{(x-\mu_1)^2}{2 \sigma_1^2} \right)
+
%\pi_2
\frac{\pi_2}{\sqrt{ 2 \pi} \sigma_2} \exp \left( -\smallfrac{(x-\mu_2)^2}{2 \sigma_2^2} \right).
\]%eeq
%
% ?????????/ Sun 3/2/02
% \begin{figure}
If we take an appropriately weighted mixture of an infinite number
of Gaussians, all having mean $\mu$,
we obtain a {\dem\ind{Student-$t$ distribution}},\index{distribution!Student-$t$}
\beq
P(x \given \mu,s,n)
= \frac{1}{Z} \frac{1}{ ( 1+(x-\mu)^2/(n s^2) )^{(n+1)/2} } ,
\label{eq.student}
\eeq
where
% (CHECK) published by William Gosset in 1908. His employer, Guinness Breweries, required him to publish under a pseudonym, so he chose "Student."
% checked, http://mathworld.wolfram.com/Studentst-Distribution.html
%\begin{figure}
%\figuremargin{\small
\amarginfig{b}{\small
\begin{center}
\begin{tabular}{rr}
\dinkyfig{bigrams/student1.ps}%
\\
\dinkyfigl{bigrams/student1.l.ps}%
\end{tabular}
\end{center}
%}{
\caption[a]{Three unimodal distributions.
Two Student distributions, with parameters
$(m,s)=(1,1)$ (heavy line) (a Cauchy distribution)\index{distribution!Cauchy}
and $(2,4)$ (light line),
and a Gaussian distribution with mean $\mu = 3$ and
standard deviation $\sigma=3$ (dashed line),
shown on linear vertical scales (top) and logarithmic
vertical scales (bottom).
Notice that the heavy tails of the Cauchy distribution
are scarcely evident in the upper `bell-shaped curve'.}
\label{fig.student}
}
%\end{figure}
%%%%%%%%%%%%% CHECK !!!!!!!!!!!!!!!!!11
\beq
Z = \sqrt{ \pi n s^2 } \frac{ \Gamma(n/2) }{ \Gamma((n+1)/2) }
\eeq
and $n$ is called the number of degrees of
freedom and $\Gamma$ is the gamma function.
If $n>1$ then the Student distribution (\ref{eq.student}) has a mean
and that mean is $\mu$. If $n>2$
the distribution also has a finite variance,
$\sigma^2 = ns^2/(n-2)$.
As $n \rightarrow \infty$, the Student
distribution approaches the normal distribution
with mean $\mu$ and standard deviation $s$.
The Student distribution arises both in classical
statistics (as the sampling-theoretic distribution
of certain statistics) and in Bayesian inference
(as the probability distribution of a variable
coming from a Gaussian distribution whose
standard deviation we aren't sure of).
In the special case $n=1$, the Student
distribution is called the {\dem{\ind{Cauchy distribution}}}.
\medskip
A distribution whose tails are intermediate in heaviness between\index{tail}
Student and Gaussian is the {\dem\ind{biexponential distribution}},\index{distribution!biexponential}
\beq
P(x \given \mu,s) =
\frac{1}{Z} \exp \left( - \frac{|x - \mu|}{s} \right) \:\: x \in (-\infty,\infty)
\eeq
where
\beq
Z = 2 s.
\eeq
% figure here from 01.tex
\medskip
The {\dem\ind{inverse-cosh distribution}\/}\index{distribution!inverse-cosh}
\beq
P(x \given \beta) \propto \frac{1}{[\cosh(\beta x)]^{1/\beta}}
\eeq
is a popular model in \ind{independent component analysis}.
In the limit
of large $\beta$, the
% nonlinearity becomes a step function and the
probability distribution $P(x \given \b)$ becomes a biexponential distribution.
%, $Pp_i(s_i) \propto \exp(-|x|)$.
In the limit $\beta \rightarrow 0$
$P(x \given \b)$ approaches a Gaussian with mean zero and variance $1/\beta$.
\section{Distributions over {\slshape\textbf{positive\/}} real numbers}
\begin{center}
{\sf {Exponential}, gamma, inverse-gamma, and {log-normal}.}
\par
\end{center}
\noindent
The {\dem\ind{exponential distribution}},\index{distribution!exponential}
\beq
P(x \given s) =
\frac{1}{Z} \exp \left( - \frac{x}{s} \right) \:\: \ \ x \in (0,\infty) ,
\label{eq.exponential}
\eeq
where
\beq
Z = s,
\eeq
arises in waiting problems. How long will you have to
wait for a bus in \ind{Poissonville}, given that buses arrive independently
at random with one every $s$ minutes on average?
Answer: the probability distribution of your wait, $x$,
is exponential with mean $s$.
\medskip
The {\dem\ind{gamma distribution}} is like a Gaussian distribution,\index{distribution!gamma}
except whereas the Gaussian goes from $-\infty$ to $\infty$,
gamma distributions go from 0 to $\infty$.
Just as the Gaussian distribution has two parameters $\mu$ and $\sigma$
which control the mean and width of the distribution,
the gamma distribution has two parameters.
It is the product of the one-parameter
exponential distribution
(\ref{eq.exponential}) with a polynomial, $x^{c-1}$.
The exponent $c$ in the polynomial is the second parameter.
\beq
P( x \given s,c ) \:\:=\:\: \Gamma( x ; s , c ) \:\:=\:\:
\frac{1}{Z}
\left( \frac{ x }{s} \right)^{c-1} \!
\exp \left( - \frac{x}{s} \right)
,\:\:\: 0\leq x < \infty
\label{gamma.dist}
\eeq
where
\beq
Z = \Gamma(c) s .
\eeq
This is a simple peaked distribution with mean $sc$ and
variance $s^2c$.
It is often natural to represent
a positive real variable $x$ in terms of its logarithm
$l = \ln x$.
The probability density of $l$ is
% x = e^l
% dx/dl = e^l = x
\beqan
P(l)& =& P(x(l)) \, \left| \frac{\partial x}{\partial l} \right|
\:\:
= \:\: P(x(l)) x(l) \label{eq.transformlog} \\
&=&
\frac{1}{Z_l}
\left( \frac{ x(l) }{s} \right)^{\! c}
\exp \left( - \frac{x(l)}{s} \right)
,
\label{gamma.distl}
\eeqan
where
%
\beq
Z_l \:\: =\:\: \Gamma(c) .
\eeq
[{{The gamma distribution is named after its normalizing constant --
an odd convention, it seems to me!}}]
\begin{figure}
\figuremargin{\small
\begin{center}
\begin{tabular}{rr}
\dinkyfig{bigrams/gamma1.3.x.ps}%
&
\dinkyfig{bigrams/gamma1.3.l.ps}%
\\
\dinkyfigl{bigrams/gamma1.3.x.l.ps}%
&
\dinkyfigl{bigrams/gamma1.3.l.l.ps}%
\\
\hspace*{0.6in} $x$ & \hspace*{0.6in} $l = \ln x$ \\
\end{tabular}
\end{center}
}{
\caption[a]{Two gamma distributions, with parameters
$(s,c)=(1,3)$ (heavy lines) and $10,0.3$ (light lines),
shown on linear vertical scales (top) and logarithmic
vertical scales (bottom);
and shown as a function of $x$ on the left (\ref{gamma.dist})
and $l = \ln x$ on the right (\ref{gamma.distl}).}
\label{fig.gammas}
}
\end{figure}
\Figref{fig.gammas} shows a couple of gamma distributions as a function
of $x$ and of $l$. Notice that where the original gamma
distribution (\ref{gamma.dist}) may have a `spike' at $x=0$, the
distribution over $l$ never has such a spike. The spike
is an artefact of a bad choice of basis.
In the limit $sc = 1, c
\rightarrow 0$, we obtain the {noninformative prior} for a scale
parameter, the $1/x$ prior. This \ind{improper} {prior} is called
noninformative because it has no associated length scale,
no characteristic value of $x$, so it prefers all values of $x$
equally. It is
{invariant} under the reparameterization $x = m x$.
If we transform the $1/x$ probability density into a density over $l= \ln x$
we find the latter density is uniform.
\exercisxB{1}{ex.power}{
Imagine that we reparameterize a positive variable $x$
in terms of its cube root, $u = x^{1/3}$.
If the probability density of $x$ is the improper distribution $1/x$,
what is the probability density of $u$?
}
The gamma distribution is always a unimodal density over
$l = \ln x$, and, as can be
seen in the figures, it is asymmetric.
If $x$ has a gamma distribution,
and we decide to work in terms of the inverse of $x$,
$v=1/x$, we obtain a new distribution, in which
the density over $l$ is flipped left-for-right:
the probability density
of $v$ is called
an {\dem\ind{inverse-gamma distribution}},
% v = 1/x
% x = 1/v
% mult by |dx/dv|
% = 1/v^2
\beq
P( v \given s,c ) =
\frac{1}{Z_v}
\left( \frac{ 1 }{s v} \right)^{\! c+1}
\exp \left( - \frac{1}{s v} \right)
, \ \ \ \ 0\leq v < \infty
\label{inversegamma.dist}
\eeq
\begin{figure}
\figuremargin{\small
\begin{center}
\begin{tabular}{rr}
\dinkyfig{bigrams/igamma1.3.x.ps}%
&
\dinkyfig{bigrams/igamma1.3.l.ps}%
\\
\dinkyfigl{bigrams/igamma1.3.x.l.ps}%
&
\dinkyfigl{bigrams/igamma1.3.l.l.ps}%
\\
$v$ & $\ln v$\\
\end{tabular}
\end{center}
}{
\caption[a]{Two inverse gamma distributions, with parameters
$(s,c)=(1,3)$ (heavy lines) and $10,0.3$ (light lines),
shown on linear vertical scales (top) and logarithmic
vertical scales (bottom);
and shown as a function of $x$ on the left
and $l = \ln x$ on the right.}
\label{fig.igammas}
}
\end{figure}
where
% (CHECK)
% not checked yet.
\beq
Z_v = \Gamma(c) / s .
\eeq
Gamma and inverse gamma distributions crop up in many
inference problems in which a positive quantity
is inferred from data. Examples include inferring
the variance of Gaussian noise from some noise samples,
and inferring the rate parameter of a Poisson distribution\index{distribution!Poisson}
from the count.
Gamma distributions also arise naturally in
the distributions of waiting times between Poisson-distributed events.
Given a Poisson process with rate $\l$, the probability
density of the arrival time $x$ of the $m$th event
is
\beq
\frac{ \l (\l x)^{m-1} }{ ( m\! - \! 1 )! } \, e^{-\l x} .
\eeq
% check, m=1 -> exp(-lx) . good.
\subsubsection{Log-normal distribution}
Another distribution over a positive
real number $x$
is the {\dem{\ind{log-normal}}} distribution,\index{distribution!log-normal}
which is the distribution that results when
$l = \ln x$ has a normal distribution.
We define $m$ to be the median value of $x$,
and $s$ to be the standard deviation of $\ln x$.
\beq
P(l \given m,s) = \frac{1}{Z} \exp \left( - \frac{(l-\ln m)^2}{2 s^2} \right)
\ \ \:\: l\in(-\infty,\infty) ,
\eeq
where
\beq
Z = \sqrt{ 2 \pi s^2 },
\eeq
implies
% via {eq.transformlog}'s relp
\beq
P(x \given m,s ) = \frac{1}{x}
\exp \left( - \frac{(\ln x -\ln m)^2}{2 s^2} \right)
\ \ \:\: x\in(0,\infty) .
\eeq
%\begin{figure}
\marginfig{\small
%\figuremargin{
\begin{center}
\begin{tabular}{rr}
\dinkyfig{bigrams/lognormal.ps}%
\\
\dinkyfigl{bigrams/lognormal.l.ps}%
\end{tabular}
\end{center}
%}{
\caption[a]{Two log-normal distributions, with parameters
$(m,s)=(3,1.8)$ (heavy line)
and $(3,0.7)$ (light line),
shown on linear vertical scales (top) and logarithmic
vertical scales (bottom). [Yes, they really do have
the same value of the median, $m=3$.]}
\label{fig.lognormal}
}
%\end{figure}
\section{Distributions over periodic variables\nonexaminable}
A \ind{periodic variable} $\theta$ is a real number
$\in [0,2 \pi]$\index{distribution!over periodic variables}
having the property that $\theta=0$ and $\theta=2 \pi$ are
equivalent.
% identical
A distribution that plays for periodic variables
the role played by the Gaussian distribution for real variables is the
{\dem\ind{Von Mises distribution}}:\index{distribution!Von Mises}
\beq
P(\theta \given \mu,\beta) =\frac{1}{Z} \exp \left( \beta \cos( \theta - \mu )
\right) \:\:\:\: \theta \in (0,2\pi).
\eeq
The normalizing constant is $Z= 2\pi I_0(\beta)$, where
$I_0(x)$ is a modified Bessel function.
% (equal to J_0(ix))
\medskip
A distribution that arises from Brownian \ind{diffusion}\index{Brownian motion}
around the \ind{circle} is the
wrapped Gaussian distribution,
% with wrap-around,
\beq
P(\theta \given \mu,\sigma) = \sum_{n=-\infty}^{\infty}
\Normal( \theta ; ( \mu+2\pi n ), \sigma ) \:\:\:\: \theta \in (0,2\pi) .
\eeq
% Not the same as (think about them on a log scale, for case of small s)...
% SECOND EDITION
%
% INSERT
% \input{tex/wrappedcauchy.tex}
%
% LOOK DO ME
\section{Distributions over probabilities}
\begin{center}
{\sf Beta distribution, Dirichlet distribution, entropic distribution}
\par
\end{center}
\noindent
The%
% {normalized vectors}
\marginfig{\small
%\figuremargin{
\begin{center}
\begin{tabular}{rr}
\dinkyfig{figs/beta.ps}%
\\[0.2in]
\dinkyfig{figs/betal.ps}%
\\[0.2in]
\end{tabular}
\end{center}
%}{
\caption[a]{Three beta distributions,
with $(u_1,u_2) = ( 0.3,1)$, $(1.3,1)$, and $(12,2)$.
The upper figure shows $P(p \given u_1,u_2)$ as a function of
$p$; the lower shows the corresponding density over
the {\dem{\ind{logit}}\/},
$$\ln \frac{p}{1-p}. $$
Notice how well-behaved the densities are
as a function of the logit.
}
\label{fig.beta}
}
{\dem\ind{beta distribution}} is a probability density\index{distribution!beta}
over a
variable $p$ that is a probability, $p \in (0,1)$:
\beq
P(p \given u_1,u_2) = \frac{1}{Z(u_1,u_2)} p^{u_1-1} (1-p)^{u_2-1} .
\eeq
The parameters $u_1,u_2$ may take any positive value.
The normalizing constant is the \ind{beta function},
% (CHECKED)
\beq
Z(u_1,u_2) = \frac{ \Gamma(u_1) \Gamma(u_2) }{ \Gamma(u_1 + u_2) } .
\label{eq.Zbeta}
\eeq
% !!!!!!!!!!!!!!!!!!!!!!!!!!!
Special cases include the uniform distribution -- $u_1\eq1, u_2\eq 1$;
the \ind{Jeffreys prior} -- $u_1\eq 0.5, u_2\eq 0.5$;
and the \ind{improper} \ind{Laplace prior} -- $u_1\eq 0, u_2\eq 0$.
If we transform the beta distribution to the corresponding density over
the \ind{logit} $l \equiv \ln \lfrac{p}{(1-p)}$, we find it is always a
pleasant bell-shaped density over $l$, while the density over $p$
may have singularities at $p=0$ and $p=1$ (\figref{fig.beta}).
\subsection{More dimensions}
The {\dem\ind{Dirichlet distribution}}\index{distribution!Dirichlet}
is a density over an $I$-dimensional vector $\bp$ whose $I$ components
are positive and sum to 1. The beta distribution is a special case of
a Dirichlet distribution with $I=2$.
The Dirichlet distribution
% for a probability vector $\bp$ with $\lI$ components
is parameterized by a measure $\bu$ (a vector with all
coefficients $u_i > 0$) which I
will write here as $\bu = \alpha \bm$, where $\bm$ is a normalized
measure over the $\lI$ components ($\sum m_i = 1$), and $\a$ is
positive:
\beq
P(\bp \given \a\bm) = \frac{1}{Z(\a \bm)}
\prod_{i=1}^{\lI} p_i^{\a m_i - 1}
\delta \left(\textstyle \sum_i p_i - 1 \right)
\equiv \Dir{\bp}{\a\bm}{\lI} .
\label{eq.dirichletdefn}
\eeq
The function $\delta (x)$ is the Dirac delta function, which
restricts the distribution to the \ind{simplex} such that $\bp$ is
normalized, \ie, $\sum_i p_i = 1$. The normalizing constant of the Dirichlet
distribution is:
\beq
Z(\a\bm)
%\int \d^{\lI} \! \bp \: \prod_{i=1}^{\lI}
% p_i^{\a m_i - 1} \delta \left(\textstyle \sum p_i - 1 \right)
% = \frac{ \prod_i \Gamma (\a m_i) }{ \Gamma(\sum \a m_i) }.$
= \prod_i \Gamma (\a m_i) \left/ \Gamma( \a ) \right. .
\label{lang.z}
\eeq
The vector $\bm$ is the mean of the probability distribution:
\beq
\int \Dir{\bp}{\a\bm}{\lI} \: \bp \: \d^{\lI} \! \bp = \bm .
\label{dirichlet_mean}
\eeq
When working with a probability vector $\bp$, it is often
helpful to work in the `\index{softmax, softmin}{softmax} basis', in which,
for example, a three-dimensional probability $\bp=(p_1,p_2,p_3)$
is represented by three numbers $a_1,a_2,a_3$
satisfying $a_1+a_2+a_3=0$ and
\beq
p_i = \frac{1}{Z} \, e^{a_i}, \:\: \mbox{where $Z = \sum_i e^{a_i}$.}
\label{eq.softmaxdef}
\eeq
% Dirichlet distributions are most
% naturally dealt with in this basis
% \protect\cite{MacKay96:laplace}.
This nonlinear transformation is analogous
to the $\sigma \rightarrow \ln \sigma$
transformation for a scale variable
and the logit transformation for a single probability,
$p \rightarrow \ln \frac{p}{1-p}$.
In the {softmax} basis, the ugly minus-ones in the exponents
in the Dirichlet
distribution (\ref{eq.dirichletdefn}) disappear,
and the density is given by:
\beq
P(\ba \given \a\bm) \propto \frac{1}{Z(\a \bm)}
\prod_{i=1}^{\lI} p_i^{\a m_i}
\delta \left(\textstyle \sum_i a_i \right) .
\eeq
%
\begin{figure}
\figuremargin{\small
\begin{center}
\begin{tabular}{l}
\makebox[0in][l]{\hspace{0.3in}$\bu=(20,10,7)$}%
\makebox[0in][l]{\hspace{1.65in}$\bu=(0.2,1,2)$}%
\makebox[0in][l]{\hspace{2.8in}$\bu=(0.2,0.3,0.15)$}%
\\
{\hspace*{0in}\psfig{figure=zipf/dirichletdemo.ps,width=4in,angle=-90}}\\
{\hspace*{0in}\psfig{figure=zipf/dirichletdemol.ps,width=4in,angle=-90}}\\
\end{tabular}
\end{center}
}{
\caption[abb]{Three Dirichlet distributions over a three-dimensional probability
vector $(p_1,p_2,p_3)$. The upper figures show 1000 random draws from
each distribution, showing the values of $p_1$ and $p_2$ on the two axes. $p_3 =1-( p_1+p_2)$.
The triangle in the first figure
is the simplex of legal probability distributions.
The lower figures show the same points in the
`softmax' basis (\eqref{eq.softmaxdef}).
The two axes show $a_1$ and $a_2$. $a_3 = -a_1-a_2$.
}
\label{fig.dirichletdemo}
}
\end{figure}
\noindent
The role of the parameter $\a$ can be characterized in two ways. First,
$\a$ measures the sharpness of the distribution (\figref{fig.dirichletdemo});
it measures how different we expect typical samples $\bp$ from the
distribution to be from the mean $\bm$, just as the
precision $\tau=\dfrac{1}{\sigma^2}$ of a Gaussian
measures how far samples stray from its mean. A large value of $\a$
produces a distribution over $\bp$ that is sharply peaked around
$\bm$. The effect of $\a$ in higher-dimensional
situations can be visualized by drawing a typical sample
from the distribution $\Dir{\bp}{\a\bm}{\lI}$, with $\bm$ set to the uniform
vector $m_i = \dfrac{1}{I}$,
and making a \ind{Zipf plot}, that is, a ranked plot of the values of
the components $p_i$.
It is traditional to plot both $p_i$ (vertical axis) and the rank (horizontal
axis) on logarithmic scales so that power law relationships
appear as straight lines.
% Many natural languages have word frequencies which
% are well modelled by Zipf's law
Figure \ref{fig.zipf} shows these plots for a single sample from
ensembles with $\lI=100$ and $\lI=1000$ and with $\a$ from 0.1 to
1000. For large $\a$, the plot is shallow with many components having
similar values.
% s to the most probable component.
For small $\a$, typically one component $p_i$ receives an
overwhelming share of the probability, and of the small probability that
remains to be shared among the other components, another component
$p_{i'}$ receives a similarly large share. In the limit as $\a$ goes
to zero, the plot tends to an increasingly steep power law.
%\begin{figure}
\amarginfig{c}{\small
\begin{center}
\begin{tabular}{c}
$\lI=100$ \\
\hspace*{-0.15in}\psfig{figure=zipf/ps/all.100.ps,%
width=57mm}
\\
$\lI=1000$ \\
\hspace*{-0.15in}\psfig{figure=zipf/ps/all.1000.ps,%
width=57mm}
\\
\end{tabular}
\end{center}
%
\caption[abb]{Zipf plots for random samples from Dirichlet distributions
with various values of $\a = 0.1 \ldots 1000$. For each value
of $\lI=100$
or 1000
and each $\a$,
%
% RESTORE DETAILS somewhere
%$\lI$ samples from a standard gamma
% distribution were generated
% with shape parameter $\a/\lI$ and normalized to give a
one sample
$\bp$ from the Dirichlet distribution was generated.
The Zipf plot shows the probabilities $p_i$, ranked by magnitude,
versus their rank.
}
\label{fig.zipf}
}
%\end{figure}
Second, we can characterize the role of $\a$ in terms of the
predictive distribution that results when we observe samples from $\bp$ and
obtain counts $\bF = (F_1, F_2, \ldots, F_I)$ of the possible outcomes.
% The term $\a m_i$ plays the role of an effective initial count in
% bin $i$.
The value of $\a$ defines the number of samples from
$\bp$ that are required in order that the data dominate over the
prior in predictions.
\exercisxC{3}{ex.Dadditive}{
The Dirichlet distribution satisfies a nice additivity property.
Imagine that a biased six-sided die has two red faces
and four blue faces. The die is rolled $N$ times and two Bayesians
examine the outcomes in order to infer the bias of the die and make
predictions.
One Bayesian has access to the red/blue colour outcomes only,
and he infers a two-component probability vector ($p_{\rm R}, p_{\rm B}$).
The other Bayesian has access to each full outcome: he can
see which of the six faces came up, and he infers
a six-component probability vector ($p_1, p_2, p_3,p_4,p_5,p_6$),
where $p_{\rm R} =p_1+ p_2$ and $p_{\rm B} = p_3 + p_4 +p_5+p_6 $.
Assuming that the second Bayesian
assigns a Dirichlet distribution to
($p_1, p_2, p_3,p_4,p_5,p_6$)
with \ind{hyperparameter}s
($u_1, u_2, u_3,u_4,u_5,u_6$),
show that, in order for the first Bayesian's inferences to be
consistent with those of the second Bayesian,
the first Bayesian's prior should be
a Dirichlet distribution
with hyperparameters
($(u_1 + u_2), (u_3+u_4+u_5+u_6)$).
{\sf Hint}: a brute-force approach is to compute the integral
$P(p_{\rm R}, p_{\rm B}) = \int \d^6 \bp \, P(\bp \given \bu) \, \delta(
p_{\rm R} - (p_1+ p_2) ) \, \delta (p_{\rm B} -( p_3 + p_4 +p_5+p_6 ))$.
A cheaper approach is to compute the predictive
distributions, given arbitrary data
($F_1, F_2, F_3,F_4,F_5,F_6$),
and find the condition for the two predictive distributions to
match for all data.
}
The {\dem\ind{entropic distribution}\/} for a
probability vector $\bp$ is sometimes used in the
`maximum entropy' image reconstruction
community.
\beq
P(\bp \given \a,\bm) = \frac{1}{Z(\a,\bm)}
\exp [ - \a D_{\rm KL}(\p||\bm) ]
\, \delta \! \left(\textstyle \sum_i p_i - 1 \right) ,
\eeq
where $\bm$, the measure, is a positive vector, and $D_{\rm KL}(\bp||\bm) = \sum_i p_i \log p_i/m_i$.
\section*{Further reading}
See \cite{MacKay_Peto} for fun with
Dirichlets.
\section{Further exercises}
\exercisxC{2}{ex.gammainf}{
$N$ datapoints $\{ x_n \}$ are drawn from a
% quantity $x$ has a
gamma distribution
$ P( x \given s,c ) \:=\: \Gamma( x ; s , c )$
with unknown parameters $s$ and $c$.
What are the maximum likelihood parameters $s$ and $c$?
}
\dvips
%%\prechapter{About Chapter}
%%\input{tex/_pexact.tex}% not associated with exact.tex any more!
\chapter{Exact Marginalization}
% in Continuous Spaces}
\label{ch.exactmarg}
% WAS ::: \chapter{Intermediate Bayesian Stuff}
\label{ch.bayes.gaussian}
\label{ch.bayes.int}
How can we avoid the exponentially large cost of complete enumeration of all
hypotheses? Before we stoop to approximate methods, we explore
two approaches to exact marginalization: first, \ind{marginalization} over
continuous variables (sometimes known as
\ind{nuisance parameters}) by doing {\em integrals};
and second, summation over discrete variables by message-passing.
% In this chapter we run through some very simple
% but intimidating examples.
Exact marginalization over continuous parameters
is a \ind{macho} activity enjoyed by those who are fluent in
definite integration.
\fakesection{Gamma whinge}
% WAS _pexact.tex ::::::::::::::::::::::::::::::::::::::::::::::
This chapter
% is unfortunately found a little intimidating
% because it
uses gamma distributions; as
%, which are not really
% worth being scared of, and are not central to the chapter.
% As
was explained in the previous chapter,
gamma distributions are a lot like Gaussian distributions,
except that whereas the Gaussian goes from $-\infty$ to $\infty$,
gamma distributions go from 0 to $\infty$.
\section{Inferring the mean and variance of a Gaussian distribution}
We discuss again the one-dimensional
Gaussian distribution, parameterized by a mean $\mu$
and a standard deviation $\sigma$:
%
\beq
P(x \given \mu,\sigma)
% ,\H_{\rm Normal})
= \frac{1}{\sqrt{2 \pi} \sigma}
\exp \left( - \frac{ ( x-\mu )^2 }{2 \sigma^2 } \right)
\equiv {\rm Normal}(x;\mu,\sigma^2) .
\eeq
%
% Let us examine the inference of $\mu$ and $\sigma$
% given data points $x_n$, $n=1,\ldots, N$, assumed to be drawn independently
% from this distribution.
%
When inferring these parameters, we must specify their prior
distribution. The prior gives us the opportunity to include specific
knowledge that we have about $\mu$ and $\sigma$ (from independent
experiments, or on theoretical grounds, for example). If we have no
such knowledge, then we can construct an appropriate prior that
embodies our supposed ignorance.
In \secref{sec.gaussian.firsttime}, we assumed a uniform prior
over the range of parameters plotted.
If we wish to be able to perform exact
marginalizations, it may be
useful to consider {\em conjugate priors}; these are priors
whose functional form combines naturally with the likelihood
such that the inferences have a convenient
form.
\subsection{Conjugate priors for $\mu$ and $\sigma$\nonexaminable}
The \ind{conjugate prior} for a mean $\mu$ is a Gaussian:\index{Gaussian distribution!parameters}
we introduce two `\ind{hyperparameter}s', $\mu_0$ and $\sigma_{\mu}$,
which parameterize the prior on $\mu$, and write
$P(\mu \given \mu_0,\sigma_{\mu}) = \Normal(\mu;\mu_0,\sigma_{\mu})$.
In the limit $\mu_0 \eq 0$, $\sigma_{\mu} \rightarrow \infty$, we obtain the {\em
noninformative prior\/} for a location parameter, the flat prior. This
is {\dem\ind{noninformative}} because it is {\em invariant\/} under the
natural reparameterization $\mu' = \mu+c$.
% \marginpar{\footnotesize I need to give a better explanation of `noninformative'.}
The prior $P(\mu) = {\rm const.}$
is also an {\em\ind{improper}\/} prior, that is, it is not normalizable.
The \ind{conjugate prior} for a standard deviation $\sigma$ is a
\ind{gamma
distribution}, which has two parameters $ b_{\b}$ and $c_{\b}$.
It is most convenient to define the prior
density of the
inverse variance
%\marginpar{\footnotesize{The inverse variance is sometimes
% called the {\dem\ind{precision}} parameter of the Gaussian.}}
(the {\dem\ind{precision}} parameter)
$\beta = 1/\sigma^2$:
\beq
P( \b ) = \Gamma( \b ; b_{\b} , c_{\b} ) =
\frac{1}{\Gamma(c_{\b})}
\frac{ \b^{c_{\b}-1} }
{ b_{\b}^{c_{\b}} }
\exp \left( - \frac{\b}{b_{\b}} \right)
, \ \ \ \ 0\leq \b < \infty .
\label{gamma.dist.again}
\eeq
This is a simple peaked distribution with mean $b_{\b}c_{\b}$ and
variance $b^2_{\b}c_{\b}$. In the limit $b_{\b}c_{\b} = 1, c_{\b}
\rightarrow 0$, we obtain the {noninformative prior} for a scale
parameter, the $1/\sigma$ prior. This is `noninformative' because it
is {invariant} under the reparameterization $\sigma' = c \sigma$. The
$1/\sigma$ prior is less strange-looking if we examine the resulting
density over $\ln \sigma$, or $\ln \beta$, which is flat.%
\marginpar{\small\raggedright{Reminder: when we change variables
from $\sigma$ to $l(\sigma)$, a one-to-one function of $\sigma$,
the probability density transforms from $P_{\sigma}(\sigma)$
to
$$
P_l(l) = P_{\sigma}(\sigma) \left| \frac{\partial \sigma}{\partial l} \right|
.
$$
Here, the \ind{Jacobian} is
$$
\left| \frac{\partial \sigma}{\partial \ln \sigma} \right| = \sigma
.
$$
% \eqref{eq.transformlog}.}}
}}
This is
the prior that expresses ignorance about $\sigma$ by saying `well, it
could be 10, or it could be
1, or it could be 0.1, \ldots' Scale variables such as $\sigma$ are
usually best represented in terms of their logarithm. Again, this
noninformative $1/\sigma$ prior is \ind{improper}.
%
In the following examples, I will use the improper noninformative priors
for $\mu$ and $\sigma$. Using improper priors is viewed as distasteful
in some circles, so let me excuse myself by saying it's for the sake of
readability; if I included proper priors, the calculations
could still be done but the key points would be obscured by the
flood of extra parameters.
\subsection{Maximum likelihood and marginalization:
$\sigma_{\ssN}$ and $\sigma_{\ssNM}$}
\label{sn}
The task of inferring the mean and \ind{standard deviation}
of a Gaussian distribution from $N$ samples is a familiar one, though
maybe not everyone understands the difference between the
\ind{$\sigma_{\ssN}$ and $\sigma_{\ssNM}$} buttons on their \ind{calculator}.
Let us recap the formulae, then derive them.
Given data $D = \{ x_n \}_{n=1}^{N}$, an `estimator' of $\mu$ is
% \newcommand{\barx}{\bar{x}}
\beq
\barx \equiv \textstyle {\sum_{n=1}^{N} x_n} / {N} ,
\eeq
and two estimators of $\sigma$ are:
\beq
\sigma_{\ssN} \equiv \sqrt{
\frac{\sum_{n=1}^{N} ( x_n - \barx )^2 }{N}
}
\: \mbox{ and } \:
\sigma_{\ssNM} \equiv \sqrt{
\frac{\sum_{n=1}^{N} ( x_n - \barx )^2 }{N-1}
} .
\eeq
%
There are two principal paradigms for statistics: \ind{sampling theory}
and Bayesian inference. In sampling theory (also known as `\ind{frequentist}'
or \ind{orthodox statistics}), one invents {\dem\ind{estimator}s} of quantities of
interest and then chooses between those estimators using some
criterion measuring their sampling properties; there is no clear
principle for deciding which criterion to use to measure the
performance of an estimator; nor, for most criteria, is there any
systematic procedure for the construction of optimal estimators.
In Bayesian inference, in contrast, once we have made
explicit all our assumptions about the model and the data, our inferences are
mechanical.
% stic.
Whatever question we wish to pose, the rules
of probability theory give a unique answer which consistently takes
into account all the given information. Human-designed
estimators and confidence intervals
have no role in Bayesian inference;
human input only enters into the important
tasks of designing the hypothesis space (that is, the specification
of the model and all its probability distributions),
and figuring out
how to do the computations that implement inference
in that space.
The answers to our questions are probability distributions over the
quantities of interest. We often find that the estimators of
sampling theory emerge automatically as modes or means
of these posterior distributions when we choose a simple hypothesis
space and turn the handle of Bayesian
inference.
In sampling theory, the estimators above can be motivated
as follows. $\barx$ is an unbiased estimator of $\mu$ which, out of all
the possible unbiased estimators of $\mu$, has smallest \ind{variance} (where this
variance is computed by averaging over an ensemble of imaginary experiments
in which the data samples are assumed to come from an unknown
\ind{Gaussian distribution}).\index{bias!in statistics}
The estimator $(\barx,\sigma_{\ssN})$ is the maximum likelihood estimator
for $(\mu,\sigma)$. The estimator $\sigma_{\ssN}$ is {\em\ind{biased}}, however:
the expectation of $\sigma_{\ssN}$, given $\sigma$, averaging over
many imagined experiments, is not $\sigma$.
\exercissxA{2}{ex.sigmanbias}{
Give an intuitive explanation why the estimator $\sigma_{\ssN}$ is {biased}.
}
This bias motivates the invention, in sampling theory, of
$\sigma_{\ssNM}$, which can be shown to be an unbiased estimator. Or to
be precise, it is $\sigma_{\ssNM}^2$ that is an \ind{unbiased estimator} of
$\sigma^2$.
% copy of this stolen and included in enumerate.tex
% \renewcommand{\figs}{/home/mackay/book/figs} % while in bayes chapter
\begin{figure}
\figuremargin{\small%
\vspace{-0.56in}
\begin{center}
\begin{tabular}{l@{}l}
% \newcommand{\bookfigs}{/home/mackay/book/figs}
(a1)\hspace{-0.4in}\raisebox{-10mm}{\psfig{figure=\bookfigs/basic/new_surfaceplot.ps,angle=-90,width=3in}}
&
(a2)\hspace{-0.8in}\raisebox{-10mm}{\psfig{figure=\bookfigs/basic/new_contourplot.ps,angle=-90,width=3in}}
\\
%(b)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_posts.ps,angle=-90,width=2.3in}}
%&
(c)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_sigposts.ps,angle=-90,width=2.3in}}
\hspace*{-0.3in}
%\\
&
\hspace*{-0.3in}(d)\raisebox{-5mm}{\psfig{figure=\bookfigs/basic/new_sigmargb.ps,angle=-90,width=2.3in}}
\\
\end{tabular}
\end{center}
% \mbox{{\bf (a)} \psfig{figure=\bookfigs/basic/like_sig_mu.ps,%
% width=3 true in,height=2.53 true in,angle=-90,%
% bbllx=19.5cm,bblly=1.1cm,%3.9cm,%
% bburx=2.4cm,bbury=24.0cm}
% %}
% %\mbox{
% {\bf (b)} \psfig{figure=\bookfigs/basic/sigma_likes.ps,%
% width=3 true in,height=2.3 true in,angle=-90,%
% bbllx=19.5cm,bblly=1.9cm,%
% bburx=2.4cm,bbury=24cm} }
}{%
\caption[abbrev]{{The likelihood function for the parameters of
a Gaussian distribution},
repeated from \protect\figref{like.sig.mu1}.
{(a1, a2)} Surface plot and
contour plot of the log likelihood as a function of $\mu$
and $\sigma$. The data set of $N=5$ points had mean
$\bar{x}=1.0$ and $S^2 = \sum(x-\bar{x})^2 = 1.0$. Notice that
the maximum is skew in $\sigma$. The two estimators of
standard deviation have values $\sigma_{\ssN}=0.45$ and
$\sigma_{\ssNM}=0.50$.
%{(b)} The posterior probability of $\mu$ for various values of
% $\sigma$.
%
{(c)} The posterior probability of $\sigma$ for
various fixed values of $\mu$ (shown as a density over $\ln \sigma$).
% The two graphs show: the likelihood as a function of
% $\sigma$, with $\mu$ fixed to $\barx$, \ie, $P(D \given \mu={\bar
% x},\sigma)$ [this is a vertical section through the peak in
% (a)]; and
{(d)} The
% `evidence' (marginal likelihood) for
posterior probability of $\sigma$, $P(\sigma \given D)$,
% $\sigma$, $P(D \given \sigma)$,
assuming a flat prior on $\mu$,
% (rescaled by an arbitrary constant). The evidence is
obtained
by projecting the probability mass in (a) onto the $\sigma$
axis. The maximum of
% $P(D \given \sigma)$ is
$P(\sigma \given D)$ is
at $\sigma_{\ssNM}$. By contrast, the maximum of
% $P(D \given \mu={\bar x},\sigma)$ is
$P(\sigma \given D,\mu\eq {\bar x})$ is
at $\sigma_{\ssN}$.
(Both probabilities are shows as densities
over $\ln \sigma$.) }
\label{like.sig.mu}
}%
\end{figure}
We now look at some Bayesian inferences for this problem, assuming
noninformative priors for $\mu$ and $\sigma$. The emphasis is thus not on
the priors, but rather on (a) the likelihood function,
and (b) the concept of marginalization.
The joint posterior probability of $\mu$ and $\sigma$ is
proportional to the likelihood function illustrated by a contour plot
in figure \ref{like.sig.mu}a.
The log likelihood is:
\beqan
\!\!\!\!\ln P(\{x_n\}_{n=1}^N \given \mu,\sigma)
&\!\!=\!\!& -N \ln (\sqrt{2 \pi} \sigma)
-\sum_n \linefrac{(x_n-\mu)^2}{(2 \sigma^2)} ,
\\
&\!\!=\!\!&
-N \ln (\sqrt{2 \pi} \sigma) - \linefrac{ [ N ( \mu - \barx )^2 + S ]}
{ (2 \sigma^2) },
\eeqan
where $S \equiv \sum_n (x_n-\barx)^2$. Given the Gaussian model,
the likelihood can be expressed
in terms of the two functions of the data $\barx$ and $S$, so these
two quantities are known as `sufficient statistics'.
The posterior probability of $\mu$ and $\sigma$ is, using the improper
priors:
\beqan
P( \mu , \sigma \given \{x_n\}_{n=1}^N ) &=&
\frac{ P(\{x_n\}_{n=1}^N \given \mu,\sigma) P( \mu, \sigma ) }
{ P ( \{x_n\}_{n=1}^N ) }
\label{joint.post1}
\\
&=&
\frac{
\smallfrac{1}{(2 \pi \sigma^2)^{N/2}} \exp\left( -
\smallfrac{N ( \mu - \barx )^2 + S }
{ 2 \sigma^2 }
\right)
\frac{1}{\sigma_{\mu}}
% \frac{1}{(2 \pi \sigma_{\mu}^2)^{1/2}}
% \exp\left( - \frac{1}{2}
% \mu^2 / ( 2 \sigma_{\mu}^2 ) \right)
% \frac{1}{\Gamma(c_{\b})}
% \frac{ \b^{c_{\b}-1} }
% { b_{\b}^{c_{\b}} }
% \exp \left( - \frac{\b}{b_{\b}} \right)
\frac{1}{\sigma}
}
% \right/
{
P ( \{x_n\}_{n=1}^N )
} .
\label{joint.post2}
\eeqan
This function describes the answer to the question, `given the data,
and the noninformative priors, what might $\mu$ and $\sigma$ be?'
It may be of interest to find the parameter values that maximize
the posterior probability, though it should be emphasized that posterior
probability maxima have no fundamental status in Bayesian inference, since
their location depends on the choice of basis. Here we choose
the basis $(\mu , \ln \sigma)$, in which our prior is flat,
so that the posterior probability maximum coincides with the
maximum of the likelihood.
%\exercisxB{2}{ex.MLgaussian}{
% Differentiate the log likelihood with respect to $\mu$ and $\ln \sigma$
% and show that the maximum likelihood solution is:
%% \beq
%$
% \{\mu,\sigma\}_{\ML} = \left\{ \bar{x},\sigma_{\ssN}
% = \sqrt{ \linefrac{S}{N} } \right\} .
%$
%% \eeq
%}
As we saw in \exerciseref{ex.MLgaussian},
the maximum likelihood
solution for $\mu$ and $\ln \sigma$ is
$
\{\mu,\sigma\}_{\ML} = \left\{ \bar{x},\sigma_{\ssN}
= \sqrt{ \linefrac{S}{N} } \right\} .
$
There is more to the posterior distribution than just its mode. As
can be seen in figure \ref{like.sig.mu}a, the likelihood has a skew
peak. As we increase $\sigma$, the width of the conditional
distribution of $\mu$ increases (\figref{like.sig.mu1a}b). And if we fix $\mu$ to a sequence
of values moving away from the sample mean $\barx$, we obtain a
sequence of conditional distributions over $\sigma$ whose maxima move
to increasing values of $\sigma$ (\figref{like.sig.mu}c).
% The next question we might ask is `given the data,
% and the noninformative prior on $\mu$, and assuming a particular
% value of $\sigma$, what might $\mu$ be?'
The posterior probability of $\mu$ given $\sigma$ is
\beqan
P( \mu \given \{x_n\}_{n=1}^N,\sigma ) &=&
\frac{ P(\{x_n\}_{n=1}^N \given \mu,\sigma) P( \mu ) }
{ P(\{x_n\}_{n=1}^N \given \sigma ) }
\label{post.mu}
\\
&\propto&
\exp ( -N(\mu - \barx )^2/(2 \sigma^2) )
\\
&=&
\Normal( \mu ; \barx , \sigma^2/N ) .
\eeqan
We note
% This posterior distribution shows
the familiar
$\sigma/\sqrt{N}$ scaling of the error bars on $\mu$.
% posterior uncertainty of the parameter $\mu$.
Let us now ask the question `given the data,
and the noninformative priors, what might $\sigma$ be?' This question
differs from the first one we asked in that we are now not interested in
$\mu$. This parameter must therefore be {\em marginalized\/} over.
The posterior probability of $\sigma$ is:
\beq
P( \sigma \given \{x_n\}_{n=1}^N ) =
\frac{ P(\{x_n\}_{n=1}^N \given \sigma ) P( \sigma ) }
{ P(\{x_n\}_{n=1}^N ) } .
\label{eq.truepostsigma}
\eeq
The data-dependent term $P(\{x_n\}_{n=1}^N \given \sigma )$ appeared
earlier as the normalizing constant in equation (\ref{post.mu}); one
name for this quantity is the `\ind{evidence}', or \ind{marginal likelihood},
for $\sigma$. We obtain the evidence for $\sigma$ by integrating out
$\mu$; a noninformative prior $P(\mu)=\mbox{constant}$ is
assumed; we call this constant
$\linefrac{1}{\sigma_{\mu}}$, so that we can think of the prior
as a top-hat prior of width $\sigma_{\mu}$.
% , with $\sigma_{\mu} \to \infty$:
% \beqa
The Gaussian integral,
$ P(\{x_n\}_{n=1}^N \given \sigma) =
\int P(\{x_n\}_{n=1}^N \given \mu,\sigma)P(\mu) \: d \mu ,
$
% \eeqa
% \\
% & = & P(\{x_n\}_{n=1}^N \given \mu_{\MP},\sigma)P(\mu_{\MP})
% \sqrt{2 \pi}\frac{\sigma}{\sqrt{N}} .
% \eeqa
% This Gaussian integral yields:
% e log evidence is therefore:
yields:
\beq
\ln P(\{x_n\}_{n=1}^N \given \sigma)=-N \ln (\sqrt{2 \pi} \sigma)
- \frac{S}{2 \sigma^2} + \ln \frac{\sqrt{2 \pi}
\sigma / \sqrt{N}}{ \sigma_{\mu} } .
\label{eq.sigmaevidence}
\eeq
The first two terms are the best fit log likelihood (\ie, the log
likelihood with $\mu = \bar{x}$). The last term is the log of the
{\dem\ind{Occam factor}\/} which penalizes smaller values of $\sigma$. (We
will discuss Occam factors more in \chref{ch.occam}.) When we
differentiate the log evidence with respect to $\ln \sigma$, to find
the most probable $\sigma$, the additional volume factor
($\linefrac{\sigma}{\sqrt{N}} $) shifts the maximum from $\sigma_{\ssN}$
to
\beq
%\sigma_{\MP} =
\sigma_{\ssNM} = \sqrt{ \linefrac{S}{(N-1)} } .
\eeq
Intuitively, the denominator \mbox{$(N\!-\!1)$} counts the number of
noise measurements contained in the quantity $S = \sum_n
(x_n\!-\!\bar{x})^2$. The sum contains $N$ residuals squared, but
there are only \mbox{$(N\!-\!1)$} effective noise measurements\index{degrees of freedom}
because the determination of one parameter $\mu$ from the data causes
one dimension of noise to be gobbled up in unavoidable \ind{overfitting}.
In the terminology of classical statistics,
the Bayesian's best guess for $\sigma$ sets\index{$\chi^2$}\index{chi-squared}
$\chi^2$ (the measure of deviance defined by $\chi^2 \equiv
\sum_n (x_n - \hat{\mu})^2/{\hat\sigma}^2$) equal to the number of degrees
of freedom, $N-1$.
%
% HELP - put more clarification here.
%
Figure \ref{like.sig.mu}d shows the posterior probability of $\sigma$,
which is proportional to the marginal likelihood.
% as a function of $\sigma$.
This may be contrasted with
the posterior probability of
% likelihood as a function of
$\sigma$ with $\mu$ fixed to its most probable value, $\barx\eq 1$, which
is shown in \figref{like.sig.mu}c and d.
The final inference we might wish to make is `given the data, what is $\mu$?'
\exercisxB{3}{ex.studentint}{
Marginalize over $\sigma$ and obtain the posterior
marginal distribution of $\mu$, which is a \ind{Student-$t$ distribution}:
\beq
P( \mu \given D ) \propto 1 / \left( N ( \mu - \barx )^2 + S \right)^{N/2} .
\eeq
}
%
% in error, this used to say (N-1)/2
% 21/3/96
%
% see ~/book/figs/basic/README
% stole exercises from here and put them in bayes_int_exs.tex
\section*{Further reading}
A bible of exact marginalization is \quotecite{Bretthorst} book
on {B}ayesian spectrum analysis
and parameter estimation.
\section{Exercises}
\exercisxB{3}{ex.manyparamsb}{
[This exercise requires macho integration capabilities.]
Give a Bayesian solution to \exerciseref{ex.manyparams},
where seven scientists of varying
capabilities have measured $\mu$ with
personal noise levels $\sigma_n$,
\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/manyparams.ps,width=1.75in,angle=-90}}
\end{center}
%\caption[a]{Seven measurements $\{x_n\}$ of a parameter $\mu$
% by seven scientists each having his own
% noise-level $\sigma_n$.}
}
and we are interested in inferring $\mu$.
% , and perhaps $\{ \sigma_n \}$ too.
Let the prior on each $\sigma_n$ be a broad prior, for example a
gamma distribution with parameters $(s,c)=(10,0.1)$.
Find the posterior distribution of $\mu$.
Plot it, and explore its properties for a variety of
data sets such as the one given, and the data set $\{ x_n \} = \{ 13.01 , 7.39 \}$.
[{\sf Hint}: first find the posterior distribution of $\sigma_n$ given
$\mu$ and $x_n$, $P(\sigma_n \given x_n,\mu)$. Note that the normalizing constant
for this inference is $P(x_n \given \mu)$. Marginalize over
$\sigma_n$ to find this normalizing constant,
then use \Bayes\ theorem a second time to
find $P(\mu \given \{ x_n \} )$.]
}
% \section{Solutions to Chapter \protect\ref{ch.bayes.int}'s exercises} %
\section{Solutions}
%
\soln{ex.sigmanbias}{
1.\ The data points are distributed with mean squared deviation $\sigma^2$
about the true mean.
2.\
The sample mean is unlikely to exactly equal the true mean.
3.\ The sample
mean is the value of $\mu$ that minimizes the sum squared deviation
of the data points from $\mu$.
Any other value of $\mu$ (in particular, the true value of $\mu$)
will have a larger value of the sum-squared deviation that $\mu = \bar{x}$.
So the expected mean squared deviation from the
sample mean is necessarily smaller than the
mean squared deviation $\sigma^2$
about the true mean.
}
\dvips
% \dvipsb{solutions bayes intermediate}
%
%\prechapter{About Chapter}
%\mysetcounter{page}{69} % set to preceding page
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% WAS \chapter{Exact Inference Methods}
\chapter{Exact Marginalization in Trellises}
\label{ch.exact}\label{ch.minsum2}
%
% exact inference methods
%
% contains lots on trellises
%
% solutions are in _sexact.tex
%
% need state diagram picture s,t,u
%
In this chapter we will discuss a few
exact methods that are used in probabilistic
modelling.
% We will do this with the aid of two examples.
As an example we will discuss
the task of decoding a linear error-correcting
code.
% The second is the burglar-alarm problem of \exburglar.
% In both examples w
We will see that
inferences can be conducted most efficiently by
{\dem\index{message passing}{message-passing} algorithms}, which take
advantage of the graphical structure of the problem
to avoid unnecessary duplication of computations (see
\chapterref{ch.message}).
% This chapter is a possible location for the first introduction
% of Markov chains, and/or hidden Markov models.
\section{Decoding problems}
\label{sec.decoding.problems}
%
% these are defined first in _linear.tex
%
A codeword $\bt$ is selected from a linear $(N,K)$
code $\C$, and it is transmitted
over a noisy channel; the received signal is
$\by$.
In this chapter we will assume that the channel is a memoryless
channel such as a Gaussian channel.
Given an assumed channel model $P(\by \given \bt)$, there are
two decoding problems.
\begin{description}
\item[The codeword decoding problem] is the task of\index{decoder!codeword}
inferring which codeword $\bt$ was transmitted given the
received signal.
\item[The bitwise decoding problem] is the task of inferring\index{decoder!bitwise}
for each transmitted bit $t_n$ how likely it is that that
bit was a one rather than a zero.
\end{description}
As a concrete example, take the $(7,4)$ Hamming code.
In \chref{ch.one}, we discussed
the codeword decoding problem for that code, assuming
a binary symmetric channel. We didn't discuss the bitwise decoding problem
and we didn't discuss how to handle more general channel models
such as a Gaussian channel.
\subsection{Solving the codeword decoding problem}
By \Bayes\ theorem, the posterior probability
of the codeword $\bt$ is\index{Bayes' theorem}
\beq
P( \bt \given \by ) = \frac{ P(\by \given \bt) P(\bt) }{ P(\by )} .
\label{eq.decode}
\eeq
\begin{description}
\item[Likelihood function\puncspace]
The first factor in the numerator,
$P(\by \given \bt)$, is the {\dbf\ind{likelihood}} of the codeword,
which, for any memoryless channel, is a separable function,
\beq
P(\by \given \bt) = \prod_{n=1}^N P(y_n \given t_n) .
\eeq
For example, if the channel is a Gaussian channel with transmissions
$\pm x$ and additive noise of standard deviation $\sigma$,
then the probability density of the received signal $y_n$ in the two
cases $t_n=0,1$ is
\beqan
P(y_n \given t_n \eq 1) &=& \frac{1}{\sqrt{2 \pi \sigma^2}}
\exp \left( -\frac{(y_n - x )^2}{2 \sigma^2} \right) \\
P(y_n \given t_n \eq 0) &=& \frac{1}{\sqrt{2 \pi \sigma^2}}
\exp \left( -\frac{(y_n + x )^2}{2 \sigma^2} \right) .
\eeqan
From the point of view of decoding, all that matters is the {\dbf likelihood
ratio}, which for the case of the Gaussian channel is
\beq
\frac{P(y_n \given t_n \eq 1)}{P(y_n \given t_n \eq 0)} =
\exp \left( \frac{2 x y_n }{ \sigma^2} \right) .
\eeq
\end{description}
\exercisxA{2}{ex.gc.bsc}{
Show that from the point of view of decoding, a Gaussian channel
is equivalent to a time-varying binary symmetric channel with a known
noise level $f_n$ which depends on $n$.
}
\begin{description}
\item[Prior\puncspace]
The second factor in the numerator is the {\dbf prior} probability of
the codeword, $P(\bt)$, which is usually assumed to be uniform over
all valid codewords.
The denominator in (\ref{eq.decode}) is the normalizing constant
\beq
P(\by ) = \sum_{\bt} { P(\by \given \bt) P(\bt) } .
\eeq
\end{description}
The complete solution to the codeword decoding problem is
a list of all codewords and their probabilities as given by equation
(\ref{eq.decode}). Since the number of codewords
in a linear code, $2^K$, is often very large, and since we are not
interested in knowing the detailed probabilities of all the codewords,
we often restrict attention to a simplified version of the codeword
decoding problem.
\begin{description}
\item[The \index{maximum {\em a posteriori}}{MAP} codeword decoding problem] is the task of
identifying {\em the most probable codeword\/} $\bt$ given the
received signal.
If the prior probability over codewords is uniform then this
task is identical to the problem of {\dbf maximum likelihood
decoding}, that is, identifying the codeword that maximizes
$P(\by \given \bt )$.
\end{description}
{\sf Example:} In \chref{chone}, for
% the case of
the $(7,4)$ Hamming code and a binary symmetric channel
we discussed a method for
deducing the {most probable codeword} from the syndrome of
the received signal, thus solving the {MAP} codeword decoding problem
for that case. We would like a more general solution.
The MAP codeword decoding problem can be solved in exponential time
(of order $2^K$) by searching through all codewords for the one that
maximizes $P(\by \given \bt) P(\bt)$. But we are interested in methods that
are more efficient than this. In section \ref{sec.viterbi}, we will
discuss an exact method known
as the {\dbf\ind{min--sum
algorithm}}
which may be able to solve the codeword
decoding problem more efficiently; how much more efficiently
depends on the properties of the code.
% {\em (put this somewhere else?)}
% However,
It is worth emphasizing that
MAP codeword decoding for a {\em general\/} linear
code is known to be \ind{NP-complete} (which means in layman's terms
that MAP codeword decoding has a complexity that
% can only be done in general
% in a time that
scales exponentially with the blocklength, unless
there is a revolution in computer science).
So restricting attention to the \ind{MAP decoding} problem hasn't\index{maximum {\em a posteriori\/} decoder}
necessarily
made the task much less challenging; it simply makes the answer briefer to
report.
\subsection{Solving the bitwise decoding problem}
Formally, the exact solution of the bitwise decoding problem
is obtained from \eqref{eq.decode} by {\em marginalizing\/}
over the other bits.
\beq
P( t_n \given \by ) = \sum_{ \{ t_{n'} : \, n' \neq n \} }
{ P(\bt \given \by)} .
\label{eq.bitwise}
\eeq
We can also write this marginal with the aid of a truth function
$\truth[S]$ that is one if
the proposition $S$ is true and zero otherwise.
\beqan
P( t_n\eq 1 \given \by ) &=& \sum_{\bt}
{ P(\bt \given \by)} \,\truth[ t_n\eq 1 ] \\
P( t_n\eq 0 \given \by ) &=& \sum_{\bt}
{ P(\bt \given \by)} \,\truth[ t_n\eq 0 ] .
\label{eq.bitwise1}
\eeqan
% In case this notation is hard to understand, here is an explicit
% example using the bitwise decoding of a $(7,4)$ Hamming code.
% The probability that $t_2=1$ is
%\beq
% P( t_n\eq 1 \given \by ) = \sum_{ \{ t_1,t_3,t_4,t_5,t_6,t_7 \} }
% { P(\bt \given \by)}
%\label{eq.bitwise2}
%\eeq
%
Computing these marginal probabilities by an explicit sum over all
codewords
$\bt$ takes
exponential time. But,
for certain codes, the bitwise decoding problem can be solved
much more efficiently using the {\dbf \ind{forward--backward
algorithm}}. We will describe
this algorithm, which is an example of the
{\dbf\ind{sum--product algorithm}}, in a moment. Both the min--sum algorithm and the
sum--product algorithm have widespread importance, and have been
invented many times in many fields.
\section{Codes and trellises\nonexaminable}
In Chapters \chone\ and \chseven, we represented linear $(N,K)$
codes in terms of their generator matrices and their parity-check matrices.
In the case of a {\dbf systematic} block code, the first
$K$ transmitted bits in each block of size $N$ are the source
bits, and the remaining $M=N-K$ bits are the parity-check
bits. This means that the generator matrix of the code can be written
\beq
\bG^{\T} = \left[ \begin{array}{c} \bI_K \\ \bP \end{array} \right] ,
\eeq
and the parity-check matrix can be written
\beq
\bH = \left[ \begin{array}{cc} \bP & \bI_M \end{array} \right] ,
\eeq
where $\bP$ is an $M \times K$ matrix.
In this section we will study another representation
of a linear code called a trellis. The codes that these trellises
represent will not in general be systematic codes, but
they can be mapped onto systematic codes
if desired by a reordering
of the bits in a block.
%\begin{figure}
\marginfig{%
\footnotesize
\begin{tabular}{*{1}{l@{\hspace{-0.5in}}l}}
\raisebox{0.5in}{(a)}& \hspace*{0.42in}\mbox{\psfig{figure=trellis/R3/ps.ps,angle=-90,width=1.3in}} \\
& \multicolumn{1}{c}{\footnotesize Repetition code $R_3$} \\[-0.12in]
\raisebox{0.5in}{(b)}& \hspace*{0.42in}\mbox{\psfig{figure=trellis/P3/ps.ps,angle=-90,width=1.3in}} \\
& \multicolumn{1}{c}{\footnotesize Simple parity code $P_3$ } \\[-0.12in]
\raisebox{0.85in}{(c)}& \hspace*{-0.24in}\mbox{\psfig{figure=trellis/H74s/ps.ps,angle=-90,width=2.63in}} \\
& \multicolumn{1}{c}{\footnotesize $(7,4)$ Hamming code} \\
\end{tabular}
%}{%
\caption[a]{Examples of trellises.
% \\ (a) Repetition code R3. \\
% (b) Simple parity code P3. \\ (c) $(7,4)$ Hamming code.
Each edge in a trellis is labelled by a zero (shown by a square)
or a one (shown by a cross).}
\label{fig.trellises}
\label{fig.trellis}
}%
%\end{figure}
\subsection{Definition of a trellis}
Our definition
% of a trellis
will be quite narrow. For a more
comprehensive
view of trellises, the reader should consult \citeasnoun{Kschischang_}.
\begin{description}
\item[A trellis]
is a {\dem graph\/} consisting of {\dem nodes\/} (also known as states or vertices)
and {\dem edges}. The nodes
are grouped into vertical slices called {\dem times}, and the times
% \marginpar{\footnotesize{Warning: terminology has recently been altered here. Look for ``state'' needing to be changed to ``time''.}}
% states
% \marginpar{\footnotesize{I need to reconsider this terminology:
% I would like to be able to talk about `a four-state trellis'
% and `the state as a function of time'; this usage conflicts with
% the idea that the encoder passes through an ordered sequence of `states'.}}
are
ordered such that each edge connects a node in one time
% state
to a node
in a neighbouring time.
% state
Every edge is labelled with a {\dem symbol}.
The leftmost and rightmost states contain only one node.
Apart from these two extreme nodes, all nodes in the trellis have at least
one edge connecting leftwards and at least one connecting rightwards.
\end{description}
A trellis with $N\!+\!1$ times
% states
defines a code of blocklength $N$
as follows: a codeword
is obtained by taking a path that crosses the trellis from left to
right and reading out the symbols on the edges that are traversed.
Each valid path through the trellis defines a codeword.
We will number the leftmost time `time 0' and the rightmost
`time $N$'. We will number the leftmost state `state 0' and the rightmost
`state $I$', where $I$ is the total number of
states (vertices) in the trellis. The $n$th bit of the codeword
is emitted as we move from time
% state
$n\!-\!1$ to time
% state
$n$.
The {\dem width\/} of the trellis at a given time
% state
is the number of
nodes in that time.
% state.
The {\dem maximal width\/}
of a trellis is what it sounds like.
A trellis is called a {\dem linear trellis\/}
if the code it defines is a
linear code. We will solely be concerned with linear trellises
from now on,
as nonlinear trellises are much more complex beasts.
% \cite{Kschischang_}.
For brevity, we will only discuss binary trellises, that is,
trellises whose edges are labelled with zeroes and ones. It is
not hard to generalize the methods that follow to $q$-ary trellises.
Figures \ref{fig.trellises}(a--c) show the trellises corresponding to
the repetition code $R_3$ which has $(N,K)=(3,1)$; the
parity code $P_3$ with $(N,K) = (3,2)$; and
the $(7,4)$ Hamming code.
\exercisxB{2}{ex.trellish74}{
Confirm that the sixteen codewords listed in \tabref{fig.h74}
are generated by the trellis shown in \figref{fig.trellises}c.}
\subsection{Observations about linear trellises}
For any linear code the {\dem minimal trellis\/} is the one
that has the smallest number of nodes.
%
% CHECK: is reordering of bits permitted?
%
% vertices.
In a minimal trellis, each node has at most two edges entering it
and at most two edges leaving it. All nodes in a time
% state
have the same
left degree as each other and they have the same right
degree as each other. The width is always a power of two.
A minimal trellis for a linear $(N,K)$ code cannot have a width greater
than $2^K$ since every node has at least one valid codeword through it,
and there are only $2^K$ codewords. Furthermore, if we define $M=N-K$,
the minimal trellis's width is everywhere less than $2^M$.
This will be proved in section \ref{sec.two.to.M.trellis}.
Notice that for the linear trellises in \figref{fig.trellis}, all of
which are minimal trellises, $K$ is the number of times a binary
branch point is encountered as the trellis is traversed from left to
right or from right to left.
We will discuss the construction of trellises more in section
\ref{sec.more.on.trellis}.
% where we discuss how to make trellises from generator matrices.
But we now know enough to
discuss the decoding problem.
\section{Solving the decoding problems on a trellis\nonexaminable}
We can view the trellis of a linear code
as giving a causal description of the probabilistic
process that gives rise to a codeword, with time flowing from
left to right.
% At each timestep we move one state to the right.
Each time a divergence
is encountered, a random source (the source of information
bits for communication) determines which way we go.
% Note this is just the same as saying that the codeword is generated
% by a hidden Markov model with a time-varying transition probability
% matrix.
At the receiving end, we receive a noisy version of the
sequence of edge-labels, and wish
to infer which path was taken, or to be precise, (a) we want to
identify the most probable path in order to solve the
codeword decoding problem; and (b) we want to find the probability that
the transmitted symbol at time $n$ was a zero or a one,
to solve the bitwise decoding problem.
\Exampl{ex.trellis.h74}{
Consider the case of
a single transmission from the Hamming $(7,4)$ trellis shown
in \figref{fig.trellis}c.
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{clll} \toprule
$\bt$ & \multicolumn{1}{c}{Likelihood } & \multicolumn{2}{c}{Posterior probability} \\ \midrule
%
\tt 0000000 & 0.0275562 & 0.25 & \raisebox{2mm}{\framebox[0.246in]{}} \\
\tt 0001011 & 0.0001458 & 0.0013 & \raisebox{2mm}{\framebox[0.001in]{}} \\
\tt 0010111 & 0.0013122 & 0.012 & \raisebox{2mm}{\framebox[0.012in]{}} \\
\tt 0011100 & 0.0030618 & 0.027 & \raisebox{2mm}{\framebox[0.027in]{}} \\
\tt 0100110 & 0.0002268 & 0.0020 & \raisebox{2mm}{\framebox[0.002in]{}} \\
\tt 0101101 & 0.0000972 & 0.0009 & \raisebox{2mm}{\framebox[0.001in]{}} \\
\tt 0110001 & 0.0708588 & 0.63 & \raisebox{2mm}{\framebox[0.632in]{}} \\
\tt 0111010 & 0.0020412 & 0.018 & \raisebox{2mm}{\framebox[0.018in]{}} \\
\tt 1000101 & 0.0001458 & 0.0013 & \raisebox{2mm}{\framebox[0.001in]{}} \\
\tt 1001110 & 0.0000042 & 0.0000 & \raisebox{2mm}{\framebox[0.000in]{}} \\
\tt 1010010 & 0.0030618 & 0.027 & \raisebox{2mm}{\framebox[0.027in]{}} \\
\tt 1011001 & 0.0013122 & 0.012 & \raisebox{2mm}{\framebox[0.012in]{}} \\
\tt 1100011 & 0.0000972 & 0.0009 & \raisebox{2mm}{\framebox[0.001in]{}} \\
\tt 1101000 & 0.0002268 & 0.0020 & \raisebox{2mm}{\framebox[0.002in]{}} \\
\tt 1110100 & 0.0020412 & 0.018 & \raisebox{2mm}{\framebox[0.018in]{}} \\
\tt 1111111 & 0.0000108 & 0.0001 & \raisebox{2mm}{\framebox[0.000in]{}} \\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{Posterior probabilities over the sixteen codewords
when the received vector $\by$ has normalized
likelihoods $(0.1, 0.4, 0.9, 0.1, 0.1, 0.1,
0.3)$.}
\label{fig.posteriorH74}
}%
\end{figure}
Let the normalized likelihoods be: $(0.1, 0.4, 0.9, 0.1, 0.1, 0.1,
0.3)$. That is, the ratios of the likelihoods are
\beq
\frac{ P(y_1 \given x_1 \eq 1)}{ P(y_1 \given x_1 \eq 0)} = \frac{0.1}{0.9} ,
\:\:\:
\frac{ P(y_2 \given x_2 \eq 1)}{ P(y_2 \given x_2 \eq 0)} = \frac{0.4}{0.6} ,
\:\:\:
\mbox{etc.}
\eeq
How should this received signal be decoded?
\begin{enumerate}
\item If we threshold the likelihoods at 0.5 to turn the
signal into a binary received vector, we have $\br = (0,0,1,0,0,0,0)$,
which decodes, using the decoder for the binary
symmetric channel (\chapterref{ch1}), into $\hat{\bt} = (0,0,0,0,0,0,0)$.
This is not the optimal decoding procedure.
Optimal inferences are always obtained by using \Bayes\ theorem.
\item
We can find the posterior probability over codewords
by explicit enumeration of all sixteen codewords. This
posterior distribution is shown
in \figref{fig.posteriorH74}. Of course, we aren't really
interested in such brute-force solutions, and the aim
of this chapter
% the following sections
is to understand algorithms for getting
the same information out in less than $2^K$ computer time.
Examining the posterior probabilities, we notice that the most probable
codeword is actually the string $\bt = \tt 0110001$. This is more than
twice as probable as the answer found by thresholding, {\tt 0000000}.
Using the posterior probabilities shown in \figref{fig.posteriorH74},
we can also compute the posterior marginal distributions of each of
the bits. The result is shown in \figref{fig.exact.marginals}.
Notice that bits 1, 4, 5 and 6 are all quite confidently
inferred to be zero. The strengths of the posterior probabilities
for bits 2, 3, and 7 are not so great. \hfill \ensuremath{\epfsymbol}\par
\end{enumerate}
}
\begin{figure}
\figuremargin{%
\[
\begin{array}{ccclllll} \toprule
n & \multicolumn{2}{c}{\mbox{Likelihood}} & \multicolumn{4}{c}{\mbox{Posterior marginals}} \\
& \multicolumn{1}{c}{P(y_n \given t_n \eq 1)} & \multicolumn{1}{c}{P(y_n \given t_n \eq 0)} &
\multicolumn{2}{c}{P(t_n \eq 1 \given \by)} & \multicolumn{2}{c}{P(t_n \eq 0 \given \by)} \\ \midrule
% marginals
1 & 0.1 & 0.9 & 0.061 & \raisebox{2mm}{\framebox[0.061in]{}} & 0.939 & \raisebox{2mm}{\framebox[0.939in]{}} \\
2 & 0.4 & 0.6 & 0.674 & \raisebox{2mm}{\framebox[0.674in]{}} & 0.326 & \raisebox{2mm}{\framebox[0.326in]{}} \\
3 & 0.9 & 0.1 & 0.746 & \raisebox{2mm}{\framebox[0.746in]{}} & 0.254 & \raisebox{2mm}{\framebox[0.254in]{}} \\
4 & 0.1 & 0.9 & 0.061 & \raisebox{2mm}{\framebox[0.061in]{}} & 0.939 & \raisebox{2mm}{\framebox[0.939in]{}} \\
5 & 0.1 & 0.9 & 0.061 & \raisebox{2mm}{\framebox[0.061in]{}} & 0.939 & \raisebox{2mm}{\framebox[0.939in]{}} \\
6 & 0.1 & 0.9 & 0.061 & \raisebox{2mm}{\framebox[0.061in]{}} & 0.939 & \raisebox{2mm}{\framebox[0.939in]{}} \\
7 & 0.3 & 0.7 & 0.659 & \raisebox{2mm}{\framebox[0.659in]{}} & 0.341 & \raisebox{2mm}{\framebox[0.341in]{}} \\ \bottomrule
\end{array}
\]
}{%
\caption[a]{Marginal posterior probabilities for the 7 bits
under the posterior distribution of \protect\figref{fig.posteriorH74}.}
\label{fig.exact.marginals}
}%
\end{figure}
In the above example, the MAP
% most probable
codeword is in agreement
with the
% bit-by-bit
bitwise decoding that is obtained by
selecting the most probable state for each bit using the
posterior marginal distributions. But this is
not always the case, as the following exercise shows.
\exercissxA{2}{ex.H74.hinoise}{
Find the most probable codeword in the case
where the normalized likelihood is $( 0.2,0.2,0.9,0.2,0.2,0.2,0.2 )$.
Also find or estimate
the marginal posterior probability for each of the seven bits,
and give the bit-by-bit decoding.
[Hint: concentrate on the few codewords that
have the largest probability.]
}
We now discuss how to use message passing on a code's trellis to solve
the decoding problems.
\subsection{The min--sum algorithm\nonexaminable}
% {Viterbi}
\label{sec.viterbi}
The MAP codeword decoding problem can be solved
using the \ind{min--sum algorithm} that was introduced
% Connect this section to counting paths in the constrained channel,
% chapter \ref{ch.noiseless}, and to the message-passing chapter \ref{ch.message}.
in \secref{sec.minsum1}.
Each codeword of the code corresponds to a path across
the trellis.
Just as the cost of a journey is the sum of the costs of its constituent
steps, the log likelihood of a codeword is the sum
of the bitwise log likelihoods. By convention, we
flip the sign of the log likelihood (which we would like
to maximize) and talk in terms of a cost, which
we would like to minimize.
We associate with each edge a cost $-\!\log P(y_n \given t_n)$,
where $t_n$ is the transmitted bit associated with that edge,
and $y_n$ is the received symbol.
The min--sum algorithm presented in \secref{sec.minsum1}
can then identify the most probable codeword in a number of computer operations equal
to the number of edges in the trellis.
This algorithm is also known as the \ind{Viterbi algorithm} \cite{viterbi}.\index{message passing!Viterbi}
% Consider a node on the most probable path, which has two upstream
% parents. The most probable way of creating the first $n$ emissions
% and getting to the present node must be the same as
% the most probable path, because if it weren't....
% To find the most probable path to a node, only need to know the
% score of the most probable paths to its parents, and the score associated
% with transitions from those two parents. Then can identify which is
% the cheaper parent.
\subsection{The sum--product algorithm\nonexaminable}
\label{sec.trellisfb}
To solve the bitwise decoding problem,
we can make a small modification to the min--sum algorithm,
so that the messages passed through the trellis
define `the probability of the data up to the current point'
instead of `the cost of the best route to this point'.
We replace the costs on the edges, $-\!\log P(y_n \given t_n)$,
by the likelihoods themselves, $P(y_n \given t_n)$.
We replace the min and sum operations of the \ind{min--sum algorithm}
by a sum and product respectively.
Let $i$ run over nodes/states, $i=0$ be the label for the
start state, ${\cal P}(i)$ denote the set of
states that are parents of state $i$,
and $w_{ij}$ be the likelihood associated with the
edge from node $j$ to node $i$.
We define the forward-pass messages $\alpha_i$
by
\beqan
\alpha_0 &=& 1 \nonumber \\
\alpha_i & = & \sum_{ j \in {\cal P}(i) } w_{ij} \alpha_j .
\eeqan
These messages can be computed sequentially from left to right.
\exercisxB{2}{ex.sumprod}{
Show that for a node $i$ whose time-coordinate is $n$,
$\alpha_i$ is proportional to the joint probability
that the codeword's path passed through node $i$
and that the first $n$ received symbols
were $y_1, \ldots, y_n$.
}
The message $\alpha_I$ computed at the end node of the trellis is proportional to
the marginal probability of
the data.
\exercisxB{2}{ex.sumprodb}{
What is the constant of proportionality? [Answer: $2^K$]
}
We define a second set of backward-pass messages $\beta_i$
in a similar manner. Let node $I$ be the end node.
\beqan
\beta_I &=& 1 \nonumber \\
\beta_j & = & \sum_{i : j \in {\cal P}(i) } w_{ij} \beta_i .
\eeqan
These messages can be computed sequentially in
a backward pass from right to left.
\exercisxB{2}{ex.sumprodd}{
Show that for a node $i$ whose time-coordinate is $n$,
$\beta_i$ is proportional to the conditional probability,
{\em given\/}
that the codeword's path passed through node $i$,
that the subsequent received symbols
were $y_{n+1} \ldots y_N$.
}
Finally, to find the probability that the $n$th bit
was a 1 or 0, we do two summations of products of the
forward and backward messages. Let $i$ run over nodes
at time $n$ and $j$ run over nodes at time $n-1$,
and let $t_{ij}$ be the value of $t_n$ associated with
the trellis edge from node $j$ to node $i$. For each
value of $t=0/1$, we compute
\beq
r^{(t)}_n = \sum_{i,j: \, j \in {\cal P}(i) ,\, t_{ij} = t} \alpha_j w_{ij} \beta_i .
\eeq
Then the posterior probability that $t_n$ was $t=0/1$ is
\beq
P( t_n \eq t \given \by ) = \frac{1}{Z} r^{(t)}_n ,
\eeq
where the normalizing constant $Z = r^{(0)}_n + r^{(1)}_n$
should be identical to the final forward message $\alpha_I$
that was computed earlier.
\exercisxC{2}{ex.sumprode}{
Confirm that the above sum--product algorithm
does compute $P( t_n \eq t \,|\, \by )$.
}
Other names for the sum--product algorithm presented here
are `the \ind{forward--backward algorithm}', `the \ind{BCJR algorithm}',
and `\ind{belief propagation}'.\index{message passing!BCJR}\index{message passing!belief propagation}\index{message passing!forward--backward}
\exercissxB{2}{ex.sumprodf}{
A codeword of the simple parity code $P_3$
is transmitted, and the received
signal $\by$ has associated likelihoods shown in
\tabref{tab.sumprodf}.%
\margintab{
\begin{center}
\begin{tabular}{ccc} \toprule
$n$ & \multicolumn{2}{c}{ $P(y_n \,|\, t_n )$ } \\
& $t_n \eq 0$ & $ t_n \eq 1 $ \\
\midrule
1 & \dquarter & \dhalf \\
2 & \dhalf & \dquarter \\
3 & \deighth & \dhalf \\ \bottomrule
\end{tabular}
\end{center}
\caption[a]{Bitwise likelihoods for
a codeword of $P_3$.}
\label{tab.sumprodf}
}
Use the min--sum algorithm and the sum--product
algorithm in the trellis (\figref{fig.trellis})
to solve the MAP codeword decoding problem
and the bitwise decoding problem. Confirm your answers
by enumeration of all codewords ({\tt{000}}, {\tt{011}}, {\tt{110}}, {\tt{101}}).
[\Hint: use logs to base 2 and do the min--sum computations by hand.
When working the sum--product algorithm by hand, you may find
it helpful to use three colours of pen, one for the
$\alpha$s, one for the $w$s, and one for the $\beta$s.]
% in the sum--product computation, the answers are best
% expressed working in multiples of .]
}
%\section{Exercises}
% Could discuss The junction tree algorithm.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\subchapter{More on Trellises\nonexaminable}
\section{More on trellises}
\label{sec.more.on.trellis}
% In this appendix we
We now discuss various ways of making the trellis of
a code. You may safely jump over this section.
The {\dbf \ind{span}} of a codeword
% of length $N$
is the set of bits contained between
the first bit in the codeword that is non-zero, and the last
bit that is non-zero, inclusive. We can indicate the span of a codeword by
a binary vector as shown in \tabref{fig.span}.
\begin{table}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{rccccc} \toprule
Codeword &
\tt 0000000 &
\tt 0001011 &
\tt 0100110 &
\tt 1100011 &
\tt 0101101 \\
Span &
\tt 0000000 &
\tt 0001111 &
\tt 0111110 &
\tt 1111111 &
\tt 0111111 \\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption[a]{Some codewords and their spans.}
\label{fig.span}
}%
\end{table}
\noindent
A generator matrix is in {\dbf trellis-oriented form} if
the spans of the rows of the generator matrix all start in different
columns and the spans all end in different columns.
%
% see bin/G2T.p
%
\subsection{How to make a trellis from a generator matrix}
First, put the generator matrix into trellis-oriented form by
row-manipulations similar to Gaussian elimination.
For example, our $(7,4)$ Hamming code can be generated by
\beq
\bG = \left[ \begin{array}{ccccccc}
1&0&0&0&1&0&1\\
0&1&0&0&1&1&0\\
0&0&1&0&1&1&1\\
0&0&0&1&0&1&1
\end{array}
\right]
\eeq
but this matrix is not in trellis-oriented form -- for example,
rows 1, 3 and 4 all have spans that end in the same column.
By subtracting lower rows from upper rows, we can obtain
an equivalent generator matrix (that is, one that generates the
same set of codewords) as follows:
\beq
\bG = \left[ \begin{array}{ccccccc}
1&1&0&1&0&0&0\\
0&1&0&0&1&1&0\\
0&0&1&1&1&0&0\\
0&0&0&1&0&1&1
\end{array}
\right] .
\eeq
Now, each row of the generator matrix can be thought of
as defining an $(N,1)$ subcode of the $(N,K)$ code, that is,
in this case, a code with two codewords of length $N=7$.
For the first row, the code consists of the two codewords
$\tt 1 1 0 1 0 0 0$ and $\tt 0 0 0 0 0 0 0$. The subcode defined
by the second row consists of $\tt 0 1 0 0 1 1 0$ and $\tt 0 0 0 0 0 0 0$.
It is easy to construct the minimal trellises of these subcodes;
they are shown in the left column of figure \ref{fig.tH74s}.
We build the trellis incrementally as shown in
figure \ref{fig.tH74s}. We start with the trellis corresponding
to the subcode given by the first row of the generator matrix.
Then we add in one subcode at a time.
The vertices within the span of the new subcode are all duplicated.
The edge symbols in the original trellis are left unchanged and the
edge symbols in the second part of the trellis are flipped wherever
the new subcode has a {\tt{1}} and otherwise left alone.
%
%
% MORE HERE!
\begin{figure}
\figuremargin{%
\vspace{-0.86in}
\begin{center}
\begin{tabular}{cl@{\hspace{-0.2in}}l}
\mbox{\psfig{figure=trellis/H74s/ps1.ps,angle=-90,width=2.13in}}& \\
+ \\[-1in]
\mbox{\psfig{figure=trellis/H74s/row2/ps.ps,angle=-90,width=2.13in}}&\raisebox{0.25in}{=}&
\mbox{\psfig{figure=trellis/H74s/ps2.ps,angle=-90,width=2.13in}}\\[0.1in]
+ \\[-0.64in]
\mbox{\psfig{figure=trellis/H74s/row3/ps.ps,angle=-90,width=2.13in}}&\raisebox{0.25in}{=}&
\mbox{\psfig{figure=trellis/H74s/ps3.ps,angle=-90,width=2.13in}}\\[0.6in]
+ \\[-1.1in]
\mbox{\psfig{figure=trellis/H74s/row4/ps.ps,angle=-90,width=2.13in}}&\raisebox{0.25in}{=}&
\mbox{\psfig{figure=trellis/H74s/ps.ps,angle=-90,width=2.13in}}\\[-0.1in]
\end{tabular}
\end{center}
}{%
\caption[a]{Trellises for four subcodes of the $(7,4)$ Hamming code
(left column),
and the sequence of trellises that are made when constructing the
trellis for the $(7,4)$ Hamming code (right column).
Each edge in a trellis is labelled by a zero (shown by a square)
or a one (shown by a cross).}
\label{fig.tH74s}
}%
\end{figure}
Another $(7,4)$ Hamming code can be generated by
\beq
\bG = \left[ \begin{array}{ccccccc}
1&1&1&0&0&0&0\\
0&1&1&1&1&0&0\\
0&0&1&0&1&1&0\\
0&0&0&1&1&1&1
\end{array}
\right] .
\label{eq.betterG74}
\eeq
The $(7,4)$ Hamming code generated by this matrix differs by a permutation
of its bits from the code generated by the systematic matrix used
in \chref{ch.one} and above.
%. This permutation has been chosen such that the
% parity-check matrix can be written thus:
The parity-check matrix corresponding to this permutation is:
\beq
\bH = \left[
\begin{array}{ccccccc}
1&0&1&0&1&0&1\\
0&1&1&0&0&1&1\\
0&0&0&1&1&1&1
\end{array}
\right] .
\label{eq.betterH74}
\eeq
The trellis obtained from the permuted
matrix $\bG$ given in \eqref{eq.betterG74}
is shown in \figref{fig.tH74}a. Notice that the number of
% edges and
nodes in this trellis is smaller than the number of nodes in the
previous trellis for the Hamming $(7,4)$ code in \figref{fig.trellis}c.
We thus observe that {\em rearranging the order of the codeword bits can sometimes
lead to smaller, simpler trellises.}
% kschischang
%\begin{figure}
%\figuremargin{%
\marginfig{\footnotesize%\small
\begin{center}
\begin{tabular}{*{1}{l@{\hspace{-0.5in}}l}}
\raisebox{0.5in}{(a)}&
\mbox{\psfig{figure=trellis/H74/ps.ps,angle=-90,width=2.13in}}
\\
\raisebox{0.5in}{(b)}&
\mbox{\psfig{figure=trellis/H74H/ps.ps,angle=-90,width=2.13in}}\\
\end{tabular}
\end{center}
%}{%
\caption[a]{Trellises for the permuted $(7,4)$ Hamming code generated from
(a) the generator matrix by the method
of \figref{fig.tH74s}; (b) the parity-check matrix
by the method on page \pageref{sec.pcm.page271}.
Each edge in a trellis is labelled by a zero (shown by a square)
or a one (shown by a cross).}
\label{fig.tH74}
}%
%\end{figure}
\subsection{Trellises from parity-check matrices}
\label{sec.pcm.page271}
Another way of viewing the trellis is in terms of the syndrome.
The syndrome of a vector $\br$ is defined to be $\bH \br$,
where $\bH$ is the parity-check matrix. A vector is only a codeword
if its syndrome is zero. As we generate a codeword
we can describe the current state by the {\dbf partial syndrome},
that is, the product of $\bH$ with the codeword bits thus far generated.
Each state in the trellis is a partial syndrome at one time
coordinate.
The starting and ending states are both constrained to be the zero
syndrome.
%
Each node in a state represents a different possible
value for the partial syndrome.
Since $\bH$ is an $M\times N$ matrix, where $M=N-K$, the
syndrome is at most an $M$-bit vector. So we need at most
$2^M$ nodes in each state.
We can construct the trellis of a code from its parity-check
matrix by walking from each end, generating two trees of possible
syndrome sequences. The intersection of these two trees defines
the trellis of the code.
In the pictures we obtain from this construction, we can let the
vertical coordinate represent the syndrome. Then any horizontal edge
is necessarily associated with a zero bit (since only a non-zero bit
changes the syndrome) and any non-horizontal edge is associated with
a one bit.
(Thus in this representation
we no longer need to label the edges in the trellis.)
%
% these are done by RMtest into the directory code/lt
%
%
% see also bin/G2T.p and mutate.p
%
\Figref{fig.tH74}b shows the trellis corresponding to the parity-check
matrix of \eqref{eq.betterH74}.
% \Figref{fig.tRM16} shows the trellises of some slightly larger codes.
%
% restore RM material and GF4?
%
\fakesection{Is this label roughly right?}
\label{sec.two.to.M.trellis}
% \section{Solutions} are in _sexact
%MNBV\newpage
%\newpage
\dvips
\section{Solutions}% to Chapter \protect\ref{ch.exact}'s exercises} %
\begin{table}[hbtp]
\figuremargin{
\[%beq
\begin{tabular}{clll} \toprule
$\bt$ & \multicolumn{1}{c}{Likelihood } & \multicolumn{2}{c}{Posterior probability} \\ \midrule
%
\tt 0000000 & 0.026 & 0.3006 & \raisebox{2mm}{\framebox[0.301in]{}} \\
\tt 0001011 & 0.00041 & 0.0047 & \raisebox{2mm}{\framebox[0.005in]{}} \\
\tt 0010111 & 0.0037 & 0.0423 & \raisebox{2mm}{\framebox[0.042in]{}} \\
\tt 0011100 & 0.015 & 0.1691 & \raisebox{2mm}{\framebox[0.169in]{}} \\
\tt 0100110 & 0.00041 & 0.0047 & \raisebox{2mm}{\framebox[0.005in]{}} \\
\tt 0101101 & 0.00010 & 0.0012 & \raisebox{2mm}{\framebox[0.001in]{}} \\
\tt 0110001 & 0.015 & 0.1691 & \raisebox{2mm}{\framebox[0.169in]{}} \\
\tt 0111010 & 0.0037 & 0.0423 & \raisebox{2mm}{\framebox[0.042in]{}} \\
\tt 1000101 & 0.00041 & 0.0047 & \raisebox{2mm}{\framebox[0.005in]{}} \\
\tt 1001110 & 0.00010 & 0.0012 & \raisebox{2mm}{\framebox[0.001in]{}} \\
\tt 1010010 & 0.015 & 0.1691 & \raisebox{2mm}{\framebox[0.169in]{}} \\
\tt 1011001 & 0.0037 & 0.0423 & \raisebox{2mm}{\framebox[0.042in]{}} \\
\tt 1100011 & 0.00010 & 0.0012 & \raisebox{2mm}{\framebox[0.001in]{}} \\
\tt 1101000 & 0.00041 & 0.0047 & \raisebox{2mm}{\framebox[0.005in]{}} \\
\tt 1110100 & 0.0037 & 0.0423 & \raisebox{2mm}{\framebox[0.042in]{}} \\
\tt 1111111 & 0.000058 & 0.0007 & \raisebox{2mm}{\framebox[0.001in]{}} \\
\bottomrule
\end{tabular}
\]%eeq
}{
\caption[a]{
The posterior probability over codewords for \protect\exerciseonlyref{ex.H74.hinoise}.
}
\label{tab.74hipost}
}
\end{table}
\soln{ex.H74.hinoise}{
The posterior probability over
codewords is shown in \tabref{tab.74hipost}.
The most probable codeword is {\tt 0000000}.
The marginal posterior probabilities of
all seven bits are:
% marginals
\[%beq
\begin{array}{cccllll}\toprule
n & \multicolumn{2}{c}{\mbox{Likelihood}} & \multicolumn{4}{c}{\mbox{Posterior marginals}} \\
& P(y_n\given t_n\eq {\tt 1}) & P(y_n\given t_n\eq {\tt 0}) &
\multicolumn{2}{c}{P(t_n\eq {\tt 1} \given \by)} & \multicolumn{2}{c}{P(t_n\eq {\tt 0} \given \by)} \\ \midrule
% marginals
1 & 0.2 & 0.8 & 0.266 & \raisebox{2mm}{\framebox[0.266in]{}} & 0.734 & \raisebox{2mm}{\framebox[0.734in]{}} \\
2 & 0.2 & 0.8 & 0.266 & \raisebox{2mm}{\framebox[0.266in]{}} & 0.734 & \raisebox{2mm}{\framebox[0.734in]{}} \\
3 & 0.9 & 0.1 & 0.677 & \raisebox{2mm}{\framebox[0.677in]{}} & 0.323 & \raisebox{2mm}{\framebox[0.323in]{}} \\
4 & 0.2 & 0.8 & 0.266 & \raisebox{2mm}{\framebox[0.266in]{}} & 0.734 & \raisebox{2mm}{\framebox[0.734in]{}} \\
5 & 0.2 & 0.8 & 0.266 & \raisebox{2mm}{\framebox[0.266in]{}} & 0.734 & \raisebox{2mm}{\framebox[0.734in]{}} \\
6 & 0.2 & 0.8 & 0.266 & \raisebox{2mm}{\framebox[0.266in]{}} & 0.734 & \raisebox{2mm}{\framebox[0.734in]{}} \\
7 & 0.2 & 0.8 & 0.266 & \raisebox{2mm}{\framebox[0.266in]{}} & 0.734 & \raisebox{2mm}{\framebox[0.734in]{}} \\ \bottomrule
\end{array}
\]%eeq
So the bitwise decoding is {\tt 0010000}, which is not actually a
codeword.
}
\soln{ex.sumprodf}{
The MAP codeword is {\tt{101}}, and its likelihood
is $1/8$. The normalizing constant of the sum--product algorithm
is $Z = \alpha_I = \dfrac{3}{16}$.
The intermediate $\alpha_i$ are (from left to right)
$\dhalf$, $\dquarter$, $\dfrac{5}{16}$, $\dfrac{4}{16}$;
the intermediate $\b_i$ are (from right to left),
$\dhalf$, $\deighth$, $\dfrac{9}{32}$, $\dfrac{3}{16}$.
The bitwise decoding is:
$P(t_1 \eq 1 \given \by) = 3/4$;
$P(t_1 \eq 1 \given \by) = 1/4$;
$P(t_1 \eq 1 \given \by) = 5/6$.
The codewords' probabilities are
\dfrac{1}{12}, \dfrac{2}{12}, \dfrac{1}{12}, \dfrac{8}{12}
for {\tt{000}}, {\tt{011}}, {\tt{110}}, {\tt{101}}.
}
\dvipsb{solutions exact}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \prechapter{About Chapter}
\chapter{Exact Marginalization in Graphs}
\label{ch.belief.propagation}
\label{ch.sumproduct}
\label{ch.factorgraphs}
\index{sum--product algorithm}\index{factor graph}\index{graph!factor graph}\index{algorithm!sum--product}
\label{sec.sumproduct}
We now take a more general view of the tasks of inference
and marginalization.
Before reading this chapter, you should read about message passing in \chref{ch.message}.
% \newcommand{\gP}{P^*} in itprnnchapter.tex
\section{The general problem}
Assume that a function $\gP$ of a set of $N$ variables $\bx \equiv \{ x_n \}_{n=1}^{N}$
is defined as a product of $M$ {\dem{factors}\/} as follows:
\beq
\gP(\bx) =
% \frac{1}{Z}
\prod_{m=1}^M f_m( \bx_m ) .
\label{eq.factorfunction}
\eeq
% Each of the factors $\phi_n(x_n)$ is a function of only one of the variables.
Each of the factors $f_m( \bx_{m} )$ is a function of a subset $\bx_{m}$ of the
variables that make up $\bx$.
If $\gP$ is a positive function then we may be interested in
a second normalized function,
\beq
P(\bx) \equiv \smallfrac{1}{Z} \gP(\bx) =
\smallfrac{1}{Z}
\prod_{m=1}^M f_m( \bx_m ) ,
\label{eq.factorfunctionZ}
\eeq
where the normalizing constant $Z$ is defined
by
\beq
Z = \sum_{\bx}
\prod_{m=1}^M f_m( \bx_m ) .
\eeq
As an example of the notation we've just introduced,
here's a function of three binary variables $x_1$, $x_2$, $x_3$ defined by
the five factors:
% ($N=3$, $M=2$):
\beq
\begin{array}{rcl}
f_1 (x_1) &=& \left\{ \begin{array}{cl} 0.1 & x_1 \eq 0 \\ 0.9 & x_1 \eq 1 \end{array}\right. \\
f_2 (x_2) &=& \left\{ \begin{array}{cl} 0.1 & x_2 \eq 0 \\ 0.9 & x_2 \eq 1 \end{array}\right. \\
f_3 (x_3) &=& \left\{ \begin{array}{cl} 0.9 & x_3 \eq 0 \\ 0.1 & x_3 \eq 1 \end{array}\right. \\
f_4 (x_1,x_2) &=& \left\{ \begin{array}{cl} 1 & (x_1,x_2) \eq (0,0) \:\:\mbox{or}\:\: (1,1) \\
0 & (x_1,x_2) \eq (1,0) \:\:\mbox{or}\:\: (0,1) \end{array}\right. \\
f_5 (x_2,x_3) &=& \left\{ \begin{array}{cl} 1 & (x_2,x_3) \eq (0,0) \:\:\mbox{or}\:\: (1,1) \\
0 & (x_2,x_3) \eq (1,0) \:\:\mbox{or}\:\: (0,1) \end{array}\right.
\\[0.15in]
\gP(\bx)& =&
f_1 (x_1)
f_2 (x_2)
f_3 (x_3)
f_4 (x_1,x_2)
f_5 (x_2,x_3) \\
P(\bx)& =& \displaystyle \smallfrac{1}{Z}
f_1 (x_1)
f_2 (x_2)
f_3 (x_3)
f_4 (x_1,x_2)
f_5 (x_2,x_3) .
\end{array}
\label{eq.r3factors}
\eeq
The five subsets of $\{ x_1,x_2,x_3 \}$ denoted by $\bx_m$ in the
general function (\ref{eq.factorfunction})
are here
$\bx_1 = \{x_1\}$,
$\bx_2 = \{x_2\}$,
$\bx_3 = \{x_3\}$,
$\bx_4 = \{x_1,x_2\}$,
and
$\bx_5 = \{x_2,x_3\}$.
The function $P(\bx)$, by the way, may be recognized as the posterior probability
distribution of the three transmitted bits in a repetition code (\sectionref{sec.r3})
when the received signal is $\br = ( {\tt 1} , {\tt 1} , {\tt 0} )$
and the channel is a binary symmetric channel with flip probability 0.1.
The factors $f_4$ and $f_5$ respectively enforce the
constraints that $x_1$ and $x_2$ must be identical and that
$x_2$ and $x_3$ must be identical.
The factors $f_1$, $f_2$, $f_3$ are the likelihood functions contributed
by each component of $\br$.
A function of the factored form (\ref{eq.factorfunction})
can be depicted by a {\dem\ind{factor graph}},
in which the variables are depicted by circular nodes
and the
% shared
factors are depicted by square nodes.
An edge is put between variable node $n$ and factor node $m$
% if $n \in \Nm$, that is, if the function $\psi_m(\bx)$ has
if the function $f_m(\bx_m)$ has
any dependence on variable $x_n$.
%
The factor graph for the example function (\ref{eq.r3factors}) is shown
in \figref{fig.r3.graph}.
\amarginfig{b}{
\begin{center}{
\setlength{\unitlength}{0.477mm}
\begin{picture}(101,34)(-54,-7)
\put(0,15){\line(1,-1){10}}
\put(11,5){\line(1,1){10}}
\put(22,15){\line(1,-1){10}}
\put(33,5){\line(1,1){10}}
\put(0,25){\makebox(0,0)[c]{$x_1$}}
\put(22,25){\makebox(0,0)[c]{$x_2$}}
\put(44,25){\makebox(0,0)[c]{$x_3$}}
\multiput(0,18)(21.5,0){3}{\circle{6}}
\multiput(-1,15)(21.5,0){3}{\line(-5,-1){50}}
% five boxes
\multiput(-57,5)(22,0){5}{\line(1,0){5}}
\multiput(-57,5)(22,0){5}{\line(0,-1){5}}
\multiput(-57,0)(22,0){5}{\line(1,0){5}}
\multiput(-52,0)(22,0){5}{\line(0,1){5}}
\put(-54,-5.4){\makebox(0,0)[c]{$f_1$}}%(x_1)
\put(-32,-5.4){\makebox(0,0)[c]{$f_2$}}
\put(-10,-5.4){\makebox(0,0)[c]{$f_3$}}
\put(12,-5.4){\makebox(0,0)[c]{$f_4$}}% (x_1,x_2)
\put(34,-5.4){\makebox(0,0)[c]{$f_5$}}% (x_2,x_3)
\end{picture}}
\end{center}
%}{%
\caption[a]{The factor graph associated with the function
% defined in
$\gP(\bx)$
(\ref{eq.r3factors}).}
\label{fig.r3.graph}
}% end marginfig
\subsection{The normalization problem}
The first task to be solved is
to compute the normalizing constant $Z$.
\subsection{The marginalization problems}
The second task to be solved is
to compute the marginal function
%\marginpar{\footnotesize{We use the term
% marginal function rather than marginal distribution because
% in what follows we do not need to constrain $f$ to be a probability distribution.}}
of any variable $x_n$, defined by
\beq
Z_n(x_n) = \sum_{ \{ x_{n'} \} , \, n' \neq n} \gP (\bx) .
\eeq
For example, if $f$ is a function of three variables then
the marginal for $n=1$ is defined by
\beq
Z_1(x_1) = \sum_{x_2,x_3} f(x_1,x_2,x_3) .
\eeq
This type of summation, over `all the $x_{n'}$ except for $x_n\!$'
is so important that
it can be useful to have a special notation for
it -- the `\ind{not-sum}' or `\ind{summary}'.
%,
%\beq
% f_1(x_1) = \sum_{\tilde x_1} f(x_1,x_2,x_3) \equiv
% \sum_{x_2,x_3} f(x_1,x_2,x_3) .
%\eeq
% The marginal function $f_n(x_n)$ can be called
% `the summary for $x_n$ of $f$'.
The third task to be solved is to compute
the normalized marginal of any variable $x_n$, defined by
\beq
P_n(x_n) \equiv \sum_{ \{ x_{n'} \} , \, n' \neq n} P (\bx) .
\eeq
[We include the suffix `$n$' in $P_n(x_n)$,
departing from our normal practice in the rest of the book,
where we would omit it.]
\exercisxB{1}{ex.normmarg}{
Show that the normalized marginal is related to the
marginal $Z_n(x_n)$ by
\beq
P_n(x_n) = \frac{ Z_n(x_n) }{ Z } .
\eeq
}
We might also be interested in marginals over
a subset of the variables, such
as
\beq
Z_{12}(x_1,x_2)
% \equiv \sum_{\tilde \{ x_1,x_2 \} } \gP (x_1,x_2,x_3)
\equiv \sum_{x_3} \gP (x_1,x_2,x_3) .
\eeq
All these tasks are intractable in general.
Even if every
% shared
factor is a function of
only three variables, the cost of computing
exact solutions for $Z$ and for the marginals
is believed in general to grow exponentially
with the number of variables $N$.
For certain functions $\gP$, however,
the marginals can be computed efficiently
by exploiting the factorization of $\gP$.
The idea of how this efficiency arises is
well illustrated by the message-passing examples
of \chref{ch.message}.
The sum--product algorithm that we
now review is a generalization of message-passing
rule-set B (\pref{sec.messageBtree}).
As was the case there, the sum--product algorithm
is only valid if the graph is \ind{tree}-like.
\section{The sum--product algorithm}
\subsection{Notation}
We identify the set of variables that the $m$th factor depends on, $\bx_m$,
by
% defining
the set of their indices $\Nm$.
For our example function (\ref{eq.r3factors}),
the sets are $\N(1) = \{ 1 \}$ (since
$f_1$ is a function of $x_1$ alone),
$\N(2) = \{ 2 \}$,
$\N(3) = \{ 3 \}$,
$\N(4) = \{ 1,2 \}$, and
$\N(5) = \{ 2,3 \}$.
% the sets are $\N(1) = \{ 1,2 \}$ (since
% $\psi_1(\bx)$ depends on $x_1$ and $x_2$)
% and $\N(2) = \{ 2,3 \}$.
% This lets us use the notation $\psi_m( \{ x_n \}_{n \in \Nm} )$.
% note the set of variables $n$ that participate in shared factor $m$ by $\Nm \equiv \{ n : \}$.
Similarly we define the set of
% shared
factors in which variable $n$
participates, by $\Mn$. We
denote a set $\Nm$ with variable $n$ excluded by $\Nm\wo n$.
We introduce the shorthand \xmwon\ or \xmwonb\ to denote
the set of variables in $\bx_m$ with $x_n$ excluded,
\ie,
\beq
\xmwon \equiv \{ x_{n'} \! : n' \in \Nm \wo n \} .
\eeq
The sum--product algorithm will involve
messages of two types passing along the edges in the
factor graph: messages $q_{n \rightarrow m}$ from
variable nodes to factor nodes,
and messages $r_{m \rightarrow n}$ from
factor nodes to variable nodes.
A message (of either type, $q$ or $r$)
that is sent along an edge connecting factor $f_m$
to variable $x_n$ is always a function of the variable $x_n$.
Here are the two rules for the updating of the two sets of messages.\indexs{sum--product algorithm}\index{message passing!sum--product algorithm}\index{factor graph}
\medskip
\noindent
\begin{framedalgorithm}
\begin{description}
\item[From variable to factor:]
\beq
q_{n \rightarrow m}(x_n) = \prod_{m' \in \Mn\wo m} r_{m' \rightarrow n}(x_n) .
\label{eq.spq}
\eeq
\item[From factor to variable:]
\beq
r_{m \rightarrow n}(x_n) = \sum_{\xmwon}
\left( f_m( \bx_m) \prod_{ n' \in \Nm \wo n } q_{n' \rightarrow m}(x_{n'})
\right) .
\label{eq.spr}
\eeq
\end{description}
\end{framedalgorithm}
\subsection{How these rules apply to leaves in the factor graph}
A%
\amarginfig{b}{\begin{center}\mbox{\epsfbox{metapost/sumproduct.1}}\end{center}
\caption[a]{A factor node that is a leaf node
perpetually sends the message
$r_{m \rightarrow n}(x_n) = f_m( x_n)$ to its one neighbour $x_n$.}}
node that has only one edge connecting it to another node is called a \ind{leaf} node.
Some factor nodes in the graph may be connected to only one variable node,
in which case the set $\Nm \wo n$ of variables appearing in the factor
message update (\ref{eq.spr}) is an empty set, and the product of
functions $\prod_{ n' \in \Nm \wo n } q_{n' \rightarrow m}(x_{n'})$
is the empty product, whose value is 1.
Such a factor node therefore always broadcasts to its one neighbour $x_n$ the message
$r_{m \rightarrow n}(x_n) = f_m( x_n)$.
Similarly, there may be variable nodes that are connected to
only one factor node, so the set $\Mn\wo m$ in (\ref{eq.spq}) is empty.
These nodes perpetually
broadcast the message $q_{n \rightarrow m}(x_n) = 1$.%
\amarginfig{b}{\begin{center}\mbox{\epsfbox{metapost/sumproduct.2}}\end{center}
\caption[a]{A variable node that is a leaf node perpetually
sends the message $q_{n \rightarrow m}(x_n) = 1$.}}
% We call nodes that have only one edge connecting them to another node
% `leaf nodes'.
\subsection{Starting and finishing, method 1}
The algorithm can be initialized in two ways.
If the graph is tree-like then it must have nodes that are leaves.
These leaf nodes can broadcast their messages to their
respective neighbours from the start.
\beqan
\mbox{For all {leaf\/} variable nodes $n$:}&& q_{n \rightarrow m}(x_n) = 1 \\
\mbox{For all {leaf\/} factor nodes $m$:}&& r_{m \rightarrow n}(x_n) = f_m( x_n) .
\eeqan
We can then adopt the procedure used in
\chref{ch.message}'s message-passing
rule-set B (\pref{sec.messageBtree}):
a message is created in accordance with the rules (\ref{eq.spq}, \ref{eq.spr})
only if all the messages on which it depends are present.
\amarginfig{t}{
\begin{center}{
\setlength{\unitlength}{0.477mm}
\begin{picture}(101,34)(-54,-7)
\put(0,15){\line(1,-1){10}}
\put(11,5){\line(1,1){10}}
\put(22,15){\line(1,-1){10}}
\put(33,5){\line(1,1){10}}
\put(0,25){\makebox(0,0)[c]{$x_1$}}
\put(22,25){\makebox(0,0)[c]{$x_2$}}
\put(44,25){\makebox(0,0)[c]{$x_3$}}
\multiput(0,18)(21.5,0){3}{\circle{6}}
\multiput(-1,15)(21.5,0){3}{\line(-5,-1){50}}
% five boxes
\multiput(-57,5)(22,0){5}{\line(1,0){5}}
\multiput(-57,5)(22,0){5}{\line(0,-1){5}}
\multiput(-57,0)(22,0){5}{\line(1,0){5}}
\multiput(-52,0)(22,0){5}{\line(0,1){5}}
\put(-54,-5.4){\makebox(0,0)[c]{$f_1$}}%(x_1)
\put(-32,-5.4){\makebox(0,0)[c]{$f_2$}}
\put(-10,-5.4){\makebox(0,0)[c]{$f_3$}}
\put(12,-5.4){\makebox(0,0)[c]{$f_4$}}% (x_1,x_2)
\put(34,-5.4){\makebox(0,0)[c]{$f_5$}}% (x_2,x_3)
\end{picture}}
\end{center}
%}{%
\caption[a]{Our model factor graph for the function
% defined in
$\gP(\bx)$
(\ref{eq.r3factors}).}
\label{fig.r3.graphagain}
}% end marginfig
For example, in \figref{fig.r3.graphagain}, the message from $x_1$ to $f_1$
will be sent only when the message from $f_4$ to $x_1$ has been received;
and the message from $x_2$ to $f_2$, $q_{2 \rightarrow 2}$,
can be sent only when the messages
$r_{4 \rightarrow 2}$ and
$r_{5 \rightarrow 2}$ have both been received.
Messages will thus flow through the tree, one in each direction along every edge,
and after a number of
steps equal to the diameter of the graph,
every message will have been created.
The answers we require can then be read out. The marginal
function of $x_n$ is obtained by multiplying all the incoming messages
at that node.
\beq
Z_n(x_n) = \prod_{m \in \Mn} r_{m \rightarrow n}(x_n) .
\eeq
The normalizing constant $Z$ can be obtained by summing any marginal function,
$Z = \sum_{x_n} Z_n(x_n)$, and the normalized marginals obtained from
\beq
P_n(x_n) = \frac{ Z_n(x_n) }{ Z } .
\eeq
\exercisxB{2}{ex.spforr3}{
Apply the sum--product algorithm to the function
defined in \eqref{eq.r3factors} and \figref{fig.r3.graph}.
Check that the normalized marginals are consistent with what you know
about the repetition code $R_3$.
}
\exercisxC{3}{ex.sppf}{
Prove that the sum--product algorithm correctly
computes the marginal functions $Z_n(x_n)$ if the graph is tree-like.
}
\exercisxC{3}{ex.sppf2}{
Describe how to use the messages computed by the sum--product algorithm
to obtain more complicated marginal functions in a tree-like graph, for example
$Z_{1,2}(x_1,x_2)$, for two variables $x_1$ and $x_2$ that are
connected to one common factor node.
}
\subsection{Starting and finishing, method 2}
Alternatively, the algorithm can be initialized by setting
{\em all\/} the initial messages from variables to 1:
\beq
\mbox{for all $n$, $m$:}\:\:\: q_{n \rightarrow m}(x_n) = 1 ,
\eeq
then proceeding with the factor message update rule (\ref{eq.spr}),
alternating with the variable message update rule (\ref{eq.spq}).
Compared with method 1, this
lazy initialization method leads to a load of wasted computations,
whose results are gradually flushed out by the correct answers
computed by method 1.
After a number of iterations equal to the diameter of the
factor graph, the algorithm will converge to a set of messages satisfying
the sum--product relationships (\ref{eq.spq}, \ref{eq.spr}).
\exercisxC{2}{ex.spforr3again}{
Apply this second version of the sum--product algorithm to the function
defined in \eqref{eq.r3factors} and \figref{fig.r3.graph}.
}
The reason for introducing this lazy method is that (unlike method 1) it can be applied
to graphs that are not tree-like.\index{loopy message-passing}\index{message passing!loopy}\index{message passing!in graphs with cycles}
When the sum--product algorithm is run on a graph with cycles,
the algorithm
does not necessarily converge, and certainly does not in general
compute the correct marginal functions; but it is nevertheless an
algorithm of great practical importance, especially in the decoding of
\ind{sparse-graph code}s.
\subsection{Sum--product algorithm with on-the-fly normalization}
If we are interested in only the {\em normalized\/} marginals,
then another version of the sum--product algorithm may be useful.
The factor-to-variable messages $r_{m \rightarrow n}$ are computed
in just the same way (\ref{eq.spr}), but the
variable-to-factor messages are normalized thus:
\beq
q_{n \rightarrow m}(x_n) = \alpha_{nm} \prod_{m' \in \Mn\wo m} r_{m' \rightarrow n}(x_n)
\label{eq.spqn}
\eeq
where $\alpha_{nm}$ is a scalar chosen such that
\beq
\sum_{x_n} q_{n \rightarrow m}(x_n) = 1 .
\eeq
\exercisxC{2}{ex.spforr3againagain}{
Apply this normalized version of the sum--product algorithm to the function
defined in \eqref{eq.r3factors} and \figref{fig.r3.graph}.
}
\subsection{A factorization view of the sum--product algorithm}
One way to view the sum--product algorithm is that it reexpresses
the original factored function, the product of $M$ factors
$ \gP(\bx) =
\prod_{m=1}^M f_m( \bx_m )$,
as another factored function which is the product
of $M+N$ factors,
\beq
\gP(\bx) =
\prod_{m=1}^M \phi_m( \bx_m )
\prod_{n=1}^N \psi_n( x_n ) .
\label{eq.factorfunctionphipsi}
\eeq
Each factor $\phi_m$ is associated with a factor node $m$,
and each factor $\psi_n(x_n)$ is associated with a variable node.
Initially $\phi_m(\bx_m) = f_m(\bx_m)$ and $\psi_n(x_n)=1$.
Each time
a factor-to-variable message $r_{m\rightarrow n}(x_n)$ is
sent, the factorization is updated thus:
\beq
\psi_n(x_n) = \prod_{m \in \Mn} r_{m\rightarrow n}(x_n)
\label{eq.firstpsirule}
\eeq
\beq
\phi_m(\bx_m) = \frac{ f(\bx_m) }{ \prod_{n \in \Nm} r_{m\rightarrow n}(x_n)}.
\eeq
And each message can be computed in terms of $\phi$ and $\psi$ using
\beq
r_{m \rightarrow n}(x_n) = \sum_{\xmwon}
\left( \phi_m( \bx_m) \prod_{ n' \in \Nm } \psi_{n'}(x_{n'})
\right)
\label{eq.sprpsi}
\eeq
which differs from the assignment (\ref{eq.spr}) in that the product is over
all $n' \in \Nm$.
\exercisxC{2}{ex.psiconfirm}{
Confirm that the update rules (\ref{eq.firstpsirule}--\ref{eq.sprpsi})
are equivalent to the sum--product rules (\ref{eq.spq}--\ref{eq.spr}).
So $\psi_n(x_n)$ eventually becomes the marginal $Z_n(x_n)$.
}
This factorization viewpoint applies whether or not the graph is tree-like.
% and $\phi_m(\bx_m)$ becomes a function having the property
%\beq
% \sum_{ \xmwon } \phi_m(\bx_m)
%\eeq
% Thus after any number of iterations of
\subsection{Computational tricks}
On-the-fly normalization is a good idea from a computational
point of view because if $P^*$ is a product of many factors,
its values are likely to be very large or very small.
Another useful computational trick involves passing
the logarithms of the messages $q$ and $r$ instead of $q$ and $r$ themselves;
the computations of the products in the algorithm (\ref{eq.spq}, \ref{eq.spr})
are then replaced by simpler additions. The summations in
(\ref{eq.spr}) of course become more difficult: to carry them out
and return the logarithm, we need to compute \index{softmax, softmin}{softmax} functions like
\beq
l = \ln ( e^{l_1} + e^{l_2} + e^{l_3} ) .
\label{eq.examplesum}
\eeq
But this computation can be done efficiently using look-up tables
along with the observation that the value of the answer $l$
is typically just a little larger than $\max_i l_i$.
If we store in look-up tables values of the
function
\beq
\ln ( 1 + e^{\delta} )
\eeq
(for negative $\delta$)
then $l$ can be computed exactly in a number of look-ups and
additions scaling as the number of terms in the sum.
If look-ups and sorting operations are cheaper than {\tt{exp()}}
then this approach costs less than the direct evaluation (\ref{eq.examplesum}).
The number of operations can be further reduced by
omitting negligible contributions from the smallest of the $\{ l_i \}$.
A third computational trick applicable to certain error-correcting codes
is to pass not the messages but the \ind{Fourier transform}
of the messages. This again makes the computations of the factor-to-variable messages
quicker. A simple example of this Fourier transform trick is given in
\chref{ch.gallager} at \eqref{eq.ft.gallager}.
%\section{The Min--Sum Algorithm}
\section{The min--sum algorithm}
The sum--product algorithm solves the problem of
finding the marginal function of a given product $P^*(\bx)$.
This is analogous to solving the bitwise decoding problem
of \secref{sec.decoding.problems}.
And just as there were other decoding problems
(for example, the codeword decoding problem),
we can define other tasks involving $P^*(\bx)$
that can be solved by modifications of the sum--product algorithm.
For example, consider this task, analogous to
the codeword decoding problem:
\begin{description}
\item[The maximization problem\puncspace]
Find the setting of $\bx$ that maximizes the product $P^*(\bx)$.
\end{description}
This problem can be solved by replacing the two operations
{\sf{add}} and {\sf{multiply}}
% (`+' and `$\cdot$')
everywhere they appear in the sum--product algorithm
by
another pair of operations that satisfy the distributive
law,
% eq 14 in /home/mackay/tmp/fgspa.ps
namely {\sf{max}} and {\sf{multiply}}.
If we replace summation ($+$, $\sum$) by maximization,
we notice that the quantity formerly known
as the normalizing constant,
\beq
Z = \sum_{\bx} P^*(\bx) ,
\eeq
becomes $\max_{\bx} P^*(\bx)$.
Thus the sum--product algorithm can
be turned into a \ind{max--product} algorithm\index{algorithm!max--product}
that computes $\max_{\bx} P^*(\bx)$,
and from which the solution of the
maximization problem can be deduced.
Each `marginal' $Z_n(x_n)$ then lists the maximum
value that $P^*(\bx)$ can attain for each value of $x_n$.
In practice, the max--product algorithm
is most often carried out in
the negative log likelihood domain,
where {\sf{max}} and {\sf{product}}
become {\sf{min}} and {\sf{sum}}.
The min--sum algorithm is also known as the
\ind{Viterbi algorithm}.\index{algorithm!Viterbi}
\section{The junction tree algorithm}
%\section{The Junction Tree Algorithm}
What should one do when the factor graph one is interested
in is not a tree?
There are several options, and they divide into exact methods
and approximate methods.
The most widely used exact method for handling marginalization
on graphs with cycles is called the \ind{junction tree algorithm}.
This algorithm works by agglomerating variables together
until the agglomerated graph has no cycles.
You can probably figure out the details for yourself; the
complexity of the marginalization grows exponentially
with the number of agglomerated variables.
Read more about the {junction tree algorithm}
in \cite{lauritzen96,jordan98:_learn_graph_model}.
There are many approximate methods, and we'll visit some of
them over the next few chapters -- Monte Carlo methods and
variational methods, to name a couple.
However, the most amusing way of handling factor graphs to
which the sum--product algorithm may not be applied
is, as we already mentioned,
to apply the sum--product algorithm! We simply compute the messages
for each node in the graph, as if the graph were a tree, iterate,
and cross our fingers.
This so-called `\ind{loopy}' message passing has great importance
in the decoding of
% state-of-the-art
error-correcting codes,
and we'll come back to it in \secref{sec.bvfe.fr} and {\partnoun} \sgcpart.
% at the end of this book.
%\exercisxC{3}{ex.minsum}{
% Fill in the
%}
\section*{Further reading}
For further reading about factor graphs and the sum--product algorithm,
see \citeasnoun{Kschischang2001},
\citeasnoun{YFW2000},
\citeasnoun{YFW2001long},
% this next one is poset
\citeasnoun{YFW2002}, \citeasnoun{wainwright2003},\index{Wainwright, Martin}\index{Yedidia, Jonathan}
and \citeasnoun{Forney2001}.
%\section*{Further reading}
% Redo the burglar-alarm problem of \exburglar\ using message-passing.
See also \citeasnoun{pearl}.
A good reference for the fundamental theory of graphical models
is \citeasnoun{lauritzen96}. A readable introduction to Bayesian
networks is given by \citeasnoun{jensen96}.
Interesting message-passing algorithms that have different
capabilities from the sum--product algorithm include {\dem\ind{expectation propagation}\/}
\cite{Minka2001} and {\dem\ind{survey propagation}\/}
\cite{surveyPropagation}.\index{Minka, Thomas}\index{Braunstein, A.}
% \index{M\'ezard, Marc}\index{Zecchina, R.}
See\index{Mezard, Marc}\index{Zecchina, R.}
also \secref{sec.bvfe.fr}.
\section{Exercises}
\exercisxB{2}{ex.pearl}{
Express the joint probability distribution
from the burglar alarm and earthquake problem (\exampleref{ex.burglar})
as a factor graph, and find the marginal probabilities of all the variables
as each piece of information comes to Fred's attention,
using the sum--product algorithm with on-the-fly normalization.
}
\dvips
% {Laplace's method}
\chapter{Laplace's Method}% \nonexaminable}
\index{Laplace's method}% (integration)}
\label{ch.laplace}
% \label{ch.laplace}
\fakesection{Laplace}
% A chapter about Laplace's method.
% Here we can perhaps include the choice of basis paper.
% see /home/mackay/_doc/dirichlet/laplace.tex
The idea behind the {Laplace approximation}\index{approximation!Laplace}\index{Laplace's method}
is simple.
We assume that an unnormalized probability density $P^*(x)$,
whose normalizing constant
\beq
Z_P \equiv \int P^*(x) \, \d x
\eeq
is of interest, has a peak
at a point $x_0$.
\marginfig{
\mbox{\psfig{figure=figs/peak/laplace.phi.ps,angle=-90,width=1.2in}\raisebox{0.3in}{\small$P^*(x)$}}
}
%
We Taylor-expand the logarithm of $P^*(x)$ around this peak:
\marginfig{
\mbox{\psfig{figure=figs/peak/laplace.phi.l.ps,angle=-90,width=1.2in}%
\raisebox{0.3in}{\small$\ln P^*(x)$}}
}
\marginfig{
\psfig{figure=figs/peak/laplace.l.ps,angle=-90,width=1.2in}%
\makebox[0in][l]{\raisebox{0.1in}{\small$\ln P^*(x)$}}
\makebox[0in][l]{\raisebox{-0.08in}{\small\& $\ln Q^*(x)$}}
}
\beq
\ln P^*(x) \simeq \ln P^*(x_0) - \frac{c}{2} (x-x_0)^2 + \cdots ,
\label{eq.expansionlogP}
\eeq
where
\beq
c = - \left.
\frac{\partial^2}{\partial x^2} \ln P^*(x)
\right|_{x=x_0} .
\eeq
%
We then approximate $P^*(x)$ by an unnormalized Gaussian,\index{approximation!by Gaussian}
\beq
Q^*(x) \equiv P^*(x_0) \exp \left[ - \frac{c}{2} (x-x_0)^2
\right] ,
\eeq
\marginfig{
\psfig{figure=figs/peak/laplace.ps,angle=-90,width=1.2in}
\makebox[0in][l]{\raisebox{0.1in}{\small$P^*(x)$}}
\makebox[0in][l]{\raisebox{-0.08in}{\small\& $Q^*(x)$}}
}
%
and we approximate the normalizing constant $Z_P$ by the
normalizing constant
of this Gaussian,
\beq
Z_Q = P^*(x_0) \sqrt{ \frac{2 \pi }{ c }} .
\eeq
We can generalize this integral to
approximate $Z_P$ for a density $P^*(\bx)$ over a $K$-dimensional
space $\bx$.
If the matrix of second derivatives of $-\ln P^*(\bx)$
at the maximum $\bx_0$ is $\bA$,
defined by:
\beq
A_{ij} = - \left.
\frac{\partial^2}{\partial x_i \partial x_j} \ln P^*(\bx)
\right|_{\bx=\bx_0} ,
\eeq
so that the expansion (\ref{eq.expansionlogP}) is generalized to
\beq
\ln P^*(\bx) \simeq \ln P^*(\bx_0) - \frac{1}{2} (\bx-\bx_0)^{\T}\!\bA (\bx-\bx_0)
+ \cdots ,
\label{eq.generalexpansionP}
\eeq
then the normalizing constant can be approximated by:
\beq
Z_P \simeq Z_Q
= P^*(\bx_0) \frac{ 1 }{ \sqrt{ \det{\frac{1}{2 \pi} \bA} } }
= P^*(\bx_0) \sqrt{\frac{ (2 \pi)^K }{ \det{\bA} } } .
\eeq
Predictions can be made using the approximation $Q$.
Physicists also call this widely-used approximation
the {\dem\ind{saddle-point approximation}}.\index{approximation!saddle-point}
\begin{aside}
The fact that the normalizing constant of a Gaussian is given by
\beq
\int \d^K \bx \: \exp \left[ -\frac{1}{2} \bx^{\T} \bA \bx \right]
= \sqrt{\frac{ (2 \pi)^K }{ \det{\bA} } }
\eeq
can be proved by making an orthogonal transformation
into the basis $\bu$ in which
$\bA$ is transformed into a diagonal matrix. The integral
then separates into a product of one-dimensional integrals,
each of the form
\beq
\int \d u_i \exp \left[ -\frac{1}{2} { \lambda_i u_i^2 } \right]
= \sqrt{\frac{2 \pi}{\lambda_i}} .
\eeq
The product of the eigenvalues $\lambda_i$ is the determinant of $\bA$.
\end{aside}
The Laplace approximation is \index{basis dependence}basis-dependent:
if $x$ is transformed to a nonlinear function $u(x)$ and
the density is transformed to $P(u) = P(x) \left| \d x/\d u \right|$
then in general the approximate normalizing constants $Z_Q$
will be different.
This can be viewed as a defect -- since the true value $Z_P$
is basis-independent -- or an opportunity -- because
we can hunt for a choice of basis in which the Laplace approximation
is most accurate.
\section{Exercises}
% {\em Under construction}
% \medskip
%
%In the maximum likelihood chapter we found the second derivative
% for a few examples.
%
\exercisxA{2}{ex.poissonmap}{
(See also \exerciseref{ex.poissonml}.)
A \ind{photon counter} is pointed at a remote
star for one minute, in order to infer the rate of
photons arriving at the counter per minute, $\l$.
Assuming the number of photons collected $r$ has a
\ind{Poisson
distribution} with mean $\l$,
\beq
P(r \given \l ) = \exp( - \l)\frac{ \l^{r} }{r!} ,
\eeq
and assuming the \ind{improper} prior $P(\l) = 1/\l$,
make Laplace approximations to the posterior
distribution
% $P(\l \given r)$
\ben
\item over $\l$
\item over $\log \l$. [Note the improper prior transforms to $P(\log \l) = \mbox{constant}$.]
\een
}
\exercisxB{2}{ex.laplacebeta}{
Use Laplace's method to approximate the integral
\beq
Z(u_1,u_2) = \int_{-\infty}^{\infty} \! \d a \: f(a)^{u_1} (1-f(a))^{u_2} ,
\eeq
where $f(a) = 1/(1+e^{-a})$ and $u_1,u_2$ are positive.
Check the accuracy of the approximation
against the exact answer
(\ref{eq.Zbeta}, \pref{eq.Zbeta})
for $(u_1,u_2)=(\dhalf,\dhalf)$
and $(u_1,u_2)=(1,1)$.
Measure the error $(\log Z_P - \log {Z}_Q)$ in bits.
}
% Start with a tiny example -- note that
% for the mean of a Gaussian with known $\sigma$ the
% approximation is exact.
% For several other models (eg interpolation)
% the approximation is exact.
\exercisxB{3}{ex.interpoln}{
{\sf Linear \ind{regression}.}\index{linear regression}
$N$ datapoints $\{ (x^{(n)},t^{(n)}) \}$
are generated by the experimenter choosing each $x^{(n)}$,
then the world delivering
a noisy version of the linear function
\beq
y(x) = w_0 + w_1 x ,
\eeq
\beq
t^{(n)} \sim \Normal( y(x^{(n)}) , \sigma_{\nu}^2 ) .
\eeq
Assuming Gaussian priors on $w_0$ and $w_1$,
make the Laplace approximation to the posterior distribution
of $w_0$ and $w_1$ (which is exact, in fact)
and obtain the predictive distribution for the next datapoint
$t^{(N\!+\!1)}$, given
$x^{(N\!+\!1)}$.
(See \citeasnoun{MacKay92a} for further reading.)
}
%\section*{Further reading}
% We'
% restore me?
% \input{tex/laplacebasis.tex}
\dvips
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Model Comparison and Occam's Razor}
\label{ch.occam}
%
% see also nn_occam.tex
%
% and graveyard.tex
%
%\newcommand{\FIGS}{figs2} %
%\newcommand{\figs}{figs} % figures are kept in three directories
%\newcommand{\figsinter}{figs/inter}%
%
% \maketitle
% {\em Under construction.}
% \section{Probability theory and {O}ccam's razor}
\begin{figure}[hbtp]
\figuremargin{
\mbox{\psfig{figure=figs/dogs.eps,width=3in}}
}{
\caption[a]{A picture to be interpreted. It contains a \ind{tree}\index{box}\index{image analysis} and
some boxes.}
\label{fig.dogs1}
}
\end{figure}
\section{{O}ccam's razor}%\index{Occam's razor}
% mini-sermon on Bayes removed to
% sermons.tex
\label{sec.occam1}
How many boxes are in the picture (\figref{fig.dogs1})?
In particular, how many boxes are in the vicinity of the tree?
If we looked with x-ray spectacles, would
we see one or two boxes behind the trunk (\figref{fig.dogs3})?
(Or even more?)%
\newcommand{\twoboxesorone}{
\makebox[0in][r]{1?}\mbox{\psfig{figure=figs/dogs3.eps,width=1.46in}}\\
\makebox[0in][r]{or 2?}\mbox{\psfig{figure=figs/dogs3b.eps,width=1.46in}}
}%
\marginfig{\footnotesize
\begin{center}
\twoboxesorone
\end{center}
\caption[a]{How many boxes are behind the tree?}
\label{fig.dogs3}
}
Occam's razor is the principle that states a preference for simple
theories.
`Accept the simplest explanation that fits the data'.
Thus according to \inds{Occam's razor}, we should
deduce that there is only one box behind the tree.
Is this an ad hoc
% {\em ad hoc\/}
rule of thumb?
Or is there a convincing reason for believing
there is most likely one box? Perhaps
your intuition likes the argument
`well, it would be a remarkable {\em\ind{coincidence}\/}
for the two boxes to be just the same height and
colour as each other'.
If we wish to make artificial intelligences
that interpret data correctly, we must translate
this intuitive feeling into a concrete theory.
%\section{Probability theory and Occam's razor}
\subsection{Motivations for Occam's razor}
If several explanations are compatible with a set of
observations, Occam's razor advises us to buy the
simplest.
%least complex explanation.
This principle is often advocated for one of two
reasons: the first is aesthetic (`A theory with mathematical beauty
is more likely to be correct than an ugly one that fits some
experimental data' (Paul Dirac)); the second reason is the past
empirical success of Occam's razor.
% (`simple theories have proved successful in the past, so
%I prefer simple theories for new domains too').
However there is a different justification for Occam's razor,
namely:
\begin{quotation}
\noindent
Coherent inference (as embodied by Bayesian probability)
automatically embodies Occam's razor,
quantitatively.
\end{quotation}
% Dirac need not be too upset if we reject his motivation for
% Occam's razor; the Bayesian Occam's razor is a theory
% with its own mathematical beauty!
It is indeed {\em more probable\/} that there's one
box behind the tree, and we can compute how much more
probable one is than two.
% Similarly,
\subsection{Model comparison and Occam's razor}
We evaluate the plausibility of two alternative theories $\H_1$ and
$\H_2$ in the light of data $D$ as follows: using\index{Bayes' theorem}
\Bayes\ theorem, we relate the plausibility of model $\H_1$ given the
data, $P(\H_1\given D)$, to the predictions made by the model about the
data, $P(D\given \H_1)$, and the prior plausibility of $\H_1$, $P(\H_1)$.
This gives the following probability ratio between theory $\H_1$ and
theory $\H_2$:
\begin{equation}
\frac{P(\H_1\given D)}{P(\H_2\given D)} = \frac{P(\H_1)}{P(\H_2)}
\frac{ P(D\given \H_1)}{ P(D\given \H_2)} .
\label{occam.eq1}
\end{equation}
The first ratio $( P(\H_1) / P(\H_2) )$ on the right-hand side
measures how much our initial beliefs favoured $\H_1$ over
$\H_2$. The second ratio
% factor $( P(D\given \H_1) / P(D\given \H_2) )$ evaluates
expresses how well the observed data were predicted by $\H_1$,
compared to $\H_2$.
How does this relate to Occam's razor, when $\H_1$ is a simpler model
than $\H_2$? The first ratio $( P(\H_1) / P(\H_2) )$ gives us the
opportunity, if we wish, to insert a prior bias in favour of $\H_1$
on aesthetic grounds, or on the basis of experience. This would
correspond to the aesthetic and empirical motivations for Occam's
razor mentioned earlier. But such a prior bias
is not necessary: the second ratio,
the data-dependent factor, embodies Occam's razor {\em
automatically}. Simple models tend to make precise
predictions. Complex models, by their nature, are capable of making a
greater variety of predictions (figure \ref{fig.pdh}). So if $\H_2$
is a more complex model, it must spread its predictive probability
$P(D\given \H_2)$ more thinly over the data space than $\H_1$. Thus, in the
case where the data are compatible with both theories,
% then
% it must be the case that
% $P(D\given \H_1)$ is greater
% than $P(D\given \H_2)$, so
the simpler $\H_1$ will turn out more probable than $\H_2$, without
our having to express any subjective dislike for complex models. Our
subjective prior just needs to assign equal prior probabilities to
the possibilities of simplicity and complexity. Probability theory
then allows the observed data to express their opinion.
% then match the model to the observed data.
\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{\psfig{figure=\figs/occam_int.ps,%
%width=90mm,height=46mm,angle=90}
width=65mm,height=32mm,angle=90}}
\end{center}
}{
\caption[abbrev]{{Why Bayesian inference embodies Occam's razor.}
This figure gives the basic intuition for why complex models can turn
out to be less probable. The horizontal axis represents the space of
possible data sets $D$. \Bayes\ theorem rewards models in proportion
to how much they {\em predicted\/} the data that occurred. These
predictions are quantified by a normalized probability distribution
on $D$. This probability of the data given model
$\H_i$, $P(D\given \H_i)$, is called the evidence for $\H_i$.
A simple model $\H_1$ makes only a limited range of predictions,
shown by $P(D\given \H_1)$; a more powerful model $\H_2$, that has, for
example, more free parameters than $\H_1$, is able to predict a
greater variety of data sets. This means, however, that $\H_2$ does
not predict the data sets in region $\C_1$ as strongly as
$\H_1$. Suppose that equal prior probabilities have been assigned to
the two models. Then, if the data set falls in region $\C_1$, the
{\em less powerful\/} model $\H_1$ will be the {\em more probable\/}
model.
}
\label{fig.pdh}
\label{fig3}
}
\end{figure}
% f:=(x,c,d,e)-> c*x**3 + d * x**2 + e ;
% x1:=-1;x2:=3;x3:=7;x4:=11; solve({f(x1,c,d,e)=x2, f(x2,c,d,e)=x3, f(x3,c,d,e)=x4, f(x4,c,d,e)=x5, f(x5,c,d,e)=x6},{c,d,e,x5,x6});
%
Let us turn to a simple example. Here is a sequence of numbers:
\[
-1, \: 3, \: 7, \: 11.
\]
The task is to predict the next two numbers,\index{sequence}\index{what number comes next?}\index{arithmetic progression}
and infer the underlying process that gave rise to
this sequence. A popular answer to this question is the prediction `15,
19', with the explanation `add 4 to the previous number'.
What about the alternative answer `$-19.9, 1043.8$' with the underlying
rule being: `get the next number from the previous number, $x$, by
evaluating $-x^3/11 + 9/11 x^2 + 23/11$'? I assume that this
prediction
% -{\frac {x^{3}}{11}}+{\frac {9\,x^{2}}{11}}+{\frac {23}{11}}
seems rather less plausible. But the second rule fits the data ($-1$,
3, 7, 11) just as well as the rule `add 4'. So why should we find it
less plausible? Let us give labels to the two general theories:
%
\begin{description}
\item[$\H_a$ --] the sequence is an {\dem arithmetic\/} progression, `add $n$',
where $n$ is an integer.
\item[$\H_c$ --] the sequence is generated by a {\dem cubic\/} function of the
form $x \rightarrow c x^3 + d x^2 + e$,
where $c$, $d$ and $e$ are fractions.
\end{description}
%
One reason for finding the second explanation, $\H_c$, less
plausible, might be that arithmetic progressions are more frequently
encountered than cubic functions. This would put a \ind{bias} in the
prior probability ratio $P(\H_a)/P(\H_c)$ in equation (\ref{occam.eq1}).
But let us give the two theories equal prior probabilities, and
concentrate on what the data have to say. How well did each theory
predict the data?
To obtain $P(D\given \H_a)$ we must specify the probability distribution
that each model assigns to its parameters. First, $\H_a$ depends on
the added integer $n$, and the first number in the sequence. Let us
say that these numbers could each have been anywhere between $-50$ and
50. Then since only the pair of values \{$n\eq 4$, $\mbox{first
number}\eq -1$\} give rise to the observed data $D$ = ($-1$,
3, 7, 11),
the probability of the data, given $\H_a$, is:
\beq
P(D\given \H_a) = \frac{1}{101} \frac{1}{101} = 0.00010 .
\eeq
To evaluate $P(D\given \H_c)$, we must similarly say what values the
fractions $c,d$ and $e$ might take on. [I choose to represent these
numbers as fractions rather than real numbers because if we used real
numbers, the model would assign, relative to $\H_a$, an
infinitesimal probability to $D$. Real parameters are the norm
however, and are assumed in the rest of this chapter.] A reasonable
prior might state that for each fraction the numerator could be any number
between $-50$ and 50, and the denominator is any number between
1 and 50. As for the initial value in the sequence, let us leave its
probability distribution the same as in $\H_a$.
There are four ways of expressing the fraction $c=-1/11= -2/22=-3/33=-4/44$
under this prior, and similarly there are four and two possible solutions
for $d$ and $e$, respectively. So the
probability of the observed data, given $\H_c$, is found to be:
%
\begin{eqnarray}
P(D\given \H_c) &=& \left(\frac{1}{101}\right)
\left(\frac{4}{101}\frac{1}{50}\right)
\left(\frac{4}{101}\frac{1}{50}\right)
\left(\frac{2}{101}\frac{1}{50}\right) \nonumber \\
&=& 0.0000000000025 = 2.5 \times 10^{-12} .
\end{eqnarray}
Thus comparing $P(D\given \H_c)$ with $P(D\given \H_a) = 0.00010$, even if our prior
probabilities for $\H_a$ and $\H_c$ are equal, the odds,
$P(D\given \H_a):P(D\given \H_c)$, in favour of $\H_a$ over $\H_c$, given the
sequence $D$ = ($-1$, 3, 7, 11), are about forty million to one.\ENDsolution
This answer depends on several subjective assumptions; in particular,
the probability assigned to the free parameters $n$, $c$, $d$, $e$ of
the theories. Bayesians make no apologies for this: there is no such
thing as inference or prediction without assumptions. However, the
quantitative details of the prior probabilities have no effect on the
qualitative Occam's razor effect; the complex theory $\H_c$ always
suffers an `\ind{Occam factor}' because it has more parameters, and so can
predict a greater variety of data sets (figure \ref{fig.pdh}). This
was only a small example, and there were only four data points; as we
move to larger and more sophisticated problems the magnitude of the
Occam factors typically increases, and the degree to which our
inferences are influenced by the quantitative details of our
subjective assumptions becomes smaller.
% Why is this quantification of Occam's razor useful? It probably would
% have been of little use to William of Ockham, the 14th century
% Franciscan monk after whom the razor is named, though this derivation
% does show that Occam's razor is not an ad hoc principle.
\subsection{Bayesian methods and data analysis}
Let us now relate the discussion above to real problems in data
analysis.
There are countless problems in science, statistics and technology
which require that, given a limited data set, preferences be assigned
to alternative models of differing complexities.\index{complexity control}\index{model comparison}
For example, two alternative hypotheses accounting for planetary
motion are Mr.\ \ind{Inquisition}'s geocentric model based on `\ind{epicycles}',
and Mr.\ \ind{Copernicus}'s simpler model of the \ind{solar system} with the
sun at the centre. The
epicyclic model fits data on planetary motion at least as well as the
Copernican model, but does so using more parameters. Coincidentally
for Mr.\ Inquisition, two of the extra epicyclic parameters for every
planet are found to be identical to the period and radius of the
sun's `cycle around the earth'. Intuitively we find Mr.\ Copernicus's
theory more probable.
% I will now explain in more detail
% how Mr.\ Inquisition's excess parameters are penalized automatically
% under probability theory.
%thesis/figs/fig1.enhanced.tex
\begin{figure}
\figuremargin{\small
\begin{center}
\setlength{\unitlength}{0.007in}
\begin{picture}(510,485)(-250,-265)% increased height Tue 24/12/02
\thicklines
\newsavebox{\fatarr}
\savebox{\fatarr}(40,40){\begin{picture}(40,40)(-20,-20)
\put(-9,-10){\line(0,1){20}}
\put(10,-10){\line(0,1){20}}
\put(0,-20){\line(1,1){20}}
\put(0,-20){\line(-1,1){20}}
\end{picture}}
%\put(-250,-400){\framebox(500,700){}}
%\put(-200,-300){\framebox(400,500){}}
%\put(-150,-200){\framebox(300,300){}}
%\put(-100,-100){\framebox(200,100){}}
\thinlines
\put(-100,165){\fbox{\shortstack{Gather \\ DATA}}}
\put(20,165){\fbox{\shortstack{Create\\ alternative\\ MODELS}}}
\put(-20,110){\usebox{\fatarr}}
\put(-93,50){\fbox{\fbox{\shortstack{Fit each MODEL\\ to the DATA}}}}
\put(-20,-15){\usebox{\fatarr}}
\put(-132,-85){\fbox{\fbox{\shortstack{Assign preferences to the \\ alternative MODELS}}}}
\put(-240,-220){\fbox{\shortstack{Choose what \\
data to\\ gather next}}}
\put(-315,-60){\fbox{\shortstack{Gather\\more data}}}
\put(100,-220){\fbox{\shortstack{Decide whether\\ to create new\\ models}}}
\put(195,-60){\fbox{\shortstack{Create new\\ models}}}
\put(-65,-265){\fbox{\shortstack{Choose future \\ actions}}}
\thicklines
\put(-220,-140){\vector(0,1){50}}
\put(-220,0){\line(0,1){50}}
\put(-200,50){\oval(40,40)[tl]}
\put(-200,70){\vector(1,0){70}}
\put(220,-140){\vector(0,1){50}}
\put(220,0){\line(0,1){50}}
\put(200,50){\oval(40,40)[tr]}
\put(200,70){\vector(-1,0){70}}
\put(30,-120){\vector(1,-1){50}}
\put(-30,-120){\vector(-1,-1){50}}
\put(0,-120){\vector(0,-1){75}}
%\put(){\fbox{\shortstack{}}}
\end{picture}\\
\end{center}
}{
\caption[Where Bayesian inference fits into the data modelling process]{{Where Bayesian inference fits into the data modelling process.
%\small
This figure illustrates an abstraction of the part of the scientific
process in which data are collected and modelled. In particular, this
figure applies to pattern classification, learning, interpolation,
etc. The two double-framed boxes denote the two steps which involve
{\em inference}. It is only in those two steps that \Bayes\ theorem can
be used. Bayes does not tell you how to invent models, for example.
The first box, `fitting each model to the data', is the task of
inferring what the model parameters might be given the model and the
data. Bayesian methods may be used to find the most probable parameter values,
and error bars on those parameters. The result of applying Bayesian methods to
this problem is often little different from the answers given by
orthodox statistics.
The second inference task, model comparison in the light of
the data, is where Bayesian methods are in a class of their own. This
second inference problem requires a quantitative Occam's razor to
penalize over-complex models. Bayesian methods
can assign objective preferences to the alternative models in a way that
automatically embodies Occam's razor.
}}
\label{fig1}
}
\end{figure}
\subsection{The mechanism of the {B}ayesian razor:
the evidence and the {O}ccam factor}
Two levels of {inference} can often be distinguished in the
process of data modelling. At the first level of inference, we
assume that a particular model is true, and we fit that model to the
data, \ie, we infer what values its free parameters should plausibly
take, given the data. The results of this inference are often
summarized by the most probable parameter values, and error bars on
those parameters. This analysis is repeated for each model. The
second level of inference is the task of model comparison. Here we
wish to compare the models in the light of the data, and assign some
sort of preference or ranking to the alternatives.
\begin{aside}
Note that
both levels of {\em inference\/} are distinct from {\em \ind{decision
theory}}. The goal of inference is, given a defined hypothesis space
and a particular data set, to assign probabilities to
hypotheses. Decision theory typically chooses between alternative
{\em actions\/} on the basis of these probabilities so as to minimize
the expectation of a `loss function'. This chapter concerns inference
alone and no loss functions are involved. When we
discuss model comparison, this should not be construed as implying
model {\em choice\/}. Ideal Bayesian predictions do not involve choice
between models; rather, predictions are made by summing over all the
alternative models, weighted by their probabilities.
% (section \ref{sec.eb}).}}
\end{aside}
Bayesian methods are able consistently and quantitatively to solve
both the inference tasks. There is a popular \ind{myth} that states that
Bayesian methods only differ from orthodox statistical methods by the
inclusion of subjective priors, which are difficult to assign, and which
usually don't make much difference to the conclusions.
%[I hope
% this myth has already been dispelled by examples such
% as \exerciseref{ex.3doors}.]
It is true
that, at the first level of inference, a Bayesian's results will
often differ little from the outcome of an orthodox attack. What is
not widely appreciated is how a Bayesian performs the second level of
inference; this chapter will therefore focus on Bayesian model
comparison.
Model comparison is a difficult task because it is not possible
simply to choose the model that fits the data best: more complex\index{complexity control}\index{model comparison}
models can always fit the data better, so the \ind{maximum likelihood}
model choice would lead us inevitably to implausible,
over-parameterized models, which generalize poorly. Occam's razor is
needed.
Let us write down \Bayes\ theorem for the two levels of inference
described above, so as to see explicitly how Bayesian model
comparison works.\index{Bayes' theorem}
Each model $\H_i$ is assumed to have a vector of
parameters $\bw$. A model is defined by a collection of probability
distributions: a `prior' distribution $P(\bw\given \H_i)$, which states what
values the model's parameters might be expected to take; and a set of
conditional distributions, one for each value of $\bw$, defining the
predictions $P(D\given \bw,\H_i)$ that the model makes about the data $D$.
% when its parameters take a particular value $\bw$.
% The second of these is actually a collection of
% probability distributions, one for each value of$\bw$.
% Note that
% models with the same parameterization but different priors over the
% parameters are therefore defined to be different models.
\ben
\item {\bf Model fitting.}
At the first level of inference, we assume
that one model, the $i$th, say,
% $\H_i$
is true, and we
infer what the model's parameters $\bw$ might be,
given the data $D$.
Using \Bayes\ theorem, the {\dem posterior probability\/} of the
parameters $\bw$ is:
\begin{equation}
\label{i.POpre}
P(\bw\given D, \H_i) = \frac{P(D\given \bw,\H_i)P(\bw\given \H_i)}{P(D\given \H_i)},
\end{equation}
% In words:
that is,
\[
{\rm Posterior = \frac{Likelihood \times Prior}{Evidence} }.
\]
The normalizing constant $P(D\given \H_i)$ is commonly ignored since it is
irrelevant to the first level of inference, \ie, the inference of $\bw$;
but it becomes important in the second level of inference, and we
name it the {\dem\ind{evidence}\/} for $\H_i$. It is common practice to use
gradient-based methods to find the maximum of the posterior, which
defines the most probable value for the parameters, $\wmp$; it is
then usual to summarize the posterior distribution by the value of
$\wmp$, and error bars or confidence intervals on these best fit
parameters. Error bars can be obtained from the curvature of the
posterior; evaluating the Hessian at $\wmp$, $\bA =
\left. -\grad\grad \ln P(\bw\given D,\H_i)\right|_{\wmp}$, and
Taylor-expanding the log posterior probability with $\upDelta \bw = \bw - \wmp$:
\begin{equation}
P(\bw\given D, \H_i) \simeq
P(\wmp\given D, \H_i)\exp \left( - {\textstyle \dhalf}\upDelta \bw^{\T}
\bA \upDelta \bw \right),
\label{i.EB}
\end{equation}
we see that the posterior can be locally approximated as a Gaussian
with covariance matrix (equivalent to error bars)
$\bA^{-1}$. [Whether this approximation is good or not will depend on
the problem we are solving. Indeed, the maximum and mean of the
posterior distribution have no fundamental status in Bayesian
inference -- they both change under nonlinear
reparameterizations. Maximization of a posterior probability is only
useful if an approximation like equation (\ref{i.EB}) gives a good
summary of the distribution.]
\item {\bf Model comparison.}
At the second level of inference, we wish to infer which model is
most plausible given the data. The posterior probability of each
model is:
\begin{equation}
\label{i.EV.A}
P(\H_i\given D) \propto P(D \given \H_i) P(\H_i) .
%P(\H_i\given D) \propto \underbrace{P(D \given \H_i)} P(\H_i)
\end{equation}
Notice that the data-dependent term $P(D \given \H_i)$ is the evidence for
$\H_i$, which appeared as the normalizing constant in
(\ref{i.POpre}). The second term, $P(\H_i)$, is the subjective prior
over our hypothesis space, which expresses how plausible we thought
the alternative models were before the data arrived. Assuming that
we choose to assign equal priors $P(\H_i)$ to the alternative models,
{\em models $\H_i$ are ranked by evaluating the evidence.} The normalizing
constant $P(D) = \sum_i P(D \given \H_i) P(\H_i)$ has been omitted from
equation (\ref{i.EV.A}) because in the data-modelling
process we may develop new models after the data have arrived, when
an inadequacy of the first models is detected, for example. Inference
is open ended: we continually seek more probable models to account
for the data we gather.
% \een
To repeat the key idea:
%\begin{conclusionbox}
to rank alternative models $\H_i$, a
Bayesian evaluates the evidence $P(D\given \H_i)$.
%\end{conclusionbox}
%
This concept is very
general: the evidence can be evaluated for parametric and
`non-parametric' models alike; whatever our data-modelling task, a
regression problem, a classification problem, or a density estimation
problem, the evidence is a transportable quantity for
comparing alternative models. In all these cases the evidence
naturally embodies Occam's razor.
\een
\subsection{Evaluating the evidence}
Let us now study the evidence more closely to gain insight into how
the Bayesian Occam's razor works. The evidence is the normalizing
constant for equation (\ref{i.POpre}):
\begin{equation}
\label{evidence}
P(D\given \H_i) = \int P(D\given \bw,\H_i)P(\bw\given \H_i)\, \d \bw.
\end{equation}
\sloppy
For many problems the
posterior $P(\bw\given D,\H_i)\propto P(D\given \bw,\H_i)P(\bw\given \H_i)$ has a
strong peak at the most probable parameters $\wmp$ (figure
\ref{fig4}). Then, taking for simplicity the one-dimensional case,
the evidence can be approximated, using Laplace's method,
by the height of the peak of the
integrand $P(D\given \bw,\H_i)P(\bw\given \H_i)$ times its width, $\sigma_{w|D}$:
\fussy
%
%thesis/figs/fig4.tex
\begin{figure}
\figuremargin{\small%
\begin{center}
\setlength{\unitlength}{0.000595745in}
\begin{picture}(5625,2400)(250,1600)
\put(0,1700){\makebox(0,0)[bl]{\psfig{figure=\figs/occam_dd.ps,%
width=3.5 true in,height=1.383 true in}
% ,% height was 2.383 with old fig
% bbllx=0.0in,bblly=0.0in,%
% bburx=5.875in,bbury=4.0in}
}}
\put(3750,2000){\makebox(0,0)[b]{$\wmp$}}
\put(3700,2570){\makebox(0,0)[b]{$\sigma_{w|D}$}}
\put(3000,1625){\makebox(0,0)[t]{$\sigma_{w}$}}
\put(5500,1750){\makebox(0,0)[t]{$\bw$}}
\put(1500,2300){\makebox(0,0)[bl]{$P(\bw\given \H_i)$}}
\put(4000,3125){\makebox(0,0)[bl]{$P(\bw\given D,\H_i)$}}
\end{picture}\\
\end{center}
}{
\caption[abbrev]{{The Occam factor.}\indexs{Occam factor}
This figure shows the quantities that determine the Occam
factor for a hypothesis $\H_i$ having a single parameter
$\bw$. The prior distribution (solid line) for the parameter
has width $\sigma_{w}$. The posterior distribution (dashed
line) has a single peak at $\wmp$ with characteristic width
$\sigma_{w|D}$. The Occam factor is
$$\displaystyle\sigma_{w|D} P(\wmp\given \H_i) = \frac{\sigma_{w|D}}{\sigma_{w}}.$$
}
\label{fig4}
\label{fig.prior.post}
}
\end{figure}
%
\begin{equation}
\label{approx.evidence}
\begin{array}[t]{c@{\hspace{0.2cm}\simeq\hspace{0.2cm}}r@{\mbox{$\:\times\:$}}l}
P(D\given \H_i) & \strutc \underbrace{P(D\given \wmp,\H_i)} &
\underbrace{ P(\wmp\given \H_i) \, \sigma_{w|D} }. \\
{\rm Evidence} &{\rm Best~fit~likelihood} & {\rm Occam~factor }
\end{array}
\end{equation}
Thus the evidence is found by taking the best fit likelihood
that the model can achieve and multiplying it by an `{Occam factor}',
% \cite{G1},
which is a term with magnitude less than one that penalizes
$\H_i$ for having the parameter $\bw$.
\subsection{Interpretation of the {O}ccam factor}
The quantity $\sigma_{w|D}$ is the posterior uncertainty in $\bw$.
Suppose for simplicity that the prior $P(\bw\given \H_i)$ is uniform on
some large interval $\sigma_{w}$, representing the range of values of
$\bw$ that were possible {\em a priori}, according to $\H_i$ (figure
\ref{fig4}). Then $P(\wmp\given \H_i) = 1/\sigma_{w}$, and
\beq
\mbox{Occam factor}= \frac{\sigma_{w|D}}{\sigma_{w}},
\eeq
\ie, {\em the Occam factor is equal to the ratio of the posterior
accessible volume of $\H_i$'s parameter space to the prior accessible
volume,} or the factor by which $\H_i$'s hypothesis space collapses
when the data arrive.
% \cite{G1,Jeffreys}.
The model $\H_i$ can be
viewed as consisting of a certain number of exclusive submodels, of
which only one survives when the data arrive. The Occam factor is the
inverse of that number. The logarithm of the Occam factor is a
measure of the amount of information\index{information content}
we gain about the model's
parameters when the data arrive.
A\index{complexity control}\index{model comparison}
complex model having many parameters, each of which is free to vary
over a large range $\sigma_{w}$, will typically be penalized by a
stronger Occam factor than a simpler model. The Occam factor also
penalizes models that have to be finely tuned to fit the data,
favouring models for which the required precision of the parameters
$\sigma_{w|D}$ is coarse. The magnitude of the Occam factor is thus a
measure of complexity of the model;
% but, unlike the V-C dimension \cite{Abu1},
it relates to the complexity of the predictions that the
model makes in data space. This depends not only on the number of
parameters in the model, but also on the prior probability that the
model assigns to them. Which model achieves the greatest evidence is
determined by a trade-off between minimizing this natural complexity
measure and minimizing the data misfit. In
% further
contrast to
alternative measures of model complexity, the Occam factor for a
model is straightforward to evaluate: it simply depends on the error
bars on the parameters, which we already evaluated when fitting the
model to the data.
\begin{figure}[t]
\figuremargin{\small%
\begin{center}
\setlength{\unitlength}{1mm}
%\framebox{
\begin{picture}(116.5,95)(0,0)
\put(1,0){\makebox(0,0)[bl]{\psfig{figure=\figs/hyp8.ps,width=4.5in,angle=90}}}
\put(46.5,2.5){\makebox(0,0){$\sigma_{w}$}}
\put(54,11){\makebox(0,0)[t]{$\sigma_{w|D}$}}
\put(48,18){\makebox(0,0)[br]{$P(\bw\given \H_3)$}}
\put(50.5,26){\makebox(0,0)[br]{$P(\bw\given D,\H_3)$}}
\put(85,22){\makebox(0,0)[br]{$P(\bw\given \H_2)$}}
\put(86,32){\makebox(0,0)[br]{$P(\bw\given D,\H_2)$}}
\put(106,31){\makebox(0,0)[br]{$P(\bw\given \H_1)$}}
\put(107,47){\makebox(0,0)[br]{$P(\bw\given D,\H_1)$}}
\put(26,64){\makebox(0,0)[l]{$P(D\given \H_1)$}}
\put(15.3,71.5){\makebox(0,0)[bl]{$P(D\given \H_2)$}}
\put(14.5,82){\makebox(0,0)[bl]{$P(D\given \H_3)$}}
\put(4,20){\makebox(0,0)[r]{$D$}}
\put(4.5,68){\makebox(0,0)[r]{$D$}}
\put(65,13){\makebox(0,0)[t]{$\bw$}}
\put(93,13){\makebox(0,0)[t]{$\bw$}}
\put(113,13){\makebox(0,0)[t]{$\bw$}}
\end{picture}
% }
\end{center}
% I had to edit this ps file to correct the bb. from 0 0 770 552
% to % 500 0 1000 552 - this put the fig too low
% to 435 0 900 552
}{
\caption[abbrev]{{A hypothesis space} consisting of three exclusive
models, each having one parameter $\bw$, and a one-dimensional data
set $D$. The `data set' is a single measured value which differs
from the parameter $\bw$ by a small amount of additive noise. Typical
samples from the joint distribution $P(D,w,\H)$ are shown
by dots. (N.B., these are not data points.) The observed `data set'
is a single particular value for $D$ shown by the
dashed horizontal line. The
dashed curves below show the posterior probability of $\bw$ for each
model given this data set (\cf\ figure \protect\ref{fig.pdh}). The evidence
for the different models is obtained by marginalizing onto the $D$
axis at the left-hand side (\cf\ figure \protect\ref{fig.prior.post}).
}
\label{fig.modelspace}
}
\end{figure}
Figure \ref{fig.modelspace} displays an entire hypothesis space so
as to illustrate the various probabilities in the analysis. There
are three models, $\H_1, \H_2, \H_3$, which have equal prior
probabilities. Each model has one parameter $\w$ (each shown on a
horizontal axis), but assigns a different prior range $\sigW$ to that
parameter. $\H_3$ is the most `flexible' or `complex' model,
assigning the broadest prior range. A one-dimensional data space is
shown by the vertical axis. Each model assigns a joint probability
distribution $P(D,\w\given \H_i)$ to the data and the parameters,
illustrated by a cloud of dots. These dots represent random samples
from the full probability distribution. The total number of dots in
each of the three model subspaces is the same, because we assigned
equal prior probabilities to the models.
When a particular data set $D$ is received (horizontal line), we
infer the posterior distribution of $\w$ for a model ($\H_3$, say) by
reading out the density along that horizontal line, and
normalizing. The posterior probability $P(\w\given D,\H_3)$ is shown by the
dotted curve at the bottom. Also shown is the prior distribution
$P(\w\given \H_3)$ (\cf\ figure \ref{fig.prior.post}). [In the case of model
$\H_1$ which is very poorly matched to the data, the shape of the
posterior distribution will depend on the details of the tails of the
prior $P(\bw\given \H_1)$ and the likelihood $P(D\given \bw,\H_1)$; the curve
shown is for the case where the prior falls off more strongly.]
We obtain figure \ref{fig3} by marginalizing the joint distributions
$P(D,\w\given \H_i)$ onto the $D$ axis at the left-hand side.
% and normalizing them.
% This procedure gives the predictions of each model in data space.
For the data set $D$ shown by the dotted horizontal line, the
evidence $P(D\given \H_3)$ for the more flexible model $\H_3$ has a smaller
value than the evidence for $\H_2$. This is because $\H_3$ placed
less predictive probability (fewer dots) on that line. In terms of the
distributions over $\w$, model $\H_3$ has smaller evidence because
the Occam factor $\sigma_{w|D}/\sigma_{w}$ is smaller for $\H_3$ than
for $\H_2$. The simplest model $\H_1$ has the smallest evidence of
all,
% because it assigned very low probability to $D$.
because the best fit that it can achieve to the data $D$ is very
poor. Given this data set, the most probable model is $\H_2$.
\subsection{Occam factor for several parameters}
% $\bw$ is $k$-dimensional, and if
If the posterior is well
approximated by a Gaussian, then the Occam factor is obtained from
the determinant of the corresponding covariance matrix (\cf\ equation
(\ref{approx.evidence}) and \chref{ch.laplace}):
\begin{equation}
\label{general.occam}
\begin{array}[t]{c@{\hspace{0.2cm}\simeq\hspace{0.2cm}}r@{\mbox{$\:\times\:$}}c}
P(D\given \H_i) & \underbrace{P(D\given \wmp,H_i)} &
\underbrace{P(\wmp\given \H_i) \,
{\det}^{-\half} (\bA/2\pi) }, \\
{\rm Evidence} &{\rm Best~fit~likelihood} & {\rm Occam~factor }
\end{array}
\end{equation}
where $\bA = -\grad\grad \ln P(\bw\given D,\H_i)$, the Hessian which we
evaluated when we calculated the error bars on $\wmp$ (equation
\ref{i.EB} and \chref{ch.laplace}). As the amount of data collected
% , $N$,
increases, this
Gaussian approximation is expected to become increasingly accurate.\index{approximation!by Gaussian}
In summary, Bayesian model comparison is a simple extension of maximum
likelihood model selection: {\em the evidence is obtained by
multiplying the best fit likelihood by the Occam factor.}
\index{Occam factor}To evaluate the Occam factor we need only the Hessian $\bA$, if the
Gaussian approximation is good. Thus the Bayesian method of model
comparison by evaluating the evidence is no more
computationally demanding than the task of finding for each model the best fit
parameters and their error bars.
% see NOTES.tex for stolen material
\section{Example}
Let's return to the example that opened this chapter.
Are there one or two boxes behind the \ind{tree}\index{box}\index{image analysis}
in \figref{fig.dogs1}?
Why do \ind{coincidence}s make us suspicious?\index{suspicious coincidences}
Let's assume the image of the area round the trunk and box
has a size of 50 pixels, that the trunk is 10 pixels wide,
and that 16 different colours of
boxes can be distinguished.
The theory $\H_1$ that says there is one box near the trunk
has four free parameters: three coordinates defining the top three edges
of the box, and one parameter giving the box's colour. (If boxes
could levitate, there would be five free parameters.)
The theory $\H_2$ that says there are two boxes near the trunk
has eight free parameters (twice four), plus a ninth, a binary
variable that
indicates which of the two boxes is the closest to the viewer.
\marginfig{\footnotesize
\begin{center}
\twoboxesorone
\end{center}
\caption[a]{How many boxes are behind the tree?}
\label{fig.dogs3again}
}
What is the evidence for each model?
We'll do $\H_1$ first.
We need a prior on the parameters to evaluate the evidence.
For convenience, let's work in pixels.
Let's assign a separable prior to the horizontal location of the box,
its width, its height, and its colour.
The height could have any of, say, 20 distinguishable values,
so could the width, and so could the location.
The colour could have any of 16 values.
We'll put uniform priors over these variables. We'll
ignore all the parameters associated with other objects in the image,
since they don't come into the model comparison between
$\H_1$ and $\H_2$.
The evidence is
\beq
P(D\given \H_1) = \frac{1}{20} \frac{1}{20} \frac{1}{20} \frac{1}{16}
\eeq
since only one setting of the parameters fits the data,
and it predicts the data perfectly.
As for model $\H_2$,
six of its nine parameters are well-determined,
and three of them are partly-constrained by the data.
If the left-hand box is furthest away, for example,
then its width is at least 8 pixels
and at most 30; if it's the closer of the two boxes,
then its width is between 8 and 18 pixels.
(I'm assuming here that the visible portion of the
left-hand box is about 8 pixels wide.)
To get the evidence we need to sum up the prior
probabilities of all viable hypotheses.
To do an exact calculation, we need to be more specific
about the data and the priors, but let's just get
the ballpark answer, assuming that the two unconstrained real
variables have half their values available,
and that the binary variable is completely undetermined. (As
an exercise, you can make an explicit model and work
out the exact answer.)
\beq
P(D\given \H_2) \simeq \frac{1}{20} \frac{1}{20} \frac{10}{20} \frac{1}{16}
\frac{1}{20} \frac{1}{20} \frac{10}{20} \frac{1}{16}
\frac{2}{2} .
\eeq
Thus the posterior probability ratio
is (assuming equal prior probability):
\beqan
\frac{ P(D\given \H_1)P(\H_1)}
{P(D\given \H_2) P(\H_2)}
& =& \frac{1}{ \frac{1}{20} \frac{10}{20} \frac{10}{20} \frac{1}{16} }
\label{eq.fourfactors}
\\
& =& {20 \times 2 \times 2 \times 16} \:\:\simeq\:\: 1000/1 .
\eeqan
So the data are roughly 1000 to 1 in favour of the simpler hypothesis.
The four factors in (\ref{eq.fourfactors}) can be interpreted in terms of
Occam factors. The more complex model
has four extra parameters for sizes and colours -- three
for sizes, and one for colour.
It has to pay two big Occam factors (\dfrac{1}{20} and \dfrac{1}{16}) for
the highly suspicious \ind{coincidence}s that the two box heights match
exactly and the two colours match exactly;
and it also pays two lesser Occam factors for the two lesser coincidences
that both boxes happened to have one of their edges conveniently hidden
behind a tree or behind each other.
% MDL stuff --
% stolen from nn_occam.tex
\section{Minimum description length (MDL)}
\label{MDL}
\index{minimum description length}
A complementary view of Bayesian model comparison is obtained by
replacing probabilities of events by the lengths in bits of messages
that communicate the events without loss to a receiver. Message
lengths $L(\bx)$ correspond to a probabilistic model over events
$\bx$ via the relations:
\begin{equation}
P(\bx) = 2^{-L(\bx)}, \:\: L(\bx) = - \log_2 P(\bx) .
\label{p_l}
\end{equation}
% Non-integer coding lengths can be handled by the arithmetic coding
% procedure \cite{arith_coding}.
The MDL principle \cite{WB} states that one should prefer models
that can communicate the data in the smallest number of
bits. Consider a two-part message that states which model, $\H$, is to be
used, and then communicates the data $D$ within that model, to some
pre-arranged precision $\delta D$. This produces a message of length
$L(D,\H) = L(\H) + L(D\given \H)$. The lengths $L(\H)$ for different $\H$
define an implicit prior $P(\H)$ over the
alternative models. Similarly $L(D\given \H)$ corresponds to a density
$P(D\given \H)$. Thus, a procedure for assigning message lengths can be
mapped onto posterior probabilities:
\begin{eqnarray}
L(D,\H) &=& -\log P(\H) - \log \left( P(D\given \H) \delta D \right) \\
&=& -\log P(\H\given D) + {\rm const} .
\end{eqnarray}
In principle, then, MDL can always be interpreted as Bayesian model
comparison and {\em vice versa}. However, this simple discussion has
not addressed how one would actually evaluate the key data-dependent
term $L(D\given \H)$, which corresponds to the evidence for $\H$. Often,
this message is imagined as being subdivided into a parameter block
and a data block (figure \ref{fig.mdl}). Models with a small number
of parameters have only a short parameter block but do not fit the
data well, and so the data message (a list of large residuals) is
long. As the number of parameters increases, the parameter block
lengthens, and the data message becomes shorter. There is an optimum
model complexity ($\H_2$ in the figure) for which the sum is
minimized.
% these lengths are defined in itprnnchapter
\begin{figure}[t]
\figuremargin{
\small
\begin{center}
\begin{tabular}{rl}
% & \makebox[0.5\minch]{\ostruta$L(\H)$} \makebox[0.8\minch]{\ostruta$L(\w^*\given \H)$} \makebox[3.8\minch]{\ostruta$L(D\given \w^*,\H)$} \\[0.1\minch]
$\H_1$:\ostrutb & \framebox[0.5\minch]{\ostruta$L(\H_1)$}
\framebox[0.9\minch]{\ostruta$L(\w^*_{(1)}\given \H_1)$}
\framebox[3.2\minch]{\ostruta$L(D\given \w^*_{(1)},\H_1)$} \\[0.1in]
$\H_2$:\ostrutb & \framebox[0.5\minch]{\ostruta$L(\H_2)$}
\framebox[1.4\minch]{\ostruta$L(\w^*_{(2)}\given \H_2)$}
\framebox[2.2\minch]{\ostruta$L(D\given \w^*_{(2)},\H_2)$} \\[0.1in]
$\H_3$:\ostrutb & \framebox[0.5\minch]{\ostruta$L(\H_3)$}
\framebox[2.2\minch]{\ostruta$L(\w^*_{(3)}\given \H_3)$}
\framebox[1.8\minch]{\ostruta$L(D\given \w^*_{(3)},\H_3)$} \\
\end{tabular}
\end{center}
}{
\caption[abbrev]{{A popular view of model comparison by \inds{minimum description length}.}
Each model $\H_i$ communicates the data $D$ by sending the
identity of the model, sending the best fit parameters of the
model $\w^*$, then sending the data relative to those
parameters. As we proceed to more complex models the length
of the parameter message increases. On the other hand, the
length of the data message decreases, because a complex model
is able to fit the data better, making the residuals
smaller. In this example the intermediate model $\H_2$
achieves the optimum trade-off between these two trends.
}
\label{fig.mdl}
}
\end{figure}
This picture glosses over some subtle issues. We have not specified
the precision to which the parameters $\bw$ should be sent. This
precision has an important effect (unlike the precision $\delta D$ to
which real-valued data $D$ are sent, which, assuming $\delta D$ is
small relative to the noise level, just introduces an additive
constant). As we decrease the precision to which $\bw$ is sent, the
parameter message shortens, but the data message typically lengthens
because the truncated parameters do not match the data so well. There
is a non-trivial optimal precision. In simple Gaussian cases it is
possible to solve for this optimal precision \cite{Wallace_Freeman},
and it is closely related to the posterior error bars on the
parameters, $\bAI$, where $\bA = -\grad \grad \ln P(\w\given D,\H)$. It
turns out that the optimal parameter message length is virtually
identical to the log of the Occam factor\index{Occam factor} in equation
(\ref{general.occam}). (The random element involved in parameter
truncation means that the encoding is slightly sub-optimal.)
With care, therefore, one can replicate Bayesian results in MDL
terms. Although some of the earliest work on complex model
comparison involved the MDL framework \cite{Patrick_Wallace}, MDL has
no apparent advantages over the direct probabilistic approach.
MDL does have its uses as a pedagogical tool. The description length
concept is useful for motivating prior probability distributions.
Also, different ways of breaking down the task of communicating data
using a model can give helpful insights into the modelling process,
as will now be illustrated.
\subsubsection{On-line learning and cross-validation.}
\begin{sloppypar}
In cases where the data consist of a sequence of points $D = \bt^{(1)}, \bt^{(2)}, \ldots , \bt^{(N)}$,
the log evidence can be decomposed as a sum of `on-line' predictive
performances:
\begin{eqnarray}
\hspace*{-1em}
\log P(D\given \H) &=&
\log P(\bt^{(1)}\given \H) +
\log P(\bt^{(2)}\given \bt^{(1)},\H) \nonumber \\
\hspace*{-1em}
&& \hspace{-1in} + \log P(\bt^{(3)}\given \bt^{(1)},\bt^{(2)},\H) + \cdots
+ \log P(\bt^{(\ssN)}\given \bt^{(1)}\ldots \bt^{(\ssNM)},\H) .
\end{eqnarray}
This decomposition can be used to explain the difference between the
evidence and `leave-one-out \ind{cross-validation}' as measures of
predictive ability. Cross-validation examines the average value of
just the last term, $\log P(\bt^{(\ssN)}\given t^{(1)}\ldots
\bt^{(\ssNM)},\H)$, under random re-orderings of the data. The
evidence, on the other hand, sums up how well the model predicted all
the data, starting from scratch.
\end{sloppypar}
\subsection{The `bits-back' encoding method.}
\label{sec.bitsback}
Another MDL thought experiment \cite{Hinton_bb} involves\index{bits back}\index{Hinton, Geoffrey E.}
incorporating random bits into our message. The data are communicated
using a parameter block and a data block. The parameter vector sent
is a random sample from the posterior,
$P(\w\given D,\H) =
P(D\given \w,\H) P(\w\given \H) / P(D\given \H)$. This sample $\w$ is sent to an
arbitrary small granularity $\delta \w$ using a message length
$L(\w\given \H) = -\log [P(\w\given \H) \delta \w]$. The data are encoded
relative to $\w$ with a message of length $L(D\given \w,\H) = - \log
[P(D\given \w,\H) \delta D]$. Once the data message has been received, the
random bits used to generate the sample $\w$ from the posterior can
be deduced by the receiver. The number of bits so recovered is
$-\! \log [P(\w\given D,\H) \delta \w]$. These recovered bits need not count
towards the message length, since we might use some other
optimally-encoded message as a random bit string, thereby communicating that
message at the same time. The net description cost is therefore:
\begin{eqnarray}
L(\w\given \H) + L(D\given \w,\H) - \mbox{`Bits back'} &=&
-\log \frac{ P(\w\given \H) \, P(D\given \w,\H) \, \delta D }{ P(\w\given D,\H) } \nonumber \\
&=& -\log P(D\given \H) -\log \delta D .
\end{eqnarray}
Thus this thought experiment has yielded the optimal description length.
Bits-back encoding has been turned into a practical
compression\index{source code!for complex sources}\index{latent variable model!compression}
method for data modelled with latent variable models by \citeasnoun{frey-98}.\index{Frey, Brendan J.}
\label{bits_back}
\section*{Further reading}
Bayesian methods are introduced and contrasted with
sampling-theory statistics in \cite{Jaynes.intervals,G1,Loredo}. The
Bayesian Occam's razor is demonstrated on model problems in
\cite{G1,MacKay92a}. Useful textbooks are
\cite{Box_and_Tiao_text,Berger}.
One debate worth understanding is the question of whether
it's permissible to use \ind{improper prior}s\index{prior!improper}
in Bayesian inference \cite{dawidJaynes}.
If we want to do model comparison (as discussed in this chapter),
it is essential to
use proper priors -- otherwise the evidences and the
Occam factors are meaningless. Only
when one has no intention to do model comparison may it be safe
to use improper priors, and even in such cases there are
pitfalls, as Dawid \etal\ explain. I would agree with their
advice to {\em always use proper priors},
tempered by an encouragement to be smart when making calculations,
recognizing
opportunities for approximation.
% to approximate proper by improper priors.
\section{Exercises}
\exercisxC{3}{ex.uniformslope}{
Random variables $x$ come independently from a probability distribution
$P(x)$.
According to model $\H_0$,
$P(x)$ is a uniform distribution
\beq
P(x\given \H_0) = \frac{1}{2} \:\:\: \:\:\:\:\:\: x \in (-1,1) .
\eeq
\amarginfignocaption{c}{
\begin{center}\mbox{\epsfbox{metapost/occam.1}}\\[0.23in]
\mbox{\epsfbox{metapost/occam.2}}
\end{center}
%\caption[a]{
%}
}According to model $\H_1$, $P(x)$ is a nonuniform distribution with
an unknown parameter $m \in (-1,1)$:
\beq
P(x\given m,\H_1) = \frac{1}{2} (1+m x) \:\:\: \:\:\: \:\:\: x \in (-1,1) .
\eeq
Given the data $D = \{ 0.3, 0.5, 0.7, 0.8, 0.9\}$,
what is the evidence for $\H_0$ and $\H_1$?
}
\exercisxC{3}{ex.slopeornot}{
Datapoints $(x,t)$ are believed to come from a straight
line. The experimenter chooses $x$, and
$t$ is Gaussian-distributed
about
\beq
y = w_0 + w_1 x
\eeq
\amarginfignocaption{b}{
\begin{center}\mbox{\epsfbox{metapost/occam.3}}
\end{center}
%\caption[a]{
%}
}with variance $\sigma_{\nu}^2$.
According to model $\H_1$, the straight line is horizontal, so $w_1 = 0$.
According to model $\H_2$, $w_1$ is a parameter with
prior distribution $\Normal(0,1)$. Both models
assign a prior distribution $\Normal(0,1)$ to $w_0$.
Given the data set
$D = \{ (-8,8), (-2,10), (6,11)\}$,
and assuming the noise level is $\sigma_{\nu} = 1$,
what is the evidence for each model?
}
\exercisxC{3}{ex.dicebiased}{
A six-sided die is rolled 30 times and the
numbers of times each face came up were $\bF = \{ 3,3,2,2, 9,11 \}$.
What is the probability that the die is a perfectly fair die (`$\H_0$'),
assuming the alternative hypothesis $\H_1$ says that the
die has a biased distribution $\bp$, and the prior density for $\bp$
is uniform over the simplex $p_i \geq 0$, $\sum_i p_i =1$?
Solve this problem two ways:
exactly, using the helpful Dirichlet formulae (\ref{eq.dirichletdefn}, \ref{lang.z}),
and approximately, using Laplace's method. Notice that your choice
of basis for the \index{Laplace's method}Laplace approximation is important. See \citeasnoun{MacKay96:laplace}
for discussion of this exercise.
}
\exercisxC{3}{ex.florida}{
The influence of \ind{race} on the imposition of the \ind{death penalty} for murder
in \ind{America} has been much studied.
The following three-way table classifies 326 cases
in which the defendant was convicted of \ind{murder}.
The three variables are the defendant's \ind{race}, the victim's race,
and whether the defendant was sentenced to death.
(Data from M.~Radelet, `Racial characteristics and imposition of the death penalty,'
{\em American Sociological Review}, {\bf 46} (1981), pp.\,918-927.)
\begin{center}
\begin{tabular}{rcccrcc}\toprule
\multicolumn{3}{c}{
White defendant } && \multicolumn{3}{c}{
Black defendant }
\\ \cmidrule{1-3}\cmidrule{5-7}
&\multicolumn{2}{c}{
Death penalty } &&& \multicolumn{2}{c}{
Death penalty }
\\
&Yes & No &&& Yes & No \\ \cmidrule{2-3}\cmidrule{6-7}
White victim &19 & 132 && White victim& 11& 52 \\
Black victim &0 & 9 && Black victim& 6 & 97\\
\bottomrule\end{tabular}
\end{center}
%From 1979 to 2001, the state of Florida executed
% fifty-one convicted murderers.
% The dataIs there any racial bias in the decision
% of whether the murderer receives the death
% penalty?
It seems that
the death penalty was applied much more often when the victim was white
then when the victim was black.
When the victim was \ind{white} 14\% of defendants got the death penalty,
but when the victim was \ind{black} 6\% of defendants
got the \ind{death penalty}.
% And white defendants overwhelmingly (94.3% of cases) killed white victims.
[Incidentally, these data provide an example of
a phenomenon known as {\dem\ind{Simpson's paradox}}:
a higher fraction of white defendants
are sentenced to death overall, but in cases
involving black victims
a higher fraction of black defendants are sentenced to death
and in cases
involving white victims
a higher fraction of black defendants are sentenced to death.]
\marginfig{
\begin{center}
\begin{tabular}{cc}
\mbox{\epsfbox{metapost/occam.4}}&
\mbox{\epsfbox{metapost/occam.5}}\\
\mbox{\epsfbox{metapost/occam.6}}&
\mbox{\epsfbox{metapost/occam.7}}\\
\end{tabular}
\end{center}
\caption[a]{Four hypotheses concerning the dependence
of the imposition of the death penalty $d$
on the race of the victim $v$ and the race of the convicted murderer $m$.
$\H_{01}$, for example, asserts that the probability
of receiving the death penalty does depend on the murderer's race,
but not on the victim's.
}\label{fig.murder}
}
Quantify the evidence for the four alternative hypotheses
shown in \figref{fig.murder}.
I should mention that I don't believe any of these models
is adequate: several additional variables are important in murder
cases, such as whether the victim and murderer knew each other,
whether the murder was premeditated, and whether the defendant
had a prior criminal record; none of these variables
is included in the table.
So this is an academic exercise in model comparison rather than
a serious study of racial bias in the state of \ind{Florida}.
The hypotheses are shown as graphical models, with arrows
showing dependencies between the variables $v$ (victim race),
$m$ (murderer race), and $d$ (whether death penalty given).
Model $\H_{00}$ has only one free parameter, the probability of
receiving the death penalty; model $\H_{11}$ has four such parameters,
one for each state of the variables $v$ and $m$. Assign uniform
priors to these variables. How sensitive are the conclusions
to the choice of prior?
}
\dvips
\prechapter{About Chapter}
The last couple of chapters have assumed that
a Gaussian approximation to the probability distribution
we are interested in is adequate.
What if it is not?
We have already seen an example -- clustering -- where
the likelihood function is multimodal, and has nasty
unboundedly-high spikes in certain locations in the parameter space;
so maximizing the posterior probability and fitting
% likelihood
a Gaussian is not always going to work.
This difficulty with Laplace's method is one motivation
for being interested in Monte Carlo methods. In fact, Monte Carlo methods
provide a general-purpose set of tools with applications
in Bayesian data modelling and many other fields.
%\begin{quotation}
This chapter describes a sequence of methods:
{\dbf importance sampling}, {\dbf rejection sampling}, the {\dbf
Metropolis method},
{\dbf Gibbs sampling} and {\dbf slice sampling}. For each method, we
discuss whether the method is expected to be useful for
high-dimensional problems such as arise in inference
with graphical models. [A graphical
model is a probabilistic
model in which dependencies and
independencies of variables
are represented by
edges in a graph
whose nodes are the variables.]
Along the way, the terminology of Markov chain
Monte Carlo methods is presented.
% [This unconventional ordering
% has been chosen because the Metropolis and Gibbs sampling methods
% can be readily understood without knowing the terminology, and
% concepts such as `ergodicity' and `detailed balance' are probably
% easiest to learn once the reader has become familiar with
% some Markov chains.]
% The chapter concludes with
The subsequent chapter gives
a discussion of advanced methods
% , including methods
for reducing random walk behaviour.
% Chapter \ref{ch.mcexact} discusses
For details of Monte Carlo methods, theorems and proofs and a full list
of references, the reader is directed to
\citeasnoun{Neal_dop},
\citeasnoun{MCMC96}, and \citeasnoun{Tanner96}.
%\end{quotation}
In this chapter
I will use the word `\ind{sample}' in the following sense:
a sample from a distribution $P(\bx)$ is a single realization
$\bx$ whose probability distribution is $P(\bx)$.
This contrasts with the alternative usage in statistics,
where `sample' refers to a collection of realizations $\{ \bx\}$.
% UGLY HYPHENATION HACK::::::::::::::
When we discuss transition probability matrices,
I will use a right-multipli-
cation convention:
I like my matrices to act to the right, preferring
\beq
\bu = \bM \bv
\eeq
to
\beq
\bu^{\T} = \bv^{\T} \bM^{\T} .
\eeq
A \ind{transition probability matrix} $T_{ij}$ or $T_{i|j}$
specifies the probability, given the current state is
$j$, of making the transition from $j$ to $i$. The columns
of $\bT$ are probability vectors.
If we write down a transition probability density,
we use the same convention for the order of its arguments:
$T(x';x)$ is a transition probability density from $x$
to $x'$. This unfortunately means that
you have to get used to reading from right to left --
the sequence $xyz$ has probability $T(z;y) T(y;x) \pi(x)$.
% I hope the consistency of this notation is helpful.
% {Monte Carlo methods}
\ENDprechapter
\chapter{Monte Carlo Methods}
\label{ch.mc}
%
%
%
% this includes slice.tex
%
% \section{Motivation}
\newcommand{\intdx}{\int \! \d^N \bx \:}% was \sum_{\bx}
\newcommand{\intdxpp}{\int \! \d^N \bx'' \:}% was \sum_{\bx}
% \newcommand{\citeasnoun}[1]{\citeauthor{#1}\ \shortcite{#1}}
% \newcommand{\quotecite}[1]{\citeauthor{#1}'s\ \shortcite{#1}}
\newcommand{\mcn}{n}
%
%
\section{The problems to be solved}
%
\label{sec.mcproblemsdefined}
\noindent
\ind{Monte Carlo methods} are computational techniques that
make use of \ind{random} numbers.
The aims of Monte Carlo methods are to solve one or both of the following
problems.
\begin{description}
\item[Problem 1\mycolon] to generate samples
$\{ \bx^{(r)}\}_{r=1}^{R}$
from a given probability distribution $P(\bx)$.
\item[Problem 2\mycolon]
to estimate expectations of functions under this distribution,
for example
\beq
\Phi = \left< \phi(\bx) \right> \equiv \intdx\: P(\bx) \phi(\bx) .
\label{eq.prob2}
\eeq
\end{description}
%
% we may also be interested in estimating distributions of f(x)
%
%
The probability distribution $P(\bx)$, which we call
the {\dbf target density}, might be a distribution
from statistical physics or a conditional distribution
arising in data modelling -- for example, the posterior probability
of a model's parameters given some observed data.
We will generally
assume
that $\bx$ is an $N$-dimensional vector with real components $x_n$,
but we will sometimes consider discrete spaces also.
Simple examples of functions $\phi(\bx)$ whose
expectations we might be interested in include the
first and second moments
of quantities that we wish to predict, from
which we can compute
means and variances; for example if some quantity
$t$ depends on $\bx$, we can find the mean and variance of $t$ under
$P(\bx)$ by finding the expectations of the functions
$\phi_1(\bx) = t(\bx)$ and
$\phi_2(\bx) = (t(\bx))^2$,
\beq
\Phi_1 \equiv \Exp [ \phi_1(\bx) ] \: \mbox{ and } \:
\Phi_2 \equiv \Exp [ \phi_2(\bx) ]
,
\eeq
then using
% from which we can obtain
\beq
\bar{t} = \Phi_1 \: \mbox{ and } \: \var( t ) = \Phi_2 - \Phi_1^2 .
\eeq
% ; all of this chapter's
% discussions apply to discrete spaces too, with the replacement of
% `$\intdx$' by $\sum_{\bx}$ throughout.]
It is assumed that $P(\bx)$ is sufficiently complex that we cannot evaluate
these expectations by exact methods; so we are interested
in Monte Carlo methods.
%
% point out, maybe in pre-chapter, that approximate methods like
% laplace's method are not good in general because typical set aint associated
% with maxima. Show alpha pictures.
%
We will concentrate on the first problem (sampling), because
% One way of solving the second problem (estimation)
% is to solve the first problem (sampling).
if we have solved it, then we
can solve the second problem by using the random
samples $\{ \bx^{(r)}\}_{r=1}^{R}$ to give the estimator
\beq
\hat{\Phi} \equiv \frac{1}{R} \sum_{r} \phi( \bx^{(r)} ) .
\label{eq.mc.est}
\eeq
If the vectors $\{ \bx^{(r)}\}_{r=1}^{R}$ are generated
from $P(\bx)$ then the expectation of $\hat{\Phi}$ is $\Phi$.
Also, as the number of samples $R$ increases, the variance of $\hat{\Phi}$
will decrease as $\dfrac{\sigma^2}{R}$, where
$\sigma^2$ is the variance of $\phi$,
\beq
\sigma^2 = \intdx\: P(\bx) (\phi(\bx)-\Phi)^2 .
\eeq
This is one of the important properties of Monte Carlo\index{key points!Monte Carlo}
methods.\index{Monte Carlo methods!dependence on dimensionality} \medskip
\noindent
\begin{conclusionbox}
{The accuracy of the Monte Carlo estimate
(\ref{eq.mc.est}) depends only on the
variance of $\phi$, not
on
% is independent of
the dimensionality of the space sampled.}
To be precise, the variance of $\hat{\Phi}$
goes as $\linefrac{\sigma^2}{R}$. So regardless of the
dimensionality of $\bx$, it may be that as few as a dozen {independent\/}
samples $\{ \bx^{(r)}\}$ suffice to estimate $\Phi$ satisfactorily.
\end{conclusionbox}
\medskip
% This result should be taken with a pinch of salt;
We will find later, however,
that high dimensionality can cause other difficulties for Monte Carlo
methods. Obtaining independent samples from a given distribution $P(\bx)$
is often not easy.
\subsection{Why is sampling from $P(\bx)$ hard?}
We will assume that the density from which we wish to draw
samples, $P(\bx)$, can be evaluated, at least to within a multiplicative
constant; that is, we can evaluate a function $P^*\!(\bx)$ such that
\beq
P(\bx) = P^*\!(\bx) / Z.
\eeq
If we can evaluate $P^*\!(\bx)$, why can we not easily solve
problem 1? Why is it in general difficult to obtain samples from
$P(\bx)$? There are two difficulties. The first is that we typically
do not know the normalizing constant
\beq
Z = \intdx\: P^*\!(\bx) .
\eeq
The second is that, even if we did know $Z$, the problem of drawing samples
from $P(\bx)$ is still a challenging one, especially in high-dimensional
spaces, because there is no obvious way to sample from $P$
without
% in general we have to
enumerating most or
all of the possible states.
Correct samples from $P$ will by definition tend to come
from places in $\bx$-space where $P(\bx)$ is big; how
can we identify those places where $P(\bx)$ is big, without
evaluating $P(\bx)$ {\em{everywhere}}?
There are only a few high-dimensional densities
from which it is easy to
draw samples, for example the Gaussian distribution.
\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{\makebox[-0.25in][l]{\raisebox{-0.1in}{(a)}}\psfig{figure=mc/pstar.ps,angle=-90,width=2.7in}%
\makebox[-0.25in][l]{\raisebox{-0.1in}{(b)}}\psfig{figure=mc/pstar.imp.ps,angle=-90,width=2.7in}}
\end{center}
}{%
\caption[a]{(a) The function $P^*\!(x) = \exp \!
\left[ 0.4 (x-0.4)^2 - 0.08 x^4 \right]$. How to draw samples
from this density? (b) The function $P^*\!(x)$ evaluated
at a discrete set of uniformly spaced points $\{x_i\}$. How to draw samples
from this discrete distribution?
}
\label{fig.pstar}
}%
\end{figure}
Let us start with a simple one-dimensional
example. Imagine that we wish to
draw samples from the density $P(x) = P^*\!(x) / Z$ where
% set key ; set size 0.6,0.6
% set term post
% set output "pstar.ps"
% plot [-5:5][0:3.1] exp(0.4*((x-0.4)**2-0.2*x**4)) t "P*(x)"
% set term post ; set samples 50
% set output "pstar.imp.ps"
% plot [-5:5][0:3.1] exp(0.4*((x-0.4)**2-0.2*x**4)) t "P*(x)" w imp
\beq
P^*\!(x) = \exp \left[ 0.4 (x-0.4)^2 - 0.08 x^4 \right] ,
\:\: x \in (-\infty, \infty) .
\eeq
We can plot this function (\figref{fig.pstar}a).
But that does not mean we can draw samples from it. To start with,
we don't know the normalizing constant $Z$. To give ourselves
a simpler problem,
we could discretize the variable $x$ and ask for samples from the
discrete probability distribution over a finite set of uniformly
spaced points $\{x_i\}$ (\figref{fig.pstar}b). How could we solve
this problem? If we evaluate $p^*_i = P^*\!(x_i)$ at each point $x_i$,
we can compute
\beq
Z = \sum_i p^*_i
% P^*\!(x_i),
\label{eq.Zdirect}
\eeq
and
\beq
p_i = p^*_i / Z
\label{eq.pdirect}
\eeq
and we can then sample
% repeatedly
from the probability
distribution $\{ p_i \}$ using various methods based on
a source of random bits (see \secref{sec.ac.efficient}).
% chapter \ref{ch.ac}).
% , of which reversed arithmetic coding
% \cite{Rissanen_Langdon:79,arith_coding} is the most efficient
% method in terms of the number of random bits needed.
%
% rad:
%% Arithmetic coding is an efficient way of generating from a finite
%% distribution only if by "efficient" you mean "uses as few random bits
%% as possible". Usually, one is more concerned with computation time,
%% in which case the "alias method" is much more efficient, assuming one
%% wants to generate many points from the same distribution. This method
%% generates N points from a distribution with K possible values in
%% O(N+K) time.
%%
%%
But what is the cost of this procedure, and how does it scale with the
dimensionality of the space, $N$? Let us concentrate on the
initial cost of evaluating $Z$ (\ref{eq.Zdirect}). To compute $Z$
we have to visit every
point in the space. In \figref{fig.pstar}b there are 50
uniformly spaced points in one dimension.
If our system had $N$ dimensions, $N=1000$ say,
then the corresponding number of points would
be $50^{1000}$, an unimaginable number of evaluations of $P^*$.
Even if each component $x_{\mcn}$
% were discretized to only 2 values,
took only two discrete values,
% $\pm 1$,
the number of evaluations of $P^*$ would
be $2^{1000}$, a number that is still horribly huge.
If every electron in the universe (there are about $2^{266}$ of them)
were a 1000 gigahertz computer that could evaluate
$P^*$ for a trillion ($2^{40}$)
% $10^{12}$
states every second, and if we ran
those $2^{266}$
computers
for a time equal
to the age of the universe ($2^{58}$ seconds), they
would still only visit $2^{364}$ states. We'd have to wait
for more than $2^{636} \simeq 10^{190}$ universe ages to elapse before
all $2^{1000}$ states had been visited.
\newcommand{\cents}{\mbox{c}}
Systems with $2^{1000}$ states are two a penny.$^{\star}$\marginpar[c]{%
\small\raggedright
$^{\star}\,$Translation for
American readers: `such systems are a dime a dozen'; incidentally,
this equivalence ($10\cents = 6$p) shows that the correct exchange rate
between our currencies
is $\pounds$1.00 = \$1.67.
%What's more,
% in pre-decimal currency, $10\cents = 6{\rm d}$ gives \$4 to the pound,
% which was the (fixed) exchange rate under the Bretton Woods system that
% was introduced after World War Two, until the devaluations of the 1960s.
}
One example is a collection of 1000 spins such as a $30 \times 30$ fragment
of an Ising model
% (or `Boltzmann machine' or `Markov field') \cite{yeomans92}
whose probability distribution
is proportional to
\beq
P^*\!(\bx) = \exp \! \left[ - \beta E(\bx) \right]
\eeq
where $x_n \in \{ \pm 1 \}$ and
\beq
E(\bx) = - \left[
\frac{1}{2}
\sum_{m,n} J_{mn} x_m x_n
+ \sum_{n} H_n x_n \right] .
\label{eq.ising.eb}
\eeq
% Non-physicists who are more familiar with neural networks
% than Ising models can think of this as the probability distribution
% of a Boltzmann machine with symmetric weights $J_{mn}$ and
% biases $H_n$.
The energy function $E(\bx)$ is readily evaluated for
any $\bx$. But if we wish to evaluate this function at {\em all\/} states
$\bx$, the computer time required would be $2^{1000}$ function evaluations.
The Ising model is a simple model which has been around for
a long time, but the task of
generating
samples from the distribution $P(\bx) = P^*\!(\bx) / Z$
is still an active research area;
% has proved so difficult that researchers are still actively
% developing practical methods for solving it
% are still this problem was published
the first `exact' samples
from this distribution were
created in the pioneering work of \citeasnoun{Propp1996},
as we'll describe in \chref{ch.mcexact}.
\subsection{A useful analogy}
\marginfig{\footnotesize
\begin{center}
\mbox{\psfig{figure=figs/lake2.eps,width=1.46in}\raisebox{0.59in}{$P^*(\bx)$}}
\end{center}
\caption[a]{A lake whose depth at $\bx=(x,y)$ is $P^*(\bx)$.}
\label{fig.lake2}
}
Imagine the tasks of drawing random water samples from
a \ind{lake} and finding the average \ind{plankton} concentration
(\figref{fig.lake2}). The depth of the lake\index{depth of lake}
at $\bx=(x,y)$ is $P^*(\bx)$,
and we assert (in order to make the analogy work) that the plankton concentration
is a function of $\bx$, $\phi(\bx)$.
The required average concentration is an integral like (\ref{eq.prob2}),
namely
\beq
\Phi = \left< \phi(\bx) \right> \equiv \frac{1}{Z} \intdx\: P^*(\bx) \phi(\bx) ,
\label{eq.prob2again}
\eeq
where $Z = \int \! \d x \, \d y \: P^*\!(\bx)$ is the volume of the lake.\index{partition function!analogy with lake}
You are provided with a boat, a satellite navigation system, and a plumbline.
Using the navigator, you can take your boat to any desired location $\bx$
on the map; using the plumbline you can measure $P^*(\bx)$ at that point.
You can also measure the plankton concentration there.
Problem 1 is to draw $1\,\mbox{cm}^3$ water samples at random from the lake,
in such a way that each sample is equally likely to come from any point within the
lake.
Problem 2 is to find the average plankton concentration.
These are difficult problems to solve because at the outset we know nothing
about the depth $P^*(\bx)$.
%\begin{figure}[hbtp]
%\figuremargin{
\marginfig{
\mbox{\psfig{figure=figs/lake.eps,width=1.83in}}
%}{
\caption[a]{A slice through a lake that includes some canyons.}
\label{fig.lake1}
}
%\end{figure}
Perhaps much of the volume of the lake is contained in narrow,
deep underwater canyons (\figref{fig.lake1}), in which
case, to correctly sample from the lake and correctly estimate
$\Phi$ our method must implicitly discover the canyons
and find their volume relative to the rest of the lake.
Difficult problems, yes;
% Given that we can't expect to visit every location in the lake, $\bx$, our
% problems seem difficult;
nevertheless, we'll see that clever
Monte Carlo methods can solve them.
% both problems.
\subsection{Uniform sampling}
Having accepted that we cannot exhaustively visit every location $\bx$ in the state space,
we might consider trying to solve the second problem (estimating the
expectation of a function $\phi(\bx)$) by drawing
random samples $\{ \bx^{(r)} \}_{r=1}^{R}$
{\em uniformly\/} from the state space
and evaluating $P^*\!(\bx)$ at those points. Then we could
introduce a normalizing constant
% {\em estimate\/}
$Z_R$, defined by
\beq
% \hat{Z}
Z_R = \sum_{r=1}^{R} P^*\!(\bx^{(r)}) ,
\eeq
% where $V$ is the volume of the space $V = \intdx\:1$,
and estimate $\Phi = \intdx\: \phi(\bx) P(\bx)$ by
\beq
\hat{\Phi} = \sum_{r=1}^{R}
\phi(\bx^{(r)}) \frac{P^*\!(\bx^{(r)})}{Z_R} .
\eeq
Is anything wrong with this strategy? Well, it depends on the
functions $\phi(\bx)$ and $P^*\!(\bx)$. Let us assume that $\phi(\bx)$
is a benign, smoothly varying function and concentrate on the nature
of $P^*\!(\bx)$. As we learnt in \chapterref{ch.two},
a high-dimensional distribution is often
concentrated in a small region of the state space known as its
typical set
% . For example, if the state $\bx$ contains a large number
% of roughly {\em independent\/} variables, then it follows from the law
% of large numbers that almost all of the probability mass of the
% distribution $P(\bx)$ can be found in a typical set
$T$, whose volume
is given by $|T| \simeq 2^{H(\bX)}$, where $H(\bX)$ is the
% Shannon-Gibbs
entropy of the probability distribution $P(\bx)$.
%\beq
% H(\bX) = \sum_{\bx} P(\bx) \log_2 \frac{1}{P(\bx)} .
%\eeq
If almost all the probability mass is located in the typical set and
$\phi(\bx)$ is a benign function, the value of $\Phi
= \intdx\: \phi(\bx) P(\bx)$ will be principally determined by the
values that $\phi(\bx)$ takes on in the typical set. So uniform
sampling will only stand a chance of giving a good estimate of $\Phi$
if we make the number of samples $R$ sufficiently large that we are
likely to hit the typical set at least once or twice.
% a number of times.
So, how many samples
are required?
\begin{figure}
\figuremargin{\small%
\begin{mycenter}
\begin{tabular}{c@{$\:\:\:\:\:$}c}
\mbox{{\small(a)}\hspace{-0.32cm}\psfig{figure=figs/Sising.ps,angle=-90,width=2.5in}}%
&
{\small(b)}\raisebox{0.25in}{\framebox{\psfig{figure=../comput/newising_mc/32.32/t2.5.ps,width=1.4in}}}%
%\mbox{\psfig{figure=figs/Sising9.ps,angle=-90,width=3.2in}}%
\\
%\footnotesize(a) &
%\footnotesize(b) \\
\end{tabular}\end{mycenter}
}{%
\caption[a]{(a) Entropy of a 64-spin Ising model
% spin systems
as a function of temperature.
% The entropy of 64 spins arranged in a planar rectangular
% lattice with periodic boundary conditions
% was found as a function of temperature $T$
% with the nearest neighbour coupling set to $J=+1$.
% (ferromagnet) and $J=-1$ (antiferromagnet).
%
(b) One state of a 1024-spin Ising model.
% with 1024 spins
}
% cd ~/_courses/comput/newising_mc/
% i -o 32.32/o -ot 32.32/t -nx 32 -ny 32 -bmin 0.2 -bmax 0.5 -bs 11 -its 130000 -mf 30000
%Right figure, 81 spins.
\label{fig.Sising}
}%
\end{figure}
% cd _courses/comput/newising/r
% gnuplot
% load '../gnu8'
%
Let us take the case of the Ising model again. (Strictly,
the Ising model may not be a good example, since it doesn't necessarily have
a typical set, as defined in \chapterref{ch2}; the definition
of a typical set was that all states had $\log$ probability close
to the entropy, which for an Ising model would mean that
the {\em energy\/} is very close to the {\em mean energy};
but in the vicinity of phase transitions, the variance of energy,
also known as the heat capacity, may diverge, which means that the
energy of a random state is not necessarily expected to be very
close to the mean energy.)
% ; but let's ignore this.)
The
total size of the state space is $2^N$ states, and the typical set
has size $2^H$. So each sample has a chance of $2^H/2^N$ of falling
in the typical set. The number of samples required to hit the
typical set once is thus of order
\beq
R_{\min} \simeq 2^{N-H} .
\eeq
% would like \pagebreak here
% \pagebreak[1]
So, what is $H$?
At high temperatures, the probability distribution of an Ising
model tends to a uniform distribution and the entropy tends to
$H_{\max} = N$ bits, which means $R_{\min}$ is of order 1.
Under these conditions, uniform sampling
may well be a satisfactory technique for estimating $\Phi$.
But high temperatures are not of great interest. Considerably
more interesting are intermediate temperatures such as the critical
temperature at which the Ising model melts from an ordered phase
to a disordered phase.\index{phase transition}
The critical temperature of an infinite Ising model, at which it melts,
is $\theta_c=2.27$.
At this temperature the entropy of
an Ising model is roughly $N/2$ bits (\figref{fig.Sising}).
% For example,
% if the entropy of the 64-spin model is $32\log(2)$,
% the probability mass is concentrated in a typical
% set of size $2^{32}$ states; this set is a fraction roughly $1/2^{32}$
% of the total size of the state space.
%
For this probability
distribution the number of samples required simply to hit the typical set
once is of order
\beq
R_{\min} \simeq 2^{N-N/2} = 2^{N/2} ,
\eeq
which for $N=1000$ is about $10^{150}$. This is roughly the square of the
number of particles in the universe. Thus uniform sampling
is utterly useless for the study of Ising models of modest size.
And in most high-dimensional problems, if the distribution
$P(\bx)$ is not actually uniform, uniform sampling is unlikely to
be useful.
% \exercis{ex.}{
% Prove that the estimator $\hat{Z}$ is an unbiased
% estimator for
% $Z$.
% }
%
\subsection{Overview}
Having established that drawing samples from a high-dimensional
distribution $P(\bx) = P^*\!(\bx) / Z$ is
difficult even if $P^*\!(\bx)$ is easy to evaluate, we
will now study a sequence of more sophisticated Monte Carlo methods:
%\bit
%\item {\dbf importance sampling},
%\item {\dbf rejection sampling},
%\item the {\dbf Metropolis method},
%\item {\dbf Gibbs sampling}, and
%\item {\dbf slice sampling}.
%\eit
%
{\dbf importance sampling},
{\dbf rejection sampling},
the {\dbf Metropolis method},
{\dbf Gibbs sampling}, and
{\dbf slice sampling}.
\section{Importance sampling}
\label{sec.importance}
\indexs{Monte Carlo methods!importance sampling}\indexs{importance sampling}
Importance
sampling is not a method for generating samples
from $P(\bx)$ (problem 1); it is just a method for estimating the
expectation of a function $\phi(\bx)$ (problem 2). It can be viewed
as a generalization of the uniform sampling method.
%
% Radford him say
% Not directly, but you could always sample from the weighted distribution
% once you have a large number of points, thereby getting something close
% to a bunch of independent points.
%
%
For illustrative purposes,
let us imagine that the target distribution
is a one-dimensional density $P(x)$. Let us
assume that we are able to
evaluate this density at any chosen point $\bx$,
at least to within a multiplicative
constant; thus we can evaluate a function $P^*\!(x)$ such that
\beq
P(x) = P^*\!(x) / Z.
\eeq
But $P(x)$ is too complicated a function for us to be able to sample
from it directly. We now assume that we have a simpler
density $Q(x)$ from which we {\em can\/} generate samples and
which we can evaluate to within a multiplicative
constant (that is, we can evaluate $Q^*(x)$, where
$Q(x) = Q^*(x)/Z_Q$).
%
% Importance sampling
% , like rejection sampling, assumes that we
% makes use of
% an approximation $Q(x)$ which is similar to $P(x)$ and which we can draw
% samples from.
% We relax the restriction that an inequality
% relating $Q$ and $P^*$ must be known.
An example of the functions $P^*$, $Q^*$ and $\phi$ is shown in
\figref{fig.pq.importance}.
%\begin{figure}
%\figuremargin{%
\amarginfig{t}{
\begin{center}
\mbox{\epsfbox{metapost/rejection.3}}
%\mbox{\psfig{figure=figs/pq.importance.eps,width=2in,angle=-90}}
\end{center}
%}{%
\caption[a]{{Functions involved in importance sampling.}
We wish to estimate the expectation of $\phi(x)$ under $P(x)\propto P^*\!(x)$.
We can generate samples from the simpler distribution $Q(x) \propto Q^*(x)$.
We can evaluate $Q^*$ and $P^*$
at any point.
}
\label{fig.pq.importance}
}%
%\end{figure}
We call $Q$ the {\dem\ind{sampler density}}.
%% [The methods that
%% follow will work even if the sampler density
%% is not normalized, that is, we can
%% only evaluate $Q^*(x)$, which is proportional to $Q(x)$.]
\newcommand{\xfromq}{x}% was _q
In {importance sampling}, we generate $R$ samples
$\{\xfromq^{(r)}\}_{r=1}^R$ from $Q(x)$.
If these points were samples from $P(x)$ then
we could estimate $\Phi$
by \eqref{eq.mc.est}.
But when we generate samples from $Q$, values of $x$ where
$Q(x)$ is greater than $P(x)$ will be {\em over-represented\/} in
this estimator, and points where $Q(x)$ is less than $P(x)$
will be {\em under-represented}.
To take into account the fact that we have sampled from
the wrong distribution, we introduce {\dem{weights}}\index{weight!importance sampling}
\beq
w_r \equiv \frac{ P^*\!(\xfromq^{(r)}) }{ Q^*(\xfromq^{(r)}) }
\label{eq.mc.is.weight.def}
\eeq
which we use to adjust the `importance' of each point
in our estimator thus:
\beq
\hat{\Phi} \equiv \frac{ \sum_{r} w_r \phi( \xfromq^{(r)} ) }{ \sum_r w_r } .
\label{eq.is}
\eeq
\exercissxB{2}{ex.Phiconverge}{
Prove that,
if $Q(x)$ is non-zero for all $x$ where $P(x)$ is non-zero,
the estimator $\hat{\Phi}$ converges to $\Phi$, the mean
value of $\phi(x)$, as $R$ increases.
What is the variance of this estimator, asymptotically?
Hint: consider the statistics of the numerator and the denominator
separately.
Is the estimator $\hat{\Phi}$ an unbiased estimator for small $R$?
%
%
% Show that the estimator also works if the normalization constant $Z_Q$
% of $Q(x)$ is unknown -- that is, if we can draw samples from $Q(x)$,
% but we can only evaluate $Q^*(x)$, where $Q(x) = Q^*(x)/Z_Q$.
}
% \exercisa
% If $Q(x)$ is non-zero for all $x$ where $P(x)$ is non-zero,
% it can be proved that the estimator $\hat{\Phi}$ converges to $\Phi$,
% the mean value of $\phi(x)$, as $R$ increases.
% The estimator also works if the normalization constant $Z_Q$
% of $Q(x)$ is unknown -- that is, if we can draw samples from $Q(x)$,
% but can only evaluate $Q^*(x)$, where $Q(x) = Q^*(x)/Z_Q$.
%
% Your presentation of importance sampling normalizes the weights. Of course,
% you don't have to normalize the weights and it would seem useful to
% discuss this issue.
% Also, your section on importance sampling should discuss Rubin's Sampling
% Importance Resampling (SIR) since its a nice bridge between rejection
% sampling and importance sampling.
%
A practical difficulty with importance sampling is that it is hard to
estimate how reliable the estimator $\hat{\Phi}$ is. The variance of the
estimator is unknown beforehand,
because it depends on an integral over $x$ of a
function involving $P^*\!(x)$. And the variance of
$\hat{\Phi}$ is hard to estimate, because the empirical variances of
the quantities $w_r$ and $w_r \phi( \xfromq^{(r)} )$ are not necessarily
a good guide to the true variances of the numerator and denominator
in \eqref{eq.is}. If the proposal density $Q(x)$ is small in a region
where $|\phi(x)P^*\!(x)|$ is large then it is quite possible, even after many
points $\xfromq^{(r)}$ have been generated, that none of them will have
fallen in that region. In this case the estimate of $\Phi$ would
be drastically wrong, and there would be no indication in the {\em{empirical}\/}
variance that the
true variance of the estimator $\hat{\Phi}$ is large.\index{caution!importance sampling}
\newcommand{\FIGTOY}{/home/mackay/aa/ps}
\begin{figure}[bthp]
\figuremargin{\small%
\begin{mycenter}
\mbox{%
\raisebox{0.32in}{\makebox[0in][l]{\small(a)}}%
\hspace{-0.012in}%
\psfig{figure=\FIGTOY/demo.is.norm.ps,height=1.8in,angle=-90}\hspace{0.12in}
\raisebox{0.32in}{\makebox[0in][l]{\small(b)}}%
\hspace{-0.092in}%
\psfig{figure=\FIGTOY/demo.is.cauchy.ps,height=1.8in,angle=-90}}\\[-0.2in]
\end{mycenter}
}{%
\caption[a]{Importance sampling in action: (a) using a Gaussian sampler density;
(b) using a \index{Cauchy distribution}Cauchy
sampler density. Vertical
axis shows the estimate $\hat{\Phi}$. The horizontal line indicates
the true value of $\Phi$.
Horizontal axis shows number of samples
on a log scale.}
\label{fig.is}
}%
\end{figure}
\subsection{Cautionary illustration of importance sampling}
In a toy problem
related to the modelling of \ind{amino acid} probability distributions
with a one-dimensional variable $x$,
%
% I am pretty sure nearl;y all the figs in _doc/proteins/amino_is
% corresponds to one latent variable
%
I evaluated a quantity of interest using importance sampling.
The results using a Gaussian sampler
% $Q(x)$
and a Cauchy sampler are shown in
\figref{fig.is}. The horizontal axis shows the number of\pagebreak[1]
samples
on a log scale. In the case of the
Gaussian sampler, after about 500 samples had been evaluated
one might be tempted to call a halt; but evidently
there are infrequent samples
that make a huge contribution to $\hat{\Phi}$, and the value of the estimate at
500 samples is wrong. Even after a million samples have been
taken, the estimate has still not settled down close to the true value.
In contrast, the Cauchy sampler does not suffer from glitches; it converges
(on the scale shown here) after about 5000 samples.
This example illustrates the fact that an importance sampler should have
{\bf heavy tails}.
\exercissxA{2}{ex.peakysample}{
Consider the situation where $P^*\!(x)$ is multimodal, consisting
of several widely-separated peaks.
% for example, a mixture of Gaussians, widely separated.
(Probability distributions
like this arise frequently in statistical data modelling.)
Discuss whether it is a wise strategy to do importance sampling
using a sampler $Q(x)$ that is a unimodal distribution fitted
to one of these peaks.
\marginfig{
\hspace*{-0.2in}\psfig{figure=figs/gmixture.pqf.ps,angle=-90,width=2.24in}
\caption[a]{A multimodal distribution $P^*\!(x)$ and a unimodal sampler $Q(x)$.}
}
Assume that the function $\phi(x)$ whose mean $\Phi$ is to be estimated
is a smoothly varying function of $x$ such as $mx+c$. Describe
the typical evolution of the estimator $\hat{\Phi}$ as a function of
the number of samples $R$.
}
\subsection{Importance sampling in many dimensions}
We have already observed that care is needed in one-dimensional
importance sampling problems. Is importance sampling a useful
technique in spaces of higher dimensionality, say $N=1000$?
Consider a simple case-study where the target density $P(\bx)$
is a uniform distribution inside a sphere,
\beq
P^*\!(\bx) = \left\{ \begin{array}{cl} 1 & 0 \leq \rho(\bx) \leq R_P \\
0 & \rho(\bx) > R_P, \end{array}
\right.
\eeq
where $\rho(\bx) \equiv (\sum_i x_i^2 )^{1/2}$,
and the proposal density is a Gaussian centred on the origin,
\beq
Q(\bx) = \prod_i \Normal(x_i ; 0,\sigma^2 ) .
\eeq
An importance-sampling method will be in trouble if the estimator $\hat{\Phi}$
is dominated by a few large weights $w_r$.
What will be the typical range of values of the weights $w_r$?
We know from our discussions of typical sequences in \partone\
-- see \extwentyseven, for example --
% by the law of large numbers
that if $\rho$ is the distance
% By the central-limit theorem (see \extwentyseven, for example),
% if $\rho$ is the distance
from the origin of a sample from $Q$,
the quantity $\rho^2$ has a roughly Gaussian distribution with mean
and standard deviation:
% is very likely to have a distance
% $\rho$ from the origin that satisfies
\beq
\rho^2 \sim
N \sigma^2 \pm \sqrt{2 N} \sigma^2 .
\eeq
% where $z$ is a constant equivalent to the $\beta$ that controlled
% the size of our typical set. If $z=2$ then there is a 95\% chance
% that $\rho^2$ will lie in the above interval.
Thus almost all samples from $Q$ lie in a \ind{typical set} with distance
from the origin very close to $\sqrt{N} \sigma$.
Let us assume that $\sigma$ is chosen such that
the typical set of $Q$ lies
% almost all typical samples from $Q$
inside the sphere of radius $R_P$. [If it does not,
then the law of large numbers implies that almost all the samples
generated from $Q$ will fall outside $R_P$ and will have weight zero.]
%
Then we know that most samples from $Q$ will have a value of $Q$
that lies in the range
\beq
\frac{1}{({2 \pi \sigma^2})^{N/2}} \exp \left( -\frac{N}{2}
\pm \frac{\sqrt{2 N}}{2} \right) .
\eeq
Thus the weights $w_r=P^*/Q$ will typically have values in the range
\beq
{({2 \pi \sigma^2})^{N/2}} \exp \left( \frac{N}{2}
\pm \frac{\sqrt{2 N}}{2} \right) .
\label{weightrange}
\eeq
So if we draw a hundred samples, what will the typical range of weights be?
We can roughly estimate the ratio of the largest weight to the median
weight by doubling the standard deviation in
\eqref{weightrange}.
% Taking the two-standard-deviation points,
% A value of $z$ equal to 2 gives a reasonable ball-park figure, and
% we find:
%\begin{description}
%\item[The largest weight and the median weight] {\bf will typically be in the
% ratio:}
The largest weight and the median weight will typically be in the
ratio:
\beq
\frac{w_r^{\max}}{w_r^{{\rm med}}} = \exp \left( \sqrt{2 N} \right) .
\eeq
%\end{description}
In $N=1000$ dimensions therefore, the largest weight after one hundred
samples is likely to be roughly $10^{19}$ times greater
than the median weight.
%
Thus an importance sampling estimate for a high-dimensional problem
will very likely be utterly dominated by a few samples with huge
weights.
% Also, there are nice effective-sample-size heuristics that you might like
% to mention. See, for example, Kong, Liu and Wong's JASA paper on sequential
% imputation.
%
% Radford said:
%
% What happens if we pick pick sigma optimally? Is importance sampling
% still bad?
%
% note with uniform sampling, the problem was to hit the typical set
% here the problem is, even if we hit the typical set, the probabilities
% of states
% within the typical set vary by considerable factors.
In conclusion, importance sampling in high dimensions often suffers from
two difficulties. First, we need to obtain samples that
lie in the typical set of $P$, and this may take a long time unless\index{approximation!of complex distribution}
$Q$ is a good approximation to $P$. Second, even if we obtain samples
in the typical set, the weights associated with those samples
are likely to vary by large factors, because the probabilities
of points in a typical set, although similar to each other,
still differ by factors of order $\exp(\sqrt{N})$,
so the weights will too, unless $Q$ is a near-perfect
approximation to $P$.
% could quantify - time to get samples in the typical set relates to DKL?
% difference in weights relates to ... ?
\begin{figure}
\figuremargin{%
\begin{center}\footnotesize
\begin{tabular}{cc}
\raisebox{1.15in}{\makebox[0in][l]{\footnotesize(a)}}
\mbox{\epsfbox{metapost/rejection.1}}
%\psfig{figure=figs/pq.rejection.eps,width=2.25in,angle=-90}
&
\raisebox{1.15in}{\makebox[0in][l]{\footnotesize(b)}}
\mbox{\epsfbox{metapost/rejection.2}}
%\psfig{figure=figs/pq.rejection.shade.eps,width=2.25in}\\
\end{tabular}
\end{center}
}{%
\caption[a]{{Rejection sampling.\indexs{rejection sampling}\index{rejection}}
%
(a) The functions involved in {rejection sampling}.
We desire samples from $P(x) \propto P^*\!(x)$. We are able to draw
samples from $Q(x) \propto Q^*(x)$, and we know a value $c$ such that
$c\,Q^*(x) > P^*\!(x)$ for all $x$.
%
(b) A point $(\xfromq,u)$ is generated
at random in the lightly shaded
area under the curve $c\,Q^*(x)$. If this point also
lies below $P^*\!(x)$ then it is accepted.}
\label{fig.pq.rejection}
\label{fig.pq.rejection.xu}
\label{fig.pq.rejection.shade}
}%
\end{figure}
\section{Rejection sampling}
\label{sec.rejection}
\indexs{Monte Carlo methods!rejection sampling}%\indexs{rejection sampling}
We
assume again a one-dimensional density
% \beq
$P(x) = P^*\!(x) / Z$
% \eeq
that is too complicated a function for us to be able to sample
from it directly. We assume that we have a simpler {\em \inds{proposal
density}\/} $Q(x)$ which we can evaluate (within a multiplicative factor
$Z_Q$, as before),
and from which we can generate samples. We further assume that we
know the value of a constant $c$ such that
\beq
c\, Q^*(x) > P^*\!(x) , \:\: \mbox{for all $x$}.
\eeq
% For rejection sampling to work $Q(x)$ should be similar to $P(x)$.
A schematic picture of the two functions is shown in
\figref{fig.pq.rejection}a.
We generate two random numbers. The
first, $\xfromq$, is generated from the proposal density $Q(x)$. We then evaluate
$c\,Q^*(\xfromq)$ and generate a uniformly distributed random variable $u$
from the interval $[0,c\,Q^*(\xfromq)]$. These two random numbers can be
viewed as selecting a point in the two-dimensional plane as shown in
\figref{fig.pq.rejection.xu}b.
We now evaluate $P^*\!(\xfromq)$ and accept or reject the sample $\xfromq$ by
comparing the value of $u$ with the value of $P^*\!(\xfromq)$. If $u >
P^*\!(\xfromq)$ then $\xfromq$ is rejected; otherwise it is accepted, which
means that we add $\xfromq$ to our set of samples $\{ x^{(r)} \}$. The
value of $u$ is discarded.
Why does this procedure generate samples from $P(x)$? The proposed point
$(\xfromq,u)$ comes with uniform probability from the lightly shaded
area underneath the curve $c\,Q^*(x)$ as shown in
\figref{fig.pq.rejection.shade}b. The rejection rule rejects all the
points that lie above the curve $P^*\!(x)$. So the points $(x,u)$
that are
accepted are uniformly distributed in the heavily shaded area under
$P^*\!(x)$. This implies that the probability density of the
$x$-coordinates of the accepted points must be proportional to
$P^*\!(x)$, so the samples must be independent samples from $P(x)$.
Rejection sampling will work best if $Q$ is a good approximation to
$P$. If $Q$ is very different from $P$ then,
for $c\,Q$ to exceed $P$ everywhere,
$c$ will necessarily have
to be large and the frequency of rejection will be large.
%\begin{figure}
%\figuremargin{%
\marginfig{
\[
\hspace*{-0.2in}\psfig{figure=figs/grejection.ps,angle=-90,width=2.24in}
\]
%}{%
\caption[a]{A Gaussian $P(x)$ and a slightly broader Gaussian $Q(x)$
scaled up by a factor $c$ such that $c\,Q(x) \geq P(x)$.}
\label{fig.grejection}
}%
%\end{figure}
%
\subsection{Rejection sampling in many dimensions}
In a high-dimensional problem it is very likely that the
requirement that $c\,Q^*$ be an upper bound for $P^*$ will force
$c$ to be so huge that acceptances\index{acceptance rate} will be very rare indeed.
Finding such a value of $c$ may be difficult too, since in many
problems we know neither where the modes of $P^*$ are located
nor how high they are.
%beforehand
As a case study, consider a pair of
$N$-dimensional Gaussian distributions with mean zero
(\figref{fig.grejection}). Imagine
generating samples from one with standard deviation $\sigma_Q$
and using rejection sampling to obtain samples from the other
whose standard deviation is $\sigma_P$. Let us assume that these
two standard deviations are close in value -- say,
$\sigma_Q$ is 1\%
% one \percent\
larger than $\sigma_P$. [$\sigma_Q$ must
be larger than $\sigma_P$ because if this is not the case, there
is no $c$ such that $c\,Q$ exceeds $P$ for all $\bx$.]
So, what value of $c$ is required if the dimensionality is $N=1000$?
The density of $Q(\bx)$ at the origin is $1/({2 \pi \sigma_Q^2})^{N/2}$,
so for $c\,Q$ to exceed $P$ we need to set
\beq
c = \frac{({2 \pi \sigma_Q^2})^{N/2}}{({2 \pi \sigma_P^2})^{N/2}}
= \exp \left( {N} \ln \frac{ \sigma_Q }{ \sigma_P} \right) .
\eeq
With $N=1000$ and $\frac{ \sigma_Q }{ \sigma_P}=1.01$, we find
$c=\exp(10)\simeq 20$,000.
What will the acceptance rate\index{acceptance rate} be for this value of $c$?
% The typical
% sample from $Q$ has $Q \simeq \frac{1}{({2 \pi \sigma_Q^2})^{N/2}} e^{-N/2}$,
% so that $c\,Q \simeq \frac{1}{({2 \pi \sigma_P^2})^{N/2}} e^{-N/2}$.
% At this typical sample, the value of $P$ will be roughly
% $\frac{1}{({2 \pi \sigma_P^2})^{N/2}} \exp \left(
% \frac{-N\sigma_Q^2}{2\sigma_P^2} \right)$, which is smaller than $c\,Q$
% by the factor
% \beq
% \left.\frac{P}{cQ}\right|_{\rm typ} = \exp \left[
% -\frac{N}{2}\left(
% \frac{\sigma_Q^2}{\sigma_P^2} - 1 \right) \right]
% ,
% \eeq
% which is roughly $c$
The answer is immediate: since the acceptance rate is the ratio of the
volume under the curve $P(\bx)$ to the volume under $c\,Q(\bx)$,
the fact that $P$ and $Q$ are both normalized here implies that the acceptance
rate will be $1/c$,
for example,
% . For our case study, this is
% $\dfrac{1}{20,000}$.
1/20,000.
In general, $c$ grows exponentially with the dimensionality $N$,
so the acceptance rate is expected to be exponentially small in $N$.
Rejection sampling, therefore, whilst a useful method for
one-dimensional problems, is not expected to be a practical technique for
generating samples from high-dimensional distributions $P(\bx)$.
\section{The Metropolis--Hastings method}
\label{sec.metropolis}
\label{sec.metrop}
\indexs{Monte Carlo methods!Metropolis--Hastings}%
\indexs{Monte Carlo methods!Markov chain Monte Carlo}%
%\indexs{Markov chain Monte Carlo}%\indexs{Metropolis method}
Importance
sampling and rejection sampling work well only
if the proposal density $Q(x)$ is similar to $P(x)$.
In large and complex problems it is difficult to create a single
density $Q(x)$ that has this property.
%\begin{figure}
%\figuremargin{%
\amarginfig{c}{
\begin{center}\small
%\mbox{\small
\begin{tabular}{@{}c@{}}
\setlength{\unitlength}{0.7mm}% was 1mm then 0.75 (for a nice fit to textwidth)
\begin{picture}(75,40)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/pq.metrop.eps,%
width=2.1in,angle=-90}}}% was width 3
\put(73,-1){\makebox(0,0)[t]{$x$}}
\put(13,-1){\makebox(0,0)[t]{$x^{(1)}$}}
\put(17,38){\makebox(0,0)[l]{$Q(x;x^{(1)})$}}
\put(42,15){\makebox(0,0)[l]{$P^*\!(x)$}}
\end{picture}\\[0.2in]%\hspace{0.2in}
\setlength{\unitlength}{0.7mm}% was 1mm then 0.75 (for a nice fit to textwidth)
\begin{picture}(75,40)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/pq.metropb.eps,%
width=2.1in,angle=-90}}}% was width 3 then 2.25 (for a nice fit to textwidth)
\put(73,-1){\makebox(0,0)[t]{$x$}}
\put(51,-1){\makebox(0,0)[t]{$x^{(2)}$}}
\put(51,36){\makebox(0,0)[l]{$Q(x;x^{(2)})$}}
\put(20,15){\makebox(0,0)[r]{$P^*\!(x)$}}
\end{picture}
%}
\end{tabular}
\end{center}
%}{%
\caption[a]{{Metropolis--Hastings method in one dimension.} The proposal distribution
$Q(x';x)$ is here shown as having a shape that changes as $x$ changes,
though this is not typical of the proposal densities
used in practice.}
\label{fig.pq.metrop}
}%
%\end{figure}
The Metropolis--Hastings algorithm instead makes use of a
\ind{proposal density} $Q$
{\em which depends on the current state\/} $x^{(t)}$. The
density $Q(x';x^{(t)})$ might
% in the simplest case
be a simple distribution such as a Gaussian
centred on the current $x^{(t)}$.
The proposal density $Q(x';x)$ can be {\em any\/}
fixed density from which we can draw samples. In contrast
to importance sampling and rejection sampling,
it is not necessary that $Q(x';x^{(t)})$ look at all similar
to $P(x)$ in order for the algorithm to be practically useful.
An example of a proposal density is shown in \figref{fig.pq.metrop};
this figure shows the density $Q(x';x^{(t)})$ for two different
states $x^{(1)}$ and $x^{(2)}$.
As before, we assume that we can evaluate $P^*\!(x)$
for any $x$.
A tentative new state $x'$ is generated from the proposal density
$Q(x';x^{(t)})$. To decide whether to accept the new state, we compute
the quantity
\beq
% P({\rm accept
a =
% \min \left( 1,
\frac{ P^*\!(x') }{ P^*\!(x^{(t)}) }
\frac{ Q(x^{(t)};x') }{ Q(x';x^{(t)}) } .
% \right)
\label{eq.ratio.metrop}
\eeq
\[
\begin{array}{l}
\mbox{{\sf If} $a\geq 1$ then the new state is accepted.}
\\
\mbox{{\sf Otherwise}, the new state is accepted with probability $a$.}\\[0.1in]
\mbox{If the step is accepted, we set $x^{(t+1)} = x'$.} \\
\mbox{If the step is rejected,\index{rejection} then we set $x^{(t+1)} = x^{(t)}$. }
\end{array}
\]
Note the difference from rejection sampling: in rejection sampling,
rejected points are discarded and have no influence on the
list of samples $\{x^{(r)}\}$ that we collected. Here, a rejection
causes the current state to be written again onto the list.
% of points another time.
{\sf Notation.} $\,$
I have used the superscript $r = 1, \ldots, R$ to label points that are
{\em independent\/} samples from a distribution, and the superscript $t =
1, \ldots , T$
to label the sequence of states in a Markov chain. It is important
to note that a Metropolis--Hastings simulation of $T$ iterations does not
produce $T$ {\em independent\/} samples from the target distribution $P$. The
samples are dependent.
To compute
the acceptance probability (\ref{eq.ratio.metrop}) we need to be able to compute the
probability ratios $P(x')/P(x^{(t)})$ and
$\linefrac{ Q(x^{(t)};x') }
{ Q(x';x^{(t)}) }$. If the proposal density
is a simple symmetrical density such as a Gaussian centred on the
current point, then the latter factor is unity,
and the Metropolis--Hastings method simply involves comparing
the value of the target density at the two points. This special
case is sometimes called the Metropolis method. However,
with apologies to Hastings, I will call the general
Metropolis--Hastings algorithm for asymmetric $Q$
% given above, is often called
`the Metropolis method' since I believe important ideas deserve
short names.
\subsection{Convergence of the Metropolis method to the target density}
It can be shown that for any positive $Q$ (that is, any $Q$ such
that $Q(x';x) > 0$ for all $x,x'$), as $t \rightarrow \infty$,
the probability distribution of $x^{(t)}$ tends to $P(x)=P^*\!(x)/Z$.
[This statement should not be seen as implying that $Q$ {\em has\/}
to assign positive probability to every point $x'$ -- we will
discuss examples later where $Q(x';x) = 0$ for some $x,x'$;
notice also that we have said nothing about how rapidly
the convergence to $P(x)$ takes place.]
%\subsection{Markov chain Monte Carlo}
The Metropolis method is an example of a {\dbf{Markov chain Monte Carlo}}
% not indexed (see see.tex)
method\index{Monte Carlo methods!Markov chain Monte Carlo}
(abbreviated {MCMC}). In contrast to rejection sampling, where
the accepted points $\{ x^{(r)} \}$ are {\em independent\/} samples from the
desired distribution, Markov chain Monte Carlo methods involve a
Markov process in which a sequence of states $\{ x^{(t)} \}$ is
generated, each sample $x^{(t)}$ having a probability distribution
that depends on the previous value, $x^{(t-1)}$. Since successive
samples are dependent,
% correlated with each other,
the Markov chain may have to be
run for a considerable time in order to generate samples that are
effectively independent samples from $P$.
Just as it was difficult to estimate the variance of an importance sampling
estimator, so it is difficult to assess whether a Markov chain Monte Carlo
method has `converged', and to quantify how long one has to wait to obtain
samples that are effectively independent samples from $P$.
\begin{figure}
\figuremargin{%
\begin{center}
%\framebox{
{
\setlength{\unitlength}{0.8mm}% was 1mm
\begin{picture}(75,75)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/metrop2.eps,%
width=2.4in,angle=-90}}}% was width 3
% \put(73,-1){\makebox(0,0)[t]{$x$}}
\put(20,38){\makebox(0,0)[l]{$\bx^{(1)}$}}
\put(10,35){\makebox(0,0)[r]{$Q(\bx;\bx^{(1)})$}}
\put(55,57){\makebox(0,0)[l]{$P^*\!(\bx)$}}
\put(45,30){\makebox(0,0)[l]{$L$}}
\put(18,47){\makebox(0,0)[b]{$\epsilon$}}
\end{picture}
}
\end{center}
}{%
\caption[a]{{Metropolis method in two dimensions,
showing a traditional proposal density that
has a sufficiently small step size $\epsilon$
that the acceptance frequency\index{acceptance rate} will be about
0.5.}}
\label{fig.metrop2}
}%
\end{figure}
\subsection{Demonstration of the Metropolis method}
\label{sec.metrop.demo}
The Metropolis method is widely used for high-dimensional problems.
%
Many implementations of the Metropolis method employ a proposal distribution
with a length scale $\epsilon$
that is short relative to the longest length scale $L$
of the probable region (\figref{fig.metrop2}).
%The use of a small length
% scale is not obligatory, but a
A reason for choosing a small length
scale is that for most high-dimensional problems, a large random step
from a typical point (that is, a sample from $P(\bx)$)
is very likely to end in a state that has very low probability;
such steps are unlikely to be accepted.
If $\epsilon$ is large,
movement around the state space will only occur when such a transition
to a low-probability state
is actually accepted, or when a large random step chances to land in another
probable state. So the rate of progress
will be slow if large steps are used.
% , unless small steps are used.
The disadvantage of small steps, on the other hand, is that
the Metropolis method will explore the probability distribution
by a {\dem\ind{random walk}\/}, and a random walk takes a long
time to get anywhere, especially if the walk is made of small steps.
\exercisxA{1}{ex.randomwalk}{
Consider a one-dimensional random walk, on each step
of which the state moves
randomly to the left or to the right with equal probability.
Show that after $T$ steps of size $\epsilon$,
the state is likely to have moved only a distance
about $\sqrt{T} \epsilon$. (Compute
the root mean square distance travelled.)
}
Recall that the first aim of Monte Carlo
sampling is to generate a number of {\em independent\/} samples
from the given distribution (a dozen, say).
If the largest length scale of the state space is $L$,
then we have to simulate a random-walk Metropolis method
for a time $T \simeq \left(\linefrac{L}{\epsilon}\right)^2$
before we can expect to get a sample that is
roughly independent of the initial condition -- and
that's assuming that every step is accepted: if only a fraction $f$
of the steps are accepted on average, then this time is increased
by a factor $1/f$. \medskip
\begin{conclusionbox}
{\bf Rule of thumb: lower bound on number of iterations of a Metropolis
method.} If the largest length scale of the space of probable
states is $L$,\index{key points!Monte Carlo}
a Metropolis method whose proposal distribution generates a random
walk with step size $\epsilon$ must be run for at least
\beq
T \simeq \left(\linefrac{L}{\epsilon}\right)^2
\label{eq.ruleofthumb}
\eeq
iterations
to obtain an independent sample.
\end{conclusionbox}
\medskip
This rule of thumb
% for the required number of iterations to obtain an independent sample
gives only a lower bound; the situation may be much worse, if, for
example, the probability distribution consists of several
% separate
islands of high probability separated by regions of low probability.
\begin{figure}[htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{c@{\hspace{-0.3in}}c@{\hspace{-0.3in}}c}
\hspace*{-0.1in}(a)\begin{tabular}{c}
\psfig{figure=metrop/Aps.ps,height=5.5in,width=0.35in}%7.5 was too big
\end{tabular}
&
\begin{tabular}{c}(b) Metropolis\\[0.15in]
\psfig{figure=metrop/hist.100.ps,height=1.64in,angle=-90} \\
\psfig{figure=metrop/hist.400.ps,height=1.64in,angle=-90} \\
\psfig{figure=metrop/hist.1200.ps,height=1.64in,angle=-90} \\
\end{tabular}
&
\begin{tabular}{c}(c) Independent sampling\\[0.15in]
\psfig{figure=metrop/h.100.ps,height=1.64in,angle=-90} \\
\psfig{figure=metrop/h.400.ps,height=1.64in,angle=-90} \\
\psfig{figure=metrop/h.1200.ps,height=1.64in,angle=-90} \\
\end{tabular}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{{Metropolis method for a toy problem.}
%
(a) The state sequence for $t = 1 , \ldots , 600$. Horizontal direction =
states from 0 to 20; vertical direction = time from 1 to 600; the
cross bars mark time intervals of duration 50.
%
(b) Histogram of occupancy of the states after 100, 400 and 1200 iterations.
%
(c) For comparison, histograms resulting when
successive points are drawn {\em independently\/}
from the target distribution.
}
\label{fig.metrop}
}%
\end{figure}
\label{sec.simplemc}
To illustrate
how slowly
% the difficulties caused by
% the exploration of a state space by
a random walk explores a state space, \figref{fig.metrop} shows
a simulation of a Metropolis algorithm
for generating
% that is intended to generate
samples from the distribution:
% following distribution over integers:
\beq
P(x) = \left\{
\begin{array}{ll} \dfrac{1}{21} & x \in \{ 0,1,2,\ldots,20 \} \\
0 & \mbox{otherwise.} \end{array}
\right.
\label{eq.metrop}
\eeq
The proposal distribution is
% the probability distribution for a simple random walk,
\beq
Q(x' ; x ) = \left\{
\begin{array}{ll} \dfrac{1}{2} & x' = x \pm 1 \\
0 & \mbox{otherwise.} \end{array}
\right.
\label{eq.metropb}
\eeq
Because the target distribution $P(x)$ is uniform,
rejections occur only when the proposal takes the state
to $x' = -1$ or $x'=21$.
The simulation was started in the state $x_0 = 10$ and its evolution
is shown in \figref{fig.metrop}a. How long does it take to reach one of
the end states $x = 0$ and $x=20$? Since the distance is 10 steps,
the rule of thumb (\ref{eq.ruleofthumb}) predicts that it will typically take
a time $T \simeq 100$ iterations to reach an end state.
This is confirmed in the present example: the first step into an
end state occurs on the 178th iteration.
How long does it take to visit {\em both\/} end states?
The rule of thumb predicts about 400 iterations are required
to traverse the whole state space; and indeed the first encounter with
the other end state takes place on the 540th iteration. Thus
effectively-independent samples are only generated by simulating
for about four hundred iterations per independent sample.
% [This discussion should not be misunderstood as saying that the aim
% of a Markov chain Monte Carlo is to actually reach every probable state;
% the argument is that if the chain has not had
% \subsection{Reducing random walk behaviour in Markov chain Monte Carlo}
This simple example shows that it is important to try to
abolish random walk behaviour in Monte Carlo methods. A
systematic exploration of the toy state space $\{0,1,2,\ldots , 20\}$
could get around it, using the same step
sizes, in about twenty steps instead of four hundred.
Methods for reducing random walk behaviour
are discussed in the next chapter.
%
% \subsection{Hybrid Monte Carlo}
%
%
\subsection{Metropolis method in high dimensions}
The rule of thumb (\ref{eq.ruleofthumb}),
% that we discussed above,
which gives a lower bound on
the number of iterations of a random walk Metropolis method,
also applies to higher dimensional problems.
Consider the simple case of a target distribution that is
an $N$-dimensional Gaussian, and a proposal distribution that is a spherical
Gaussian of standard deviation $\epsilon$ in each direction.
Without loss of generality, we can
assume that the target distribution is a separable distribution
aligned with the axes $\{x_n\}$, and that it has standard deviation
$\sigma_n$ in direction $n$. Let $\sigma^{\max}$
and $\sigma^{\min}$ be the largest and smallest of these standard deviations.
Let us assume that $\epsilon$ is adjusted such that the acceptance
frequency\index{acceptance rate} is close to 1. Under this assumption,
each variable $x_n$ evolves independently of all the others,
executing a random walk with step size about $\epsilon$. The time taken
to generate effectively independent samples from the target distribution
will be controlled by the largest lengthscale $\sigma^{\max}$.
Just as in the previous section, where
we needed at least $T \simeq (L/\epsilon)^2$
iterations to obtain an independent sample, here we
need $T \simeq ( \sigma^{\max} /\epsilon)^2$.
Now, how big can $\epsilon$ be? The bigger it is, the smaller this number
$T$ becomes, but if $\epsilon$ is too big -- bigger than
$\sigma^{\min}$ -- then the acceptance rate\index{acceptance rate} will fall sharply.
It seems plausible that the optimal $\epsilon$ must be similar to
$\sigma^{\min}$. Strictly, this may not be true; in
special cases where the second smallest $\sigma_n$
is significantly greater than $\sigma^{\min}$,
the optimal $\epsilon$ may be closer to that second smallest
$\sigma_n$. But our rough conclusion is this: where simple
spherical proposal distributions are used,
we will need at least $T \simeq ( \sigma^{\max} / \sigma^{\min} )^2$
iterations to obtain an independent sample, where
$\sigma^{\max}$ and $\sigma^{\min}$ are the longest and shortest lengthscales
of the target distribution.
This is good news and bad news. It is good news because, unlike
the cases of rejection sampling and importance sampling, there is no
catastrophic dependence on the dimensionality $N$.
% We can get answers
Our \ind{computer} {\em will\/} give useful
answers in a time shorter than the age of the universe.
But it is bad news
all the same, because this quadratic dependence on the
lengthscale-ratio
% that random walks induce
may still force us to make very lengthy simulations.
Fortunately, there are methods for suppressing
\index{Monte Carlo methods!random walk suppression}\index{random walk!suppression}random walks in
Monte Carlo simulations, which we will discuss in the next chapter.
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{ll}
\hspace{-0.05in}(a)
{
\setlength{\unitlength}{1mm}
\begin{picture}(50,50)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gibbs.eps,%
width=2in,angle=-90}}}
\put(48,-1){\makebox(0,0)[t]{$x_1$}}
\put(-1,40){\makebox(0,0)[r]{$x_2$}}
\put(33,19){\makebox(0,0)[l]{$P(\bx)$}}
\end{picture}
}
&
\hspace{-0.05in}(b)
\setlength{\unitlength}{1mm}
\begin{picture}(50,50)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gibbss.eps,%
width=2in,angle=-90}}}
\put(48,-1){\makebox(0,0)[t]{$x_1$}}
\put(-1,40){\makebox(0,0)[r]{$x_2$}}
\put(29,13){\makebox(0,0)[l]{$P(x_1\given x_2^{(t)})$}}
\put(24,6){\makebox(0,0)[l]{$\bx^{(t)}$}}
\end{picture}
\\
\hspace{-0.05in}(c)
\setlength{\unitlength}{1mm}
\begin{picture}(50,50)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gibbst.eps,%
width=2in,angle=-90}}}
\put(48,-1){\makebox(0,0)[t]{$x_1$}}
\put(-1,40){\makebox(0,0)[r]{$x_2$}}
\put(33,16){\makebox(0,0)[l]{$P(x_2\given x_1)$}}
\end{picture}
&
\hspace{-0.05in}(d)
{
\setlength{\unitlength}{1mm}
\begin{picture}(50,50)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gibbsu.eps,%
width=2in,angle=-90}}}
\put(48,-1){\makebox(0,0)[t]{$x_1$}}
\put(-1,40){\makebox(0,0)[r]{$x_2$}}
\put(24,6){\makebox(0,0)[l]{$\bx^{(t)}$}}
\put(12,22){\makebox(0,0)[br]{$\bx^{(t+1)}$}}
\put(18,29){\makebox(0,0)[br]{$\bx^{(t+2)}$}}
\end{picture}
}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{{Gibbs sampling.}
(a) The joint density {$P(\bx)$}
from which samples
are required. (b) Starting from a state $\bx^{(t)}$, $x_1$
is sampled from the conditional density $P(x_1\given x_2^{(t)})$.
(c) A sample is then made from the conditional density $P(x_2\given x_1)$.
(d) A couple of iterations of Gibbs sampling. }
\label{fig.gibbs}
}%
\end{figure}
%
\section{Gibbs sampling}
We introduced\indexs{Monte Carlo methods!Gibbs sampling}
% have studied
importance sampling, rejection sampling and the
Metropolis method using one-dimensional examples. \inds{Gibbs sampling},
also known as the {\em{\ind{heat bath}} method\/}
or `\ind{Glauber dynamics}',
% Not in index because done in see.tex % NOW UNDONE in see.tex
is a method for sampling from distributions over at least two dimensions.
Gibbs sampling can be viewed as a Metropolis method in which a
sequence of proposal
distributions $Q$ are defined in terms of the {\em conditional\/}
distributions of the joint distribution $P(\bx)$. It is assumed that, whilst
$P(\bx)$ is too complex to draw samples from directly, its conditional
distributions $P(x_i\given \{x_j\}_{j\neq i})$ are tractable to work with.
%
% Gibbs sampling is a \MCMC\ method in which each iteration
% $\bx \rightarrow \bx'$ involves a separate sampling of each
% variable $x_i$ in turn from its distribution {\em conditional\/} on the
% current values of all the other variables in the model.
For many graphical
models (but not all) these one-dimensional conditional
distributions are straightforward to sample from.
For example, if a Gaussian distribution for some variables $\bd$ has an unknown
mean $\bm$,
and the prior distribution of $\bm$ is Gaussian,
then the conditional distribution of $\bm$ given $\bd$ is also Gaussian.
Conditional
distributions that are not of standard form may still be sampled from
by {\dem\ind{adaptive rejection sampling}\index{Monte Carlo methods!rejection sampling!adaptive}\/}
if the conditional distribution satisfies
certain \ind{convexity} properties \cite{Gilks_Wild}.
Gibbs sampling is illustrated for a case with two variables $(x_1,x_2)=\bx$
in \figref{fig.gibbs}.
On each iteration, we start from the current state $\bx^{(t)}$,
and $x_1$
is sampled from the conditional density $P(x_1\given x_2)$, with $x_2$ fixed
to $x^{(t)}_2$.
A sample $x_2$ is then made from the conditional density $P(x_2\given x_1)$,
using the new value of $x_1$. This brings us to the new state
$\bx^{(t+1)}$, and completes the iteration.
In the general case of a system with $K$ variables,
a single iteration involves sampling one parameter at a time:
\newcommand{\tplusone}{(t+1)}
\beqan
\label{eq.gibbs1}
x_1^{\tplusone} &\sim& P( x_1 \given x_2^{(t)} , x_3^{(t)} , \ldots , x_K^{(t)} ) \\
x_2^{\tplusone} &\sim& P( x_2 \given x_1^{\tplusone} , x_3^{(t)} , \ldots , x_K^{(t)} ) \\
x_3^{\tplusone} &\sim& P( x_3 \given x_1^{\tplusone} , x_2^{\tplusone} , \ldots , x_K^{(t)} ) , \:\: \mbox{etc.}
\label{eq.gibbs3}
\eeqan
\subsection{Convergence of Gibbs sampling to the target density}
\exercisxB{2}{ex.gibbs.eq.met}{
Show that a single variable-update
of Gibbs sampling can be viewed as a Metropolis method
with target density $P(\bx)$, and that this Metropolis method
has the property that every proposal is always accepted.
}
Because Gibbs sampling is a Metropolis method, the probability
distribution of $\bx^{(t)}$ tends to $P(\bx)$ as $t \rightarrow
\infty$, as long as $P(\bx)$ does not have pathological properties.
\exercissxB{2}{ex.gibbs.h74}{
Discuss whether the syndrome decoding problem for a $(7,4)$ Hamming code
can be solved using Gibbs sampling.
The syndrome decoding problem, if we are to solve it with a Monte Carlo
approach, is to draw samples
from the posterior distribution of the noise vector $\bn = (n_1, \ldots, n_n,
\ldots, n_N)$,
\beq
P( \bn \given {\bf f}, \bz ) = \frac{1}{Z} \prod_{n=1}^N f_n^{n_n}
(1-f_n)^{(1-n_n)} \, \truth [ \bH \bn \eq \bz ] ,
\eeq
where $f_n$ is the normalized likelihood for the $n$th transmitted bit and
$\bz$ is the observed syndrome. The factor $\truth [ \bH \bn \eq \bz ]$
is 1 if
% the hypothesis
$\bn$ has the correct syndrome $\bz$
and 0 otherwise.
What about the
\ind{syndrome decoding}\index{error-correcting code!syndrome decoding}
problem for any linear error-correcting code?
}
\subsection{Gibbs sampling in high dimensions}
Gibbs sampling suffers from the same defect as simple Metropolis algorithms
-- the state space is explored by a slow random walk, unless
a fortuitous parameterization has been chosen that makes the
probability distribution $P(\bx)$ separable. If, say, two variables
$x_1$ and $x_2$ are strongly correlated, having marginal densities
of width $L$ and conditional densities of width $\epsilon$,
then it will take at least about $(L/\epsilon)^2$ iterations
to generate an independent sample from the target density. \Figref{fig.adler},
\pref{fig.adler}, illustrates the slow progress made
by Gibbs sampling when $L \gg \epsilon$.
However Gibbs sampling involves no adjustable parameters, so it is
an attractive strategy when one wants to get a model running
quickly.
An excellent software package, {\tt BUGS},\index{software!BUGS}\index{BUGS}
makes it easy to set up almost arbitrary probabilistic models
and simulate them by Gibbs sampling \cite{bugs}.\footnote{\tt{http://www.mrc-bsu.cam.ac.uk/bugs/}}
%%%%%%%%%%%%%%%%%%%%%%%%%%
% possible boundary
%%%%%%%%%%%%%%%%%%%%%%%%%%
% new material added here from metrop/DEMO.m
% post=1
% DEMO
\newcommand{\metropdensity}[2]{%
\mbox{\makebox[0in][r]{\raisebox{0.24in}{$p^{(#2)}(x)$}}%
\psfig{figure=metrop/ps/pt#1.#2.ps,width=1in,angle=-90}}}
% advanced monte carlo methods
\section{Terminology for \MCMC\ methods}
\label{sec.mc.terminology}
% The preceding description of the Metropolis method and Gibbs sampling
% is hopefully comprehensible.
We now spend a few moments sketching\index{terminology!Monte Carlo methods}
the theory on which the Metropolis method and Gibbs sampling are based.
We denote by $p^{(t)}(\bx)$ the probability distribution of the
state of a Markov chain simulator. (To visualize this distribution, imagine running
an infinite
collection of identical simulators in parallel.)
Our aim is to find a Markov chain
such that as $t \rightarrow \infty$, $p^{(t)}(\bx)$ tends to
the desired distribution $P(\bx)$.
%\begin{description}
%\item[Markov chain.]
% \subsection{Markov chain}
A {\dbf Markov chain} can be specified by an {\dbf initial} probability
distribution $p^{(0)}(\bx)$ and a {\dbf transition probability} $T(\bx';\bx)$.
The probability distribution of the state at the $(t\!+\!1)$th iteration
of the Markov chain, $p^{(t+1)}(\bx)$, is given by
\beq
p^{(t+1)}(\bx') = \intdx\: T(\bx';\bx) p^{(t)}(\bx) .
\eeq
%\item[Choice of Markov chain.]
% \subsection{Choice of Markov chain}
\noindent
\Exampl{example20}{
An example of a Markov chain is given by the Metropolis demonstration
of section \ref{sec.metrop.demo} (\figref{fig.metrop}),
for which the transition probability is
\begin{realcenter}
{\footnotesize $\displaystyle
\mbox{\normalsize$\bT$} = \left[
\begin{array}{*{21}{c@{\,}}}
\dhalf&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf&\cdot\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\cdot&\dhalf\\[-0.05in]
\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\cdot&\dhalf&\dhalf
\end{array}
\right]
$} \end{realcenter}
and the initial distribution was
\beq
p^{(0)}(x) = \left[
\begin{array}{*{21}{c@{\,}}}
\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&1&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,&\,\cdot\,
\\
\end{array}
\right] .
\eeq
The probability distribution $p^{(t)}(x)$ of the state at the
$t$th iteration is shown
for $t=0,$ 1, 2, 3, 5, 10, 100, 200, 400
% $\ldots$
in \figref{fig.metropdensity10};
an equivalent sequence of distributions is shown
in \figref{fig.metropdensity17}
for the chain that begins in initial state $x_0=17$.
Both chains converge to the target density, the uniform density,
as $t \rightarrow \infty$.
\amarginfig{b}{\footnotesize
\begin{center}
\begin{tabular}{c}
\metropdensity{10}{0}\\[-0.3in]
\metropdensity{10}{1}\\[-0.2in]
\metropdensity{10}{2}\\[-0.2in]
\metropdensity{10}{3}\\[-0.3in]
\metropdensity{10}{10}\\[-0.3in]
\metropdensity{10}{100}\\[-0.3in]
\metropdensity{10}{200}\\[-0.3in]
\metropdensity{10}{400}\\[-0.13in]
\end{tabular}
\end{center}
\caption[a]{The probability distribution of the state of the
Markov chain of \exampleonlyref{example20}.
}\label{fig.metropdensity10}
}% \ENDsolution
}
%\end{example}
\subsection{Required properties}
When designing a Markov chain Monte Carlo method,
we construct a chain with the following properties:
\ben
\item The desired
distribution $P(\bx)$ is an {\dbf\ind{invariant distribution}} of the chain.
A distribution $\pi(\bx)$ is an invariant distribution of the transition
probability $T(\bx';\bx)$ if
\beq
\pi(\bx') = \intdx\: T(\bx';\bx) \pi(\bx) .
\eeq
An invariant distribution is an eigenvector of the transition
probability matrix that has eigenvalue 1.
\item
The chain must also be {\dbf\ind{ergodic}}, that is,
\beq
p^{(t)}(\bx) \rightarrow \pi(\bx) \mbox{ as $t \rightarrow \infty$, for any $p^{(0)}(\bx)$.}
\eeq
A couple of reasons why a chain might not be ergodic are:
\ben
\item Its matrix might be {\dem\ind{reducible}}, which
means that the state space contains two or more subsets of states that
can never be reached from each other. Such a chain has many invariant
distributions; which one $p^{(t)}(\bx)$ would tend to
as $t \rightarrow \infty$ would depend on the initial condition
$p^{(0)}(\bx)$.
\amarginfig{b}{\footnotesize
\begin{center}
\begin{tabular}{c}
\metropdensity{17}{0}\\[-0.3in]
\metropdensity{17}{1}\\[-0.2in]
\metropdensity{17}{2}\\[-0.2in]
\metropdensity{17}{3}\\[-0.3in]
\metropdensity{17}{10}\\[-0.3in]
\metropdensity{17}{100}\\[-0.3in]
\metropdensity{17}{200}\\[-0.3in]
\metropdensity{17}{400}\\[-0.13in]
\end{tabular}
\end{center}
\caption[a]{The probability distribution of the state of the
Markov chain for initial condition $x_0 = 17$ (\exampleref{example20}).
}\label{fig.metropdensity17}
}
The transition probability matrix of such a chain has more than
one eigenvalue equal to 1.
\item The chain might have a {\dem periodic\/}
% irreducible
set, which
means that, for some initial conditions, $p^{(t)}(\bx)$ doesn't
tend to an invariant distribution, but instead tends to
a periodic limit-cycle.
A simple Markov chain with this property
is the random walk on the $N$-dimensional hypercube. The chain $T$
takes the state from one corner to a randomly chosen adjacent corner.
The unique invariant distribution of this chain is the uniform
distribution over all $2^N$ states, but the chain is not ergodic;
it is periodic with period two:
if we divide the states into states with odd parity and states with even
parity, we notice that every odd state is surrounded by even states
and {\em vice versa}. So if the initial condition at time $t=0$
is a state with even parity, then at time $t=1$ -- and
at all odd times -- the state must have
odd parity, and at all even times, the state will be of even parity.
The transition probability matrix of such a chain has more than
one eigenvalue with magnitude equal to 1. The random walk on the hypercube,
for example,
has eigenvalues equal to $+ 1$ and $-1$.
\een
\een
\subsection{Methods of construction of Markov chains}
\index{concatenation!in Markov chains}It
is often convenient to construct $T$ by \index{mixture distribution}{\dem{mixing}\/}
or {\dem{concatenating}\/}
simple {\dbf\ind{base transitions}\/} $B$ all of which satisfy
\beq
P(\bx') = \intdx\: B(\bx';\bx) P(\bx) ,
\eeq
for the desired density $P(\bx)$, \ie, they
all have the desired density as an invariant distribution.
These base transitions need not individually be
\ind{ergodic}.
$T$ is a {\dem{mixture}}\index{mixture!in Markov chains}
of several base transitions $B_b(\bx',\bx)$ if
we make the transition by picking one of the base transitions
at random, and allowing it to determine the transition, \ie,
\beq
T(\bx',\bx) = \sum_b p_b B_b(\bx',\bx) ,
\eeq
where $\{ p_b \}$ is a probability distribution over the
base transitions.
$T$ is a {\dem{concatenation}}\index{concatenation!in Markov chains} of two base transitions $B_1(\bx',\bx)$
and $B_2(\bx',\bx)$ if we first make a transition to an
intermediate state $\bx''$ using $B_1$, and then make a transition
from state $\bx''$ to $\bx'$ using $B_2$.
\beq
T(\bx',\bx) = \intdxpp B_2(\bx',\bx'') B_1(\bx'',\bx) .
\label{eq.concatT}
\eeq
% \item[Detailed balance.]
% \subsection{Detailed balance}
\subsection{Detailed balance}
Many useful transition probabilities satisfy the
{\dbf detailed balance} property:
\beq
T(\bx_{a};\bx_{b}) P(\bx_{b}) =
T(\bx_{b};\bx_{a}) P(\bx_{a}) , \mbox{ for all $\bx_{b}$ and $\bx_{a}$}.
\eeq
This equation says that if we pick (by magic) a state
% n $\bx$
from the target density
$P$ and make a transition under $T$ to another
state, it is just as likely that we will pick $\bx_{b}$
and go from $\bx_{b}$ to $\bx_{a}$ as it is that we will pick
$\bx_{a}$
and go from $\bx_{a}$ to $\bx_{b}$.
% \end{description}
Markov chains that satisfy detailed balance are also called
{\dbf reversible} Markov chains.
The reason why the detailed balance property is of interest
is that detailed balance implies invariance of the
distribution $P(\bx)$ under the Markov chain $T$, which
is a necessary condition for
the key property that we want from our MCMC simulation -- that
the probability distribution of the chain should converge to $P(\bx)$.
\exercisxB{2}{ex.detbal}{Prove that detailed balance implies invariance of the
distribution $P(\bx)$ under the Markov chain $T$.}
Proving that
detailed balance holds is often a key step
when proving that a \MCMC\ simulation will converge to
the desired distribution. The Metropolis method
% and Gibbs sampling method both
satisfies detailed balance, for example. Detailed balance
is not an essential condition, however, and we will see later that
irreversible Markov chains can be useful in practice, because
they may have different random walk properties.
%(We still require
% such chains to have $P(\bx)$ as their invariant distribution.)
\exercisxB{2}{ex.detbal2}{
Show that, if we concatenate two base transitions
$B_1$ and $B_2$ that satisfy detailed balance,
it is not necessarily the case that the $T$
thus defined (\ref{eq.concatT}) satisfies detailed balance.
}
\exercisxC{2}{ex.detbal3}{
Does Gibbs sampling, with several variables all
updated in a deterministic sequence, satisfy detailed balance?
% Radford says no.
}
% slice sampling
% 980214
\section{Slice sampling}
Slice sampling\index{slice sampling}
\cite{Radford_slice,Radford_slice2001}\index{Neal, Radford}
is a Markov chain Monte Carlo method that has
similarities to rejection sampling, Gibbs sampling and the Metropolis method.
It can be applied wherever the Metropolis method
can be applied, that is, to any system for which
the target density $P^*(\bx)$ can be evaluated at any point $\bx$;
it has the advantage over simple Metropolis methods that it is more robust
to the choice of parameters like step sizes.
The simplest version of slice sampling
is similar to Gibbs sampling in that
% a slice sampling simulation
it
consists of one-dimensional transitions
in the state space; however there is no requirement that the
one-dimensional conditional distributions be easy to sample from,
nor that they have any convexity properties such as
are required for adaptive rejection sampling.
And slice sampling is similar to rejection sampling in that it is a method that
asymptotically draws samples from the volume under the
curve described by $P^*(\bx)$; but there is no requirement for
an upper-bounding function.
I will describe slice sampling by giving a sketch of
a one-dimensional sampling algorithm, then giving a pictorial
description that includes the details
that make the method valid.
\subsection{The skeleton of slice sampling}
Let us assume that we want to draw samples from $P(x) \propto P^*(x)$
where $x$ is a real number.
A one-dimensional slice sampling algorithm is a method for
making transitions from a two-dimensional point $(x,u)$ lying
under the curve $P^*(x)$
to another point $(x',u')$ lying
under the same curve, such that the probability distribution of $(x,u)$
tends to a uniform distribution over the area under the curve $P^*(x)$,
whatever initial point we start from -- like the uniform distribution
under the curve
$P^*(x)$ produced by rejection sampling (\sectionref{sec.rejection}).
A single transition $(x,u) \rightarrow (x',u')$ of a
one-dimensional slice sampling algorithm has the following steps,
of which steps {\tt 3} and {\tt 8} will require further elaboration.
\medskip
\newcommand{\Uniform}{\mbox{Uniform}}
\newcommand{\localtt}{\sf}
\begin{framedalgorithm}
\noindent \sf
{\tt 1:} evaluate $P^*\!(x)$
\\{\tt 2:} draw a vertical coordinate $u' \sim \Uniform(0,P^*\!(x))$
\\{\tt 3:} create a horizontal interval $(x_l,x_r)$ enclosing $x$
\\{\tt 4:} loop {\localtt\{}
\\{\tt 5:} \hspace{0.3in} draw $x' \sim \Uniform(x_l,x_r)$
\\{\tt 6:} \hspace{0.3in} evaluate $P^*\!(x')$
\\{\tt 7:} \hspace{0.3in} {\localtt if} $P^*\!(x') > u'$ {\localtt break out of loop {\tt4}-{\tt9}}
\\{\tt 8:} \hspace{0.3in} {\localtt else} modify the interval $(x_l,x_r)$
\\{\tt 9:} {\localtt\}}
\end{framedalgorithm}
\medskip
There are several methods for creating the interval $(x_l,x_r)$ in step
{\tt 3}, and several methods for modifying it at step {\tt 8}.
The important point is that the overall method must satisfy detailed
balance, so that the uniform distribution for $(x,u)$
under the curve $P^*\!(x)$ is invariant.
% Here I will describe methods appropriate for a real variable $x$.
% see itp/octave/mcmc/slice.m
% second argument is a label
\newcommand{\slicefig}[2]{%
\makebox[0in][l]{\hspace{0.07in}\raisebox{1.5in}{\small\tt{#2}}}%
\mbox{\psfig{figure=octave/mcmc/ps/slice/#1.ps,width=2.49in,angle=-90}\hspace{-0.05in}}}%was 2.42 and -0.2
\begin{figure}
%\figuredangle{%
\figuremargin{%
\begin{raggedright}
\begin{tabular}{@{\hspace*{-0.524in}}*{2}{l}@{\hspace*{-0.2in}}}
\slicefig{22.1}{1}&
\slicefig{22.2}{2}\\
\slicefig{22.3}{3a,3b,3c}&
\slicefig{22.4}{3d,3e}\\
\slicefig{22.5}{5,6}&
\slicefig{22.6}{8}\\
\slicefig{22.7}{5,6,7}&
\\
\end{tabular}
\end{raggedright}
}{%
\caption[a]{Slice sampling.
Each panel is labelled by the steps of the algorithm that
are executed in it. At step {\tt1}, $P^*\!(x)$ is evaluated
at the current point $x$.
At step {\tt2}, a vertical coordinate is selected giving the point $(x,u')$
shown by the box;
At steps {\tt 3a-c}, an interval of size $w$ containing $(x,u')$ is created
at random.
At step {\tt 3d}, $P^*$ is evaluated at the left end of the interval
and is found to be larger than $u'$, so a step to the left of size $w$
is made.
At step {\tt 3e}, $P^*$ is evaluated at the right end of the interval
and is found to be smaller than $u'$, so no stepping out to the
right is needed.
When step {\tt 3d} is repeated, $P^*$ is
found to be smaller than $u'$, so the stepping out halts.
At step {\tt 5} a point is drawn from the interval, shown by a $\circ$.
Step {\tt6} establishes that this point is above $P^*$
and step {\tt8} shrinks the interval to the rejected point
in such a way that the original point $x$ is still in the interval.
When step {\tt5} is repeated, the new coordinate $x'$ (which is
to the right-hand side of the interval) gives a value of $P^*$ greater than
$u'$, so this point $x'$ is the outcome at step {\tt7}.
}
\label{fig.slice0}
\label{fig.slice}
}%
\end{figure}
%%%%%%%%%%%%%
\subsection{The `stepping out' method for step {\tt 3}}
In the `stepping out' method for
creating an interval $(x_l,x_r)$ enclosing $x$, we step out
in steps of length $w$ until we find
endpoints $x_l$ and $x_r$ at which $P^*$ is smaller than
$u$. The algorithm is
% easiest to understand by seeing it in action as
shown in \figref{fig.slice}.
\medskip
\begin{framedalgorithm}
\noindent \sf
{\tt 3a:} draw $r \sim \Uniform(0,1)$
\\{\tt 3b:} $x_l$ {\tt :=} $x - r w$
\\{\tt 3c:} $x_r$ {\tt :=} $x + (1- r) w$
\\{\tt 3d:} {\localtt while} ($P^*\!(x_l)>u'$) {\localtt\{} $x_l \:{\tt:=}\: x_l - w$
{\localtt\}}
\\{\tt 3e:} {\localtt while} ($P^*\!(x_r)>u'$) {\localtt\{} $x_r \:{\tt:=}\: x_r + w$
{\localtt\}}
\end{framedalgorithm}
\subsection{The `shrinking' method for step {\tt 8}}
Whenever a point $x'$ is drawn such that $(x',u')$ lies
above the curve $P^*\!(x)$, we shrink the interval so that
one of the end points is $x'$, and such that the original
point $x$ is still enclosed in the interval.
\medskip
%\begin{quotation}
\begin{framedalgorithm}
\noindent \sf
{\tt 8a:} {\localtt if} ($x'>x$)
% \\ \hspace*{0.853in}
\{ $x_r$ {\tt :=} $x'$ \}
\\{\tt 8b:} {\localtt else}
% \\ \hspace*{0.853in}
\{ $x_l$ {\tt :=} $x'$ \}
\end{framedalgorithm}
%\end{quotation}
\subsection{Properties of slice sampling}
Like a standard Metropolis method,
slice sampling gets around by a random walk, but
whereas in the Metropolis method, the choice of the step size
is critical to the rate of progress,
in slice sampling
the step size is
self-tuning.
If the initial interval size $w$ is too small by a factor $f$ compared with the
width of the probable region then
% there are
% never any rejections and
the stepping-out procedure expands the interval size.
The cost of this stepping-out is only linear in $f$,
% the factor by which the optimal $w$ is bigger than the chosen $w$,
whereas in the Metropolis method the computer-time scales
as the square of $f$ if the step size is too small.
% Discuss sensitivity to $w$. If $w$ too small then waste linear
% amount of time in stepping out.
If the chosen value of $w$ is too large by a factor $F$ then the algorithm spends
a time proportional to the logarithm of $F$
% waste
% logarithmic factor
shrinking the interval down to the right size, since the interval
typically shrinks by a factor in the ballpark of $0.6$ each time
% The exact value is exp(-1/2) = 0.61.
a point is rejected. In contrast, the Metropolis
algorithm responds to a too-large step size by
rejecting almost all proposals, so the
rate of progress is exponentially bad in $F$.
There are no rejections in slice sampling. The probability of staying
in exactly the same place is very small.
\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/sliceeg.eps,width=1.6in}}
\end{center}
\caption[a]{$P^*\!(x)$.}
\label{fig.sliceeg}
}%
\exercisxB{2}{ex.sliceproblem}{
Investigate the properties of slice sampling applied to the
density
shown in \figref{fig.sliceeg}. $x$ is a real variable
between 0.0 and 11.0.
How long does it take typically
for slice sampling to get from an $x$ in the peak region $x\in (0,1)$
to an $x$ in the tail region $x \in (1,11)$, and {\em vice versa}?
Confirm that the probabilities of these transitions do
yield an asymptotic probability density that is correct.
%
% \in (1,2)$ to an $x' \in (2,3)$?
% Note that for some distributions it may take a long time
% to mix, \eg, if there is a peak and a long low tail, then
% you spend a lot of time in the tail then
% a lot in the peak. This can in some cases be viewed as beneficial.
% Skilling has applications where the peak has much more
% probability mass than the tail, but it is the tail that
% is of interest, slice sampling is used to explore the tail;
% transitions between the tail and peak are
% handled by a separate proposal. Slice sampling is thus one of several
% base transitions.
%
}
\subsection{How slice sampling is used in real problems \nonexaminable}
An $N$-dimensional density $P(\bx) \propto P^*(\bx)$ may be sampled
with the help of the one-dimensional slice sampling method presented
above by picking a sequence of directions $\by^{(1)}, \by^{(2)},\ldots$
and defining
$\bx = \bx^{(t)} + x \by^{(t)}$. The function $P^*(x)$ above
is replaced by $P^*( \bx ) = P^*( \bx^{(t)} + x \by^{(t)})$.
The directions may be chosen in various ways; for example,
as in Gibbs sampling, the directions could be the coordinate axes;
alternatively, the directions $\by^{(t)}$ may be selected at random
in any manner such that the overall procedure satisfies detailed balance.
\subsection{Computer-friendly slice sampling \nonexaminable}
The real variables of a probabilistic model will always be
represented in a computer using a finite number of bits.
In the following implementation of slice sampling
due to Skilling\nocite{SkillingMacKay2002},
the stepping-out, randomization, and shrinking
operations, described above
% by \citeasnoun{Radford_slice2001}
% Neal
in terms of floating-point operations,
are replaced by binary and integer operations.
We assume that the
variable $x$ that is being slice-sampled is represented
by a $b$-bit integer $X$ taking on one of $B = 2^b$ values,
$0, 1, 2, \ldots, B\!-\!1$, many or all of which correspond to
valid values of $x$.
Using an integer grid eliminates
any errors in detailed balance that might ensue from
variable-precision rounding of floating-point numbers.
%
% via a mapping $x(X)$.
% We often take these points
% to have equal prior measure, so that the prior becomes flat
% over $X$ and all points are automatically a-priori-equivalent.
% Floating-point numbers, by contrast, are not equivalent, because of their
% variable rounding. Using an integer grid eliminates
% any errors in detailed balance that might thus ensue.
% We denote by $F(X)$ the appropriately transformed version of
% the unnormalized density $f(x(X))$.
The mapping from $X$ to $x$ need not be
linear; if it is nonlinear,
we assume that the function $P^*\!(x)$ is replaced by
an appropriately transformed function -- for example,
$P^{**}(X) \propto P^*\!(x) |\d x/\d X|$.
% , if the mapping from $X$ to $x$ is continuous.
We assume the following operators on $b$-bit integers
are available:
\def\la{\,{\tt :=}\,}
\def\sp{\hspace*{0.2in}}
\begin{realcenter}
\begin{tabular}{cc}
$X + N$ & arithmetic sum, modulo $B$, of $X$ and $N$.\\
$X - N$ & difference, modulo $B$, of $X$ and $N$.\\
$X \oplus N$ & {bitwise\/} exclusive-or of $X$ and $N$.\\
$N \la
{\tt{randbits}}(l)$ & sets $N$ to a random $l$-bit integer.\\
\end{tabular}
\end{realcenter}
A slice-sampling procedure for integers is then as follows: \medskip
% \footnote{note change in draft 2.2 from $<$ to $\leq$ in the first line.}
\newcommand{\nsp}[1]{\makebox[0in][l]{\tt{#1}:}\hspace*{0.3in}\sf}
\begin{framedalgorithmw}{\fulltextwidth}
%\begin{realcenter}
\begin{tabular}{p{2.7in}p{3.4in}}
% \multicolumn{2}{c}{ {\sf Shrinking procedure} }\\
\multicolumn{2}{c}{ {\sf Given: a current point $X$ and a height $Y = P^*\!(X) \times \mbox{Uniform}(0,1) \leq P^*\!(X)$} }\\[0.15in]
\nsp{1} $U \la {\tt{randbits}}(b)$ & Define a random translation $U$ of the binary coordinate system. \\
\nsp{2} set $l$ to a value $l \leq b$ & Set initial $l$-bit sampling range. \\% (step 2)\\
\nsp{3} do \{ \\
\nsp{4}\sp $N \la {\tt{randbits}}(l)$ & Define a random move within the current interval of width $2^l$.\\
\nsp{5}\sp $X' \la ( (X-U) \oplus N ) + U $ &
Randomize the lowest $l$ bits of $X$ (in the translated coordinate system).
\\
\nsp{6}\sp $l \la l - 1$ &
If $X'$ is not acceptable, decrease $l$ and try again \\
\nsp{7}\} until \mbox{($X' = X$) or ($P^*\!(X') \geq Y$)} & with a smaller
perturbation of $X$; termination at or before $l=0$ is assured.\\
\end{tabular}
\end{framedalgorithmw}
\medskip
% \end{realcenter}
The translation $U$ is introduced to avoid permanent sharp edges, where
for example the adjacent binary integers {\tt{0111111111}} and {\tt{1000000000}}
would otherwise be permanently in different sectors, making it difficult for
$X$ to move from one to the other.
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=figs/slicehalve.eps,width=1.965in}}\\[-0.015in]% was -.15
\end{center}
\caption[a]{
The sequence of intervals from which
the new candidate points are drawn.
}
\label{fig.slicehalve}
%Pictorially, the sequence of intervals from which
% the new candidate points are drawn are like the sequence
% of intervals in Neal's doubling procedure (Neal, 2001, figure 2).
}
The sequence of intervals from which
the new candidate points are drawn is illustrated in \figref{fig.slicehalve}.
%\begin{center}
%\mbox{\psfig{figure=figs/slicehalve.eps,width=2.5in}}
%\end{center}
First, a point is drawn from the
entire interval, shown by the top horizontal line.
At each subsequent draw, the interval is halved in such a way
as to contain the previous point $X$.
\begincuttable
% I aimed to CUT some details from here and put them in graveyard.tex.
% Mon 30/12/02
% They are also in the original skilling paper/.
If preliminary stepping-out from the initial range is required, step {\sf{2}} above
can be replaced by the following similar procedure:
\medskip% NORMALCENTER
\begin{framedalgorithm}
\begin{center}
\hspace*{0.5in}
\begin{tabular}{@{}p{2.614in}p{2.4in}}
\nsp{2a} set $l$ to a value $l < b$ & $l$ sets the initial width \\
\nsp{2b} do \{ \\
\nsp{2c}\sp $N \la {\tt{randbits}}(l)$ \\
\nsp{2d}\sp $X' \la ( (X-U) \oplus N ) + U $ \\
\nsp{2e}\sp $l \la l + 1$ \\
\nsp{2f} \} until \mbox{($l=b$) or ($P^*\!(X') < Y$)} \\
%% Then shrink as before \\
\end{tabular}
\end{center}
\end{framedalgorithm}
\medskip
% \footnote{ I changed $\geq$ to $<$ above}
These shrinking and stepping out methods shrink and expand
by a factor of two per evaluation. A variant
is to shrink or expand by more than one bit each time, setting
$l \la l \pm \Delta l$ with $\Delta l > 1$.
%
% addition Thu 7/2/02
%
% Provided the initial sampling range is well chosen
% ({\em i.e.,} of the same order of magnitude as the
% acceptable range), we found experimentally that
% the mean diffusion rate of $X$ per
% evaluation when $\Delta l = 1$ is at most 25\% slower than for Neal's
% method of shrinking to the rejected point. If the initial
% sampling range is not well chosen, the faster shrinking
% allowed here by setting $\Delta l > 1$ enables
% more rapid diffusion because an admittedly poorer
% acceptable jump is found more quickly.
Taking $\Delta l$ at each step from any pre-assigned distribution (which
may include $\Delta l=0$) allows extra flexibility.
\exercisxC{4}{ex.slice.ex}{
In the shrinking phase, after an unacceptable $X'$ has been
produced, the choice of $\Delta l$ is allowed to depend on
the difference between the slice's height $Y$ and the value
of $P^*\!(X')$, without spoiling the algorithm's validity. (Prove this.)
It might be a good idea to
choose a larger value of $\Delta l$ when $Y-P^*\!(X')$ is large.
Investigate this idea theoretically or empirically.
}
\ENDcuttable
A feature of using the integer representation is that, with a suitably
extended number of bits, the single integer $X$ can represent
two or more real parameters -- for example, by mapping $X$ to $(x_1,x_2,x_3)$
through a space-filling curve such as a Peano curve.
Thus \index{slice sampling!multi-dimensional}multi-dimensional
slice sampling can be performed using the
same software as for one dimension.
% Peano curves are useful here because they relate conveniently to a rectangular grid and
% they have the best possible locality properties: nearby points on the curve
% are close in space (though not the converse, which is unattainable).
% In this case, each successive
% bit of $X$ represents a factor of 2 in volume. Because
% we are likely to be uncertain about the optimal sampling volume
% in several dimensions, it
% may be helpful to set $\Delta l$ to the dimensionality.
\newpage%%%%%%%%%%%%%%%%%%%%%%%% ADDED Fri 11/7/03
\section{Practicalities}
{\bf Can we predict how long a \MCMC\ simulation will take to equilibrate?}
By considering the random walks involved in a \MCMC\ simulation
we can obtain
simple {\em lower bounds\/} on the time required for convergence.
But predicting this time more precisely
is a difficult problem, and most of the theoretical results
giving upper bounds on the convergence time
are of little practical use. The exact sampling methods
of \chref{ch.mcexact} offer a solution to this problem
for certain Markov chains.
\medskip
\noindent
{\bf Can we diagnose or detect convergence in a running simulation?}
This is also a difficult problem. There are a few practical tools available,
but none of them is perfect \cite{Cowles1996a}.
\medskip
\noindent
{\bf Can we speed up the convergence time and time between independent samples of a
\MCMC\ method?}
Here, there is good news, as described in the next chapter,
% following three sections,
which describes the \hybrid\ Monte Carlo method, overrelaxation, and simulated annealing.
%%%%%%%%%%%%%%%%%%%%%%%%%
% this material is grabbed from later in the chapter advanced_mc.tex
\section{Further practical issues}
\subsection{Can the normalizing constant be evaluated?}
If the target density $P(\bx)$ is given in the form of an unnormalized
density $P^*\!(\bx)$ with $P(\bx) = \frac{1}{Z} P^*\!(\bx)$, the value of
$Z$ may well be of interest. Monte Carlo methods do not
readily
yield an estimate of this quantity, and it is an area of active research
to find ways of evaluating it. Techniques for evaluating $Z$
include:
\ben
\item
Importance sampling (reviewed by \citeasnoun{Neal_dop})\index{importance sampling}\index{Monte Carlo methods!importance sampling}
and \ind{annealed importance sampling} \cite{Radford_ais}.\index{Monte Carlo methods!annealed importance sampling}\index{annealing!importance sampling}
\item
`Thermodynamic integration'\index{thermodynamic integration} during \ind{simulated annealing},\index{Monte Carlo methods!simulated annealing}\index{Monte Carlo methods!thermodynamic integration}\index{annealing}
the `\index{acceptance ratio method}acceptance ratio' method, and `\ind{umbrella sampling}'\index{Monte Carlo methods!umbrella sampling}\index{Monte Carlo methods!acceptance ratio method}
(reviewed by \citeasnoun{Neal_dop}).\index{Neal, Radford}
% and AIS
\item
`Reversible jump \MCMC' \cite{Green1995}.\index{reversible jump}\index{Monte Carlo methods!reversible jump}
\een
% \citeasnoun{Neal_dop} gives a review of these methods.
One way of dealing with $Z$, however, may be to find a
solution to one's task that does not require that $Z$ be evaluated.
In Bayesian data modelling one might be able to avoid the need to evaluate $Z$ --
which would be
% traditionally
important for model comparison -- by not having more than one
model. Instead of using several models (differing in
% their
complexity, for example) and evaluating their relative posterior
probabilities, one can make a single {\dbf hierarchical\/} model\index{hierarchical model}
having, for example, various continuous \ind{hyperparameter}s which play a
role similar to that played by the distinct models
\cite{Radford_book}.
% The major objection to this approach of not evaluating $Z$
% is that
In noting the possibility of not computing $Z$, I am not endorsing this
approach. The normalizing constant $Z$ is often the single most
important number in the problem, and I think every effort should be devoted
to calculating it.
\subsection{The Metropolis method for big models}
Our original description of the Metropolis method involved a joint
updating of all the variables using a proposal density $Q(\bx';\bx)$.
For big problems it may be more efficient to use several proposal
distributions $Q^{(b)}(\bx';\bx)$, each of which updates only some
of the components of $\bx$. Each proposal is individually accepted or
rejected, and the proposal distributions are repeatedly
run through in sequence.
\exercissxB{2}{ex.metropB}{
Explain why the rate of movement through the state space
will be greater when $B$ proposals
$Q^{(1)} ,\ldots, Q^{(B)}$ are considered {\em individually\/}
in sequence, compared with the case of a single proposal
$Q^*$ defined by the concatenation of $Q^{(1)} ,\ldots ,Q^{(B)}$.
Assume that each proposal distribution $Q^{(b)}(\bx';\bx)$
has an \ind{acceptance rate} $f<1/2$.
}
In the Metropolis method, the proposal density $Q(\bx';\bx)$ typically
has a number of parameters that control, for example, its `width'.
These parameters are usually set by trial and error with the \ind{rule
of thumb} being to aim for a rejection frequency of about 0.5.
It is {\em not\/} valid to have the width parameters be dynamically
updated during the simulation in a way that depends on the
history of the simulation. Such a modification of the proposal
density would violate the detailed balance condition that
guarantees that the Markov chain has the correct invariant distribution.
\subsection{Gibbs sampling in big models}
Our description of Gibbs sampling involved sampling one parameter at a time,
as described in equations (\ref{eq.gibbs1}--\ref{eq.gibbs3}).
%:
%\beqan
%x_1^{\tplusone} &\sim& P( x_1 \given x_2^{(t)} , x_3^{(t)} , \ldots ,x_K^{(t)} ) \\
%x_2^{\tplusone} &\sim& P( x_2 \given x_1^{\tplusone} , x_3^{(t)} , \ldots ,x_K^{(t)} ) \\
%x_3^{\tplusone} &\sim& P( x_3 \given x_1^{\tplusone} , x_2^{\tplusone} , \ldots , x_K^{(t)} ) , \: \mbox{ etc.}
%\eeqan
For big problems it may be more efficient to sample {\em groups\/} of
variables jointly, that is to use several proposal
distributions:
\beqan
x_1^{\tplusone}\hspace{-0.1in},\ldots, x_a^{\tplusone} &\!\!\sim\!\!& P( x_1,\ldots, x_a \given x_{a+1}^{(t)}
,\ldots, x_K^{(t)} ) \\
x_{a+1}^{\tplusone} ,\ldots, x_b^{\tplusone}
&\!\!\sim\!\!&
P( x_{a+1}, \ldots, x_b \given x_1^{\tplusone}\hspace{-0.1in} ,\ldots , x_a^{\tplusone}
, x_{b+1}^{(t)} ,\ldots, x_K^{(t)} ) , \:\: \mbox{ etc.}
\nonumber
\eeqan
\subsection{How many samples are needed?}
At the start of this chapter, we observed that the variance of
an estimator $\hat{\Phi}$ depends only on the number of independent
samples $R$ and the value of
\beq
\sigma^2 = \intdx\: P(\bx) (\phi(\bx)-\Phi)^2 .
\eeq
We have now discussed a variety of methods for generating
samples from $P(\bx)$. How many independent samples $R$ should we aim for?
In many problems, we really only need about
% a dozen (twelve)
twelve
independent samples from $P(\bx)$. Imagine that $\bx$
is an unknown vector such as the amount of corrosion present
in each of $10\,000$ underground pipelines around Cambridge,
% Cambridge,
and $\phi(\bx)$
is the total cost of repairing those pipelines. The
distribution $P(\bx)$ describes the probability of
a state $\bx$ given the tests that have been carried
out on some pipelines and the assumptions about the physics of corrosion.
The quantity $\Phi$ is the expected cost of the repairs.
The quantity $\sigma^2$ is the variance of the cost -- $\sigma$
measures by how much we should expect the actual cost to differ from the
expectation $\Phi$.
Now, how accurately would a manager like to know $\Phi$? I would suggest there
is little point in knowing $\Phi$ to a precision finer than about
$\sigma/3$. After all, the true cost is likely to differ by
$\pm \sigma$ from $\Phi$.
If we obtain $R=12$ independent samples from $P(\bx)$,
we can estimate $\Phi$ to a precision of $\sigma/\sqrt{12}$
-- which is smaller than $\sigma/3$. So twelve samples suffice.
\begin{figure}%[htbp]
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=figs/mcresource.eps,angle=-90,width=2.5in}}
\end{center}
}{%
\caption[a]{Three possible Markov chain Monte Carlo
strategies for obtaining twelve samples
in a fixed amount of computer time. Time is represented
by horizontal lines; samples by white circles.
(1) A single run consisting of one long `burn in' period followed
by a sampling period. (2) Four medium-length runs with
different initial conditions and a medium-length burn in period.
(3) Twelve short runs.}
\label{fig.mcresource}
}%
\end{figure}
%
\subsection{Allocation of resources}
\label{sec.mcresource}
% Choice of strategy}
Assuming we have decided how many independent samples $R$
are required,
an important question is how one should make use of one's limited computer
resources to obtain these samples.
A typical \MCMC\ experiment involves an initial period in which
control parameters of the simulation such as step sizes may be adjusted.
This is followed by a `burn in' period during which we hope the simulation
`converges' to the desired distribution. Finally, as the simulation
continues, we record the state vector occasionally so as to create
a list of states $\{ \bx^{(r)}\}_{r=1}^{R}$
that we hope are roughly independent samples from
$P(\bx)$.
There are several possible strategies (\figref{fig.mcresource}):
\ben
\item Make one long run, obtaining all $R$ samples from it.
\item Make a few medium-length runs with different
initial conditions, obtaining some samples from each.
\item Make $R$ short runs, each starting from a different random
initial condition, with the only state that is recorded being the
final state of each simulation.
\een
The first strategy has the
best chance of attaining `convergence'.
The last strategy may have the advantage that the correlations between
the recorded samples are smaller.
The middle path is popular with \MCMC\ experts \cite{MCMC96}
because it avoids the inefficiency of discarding burn-in
iterations in many runs, while still allowing one to
detect problems with lack of
convergence that would not be apparent from a single
run.
%The lots-of-short-runs versus one-long-run has been very controversial. You
%should reference the Gelman and Rubin and Geyer papers on the topic.
Finally, I should emphasize that there is no need to make the
points in the estimate nearly-independent. Averaging
over dependent points is fine -- it won't lead to any bias
in the estimates. For example, when you use strategy 1 or 2, you may, if you wish,
include all the points between the first and last sample in each run.
Of course, estimating the accuracy of the estimate is harder when the
points are dependent.
% \section{Philosophy} moved to graveyard.tex Sun 3/2/02
\section{Summary}
\bit
\item
Monte Carlo methods are a powerful tool that allow one to sample from
any probability distribution that can be expressed in the form
$P(\bx) = \frac{1}{Z} P^*\!(\bx)$.
\item
Monte Carlo methods can answer virtually any query related
to $P(\bx)$ by putting the query in the
form
\beq
\int \phi(\bx) P(\bx) \simeq \frac{1}{R} \sum_r \phi(\bx^{(r)}) .
\eeq
% and estimating this integral by sampling.
\item
In high-dimensional problems the only satisfactory methods
are those based on Markov chains, such as the Metropolis method, Gibbs sampling and slice
sampling. Gibbs sampling is an attractive method
because it has no adjustable parameters but its use is restricted to
cases where samples can be generated from the conditional distributions.
Slice sampling is
attractive because, whilst it has step-length parameters, its performance
is not very sensitive to their values.
\item
Simple Metropolis algorithms and Gibbs sampling algorithms,
although widely used,
perform poorly because they explore the
space by a slow random walk. The next chapter will discuss
methods for speeding up Markov chain Monte Carlo simulations.
% More sophisticated Metropolis
% algorithms such as \hybrid\ Monte Carlo, which we discuss in the next
% chapter,
% (see \citeasnoun{Neal_dop})
% make use
% of proposal densities that give faster movement through the state space.
%
% The efficiency of Gibbs sampling is also troubled by random walks.
% The method of ordered overrelaxation is a general purpose technique
% for suppressing them.
\item
% for summary
Slice sampling does not avoid random walk behaviour,
but it automatically chooses the largest appropriate
step size, thus reducing the bad effects of the random walk
compared with, say, a Metropolis method with a tiny step size.
\eit
\section{Exercises}
%
% I rate this ex as one of the best bits of this book
%
\exercissxA{2C}{ex.isproblem}{
{\sf A study of importance sampling.}
%
We already established in section \ref{sec.importance}
that importance sampling is likely to be useless
in high-dimensional problems.
This exercise explores a further \index{sermon!importance sampling}cautionary tale, showing\index{caution!importance sampling}
that importance sampling can fail even in one dimension,
even with
friendly Gaussian distributions.\index{Monte Carlo methods!importance sampling!weakness of}
Imagine that we want to know the expectation of a function
$\phi(x)$ under a distribution $P(x)$,
\beq
\Phi = \int \d x \: P(x) \phi(x) ,
\eeq
and that this expectation is estimated by importance sampling
with a distribution $Q(x)$.
Alternatively, perhaps we wish to estimate the normalizing constant
$Z$ in $P(x) = P^*\!(x)/Z$ using
\beq
Z = \int \d x \: P^*\!(x) = \int \d x \: Q(x) \frac{P^*\!(x)}{Q(x)}
= \left< \frac{P^*\!(x)}{Q(x)} \right>_{x\sim Q} .
\eeq
Now, let $P(x)$ and $Q(x)$ be Gaussian distributions with
mean zero and standard deviations $\sigma_p$ and $\sigma_q$.
Each point $x$ drawn from $Q$ will have an associated weight
$P^*\!(x)/Q(x)$.
What is the variance of the weights? [Assume that $P^* = P$, so
$P$ is actually normalized, and $Z=1$, though we can pretend that we didn't know
that.]
What happens to the variance of the weights as $\sigma^2_q \rightarrow
\sigma^2_p/2$?
Check your theory by simulating this importance-sampling problem
on a computer.
}
\exercisaxA{2}{ex.metFred}{
Consider the Metropolis algorithm for the
one-dimensional toy problem of section \ref{sec.metrop.demo},
sampling from $\{ 0,1,\ldots,20\}$.
Whenever the current state is one of the end states,
the proposal density given in \eqref{eq.metropb} will propose with
probability 50\% a state that will be rejected.
To reduce this `waste', Fred modifies the software responsible for
generating samples from $Q$ so that when $x=0$, the proposal density
is 100\% on $x'=1$, and similarly when $x=20$, $x'=19$ is always
proposed. Fred sets the software that implements the acceptance
rule so that the software accepts all proposed moves.
What probability $P'(x)$ will Fred's modified
software generate samples from?
What is the correct acceptance rule for Fred's proposal density, in
order to obtain samples from $P(x)$?
}
%%%%%%%%%%%% extra exercises added draft 4.1 %%%%%%%%%%%%%
\exercisxB{3C}{ex.doGibbs1}{
Implement Gibbs sampling for the inference of a
single one-dimensional Gaussian, which we studied using maximum likelihood in \secref{sec.mloneg}.
Assign a broad Gaussian prior to $\mu$ and a broad gamma prior (\ref{gamma.dist.again})
to the \ind{precision} parameter
$\beta = 1/\sigma^2$.
Each update of $\mu$ will involve a sample from a Gaussian distribution,
and each update of $\sigma$ requires a sample from a gamma distribution.
}
\exercisxA{3C}{ex.doGibbs2}{
{\sf Gibbs sampling for clustering.}
Implement Gibbs sampling for the inference of a
mixture of $K$
% two or more
one-dimensional Gaussians, which we studied using maximum likelihood in \secref{sec.mog}.
Allow the clusters to have different standard deviations $\sigma_k$.
% Assign a uniform prior to the
Assign priors to the means and standard deviations in the same way as the previous
exercise. Either fix the prior probabilities of the classes $\{ \pi_k \}$ to be equal
or put a uniform prior over the parameters $\pi$ and include them in the Gibbs sampling.
%
% [ -0.01 -0.27 0.1 0.31 0.706 1.07 1.37 1.16 1.2 1.25 1.3 1.33 1.65 ]
Notice the similarity of Gibbs sampling to the soft K-means clustering algorithm (\algref{alg.kmeansoft2}).
We can alternately {\em assign\/} the class labels $\{ k_n \}$ given the parameters $\{ \mu_k , \sigma_k \}$,
then {\em update\/} the parameters given the class labels.
The assignment step involves sampling from the probability distributions defined by
the responsibilities (\ref{eq.assignII}), and the
update step updates the means and variances using probability distributions
centred on the K-means algorithm's values (\ref{eq.softkmeans.meanupdate}, \ref{eq.softkmeans.varianceupdate}).
Do your experiments confirm that Monte Carlo methods bypass the overfitting
difficulties of maximum
likelihood discussed in \secref{sec.kaboom}?
A solution to this exercise and the previous one,
written in {\tt{octave}}, is available.\footnote{{\tt{http://www.inference.phy.cam.ac.uk/mackay/itila/}}}
}
\exercisxB{3C}{ex.doGibbs3}{
Implement Gibbs sampling for the {\sf seven scientists} inference problem,
which we encountered in \exerciseref{ex.manyparams}, and which you may
have solved by exact marginalization (\exerciseref{ex.manyparamsb}) [it's not essential to have done the latter].
}
%%%%%%%%%%%% end extra exercises added draft 4.1 %%%%%%%%%%%%%
\exercisxB{2}{ex.walkGau}{
A Metropolis method is used to explore a distribution $P(\bx)$
that is actually a 1000-dimensional spherical Gaussian distribution of standard deviation
1 in all dimensions.
The proposal density $Q$ is a 1000-dimensional spherical Gaussian distribution
of standard deviation $\epsilon$.
Roughly what is the step size $\epsilon$ if the \ind{acceptance rate}
is 0.5?
Assuming this value of $\epsilon$,
\ben
\item
roughly how long would the method take to traverse the distribution
and generate a sample independent of the initial condition?
\item
By how much does $\ln P(\bx)$ change in a typical step?
By how much should $\ln P(\bx)$ vary when $\bx$ is drawn from
$P(\bx)$?
\item
What happens if, rather than using a Metropolis
method that tries to change all components at once, one instead uses
a concatenation of Metropolis updates changing one component at a time?
\een
}
\exercisaxB{2}{ex.walkE}{
When discussing the time taken by the Metropolis algorithm to
generate independent samples we considered a distribution
with longest spatial
length scale $L$ being explored using a proposal distribution
with step size $\epsilon$.
Another dimension
% non-spatial exploration
that a MCMC method must explore is the range of
possible values of the log probability
% $E(\bx) \equiv
$\ln P^*\!(\bx)$. Assuming that the state $\bx$ contains a number of
independent random variables proportional to $N$,
when samples are drawn from $P(\bx)$, the
% `$\!$Asymptotic Equipartition' Principle
`\ind{asymptotic equipartition}' principle
tell us that the value of $- \ln P(\bx)$
is likely to be close to the entropy of $\bx$, varying either side with
a standard deviation that scales as $\sqrt{N}$.
Consider a Metropolis method with a symmetrical proposal density,
that is, one that satisfies $Q(\bx;\bx') = Q(\bx';\bx)$. Assuming that
accepted jumps either increase $\ln P^*\!(\bx)$ by some amount
or decrease it
by a {\em small\/} amount, \eg\ $\ln e=1$ (is this a reasonable
assumption?), discuss how long
it must take to generate roughly independent samples from $P(\bx)$.
Discuss whether Gibbs sampling has similar properties.
}
%the point of 23.11 (exercise) is the idea that as well
%as a spatial random walk, there are other ways of
%thinking about the random walks that MCMC does. Other dimensions.
%For example, during a simulation, the energy of a system
%wanders up and down. And it has to cover the "typical" range
%of values before we can expect the simulation to converge.
%Therefore the convergence time is something like (X/x)^2
%where X is the range of energies and x is the typical change
%in energy.
%
% \exercis{ex.goodapproxsample}{
% Compare and contrast what makes a
% % n approximating
% distribution $Q$
% a good variational approximation to a distribution $P$
% (as in the previous chapter)
% and what makes a distribution $Q$ a good sampler for
% importance sampling.
% }
\exercisxC{3}{ex.ZMC}{
Markov chain Monte Carlo methods do not compute
partition functions $Z$,
yet they allow ratios of quantities like $Z$ to
be estimated. For example, consider a random-walk Metropolis algorithm
in a state space where the energy is zero in a connected accessible
region, and infinitely large everywhere else;
and imagine that the accessible space can be chopped into two regions
connected by one or more corridor states. The fraction of
times spent in each region at equilibrium is proportional to
the volume of the region. How does the Monte Carlo method
manage to do this without measuring the volumes?
}
\exercisxC{5}{ex.BayesianMC}{
{\sf Philosophy}.\index{philosophy}
One curious defect of these Monte Carlo methods -- which are widely used
by Bayesian statisticians -- is that they are all non-Bayesian \cite{ohagan87}.
They involve computer experiments from which
{\em estimators\/} of quantities of interest are derived. These estimators
depend on the proposal distributions that were used to generate
the samples and on the random numbers that happened
to come out of our random number generator.
In contrast, an alternative Bayesian approach to
the problem would use the results of our computer experiments
to infer the properties of the target function $P(\bx)$ and
generate predictive distributions for quantities of interest such as $\Phi$.
This approach would give answers that would depend only on the
computed values of $P^*(\bx^{(r)})$ at the points
$\{ \bx^{(r)} \}$; the answers would not depend on how those
points were chosen.
Can you make a Bayesian Monte Carlo method?
(See \citeasnoun{zoubincarlBMC} for a practical attempt.)
}
% \input{tex/bayes_mc.tex}
\dvips
\section{Solutions}% to Chapter \protect\ref{ch.mc}'s exercises}
%
\fakesection{s11.tex}
\soln{ex.Phiconverge}{
We wish to show that
\beq
\hat{\Phi} \equiv \frac{ \sum_{r} w_r \phi( \xfromq^{(r)} ) }{ \sum_r w_r }
% \label{eq.is}
\eeq
converges to the expectation of $\Phi$ under $P$. We consider the
numerator and the denominator separately. First, the denominator.
Consider a single importance weight
\beq
w_r \equiv \frac{ P^*(\xfromq^{(r)}) }{ Q^*(\xfromq^{(r)}) } .
% copied from \label{eq.mc.is.weight.def}
\eeq
What is its expectation, averaged under the distribution $Q=Q^*/Z_Q$ of
the point $\xfromq^{(r)}$?
\beq
\langle w_r \rangle
= \int \d \xfromq \,
Q( \xfromq )
\frac{ P^*(\xfromq) }{ Q^*(\xfromq) }
= \int \d \xfromq \,
\frac{1}{Z_Q}
P^*(\xfromq)
= \frac{Z_P}{Z_Q} .
\eeq
So the expectation of the denominator is
\beq
\left< \sum_r w_r \right> = R \frac{Z_P}{Z_Q} .
\eeq
As long as the variance of $w_r$ is finite, the denominator, divided
by $R$, will converge to $Z_P/Z_Q$ as $R$ increases.
[In fact, the estimate converges to the right answer even if this variance is
infinite, as long as the expectation is well-defined.]
Similarly, the expectation of one term in the numerator is
\beq
\langle w_r \phi( \xfromq ) \rangle =
\int \d \xfromq \,
Q( \xfromq )
\frac{ P^*(\xfromq) }{ Q^*(\xfromq) } \phi( \xfromq )
= \int \d \xfromq \,
\frac{1}{Z_Q}
P^*(\xfromq) \phi( \xfromq )
= \frac{Z_P}{Z_Q} {\Phi} ,
\eeq
where $\Phi$ is the expectation of $\phi$ under $P$.
So the numerator, divided by $R$, converges to $\smallfrac{Z_P}{Z_Q} {\Phi}$
with increasing $R$.
Thus $\hat{\Phi}$ converges to $\Phi$.
The numerator and the denominator are unbiased estimators of
$R Z_P/Z_Q$ and
$R Z_P/Z_Q \Phi$ respectively, but their ratio $\hat{\Phi}$
is not necessarily an unbiased estimator for finite $R$.
%%%%%%%%%%%%%%%% HELP !!!!!!!!!!!!!!!!!!!!!!!!!!!
% More here about variance and bias.
%%%%%%%%%%%%%%%% HELP !!!!!!!!!!!!!!!!!!!!!!!!!!!
}
\soln{ex.peakysample}{When the true density $P$ is multimodal,
it is unwise to use importance sampling
with a sampler density fitted to one mode, because on the rare
occasions that a point is produced that lands in one of the other modes,
the weight associated with that point will be enormous. The
estimates will have enormous variance, but this enormous variance
may not be evident to the user if no points in the other mode have been seen.
}
%\soln{ex.randomwalk}{ ... }
%\soln{ex.gibbs.eq.met}{ ... }
\soln{ex.gibbs.h74}{
The posterior distribution
for the syndrome decoding problem
is a pathological distribution from
the point of view of Gibbs sampling.
The factor $\truth[ \bH \bn = \bz ]$ is only 1 on a small fraction
of the space of possible vectors $\bn$, namely the $2^K$ points
that correspond to the valid codewords. No two codewords are adjacent,
so similarly, any single bit flip from a viable state $\bn$ will
take us to a state with zero probability and so the state will never move
in
Gibbs sampling.
A general code has exactly the same problem.
The points corresponding to valid codewords are relatively few in number
and they are not adjacent (at least for any useful code).
So Gibbs sampling is no use for syndrome decoding for two reasons.
First, finding {\em any\/} reasonably good hypothesis is difficult, and
as long as the state is not near a valid codeword, Gibbs sampling
cannot help since none of the conditional distributions is defined;
and second, once we are in a valid hypothesis, Gibbs sampling
will never take us out of it.
% However, clever modifications of Gibbs sampling, using several
% annealing parameters, have been developed by \citeasnoun{Neal_mcdecoder},
% who has demonstrated that Monte Carlo decoding of certain codes,
% while inefficient, is not impossible.
One could attempt to perform Gibbs sampling using the
bits of the original message $\bs$ as the variables. This
approach would not get locked up in the way just described,
but, for a good code, any single bit flip would substantially alter
the reconstructed codeword, so if one had found a state
with reasonably large likelihood, Gibbs sampling would take
an impractically large time to escape from it.
}
%%%%%%%%%%%%%%%%%%
\soln{ex.metropB}{
Each Metropolis proposal will take the energy
of the state up or down by some amount.
The total change in energy when $B$ proposals
are concatenated will be the end-point of a random walk
with $B$ steps in it. This walk might have mean zero,
or it might have a tendency to drift upwards (if most
moves increase the energy and only a few decrease it). In general
the latter will hold, if the acceptance rate $f$ is small:
the mean change in energy from any one move will be some $\Delta E>0$
and so the acceptance probability for the concatenation of $B$
moves will be
of order $1/(1+\exp(-B \Delta E))$, which scales roughly as $f^B$.
The mean-square-distance moved will be of order $f^B B \epsilon^2$,
where $\epsilon$ is the typical step size.
In contrast, the mean-square-distance moved when the
moves are considered individually will be of order $f B \epsilon^2$.
}
% importance sampling
% see also ~/itp/importance/
\begin{figure}[htpb]
\figuredanglenudge{
\begin{center}
\begin{tabular}{ccc}
\mbox{\psfig{figure=importance/mean.ps,width=2.3in,angle=-90}}&
\mbox{\psfig{figure=importance/std.ps,width=2.3in,angle=-90}}&
\mbox{\psfig{figure=importance/ws.ps,width=2.3in,angle=-90}}\\
\end{tabular}
\end{center}
}{
\caption[a]{Importance sampling in one dimension.
For $R=1000,$ $10^4$, and $10^5$,
the normalizing constant of a Gaussian distribution (known
in fact to be 1) was
estimated using importance sampling with a sampler density of standard
deviation $\sigma_q$ (horizontal axis).
The same random number seed was used for all runs.
The three plots show (a) the estimated normalizing constant;
(b) the {\em empirical\/}
standard deviation of the $R$ weights; (c) 30 of the weights.
}
\label{fig.iscrazy}
}{-0.15in}
\end{figure}
\soln{ex.isproblem}{
The weights are $w = P(x)/Q(x)$ and $x$ is drawn from $Q$.
The mean weight is
\beq
\int \d x \: Q(x) \left[ P(x)/Q(x) \right]
= \int \d x \: P(x) = 1,
\eeq
assuming the integral converges.
The variance is
\beqan
\var ( w ) &=& \int \d x \: Q(x) \left[ \frac{P(x)}{Q(x)} - 1 \right]^2
\\
&=& \int \d x \: \frac{P(x)^2}{Q(x)} - 2 P(x) + Q(x)
\\
% &=& \left[ \int \d x \: \frac{Z_Q}{Z_P^2}
% \frac{ \exp \left( - 2 x^2/(2 \sigma^2_p) \right) }
% { \exp \left( - x^2/(2 \sigma^2_q) \right)}
% \right]
% - 2 + 1
%\\
&=& \left[ \int \d x \: \frac{Z_Q}{Z_P^2}
\exp \left(
- \frac{x^2}{2}
\left( \frac{2}{\sigma^2_p} - \frac{1}{\sigma^2_q} \right)
\right)
\right]
- 1 ,
\label{eq.nasty}
\eeqan
where $Z_Q/Z_P^2 = \sigma_q/(\sqrt{2\pi}\sigma_p^2)$.
The integral in (\ref{eq.nasty})
is finite only if the coefficient of $x^2$
in the exponent is positive, \ie,
if
%\beq
% \left( \frac{1}{\sigma^2_p} - \frac{1}{2 \sigma^2_q} \right) > 0
%\eeq
% \ie,
\beq
\sigma^2_q > \frac{1}{2} \sigma^2_p .
\eeq
If this condition is satisfied, the variance is
\beq
%\left[ \int \d x \: \frac{Z_Q}{Z_P^2}
% \exp \left(
% - x^2 \left( \frac{1}{\sigma^2_p} - \frac{1}{2 \sigma^2_q} \right)
% \right] - 1 ,
\var(w) =
\frac{\sigma_q}{\sqrt{2\pi}\sigma_p^2} \sqrt{2 \pi}
\left( \frac{2}{\sigma^2_p} - \frac{1}{\sigma^2_q} \right)^{\!-\frac{1}{2}} \!\!\!\! - 1
\:=\:
\frac{\sigma_q^2}{\sigma_p \left( 2 \sigma^2_q - \sigma^2_p \right)^{1/2}}
- 1.
\eeq
As $\sigma_q$ approaches the critical value -- about $0.7 \sigma_p$ --
the variance becomes infinite.
\Figref{fig.iscrazy} illustrates these phenomena for $\sigma_p=1$ with
$\sigma_q$ varying from 0.1 to 1.5.
{\em The same random number seed was used for all runs,}
so the weights and estimates follow smooth curves.
Notice that the {\em empirical\/}
standard deviation of the $R$ weights can look quite small
and well-behaved (say, at $\sigma_q \simeq 0.3$) when the true
standard deviation is nevertheless infinite.
%
}
% \soln{ex.goodapproxsample}{
% Variational free energy approximation: compact $Q$ good, heavy--tailed $Q$
% bad.
% Importance sampling: compact $Q$ bad, heavy--tailed $Q$ good.
% }
\dvipsb{solutions mc}
% {Efficient Monte Carlo methods}
\chapter{Efficient Monte Carlo Methods \nonexaminable}
\label{ch.mc2}
%\{Speeding up Monte Carlo methods}
% \subsection{Reducing
This chapter discusses
several methods for
{reducing random walk behaviour in Metropolis methods}.
The aim is to reduce the time
required to obtain effectively
independent samples.
For brevity, we will
% deliberately
% use imprecise terminology,
say `independent samples' when we mean
`effectively
independent samples'.
\section{\Hybrid\ Monte Carlo}
\begin{algorithm}
\begin{framedalgorithmwithcaption}{
%\figuremargin{%{%%%%%%%\margincaption{%
\caption[a]{{\tt Octave} source code for the \hybrid\ Monte Carlo method.}
\label{fig.hmc}
}%
\footnotesize
\begin{verbatim}
g = gradE ( x ) ; # set gradient using initial x
E = findE ( x ) ; # set objective function too
for l = 1:L # loop L times
p = randn ( size(x) ) ; # initial momentum is Normal(0,1)
H = p' * p / 2 + E ; # evaluate H(x,p)
xnew = x ; gnew = g ;
for tau = 1:Tau # make Tau `leapfrog' steps
p = p - epsilon * gnew / 2 ; # make half-step in p
xnew = xnew + epsilon * p ; # make step in x
gnew = gradE ( xnew ) ; # find new gradient
p = p - epsilon * gnew / 2 ; # make half-step in p
endfor
Enew = findE ( xnew ) ; # find new value of H
Hnew = p' * p / 2 + Enew ;
dH = Hnew - H ; # Decide whether to accept
if ( dH < 0 ) accept = 1 ;
elseif ( rand() < exp(-dH) ) accept = 1 ;
else accept = 0 ;
endif
if ( accept )
g = gnew ; x = xnew ; E = Enew ;
endif
endfor
\end{verbatim}
\end{framedalgorithmwithcaption}
\end{algorithm}
%\newcommand{\Tau}{\mbox{\verb+Tau+}}
%\newcommand{\ttepsilon}{\mbox{\verb+epsilon+}}
\newcommand{\Tau}{\mbox{\tt{Tau}}}
\newcommand{\ttepsilon}{\mbox{\tt{epsilon}}}
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{crcr}
\multicolumn{2}{c}{\Hybrid\ Monte Carlo}
&
\multicolumn{2}{c}{Simple Metropolis}
\\
%%%%%%%%%%%%%%%%%%%%
% HMC easy start
\raisebox{1.5in}{\makebox[0.1in][l]{(a)}}&%
\hspace{-0.42in}\psfig{figure=hmcdemo/hmc.sample2.ps,angle=-90,width=2.53in}%
%
% detail inset:
%
\makebox[0.0in][r]{\hspace{-0.15in}\raisebox{0.1in}{%
%{\small\sf{detail:}}
\psfig{figure=hmcdemo/det0.ps,angle=-90,width=1.5in}}}%
%%%%%%%%%%%%%%%%%%%%
&
%%%%%%%%%%%%%%%%%%%%
% metrop
\raisebox{1.5in}{\makebox[0.1in][l]{(c)}}&%
\hspace{-0.42in}\psfig{figure=hmcdemo/metrop2.ps,angle=-90,width=2.53in}%
\makebox[0in][l]{\hspace{-0.15in}\raisebox{0.1in}{\makebox[0.0in][r]{%
%{\small\sf{detail:}}
\psfig{figure=hmcdemo/det2.ps,angle=-90,width=1.5in}}}}
%%%%%%%%%%%%%%%%%%%%
\\
%%%%%%%%%%%%%%%%%%%%
% HMC hard start
\raisebox{1.5in}{\makebox[0.1in][l]{(b)}}&%
\hspace{-0.42in}\psfig{figure=hmcdemo/hmc.converge4.ps,angle=-90,width=2.53in}
%%%%%%%%%%%%%%%%%%%%
&%
%%%%%%%%%%%%%%%%%%%%
% metrop
\raisebox{1.5in}{\makebox[0.1in][l]{(d)}} & %
\hspace{-0.42in}\psfig{figure=hmcdemo/metrop4.ps,angle=-90,width=2.53in}
%%%%%%%%%%%%%%%%%%%%
\\
\end{tabular}
\end{center}
}{%
\caption[a]{{(a,b) \Hybrid\ Monte Carlo used to generate samples from a
\ind{bivariate Gaussian} with correlation $\rho = 0.998$. (c,d) For comparison, a simple
\index{Monte Carlo methods!random-walk Metropolis}\ind{random-walk Metropolis method}, given equal computer time.}
%
}
\label{fig.hmcdemo}
}%
\end{figure}
%
%
The \indexs{Hamiltonian Monte Carlo}\index{Monte Carlo methods!Hamiltonian Monte Carlo}{\hybrid\ Monte Carlo}
\index{algorithm!Hamiltonian Monte Carlo}method
% has been reviewed and developed by \citeasnoun{Neal_dop}.
is a Metropolis method, applicable to continuous state
spaces, that makes use of gradient information to reduce
random walk behaviour. [The Hamiltonian Monte Carlo method
was originally called \ind{hybrid Monte Carlo}, for historical
reasons.]
For many systems whose probability $P(\bx)$ can be written in the form
\beq
P(\bx) = \frac{ e^{- E(\bx)} }{Z},
\eeq
not only $E(\bx)$ but also its
gradient with respect to $\bx$ can be readily evaluated. It seems
wasteful to use a simple random-walk Metropolis method when
this gradient is available -- the gradient indicates
which direction one should go in
to find states that have higher probability!
\subsection{Overview of \hybrid\ Monte Carlo}
In the {\hybrid\ Monte Carlo} method,
the state space $\bx$ is augmented by {\em momentum
variables\/} $\bp$, and there is an alternation of two types
of proposal. The first proposal randomizes the
\ind{momentum} variable, leaving the state $\bx$ unchanged.
The second
proposal changes both $\bx$ and $\bp$
using
simulated Hamiltonian dynamics as defined by the Hamiltonian
\beq
H(\bx,\bp) = E(\bx) + K(\bp) ,
\eeq
where $K(\bp)$ is a `kinetic energy' such as $K(\bp) = \bp^{\T}\bp/2$.
% are iterated for a number of steps;
These two proposals are used to create (asymptotically) samples from
the joint density
\beq
P_H(\bx,\bp) = \frac{1}{Z_H} \exp [ - H(\bx,\bp) ] = \frac{1}{Z_H} \exp [ - E(\bx) ] \exp [ - K(\bp) ].
\eeq
This density is separable,
so the marginal distribution of $\bx$ is
the desired distribution $\exp [ - E(\bx) ]/Z$.
So, simply discarding the momentum variables, we obtain a sequence of
samples $\{ \bx^{(t)} \}$ that asymptotically come from
$P(\bx)$.
\subsection{Details of \hybrid\ Monte Carlo}
The first proposal, which can be viewed as a Gibbs sampling update,
draws a new momentum from the
Gaussian density $\exp [ - K(\bp) ]/{Z_K}$. This
% is a Gibbs sampling update and the
proposal is always accepted.
During the second, dynamical proposal,
the momentum variable determines where the
state $\bx$ goes, and the {\em gradient\/} of
% the log of the probability density $P(\bx)$
$E(\bx)$ determines how the momentum $\bp$ changes, in accordance
with the equations
\beqan
\dot{\bx} &=& \bp \\
\dot{\bp} &=& - \frac{\partial E(\bx)}{\partial\bx}
.
\eeqan
Because of the persistent motion of $\bx$ in the direction of the
momentum $\bp$
during each dynamical proposal,
the state of the system tends to move a distance that goes {\em linearly\/}
with the computer time, rather than as the square root.
%
% see itp/hmcdemo
% octave
% DEMO
%
The second proposal is accepted in accordance with the Metropolis rule.
If the simulation of the Hamiltonian dynamics is numerically perfect
then the proposals are accepted every time, because the total energy
$H(\bx,\bp)$ is a constant of the motion and so $a$ in \eqref{eq.ratio.metrop}
is equal to one. If the simulation is
imperfect, because of finite step sizes for example, then some of the
dynamical proposals will be rejected. The rejection rule makes use of
the change in $H(\bx,\bp)$, which is zero if the simulation is
perfect. The occasional rejections ensure that,
asymptotically, we obtain samples $(\bx^{(t)},\bp^{(t)})$
from the required joint density $P_H(\bx,\bp)$.
The source code in \figref{fig.hmc} describes a \hybrid\ Monte Carlo
method that uses the `leapfrog' algorithm\index{leapfrog algorithm}\index{algorithm!leapfrog}
to simulate the dynamics
on the function {\tt{findE(x)}}, whose gradient is found by the function
{\tt{gradE(x)}}. \Figref{fig.hmcdemo} shows this algorithm generating
samples from a bivariate Gaussian whose energy function is
$E(\bx) = \half \bx^{\T} \bA \bx$ with
\beq
\bA = \left[
\begin{array}{rr}
250.25 & -249.75 \\
-249.75 & 250.25
\end{array}
\right] ,
\eeq
%
corresponding to a variance--covariance matrix of
\beq
\left[
\begin{array}{ll}
1 & 0.998 \\
0.998 & 1
\end{array}
\right] .
\eeq
In \figref{fig.hmcdemo}a,
starting from the state marked by the arrow, the solid line
represents two successive trajectories generated by the Hamiltonian
dynamics. The squares show the endpoints of these two trajectories.
Each trajectory consists of $\Tau =19$
`leapfrog' steps with $\ttepsilon = 0.055$.
These steps are indicated by the crosses on the trajectory in the
magnified inset.
After each trajectory, the momentum is randomized.
Here, both trajectories are
accepted; the errors in the Hamiltonian were only $+0.016$ and $-0.06$
respectively.
\Figref{fig.hmcdemo}b shows how a sequence of four trajectories converges
from an initial condition, indicated by the arrow,
that is not close to the typical set of
the target distribution. The trajectory parameters $\Tau$ and
$\ttepsilon$ were randomized for each trajectory using uniform
distributions with means 19 and 0.055 respectively. The first trajectory takes
us to a new state, $(-1.5,-0.5)$,
similar in energy to the first state. The second
trajectory happens to end in a state nearer the bottom of the energy
landscape. Here, since the potential energy $E$ is smaller, the kinetic
energy $K = \bp^2/2$ is necessarily larger than it was at the start of the trajectory.
When
the momentum is randomized before the third trajectory, its kinetic energy
becomes much smaller. After the fourth trajectory has been simulated,
the state appears to have become typical of the target density.
\Figsref{fig.hmcdemo}(c) and (d) show a
random-walk Metropolis method using a Gaussian proposal density
to sample from the same Gaussian distribution, starting from the
initial conditions of (a) and (b) respectively.
In (c) the step size
% radius had been
was adjusted
such that the acceptance rate was 58\%.
The number of proposals was 38 so the total amount of computer time
used was similar to that in (a). The distance moved is small because
of random walk behaviour.
In (d) the random-walk Metropolis method was
used and started from the same initial condition as (b)
and given a similar amount of
computer time.
%
%
% see hmc.tex for an attempt to make a toy simulation story.
%
% The {\dbf \hybrid\ Monte Carlo} method mentioned in the section
% on Metropolis methods makes use of gradient information to reduce
% random walk behaviour.
%
%
% for a nice adler demo, cd itp/adler; gnuplot ; load 'gnu'
%
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{@{}l@{}}
\begin{tabular}{ccc}&Gibbs sampling & Overrelaxation\\
\raisebox{1.5in}{\makebox[0in][l]{(a)}}&%
\hspace{-0.2in}\psfig{figure=adler/gibbs.xy.ps,angle=-90,width=2.3in} &
\hspace{-0.2in}\psfig{figure=adler/adler.xy.ps,angle=-90,width=2.3in} \\
\raisebox{1.5in}{\makebox[0in][l]{(b)}}&%
&
\hspace{-0.2in}\psfig{figure=adler/adler.xy.det.ps,angle=-90,width=2.3in} \\
\end{tabular}
\\
\raisebox{5mm}{\makebox[0in][l]{(c)}}%
\hspace{-0.2in}%
\begin{tabular}[t]{l}
\hspace{12mm}Gibbs sampling \\
\psfig{figure=adler/gibbs.x1.ps,width=5in,angle=-90} \\[-0.05in]
\hspace{12mm}Overrelaxation\\
\psfig{figure=adler/adler.x1.ps,width=5in,angle=-90} \\
\end{tabular}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{{Overrelaxation contrasted with Gibbs sampling for a
bivariate Gaussian with correlation $\rho = 0.998$.}
%
(a) The state sequence for 40 iterations, each iteration involving
one update of both variables. The overrelaxation method
had $\alpha=-0.98$. (This excessively large value is chosen
to make it easy to see how the overrelaxation method reduces random
walk behaviour.) The dotted line shows the contour
$\bx^{\T} \bSigma^{-1} \bx=1$.
%
(b) Detail of (a), showing the two steps making up each iteration.
(c) Time-course of the variable $x_1$ during 2000 iterations of the
two methods. The overrelaxation method
had $\alpha=-0.89$.
(After \protect\citeasnoun{Radford.over}.)
%
}
\label{fig.adler}
}%
\end{figure}
%Page 19
%What's the relationship between overrelaxtion and antithetic variables
%(e.g. Besag and Green, 1993, JRSS(B))?
\section{Overrelaxation}
The method of {\dbf\ind{overrelaxation}}
is a method for reducing
random walk behaviour in Gibbs sampling.
Overrelaxation was originally introduced for systems in which all the
conditional distributions are Gaussian.
\begin{aside}
An example of a
% There are
joint distribution that is {\em not\/} Gaussian but whose conditional
distributions {\em are\/} all Gaussian is $P(x,y)=
\exp(-x^2 y^2-x^2 -y^2)/Z$.
\end{aside}
% Adler's Overrelaxation
\subsection{Overrelaxation for Gaussian conditional distributions}
In ordinary Gibbs sampling,
one draws the new value $x_i^{(t+1)}$
of the current variable $x_i$ from its conditional
distribution, ignoring the old value $x_i^{(t)}$.
The state makes lengthy random walks in cases where
the variables are strongly correlated, as illustrated in the
left-hand panel of \figref{fig.adler}.
% the joint distribution has strong correlations between the variables.
This figure uses a correlated Gaussian distribution
as the target density.
% that we used when studying the \hybrid\ Monte Carlo method.
In \nocite{Adler1981}Adler's (1981)
% In \quotecite{Adler1981}
overrelaxation method, one instead samples
$x_i^{(t+1)}$ from a Gaussian that is biased to the {\em opposite\/} side of
the conditional distribution.
%, and that is narrower than the
% Gaussian describing the conditional distribution.
% When the bias and width are appropriately set,
If the conditional distribution of $x_i$ is $\Normal(\mu,\sigma^2)$
and the current value of $x_i$ is $x_i^{(t)}$, then Adler's method
sets $x_i$ to
\beq
x_i^{(t+1)} = \mu + \a ( x_i^{(t)} - \mu ) + (1- \alpha^2 )^{1/2} \sigma \nu ,
\label{eq.adler}
\eeq
where $\nu \sim \Normal(0,1)$ and $\alpha$ is a parameter between $-1$ and
$1$, usually set to a negative value. (If $\alpha$ is positive, then the
method is called under-relaxation.)
\exercisxA{2}{ex.adler}{
Show that this individual transition leaves invariant the conditional
distribution $x_i \sim \Normal(\mu,\sigma^2)$.
}
A single iteration of Adler's overrelaxation, like one of Gibbs sampling,
updates each variable in turn as indicated in \eqref{eq.adler}.
The transition matrix $T(\bx';\bx)$ defined by a complete update of
all variables in some fixed order
does not satisfy detailed balance.
Each individual transition for one coordinate
just described {\em does\/} satisfy detailed balance -- so the
overall chain gives a valid
sampling strategy which converges to the target density $P(\bx)$ --
but
when we form a chain by applying the individual transitions
in a fixed sequence, the overall
chain is not reversible. This temporal asymmetry is the key to why
overrelaxation can be beneficial. If, say,
two variables are positively correlated, then they will (on a short
timescale) evolve
in a directed manner instead of by random walk,
as shown in \figref{fig.adler}. This may
significantly reduce the time required to obtain
% effectively
independent samples.
% This method is still a valid
% sampling strategy -- it converges to the target density $P(\bx)$ --
% because it is made up of transitions that satisfy detailed balance.
% (XXXXXXXXXXXXXXXXXXX)
%
% PUT THIS FIG BACK in any non-erice version, and this ref.
%
% Figure \ref{fig.adler} illustrates the difference
% between Gibbs sampling and overrelaxation for the case of
% a bivariate Gaussian distribution. Notice how much more rapidly
% overrelaxation gets around the distribution.
\exercisxC{3}{ex.detbaladler}{
The transition matrix $T(\bx';\bx)$ defined by a complete update of
all variables in some fixed order
does not satisfy \ind{detailed balance}. If the updates were in a {\em random order},
then $T$ would be symmetric. Investigate, for the toy two-dimensional
Gaussian distribution, the assertion
that the advantages of overrelaxation are lost if the
overrelaxed updates are made in a random order.
}
\subsection{Ordered Overrelaxation\nonexaminable}
The overrelaxation method has been generalized by\index{overrelaxation!ordered}\index{Monte Carlo methods!overrelaxation!ordered}
\citeasnoun{Radford.over}
whose {\dem\ind{ordered overrelaxation}\/} method is applicable to\index{Neal, Radford}
{\em any\/} system where \ind{Gibbs sampling}\index{Monte Carlo methods!Gibbs sampling}
is used.
%
% MORE HERE
%
In ordered overrelaxation, instead of taking one sample from
the conditional distribution $P(x_i \given \{ x_j \}_{j \neq i} )$,
we create $K$ such samples $x_i^{(1)}, x_i^{(2)}, \ldots , x_i^{(K)}$,
where $K$ might be set to twenty or so.
Often, generating $K-1$ extra samples adds a negligible
computational cost to the initial computations required for making
the first sample. The points $\{ x_i^{(k)} \}$ are then
sorted numerically, and the current value of $x_i$ is inserted into
the sorted list, giving a list of $K+1$ points. We give them
ranks $0,1,2,\ldots, K$. Let $\kappa$ be
the rank of the current value of $x_i$ in the list.
We set $x'_i$ to the value that is an equal distance from the
other end of the list, that is,
%\beq
the value with rank $K-\kappa$.
%\eeq
The role played by Adler's $\alpha$ parameter is here played by the parameter
$K$. When $K=1$, we obtain ordinary Gibbs sampling.
% Radford says this should be cut:
% In practice, it might be a good idea to use a small value of $K$, \eg, $K=1$, before
% convergence of a Gibbs sampler, then a value such as $K=20$ after
% convergence, because atypicality persists longer for larger values of $K$.
% (Imagine an atypical state at the 95th percentile hopping across
%to an equally atypical state at the 5th)
%
For practical purposes Neal\index{Neal, Radford} estimates that ordered overrelaxation
may speed up a simulation by a factor of ten or twenty.
%
% but maybe should recommend don't use before convergence. RADFORD SAYS NO
%
% It is a method which can be used wherever Gibbs sampling is used.
\section{Simulated annealing}
A third technique for speeding convergence is {\dbf\ind{simulated
annealing}}. In \index{Monte Carlo methods!simulated annealing}simulated\index{annealing}
annealing, a `\ind{temperature}' parameter is introduced
which, when large, allows the system to make transitions that
would be improbable at temperature 1. The temperature is
set to a large value and gradually reduced to 1. This procedure is supposed
to reduce the chance that the simulation gets
stuck in an unrepresentative probability island.
We asssume that we wish to sample from a
distribution of the form
\beq
P(\bx) = \frac{ e^{- E(\bx)} }{Z}
\eeq
where $E(\bx)$ can be evaluated. In the simplest simulated annealing method,
we instead sample from the distribution
\beq
P_T(\bx) =
{\smallfrac{1}{Z(T)}} \, { e^{-\smallfrac{E(\bx)}{T}} }
\eeq
and decrease $T$ gradually to 1.
Often the energy function can be separated into two terms,
\beq
E(\bx) = E_0(\bx) + E_1(\bx),
\eeq
of which the first term is `nice' (for example, a separable function
of
% linear in
$\bx$) and the second is `nasty'.
In these cases, a better simulated annealing method might make use of
the distribution
\beq
P'_T(\bx) =
{\smallfrac{1}{Z'(T)}}
\,
e^{-E_0(\bx)-\textstyle\dfrac{ E_1(\bx)}{T}}
\eeq
with $T$ gradually decreasing to 1.
In this way, the distribution at high temperatures reverts to a
well-behaved distribution defined by $E_0$.
Simulated annealing is often used as an \ind{optimization} method, where
the aim is to find an $\bx$ that minimizes $E(\bx)$, in which case
the temperature is decreased to zero rather than to 1.
As a Monte Carlo method, simulated annealing as
described above doesn't sample exactly
from the right distribution, because there is no guarantee
that the probability of falling into one basin of the energy
is equal to the total probability of all the states
in that basin.
%-- indeed we would expect this not to be the case
% in general; simulated annealing is usually a biased sampling method.
The closely related
`simulated tempering' method \cite{Marinari1992}
corrects the biases introduced by the annealing process
by making the temperature itself a random variable that is
updated in Metropolis fashion during the simulation.
%
\quotecite{Radford_ais}
% Neal's (unpublished)
`annealed importance sampling' method\index{Neal, Radford}
removes the biases introduced by annealing by computing importance weights
for each generated point.
% more?
\section{Skilling's multi-state leapfrog method\nonexaminable}
\label{sec.skillingleapfrog}
A fourth method for speeding up Monte Carlo simulations,
% reducing random walk behaviour,
due to \index{Skilling, John}John Skilling,
% was introduced by Skilling (unpublished, 2002); it
has a similar spirit to overrelaxation,
but works in more dimensions.
This method is applicable to sampling from a distribution
over a continuous state space, and the sole requirement
is that the energy $E(\bx)$ should be easy to evaluate.
The gradient is not used.
This leapfrog method is not intended to be used on its
own but rather in sequence with other Monte Carlo
operators.
Instead of moving just one state vector $\bx$
around the state space, as was the case for all the
Monte Carlo methods discussed thus far,
Skilling's leapfrog method simultaneously\index{Monte Carlo methods!multi-state}
maintains a {\dem{set}\/} of $S$ state vectors
$\{ \bx^{(s)} \}$, where $S$ might be six or twelve.
% or so.
The aim is that all $S$ of these vectors will represent independent
samples from the same distribution $P(\bx)$.
Skilling's leapfrog makes a proposal for the new
state ${\bx^{(s)}}'$,
which is accepted or rejected in accordance with the
Metropolis method, by%
\amarginfig{t}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(40,40)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gallager/skilling.eps,width=40mm,height=30mm}}}
\put(0,0){\makebox(0,0)[br]{$\bx^{(s)}$}}
\put(20,15){\makebox(0,0)[br]{$\bx^{(t)}$}}
\put(40,30){\makebox(0,0)[br]{${\bx^{(s)}}'$}}
\end{picture}
\end{center}
}\
leapfrogging the current state $\bx^{(s)}$ over
another state vector $\bx^{(t)}$:
\beq
{\bx^{(s)}}' = \bx^{(t)} + ( \bx^{(t)} - \bx^{(s)} )
= 2 \bx^{(t)} - \bx^{(s)} .
\eeq
All the other state vectors are left where they are, so
the acceptance probability depends only on the change in
energy of $\bx^{(s)}$.
Which vector, $t$, is the partner for
the leapfrog event can be chosen in
various ways. The simplest method
is to select the partner at random from the
other vectors.
It might be better to choose $t$ by selecting one
of the nearest neighbours $\bx^{(s)}$
-- nearest by any chosen distance function --
as long as one then uses an acceptance rule that ensures
detailed balance by checking whether point $t$ is still among
the nearest
neighbours of the new point, ${\bx^{(s)}}'$.
\subsection{Why the leapfrog is a good idea}
Imagine that the target density $P(\bx)$ has
strong correlations -- for example, the density
might be a needle-like Gaussian with width $\epsilon$
and length $L \epsilon$, where $L \gg 1$.
As we have emphasized, motion around such a density by standard
methods proceeds by a slow random walk.
Imagine now that our set of $S$ points is lurking initially
in a location that is probable under the density,
but in an inappropriately small ball of size $\epsilon$.
Now, under Skilling's leapfrog method,
a typical first move will take the point a little
outside the current ball, perhaps
doubling its distance from the centre of the ball.
After all the points have had a chance to move,
the ball will have increased in size;
if all the moves are accepted, the ball will
be bigger by a
factor of two or so in all dimensions. The rejection
of some moves will mean that the ball containing the
points will probably have elongated
in the needle's long direction by a factor of, say, two.
After another cycle through the points, the
ball will have grown in the long direction by another factor of two.
So the typical distance travelled in the long dimension
grows {\em{exponentially\/}} with the number of iterations.
Now, maybe a factor of two growth per iteration
is on the optimistic side; but even if
the ball only grows by a factor of, let's say, 1.1 per iteration,
the growth is nevertheless exponential. It will only
take a number of iterations proportional to $\log L/\log(1.1)$
for the long dimension to be explored.
\exercissxB{2}{ex.skilling}{
Discuss how the effectiveness of
Skilling's method scales with dimensionality,
using a correlated $N$-dimensional Gaussian distribution
as an example. Find an expression for the rejection
probability, assuming the Markov chain is at equilibrium.
Also discuss
how it scales with the strength of correlation among
the Gaussian variables.
[\Hint: Skilling's method is invariant under
affine transformations, so the rejection probability
at equilibrium can be found by looking at the case
of a {\em{separable}\/} Gaussian.]
% $x_i \sim \Normal(\mu,\sigma^2)$.
}
This method has some similarity to
the `\ind{adaptive direction sampling}' method
of \citeasnoun{Gilks_RG_ADS}\index{Gilks, W.R.}\index{Roberts, Gareth O.}\index{George, E.I.}
but the leapfrog method is simpler and can be applied to a greater
variety of distributions.
% Gilks, WR; Roberts, GO; George, EI (1994). Adaptive direction sampling.
% Statistician, 43, 179-189.
%
% Roberts, GO; Gilks, WR (1994). Convergence of adaptive direction sampling.
% Journal of Multivariate Analysis, 49, 287-298.
%5Here are some rough notes I just jotted down today
%5on monte carlo methods. I'm getting keen on the
%idea that genetic methods are the key step needed in
%order to speed up monte carlo, since sex allows
%information to be acquired by a population from an oracle
%at a rate sqrt(G) faster than the maximum rate achievable
%by a lone individual who evolves under standard dumb metropolis.
%
% So then the challenge is to come up with a birth or death
%rule which supplies the required slaughter of unfit individuals,
%post-sex, without ruining the proof of validity we are interested
%in obtaining.
%
% Hope this sparks some useful ideas. I started thinking about
%this because John Skilling is using a little bit of sex in his
%algorithms, but I am pretty sure his method is vastly suboptimal
%because he is not accompanying his sex by the appropriate amount of
%death.
%
\section{Monte Carlo algorithms as communication channels}
It may be a helpful perspective, when thinking about
speeding up Monte Carlo methods, to think about the information
that is being communicated.\index{Monte Carlo methods!information communication in}\index{communication}
Two communications take place when a sample from $P(\bx)$ is being generated.
First, the selection of a particular $\bx$ from $P(\bx)$
necessarily requires that at least $\log 1/P(\bx)$ random bits be consumed.
[Recall the use of inverse arithmetic coding as a method
for generating samples from given distributions (\secref{sec.ac.efficient}).]
% (\chref{ch.ac})]
%
% For example, could think about a chain qith Q_1 = +/1 mod B
% Q_2 = +/- 2^{1} mod B
% Q_b = +/- B
% (or 0/1)
% with each move consuming one bit, and the outcome generates a sample from
% 1...2^b, which is b bits.
%
Second, the generation of a sample conveys information about $P(\bx)$ from the
% involves consulting the
subroutine that is able to evaluate $P^*(\bx)$ (and from any other
subroutines that have access to properties of $P^*(\bx)$).
Consider a dumb Metropolis method, for example. In a
\ind{dumb Metropolis}\index{Monte Carlo methods!Metropolis method!dumb Metropolis} method,
the proposals $Q(\bx';\bx)$ have nothing to do with $P(\bx)$.
Properties of $P(\bx)$ are only involved in the
algorithm at the acceptance step, when the ratio
$P^*(\bx')/P^*(\bx)$ is computed.
The channel from the true distribution $P(\bx)$
to the user who is interested in computing properties of
$P(\bx)$ thus passes through a bottleneck: all
the information about $P$ is conveyed by
% mediated
the string of acceptances and rejections.
If $P(\bx)$ were replaced by a different distribution
$P_2(\bx)$, the only way in which this change would have an influence
is that the string of acceptances and rejections would be changed.
I am not aware of much use being made of
this information-theoretic view of Monte Carlo algorithms,
but I think it is an instructive viewpoint: if the aim is to
obtain information about properties of $P(\bx)$ then
presumably it is helpful to identify the channel through which
this information flows, and maximize the rate of information
transfer.
\exampl{ex.whyhalf}{
The information-theoretic viewpoint offers a simple justification
for the widely-adopted rule of thumb, which states that the
parameters of a dumb Metropolis method should be adjusted such that the
\ind{acceptance rate}\index{Monte Carlo methods!acceptance rate}
is about one half.
Let's call the acceptance history, that is,
the binary string of accept or reject
decisions, $\ba$. The information learned about $P(\bx)$ after
the algorithm has run for $T$ steps is less than or equal to the
information content of $\ba$, since all information about $P$
is mediated by $\ba$. And the information content of $\ba$ is upper-bounded
by $T H_2(f)$, where $f$ is the acceptance rate. This bound
on information acquired about $P$ is maximized by setting $f=1/2$.
}
% Radford says:
% The information theory perspective looks useful, but it seems hard to
% get precise results out of it. The accept/reject decisions for a
% sequence of updates will generally not be independent, so the actual
% amount of information conveyed will be less than one bit per proposal.
% So it's not clear (though perhaps one can attempt to \analyse) whether
% or not departing from a 50% accept rate actually reduces the
% information (it could be it reduces the dependence enough to
% compensate for departing from the marginally optimal 50% rate). As
% you maybe are aware, Gareth Roberts (with, I think, Gelman and Gilks)
% has shown that a 23% acceptance rate is optimal under certain
% circumstances, asymptotically with increasing dimensionality.
%
% The evolutionary perspective is tantalizing, but also looks hard to
% get to work. One basic problem is that death is not reversible.
% Looked at another way, the reason we want death is to make room for
% births, specifically for births that are similar to their parents (or
% siblings, if parents are eliminated). If we succeed in this, we will
% have an overall state consisting of many copies of the base state,
% many of which are very similar. This is quite incompatible with the
% usually scheme of defining the distribution for the overall state by
% saying the component base states are independent, with all having the
% desired distribution.
Another\index{evolutionary computing}
helpful analogy for a dumb Metropolis method is an evolutionary
one. Each proposal generates a progeny $\bx'$ from the current state $\bx$.
These two individuals then compete with each other, and the Metropolis
method uses
a noisy survival-of-the-fittest rule. If the progeny $\bx'$ is
fitter than the parent (\ie, $P^*(\bx')>P^*(\bx)$, assuming the $Q/Q$ factor is
unity) then the progeny replaces the parent. The survival rule
also allows less-fit progeny to replace the parent, sometimes.
Insights about the rate of evolution can thus be applied to Monte Carlo methods.
\exercisxC{3}{ex.learnMC}{
Let $\bx \in \{ 0,1 \}^G$ and let
$P(\bx)$ be a separable distribution,
%
\beq
P(\bx) = \prod_g p(x_g),
\eeq
with $p(0) = p_0$ and $p(1)=p_1$, for example $p_1=0.1$.
Let the proposal density of a dumb Metropolis algorithm
$Q$ involve flipping a fraction $m$ of the $G$ bits in the state $\bx$.
Analyze how long it takes for the chain to converge to the
target density as a function of $m$. Find the optimal $m$
and deduce how long the Metropolis method must run for.
Compare the result with the results for an evolving population
under natural selection found in \chref{ch.sex}.
}
The insight that the fastest progress
that a standard Metropolis method can make, in information terms,
is about one
bit per iteration, gives a strong motivation for speeding
up the algorithm.
This chapter has already reviewed several methods for
reducing random-walk behaviour. Do these methods also
speed up the rate at which information is acquired?
\exercisxC{4}{ex.learnMC2}{
Does Gibbs sampling, which is a smart Metropolis method
whose proposal distributions do depend on $P(\bx)$,
allow information about $P(\bx)$ to leak out at a rate
faster than one bit per iteration?
Find toy examples in which this question can be
precisely investigated.
}
\exercisxC{4}{ex.learnMC3}{
\Hybrid\ Monte Carlo is another smart Metropolis method
in which the proposal distributions depend on $P(\bx)$.
Can \Hybrid\ Monte Carlo extract information about $P(\bx)$ at a rate
faster than one bit per iteration?
}
\exercisxC{5}{ex.learnimport}{
In importance sampling, the weight $w_r = P^*(\bx^{(r)})/Q^*(\bx^{(r)})$,
a floating-point number,
is computed and retained until the end of the computation.
In contrast, in the dumb Metropolis method, the ratio $a = P^*(\bx')/P^*(\bx)$
is reduced to a single bit (`is $a$ bigger than or smaller than
the random number $u$?').
% \in (0,1)$?').
Thus in principle importance sampling preserves more information
about $P^*$ than does dumb Metropolis.
Can you find a toy example in which this extra information
does indeed lead to faster convergence of importance sampling
than Metropolis?
Can you design a Markov chain Monte Carlo algorithm
that moves around adaptively, like a Metropolis method,
and that retains more useful information about the
value of $P^*$, like importance sampling?
}
In \chref{ch.sex} we noticed that an evolving population of $N$ individuals
can make faster evolutionary progress if the individuals
engage in sexual reproduction. This observation motivates
looking at Monte Carlo algorithms in which multiple parameter vectors $\bx$
are evolved and interact.
\section{Multi-state methods\nonexaminable}
In a multi-state method, multiple parameter vectors $\bx$ are maintained;\index{Monte Carlo methods!multi-state}
they evolve individually under moves such as Metropolis and Gibbs; there
are also interactions among the vectors.
The intention is either
that eventually all the vectors $\bx$ should be
samples from $P(\bx)$ (as illustrated by Skilling's leapfrog method), or that information associated with the final
vectors $\bx$
should allow us to approximate expectations under $P(\bx)$, as
in importance sampling.
\subsection{Genetic methods}
% There is a good reason for wanting to use a population and have sex.
% In inference, the computational task is to hunt for the the place where
% the probability is big. This is difficult. As the chapter on sex
% and evolution (\chref{ch.sex}) shows,
% if the problem can be decomposed,
% it's more efficient to have many individuals and crossover and selection.
% The following gives a particular way of making crossover and selection
% in a principled way.
Genetic algorithms\index{genetic algorithm}\index{evolutionary computing}\index{algorithm!genetic}
are not
often described by their proponents as Monte Carlo algorithms, but
I think this is the correct categorization, and an ideal genetic
algorithm would be one that can be proved to be a valid Monte Carlo algorithm
that converges to a specified density.
I'll use $R$ to denote the number of vectors in the population.
We aim to have $P^*(\{\bx^{(r)}\}_1^R) = \prod P^*(\bx^{(r)})$.
A genetic algorithm involves moves of two or three types.
First, individual moves in which one state vector is perturbed,
$\bx^{(r)} \rightarrow {\bx^{(r)}}'$, which
could be performed using any of the Monte Carlo methods
we have mentioned so far.
Second, we allow crossover moves of the form
$\bx,\by \rightarrow \bx',\by'$;
in a typical crossover move, the progeny $\bx'$ receives half his
state vector from one parent, $\bx$, and half from the other, $\by$;
the secret of success in a \ind{genetic algorithm}\index{algorithm!genetic} is that
% the way that
the parameter $\bx$
% relates to $P(\bx)$
must be encoded in such a way that the crossover of two independent
states $\bx$ and $\by$, both of which have good
fitness $P^*$, should have a reasonably good chance of producing progeny
who are equally fit.
This constraint is a hard one to satisfy in many problems, which
is why genetic algorithms are mainly talked about and hyped up,
and rarely used by serious experts.
Having introduced a crossover move
$\bx,\by \rightarrow \bx',\by'$, we need to
choose an acceptance rule. One easy way to obtain
a valid algorithm is to accept or reject the crossover proposal
using the Metropolis rule with $P^*(\{\bx^{(r)}\}_1^R)$
as the target density -- this involves
comparing the fitnesses before and after the crossover using the
ratio
\beq
\frac{ P^*( \bx' ) P^*( \by' ) }{ P^*( \bx ) P^*( \by ) } .
\eeq
If the crossover operator is reversible then
we have an easy proof that this procedure satisfies detailed
balance and so is a valid component in a chain
converging to $P^*(\{\bx^{(r)}\}_1^R)$.
\exercisxB{3}{ex.geneticenough}{
Discuss whether the above two operators, individual
variation and crossover with the Metropolis acceptance rule,
will give a more efficient Monte Carlo method
than a standard method with only one state vector and
no crossover.
}
The reason why the sexual community could acquire information
faster than the asexual community in \chref{ch.sex} was because
the \ind{crossover} operation produced diversity with standard deviation
$\sqrt{G}$, then the \ind{Blind Watchmaker} was able to convey lots of information
about the fitness function by {\em killing off\/} the less fit offspring.
The above two operators do {\em not\/} offer a speed-up of $\sqrt{G}$
compared with standard Monte Carlo methods because there is
no killing. What's required, in order to obtain a speed-up, is two things:
multiplication and death; and at least one of these must operate {\em selectively}.
Either we must kill off the less-fit state vectors, or
we must allow the more-fit state vectors to give rise to more offspring.
% We need a birth process in which
% is some sort of birth
% process in which $\bx,\by \rightarrow \bx',\by'$
While it's easy to sketch these ideas, it is hard to
define a valid method for doing it.
\exercisxD{5}{ex.geneticsolve}{
Design a birth rule and a death rule
such that the chain converges
to $P^*(\{\bx^{(r)}\}_1^R)$.
}
% http://www.robots.ox.ac.uk/~misard/condensation.html
% http://www-sigproc.eng.cam.ac.uk/~ad2/book.html
I believe this is still an open research problem.
% \index{particle filters}{Particle filters}
% offer a partial solution to this problem for cases where
% the target density $P^*$ can be chopped into a product of factors.
% One way to chop a target density into a product of factors is annealing:
% we can, for example, write $P^*(x) = ( P_{\epsilon}(x) )^N$, where
% $P_{\epsilon}(x) \equiv ( P^*(x) )^{\epsilon}$ and $N=1/\epsilon$ is the
% number of steps in an annealing process with equally-spaced temperatures.
% Thus births and deaths can be integrated into the annealing process.
% these cross
% and accept or reject
% using Metropolis.
% (This is not how many people do Genetic algorithm, but it is a good idea)
%
% Birth or death rule: Skilling's method couple deaths and births to
% the annealing process.\index{Skilling, John}
% Reduce temperature such that just one of $R$ is likely to die.
\subsection{Particle filters}
%5 See next edition of the book!
Particle\index{particle filter}
filters, which are particularly popular in
inference problems involving temporal
tracking, are multistate methods that mix the ideas
of importance sampling and Markov chain Monte Carlo.
% should be covered in the next edition of this book.
See
\citeasnoun{isard96visual}, \citeasnoun{isard98condensation},
\citeasnoun{berzuini97dynamic},
\citeasnoun{BerzuiniGilks2001}, \citeasnoun{particlefilters01}.
\section{Methods that do not necessarily help}
It is common practice to use {\em many\/} initial conditions
for a particular Markov chain (\figref{fig.mcresource}).
% (\secref{sec.mcresource}).
If you are worried about
sampling well from a complicated density $P(\bx)$,
{\em can\/} you ensure the states produced by
the simulations are well distributed about the
typical set of $P(\bx)$
by ensuring that the
initial points are `well distributed about
the whole state space'?
The answer is, unfortunately, no. In hierarchical Bayesian models,
for example, a large number of parameters $\{x_n\}$ may be coupled
together via another parameter $\b$ (known as a hyperparameter).
For example, the quantities $\{x_n\}$ might be independent
noise signals, and $\b$ might be the inverse-variance of
the noise source. The joint distribution of $\b$ and $\{x_n\}$
might be
\beqa
P( \b , \{x_n\} ) & =& P(\b) \prod_{n=1}^{N} P(x_n\given \b) \\ \nonumber
& =& P(\b) \prod_{n=1}^{N} \smallfrac{1}{Z(\b)} \, e^{-\b x_n^2 / 2} ,
\eeqa
where $Z(\b) = \sqrt{ 2 \pi / \b }$ and $P(\b)$ is a broad
distribution describing our ignorance about the noise level.
For simplicity, let's leave out all the other variables -- data and such --
that might be involved in a realistic problem.
Let's imagine that we want to sample effectively from $P( \b , \{x_n\} )$
by Gibbs sampling -- alternately sampling $\b$ from the
conditional distribution $P(\b\given x_n)$ then sampling
all the $x_n$ from their
conditional
distributions $P(x_n\given \b)$.
[The resulting marginal distribution of $\b$ should asymptotically be the
broad distribution $P(\b)$.]
If $N$ is large then the conditional distribution
of $\b$ given any particular setting of $\{x_n\}$
will be tightly concentrated on a particular most-probable
value of $\b$, with width proportional to $1/\sqrt{N}$.
Progress up and down the $\b$-axis will therefore take place
by a slow random walk with steps of size $\propto 1/\sqrt{N}$.
So, to the initialization strategy. Can we finesse our
slow convergence problem by using initial conditions located
`all over the state space'?
Sadly, no.
If we distribute the points $\{x_n\}$ widely,
what we are actually doing is favouring
an initial value of the noise level $1/\b$ that is {\em large\/}.
% , since widely varying values of $\{x_n\}$.
The random walk of the parameter $\b$ will thus tend,
after the first drawing of $\b$ from $P(\b\given x_n)$,
always to start off from one end of the $\b$-axis.
\section*{Further reading}
\index{Neal, Radford}The \hybrid\ Monte Carlo method \cite{duane-kennedy-pendleton-roweth-87}
is
reviewed in \citeasnoun{Neal_dop}.
This excellent tome also reviews a huge range of other Monte Carlo
methods, including the related
topics of simulated annealing\index{annealing} and free energy estimation.
% For another advanced method for adapting Markov chains,
% see \citeasnoun{MCMCRegeneration}.
% \index{regeneration}\index{Markov chain Monte Carlo!regeneration}
\section{Further exercises}
\exercisxC{4}{ex.hmcreversible}{
An important detail of the \hybrid\ Monte Carlo method is\index{Hamiltonian Monte Carlo}
that the simulation of the Hamiltonian dynamics,
while it may be inaccurate, must be
perfectly reversible, in the sense that
if the initial condition $(\bx,\bp)$ goes to
% \rightarrow
$(\bx',\bp')$,
then the same simulator must take $(\bx',-\bp')$
to
% \rightarrow
$(\bx,-\bp)$,
and the inaccurate dynamics must conserve state-space volume.
% cut this on radford's advice
% (In fact, this second rule is redundant since,
% if state-space volume is not conserved, perfect reversibility is impossible,
% if the state $\bx,\bp$ is represented with finite precision using integers.)
% the rule of perfect reversibility must be violated
[The leapfrog method in \algref{fig.hmc} satisfies these rules.]
Explain why these rules must be satisfied and create an example illustrating
the problems that arise if they are not.
}
\exercisxC{4}{ex.multi-state-slice}{
{\sf A multi-state idea for slice sampling.}\index{Monte Carlo methods!multi-state}
Investigate the following multi-state method for slice sampling.
As in {Skilling's multi-state leapfrog method} (\secref{sec.skillingleapfrog}),
maintain a set of $S$ state vectors
$\{ \bx^{(s)} \}$. Update one state vector $\bx^{(s)}$ by one-dimensional
slice sampling in a direction $\by$ determined by
picking two other state vectors $\bx^{(v)}$ and $\bx^{(w)}$ at random and
setting $\by = \bx^{(v)} - \bx^{(w)}$.
\amarginfig{b}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(40,40)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gallager/multislice.eps,width=40mm,height=30mm}}}
\put(8,11){\makebox(0,0)[tl]{$\bx^{(s)}$}}
\put(34,19.5){\makebox(0,0)[tl]{$\bx^{(v)}$}}
\put(21.50,8){\makebox(0,0)[tl]{${\bx^{(w)}}$}}
\end{picture}
\end{center}
}\
Investigate this method on toy problems such as a highly-correlated
multivariate \ind{Gaussian distribution}. Bear in mind that
if $S-1$ is smaller than the number of dimensions $N$ then
this method will not be ergodic by itself, so it may
need to be mixed with other methods. Are there
classes of problems that are better solved by this slice-sampling method
than by the standard methods for picking $\by$ such
as cycling through the coordinate axes or picking $\bu$ at random
from a Gaussian distribution?
}
% see advanced_mc.tex
\section{Solutions}
\soln{ex.skilling}{
Consider the spherical Gaussian distribution where
all components have mean zero and variance 1.
In one dimension, the $n$th, if $x^{(1)}_n$ leapfrogs over $x^{(2)}_n$,
we obtain the proposed coordinate
\beq
(x^{(1)}_n)' = 2 x^{(2)}_n - x^{(1)}_n .
\eeq
Assuming that $x^{(1)}_n$ and $x^{(2)}_n$ are Gaussian random variables
from $\Normal(0,1)$, $(x^{(1)}_n)'$ is Gaussian from
$\Normal(0,\sigma^2)$,
where $\sigma^2 = 2^2 + (-1)^2 = 5$.
The change in energy contributed by this one dimension will be
% half x'^2 - ( half x^2 )
\beq
\frac{1}{2} \left[ ( 2 x^{(2)}_n - x^{(1)}_n )^2 - ( x^{(1)}_n )^2 \right] =
% 4 (x^{(2)}_n)^2 - 4 x^{(2)}_n x^{(1)}_n + (x^{(1)}_n )^2 - ( x^{(1)}_n )^2 =
2 ( x^{(2)}_n)^2 - 2 x^{(2)}_n x^{(1)}_n
\eeq
so the typical change in energy is $2 \langle ( x^{(2)}_n)^2 \rangle = 2$.
This positive change is bad news.
In $N$ dimensions, the typical change in energy when
a leapfrog move is made, at equilibrium,
is thus $+2N$.
The probability of acceptance of the move scales
as
\beq
e^{-2N} .
% \exp$
\eeq
This implies that Skilling's method, as described, is not effective
in very high-dimensional problems -- at least, not
once convergence has occurred.
% , so it is hard to imagine that it's effective .
Nevertheless it has the impressive advantage that its
convergence properties are independent of the
strength of correlations between the variables --
a property that not even the \hybrid\ Monte Carlo
and overrelaxation methods offer.
% In a bit more detail, roughly what does the above calculation
% mean about the rate of convergence? If we assume that
% the rare acceptances lead to a
% rough doubling in size of the ball of points (in the
% directions in which growth is permitted) then
% there will be a doubling every $e^{2N}$ iterations,
% and the number of iterations to fill out the long dimension (which
% is $L$ times larger than the initial ball)
% will be roughly $e^{2N} \log_2 L$.
}
\prechapter{About Chapter}
%
% \chapter{Ising Models}
%
% for entropy versus temperature see
%\label{fig.Sising}
% in basic_mc
%
Some of the neural network models that we will encounter
are related to \ind{Ising model}s, which are idealized magnetic systems.
% familiar to most physics graduates.
%\footnote{Though maybe not the
% present Cambridge class?}
It is not essential to understand
the statistical physics of Ising models to understand these
neural networks,
but I hope you'll find them helpful.
% think it is a good idea to include some notes on them,
% as much to revise the beauties of \ind{statistical physics} as to
% refresh the memory about Ising models.
Ising models are also related to several other topics in this book.
We will use exact tree-based computation methods like those
introduced in \chapterref{ch.exact} to evaluate properties of
interest in Ising models.
Ising models offer crude models for binary images.\index{image models}\index{binary images}
And Ising models relate to two-dimensional \ind{constrained channel}s (\cf\ \chapterref{ch.noiseless}):
a two-dimensional \ind{bar-code} in which a black dot may not
be completely surrounded by black dots, and
% {\em vice versa\/}
a white dot may not
be completely surrounded by white dots,
is similar to an antiferromagnetic Ising model at low temperature.
Evaluating the entropy of this Ising model is equivalent to
evaluating the capacity of the constrained channel for conveying bits.
If you would like to jog your memory on statistical physics
and thermodynamics, you might find \appendixref{app.statphy}
helpful. I also recommend the book by \citeasnoun{Reif}.
\ENDprechapter
\chapter{Ising Models}
\label{ch.ising}
\fakesection{Ising Models}
An \ind{Ising model}\index{spin system}
is an array of spins (\eg, atoms that can take
states $\pm 1$) that are
magnetically coupled to each other. If one spin is, say, in the $+1$ state
then it is energetically favourable for its immediate neighbours to
be in the same state, in the case of a ferromagnetic model,
and in the opposite state, in the case of an antiferromagnet.
In this chapter
% e following
% three two sections
we discuss two computational techniques
for studying Ising models.
Let the state $\bx$ of an Ising model with $N$ spins be a vector in which
each component $x_n$ takes values $-1$ or $+1$.
If two spins $m$ and $n$ are neighbours we write $(m,n) \in {\cal N}$.
The coupling between neighbouring spins is $J$.
We define $J_{mn} = J$ if $m$ and $n$ are neighbours
and $J_{mn}=0$ otherwise. The energy of a state
$\bx$ is
\beq
E(\bx;J,H) = - \left[
% \frac{1}{2}
\half \sum_{m,n}
% \begin{array}{@{}c@{}}m,n:\\
% (m,n) \in {\cal N}\end{array} }
J_{mn} x_m x_n
+ \sum_{n} H x_n \right] ,
\label{eq.ising.e}
\eeq
where
% $J$ is the coupling
% between spins $m$ and $n$,
% and
$H$ is the applied field. If $J > 0$ then the model is
\ind{ferromagnetic}, and if $J < 0$ it is \ind{antiferromagnetic}.
We've included the factor of $\dhalf$ because
each pair is counted twice in the first sum, once as $(m,n)$ and once as $(n,m)$.
% ; alternatively, we could sum over all $m$ and $n$, shove in a factor
% of $\half$, and define $J_{mn}$ (the coupling between spins $m$ and
% $n$) to be zero if $(m,n) \not \in \cal N$.
%
% In Physics we may be interested in the properties of Ising models with
% a large number $N$ of spins having regular geometric neighbourhood
% relationships.
At equilibrium at temperature $T$, the probability that the
state is $\bx$ is
\beq
P( \bx\given \beta, J,H) = \frac{1}{Z(\beta,J,H)} \exp \! \left[ - \beta E( \bx ; J , H ) \right] ,
\label{eq.ising.p}
\eeq
where $\beta = 1/k_{\rm B}T$, $k_{\rm B}$ is Boltzmann's constant, and
\beq
Z(\beta,J,H) \equiv \sum_{\bx} \exp \!
\left[ - \beta E( \bx ; J , H ) \right] .
\label{eq.ising.z}
\eeq
\subsection{Relevance of Ising models}
Ising models are relevant for three reasons.
Ising models are important first as models of magnetic systems
that have a phase transition. The theory of \ind{universality} in
statistical physics
shows that all systems with the same dimension (here, two),
and the same symmetries, have equivalent critical
properties, \ie, the scaling laws shown by their phase transitions
are identical. So by studying Ising models we can find out
not only about magnetic phase transitions but also about
phase transitions in many other systems.
% such as liquid-vapour transitions.
Second, if we generalize the energy function to
\beq
E(\bx;\bJ,{\bf h}) = - \left[
\frac{1}{2}
\sum_{m,n} J_{mn} x_m x_n
+ \sum_{n} h_n x_n \right] ,
\eeq
where the couplings $J_{mn}$ and applied fields $h_n$ are not constant,
we obtain a family of models known as `spin glasses'
to physicists, and as `Hopfield\pagebreak[1] networks' or
`Boltzmann machines' to the neural
network community. In some of these models, all spins are declared
to be neighbours of each other, in which case physicists call
the system an `infinite-range' spin glass, and networkers call
it a `fully connected' network.
Third,
% , as I will show in section \ref{sec.ising.retina},
the Ising model is also useful as a statistical model in its own
right.
In this chapter we will
% sections \ref{sec.ising.mc} and \ref{sec.ising.matrix} we will
study Ising models using two different computational techniques.
% The aim is not so much to learn about Ising
% models as it is to think about the Physics of thermodynamic systems.
\subsection{Some remarkable relationships in statistical physics}
\index{statistical physics}We
would like to get as much information as possible out of
our computations. Consider for example the \ind{heat capacity} of a system,
which is defined to be
\beq
C \equiv \frac{\partial}{\partial T} \bar{E}
,
\eeq
where
\beq
\bar{E} = \frac{1}{Z} \sum_{\bx} \exp( - \beta E(\bx) ) \, E(\bx) .
\eeq
% Naively, we might guess that to work
% out the heat capacity of a system at a certain
% temperature, we have to change the temperature to a higher
% temperature and measure the energy change.
%
% given only observations of the system at that temperature,
% or do we have to change the temperature?
%
To work out the heat capacity of a system,
we
might naively guess that we have to increase the temperature and
measure the energy change.
Heat capacity, however, is intimately related to energy {\em \ind{fluctuations}\/}
at constant temperature.
Let's start from the \ind{partition function},
\beq
Z = \sum_{\bx} \exp( -\b E(\bx) ) .
\eeq
The mean energy is obtained by differentiation \wrt\ $\b$:
\beq
\frac{ \partial \ln Z}{ \partial \b }
= \frac{1}{Z} \sum_{\bx} - E(\bx) \exp( -\b E(\bx) ) = - \bar{E} .
\eeq
A further differentiation spits out the variance of the \ind{energy}:
\beq
\frac{ \partial^2 \ln Z}{ \partial \b^2 } =
\frac{1}{Z} \sum_{\bx} E(\bx)^2 \exp( -\b E(\bx) ) - \bar{E}^2
= \langle{E^2}\rangle - \bar{E}^2 = {\rm var}(E) .
\eeq
But the heat capacity is also the derivative of $\bar{E}$ with respect to
temperature:
\beq
\frac{ \partial \bar{E} }{ \partial T }
= - \frac{ \partial}{ \partial T}
\frac{ \partial \ln Z}{ \partial \b }
= - \frac{ \partial^2 \ln Z}{ \partial \b^2 } \frac{ \partial \b }{ \partial T }
= - {\rm var}(E) ( -1/k_{\rm B} T^2 ) .
\eeq
So for any system at temperature $T$,
\beq
C = \frac{ {\rm var}(E) }{k_{\rm B} T^2} = k_{\rm B} \b^2 \, {\rm var}(E) .
\eeq
Thus if we can observe the variance of the energy of a system at equilibrium,
we can estimate its heat capacity.
% More tricks can be found in section \ref{sec.ising.matrix}.
% , with derivations
I find this an almost paradoxical relationship.\index{paradox!heat capacity and fluctuations}
Consider a system
with a finite set of states, and imagine heating it up. At high
temperature, all states will be equiprobable, so the mean energy will
be essentially constant and the heat capacity will be essentially
zero. But on the other hand, with all states being equiprobable,
there will certainly be fluctuations in energy. So how can the heat
capacity be related to the fluctuations? The answer is in the
words `essentially zero' above. The heat capacity is not quite zero at high
temperature, it
just tends to zero. And it tends to zero as $\smallfrac{ {\rm var}(E)
}{k_{\rm B} T^2}$, with the quantity ${\rm var}(E)$ tending
to a constant at high
temperatures. This $1/T^2$ behaviour of the heat capacity of finite
systems at high temperatures is thus very general.
The $1/T^2$ factor can be viewed as an accident of history. If only
temperature scales had been defined using $\beta=\smallfrac{1}{\kB T}$,
then the definition of heat capacity would be
\beq
C^{(\beta)} \equiv \frac{ \partial \bar{E} }{ \partial \b }
= {\rm var}(E) ,
\eeq
and heat capacity and fluctuations would be identical
quantities.
% , were it not for this slip by \ind{Kelvin}, \ind{Carnot} {\em et al}.
% \medskip
\exercisxB{2}{ex.SlZbE}{
[We will call the entropy of a physical system $S$ rather
than $H$, while we are in a statistical physics
chapter;
% for convenience we will
we set $k_{\rm B} = 1$.]
The entropy of a system whose states are $\bx$, at temperature $T=1/\beta$,
is
\beq
S = \sum p(\bx) \! \left[ \ln 1/p(\bx) \right]
\eeq
where
\beq
p( \bx ) = \frac{1}{Z(\beta)} \exp \! \left[ - \beta E( \bx ) \right] .
\label{eq.gen.p}
\eeq
\ben
\item
Show that
\beq
S = \ln Z(\beta) + \beta \bar{E}(\beta)
\eeq
where $\bar{E}(\beta)$ is the mean energy of the system.
\item
Show that
\beq
S = - \frac{ \partial F }{ \partial T } ,
\eeq
where the free energy $F = - kT \ln Z$ and $kT = 1/\b$.
\een
}
% Binder says simulate a 55x55 grid
% critical behaviour of magnetization: m -> B(1-T/Tc)^beta, beta=1/8
% Cv should have a divergence.
\section{Ising models -- Monte Carlo simulation}
\label{sec.ising.mc}
In this section we study two-dimensional planar Ising models
using a simple Gibbs-sampling
% or `heat bath Monte Carlo'
method.
Starting from some initial state, a spin $n$ is selected at
random, and the probability that it should be $+1$ given the state of
the other spins and the temperature is computed,
\beq
P(+1\given b_n)= \frac{1}{1+\exp(- 2 \beta b_n)},
\label{eq.gibbs}
\eeq
where $\beta = 1/k_{\rm B}T$ and $b_n$ is the local field
\beq
b_n = \sum_{m:(m,n) \in {\cal N}} J x_m + H.
\eeq
[The factor of 2 appears in equation (\ref{eq.gibbs}) because
the two spin states are $\{+1,-1\}$ rather than $\{ +1 , 0 \}$.]
Spin $n$ is set to $+1$ with that probability, and otherwise to
$-1$; then the next spin to update is selected at random.
After sufficiently many iterations, this procedure
converges to the equilibrium distribution
% of equation
(\ref{eq.ising.p}).
% $P(\bx)=\frac{1}{Z}\exp(-\beta E(\bx;J,B))$.
An alternative to the Gibbs sampling formula (\ref{eq.gibbs}) is the
Metropolis algorithm, in which we consider the change in energy
that results from flipping the chosen spin from its current state $x_n$,
\beq
\Delta E = 2 x_n b_n ,
\eeq
and adopt this change in configuration with probability
\beq
P( {\rm accept} ; \Delta E , \b ) = \left\{ \begin{array}{cc}
1 & \Delta E \leq 0 \\
\exp( - \beta \Delta E ) & \Delta E > 0 .
\end{array}
\right.
\eeq
This procedure has roughly double the probability of accepting energetically
unfavourable moves, so may be a more efficient sampler -- but at very low
temperatures the relative merits
of
% choice between
Gibbs sampling and the
Metropolis algorithm may be subtle.%
%\begin{center}
\amarginfig{b}{
\mbox{\setlength{\unitlength}{1.572pt}
\begin{picture}(70,40)(0,-5)
\newsavebox{\verticalfour}
\savebox{\verticalfour}(0,0)[bl]{
\multiput(0,0)(0,10){4}{\circle{2}} % spins
\multiput(0,5)(0,10){4}{\line(0,-1){3}} % lines down
\multiput(0,-5)(0,10){4}{\line(0,1){3}} % lines up
\multiput(2,0)(0,10){4}{\line(1,0){6}} % lines right
}
\multiput(0,0)(10,0){6}{\usebox{\verticalfour}}
\end{picture}
}
\caption[a]{Rectangular Ising model.}
\label{fig.isingR}
}
%\end{center}
\subsection{Rectangular geometry}
I first simulated
% Let us first write a program that simulates
an Ising model with the rectangular
geometry shown
% below
in \figref{fig.isingR}, and with periodic boundary conditions. A
line between two spins indicates that they are neighbours.
%
% To make a bite-sized example, we will set $b$ to 0 throughout,
I set the external field $H=0$
and considered the two cases $J = \pm 1$
which are a ferromagnet and antiferromagnet respectively.
I started at a large temperature ($T \eq 33, \beta \eq 0.03$)
and changed the temperature every $I$ iterations, first decreasing
it gradually to $T\eq 0.1, \beta \eq 10$, then increasing it gradually back to a large
temperature again. This procedure gives a crude check on whether
`equilibrium has been reached' at each temperature; if not, we'd expect to
see some hysteresis in the graphs we plot. It also gives an
idea of the reproducibility of the results, if we assume that
the two runs, with decreasing and increasing temperature,
are effectively independent of each other.
At each temperature I recorded the mean energy per spin
and the standard deviation
of the energy, and the mean square value of the magnetization $m$,
\beq
m = { \smallfrac{1}{N}}
\sum_{n} x_n .
\eeq
%\begin{figure}
%\figuremargin{%
\marginfig{\small
\begin{center}
\makebox[-0.35in]{}
\begin{tabular}[b]{cl} $T$ & \\
5 & \risingsample{r0.2} \\%%%%%%%%%%%%%% restore these two!!!!!!!:::::::
%3 & \risingsample{r0.33} \\
%2.7 & \risingsample{r0.37} \\
2.5 & \risingsample{r0.4} \\
%\end{tabular}
%\begin{tabular}[b]{cl} % $T$ & \\
2.4 & \risingsample{r0.42} \\
2.3 & \risingsample{r0.44} \\
2 & \risingsample{r0.5} \\
\end{tabular}
\end{center}
%}{%
\caption[a]{Sample states of rectangular Ising models with $J=1$
at a sequence of temperatures $T$.
}
\label{fig.ising.states1}
}%
%\end{figure}
%
One tricky decision that has to be made is how soon to start taking
these measurements after a new temperature has been established; it
is difficult to detect `equilibrium' -- or even to give a clear
definition of a system's being `at equilibrium'! [But in \chref{ch.mcexact}
we will see a solution to this problem.] My crude strategy
was to let the number of iterations at each temperature, $I$, be a few hundred times the
number of spins $N$, and to discard the first $\dthird$ of those
% assume equilibrium had been reached after
% $I/3$
iterations. With $N\eq 100$, I found I needed more than $100\,000$
iterations to reach equilibrium at any given temperature.
% My code is written in {\tt C} and is available at
% \verb+http://wol.ra.phy.cam.ac.uk/+.
% There are no fancy tricks.
\subsection{Results for small $N$ with $J=1$.}
I simulated an $l \times l$ grid for $l = 4, 5, \ldots, 10, 40, 64$.
Let's have a quick think about what results we expect. At low temperatures
the system is expected to be in a ground state. The rectangular Ising model
with $J=1$ has two ground states,
the all $+1$ state and the all $-1$ state. The energy per spin of either
ground state is $-2$.
At high temperatures, the spins are independent,
all states are equally probable, and the
energy is expected to fluctuate around a mean of
$0$ with a standard deviation proportional to $1/\sqrt{N}$.
% At intermediate temperatures we expect the energy to rise monotonically.
% By thinking more carefully we could probably predict the leading order
% behaviour at each extreme.
Let's look at some results. In all figures temperature $T$ is shown with
$k_{\rm B}=1$.
The basic picture emerges with as few
as 16 spins (\figref{fig.ising.16}, top):
the energy rises monotonically.
As we increase the number
of spins to 100 (\figref{fig.ising.16}, bottom) some new details emerge.
First, as expected, the fluctuations at large temperature decrease
as $1/\sqrt{N}$. Second, the fluctuations at intermediate temperature
become relatively {\em bigger}. This is the signature of a `\ind{collective}
phenomenon', in this case, a \ind{phase transition}. Only systems with
infinite $N$ show true phase transitions, but with $N=100$ we are getting
a hint of the \ind{critical fluctuations}. \Figref{fig.ising.100d} shows details
of the graphs for $N=100$ and $N=4096$.
\Figref{fig.ising.states1} shows a sequence of typical states from
the simulation of $N=4096$ spins at a sequence of decreasing temperatures.
\begin{figure}
\figuremargin{\small%
%\figuredangle{%
\begin{center}
\footnotesize
%\makebox[-0.35in]{}
\begin{tabular}{@{}cll} $N$ & \hspace{0.2in} Mean energy and fluctuations
%in energy
& \hspace{0.2in} Mean square magnetization \\
\raisebox{10mm}{16}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/E4.1.ps,angle=-90,width=2.49in}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/M4.1.ps,angle=-90,width=2.49in}\\
\raisebox{10mm}{100}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/E10.1.ps,angle=-90,width=2.49in}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/M10.1.ps,angle=-90,width=2.49in}\\
%4096&
%\makebox[-0.15in]{}\psfig{figure=isingfigs/E64.1.ps,angle=-90,width=2.4in}&
%\makebox[-0.15in]{}\psfig{figure=isingfigs/M64.1.ps,angle=-90,width=2.4in}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Monte Carlo simulations of rectangular Ising models with $J=1$.
Mean energy and fluctuations in energy as a function of temperature (left).
Mean square magnetization as a function of temperature (right).
In the top row, $N=16$, and the bottom, $N=100$. For even larger
$N$, see later figures.
}
\label{fig.ising.16}
}%
\end{figure}
% these figures were done by _courses/comput/newising_mc/i but see ising_mc/README
%
% FIGURE \label{fig.ising.100d}
% was moved later than this natural point, against my wishes. and against logic.
% in order to get its number to be 31.5
\subsubsection{Contrast with Schottky anomaly}
\amarginfig{c}{%%%%%%%%%%%%%% this fig needs its axes cleaning up
\mbox{\psfig{figure=figs/fakeC.ps,width=1.5in,angle=-90}\footnotesize$\,T$}
\caption[a]{Schematic diagram to explain the meaning of
a \ind{Schottky anomaly}.
The curve shows the heat capacity of two gases
as a function of temperature. The lower curve shows a
normal gas whose heat capacity is an increasing
function of temperature. The upper curve
has a small peak in the heat capacity, which is
known as a Schottky anomaly (at least in Cambridge).
The peak is produced by the gas having magnetic
degrees of freedom with a finite number of accessible states.
%
% can I find real data? see schott.gnu for this hack DO NOT EDIT
%
}
\label{fig.schottky}
}%%%%%%%%%%%%%%%%%%%%%
A peak in the \ind{heat capacity}, as a function of temperature,
occurs in any system that has a finite number of energy levels;
a peak is not in itself evidence of a phase transition.
Such peaks were viewed as anomalies in classical \ind{thermodynamics},
since `normal' systems with infinite numbers
of energy levels (such as a particle in a box) have heat capacities that are either
constant or increasing functions of temperature.
In contrast, systems with a finite number of levels produced
small blips in the heat capacity graph (\figref{fig.schottky}).
%
% this belongs earlier logically
%
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{l@{}l} \multicolumn{1}{c}{ $N=100$ } & \multicolumn{1}{c}{ $N=4096$ } \\
\makebox[-0.15in]{(a)}\psfig{figure=isingfigs/E10.1d.ps,angle=-90,width=2.7in}
&
\makebox[-0.15in]{}\psfig{figure=isingfigs/E64.1d.ps,angle=-90,width=2.7in} \\[-0.1in]
\makebox[-0.15in]{(b)}\psfig{figure=isingfigs/SE10.1d.ps,angle=-90,width=2.7in}
&
\makebox[-0.15in]{}\psfig{figure=isingfigs/SE64.1d.ps,angle=-90,width=2.7in} \\[-0.1in]
\makebox[-0.15in]{(c)}\psfig{figure=isingfigs/M10.1d.ps,angle=-90,width=2.7in}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/M64.1d.ps,angle=-90,width=2.7in} \\[-0.1in]
\makebox[-0.15in]{(d)}\psfig{figure=isingfigs/SC10.1.ps,angle=-90,width=2.7in}&
\makebox[-0.15in]{}\psfig{figure=isingfigs/SC64.1.ps,angle=-90,width=2.7in} \\[-0.1in]
\end{tabular}
\end{center}
}{%
\caption[a]{Detail of Monte Carlo simulations of rectangular Ising models with $J=1$.
(a) Mean energy and fluctuations in energy as a function of temperature.
(b) Fluctuations in energy (standard deviation).
(c) Mean square magnetization.
(d) Heat capacity.
}
\label{fig.ising.100d}
}%
\end{figure}
%
% END this belongs earlier.
%
Let us refresh
our memory of the simplest such system, a two-level system with
states $x=0$ (energy 0) and $x=1$ (energy $\epsilon$).
The mean energy is
\beq
E(\beta) = \epsilon \frac{ \exp( - \beta \epsilon ) }{
1 + \exp( - \beta \epsilon ) }
= \epsilon \frac{ 1}{
1 + \exp( \beta \epsilon ) }
\eeq
and the derivative with respect to $\beta$ is
\beq
\d E/\d \beta = - \epsilon^2 \frac{ \exp( \beta \epsilon ) }{
[ 1 + \exp( \beta \epsilon )]^2 } .
\label{eq.schot.dEdb}
\eeq
So the heat capacity is
\beq
C = \d E / \d T
% = dE/d\beta d(1/kT)/dT
= - \frac{ \d E}{\d\beta} \frac{ 1}{k_{\rm B}T^2}
= \frac{\epsilon^2}{k_{\rm B}T^2}
\frac{ \exp( \beta \epsilon ) }{
[ 1 + \exp( \beta \epsilon )]^2 }
\eeq
and the fluctuations in energy are given by
$\var(E) = C k_{\rm B} T^2 = - \d E/\d\beta$, which was
% already
evaluated in
(\ref{eq.schot.dEdb}).
The heat capacity and fluctuations are plotted in figure \ref{fig.schot}.
The take-home message at this point is that whilst Schottky anomalies
do have a peak in the heat capacity, there is {\em no\/} peak
in their {\em\ind{fluctuations}}; the variance of the
energy simply increases monotonically
with temperature to a value proportional to
the number of independent spins. Thus it is a peak
in the {\em{fluctuations}\/} that is interesting, rather than
a peak in the heat capacity.
% , as it gives evidence
% that something non-standard is going on.
% In contrast,
The Ising model has such a peak
% an exciting peak
in its fluctuations,
as can be seen in the second row of \figref{fig.ising.100d}.
% visible in the Ising plots is novel in contrast to Schottky anomalies.
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{l}
\makebox[-0.15in]{}\psfig{figure=isingfigs/schot.ps,angle=-90,width=2.9in} \\
\end{tabular}
\end{center}
}{%
\caption[a]{Schottky anomaly --
Heat capacity and fluctuations in energy as a function of temperature
for a two-level system with separation $\epsilon = 1$ and $k_{\rm B} = 1$.
}
\label{fig.schot}
}%
\end{figure}
%gnuplot> dE(b)=-exp(b)/(1+exp(b))**2
%gnuplot> vE(T) = -dE(1/T)
%gnuplot> C(T) = - dE(1/T) / T**2
%gnuplot> plot [0.1:10] vE(x), C(x)
%gnuplot> set logs x ; set size 0.6,0.6 ; replot
%gnuplot> plot [0.1:10] vE(x) t "Var(E)", C(x) t "Heat Capacity"
%
%gnuplot>
%gnuplot> plot [0.1:10] vE(x) t "Var(E)", C(x) t "Heat Capacity",E(x) t "Var(E)",gnuplot> set logs x ; set size 0.6,0.6 ; replot
%gnuplot> plot [0.1:10] C(x) t "Heat Capacity", vE(x) t "Var(E)"
%gnuplot> set xlabel "Temperature"
%gnuplot> replot
%gnuplot> set term post
%Terminal type set to 'postscript'
%Options are 'landscape monochrome dashed "Helvetica" 14'
%gnuplot> set output "schot.ps"
%gnuplot> replot
\subsection{Rectangular Ising model with $J=-1$}
What do we expect to happen in the case $J=-1$? The ground states
of an infinite system are the two \ind{checkerboard} patterns (\figref{fig.ising.check}),
and they have energy per spin $-2$, like the ground states of
the $J \eq 1$ model. Can this analogy be pressed further?%
%\begin{figure}
\amarginfig{t}{\small%
\begin{center}
\makebox[-0.35in]{}
\begin{tabular}{c@{\hspace{0.25in}}c}
\smallrisingsample{sixC} &
\smallrisingsample{sixC2} \\
\end{tabular}
\end{center}
%}{%
\caption[a]{The two ground states of a rectangular Ising model with $J=- 1$.
}
\label{fig.ising.check}
}%
%\end{figure}
%\begin{figure}\figuremargin{%
\amarginfig{t}{\small
\begin{center}
\makebox[-0.35in]{}
\begin{tabular}{c@{\hspace{0.25in}}c} $J=-1$ & $J=+1$ \\
\smallrisingsample{six} &
\smallrisingsample{sixc} \\
\end{tabular}
\end{center}
%}{%
\caption[a]{Two states of rectangular Ising models with $J=\pm 1$
that have identical energy.
}
\label{fig.ising.check.equiv}
}%
%\end{figure}
A moment's reflection will confirm that the two systems are
equivalent to each other under a checkerboard symmetry operation.
If you take an infinite $J=1$ system in some state and flip all the spins
that lie on the black squares of an infinite checkerboard, and
set $J=-1$ (\figref{fig.ising.check.equiv}), then
the energy is unchanged. (The magnetization changes, of course.)
So all
thermodynamic properties of the two systems are expected to be identical
in the case of zero applied field.
% This provides a useful check on one's code.
But there is a subtlety lurking here. Have you spotted it?
%\newpage
%
We are simulating
finite grids with periodic boundary conditions. If the size of the grid in
any direction is {\em odd}, then the checkerboard operation is no longer
a symmetry operation relating $J=+1$ to $J=-1$, because the checkerboard
doesn't match up at the boundaries. This means that for systems
of odd size, the ground state of a system with $J=-1$
will have degeneracy greater than 2,
and the energy of those ground states will not be as low as $-2$ per spin.
So we expect qualitative differences between
the cases $J = \pm 1$ in odd-sized systems.
These differences are expected to be most prominent
for small systems. The \ind{frustration}s are introduced by the boundaries,
and the length of the boundary grows as the square root of the system
size, so the fractional influence of this boundary-related frustration
on the energy and entropy of the system will decrease as $1/\sqrt{N}$.
%
\Figref{fig.ising.25} compares the energies of the ferromagnetic and
antiferromagnetic models with $N=25$. Here, the difference is striking.
% The graphs for fragments with even size are identical for $J=\pm 1$ as
% theoretically predicted.
\begin{figure}[hbtp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{ll} \multicolumn{1}{c}{$J=+1$} & \multicolumn{1}{c}{$J=-1$} \\
\makebox[-0.15in]{}\psfig{figure=isingfigs/E5.1.ps,angle=-90,width=2.7in}&
%\makebox[-0.15in]{(b)}\psfig{figure=isingfigs/M5.1.ps,angle=-90,width=2.7in}\\
\makebox[-0.415in]{}\psfig{figure=isingfigs/E5.-1.ps,angle=-90,width=2.7in}\\
%\makebox[-0.15in]{(d)}\psfig{figure=isingfigs/M5.-1.ps,angle=-90,width=2.7in}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Monte Carlo simulations of rectangular Ising models with $J=\pm 1$ and $N=25$.
Mean energy and fluctuations in energy as a function of temperature.
(a) $J=1$.
%(b) $J=1$. Mean square magnetization as a function of temperature.
(b) $J=-1$.
% Mean energy and fluctuations in energy as a function of temperature.
%(d) $J=-1$. Mean square magnetization as a function of temperature.
}
\label{fig.ising.25}
}%
\end{figure}
\subsection{Triangular Ising model}
We can repeat these computations for a triangular Ising model.
Do we expect the triangular Ising model with $J = \pm 1$
to show different physical
properties from the rectangular Ising model?
Presumably the $J=1$ model will have broadly similar properties
to its rectangular counterpart. But the case $J=-1$ is
radically different from what's gone before. Think about it:
{\em there is no unfrustrated ground state}; in any state, there {\em must\/} be
\ind{frustration}s -- pairs of neighbours
who have the same sign as each other. Unlike the case of
the rectangular model with odd size, the frustrations are
not introduced by the periodic boundary conditions.
{\em Every set of three mutually neighbouring spins must be in a state of
frustration,} as shown in \figref{fig.frustration}.
% this was in the caption but the margin got full...
(Solid lines show `happy' couplings which contribute $-|J|$ to the
energy; dashed lines show `unhappy' couplings which contribute
$|J|$.)
Thus we certainly expect different behaviour at low temperatures.
In fact we might expect this system to have a
non-zero entropy at absolute zero. (`Triangular model violates
\ind{third law of thermodynamics}!')\index{thermodynamics!third law}
Let's look at some results.
%
% this figure belongs higher up.
%\marginpar[b]{
%%%%%%%%%%%%%
%}
%%%%%%%%%%%% end marginpar
%\begin{figure}
%\figuremargin{%
\amarginfig{b}{
\begin{center}\small
\mbox{
\setlength{\unitlength}{1.7pt}% was 2pt
\begin{picture}(70,45)(0,-5)
\newsavebox{\verticalfourdiag}% hexagonal
\savebox{\verticalfourdiag}(0,0)[bl]{
\multiput(0,0)(0,10){4}{\circle{2}} % spins
\multiput(0,5)(0,10){4}{\line(0,-1){3}} % lines down
\multiput(0,-5)(0,10){4}{\line(0,1){3}} % lines up
\multiput(2,1)(0,10){4}{\line(2,1){6}} % lines rightup
\multiput(2,-1)(0,10){4}{\line(2,-1){6}} % lines rightdown
}
\multiput(0,0)(20,0){3}{\usebox{\verticalfourdiag}}
\multiput(10,5)(20,0){3}{\usebox{\verticalfourdiag}}
\end{picture}
}\\[0.1in]
\begin{tabular}{cc}
\psfig{figure=isingfigs/triangle.ps,angle=-90,width=0.7in} &
\psfig{figure=isingfigs/triangle3.ps,angle=-90,width=0.7in}
\\
(a) & (b) \\
\end{tabular}
\end{center}
%}{%
\caption[a]{In an antiferromagnetic
triangular Ising model, any three neighbouring
spins are frustrated. Of the eight possible configurations of three
spins, six
have energy $-|J|$ (a), and two have energy $3|J|$ (b).
}
\label{fig.frustration}
}%
%\end{figure}
%
%
%
% There are various ways to implement
% a periodic triangular grid. I did it as shown in the margin.
% \input{tex/isingshearfig.tex}
% includes some cut graphs for 25 also
Sample states are shown in \figref{fig.ising.stateshex1}, and
\figref{fig.ising.H4096} shows the energy, fluctuations,
and heat capacity for $N=4096$.
Note how different the results for $J = \pm 1$ are. There is
no peak at all in the standard deviation of the energy in the case
$J = - 1$.
This indicates that the antiferromagnetic system does not have a phase
transition to a state with long-range order.
\begin{figure}[hbtp]
\figuremarginb{\small%
\begin{raggedright}
\noindent
\begin{tabular}{ll} \multicolumn{1}{c}{$ J=+1$} & \multicolumn{1}{c}{$J=-1$} \\
\makebox[-0.15in]{(a)}\psfig{figure=isingfigs/HE64.1.ps,angle=-90,width=2.7in} &
\makebox[-0.15in]{(d)}\psfig{figure=isingfigs/HE64.-1.ps,angle=-90,width=2.7in}\\
%
\makebox[-0.15in]{(b)}\psfig{figure=isingfigs/HSE64.1.ps,angle=-90,width=2.7in} &
\makebox[-0.15in]{(e)}\psfig{figure=isingfigs/HSE64.-1.ps,angle=-90,width=2.7in}\\
%
\makebox[-0.15in]{(c)}\psfig{figure=isingfigs/HC64.1.ps,angle=-90,width=2.7in} &
\makebox[-0.15in]{(f)}\psfig{figure=isingfigs/HC64.-1.ps,angle=-90,width=2.7in} \\
\end{tabular}
\end{raggedright}
}{%
\caption[a]{Monte Carlo simulations of triangular Ising models with $J=\pm 1$ and $N=4096$.
(a--c) $J=1$. (d--f) $J=-1$.
(a, d) Mean energy and fluctuations in energy as a function of temperature.
(b, e) Fluctuations in energy (standard deviation).
%
% change to variance?
%
(c, f) Heat capacity.
}
\label{fig.ising.H4096}
}%
\end{figure}
\begin{figure}
\figuremarginb{\small%
\begin{center}
\mbox{
\begin{tabular}[t]{cl} $T$ & $J=+1$ \\
20 & \Hisingsample{hexagon0.05} \\
6 & \Hisingsample{hexagon0.16} \\
4 & \Hisingsample{hexagon0.25} \\
3 & \Hisingsample{hexagon0.3} \\
2 & \Hisingsample{hexagon0.5} \\
\end{tabular}
\begin{tabular}[t]{cl} $T$ & $J=-1$ \\
50 & \hisingsample{hexagon0.02} \\
5 & \hisingsample{hexagon0.2} \\
2 & \hisingsample{hexagon0.5} \\
0.5 & \hisingsample{hexagon2} \\
\end{tabular}
}
\end{center}
}{%
\caption[a]{Sample states of triangular Ising models with $J=1$ and $J=-1$.
High temperatures at the top; low at the bottom.
}
\label{fig.ising.stateshex1}% not referred to?
}%
\end{figure}
\section{Direct computation of partition function of Ising models}
\label{sec.ising.matrix}
We now examine a completely different approach to Ising models.
The {\dbf\ind{transfer matrix method}}
is an exact and abstract approach that obtains
physical properties of the model from the \ind{partition function}
\beq
Z(\beta,\bJ,\bb) \equiv \sum_{\bx} \exp \!
\left[ - \beta E( \bx ; \bJ , \bb ) \right] ,
\eeq
where the summation is over all states $\bx$, and the inverse
temperature is $\beta = 1/T$. [{As usual, Let $\kB = 1$.}]
The \ind{free energy} is given by $F = -
\frac{1}{\beta} \ln Z$. The number of states is $2^N$, so direct
computation of the partition function is not possible for large
$N$. To avoid enumerating all global states explicitly, we can use a
trick similar to the \ind{sum--product
% probability propagation
algorithm} discussed in \chapterref{ch.exact}.\index{message passing}
We concentrate on models that have the form of a
long thin strip of width $W$ with periodic boundary conditions in both
directions, and we iterate along the
length of our model, working out a set of
{\dem\ind{partial partition functions}\/}\index{partition function}
% \index{partition function!partial}
at one location $l$ in terms of partial partition functions at the
previous location $l-1$. Each iteration involves a summation
over all the states at the boundary. This operation is exponential
in the width of the strip, $W$. The final clever trick is to note that
if the system is \ind{translation-invariant} along its length
then we only need to do {\em one\/} iteration in order to find the properties
of a system of {\em any\/} length.
The computational task becomes the evaluation of an $S \times S$ matrix,
where $S$ is the number of microstates that need to be considered at the
boundary, and the computation of its eigenvalues. The eigenvalue of largest
magnitude gives the partition function for an infinite-length thin strip.
Here is a more detailed explanation. Label the states of the $C$ columns of the thin
strip $s_1, s_2, \ldots, s_C$, with each $s$ an integer from 0 to $2^{W}\!-\!1$.
The $r$th bit of $s_c$ indicates whether the spin in row $r$, column $c$
is up or down.
The \ind{partition function} is
\newcommand{\lE}{{\cal E}}
\beqan
Z& =& \sum_{\bx} \exp ( -\b E(\bx) ) \\
& = & \sum_{s_1}\sum_{s_2}\cdots \sum_{s_C} \exp \! \left(
-\b \sum_{c=1}^{C} \lE(s_{c},s_{c+1}) \right) ,
\label{eq.Z.sums}
\eeqan
where $\lE(s_{c},s_{c+1})$ is an appropriately defined energy, and, if
we want periodic boundary conditions, $s_{C+1}$ is defined to be $s_{1}$.
One definition for $\lE$ is:
\marginfig{
\begin{center}{\epsfbox{metapost/ising.1}}\end{center}
\caption[a]{Illustration to help explain the definition (\ref{eq.mydefn.ising}).
$\lE(s_{2},s_{3})$ counts all the contributions to the
energy in the rectangle.
The total energy is given by stepping the rectangle along.
Each horizontal bond inside the rectangle is counted once;
each vertical bond is half-inside the rectangle (and will be
half-inside an adjacent rectangle) so half its energy is
included in $\lE(s_{2},s_{3})$; the factor of $1/4$ appears
in the second term
% ${\textstyle\frac{1}{4}}\!\!\!\!\!\sum_{\begin{array}{c}\scriptstyle (m,n) \in {\cal N}:\\ \scriptstyle m \in c, n \in c \end{array}\!\!\!\!} \!\!\!\!\! J \, x_m x_n$
because $m$ and $n$ both run over all nodes
in column $c$, so each bond is visited twice.
%\indent MANUAL INDENT
\hspace{1.5em}For the state shown here, $s_2 = (100)_2$, $s_3 = (110)_2$,
the horizontal bonds contribute
$+J$ to $\lE(s_{2},s_{3})$, and the vertical bonds
contribute $-J/2$ on the left and $-J/2$ on the right,
assuming periodic boundary conditions between top and bottom.
So $\lE(s_{2},s_{3}) = 0$.
}
}
\beq
\lE(s_{c},s_{c+1}) =
\!\!\!\sum_{\begin{array}{c}\scriptstyle (m,n) \in {\cal N}:\\ \scriptstyle m\in c, n \in c+1 \end{array}\!\!\!\!}
\!\!\!\!\! J \, x_m x_n
+ {\textstyle\frac{1}{4}}\!\!\!\!\!\sum_{\begin{array}{c}\scriptstyle (m,n) \in {\cal N}:\\ \scriptstyle m \in c, n \in c \end{array}\!\!\!\!} \!\!\!\!\! J \, x_m x_n
+ {\textstyle\frac{1}{4}}\!\!\!\!\!\!\!\!\!\sum_{\begin{array}{c}\scriptstyle (m,n) \in {\cal N}:\\ \scriptstyle m\in c+1, n \in c+1 \end{array}\!\!\!\!}\!\!\!\! \!\!\!\!\! J \, x_m x_n .
\label{eq.mydefn.ising}
\eeq
This definition of the energy has the nice property that (for the rectangular Ising model)
it defines a matrix that is symmetric in
its two indices $s_{c},s_{c+1}$. The factors of $1/4$ are needed because
vertical links are counted four times. Let us define
\beq
M_{s s'} = \exp \! \left( -\b \lE(s,s') \right) .
\eeq
Then continuing from equation (\ref{eq.Z.sums}),
\beqan
Z& = & \sum_{s_1}\sum_{s_2}\cdots \sum_{s_C}
\left[ \prod_{c=1}^{C} M_{s_{c},s_{c+1}} \right] \\
& = & \Trace \left[ \bM^C \right] \\
& = & \sum_a \mu_a^C ,
\label{eq.Z.prods}
\eeqan
where $\{ \mu_a \}_{a=1}^{2^W}$ are the eigenvalues of $\bM$.
As the length of the strip $C$ increases, $Z$ becomes dominated by the
largest eigenvalue $\mu_{\max}$:
\beq
Z \rightarrow \mu_{\max}^C .
\eeq
So the \ind{free energy} per spin in the limit of an infinite thin strip is
given by:
\beq
f = - kT \ln Z / (WC) = - kT C \ln \mu_{\max} / (WC )
= - kT \ln \mu_{\max} / W .
\eeq
It's really neat that {\em all\/}
the thermodynamic properties of a long
thin strip can be obtained from just the largest \ind{eigenvalue} of this \ind{matrix}
$\bM$!
% From the partition function we can obtain interesting thermodynamic properties
% using the following relations (which you should confirm):\footnote{Here
% I have been careless about $\kB$, since I use the convention
% throughout the numerics of this paper that $\kB = 1$.}
% \beqan
% E &=& - \partial \ln Z / \partial \beta
% \\
% F &=& - \frac{1}{\b} \ln Z
% \\
% F &=& E - TS
% \\
% \Rightarrow
% S &=& \ln Z + \b \partial \ln Z / \partial \beta
% \\
% C &=& \partial E / \partial T \\
% &=& k_{\rm B} \b^2 \partial^2 \ln Z / \partial \beta^2 \\
% &=& \frac{ \partial^2 \ln Z / \partial \beta^2 }{k_{\rm B} T^2 }\\
% {\rm var}(E) & =& \partial^2 \ln Z / \partial \beta^2
% %
% \eeqan
\subsection{Computations}
% I wrote a {\tt C} program that computes
I computed the \ind{partition function}s of {\dem\index{long thin strip}{long-thin-strip}}
Ising models with the geometries shown in \figref{fig.thinstrips}.
\begin{figure}[htbp]
\figuredangle{
\begin{center}\small
\begin{tabular}{cc}
Rectangular:
&
Triangular:\\
%
\setlength{\unitlength}{1.7pt}% was 2pt
\begin{picture}(135,40)(-10,-5)
\newsavebox{\vfour} % again!
\savebox{\vfour}(0,0)[bl]{
\multiput(0,0)(0,10){4}{\circle{2}} % spins
\multiput(0,5)(0,10){4}{\line(0,-1){3}} % lines down
\multiput(0,-5)(0,10){4}{\line(0,1){3}} % lines up
\multiput(2,0)(0,10){4}{\line(1,0){6}} % lines right
}
\multiput(0,0)(10,0){12}{\usebox{\vfour}}
\put(-14,17.75){\makebox{$W$}}
\put(-10,25){\vector(0,1){10}}
\put(-10,15){\vector(0,-1){10}}
\end{picture}\hspace{0.42in}
&
% smallest length that works seems to be 1.7pt
\setlength{\unitlength}{1.7pt}\input{tex/isingstrip.tex}\\
\end{tabular}
\end{center}
}{
\caption[a]{Two long-thin-strip Ising models. A line between two spins
indicates that they are neighbours. The strips have width $W$ and infinite
length. }
\label{fig.thinstrips}
}
\end{figure}
As in the last section, I
set the applied field $H$ to zero
and considered the two cases $J = \pm 1$ which are a ferromagnet
and antiferromagnet respectively.
I computed the free energy per spin, $f(\beta,J,H) = F / N$
for widths from $W = 2$ to 8 as a function of $\beta$ for
$H=0$.
\subsubsection{Computational ideas:}
Only the largest eigenvalue is needed. There are several ways of getting
this quantity, for example, iterative multiplication of the matrix by an initial vector.
Because the matrix is all positive we know that the principal
eigenvector is all positive too (\ind{Frobenius--Perron theorem}), so a
reasonable initial vector is $(1,1,\ldots,1)$. This iterative
procedure may be faster than explicit computation of all eigenvalues.
I computed them all anyway, which has the advantage that
we can find the free energy of finite length strips -- using
\eqref{eq.Z.prods} --
as well as infinite ones.
\begin{figure}[tbh]
\figuremargin{\small%
\begin{center}
\mbox{\psfig{figure=ising/ferr8.ps,width=2.7in,angle=-90}\hspace{-0.2in}
\psfig{figure=ising/anti8.ps,width=2.7in,angle=-90}}
\end{center}
}{%
\caption[a]{Free energy per spin of long-thin-strip Ising models.
Note the non-zero gradient at $T=0$ in the case of
the triangular antiferromagnet.
}
%\label{fig1}
\label{fig.lts1}
}%
\end{figure}
\begin{figure}%[tbh]
\figuremargin{\small%
\begin{center}\mbox{%
%\psfig{figure=ising/S.4.ps,width=3in,angle=-90}
\psfig{figure=ising/S.8.ps,width=2.773in,angle=-90}
}\end{center}
}{%
\caption[a]{Entropies (in nats) of
% (a) width 4; (b)
width 8 Ising systems as a function of temperature,
obtained by differentiating the free energy curves
in \protect\figref{fig.lts1}. The rectangular ferromagnet and
antiferromagnet have identical thermal properties.
For the triangular systems, the upper curve $(-)$ denotes the
antiferromagnet and the lower curve $(+)$ the ferromagnet.
}
\label{fig.ising.S}
}%
\end{figure}
\subsection{Comments on graphs:}
For large temperatures all Ising models should show the same
behaviour: the \ind{free energy} is entropy-dominated, and the entropy per
spin is $\ln(2)$. The mean energy per spin goes to zero.
The free energy per spin should tend to
$-\!\ln(2)/\beta$. The free energies are shown in \figref{fig.lts1}.
One of the interesting properties we can obtain from the free energy
is the degeneracy of the ground state.
As the temperature goes to zero, the Boltzmann
% Gibbs
distribution becomes
concentrated in the ground state. If the ground state is degenerate (\ie,
there are multiple ground states with identical energy)
then the entropy as $T \to 0$ is non-zero. We can
find the entropy from the free energy using $S = - \partial F/ \partial T$.
% When $J=1$ and $b=0$, a rectangular ferromagnet has an almost unique
% ground state (degeneracy 2)
% with energy per spin $-2.0$ (four bonds, each shared between
% two spins).
%
% If $W$ is even then the antiferromagnet is equivalent to the
% ferromagnet, under the checkerboard transformation, as we already
% said. But if $W$ is odd then the antiferromagnet is frustrated in
% the width direction; this affects both the energy per spin, which is
% not so negative; and also, in principle, the entropy per spin,
% because the ground state of the frustrated system may be
% significantly degenerate, with a non-zero entropy per spin. In the
% case $W = 3$ the free energy per spin is $-4/3$ instead of $-2$. The
% ground state only has degeneracy 2. For the rectangular geometry I
% think that for any $W$ the ground state has finite degeneracy. As $W$
% increases this effect becomes negligible.
%
% The degeneracy of the antiferromagnetic
% triangular system at low temperature is
% different. The ground state is extensively
% degenerate, at least for all even values of $W$.
% (It is instructive to create ground states on the back of an
% envelope.)
% % \footnote{By constructing ground states for $W=3$
% % on the back of an
% %
% % : f: -1.0807 lz: +2.3283 T +0.46416 log(beta) +0.76753:
% % : f: -1.0482 lz: +3.7671 T +0.27826 log(beta) +1.2792:
% % : f: -1.0289 lz: +6.1681 T +0.16681 log(beta) +1.7909:
% % : f: -1.0173 lz: +10.1733 T +0.1 log(beta) +2.3026:
% %
% % using T = 0.464, obtain S = .0807 / 0.464 = 0.17
% %
% % or just using S = - dF/dT = .0173/0.1 = 0.173
% %
% % envelope, I anticipated that the entropy per spin at low temperature
% % might be about $\ln(2)/3 \simeq 0.23$, because roughly every third spin
% % seems undetermined by its neighbours.}
% %
% % here are some states with energy per spin -1: straight parallel lines
% % chevronny arallel lines. Any pattern starting from a honeycomb
% % of mainly + and pockets of -
% % any pattern lie chevrons but with side branches. Mazes that have
% % dead ends and walls and roundabouts. Hexagons inside hexagons.
% %
% The zero-temperature degeneracy is
% nicely revealed by a plot of the free energy versus temperature which
% has gradient at any $T$ equal to minus the entropy. If the ground state
% is unique the gradient is zero; for a triangular antiferromagnet the
% gradient is non-zero. See figure \ref{fig1} for an illustration with
% width
% % $W=4$. This graph has gradient corresponding to a zero temperature
% % entropy of $0.17$ per spin. With
% $W=8$. I found an entropy of 0.088 per spin from the gradient
% at zero temperature.
% I have not figured out whether the ground state entropy per spin is non-zero
% vanishes as $W$ increases.
% % --maybe it
% % goes as $\sqrt{N}$ rather than as $N$.
%
% according to students it says 0.3 in a textbook. also, this seems
% reasonable from a counting argument. Can show that 1/3 of spins are free, at least, giving 0.23.
The entropy of the triangular antiferromagnet at absolute zero
appears to be about 0.3, that is, about half its high temperature value (\figref{fig.ising.S}).
%
The mean energy as a function of temperature is plotted in figure \ref{fig.lts2}.
It is evaluated using the identity $\left< E \right> = - \partial \ln Z /
\partial \beta$.
\begin{figure}%[tbh]
\figuremargin{\small%
\begin{center}\mbox{%
\psfig{figure=ising/ebar.8.ps,width=2.6in,angle=-90}
% see ~/_courses/comput/newising
}\end{center}
}{%
\caption[a]{Mean energy versus temperature of
long thin strip Ising models with width 8.
Compare with \figref{fig.ising.16}.
}
%\label{fig2}
\label{fig.lts2}
}%
\end{figure}
\begin{figure}
\figuremargin{\small%
\begin{center}\mbox{%
\psfig{figure=ising/ferr.R.4.8.C.ps,width=2.5in,angle=-90}\hspace{-0.25in}
% does this need changing to C2.ps ?
\psfig{figure=ising/anti.H.4.8.C.ps,width=2.5in,angle=-90}
}\end{center}
}{%
\caption[a]{Heat capacities of (a) rectangular model;
(b) triangular models with different
widths, (+) and $(-)$ denoting ferromagnet and
antiferromagnet. Compare with figure \ref{fig.ising.H4096}.
}
\label{fig.ising3}
}%
\end{figure}
\begin{figure}
\figuremargin{\small%
\begin{center}\mbox{%
\psfig{figure=ising/ferr.R.4.8.vE.ps,width=2.5in,angle=-90}\hspace{-0.25in}
% does this need changing to vE2.ps ?
\psfig{figure=ising/anti.H.4.8.vE.ps,width=2.5in,angle=-90}
}\end{center}
}{%
\caption[a]{Energy variances, per spin, of (a) rectangular model;
(b) triangular models with different
widths, (+) and $(-)$ denoting ferromagnet and
antiferromagnet. Compare with figure \ref{fig.ising.H4096}.
}
\label{fig.ising4}
}%
\end{figure}
Figure \ref{fig.ising3} shows the estimated heat capacity (taking raw
derivatives of the mean energy) as a function of temperature
for the triangular models with widths 4 and 8.
Figure \ref{fig.ising4} shows the fluctuations in energy
as a function of temperature. All of these figures should show
smooth graphs; the roughness of the curves is due
to inaccurate numerics.
% eigenvalue evaluation.
% It is apparent
% that the peak in the heat capacity is getting sharper as the width
% increases, especially in the ferromagnetic case.
The nature of any phase transition is not obvious, but the graphs
seem compatible with the assertion that the ferromagnet shows,
and the antiferromagnet does not show a phase transition.
The pictures of the free energy in \figref{fig.lts1} give some insight
into how we could predict the transition temperature. We can
see how the two phases of the ferromagnetic
systems each have simple free energies:
a straight sloping line through $F=0$, $T=0$ for the high temperature
phase, and a horizontal line for the low temperature phase. (The slope
of each line shows what the entropy per spin of that
phase is.) The phase transition occurs roughly at the intersection
of these lines. So we predict the transition temperature
to be linearly related to the ground state energy.
%
\subsection{Comparison with the Monte Carlo results}
The agreement between the results of the two experiments
seems very good. The two systems simulated (the long thin strip and
the periodic square) are not quite identical.
One could a more accurate comparison by finding all eigenvalues
for the strip of width $W$
and computing $\sum \lambda^W$ to get the partition function
of a $W \times W$ patch.
% \subsubsection*{Further properties that can be extracted}
% A wonderful result derived in Yeomans
% % \cite{yeomans92}
% Yeomans (1992) is that the inverse correlation
% length can be obtained from the first two eigenvalues:
% \beq
% \xi^{-1} = - \ln \left( \l_1/\l_0 \right) .
% \eeq
% %
% % p.103 critical temp is J/kT_c = 0.22165 for 3d ising model
% %
\section{Exercises}% Problems}
\exercisxB{4}{ex.mcS}{% (Open question)
What would be the best way to extract the entropy from
the Monte Carlo simulations?
What would be the best way to obtain the entropy and the
heat capacity from the partition function computation?
}
\ExercisxA{3}{ex.isingmemories}{
An Ising model may be generalized to have a coupling $J_{mn}$
between any spins $m$ and $n$, and the value of $J_{mn}$
could be different for each $m$ and $n$.
%
In the special case where all the couplings are positive we know that
the system has two ground states, the all-up and all-down states.
%
For a more general setting of $J_{mn}$ it is conceivable that there
could be {\em many\/} ground states.
Imagine that it is required to make a spin system whose local minima
% lowest energy states
are a given list of states $\bx_{(1)}, \bx_{(2)}, \ldots, \bx_{(S)}$.
%
Can you think of a way of setting $\bJ$ such that the chosen
states are low energy states? You are allowed
to adjust all the $\{ J_{mn} \}$ to whatever values you wish.
}
\dvips
% \subchapter
% \section{Solutions}% to Chapter \protect\ref{ch.ising}'s exercises} %
% \input{tex/_s9.tex}
\dvipsb{solutions ising}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\prechapter{About Chapter}
%\input{tex/_pmc.tex}
\chapter{Exact Monte Carlo Sampling \nonexaminable}
\label{ch.mcexact}
\section{The problem with Monte Carlo methods}
For high-dimensional problems, the
most widely used random sampling methods
are Markov chain Monte Carlo methods
like the Metropolis method, Gibbs sampling, and
slice sampling.
The problem with all these methods is this:
yes, a given algorithm can be guaranteed to
produce samples from the target density $P(\bx)$
asymptotically,
`once the chain has converged to the equilibrium
distribution'.
But if one runs the chain for
too short a time $T$, then the samples will come
from some other distribution $P^{(T)}(\bx)$.
For how long must the Markov chain
be run before it has `converged'?
As was mentioned in \chapterref{ch.mc},
this question is usually very hard to answer.
%
%
However, the pioneering
work of \citeasnoun{Propp1996}\index{Propp, Jim G.}\index{Wilson, David B.} allows
one, for certain chains,
to answer this very question;
% is of great importance
% for those who want to know for how long to run their
furthermore Propp and Wilson show how to
% Markov chain Monte Carlo simulation to get a
obtain `exact' samples
from the target density.
\section{Exact sampling concepts}
Propp and Wilson's {\dem{exact sampling method}} (also
known as `\ind{perfect simulation}'\index{algorithm!exact sampling}\index{algorithm!perfect simulation}
or `\ind{coupling from the past}')\index{exact sampling}\index{Monte Carlo methods!exact sampling}\index{Monte Carlo methods!perfect simulation}
depends on three ideas.
\subsection{Coalescence of coupled Markov chains}
First,\index{coalescence}
% the idea that
if several Markov chains
starting from different initial conditions
share a single random-number generator, then their
trajectories in state space may
{\dem\index{Monte Carlo methods!coalescence}\index{coalescence}{coalesce}};
and, having, coalesced, will not separate
again. If {\em all\/} initial conditions lead to trajectories that
coalesce into a single trajectory, then we can be sure that
the Markov chain has `forgotten' its initial condition.
\Figref{fig.mcexact.1}\mbox{a{\small{-i}}} shows twenty-one Markov chains
identical to the one described in section \ref{sec.metropolis},
which samples from $\{ 0,1,\ldots,20\}$ using the
Metropolis algorithm
(\figref{fig.metrop}, \pref{fig.metrop}); each of the
chains has a different
initial condition but they are all driven by a single random number generator;
the chains coalesce after about 80 steps.
\Figref{fig.mcexact.1}\mbox{a{\small{-ii}}} shows the same Markov chains
with a different random number seed; in this case, coalescence
does not occur until 400 steps have elapsed (not shown).
\Figref{fig.mcexact.1}b shows similar Markov chains, each
of which has identical proposal density to those in
section \ref{sec.metropolis} and \figref{fig.mcexact.1}a;
% the difference between figures \ref{fig.mcexact.1}a and b
% is
but in \figref{fig.mcexact.1}b, the proposed move at each step,
`left' or `right', is obtained in the same way by all the chains
at any timestep, independent of the current state.
This coupling of the chains changes the statistics of coalescence.
Because two neighbouring paths only merge when a rejection occurs,
and rejections only occur at the walls (for this particular
Markov chain), coalescence will occur only when the chains
are all in the leftmost state or all in the rightmost state.
\newcommand{\exactforw}[1]{\hspace*{-7mm}\psfig{figure=metrop/exact/run#1,height=7.5in,width=1.2in}}
\newcommand{\exactback}[1]{\hspace*{-7mm}\psfig{figure=metrop/exact/back#1,height=7.5in,width=1.1in}\hspace*{-3mm}}
\begin{figure}
\figuremargin{
\footnotesize
\begin{tabular}{cccccccc}
%\exactforw{3/x.vn.1.ps}&
\exactforw{4/x.vn.1.ps}&
\exactforw{2/x.vn.1.ps}&
&
\exactforw{2/x.v.1.ps}&
%\exactforw{3/x.v.1.ps}&
\exactforw{4/x.v.1.ps}&
\\
% ``t'' means ternary moves are made. ``v'' means vanilla simple
% dependence on random number generator.
% ``n'' means ``not locked to other states''
% the non-vanilla runs are a stupid idea it turns out.
%\psfig{figure=metrop/exact/run2/x.vn.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.tvn.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.v.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.tv.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.1.ps,angle=-90,height=5in}
%&
%\psfig{figure=metrop/exact/run2/x.t.1.ps,angle=-90,height=5in}
%\\
%(a) &(b) &(c) &(d)& (e) & (f) \\
{\footnotesize{(i)}} &
{\footnotesize{(ii)}} &
&
{\footnotesize{(i)}} &
{\footnotesize{(ii)}} \\
\multicolumn{2}{c}{\footnotesize(a)} & \hspace{0.3in} &
\multicolumn{2}{c}{\footnotesize(b)} \\
\end{tabular}
}{
\caption[a]{Coalescence, the first idea behind the
exact sampling method.
Time runs from bottom to top.
In the leftmost panel, coalescence occurred
within 100 steps.
Different coalescence properties are obtained
depending on the way each state uses the random numbers
it is supplied with.
% In the first and third panels shown, coalescence has occurred
% within 250 steps
(a) Two runs of
% examples of coalescence for
a Metropolis simulator in which the random bits that determine
the proposed step
depend on the current state; a different random number seed
was used in each case.
(b) In this simulator the random proposal (`left' or `right') is the same
for all states.
In each panel, one of the paths, the one starting at location $x=8$,
has been highlighted.
}
\label{fig.mcexact.1}
}% end fig
\end{figure}
%\begin{figure}
%\figuremargin{
%\footnotesize
%\begin{tabular}{cccccccc}
%\exactforw{2/x.vn.L.ps}&
%\exactforw{3/x.vn.L.ps}&
%\exactforw{4/x.vn.L.ps}&
%\hspace{0.2in}&
%\exactforw{2/x.v.L.ps}&
%\exactforw{3/x.v.L.ps}&
%\exactforw{4/x.v.L.ps}&
%\\
% & (a) & & & &(b) \\
%%(a) &(b) &(c) & (d) & (e) & (f) \\
%\end{tabular}
%}{
%\caption[a]{Longer time-histories of the coalescences.
%}
%\label{fig.mcexact.L}
%}% end fig
%\end{figure}
\subsection{Coupling from the past}
% or {Simulation from the past}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5 next paragraph
How can we use the coalescence property to find an exact
sample from the equilibrium distribution of the chain?
The state of the system at the moment when complete coalescence
occurs is not a valid sample from the equilibrium distribution;
for example in \figref{fig.mcexact.1}b,
final coalescence always occurs when the state
is against one of the two walls, because trajectories only
merge at the walls. So sampling forward in time until coalescence
occurs is not a valid method.
The second key idea of exact sampling is that we can obtain exact samples
by sampling {\em from a time $T_0$ in the past, up to the present}.
If coalescence has occurred, the present sample is an unbiased
sample from the equilibrium distribution; if not, we restart
the simulation from a time $T_0$ further into the past, {\em reusing the
same random numbers}. The simulation is repeated at a sequence of ever
more distant times $T_0$, with a doubling of $T_0$ from
one run to the next being a convenient
% and near-optimal
choice. When coalescence occurs at a time before `the present',
we can record $x(0)$ as an {\dem exact sample\/} from the equilibrium
distribution of the Markov chain.
\Figref{fig.mcexact.b} shows two exact samples produced
in this way. In the leftmost panel of \figref{fig.mcexact.b}a,
we start twenty-one chains in all possible initial conditions
at $T_0 = -50$ and run them forward in time.
Coalescence does not occur. We restart the simulation
from all possible initial conditions
at $T_0 = -100$, and reset the random number generator
in such a way that the random numbers generated
at each time $t$ (in particular, from $t=-50$ to $t=0$)
will be identical to what they were in the first run. Notice that
the trajectories produced from $t=-50$ to $t=0$ by
these runs that started from $T_0 = -100$ are identical to
a {\em subset\/} of the trajectories in the first
simulation with $T_0=-50$.
Coalescence still does not occur, so we double $T_0$ again
to $T_0= -200$.
This time, all the trajectories coalesce and we obtain
an exact sample, shown by the arrow.
If we pick an earlier time such as $T_0=-500$, all the trajectories
must still end in the same point at $t=0$, since every trajectory
must pass through {\em{some}\/} state at $t=-200$, and {\em{all}\/} those
states lead to the same final point.
So if we ran the Markov chain for an infinite time in the
past,
from any initial condition, it would end in the same state.
\Figref{fig.mcexact.b}b shows an exact sample produced in
the same way with the Markov chains of
\figref{fig.mcexact.1}b.
This method, called {\dem{coupling from the past}},
is important because it allows us to obtain
exact samples from the equilibrium distribution; but,
as described here, it is of little practical use,
since we are obliged to simulate chains starting
in {\em all\/} initial states. In the examples shown,
there are only twenty-one states, but in any realistic
sampling problem there will be an utterly enormous number
of states -- think of the $2^{1000}$ states of a
system of 1000 binary spins, for example. The whole
point of introducing Monte Carlo methods was to try to avoid
having to visit all the states of such a system!
\begin{figure}
\fullwidthfigureright{
\footnotesize
\begin{tabular}{cccccccccc}
%\exactback{1/x.vn.20.L.ps}
%&
\exactback{1/x.vn.50.L.ps}
&
\exactback{1/x.vn.100.L.ps}
&
\exactback{1/x.vn.200.L.ps} % converges at 200
&
% \exactback{1/x.vn.500.L.ps} something wrong with this
&
\hspace{0.3in}
&
%\exactback{1/x.v.20.L.ps}
%&
\exactback{1/x.v.50.L.ps}
&
\exactback{1/x.v.100.L.ps}
&
\exactback{1/x.v.200.L.ps} % converges at 200
%&
%\exactback{1/x.v.500.L.ps}
\\
%{\footnotesize{(i)}} &
%{\footnotesize{(ii)}} &
%{\footnotesize{(iii)}} &
%{\footnotesize{(iv)}}
%{\footnotesize{$T_0=-20$}} &
{\footnotesize{$T_0=-50$}} &
{\footnotesize{$T_0=-100$}} &
{\footnotesize{$T_0=-200$}} &
%{\footnotesize{$T_0=-500$}}
&
&
{\footnotesize{$T_0=-50$}} &
{\footnotesize{$T_0=-100$}} &
{\footnotesize{$T_0=-200$}} &
%{\footnotesize{$T_0=-500$}}
\\
\multicolumn{4}{c}{\footnotesize(a)} & &
\multicolumn{3}{c}{\footnotesize(b)} \\
\end{tabular}
}{
\caption[a]{\mbox{`Coupling from the past', the second idea behind the
exact sampling method.}
}
\label{fig.mcexact.b}
}% end fig
\end{figure}
\begin{figure}
\fullwidthfigureright{
\footnotesize
\begin{center}
\begin{tabular}{ccccccccc}
%\exactback{1/x.ve.20.ps}
%&
\exactback{1/x.ve.50.ps}&
\exactback{1/x.ve.100.ps}&
\exactback{1/x.ve.200.ps} % converges at 200
%&
%\exactback{1/x.ve.500.ps}
&
\hspace{0.3in}
&
\exactback{2/x.ve.all.ps} & \hspace{0.3in}
&
\exactback{3/x.ve.all.ps}
\\
{\footnotesize{$T_0=-50$}} &
{\footnotesize{$T_0=-100$}} &
{\footnotesize{$T_0=-200$}} &
%{\footnotesize{$T_0=-500$}} &
&
{\footnotesize{$T_0=-50$}} & &
{\footnotesize{$T_0=-1000$}}
\\
\multicolumn{3}{c}{\footnotesize(a)}& &
{\footnotesize(b)}& &
{\footnotesize(c)}\\
\end{tabular}
\end{center}
}{
\caption[a]{(a) Ordering of states, the third idea behind the
exact sampling method. The trajectories shown here are
the left-most and right-most trajectories of
\protect\figref{fig.mcexact.b}b.
%
In order to establish what the state at time zero is,
we only need to run simulations from $T_0=-50$, $T_0=-100$, and $T_0=-200$, after which
point coalescence occurs.
(b,c) Two more exact samples from the target density, generated by this method,
and different random number seeds.
The initial times required were
$T_0=-50$ and $T_0=-1000$, respectively.
}
\label{fig.mcexact.c}
}% end fig
\end{figure}
\subsection{Monotonicity}
Having established that we can obtain valid samples by simulating
forward from times in the past, starting in {\em all\/}
possible states at those times, the third trick of
Propp and Wilson, which makes the exact sampling method useful in practice,
is the idea that, for some Markov chains, it may be possible to
detect coalescence of all trajectories {\em without simulating
all those trajectories}. This property holds, for
example, in the chain of \figref{fig.mcexact.1}b,
which has the property that {\em two trajectories never cross}.
So if we simply track the two trajectories starting from the leftmost
and rightmost states, we will know that coalescence of
{\em all\/} trajectories has occurred when {\em those two\/}
trajectories coalesce.
\Figref{fig.mcexact.c}a illustrates this idea by
showing only the left-most and right-most trajectories
of \figref{fig.mcexact.b}b.
\Figref{fig.mcexact.c}(b,c) shows two more
exact samples from the same equilibrium distribution
generated by running the `coupling from the past' method
starting from the two end-states alone.
In (b), two runs coalesced starting from $T_0=-50$;
in (c), it was necessary to try times up to $T_0=-1000$ to achieve
coalescence.
% could reference the paper by Holmes here
% except I am not convinced it is genuinely useful.
% I put it in an exercise below.
\section{Exact sampling from interesting distributions}
In the toy problem we studied, the states could be put in a one-dimensional
order such that no two trajectories crossed. The states of
many interesting state spaces can also be put into
a {\dem\ind{partial order}\/} and coupled Markov chains can be found that
respect this partial order. [An example of a partial order
on the four possible states of two spins is this:
$(+,+) > (+,-) > (-,-)$;
and
$(+,+) > (-,+) > (-,-)$;
and the states $(+,-)$ and $(-,+)$ are not ordered.]
For such systems, we can show that coalescence has occurred merely by
verifying that coalescence has occurred for all the histories
whose initial states were `maximal' and `minimal' states of the
state space.
\marginalg{
\begin{framedalgorithm}
\begin{tabular}{@{}l}
{\sf Compute} $a_i := \sum_j J_{ij} x_j$\\
{\sf Draw} $u$ {\sf from} Uniform$(0,1)$ \\
{\sf If} $u<1/(1+e^{-2 a_i})$ \\
\ \ \ $x_i := +1$\\
{\sf Else} \\
\ \ \ $x_i := -1$\\
\end{tabular}
\end{framedalgorithm}
\caption[a]{Gibbs sampling coupling method.
The Markov chains
are coupled together by having all chains update the same spin $i$
at each time step and having
all chains share a common sequence of random numbers $u$.\medskip
}
\label{alg.coupling}
}
As an example, consider the\index{Monte Carlo methods!Gibbs sampling}
Gibbs sampling method
applied to
% a set of spins
a ferromagnetic Ising spin system, with the partial ordering of
states being defined thus: state $\bx$ is `greater than or equal to' state $\by$
if $x_i \geq y_i$ for all spins $i$. The maximal and minimal states
are the the all-up and all-down states.
The Markov chains are coupled together as shown in \algref{alg.coupling}.
% NOT by the number of up-spins in the state.
\citeasnoun{Propp1996} show that exact samples
can be generated for this system, although the time to find
exact samples is large if the Ising model is below its critical
temperature, since the Gibbs sampling method itself
is slowly-mixing under these conditions.
Propp and Wilson have improved on this method\index{Gibbs sampling}
for the Ising model
by using a Markov chain called the single-bond heat bath algorithm
to sample from a related model called the \ind{random cluster model};
they show that
exact samples
from the random cluster model can be obtained rapidly
and can be converted
into exact samples from the Ising model. Their ground-breaking
paper includes an exact sample from a 16-million-spin Ising model
at its critical temperature. A sample for a smaller Ising model
is shown in \figref{fig.ising.exact}.
\marginfig{
\begin{center}
\psfig{figure=images/q2.ps,width=1.94in}
\end{center}
\caption[a]{An exact sample from the Ising model at its critical temperature,
produced by
% David Bruce Wilson.
\mbox{D.B.~Wilson}.
Such samples can be produced within seconds
on an ordinary computer by
exact sampling.
}
\label{fig.ising.exact}
}
\subsection{A generalization of the exact sampling method for `non-attractive' distributions}
The method of Propp and Wilson for the Ising model, sketched above,
can only be applied to probability distributions that are, as they
call them, `attractive'. Rather than define this term, let's say what it
means, for practical purposes: the method can be applied to spin systems
in which all the couplings are positive (\eg, the ferromagnet), and
to a few special spin systems with negative couplings (\eg, as we already
observed in \chref{ch.ising}, the rectangular ferromagnet and antiferromagnet
are equivalent); but it cannot be applied to general spin systems in which
some couplings are negative, because in such systems the trajectories
followed by the all-up and all-down states are not guaranteed to be
upper and lower bounds for the set of all trajectories.
% To put it another way, the Markov chain does not have the non-crossing property.
Fortunately, however, we do not need to be so strict.
% Radford Neal\index{Neal, Radford}
% has pointed out that i
It is possible to re-express the
\index{Propp, J. G.}{Propp}
and \index{Wilson, David B.}{Wilson} algorithm in a way that generalizes to the case of
spin systems with negative couplings.
% summary state
The idea of
the {\dem\ind{\envelope}} version of exact sampling
is still that we keep track of bounds\index{{\tt{?}}}
% an `upper bound' and `lower bound'
on the set of all trajectories, and detect when
these bounds are equal, so as to find exact samples.
% Propp and Wilson
But the bounds will not themselves be actual trajectories,
and they will not necessarily be {\em tight\/} bounds.
% This is called .
% simon said
% Is it as if we are
%using the '?' states to represent multiple possible states with a single
%vector and we only fill in the '?'s with a 0 or 1 when the 'alternative
%state' chains would've coalesced. So when we start off with all '?' we ARE
%effectively considering all possible start configurations - it just gives
%us a very economical way of keeping track and monitoring our progress.
% I think this is already said.
%
Instead of simulating two trajectories, each of which moves in a state
space $\{ -1, +1 \}^N$, we simulate one {\dem trajectory envelope\/} in an
augmented state space $\{ -1, +1 , {\tt ?} \}^N$, where the symbol
{\tt ?} denotes `either $-1$ or $+1$'.
We call the state of this augmented system the `\envelope'.
% envelope'
An example
\envelope\ of a six-spin system is {\tt ++-?+?}. This \envelope\ is
shorthand for the set of states
\begin{center} {\tt ++-+++}, {\tt ++-++-}, {\tt ++--++}, {\tt ++--+-} .
\end{center}
The update rule at each step of the Markov chain takes a single spin,
enumerates all possible states of the neighbouring spins that are compatible with
the current \envelope, and, for each of these local scenarios,
computes the new value ({\tt +} or {\tt -}) of the spin
using Gibbs sampling (coupled to a random number $u$ as in \algref{alg.coupling}).
If all these new values agree, then the new value of the updated spin in the
\envelope\ is set to the unanimous value ({\tt +} or {\tt -}).
Otherwise, the new value of the spin in the \envelope\ is `{\tt ?}'.
% This update rule can
The initial condition, at time $T_0$, is given by setting all the spins in
the \envelope\ to `{\tt ?}', which corresponds to considering
all possible start configurations.
In the case of a spin system with positive couplings,
this \envelope\ simulation will be identical to the simulation of
the uppermost state and lowermost states, in the style of
% {\em \`a la\/}
Propp and Wilson, with coalescence occuring when all the `{\tt ?}' symbols
have disappeared.
The \envelope\ method can be applied to general spin systems with any couplings.
The only shortcoming of this method is that the envelope may describe
an unnecessarily
large set of states, so there is no guarantee that the
\envelope\ algorithm will converge;
the time for coalescence to be {\em detected\/} may be considerably larger
than the actual time taken for the underlying Markov chain to coalesce.
The \envelope\ scheme has been applied to exact sampling in belief networks
by \citeasnoun{NealHarvey2000},\index{Neal, Radford} and to
the triangular antiferromagnetic Ising model
by \citeasnoun{PattersonChildsMacKay00}.
% Mike Harvey and Radford Neal.
Summary state methods were first introduced by
\citeasnoun{Huber1998}; they also go by the names
\ind{sandwiching method}s and \ind{bounding chain}s.
% The \envelope\ method was first introduced by
% \citeasnoun{Huber1998}, who called it a \index{bounding chain}.
% Should I also cite H?ggstr?m-Nelander. ?
\begin{figure}
\figuremargin{\mbox{\psfig{figure=figs/hexagonbig.ps,width=3.95in}}}{
\caption[a]{A perfectly random \ind{tiling} of a hexagon by lozenges,
provided by J.G.~Propp and D.B.~Wilson.}
}
\end{figure}
\section*{Further reading}
For further reading, impressive pictures of exact samples
from other distributions, and generalizations of the
exact sampling method, browse the perfectly-random sampling
website.\footnote{\tt{http://www.dbwilson.com/exact/}}
% http://dimacs.rutgers.edu/$\sim$dbwilson/exact/}}
% http://dimacs.rutgers.edu/~dbwilson/exact/
% Exact sampling
For beautiful exact-sampling demonstrations running
live in your web-browser, see Jim Propp's website.\footnote{
{\tt{http://www.math.wisc.edu/$\sim$propp/tiling/www/applets/}}}
%http://www.math.wisc.edu/~propp/tiling/www/applets/
% I hope CUP printer will render this nicely.
%\marginfig{\mbox{\psfig{figure=figs/hexagonbig.ps,width=54mm}}
%\caption[a]{A perfectly random tiling of a hexagon with lozenges,
% provided by J.G.~Propp and D.B.~Wilson.}
%}
\subsection{Other uses for coupling}
The idea of coupling together Markov chains by having
them share a random number generator has other
applications beyond exact sampling.
\citeasnoun{PintoNeal_01} have shown that
the accuracy of estimates obtained from
a Markov chain Monte Carlo simulation (the second problem discussed
in \sectionref{sec.mcproblemsdefined}, \pref{sec.mcproblemsdefined}), using the estimator
% chapter \ref{ch.mc},
% \cf\ \eqref{eq.mc.est})
\beq
\hat{\Phi}_P \equiv \frac{1}{T} \sum_{t} \phi( \bx^{(t)} ) ,
\label{eq.mc.est.again}
\eeq
can be improved by coupling the chain of interest, which converges
to $P$,
to a second chain, which generates samples
from a second, simpler distribution, $Q$.
The coupling must be set up in such a way that
the states of the two chains are strongly correlated.
The idea is that we first estimate the expectations of a function
of interest, $\phi$,
under $P$ and under $Q$ in the normal way (\ref{eq.mc.est.again})
and compare the estimate under $Q$, $\hat{\Phi}_Q$, with the true value of
the expectation under $Q$, ${\Phi_Q}$ which we assume
can be evaluated exactly.
% because of the simplicity of $Q$. If
If $\hat{\Phi}_Q$ is an overestimate then it is likely
that $\hat{\Phi}_P$ will be an overestimate too.
The difference $(\hat{\Phi}_Q-{\Phi_Q})$ can thus be used to
correct $\hat{\Phi}_P$.\index{Neal, Radford}
% For details of the correction method, see
% Pinto and Neal's paper.
\section{Exercises}
\exercissxB{2}{ex.mcexact}{
Is there any relationship between the probability
distribution of the time taken for all trajectories
to coalesce, and the equilibration time of a Markov chain?
Prove that there is a relationship, or find a single chain
that can be realized in two different ways that have different
coalescence times.
}
\exercisxB{2}{ex.mcexact.fred}{
Imagine that Fred ignores
the requirement that the random bits used at some time $t$, in every run
from increasingly distant times $T_0$, must be identical,
and makes a coupled-Markov-chain simulator that uses
fresh random numbers every time $T_0$ is changed.
Describe what happens if Fred applies his method to the Markov
chain that is intended to sample from the uniform distribution over
the states 0, 1, and 2, using the Metropolis method, driven
by a random bit source as in \figref{fig.mcexact.1}b.
}
\exercisxC{5}{ex.modelexact}{
Investigate the application of perfect sampling to
linear regression in
\citeasnoun{holmes98perfect} or \citeasnoun{holmes2002perfect}
and try to generalize it.
}
\exercisxC{3}{ex.coalescencegeneral}{
The concept of coalescence has many applications.
Some surnames are more frequent than
others, and some die out altogether. Make a model of this
process; how long will it take until everyone
has the same surname?
Similarly,
variability in any particular portion of the human genome
(which forms the basis of \ind{forensic} \ind{DNA} fingerprinting)
is inherited like a surname. A DNA fingerprint is like a
string of surnames.
Should the fact that these surnames are subject
to coalescences, so that some surnames are by chance more prevalent
than others, affect the way in which DNA fingerprint
evidence is used in court?
}
% http://www.biology.washington.edu/fingerprint/dnaintro.html
% Variable Number Tandem Repeats or VNTR.
% http://www.college.ucla.edu/webproject/micro7/lecturenotes/finished/Fingerprinting.html
% This method is called Restriction Fragment Length Polymorphism and results in an RFLP Fingerprint.
\exercisxB{2}{ex.fairstrawsb}{
How can you use a coin to create a random ranking of 3 people?
Construct a solution that uses exact sampling. For example,
you could apply exact sampling to a Markov chain in which the coin
is repeatedly used alternately to decide whether to switch first and
second, then whether to switch second and third.
}% my solution: arithmetic coding. was in _e6a.tex
\exercisxC{5}{ex.exactZ}{
Finding the partition function $Z$ of a
probability distribution is a difficult problem.
Many Markov chain Monte Carlo methods
produce valid samples from a distribution without
ever finding out what $Z$ is.
Is there any probability distribution and Markov chain
such that
either the time taken to produce a perfect sample
or the number of random bits used to create a perfect
sample are related to the value of $Z$?
Are there some situations in which the time to coalescence conveys
information about $Z$?
}
\section{Solutions}
\soln{ex.mcexact}{
It is perhaps surprising that there is no direct relationship
between the equilibration time and the time to coalescence.
%
We can prove this using the example of
% A simple example that proves this is the
% case of
the uniform distribution over the integers $\A = \{ 0,1,2, \ldots , 20 \}$.
A Markov chain that converges to this distribution in exactly
one iteration is the chain for which the probability of
state $s_{t+1}$ given $s_t$ is the uniform distribution, for all
$s_t$.
Such a chain can be coupled to a random number generator
in two ways: (a) we could draw a random integer $u \in \A$,
and set $s_{t+1}$ equal to $u$ regardless of
$s_t$; or
(b) we could draw a random integer $u \in \A$,
and set $s_{t+1}$ equal to $(s_{t}+u) \mod 21$. Method (b)
would produce a cohort of trajectories locked together, similar to
the trajectories in \figref{fig.mcexact.1}, except that
no coalescence ever occurs.
Thus, while the equilibration times of methods (a) and (b)
are both one, the coalescence times are respectively one and
infinity.
It seems plausible on the other hand that coalescence time
provides some sort of upper bound on
equilibration time.
}
%%%%%%%%%
%
\chapter{Variational Methods}
\label{ch.mft}
% \chapter{Mean Field Theory}
% \chapter{Variational Methods}
% \label{ch.mft}
% \chapter{Mean Field Theory}
% Another topic which will prove useful to have up our sleeves is
% mean field theory.
%
%
% included by lb.tex
\label{ch.variational}
Variational methods\index{variational methods}\index{approximation!variational}
are an important technique for the approximation of
complicated probability distributions, having\index{approximation!of complex distribution}
applications in statistical physics,
% Bayesian inference
data modelling and neural networks.
% , including the decoding of error correcting codes.
% Mean field theory is relevant to understanding
% neural networks and to the development of ways of implementing
% Bayesian inference and decoding error correcting codes.
\section{Variational free energy minimization}
One method for approximating a
complex distribution in a physical system is {\dem \ind{mean field theory}}.
Mean field theory is a special case of a general
{\dbf \ind{variational free energy}}
approach of Feynman\nocite{Feynman:SM}\index{Feynman, Richard}
and Bogoliubov which we will now study.
The key piece of mathematics needed to understand this method
is Gibbs' inequality,\marginpar{\small\raggedright
Gibbs' inequality first appeared in equation (\eqKL); see also \exrelent.
}
% -- equation (\eqKL), \exrelent --
which
we repeat here.
\begin{description}
\item[The relative entropy]
between two probability distributions $Q(x)$ and $P(x)$
that are defined over the same alphabet $\A_X$ is\index{relative entropy}
\beq
D_{\rm KL}(Q||P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)} .
\label{eq.KL.again}
\eeq
The relative entropy satisfies $D_{\rm KL}(Q||P) \geq 0$ (Gibbs'
inequality) with equality only if $Q \eq P$. In general
$D_{\rm KL}(Q||P) \neq D_{\rm KL}(P||Q)$.
In this chapter we will replace the $\log$ by $\ln$,
and measure the divergence in nats.
\end{description}
\subsection{Probability distributions in statistical physics}
%
% Refer to example \ref{ex.rel.ent} for the essential inequality.
% Are the marginals the best approximation? No.
%
% \subsection{Why mean field theory in statistical physics?}
In statistical physics one often encounters probability
distributions of the form
\beq
P( \bx \given \beta, \bJ) = \frac{1}{Z(\beta,\bJ)}
\exp \! \left[ - \beta E( \bx ; \bJ ) \right] ,
\label{eq.ising.p.again}
\eeq
where for example the state vector is $\bx \in \{-1,+1\}^N$, and $E(\bx;\bJ)$ is some energy function such as
\beq
E(\bx;\bJ) = -
% \left[
\frac{1}{2}
\sum_{m,n} J_{mn} x_m x_n - \sum_n h_n x_n.
% \right] .
\label{eq.ising.e.again}
\eeq
The \ind{partition function} (normalizing constant) is
\beq
Z(\beta,\bJ) \equiv \sum_{\bx} \exp \!
\left[ - \beta E( \bx ; \bJ ) \right] .
\label{eq.ising.z.again}
\eeq
%
The probability distribution of
\eqref{eq.ising.p.again} is complex. Not unbearably complex --
we can, after all, evaluate $E(\bx;\bJ)$ for any particular $\bx$ in a time
polynomial in the number of spins.
But evaluating the normalizing constant $Z(\beta,\bJ)$ is difficult,
as we saw in \chapterref{ch.mc},
and describing the properties of the probability distribution
is also hard. Knowing the value of $E(\bx;\bJ)$ at a few arbitrary points
$\bx$, for example,
gives no useful information about what the average properties
of the system are.
An evaluation of $Z(\beta,\bJ)$ would be particularly
desirable because from
% the \ind{partition function}
$Z$ we can derive all the
thermodynamic properties of the system.
% Mean field theory\index{mean field theory}
{Variational free energy minimization}\index{variational free energy!minimization}\index{free energy!minimization}\index{free energy!variational}
is a method for {\dbf approximating\/} the complex
distribution $P( \bx)$ by a simpler ensemble $Q(\bx ; \btheta)$
that is parameterized by adjustable parameters $\btheta$. We
adjust these parameters so as to get $Q$ to best approximate
$P$, in some sense.
A by-product of this approximation is a lower bound on $Z(\beta,\bJ)$.
% \subsection{Why mean field theory in error correcting codes?}
% removed to leftovers
\subsection{The variational free energy}
The objective function chosen to measure the
quality of the approximation is the {\dem\ind{variational free
energy}}
% \newcommand{\tF}{{\tilde{F}}}
\beq
\beta \tF(\btheta) = \sum_{\bx} \: Q(\bx;\btheta)
\ln \frac{ Q(\bx;\btheta) }{ \exp \!
\left[ - \beta E( \bx ; \bJ ) \right] }
.
\label{eq.vfe}
\eeq
% The factor of $\beta$ is included on the left-hand side
% to make it
This expression can be manipulated into a couple of interesting
forms: first,
\beqan
\beta \tF(\btheta) &=& \beta \sum_{\bx} \: Q(\bx;\btheta)
E( \bx ; \bJ )
- \sum_{\bx} \: Q(\bx;\btheta) \ln\frac{1}
{ Q(\bx;\btheta) } \\
&\equiv& \beta \left< E( \bx ; \bJ ) \right>_Q - S_Q ,
\eeqan
where $\left< E( \bx ; \bJ ) \right>_Q$ is the average of the
energy function under the distribution $Q(\bx;\btheta)$, and
$S_Q$ is the entropy of the distribution $Q(\bx;\btheta)$
(we set $k_{\rm B}$ to one in the definition of $S$
so that it is identical to the definition of the entropy $H$ in \partone).
Second, we can use the definition of $P(\bx \given \beta, \bJ)$
to write:
\beqan
\beta \tF(\btheta) &=& \sum_{\bx} \: Q(\bx;\btheta)
\ln \frac{ Q(\bx;\btheta) }{ P(\bx \given \beta, \bJ) }
- \ln {Z(\beta,\bJ)}
\\
&=& D_{\rm KL}( Q || P ) + \beta F,
\eeqan
where $F$ is the true free energy, defined by
\beq
\beta F
\equiv - \ln {Z(\beta,\bJ)},
\eeq
and $D_{\rm KL}(Q||P)$ is the relative entropy between
the approximating distribution $Q(\bx;\btheta)$ and the
true distribution $P(\bx \given \beta, \bJ)$.
Thus by Gibbs' inequality, the variational free energy
$\tF(\btheta)$ is bounded below by $F$ and
only attains this value for $Q(\bx;\btheta) = P(\bx \given \beta, \bJ)$.
Our strategy is thus to vary $\btheta$ in such a way that
$\beta \tF(\btheta)$ is minimized. The approximating distribution
then gives a simplified approximation
to the true distribution that may be useful, and the value
of $\b \tF(\btheta)$ will be an upper bound for $\b F$.
Equivalently, $\tilde{Z} \equiv e^{-\b \tF(\btheta)}$ is a lower bound for $Z$.
\subsection{Can the objective function $\b \tF$ be evaluated?}
We have already agreed that the evaluation of various interesting
sums over $\bx$ is intractable. For example, the \ind{partition function}
\beq
Z = \sum_{\bx} \exp \! \left( - \b E( \bx ; \bJ ) \right),
\eeq
the energy
\beq
\left< E \right>_P = \frac{1}{Z} \sum_{\bx} E( \bx ; \bJ ) \exp \!
\left( - \b E( \bx ; \bJ ) \right) ,
\eeq
and the entropy
\beq
S \equiv
\sum_{\bx} P(\bx \given \beta, \bJ) \ln\frac{1}{P(\bx \given \beta, \bJ)}
\eeq
are
all presumed to be impossible to evaluate.
So why should we suppose that this objective function
$\beta \tF(\btheta)$, which is also defined in terms of a sum
over all $\bx$ (\ref{eq.vfe}), should be a convenient quantity to deal
with? Well, for a range of interesting energy functions,
and for sufficiently simple approximating distributions,
the variational free energy {\em can\/} be efficiently
evaluated.
\section{Variational free energy minimization for spin systems}
% Ising models}
\label{sec.vfeising}
An example of a tractable variational free energy is given by
the spin system whose energy function was given in \eqref{eq.ising.e.again},
which we can approximate with a {\em separable\/} approximating distribution,
\beq
Q(\bx; \ba) = \frac{1}{Z_Q} \exp \left({ \sum_n a_n x_n }\right) .
\eeq
The variational parameters $\btheta$ of the variational free
energy (\ref{eq.vfe}) are the components of the vector
% of log probability ratios,
$\ba$.
To evaluate the variational free energy we need the entropy of
this distribution,
\beq
S_Q = \sum_{\bx} \: Q(\bx;\ba) \ln\frac{1}
{ Q(\bx;\ba) }
\eeq
and the mean of the energy,
\beq
\left< E( \bx ; \bJ ) \right>_Q =
\sum_{\bx} \: Q(\bx;\ba)
E( \bx ; \bJ ) .
\eeq
The entropy of the separable approximating
distribution is simply the sum of the entropies of the individual
spins \exercisebref{ex.Hadditive},
% bref puts the ref in brackets
\beq
S_Q = \sum_n H_2^{(e)}(q_n),
\eeq
where $q_n$ is the probability that spin $n$ is $+1$,
\beq
q_n = \frac{e^{a_n}}{e^{a_n}+e^{-a_n}} = \frac{1}{1+\exp(-2 a_n)} ,
\eeq
and
\beq
H_2^{(e)}(q) = q \ln \frac{1}{q} + (1-q)\ln\frac{1}{(1-q)} .
\eeq
% all logs being natural logarithms.
%
% REFER to an exercise?
%
The mean energy under $Q$ is easy to obtain because
% $E(\bx;\bJ)$
$\sum_{m,n} J_{mn} x_m x_n$ is a sum
of terms each involving the product of two
{\em independent\/} random variables.
(There
are no self-couplings, so $J_{mn} = 0$ when $m=n$.)
If we define
the mean value
of $x_n$ to be $\bar{x}_n$, which is given by
\beq
\bar{x}_n = \frac{ e^{a_n} - e^{-a_n} }{e^{a_n} + e^{-a_n} }
= \tanh(a_n) = 2 q_n - 1 ,
\eeq
we obtain
\beqan
\left< E( \bx ; \bJ ) \right>_Q &=&
\sum_{\bx} \: Q(\bx;\ba) \left[ - \frac{1}{2}
\sum_{m,n} J_{mn} x_m x_n - \sum_n h_n x_n
\right] \\
&=&
- \frac{1}{2} \sum_{m,n} J_{mn} \bar{x}_m \bar{x}_n - \sum_n h_n \bar{x}_n.
%
\eeqan
So the variational free energy is given by
\beq
\b \tF(\ba) = \b \left< E( \bx ; \bJ ) \right>_Q - S_Q
= \b \left(- \frac{1}{2}
\sum_{m,n} J_{mn} \bar{x}_m \bar{x}_n - \sum_n h_n \bar{x}_n \right) - \sum_n H_2^{(e)}(q_n) .
\eeq
%%%%%%%%%%%%%%%%%%%%%% added Dec 2000
\amarginfig{c}{
%\begin{figure}
%\figuremargin{%
\begin{center}
\hspace*{-0.1in}\mbox{\psfig{figure=gnu/ising.vfe.s.ps,angle=-90,width=2.4in}}\\[-0.4in]
\end{center}
%}{%
% see gnu/ising.gnu
% {\textstyle\half} no factor of half, because sum above is over all mn
\caption[a]{The variational free energy
of
the two-spin system whose energy is $E(\bx) = - x_1 x_2$,
as a function of the two variational parameters $q_1$ and $q_2$.
The inverse-temperature is $\beta=1.44$.
% critical point for this system is 1
The function plotted is
$$
\b \tF = -
\b \bar{x}_1 \bar{x}_2
- H_2^{(e)}(q_1) - H_2^{(e)}(q_2),
$$
where $\bar{x}_n = 2 q_n -1$.
Notice that for fixed $q_2$
the function is \convexsmile\ with respect to $q_1$,
and for fixed $q_1$ it is \convexsmile\ with respect to $q_2$.}
\label{fig.mft2spins}
}%
% see also load 'ising3.gnu' for a movie demo
% for lecture
%\end{figure}
We now consider minimizing this function with respect
to the variational parameters $\ba$.
If
% Noting that when
$q=1/(1+e^{-2a})$, the derivative of the entropy is
\beq
\frac{ \partial}{\partial q} H_2^{e}(q) = \ln \frac{1-q}{q} = -2a .
\eeq
So we obtain
\beqan
\frac{\partial }{\partial a_m} \b \tF(\ba)
&=& \b \left[ - \sum_{n} J_{mn} \bar{x}_n - h_m \right]\left(2
\frac{\partial q_m }{\partial a_m} \right) -
\ln \left( \frac{1-q_m}{q_m}
\right) \left(\frac{\partial q_m }{\partial a_m} \right)
\nonumber \\
&=& 2\left(\frac{\partial q_m }{\partial a_m} \right)
\left[- \b \left( \sum_{n} J_{mn} \bar{x}_n + h_m\right) + a_m \right]
.
\eeqan
This derivative is equal to zero when
\beq
a_m = \b\left( \sum_{n} J_{mn} \bar{x}_n + h_m \right) .
\label{eq.mfta}
\eeq
So $\tF(\ba)$ is extremized at any point that satisfies
\eqref{eq.mfta} and
% the definition
\beq
\bar{x}_n = \tanh( a_n ) .
\label{eq.mftb}
\eeq
% define the solution to our variational
% free energy minimization.
The \vfe\ $\tF(\ba)$ may be a multimodal function,
in which case each stationary point (maximum, minimum or saddle)
will satisfy equations (\ref{eq.mfta}) and (\ref{eq.mftb}).
One way of using these equations,
in the case of a system with an arbitrary coupling matrix $\bJ$,
is to update each parameter $a_m$
and the corresponding value of $\bar{x}_m$
using equation (\ref{eq.mfta}), one at a time. This {\dem asynchronous
updating of the parameters\/} is guaranteed to decrease $\b\tF(\ba)$.
Equations (\ref{eq.mfta}) and (\ref{eq.mftb}) may be recognized
as the \index{mean field theory}{mean field} equations for a spin system. The variational
parameter $a_n$ may be thought of as the strength of a fictitious
field applied to an isolated spin $n$.
% which when positive encourages spin $n$ to point up.
\Eqref{eq.mftb}
describes the mean response of spin $n$, and \eqref{eq.mfta} describes
how the field $a_m$ is set in response to the mean state of
all the other spins.
The variational free energy derivation is a helpful
viewpoint for mean field theory for two reasons.
\ben
\item
This approach associates an objective function $\b \tF$ with
the mean field equations; such an objective function is useful
because it can help identify alternative dynamical systems
that minimize the same function.
\item
The theory is readily generalized to other approximating
distributions. We can imagine introducing a more complex
approximation $Q(\bx;\btheta)$ that might for example capture
correlations among the spins instead of modelling the spins as
independent. One could then evaluate the variational free energy and
optimize the parameters $\btheta$ of this more complex
approximation. The more degrees of freedom the approximating
distribution has, the tighter the bound on the free energy becomes.
However, if
the complexity of an approximation is increased, the evaluation of either
the mean energy or the entropy typically becomes more
challenging.
\een
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=isingmft/mft3.T.ps,angle=-90,width=3.2in}}
\end{center}
}{%
\caption[a]{Solutions of the variational free energy extremization problem
for the Ising model, for three different
applied fields $h$. Horizontal axis: temperature $T=1/\b$.
Vertical axis: magnetization $\bar{x}$. The critical temperature
found by mean field theory is $T_c^{\rm mft} = 4$.}
\label{fig.mft}
}%
\end{figure}
\section{Example: mean field theory for the ferromagnetic Ising model}
In the simple Ising model studied in \chapterref{ch.ising}, every coupling $J_{mn}$ is equal to $J$
if $m$ and $n$ are neighbours and zero otherwise. There is
an applied field $h_n = h$ that is the same for all spins.
A very simple approximating distribution is one with just a single
variational parameter $a$, which defines a separable distribution
\beq
Q(\bx; a) = \frac{1}{Z_Q} \exp \left({ \sum_n a x_n }\right)
\eeq
in which all spins are independent and have the same probability
% $\theta$,
\beq
q_n = \frac{1}{1+\exp(-2 a)}
\eeq
of being up. The mean magnetization is
\beq
\bar{x} = \tanh( a )
\label{eq.mftb.i}
\eeq
and the equation (\ref{eq.mfta}) which defines the minimum of the
variational free energy becomes
\beq
a = \b\left( C J \bar{x} + h \right) ,
\label{eq.mfta.i}
\eeq
where $C$ is the number of couplings that a spin is involved in --
$C=4$ in the case of a rectangular two-dimensional Ising model.
We can solve equations (\ref{eq.mftb.i}) and (\ref{eq.mfta.i}) for $\bar{x}$
numerically -- in fact,
% if we want a graph
it is easiest to vary $\bar{x}$ and solve
for $\b$
%
% note if x = tanh(a) then a = 1/2 log[(1+x)/(1-x)]
%
-- and obtain graphs of the free energy minima and maxima
as a function of temperature as shown in \figref{fig.mft}. The
solid line
shows $\bar{x}$ versus $T = 1 /\beta$ for the case $C=4, J=1$.
%
% easy because b( CJ x + h ) = tanh^{-1} x = 1/2 log[(1+x)/(1-x)]
%
% see ~/bin/mft.p
When $h=0$, there is a pitchfork bifurcation at a critical
temperature $T_c^{\rm mft}$. [A pitchfork bifurcation is
a transition like the one shown by the solid lines in
\figref{fig.mft},
% figure 26.1
from a system with one minimum as a function of $a$ (on the right) to
a system (on the left)
with two minima and one maximum; the maximum is the middle one of
the three lines. The solid lines look
like a pitchfork.]
% (like the true critical temperature $T_c$ of the Ising model).
Above this temperature, there is only one
minimum in the variational free energy, at $a=0$
and $\bar{x}=0$; this minimum corresponds to an approximating
distribution that is uniform
% distribution
over all states. Below the critical temperature, there
are two minima corresponding to approximating distributions that are
symmetry-broken, with all spins more likely to be up, or all spins
more likely to be down. The state $\bar{x}=0$ persists as a stationary
point of the variational free energy, but now it is a local {\em maximum\/}
of the variational free energy.
When $h>0$, there is a global \vfe\ minimum at
any temperature for a positive value of $\bar{x}$,
shown by the upper dotted curves in \figref{fig.mft}.
As long as $h =
\left< G^*_{jm} s_m s_n G^*_{ln} \delta_{ik} z'_i \right> .
\eeq
We now make several severe approximations: we replace $\bG^*$ by the
present value of $\bG$, and replace the correlated average
$\left< s_m s_n z'_i \right>$ by $\left< s_m s_n \right> \!
\left< z'_i \right> \equiv \Sigma_{mn} D_i$. Here $\bSigma$ is
the variance--covariance matrix of the latent variables (which is
assumed to exist), and $D_i$ is the typical value of the
curvature $\d^2 \ln p_i(a)/ \d a^2$. Given that the sources are
assumed to be independent, $\bSigma$ and ${\bf D}$ are both diagonal matrices.
These approximations motivate the
matrix $\bM$ given by:
\beq
[M^{-1}]_{(ij)(kl)} = G_{jm} \Sigma_{mn} G_{ln} \delta_{ik} D_i ,
\eeq
that is,
\beq
M_{(ij)(kl)} = W_{mj} \Sigma^{-1}_{mn} W_{nl} \delta_{ik} D^{-1}_i .
\eeq
For simplicity, we further assume that the sources are similar
to each other so that $\bSigma$ and ${\bf D}$ are both homogeneous,
and that $\bSigma {\bf D} = 1$. This will lead us to an algorithm
that is covariant with respect to linear rescaling of the data
$\bx$, but not with respect to linear rescaling of the latent
variables.
%For problems where these assumptions
% do not hold, it will be straightforward to retain inhomogeneous
% $\Sigma$ and $D$.
We thus use:
\beq
M_{(ij)(kl)} = W_{mj} W_{ml} \delta_{ik} .
\eeq
%
Multiplying this matrix by the gradient in equation (\ref{eq.Wderiv})
we obtain the following covariant learning algorithm:
\beq
\Delta W_{ij} = \eta \left( W_{ij} + W_{i' j} \ica_{i'} z_{i} \right) .
\label{eq.DJCM}
\eeq
Notice that this expression does not require any inversion of the matrix
$\bW$. The only additional computation once $\bz$ has been
computed is a single backward pass through the weights to compute the
quantity
\beq
x'_{j} = W_{i' j} \ica_{i'}
\eeq
in terms of which the covariant algorithm reads:
\beq
\Delta W_{ij} = \eta \left( W_{ij} + x'_{j} z_{i} \right) .
\label{eq.DJCM2}
\eeq
The quantity $\left( W_{ij} + x'_{j} z_{i} \right)$ on the right-hand side is
sometimes called the {\dem\ind{natural gradient}}.\index{gradient descent!natural}
The covariant independent component analysis algorithm is summarized in
\algref{cov.alg}.
\begin{algorithm}
\algorithmmargin{%
%\figuremargin{
Repeat for each datapoint $\bx$:
\ben
\item
Put $\bx$ through a linear mapping:
\[
\ba = \bW \bx.
\]
\item
Put $\ba$ through a nonlinear map:
\[
z_i = \phi_i(\ica_i),
\]
where a popular choice for $\phi$ is $\phi=- \tanh(\ica_i)$.
\item
Put $\ba$ back through $\bW$:
\[
\bx' = \bW^{\T} \ba .
\]
\item Adjust the weights in accordance with
%, with a learning rule:
\[
\Delta \bW \propto \bW + \bz {\bx'}^{\T} .
\]
\een
}{
\caption[a]{Independent component analysis -- covariant\index{independent component analysis}
version.\index{algorithm!independent component analysis}}
\label{cov.alg}
}
\end{algorithm}
\section*{Further reading}
ICA was originally derived using
an \ind{information maximization} approach \cite{Bell_Sejnowski}.
Another view of ICA, in terms of energy functions, which
motivates more general models, is given by \citeasnoun{hintonICA}.
Another generalization of ICA can be found
in Pearlmutter and Parra (1996, 1997).\nocite{PearlmutterICA,pearlmutter97maximum}
%\citeasnoun{PearlmutterICA}.
% A New View of ICA.
% G. E. Hinton, M. Welling, Y. W. Teh and S. Osindero. ICA 2001.
There is now an enormous literature on applications of ICA.
A variational free energy
minimization approach to ICA-like models is given
in \cite{MiskinPHD,miskin1,miskin3}.
Further reading on blind separation, including
non-ICA algorithms,
can be found in \cite{Jutten1991,Comon1991,Hendin1994,AmariICA,hojen02meanfield}.
\subsection{Infinite models}
While latent variable models with a finite number of latent
variables are widely used, it is often the case that
our beliefs about the situation would be most accurately
captured by a very large number of latent variables.
% Imagine modelling the accent of a speaker, in order to
% make an accurate speech recognition system. What is the
% dimensionality of accent space? While a reasonable model
% might be built using ten or twenty dimensions,
Consider clustering, for example. If we attack speech recognition
by modelling words using a cluster model,
how many clusters should we use? The number of possible words
is unbounded (\secref{sec.zipf}), so we would really like to use a
model in which it's always possible for new clusters to arise.
Furthermore, if we do a careful job of modelling the cluster
corresponding to just one English word, we will probably find that
the cluster for one word should itself be modelled as composed of clusters --
% the speaker might be male or female, and might speak with one accent or
% another. Indeed, the set of pronunciations of one word
indeed, a hierarchy of clusters within clusters.
The first levels of the hierarchy would divide male speakers
from female, and would separate speakers from different regions -- India, Britain, Europe,
and so forth. Within each of those clusters would be subclusters
for the different accents within each region. The subclusters could have
subsubclusters right down to the level of villages, streets, or families.
Thus we would often like to have infinite numbers of clusters;
in some cases the clusters would have
% The first example motivated an infinite number of clusters that form
a hierarchical structure,
% -- each word is distinct. The second example motivated an infinite number of clusters in
and in other cases the hierarchy would be flat.
So, how should such infinite models be implemented in finite computers?
And how should we set up our Bayesian models so as to avoid
getting silly answers?
Infinite mixture models for categorical data are presented in \citeasnoun{Radford.mixtureTR},
along with a Monte Carlo method for simulating inferences and predictions.
Infinite Gaussian mixture models with a flat hierarchical structure
are presented in \citeasnoun{rasmussenIGMM}.
\citeasnoun{Neal_infinite_trees} shows how to use Dirichlet diffusion
trees to define models of hierarchical clusters.
Most of these ideas build on the Dirichlet process (\secref{sec.dirichletprocess}).
This remains an active research area \cite{rasmussen-ghahramani-02,beagharas01}.
\section{Exercises}
\exercisxC{3}{ex.icanoise}{
Repeat the derivation of the algorithm,
but assume a small amount of noise in $\bx$:
$\bx = \bG \bs + \bn$;
so
the term $\delta \left( x^{(n)}_j -
{\textstyle \sum_i G_{ji} s^{(n)}_i }
\right)$ in the joint probability (\ref{eq.nonoise})
is replaced by a probability distribution over $x^{(n)}_j$
with mean $\sum_i G_{ji} s^{(n)}_i$. Show that, if this noise
distribution has sufficiently
small standard deviation, the identical algorithm results.
}
\exercisxC{3}{ex.icado}{
Implement the covariant ICA algorithm
and apply it to toy data.
}
\exercisxC{4-5}{ex.icagen}{
Create algorithms appropriate for the situations:
%\ben
%\item
(a) $\bx$ includes substantial Gaussian noise;
%\item
(b) more measurements than latent variables ($J>I$);
%\item
(c) fewer measurements than
latent variables ($J gamma-energies
% ~/bin/ignorance.p gamma-energies > digits
% this puts answer in counts
% gnuplot
% set samples 90
% plot 'counts', 16589*log((x+1)/(x))/log(10)
% this does NOT fit, it is Zipfian, presumably because of selection
% bias to larger events.
% oops, I made a mistake and lost everything from 1.0 downwards!
% but it makes no difference........
% histo.p bins=100 min=-3 max=13 l=1 gamma-energies
% plot '_his' u 1:2 w boxes
%% , '_his' u 1:3 w l
% check what happens if change units by factor of 3
% ~/bin/ignorance.p rescale=3 out=counts3 gamma-energies > digits3
% ha, there really is a selection bias of some sort, or something special about keV!
% now there is under-representation of 1.
% OK, what is going on is nearly all of them are in a peak about 1000 keV, and it is not
% as broad as a factor of 10.
% ~/bin/ignorance2.p gauss*.dat > tdigits ; histo.p c=4 bins=150 min=-33 max=63 l=1 out=t_his tdigits
One way to attack this question is to notice that the units
of $x$ have not been specified. If the half-life of the
neutron were measured in fortnights instead of
seconds, the number $x$ would be divided by $1\,209\,600$;
if it were measured in years, it would be divided by $3\times 10^{7}$.
Now, is our knowledge about $x$, and, in particular,
our knowledge of its first digit, affected by the change in units?
For the expert, the answer is yes; but let us take someone truly ignorant,
for whom the answer is no; their predictions about the first digit
of $x$ are independent of the units.
The arbitrariness of the units corresponds to {\dem\ind{invariance}\/}
of the probability distribution when $x$ is {\em multiplied\/} by any number.
\amarginfig{b}{
% written by logscale.p (c) DJCM Feb 2000
% see ignorance.tex
\setlength{\unitlength}{1.4mm}
\begin{picture}(27,100)(0,0)
\put(0,2){\makebox(0,0)[tl]{\footnotesize{metres}}}
\put(0,3){\vector(0,1){ 101.2500}}
\put(0, 5.0696){\line(1,0){1}}
\put(2, 5.0696){\makebox(0,0)[l]{\footnotesize{1}}}
\put(0, 20.1211){\line(1,0){1}}
\put(2, 20.1211){\makebox(0,0)[l]{\footnotesize{2}}}
\put(0, 28.9257){\line(1,0){1}}
\put(2, 28.9257){\makebox(0,0)[l]{\footnotesize{3}}}
\put(0, 35.1726){\line(1,0){1}}
\put(2, 35.1726){\makebox(0,0)[l]{\footnotesize{4}}}
\put(0, 40.0181){\line(1,0){1}}
\put(2, 40.0181){\makebox(0,0)[l]{\footnotesize{5}}}
\put(0, 43.9772){\line(1,0){1}}
\put(2, 43.9772){\makebox(0,0)[l]{\footnotesize{6}}}
\put(0, 47.3245){\line(1,0){1}}
\put(2, 47.3245){\makebox(0,0)[l]{\footnotesize{7}}}
\put(0, 50.2241){\line(1,0){1}}
\put(2, 50.2241){\makebox(0,0)[l]{\footnotesize{8}}}
\put(0, 52.7818){\line(1,0){1}}
\put(2, 52.7818){\makebox(0,0)[l]{\footnotesize{9}}}
\put(0, 55.0696){\line(1,0){1}}
\put(2, 55.0696){\makebox(0,0)[l]{\footnotesize{10}}}
\put(0, 70.1211){\line(1,0){1}}
\put(2, 70.1211){\makebox(0,0)[l]{\footnotesize{20}}}
\put(0, 78.9257){\line(1,0){1}}
\put(2, 78.9257){\makebox(0,0)[l]{\footnotesize{30}}}
\put(0, 85.1726){\line(1,0){1}}
\put(2, 85.1726){\makebox(0,0)[l]{\footnotesize{40}}}
\put(0, 90.0181){\line(1,0){1}}
\put(2, 90.0181){\makebox(0,0)[l]{\footnotesize{50}}}
\put(0, 93.9772){\line(1,0){1}}
\put(2, 93.9772){\makebox(0,0)[l]{\footnotesize{60}}}
\put(0, 97.3245){\line(1,0){1}}
\put(2, 97.3245){\makebox(0,0)[l]{\footnotesize{70}}}
\put(0, 100.2241){\line(1,0){1}}
\put(2, 100.2241){\makebox(0,0)[l]{\footnotesize{80}}}
\put(18,2){\makebox(0,0)[tl]{\footnotesize{inches}}}
\put(18,3){\vector(0,1){ 101.2500}}
\put(18, 5.4143){\line(1,0){1}}
\put(20, 5.4143){\makebox(0,0)[l]{\footnotesize{40}}}
\put(18, 10.2598){\line(1,0){1}}
\put(20, 10.2598){\makebox(0,0)[l]{\footnotesize{50}}}
\put(18, 14.2189){\line(1,0){1}}
\put(20, 14.2189){\makebox(0,0)[l]{\footnotesize{60}}}
\put(18, 17.5662){\line(1,0){1}}
\put(20, 17.5662){\makebox(0,0)[l]{\footnotesize{70}}}
\put(18, 20.4658){\line(1,0){1}}
\put(20, 20.4658){\makebox(0,0)[l]{\footnotesize{80}}}
\put(18, 23.0234){\line(1,0){1}}
\put(20, 23.0234){\makebox(0,0)[l]{\footnotesize{90}}}
\put(18, 25.3113){\line(1,0){1}}
\put(20, 25.3113){\makebox(0,0)[l]{\footnotesize{100}}}
\put(18, 40.3628){\line(1,0){1}}
\put(20, 40.3628){\makebox(0,0)[l]{\footnotesize{200}}}
\put(18, 49.1674){\line(1,0){1}}
\put(20, 49.1674){\makebox(0,0)[l]{\footnotesize{300}}}
\put(18, 55.4143){\line(1,0){1}}
\put(20, 55.4143){\makebox(0,0)[l]{\footnotesize{400}}}
\put(18, 60.2598){\line(1,0){1}}
\put(20, 60.2598){\makebox(0,0)[l]{\footnotesize{500}}}
\put(18, 64.2189){\line(1,0){1}}
\put(20, 64.2189){\makebox(0,0)[l]{\footnotesize{600}}}
\put(18, 67.5662){\line(1,0){1}}
\put(20, 67.5662){\makebox(0,0)[l]{\footnotesize{700}}}
\put(18, 70.4658){\line(1,0){1}}
\put(20, 70.4658){\makebox(0,0)[l]{\footnotesize{800}}}
\put(18, 73.0234){\line(1,0){1}}
\put(20, 73.0234){\makebox(0,0)[l]{\footnotesize{900}}}
\put(18, 75.3113){\line(1,0){1}}
\put(20, 75.3113){\makebox(0,0)[l]{\footnotesize{1000}}}
\put(18, 90.3628){\line(1,0){1}}
\put(20, 90.3628){\makebox(0,0)[l]{\footnotesize{2000}}}
\put(18, 99.1674){\line(1,0){1}}
\put(20, 99.1674){\makebox(0,0)[l]{\footnotesize{3000}}}
\put(9,2){\makebox(0,0)[tl]{\footnotesize{feet}}}
\put(9,3){\vector(0,1){ 101.2500}}
\put(9, 3.1264){\line(1,0){1}}
\put(11, 3.1264){\makebox(0,0)[l]{\footnotesize{3}}}
\put(9, 9.3734){\line(1,0){1}}
\put(11, 9.3734){\makebox(0,0)[l]{\footnotesize{4}}}
\put(9, 14.2189){\line(1,0){1}}
\put(11, 14.2189){\makebox(0,0)[l]{\footnotesize{5}}}
\put(9, 18.1779){\line(1,0){1}}
\put(11, 18.1779){\makebox(0,0)[l]{\footnotesize{6}}}
\put(9, 21.5253){\line(1,0){1}}
\put(11, 21.5253){\makebox(0,0)[l]{\footnotesize{7}}}
\put(9, 24.4249){\line(1,0){1}}
\put(11, 24.4249){\makebox(0,0)[l]{\footnotesize{8}}}
\put(9, 26.9825){\line(1,0){1}}
\put(11, 26.9825){\makebox(0,0)[l]{\footnotesize{9}}}
\put(9, 29.2704){\line(1,0){1}}
\put(11, 29.2704){\makebox(0,0)[l]{\footnotesize{10}}}
\put(9, 44.3219){\line(1,0){1}}
\put(11, 44.3219){\makebox(0,0)[l]{\footnotesize{20}}}
\put(9, 53.1264){\line(1,0){1}}
\put(11, 53.1264){\makebox(0,0)[l]{\footnotesize{30}}}
\put(9, 59.3734){\line(1,0){1}}
\put(11, 59.3734){\makebox(0,0)[l]{\footnotesize{40}}}
\put(9, 64.2189){\line(1,0){1}}
\put(11, 64.2189){\makebox(0,0)[l]{\footnotesize{50}}}
\put(9, 68.1779){\line(1,0){1}}
\put(11, 68.1779){\makebox(0,0)[l]{\footnotesize{60}}}
\put(9, 71.5253){\line(1,0){1}}
\put(11, 71.5253){\makebox(0,0)[l]{\footnotesize{70}}}
\put(9, 74.4249){\line(1,0){1}}
\put(11, 74.4249){\makebox(0,0)[l]{\footnotesize{80}}}
\put(9, 76.9825){\line(1,0){1}}
\put(11, 76.9825){\makebox(0,0)[l]{\footnotesize{90}}}
\put(9, 79.2704){\line(1,0){1}}
\put(11, 79.2704){\makebox(0,0)[l]{\footnotesize{100}}}
\put(9, 94.3219){\line(1,0){1}}
\put(11, 94.3219){\makebox(0,0)[l]{\footnotesize{200}}}
\end{picture}
\caption[a]{When viewed on a logarithmic scale,
scales using different units are translated
relative to each other.}
}
% ~/bin/logscale.p > ignorance/picture.tex
% ~/bin/logscale.p pic=2 N=1 Y=13 extend=1 > ignorance/line.tex
%From the derivation it must be clear that the law has nothing to do
%with physics, - it it as valid for stock prices, numbers appearing in
%a newspaper, etc. It was first noted by S. Newcomb (in 1881 in Amer
%J. Math. 4, 39(1881)) and investigated in detail by Frank Benford a
%physicist at the General Electric Company (in
%Proc. Amer. Phil. Soc. 78, 551(1938)), and, hence, is known is
%Benford's Law. Only recently, the law received rigorous formulation
%and proof. For detailed description and references see description of
%the law by Eric W. Weisstein. (The above histogram was taken from
%there.) See also the article by Malcolm W. Browne on the applications
%to economics and fraud detection.
If you don't know the units that a quantity
is measured in, the probability of the first digit
must be proportional to the length of the corresponding
piece of logarithmic scale.
The probability that the first digit of a number is {\tt 1} is
thus
\beq
p_{\tt 1} = \frac{ \log 2 - \log 1 }{ \log 10 - \log 1 }
= \frac{ \log 2 }{ \log 10 } .
\eeq
% more generally, digit D has probability log (D+1) - log(D) / log (10)-log(1)
% or log_{10} (1+1/D)
%
Now,%
\amarginfignocaption{t}{
% written by logscale.p (c) DJCM Feb 2000
% see ignorance.tex
\setlength{\unitlength}{1.4mm}
\begin{picture}(13,50)(0,0)
\put(10,-1){\makebox(0,0)[tl]{\footnotesize{}}}
\put(10,0){\line(0,1){ 50.0000}}
\put(10, 0.0000){\line(1,0){1}}
\put(12, 0.0000){\makebox(0,0)[l]{\footnotesize{1}}}
\put(10, 15.0515){\line(1,0){1}}
\put(12, 15.0515){\makebox(0,0)[l]{\footnotesize{2}}}
\put(10, 23.8561){\line(1,0){1}}
\put(12, 23.8561){\makebox(0,0)[l]{\footnotesize{3}}}
\put(10, 30.1030){\line(1,0){1}}
\put(12, 30.1030){\makebox(0,0)[l]{\footnotesize{4}}}
\put(10, 34.9485){\line(1,0){1}}
\put(12, 34.9485){\makebox(0,0)[l]{\footnotesize{5}}}
\put(10, 38.9076){\line(1,0){1}}
\put(12, 38.9076){\makebox(0,0)[l]{\footnotesize{6}}}
\put(10, 42.2549){\line(1,0){1}}
\put(12, 42.2549){\makebox(0,0)[l]{\footnotesize{7}}}
\put(10, 45.1545){\line(1,0){1}}
\put(12, 45.1545){\makebox(0,0)[l]{\footnotesize{8}}}
\put(10, 47.7121){\line(1,0){1}}
\put(12, 47.7121){\makebox(0,0)[l]{\footnotesize{9}}}
\put(10, 50.0000){\line(1,0){1}}
\put(12, 50.0000){\makebox(0,0)[l]{\footnotesize{10}}}
\put(9, 7.5257){\vector(0,1){ 7.5257}}
\put(9, 7.5257){\vector(0,-1){ 7.5257}}
\put(8, 7.5257){\makebox(0,0)[r]{\footnotesize{$P(1)$}}}
\put(9, 26.9795){\vector(0,1){ 3.1234}}
\put(9, 26.9795){\vector(0,-1){ 3.1234}}
\put(8, 26.9795){\makebox(0,0)[r]{\footnotesize{$P(3)$}}}
\put(9, 48.8560){\vector(0,1){ 1.1440}}
\put(9, 48.8560){\vector(0,-1){ 1.1440}}
\put(8, 48.8560){\makebox(0,0)[r]{\footnotesize{$P(9)$}}}
\end{picture}
%\caption[a]
}
$2^{10} = 1024 \simeq 10^3 = 1000$, so
without needing a calculator, we have $10 \log 2 \simeq 3 \log 10$ and
\beq
p_{\tt 1} \simeq \frac{3}{10}.
\eeq
%\index{Frank Benford}
More generally, the probability that
the first digit is $d$ is
\beq
(\log (d+1) - \log(d) )/( \log 10-\log 1)
= \log_{10} (1+1/d).
\eeq
This observation about initial digits is
\label{sec.whatyouknow}known as \ind{Benford's law}.
% http://www.cut-the-knot.com/do_you_know/zipfLaw.shtml
Ignorance does not correspond to a uniform probability distribution.\index{ignorance}\ENDsolution
% A helpful way to confirm Benford's law is to consider an extreme case --
% imagine that we work in base 2 rather than base 10. What is the
% first digit
%\section{}
\exercisxB{2}{ex.pin}{
A pin is thrown tumbling in the air.
What is the probability distribution of the angle $\theta_1$
between the pin and the vertical at a moment while it is in the air?
The tumbling pin is photographed.
What is the probability distribution of the angle $\theta_3$ between
the pin and the vertical as imaged in the photograph?
% A coin is thrown tumbling in the air.
% What is the probability distribution of the angle $\theta_3$
% between the plane of the coin and the vertical?
}
\exercisxB{2}{ex.recordbreaking}{
{\sf Record breaking}.\index{record breaking}\index{extreme value}
Consider keeping track of the \ind{world record} for
some quantity $x$, say earthquake magnitude,
or longjump distances jumped at world championships.
If we assume that attempts to break the record
take place at a steady rate, and if we assume
that the underlying probability distribution
of the outcome $x$, $P(x)$, is not changing -- an assumption
that I think is unlikely to be true in the case of sports
endeavours, but an interesting assumption to consider
nonetheless -- and assuming no knowledge at all about
$P(x)$, what can be predicted about successive
intervals between the dates when records are broken?
}
%
\fakesection{inference-exs}
% I've been asked to include some exercises without worked solutions.
% Here's a start.
\section{The Luria--Delbr\"uck distribution}
\ExercissxC{3C}{ex.luriadelbruck}{
% http://www.asm.org/mbrsrc/archive/pdfs/tline/402luria.pdf
% http://www.mun.ca/biology/scarr/Luria-Delbruck_experiment.htm
In their landmark paper\index{Luria, Salvador}\index{Delbr\"uck, Max}
demonstrating that bacteria could mutate\index{fluctuation analysis}
from virus sensitivity to\index{test!fluctuation}\index{distribution!Luria--Delbr\"uck}
virus resistance, \citeasnoun{luriadelbruck43}
wanted to estimate the \ind{mutation rate}
in an exponentially-growing population
from the total number of mutants found at the end of the experiment.
This problem is difficult because the quantity
measured (the number of mutated bacteria)
has a heavy-tailed\index{tail} probability distribution:
a mutation occuring early in the experiment can give rise
to a huge number of mutants.\index{Bayes' theorem}
% ; as they explain, it involved
Unfortunately, Luria and Delbr\"uck didn't
know \Bayes\ theorem, and their
way of coping with the heavy-tailed distribution involves arbitrary
hacks leading to two different \ind{estimator}s
of the mutation rate. One of these estimators (based on
the mean number of mutated bacteria, averaging over several
experiments) has appallingly large variance, yet sampling
theorists continue to use it and base confidence intervals
around it \cite{kepleroprea01a}.
In this exercise you'll do the inference right.
% made an unfortunate mess of the statistics
% For this exercise, please
In each culture, a single bacterium that is {\em not
resistant\/} gives rise, after $g$ generations,
to $N = 2^g$ descendants, all clones except for
differences arising from mutations.
The final culture is then exposed to a virus,
and the number of resistant bacteria $n$ is measured.
According to the now accepted mutation hypothesis,
these resistant bacteria got their resistance from
random mutations that took place during the growth
of the colony. The mutation rate (per cell per generation),
$a$, is about one in a hundred million.
% , very small number.
The total number of opportunities to
mutate is $N$, since $\sum_{i=0}^{g-1} 2^i \simeq 2^g = N$.
If a bacterium
mutates at the $i$th generation,
its descendants all inherit the mutation, and the
final number of resistant bacteria contributed by
that one ancestor is $2^{g-i}$.
%http://www.accessexcellence.org/AB/BC/Bacterial_Mutations.html
%
% Salvador Luria and Max Delbr\"uck, working together at Cold Spring Harbor during World War II
Given $M$ separate experiments, in each of
which a colony of size $N$ is created, and where
the measured numbers of resistant bacteria are
$\{ n_m \}_{m=1}^M$, what can we infer about
the mutation rate, $a$?
% http://www.mun.ca/biology/scarr/Luria-Delbruck_results.htm
Make the inference given the following
dataset from Luria and Delbr\"uck, for $N=2.4 \times 10^8$:
% $g=17$ generations,
% starting from about 1000 bacteria.
% 2.4e8 = 2**27.8
$\{ n_m \} = \{ 1,0,3,0,0,5,0,5,0,6,107,0,0,0,1,0,0,64,0,35 \}$.
[A small amount of computation is required to solve this problem.]
% a website said....
%Mutation rate (a) can be calculated
% mean # mutations / culture = aN
% Poisson distribution predicts p0 = exp (- a / N)
% where p0 = fraction of cultures with no Tonr mutants
% Rewrite as a = (- ln p0) / N
% and p0 = 11 / 20 = 0.55 from data
% Then a = -ln 0.55 / (0.2 x 10^8) = 3 x 10-8 mutations / cell / generation
% - this value of N (2e7) is smaller than mine by a factor of 10.
% I have gone with 2e8 because it seems to agree with LD paper
% the LD conclusion is 0.32e-8
}
\section{Inferring causation}
\ExercissxA{2}{ex.causation}{
In the Bayesian graphical model community,
the task of inferring which way the arrows point -- that is,
which nodes are parents, and which children --
is one on which much has been written.
Inferring causation is tricky because
of `likelihood equivalence'.
Two graphical models are likelihood-equivalent
if for any setting of the parameters of either, there
exists a setting of the parameters of the others such that
the two joint probability distributions of all observables are
identical.
An example of a pair of likelihood-equivalent models are
$A \rightarrow B$
and
$B \rightarrow A$.
The model $A \rightarrow B$ asserts that
$A$ is the parent of $B$, or, in very sloppy
terminology, `$A$ causes $B$'.
An example of a situation where
`$B \rightarrow A$' is true is
the case where $B$ is the variable `burglar in house'
and $A$ is the variable `alarm is ringing'.
Here it is literally true that $B$ causes $A$.
But this choice of words is confusing if applied to
another example, $R \rightarrow D$, where $R$ denotes
`it rained this morning' and $D$ denotes `the pavement is dry'.
`$R$ causes $D$' is confusing.
% a similar relationship exists between
%
I'll therefore use the words
`$B$ is a parent of $A$' to denote causation.
% so we have an agreement of colloquial use.
% But
Some statistical methods that use the likelihood alone
are unable to use data to distinguish between
likelihood-equivalent models.\index{likelihood equivalence}
In a Bayesian approach, on the other hand, two likelihood-equivalent
models may nevertheless be somewhat distinguished, in the light
of data, since likelihood-equivalence does not force a Bayesian
to use priors that assign equivalent densities over the two
parameter spaces of the models.
However, many Bayesian graphical modelling folks,
perhaps out of sympathy for their non-Bayesian colleagues,
or from a latent urge not to appear different from them,
deliberately discard this potential advantage of Bayesian methods
-- the ability to infer causation from data -- by skewing their
models so that the ability goes away;
a widespread orthodoxy holds that one should identify
the choices of prior for which `\ind{prior equivalence}' holds,
\ie, the priors such that models that are likelihood-equivalent
also have identical posterior probabilities,
% the Bayesian is incapable
% of inferring causation,
and then one should use one of those priors
in inference and prediction.
This argument motivates the use, as the prior over all probability
vectors, of specially-constructed
% {\dem{additive}\/}
Dirichlet distributions.
%(and {\em not\/} uniform Dirichlet distributions).
%
%(Definition: let the `Dirichlet mass' associated with a prior
%be the product of its associated hyperparameters.
%The prior is additive if the mass of the prior on a marginal
%is eual to the sum of the masses on the joints)
In my view it is a philosophical error
to
% believe that one should
use only those priors such that causation
cannot be inferred. Priors should be set to describe one's assumptions;
when this is done, it's likely that interesting inferences about causation
{\em can\/} be made from data.
In this exercise, you'll make an example of such an inference.
Consider the toy problem where $A$ and $B$ are
binary variables.
The two models are
$\H_{A \rightarrow B}$
and $\H_{B \rightarrow A}$.
$\H_{A \rightarrow B}$
asserts that the marginal probability of $A$
comes from a beta distribution with parameters $(1,1)$,
\ie, the uniform distribution;
and that the two conditional distributions
$P(b \given a \eq 0)$ and
$P(b \given a \eq 1)$
also come independently from beta distributions with parameters $(1,1)$.
The other model assigns similar priors to
the marginal probability of $B$
and the conditional distributions of $A$ given $B$.
Data are gathered,
and the counts, given $F=1000$ outcomes, are
\beq
\begin{array}{cccc}
& a \eq 0 & a \eq 1 \\
b \eq 0 & 760 & 5 & \multicolumn{1}{|c}{765 }\\
b \eq 1 & 190 & 45 & \multicolumn{1}{|c}{235 }\\ \cline{2-3}
& 950 & 50
\end{array}
\eeq
What are the posterior probabilities of the
two hypotheses?
\begin{aside}
Hint: it's a good idea to work this exercise out symbolically
in order to spot all the simplifications that emerge.
\beq
\Psi(x) =
\frac{\d}{\d x} \ln \Gamma(x) \simeq \ln(x) - \frac{1}{2x} + O(1/x^2) .
\label{digam_1}
\eeq
\end{aside}
The topic of inferring causation is a complex one.
The fact that Bayesian inference can sensibly be used
to infer the directions of arrows in graphs seems to be a neglected
view, but it is certainly not the whole story.
See \citeasnoun{pearl2000} for discussion of many other aspects
of causality.
% including interventions
% and counterfactuals.
}
%%%%%%%%%%%%% further stuff removed to cutsolutions.tex
\section{Further exercises}
\exercisxC{3}{ex.poissonoscillate}{
Photons arriving at a \index{photon counter}photon detector are believed
to be emitted as a \ind{Poisson process} with a time-varying rate,
\beq
\l(t) = \exp ( a + b \sin (\omega t + \phi ) ) ,
\eeq
where the parameters $a$, $b$, $\omega$, and $\phi$
are known. Data are collected during the time $t=0\ldots T$.
Given that $N$ photons arrived at times $\{ t_n \}_{n=1}^N$,
discuss the inference of $a$, $b$, $\omega$, and $\phi$.
[Further reading: \citeasnoun{Gregory_Loredo}.]
}
\fakesection{Exercises Harry}
% stolen from bayes_intermediate.tex
%
\exercisaxB{2}{ex.harrydata}{
A data file consisting of two columns of numbers has been printed
in such a way that the boundaries between the columns are unclear.
Here are the resulting strings.
% two columns that have run together.
\begin{center}
\begin{tabular}{r|r|r|r|r|r}
891.10.0 &
912.20.0 &
874.10.0 &
870.20.0 &
836.10.0 &
861.20.0 \\
903.10.0 &
937.10.0 &
850.20.0 &
916.20.0 &
899.10.0 &
907.10.0 \\
924.20.0 &
861.10.0 &
899.20.0 &
849.10.0 &
887.20.0 &
840.10.0 \\
849.20.0 &
891.10.0 &
916.20.0 &
891.10.0 &
912.20.0 &
875.10.0 \\
898.20.0 &
924.10.0 &
950.20.0 &
958.10.0 &
971.20.0 &
933.10.0 \\
966.20.0 &
908.10.0 &
924.20.0 &
983.10.0 &
924.20.0 &
908.10.0 \\
950.20.0 &
911.10.0 &
913.20.0 &
921.25.0 &
912.20.0 &
917.30.0 \\
923.50.0 &
& & & & \\
\end{tabular}
\end{center}
Discuss how probable it is, given these data, that the correct
parsing of each item is:
\ben
\item $891.10.0 \rightarrow 891. \:\: 10.0$, etc.
\item $891.10.0 \rightarrow 891.1 \:\: 0.0$, etc.
\een
%\marginpar[c]{\footnotesize
[A parsing of a string is
a grammatical\index{parse}\index{Punch}
% explanation
interpretation of the string. For
example, `Punch bores' could be parsed as
`Punch (noun) bores (verb)', or `Punch (imperative verb) bores (plural noun)'.]
%`I said the honorable member is a liar, it is true,
% and I regret it'
}
%
%
%
\exercisaxB{2}{ex.biexp}{
In an experiment, the measured quantities $\{x_n\}$
come independently
from a \ind{biexponential distribution} with mean $\mu$,
\[
P(x \given\mu ) = \frac{1}{Z}
\exp \! \left( - \left| x - \mu \right| \right) ,
\]
where $Z$ is the normalizing constant, $Z=2$.
The mean $\mu$ is not known. An example
of this distribution, with $\mu=1$, is shown in \figref{fig.biexpex}.
\amarginfig{t}{
\[
%\mbox{\raisebox{0.84in}{$P(x \given\mu=1)$} \psfig{figure=figs/biexp.ps,width=2in,angle=-90}}
\mbox{\psfig{figure=figs/biexp.ps,width=2in,angle=-90}}
\]
\caption[a]{The biexponential distribution $P(x \given \mu=1)$.}
\label{fig.biexpex}
}
Assuming the four datapoints are
\[
\{ x_n \} = \{ 0, 0.9 , 2, 6 \} ,
% \{ x_n \} = \{ 0.8, 0.9 , 1.1, 1.7 \} ,
\mbox{\hspace*{1in}\raisebox{-3mm}[9mm][0mm]{%
\psfig{figure=figs/biexpdat.ps,width=2.3in,angle=-90}}}
\]
what do these data tell us about $\mu$?
Include detailed sketches in your answer.
Give a range of plausible values of $\mu$.
}
%\subsection*{Counting combinations}
%\input{tex/profhinton.tex}
\section{Solutions}
% for inference-exs.tex
% see also book/blue.tex
\soln{ex.luriadelbruck}{
A population of size $N$ has $N$ opportunities to mutate.
The probability of the number of mutations that occurred,
$r$, is roughly Poisson
\beq
P(r\given a,N) = e^{-aN} \frac{(aN)^r}{r!} .
\eeq
(This is slightly inaccurate because the descendants
of a mutant cannot themselves undergo the same mutation.)
Each mutation gives rise to a number of final
mutant cells $n_i$ that depends on the generation time
of the mutation.
If multiplication went like clockwork then
the probability of $n_i$ being 1 would be
$1/2$, the probability of 2 would be $1/4$,
the probability of $4$ would be $1/8$,
and $P(n_i) = 1/(2n)$ for all $n_i$ that are powers
of two. But we don't expect the mutant progeny to
divide in exact synchrony, and we don't know the
precise timing of the end of the experiment compared to
the division times. A smoothed version of this distribution
that permits all integers to occur is
\beq
P(n_i) = \frac{1}{Z}\frac{1}{n_i^2} ,
\label{eq.Pni}
\eeq
where $Z= \pi^2/6 =1.645$.
[This distribution's moments are all wrong, since
$n_i$ can never exceed $N$, but who cares about moments? -- only sampling
theory statisticians who are barking up the wrong tree, constructing
`\ind{unbiased estimator}s' such as $\hat{a} = (\bar{n}/N) / \log N$.
The error that we introduce in the likelihood function by using the
approximation to $P(n_i)$ is negligible.]
% This smoothing crudely takes account
% of the fact that we don't know whether
The observed number of mutants $n$ is the sum
\beq
n = \sum_{i=1}^r n_i .
\eeq
The probability distribution of $n$ given $r$
is the convolution of $r$ identical distributions of the form (\ref{eq.Pni}).
For example,
\beq
P(n\,|\, r\eq 2) = \sum_{n_1 = 1}^{n-1}
\frac{1}{Z^2} \frac{1}{n_1^2}\frac{1}{(n-n_1)^2} \:\: \mbox{for $n\geq 2$}.
\eeq
The probability distribution of $n$ given
$a$, which
is what we need for the Bayesian inference, is
given by summing
% $P(n \given r) P(r \given a)$
over $r$.
\beq
P(n \given a) = \sum_{r=0}^{N} P(n \given r) P(r \given a,N) .
\eeq
This quantity can't be evaluated analytically, but for small $a$,
it's easy to evaluate to any desired numerical precision
by explicitly summing over $r$ from $r=0$ to some $r_{\max}$,
with $P(n \given r)$ also being found for each $r$ by $r_{\max}$ explicit convolutions
for all required values of $n$; if $r_{\max} = n_{\max}$,
the largest value of $n$ encountered in the data, then $P(n \given a)$
is computed exactly; but
for this question's data, $r_{\max} = 9$ is
plenty for an accurate result; I used $r_{\max}=74$ to make the graphs
in \figref{fig.luria}. {\tt{Octave}} source code is
available.\footnote{{\tt{www.inference.phy.cam.ac.uk/itprnn/code/octave/luria0.m}}}
% www.inference.phy.cam.ac.uk/itila/code/octave/luria0.m
\amarginfig{t}{
\begin{center}
\begin{tabular}{ccc}
&&\mbox{\psfig{figure=figs/luria.ps,width=1.52in,angle=-90}}\\[0.15in]
&&\mbox{\psfig{figure=figs/lurial.ps,width=1.52in,angle=-90}}\\[0.15in]
\end{tabular}
\end{center}
\caption[a]{Likelihood of the mutation rate $a$
on a linear scale and log scale,
given Luria and Delbruck's data.
Vertical axis: likelihood/$10^{-23}$;
horizontal axis: $a$.
}
\label{fig.luria}
}
% end figure
Incidentally, for data sets like the one in this exercise, which
have a substantial number of zero counts,
very little is lost by making Luria and Delbruck's second
approximation, which is to retain only the count of
how many $n$ were equal to zero, and how many were non-zero.
The likelihood function found using
this weakened data set,
\beq
L(a) = (e^{-aN})^{11} (1-e^{-aN} )^9 ,
\eeq
is scarcely distinguishable
from the likelihood computed using full information.
% see luria.gnu
% also discussed wth
%From stavare@gnome.usc.edu Fri Jun 15 22:25:32 2001
% Subject: Luria Delbruck
%I promised you a list of papers on the LD law. There have been a couple of=
%=20
%new ones recently in the math bio literature:
%Angerer W.P.
% An explicit representation of the Luria-Delbruck distribution. J. Math. Biol. 2001 Feb;42(2):145-74.
%W. P. Angerer (2001) J Math Biol, 42, 145-174.
%http://www.univie.ac.at/Krebsforschung/molgen/molgen.htm
%Q. Zheng (1999) Math Biosci, 162, 1-32
% Zheng, Q. (1999). Progress of a half century in the study of the Luria-Delbr?uck distrib
% ution, Mathematical Biosciences 162:1-32
%and an unpublished one by Tom Kepler at
%
%http://www.santafe.edu/~mihaela/estimation.html
% now published!
%Kepler, T.B. & Oprea, M. Improved inference of mutation rates: I. An integral representation of the Luria-Delbruck distribution.
%Theoretical Population Biology 59, 41-48 (2001) (pdf).
%
%Oprea, M. & Kepler, T.B. Improved inference of mutation rates: II. Generalization of the Luria-Delbruck distribution for realistic cell-cycle time distributions.
%Theoretical Population Biology 59, 49-59 (2001) (pdf).
%
% These idiots are still using the ``unbiased estimator'' a = n/(2 log(N)) !!???
}
\soln{ex.causation}{
From the six terms of the form
\beq
P(\bF \given \a\bm) =
\frac{ \prod_i \Gamma (F_i + \a m_i) }{ \Gamma(\sum_i F_i + \a ) }
\frac{ \Gamma( \a ) }{ \prod_i \Gamma (\a m_i) } ,
\label{lang.evidence}
\eeq
most factors cancel and
all that remains is
\beq
\frac{
P(\H_{A \rightarrow B} \given \mbox{Data} ) }{ P( \H_{B \rightarrow A} \given \mbox{Data} ) }
= \frac{ (765+1)( 235+1) }{ (950+1)(50+1) }
= \frac{3.8}{1} .
\eeq
There is modest evidence in favour of $\H_{A \rightarrow B}$ because
the three probabilities inferred for that hypothesis (roughly 0.95, 0.8, and 0.1)
are more typical of the prior than are the three probabilities
inferred for the other (0.24, 0.008, and 0.19).
This statement sounds absurd if we think of the priors as `uniform'
over the three probabilities -- surely, under a uniform prior,
any settings of the probabilities are equally probable? But in the natural basis,
the logit basis, the prior is proportional to $p(1-p)$,
and the posterior probability ratio can be estimated
by
\beq
\frac{ 0.95 \times 0.05 \times 0.8 \times 0.2 \times 0.1 \times 0.9 }
{ 0.24 \times 0.76 \times 0.008 \times 0.992 \times 0.19 \times 0.81 }
% pr 0.95 * 0.05 * 0.8 * 0.2 * 0.1 * 0.9 / ( 0.24 * 0.76 * (6*761/(767*767.0)) * 0.19 * 0.81 )
\simeq \frac{3}{1} ,
\eeq
which is not exactly right, but it does illustrate where the
preference for ${A \rightarrow B}$ is coming from.
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
\dvips
% {Decision theory}
\chapter{Decision Theory}
\label{ch.decision}
Decision theory is\indexs{decision theory}
trivial, apart from computational details (just like playing chess!).
You have
% One has
a choice of various actions, $a$.
The world may be in one of many states $\bx$;
which one occurs may be influenced by your action.
The world's state has a probability
distribution $P(\bx \given a)$.
Finally, there is a \ind{utility} function $U(\bx,a)$ which
specifies the payoff you receive when the world is
in state $\bx$ and you chose action $a$.
The task of decision theory is to select the action that
maximizes the expected utility,
\beq
\Exp[U \given a] = \int \d^K\bx \:\, U(\bx,a) P(\bx \given a) .
\eeq
That's all.
The
% The solution to this task is, tautologically, to solve the
computational
problem is to
% of maximizing
maximize $\Exp[U \given a]$ over $a$.
[Pessimists may prefer to define a loss function $L$
instead of a utility function $U$ and
% , then they can
minimize the expected loss.]
% instead of maximizing the expected utility.]
Is there anything more to be said about decision theory?
Well, in a real problem, the choice of an appropriate utility
function may be quite difficult.
Furthermore, when a sequence of actions is to be taken, with each action
providing information about $\bx$, we have to take into
account the effect that this anticipated information
may have on our subsequent actions.
% -- moves in a board game, for example --
The resulting mixture of forward probability and inverse probability
computations
% detailed
% way in which the computations
in a decision problem is distinctive.
% use probability distributions is a little different from the manipulations we are used to in inference.
In a realistic problem such as playing a board game,\index{game-playing}\index{do the right thing}
the tree of possible cogitations
and actions that must be considered becomes enormous, and `doing\index{do the right thing}
the right thing' is not simple, because the expected
utility of an action cannot be computed exactly
\cite{Russell_Wefald,baum93best,baum97bayesian}.
Let's explore an example.
\section{Rational prospecting}
Suppose you have the task of choosing the site for a \ind{Tanzanite}
{mine}.\index{mine (hole in ground)}\index{prospecting}
% \marginpar{\footnotesize{Tanzanite is a mineral found in East Africa.}}.
Your final
action will be to select the site from a list of $N$ sites. The $n$th
site has a net value called the return $x_n$ which is initially unknown,
and will be found out exactly only after site $n$ has been chosen.
[$x_n$ equals the revenue
earned from selling the Tanzanite from that site, minus the costs
of buying the site, paying the staff, and so forth.]
At the outset, the return $x_n$ has a probability distribution $P(x_n)$,
based on the information already available.
Before you take your final action you have the opportunity
to do some \ind{prospecting}.
Prospecting at the $n$th site has a cost $c_n$ and yields
data $d_n$ which reduce the uncertainty about $x_n$.
[We'll assume that the
% returns
returns of the $N$ sites are unrelated
to each other, and that prospecting at one site only yields
information about that site and doesn't affect the return
from that site.]
Your decision problem is:
\begin{quotation}
\noindent given the initial probability
distributions $P(x_1)$, $P(x_2)$, \ldots, $P(x_N)$, first, decide
whether to prospect, and at which sites; then,
in the light of your prospecting results,
choose which site to mine.
\end{quotation}
For simplicity, let's make everything in the problem Gaussian\marginpar{\small\raggedright{The
notation $P(y) = \Normal ( y ; \mu , \sigma^2 )$ indicates
that $y$ has Gaussian distribution with mean $\mu$ and variance
$\sigma^2$.}}
and focus on the question of whether to prospect once or not.
We'll assume our utility function is linear in $x_n$; we
wish to maximize our expected return.
The utility function is
\beq
U = x_{{n_a}},
\eeq
if no prospecting is done, where ${n_a}$ is the chosen `action' site;
and, if prospecting is done, the utility is
\beq
U = - c_{{n_p}} + x_{{n_a}},
\eeq
where ${n_p}$ is the site at which prospecting took place.
The prior distribution of the return
% value
of site $n$ is
\beq
P(x_n) = \Normal ( x_n ; \mu_n , \sigma^2_n ) .
\eeq
If you prospect at site $n$, the datum $d_n$ is a noisy version of $x_n$:
\beq
P(d_n \given x_n) = \Normal ( d_n ; x_n , \sigma^2 ) .
\eeq
\exercisxB{2}{ex.decisionPd}{
Given these assumptions, show that the prior probability distribution
of $d_n$ is
\beq
P(d_n) = \Normal ( d_n ; \mu_n , \sigma^2 \!+\! \sigma^2_n )
% \:\:\:\mbox{(mnemonic: variances add]}
\label{eq.decision.dn}
\eeq
(mnemonic: when independent variables add, variances add),
and that the posterior distribution of $x_n$ given $d_n$ is
\beq
P(x_n \given d_n) = \Normal \left( x_n ; \mu'_n , {\sigma^2_n}'
\right)
\eeq
where
\beq
\mu'_n
= \frac{ \linefrac{d_n}{\sigma^2} + \linefrac{\mu_n}{\sigma_n^2} }
{ \linefrac{1}{\sigma^2} + \linefrac{1}{\sigma_n^2} }
\:\:\:\:\:\mbox{and}\:\:\:\:\:
\frac{1}
{{\sigma^2_n}'}
= \frac{1}{\sigma^2} + \frac{1}{\sigma_n^2}
%\:\:\:\mbox{[mnemonic: precisions add]}
\label{newmu.decision}
\eeq
(mnemonic: when Gaussians multiply, precisions add).
}
To start with let's evaluate the expected utility if
we do no prospecting (\ie, choose the site immediately); then
we'll evaluate the expected utility
if we first prospect at one site and then make our choice.
From these two results we will be able to decide whether
to prospect once or zero times, and, if we prospect once,
at which site.
So, first we consider the expected
utility without any prospecting.
\exercisxA{2}{ex.decisionUa}{
Show that the optimal action, assuming no prospecting,
is to select the site with biggest mean
% HELP!!
% want the over n to be underneath!
\beq
{n_a} = \argmax_{n} \, \mu_n,
\label{opt.n}
\eeq
and the expected utility of this action is
\beq
\Exp[U \given {\mbox{optimal $n$}}] = \max_n \mu_n .
\eeq
[If your intuition says `surely the optimal decision
should take into account the different uncertainties $\sigma_n$
too?', the answer to this question is `reasonable -- if so, then the utility function
should be {\em nonlinear} in $x$'.]
}
Now the exciting bit. Should we prospect?
Once we have prospected at site ${n_p}$, we will choose the site
using the decision rule (\ref{opt.n}) with the
value of mean $\mu_{{n_p}}$ replaced by
the updated value $\mu'_n$
given by (\ref{newmu.decision}).
What makes the problem
% (mildly)
exciting
is that we don't yet know the value of $d_n$,
so we don't know what our action ${n_a}$ will be; indeed
the whole value of doing the prospecting
comes from the fact that the outcome $d_n$
may alter the action from the one that we would have taken
in the absence of the experimental information.
From the expression for the new mean in terms of $d_n$
(\ref{newmu.decision}), and the known variance of
$d_n$ (\ref{eq.decision.dn}), we can compute
the probability distribution of the key quantity,
$\mu'_n$, and can work out the expected
utility by integrating over all possible outcomes
and their associated actions.
\exercisxA{2}{ex.decisionPm}{
Show that the probability distribution of the new mean $\mu'_n$
(\ref{newmu.decision}) is Gaussian with mean $\mu_n$
and variance
\beq
% \Normal( \mu_n
s^2 \equiv
\sigma^2_n \frac{ \sigma^2_n }{ \sigma^2 + \sigma^2_n } .
\eeq
}
Consider prospecting at site $n$. Let the biggest mean of the other
sites be $\mu_1$.
When we obtain the new value of the mean, $\mu'_n$,
we will choose site $n$ and get an expected return of $\mu'_n$
if $\mu'_n > \mu_1$, and we will choose site 1 and get an expected return of $\mu_1$
if $\mu'_n < \mu_1$.
So the expected utility of prospecting at site $n$, then picking the best site, is
\beq
\Exp[U \given \mbox{prospect at $n$}]
= - c_n
+ P( \mu'_n < \mu_1 ) \, \mu_1
+ \int_{\mu_1}^{\infty} \d\mu'_n \: \mu'_n \,
\Normal( \mu'_n ; \mu_n , s^2 ) .
\label{eq.before}
\eeq
The difference in utility between prospecting and not prospecting
is the quantity of interest, and it depends on
what we would have done without prospecting;
and that depends on whether $\mu_1$ is bigger than $\mu_n$.
%If $\mu_1$ is not only the biggest
% of the rest, but is also bigger than $\mu_n$, then
% we would have chosen $\mu_1$; if $\mu_n$, we would
% have chosen $n$.
\beqan
{
\Exp[U \given \mbox{no prospecting}]
}
&=&
\left\{ \begin{array}{cc}
-\mu_1
& \mbox{if $\mu_1 \geq \mu_n$}
\\
- \mu_n & \mbox{if $\mu_1 \leq \mu_n$} .
\end{array}
\right.
\eeqan
So
\beqan
\lefteqn{
\Exp[U \given \mbox{prospect at $n$}]
-
\Exp[U \given \mbox{no prospecting}]
} \nonumber \\
&=&
\left\{ \begin{array}{cc}
-c_n
+ \displaystyle \int_{\mu_1}^{\infty} \d\mu'_n \:
(\mu'_n - \mu_1 ) \,
\Normal( \mu'_n ; \mu_n , s^2 )
& \mbox{if $\mu_1 \geq \mu_n$}
\\
-c_n
+\displaystyle \int_{-\infty}^{\mu_1} \d\mu'_n
\: ( \mu_1 - \mu'_n ) \,
\Normal( \mu'_n ; \mu_n , s^2 )
& \mbox{if $\mu_1 \leq \mu_n$} .
\end{array}
\right.
\eeqan
We can plot the change in expected utility due to
% produced by
prospecting (omitting $c_n$)
as a function of
the difference $(\mu_n - \mu_1)$ (horizontal axis)
% between the prospected site's mean and the best of the rest;
and the initial standard deviation $\sigma_n$ (vertical axis).
In the figure the noise variance
% in the data
is $\sigma^2=1$.
%\begin{figure}
%\figuremargin{
\marginfig{\small
\begin{center}
\setlength{\unitlength}{0.875mm}
{
\begin{picture}(70,49)
\put(-20,-10){\makebox(0,0)[bl]{\psfig{figure=decision/utility.ps,width=3.5in,angle=-90}}}
%\put(65,35){\makebox(0,0)[l]{$\sigma_n$}}% sensible location to the right hand side but falls off right of page
\put(60,46.7){\makebox(0,0)[b]{$\sigma_n$}}
\put(30,2){\makebox(0,0){$(\mu_n - \mu_1)$}}
\end{picture}
}
\end{center}
%}{
\caption[a]{Contour plot of the gain in expected utility due to prospecting.
The contours are equally spaced from 0.1 to 1.2 in steps of 0.1.
To decide whether it is worth prospecting at site $n$,
find the contour equal to $c_n$ (the cost of prospecting);
all points $[(\mu_n \!-\! \mu_1),\sigma_n]$
above that contour are worthwhile.
}
}
%\end{figure}
\section{Further reading}
If the world in which we act is a little more
complicated than the prospecting problem -- for example,
if multiple iterations of prospecting are possible, and the
cost of prospecting is uncertain -- then finding the optimal
balance between \index{explore}\index{exploit}exploration
and exploitation becomes a much harder computational problem.
{\dem\index{reinforcement learning}{Reinforcement learning}\/} addresses approximate
methods for this problem \cite{sutton98reinforcement}.
\section{Further exercises}
% Perhaps a simple example is worth exploring.
\exercisxB{2}{ex.threedoorsmulti}{
{\sf The four doors problem}.
A new \ind{game show}\index{game!three doors} uses rules similar to
those of the three doors (\exerciseref{ex.3doors}),
but there are four doors, and
the host explains:
`First you will point to one of the doors,
and then I will open one of the other doors, guaranteeing to
choose a non-winner.
Then you decide whether to stick with your original pick
or switch to one of the remaining doors.
Then I will open another
% (other than the current pick)
non-winner (but never the current pick). You will then make your final decision by
sticking with the door picked on the previous
decision or by switching to the only other remaining door.'
What is the optimal strategy? Should you switch on the
first opportunity? Should you switch on the
second opportunity?
}
% [Ans: should stick first then switch, and have a prob of 0.75 of winning]
% this ex illustrates that a greedy strategy is not optimal -- don't switch
% immediately.
%
\exercisaxB{3}{ex.utils}{
One of the challenges of decision theory is
figuring out exactly what the utility function is.
The utility of money, for example, is notoriously nonlinear for most people.
% http://www.statslab.cam.ac.uk/~rrw1/stats/digress.ps
In fact, the behaviour of many people cannot be captured by
a coherent utility function, as illustrated by
the {\dem\ind{Allias paradox}}, which runs as follows.\index{paradox!Allias}
\begin{quote}
Which of these choices do you find most attractive?
\begin{center}
\begin{tabular}{rp{3in}}
A. &$\pounds 1$ million guaranteed. \\
B. &89\% chance of $\pounds$1 million;\\
& 10\% chance of $\pounds$2.5 million;\\
& 1\% chance of nothing.\\
\end{tabular}
\end{center}
Now consider these choices:
\begin{center}
\begin{tabular}{rp{3in}}
C. &89\% chance of nothing; \\
&11\% chance of $\pounds$1 million. \\
D. &90\% chance of nothing; \\
&10\% chance of $\pounds$2.5 million. \\
\end{tabular}
\end{center}
\end{quote}
Many people prefer A to B, and, at the same time, D to C. Prove that
these preferences are inconsistent
with any utility function $U(x)$ for money.
}
\exercisxC{4}{ex.optimalstopping}{
{\sf Optimal stopping}.
% http://plus.maths.org/issue3/marriage/index.html
A large \ind{queue}\index{optimal stopping}\index{marriage}\index{secretary problem}
of $N$ potential partners is waiting
at your door,
all asking to marry you. They have arrived in random order.
As you meet each partner, you have to decide on the spot, based
on the information so far, whether
to marry them or say no.
Each potential partner has a desirability $d_n$, which
you find out if and when you meet them.
You must marry one of them, but you are not allowed to go
back to anyone you have said no to.
There are several ways to define the precise problem.
\ben
\item
Assuming your aim is to maximize the desirability $d_n$,
\ie, your utility function is $d_{\hat{n}}$, where $\hat{n}$ is the
partner selected, what strategy should you use?
\item
Assuming you wish very much to marry {\em the most desirable\/} person
(\ie, your utility function is 1 if you achieve that,
and zero otherwise); what strategy should you use?
\item
Assuming you wish very much to marry the most desirable person,
and that your strategy will be `strategy $M$':
\begin{quote}
{Strategy $M$} -- Meet the first $M$ partners
and say no to all of them. Memorize the maximum desirability $d_{\max}$
among them. Then meet the others in sequence, waiting until a partner
with $d_n > d_{\max}$ comes along, and marry them. If none more desirable
comes along, marry the final $N$th partner (and feel miserable).
\end{quote}
-- what is the optimal value of $M$?
% This version final problem is solveable (difficulty rating 3), but
% my view is that
\een
}
\exercisxC{3}{ex.regret}{
{\sf Regret as an objective function?}
% This is a philosophical problem.
The preceding exercise (parts b and c) involved a utility function
based on regret. If one married the tenth most desirable
candidate, the utility function asserts that one would feel regret for
having not chosen the most desirable.
Many people working in learning theory and decision theory
use `minimizing the maximal possible regret' as an objective function, but does
this make sense?%
\margintab{
\begin{center}
\begin{tabular}{rcc} \toprule
& \multicolumn{2}{c}{Action} \\
& Buy & Don't \\
Outcome & & buy \\ \midrule
No win & $-1$ & 0 \\
Wins & $+9$ & 0 \\ \bottomrule
\end{tabular}
\end{center}
\caption[a]{Utility in the lottery ticket problem.}
\label{tab.utilticket}
}
% If you are investing money, is your aim to
% maximize the expected return, or to
% achieve an expected return that proves to be nearly
% as good as the best return you could have got?
%
% If you are choosing a route for getting quickly
% from A to B, is your aim to choose the route
% whose expected time is smallest, or
% is your aim to choose the route that has the
% highest probability of being the quickest?
%
% These aims lead to different behaviours.
Imagine that Fred has bought a lottery ticket, and offers to
sell it to you before it's known whether the ticket is a winner.
For simplicity say the probability that the ticket is a winner is $1/100$,
and if it is a winner, it is worth $\pounds 10$. Fred offers to sell
you the ticket for $\pounds 1$.
Do you buy it?
The possible actions are `buy' and `don't buy'.
The utilities of the four possible action--outcome pairs
are shown in \tabref{tab.utilticket}. I have assumed
that the utility of small amounts of money for you
is linear. If you don't buy the ticket then the utility
is zero regardless of whether the ticket
proves to be a winner.%
\margintab{
\begin{center}
\begin{tabular}{rcc} \toprule
& \multicolumn{2}{c}{Action} \\
& Buy & Don't \\
Outcome & & buy \\ \midrule
No win & $1$ & 0 \\
Wins & 0 & $9$ \\ \bottomrule
\end{tabular}
\end{center}
\caption[a]{Regret in the lottery ticket problem.}
\label{tab.regretticket}
}
If you do buy the ticket you either end up losing
one pound (with probability $99/100$) or gaining nine (with probability $1/100$).
In the \ind{minimax} \ind{regret} community,
actions are chosen to minimize the maximum possible regret.
The four possible regret outcomes are shown in \tabref{tab.regretticket}.
If you buy the ticket and it doesn't win, you
have a regret of $\pounds 1$, because if you
had not bought it you would have
been $\pounds 1$ better off.
If you do not buy the ticket and it wins, you
have a regret of $\pounds 9$, because if you
had bought it you would have
been $\pounds 9$ better off.
The action that minimizes the maximum possible regret
is thus to buy the ticket.
Discuss whether this use of regret to choose actions
can be philosophically justified.
The above problem can be turned into an \ind{investment portfolio}
decision problem by imagining that you have been given
one pound to invest in two possible funds for one day:
Fred's lottery fund, and the cash fund. If you put $\pounds f_1$
into Fred's lottery fund, Fred promises to return $\pounds 9 f_1$
to you if the lottery ticket is a winner, and otherwise nothing.
The remaining $\pounds f_0$ (with $f_0 = 1 - f_1$) is kept as cash.
What is the best investment? Show that the minimax regret community
will invest $f_1 = 9/10$ of their money in the high risk, high return
lottery fund, and only $f_0 = 1/10$ in cash.
Can this investment method be
justified?
}
%
%\subsection{Gambling oddities}
\exercisxC{3}{ex.gambling}{{\sf Gambling oddities} (from \citeasnoun{Cover&Thomas}).
A horse race involving $I$ horses
occurs repeatedly, and you are obliged to \ind{bet} all
your money each time. Your bet at time $t$ can be represented by a
normalized probability vector $\bb$ multiplied by your money
$m(t)$. The odds offered
by the bookies are such that if horse $i$ wins then your
return is $m(t\!+\!1) = b_i o_i m(t)$.
Assuming the bookies' odds are `fair', that is,
\beq
\sum_i \frac{1}{o_i} = 1 ,
\eeq
and assuming that the probability that horse $i$ wins is $p_i$,
work out the optimal betting strategy if your aim is {\dem{{Cover}'s aim}},\index{Cover, Thomas}
namely,
to maximize the {\em expected value of\/} $\log m(T)$.
Show that the optimal strategy sets $\bb$ equal to $\bp$, independent
of the bookies' odds ${\bf o}$.
Show that when this strategy is used, the money is expected to grow exponentially as:
\beq
2^{n W(\bb,\bp)}
\eeq
where $W = \sum_i p_i \log b_i o_i$.
If you only bet once, is the optimal strategy any different?
Do you think this optimal strategy makes sense?
Do you think that it's `optimal',
% investment strategy,
in common language, to ignore the bookies' odds?
What can you conclude about `Cover's aim'?
}
\exercisaxC{3}{ex.joeshark}{
Two ordinary dice are thrown repeatedly; the outcome of each throw is
the sum of the two numbers. Joe Shark, who says that 6 and 8 are
his lucky numbers, bets even money that a 6 will be thrown before the
first 7 is thrown. If you were a gambler, would you take the bet?
What is your probability of winning?
%
Joe then bets even money that an 8 will be thrown before the first 7
is thrown. Would you take the bet?
Having gained your confidence, Joe suggests combining the two bets
into a single bet: he bets a larger sum, still at even odds, that an 8
and a 6 will be thrown before two 7s have been thrown. Would you take
the bet? What is your probability of winning?
%
% take care- there are a lot of wrong answers
%
%
% Optional part --- {\em not\/} necessary to solve the above problem:
% what is the probability that you win on the $n$th roll?
%
}
% Choose one of these:
% is it ok to have nothing on latent variable models here?
% and nothing on EM? and hmms? (until later)
\dvips
% bayesvfreq fvb freqversusbayes
\chapter{Bayesian Inference and Sampling Theory\nonexaminable}
\label{ch.sampling}
%
% all about sampling theory
%
\fakesection{sampling theory}
% {\em This chapter still being written.}
There are two schools of statistics.\index{classical statistics!criticisms}
Sampling theorists concentrate on having methods\index{Bayesian inference}
guaranteed to work most of the time, given minimal
assumptions.
Bayesians try to make inferences that take
into account all available information and
answer the question
% that is
of interest
given the particular data set.
As you have probably gathered, I strongly recommend
the use of Bayesian methods.
Sampling theory is the widely used approach to
statistics, and most papers in most journals report
their experiments using quantities like \ind{confidence
interval}s, \ind{significance level}s, and $p$-values. A \index{p-value}{$p$-value} (\eg\ $p=0.05$) is
the probability, given a null hypothesis for the
probability distribution of the data, that
the outcome would be as extreme as, or more extreme
than, the observed outcome.
Untrained readers
-- and perhaps, more worryingly, the authors of many papers
-- usually
interpret such a $p$-value as if it is a Bayesian
probability (for example, the posterior probability
of the null hypothesis), an interpretation that both sampling
theorists and Bayesians would agree is incorrect.
In this chapter we study a couple of
simple inference problems in order to compare
these two approaches to statistics.
While in some cases, the answers from a Bayesian
approach and from sampling theory are very similar,
we can also find cases where there are significant
differences. We have already seen such an example in
\exerciseref{ex.eurotoss}, where a sampling theorist got a
$p$-value smaller than 7\%, and viewed this
as strong evidence {\em{against}\/}
the null hypothesis, whereas the data actually {\em{favoured}\/} the
null hypothesis over the simplest alternative.
On \pref{sec.sampling5percent}, another example was given
where the $p$-value was smaller than the mystical value of
5\%, yet the data again {{favoured}} the
null hypothesis. Thus in some cases, sampling theory can be trigger-happy,
declaring results to be `sufficiently improbable that
the null hypothesis should be rejected', when those results
actually weakly support the null hypothesis.
As we will now see, there are also inference problems
where sampling theory fails to detect `significant' evidence
where a Bayesian approach and everyday intuition agree that
the evidence is strong.
Most telling of all are the inference problems
where the `significance' assigned by sampling theory
changes depending on irrelevant factors concerned
with the design of the experiment.
This chapter is only provided for those readers
who are curious about the sampling theory$\,$/$\,$Bayesian methods
debate.
If you find any of this chapter tough to understand,
please skip it. There is no point trying to understand
the debate. Just use Bayesian methods -- they are
much easier to understand than the debate itself!
\section{A medical example}
\label{sec.microsoftus}
\begin{quote}
We are trying to reduce the incidence of an unpleasant \ind{disease} called
{\em\ind{microsoftus}}.
Two \ind{vaccination}s, $A$ and $B$, are tested on a group of volunteers.
Vaccination $B$ is a control treatment, a placebo treatment
with no active ingredients.
% believed to be susceptible to {\em microsoftus}.
Of the 40 subjects,
30 are randomly assigned to have treatment $A$ and the other
10 are given the \ind{control treatment} $B$. We observe the subjects for
one year after their vaccinations.
Of the 30 in group $A$, one contracts {\em microsoftus}.
% the disease
Of the 10 in group $B$, three contract {\em microsoftus}.
Is treatment $A$ better than treatment $B$?
% Is it a good idea to treat everyone in Tanzania
%% (population 30 million)
% with vaccination $A$ if the cost of the vaccination is 1 unit,
% and the cost of treating someone after they get the disease is 100 units?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% (The `unit' here could be a unit of human suffering or
%% or a unit measuring
% financial cost,
% depending on one's philosophy of health care.)
% or a combination of both.)
\end{quote}
% TShs. 616=US$l on 5 December 1997
\subsection{Sampling theory has a go}
The standard sampling theory approach to the question `is A better than B?'
is to construct a {\dem\ind{statistical test}}\index{test!statistical}.
%\marginpar{\footnotesize{If
% you don't understand the sampling theory approach, don't worry.
% It is only important that you understand Bayesian methods.
%}}
The test usually compares a hypothesis such as
\begin{quote}
$\H_1$: `A and B have different effectivenesses'
\end{quote}
with a null hypothesis such as
\begin{quote}
$\H_0$:
`A and B have exactly the
same effectivenesses as each other'.
\end{quote}
A novice might object `no, no, I want to compare the hypothesis
``A is better
than B''
with the alternative
``B is better
than A''!' but
such objections are not welcome in sampling theory.
Once the two hypotheses have been defined, the first
hypothesis is scarcely mentioned again -- attention focuses
solely on the null hypothesis. It makes me laugh to write
this, but it's true! The null hypothesis is
accepted or rejected purely on the
basis of how unexpected the data were to $\H_0$,
not on how much better $\H_1$ predicted the data.
One chooses a {\dem\ind{statistic}} which measures
how much a data set deviates from
the null hypothesis.
In the example here, the standard statistic
to use would be one called $\chi^2$ (\ind{chi-squared}).\index{$\chi^2$}
To compute $\chi^2$,
we take the difference between each data measurement
and its {\em expected\/} value
{\em assuming the null hypothesis to be true},
and divide the square of that difference by the
{\em variance} of the measurement,
{\em assuming the null hypothesis to be true}.
In the present problem, the four data measurements
are the integers
$F_{A+}$,
$F_{A-}$,
$F_{B+}$, and
$F_{B-}$, that is,
the number of subjects given treatment $A$ who contracted
{\em microsoftus} ($F_{A+}$), the
number of subjects given treatment $A$ who didn't ($F_{A-}$),
and so forth.
% A generally roughly correct notion of chi-squared is
%\beq
%\chi^2
%= \sum_{i}
% \frac{\left( F_i - \langle F_i \rangle \right)^2 }{ \var( F_i) } .
%\label{eq.chi2}
%\eeq
% Actually, reading \cite{Spiegel}, I find this definition:
The definition of $\chi^2$ is:
\beq
\chi^2
= \sum_{i}
\frac{\left( F_i - \langle F_i \rangle \right)^2 }{ \langle F_i \rangle } .
\label{eq.chi2B}
\eeq
Actually, in my
elementary statistics book
\cite{Spiegel} I find Yates's correction is recommended:\marginpar{\small\raggedright{If you want
to know about Yates's correction, read a sampling theory textbook. The point of this
chapter is not to teach sampling theory; I merely mention Yates's correction
because it is what a professional sampling theorist might use.}}
\beq
\chi^2
= \sum_{i}
\frac{\left( |F_i - \langle F_i \rangle | -0.5 \right)^2 }{ \langle F_i \rangle } .
\label{eq.chi2C}
\eeq
In this case, given the null hypothesis that treatments
$A$ and $B$ are equally effective, and have rates $f_{+}$
and $f_{-}$ for the two outcomes, the expected counts are:
\beq
\begin{array}{c@{}c@{}ccc@{}c@{}c}
\langle F_{A+} \rangle &=& f_{+} N_A &\ &
\langle F_{A-} \rangle &=& f_{-} N_A \\
\langle F_{B+} \rangle &=& f_{+} N_B &\ &
\langle F_{B-} \rangle &=& f_{-} N_B .
%\\
% \var( F_{A+} ) &=& f_{+} f_{-} N_A \\
% \var( F_{A-} ) &=& f_{+} f_{-} N_A \\
% \var( F_{B+} ) &=& f_{+} f_{-} N_B \\
% \var( F_{B-} ) &=& f_{+} f_{-} N_B
\end{array}
\eeq
The test accepts or rejects the null hypothesis
on the basis of how big $\chi^2$ is.
%The rough idea is,
% if the null hypothesis
% were true and if the parameters
% $f_{\pm}$ were known, then the expectation
% of each term in the sum (\ref{eq.chi2})
% is 1, since the average square deviation is the variance.
% So if $\chi^2$ is much bigger than the number of measurements,
% we ought to be suspicious of the null hypothesis.
To make this test precise, and give it a `significance level',
we have to work out what the {\dem\ind{sampling distribution}\/}\marginpar{\small\raggedright{%
The sampling distribution of a statistic is the probability distribution
of its value under repetitions of the experiment, assuming
that the null hypothesis is true.}}
of $\chi^2$
is, taking into account the fact that the four data points
are not independent (they satisfy the two constraints
$F_{A+}+F_{A-} = N_A$ and
$F_{B+}+F_{B-} = N_B$)
and the fact that the parameters
$f_{\pm}$ are not known.
These three constraints reduce the {\dem{number of
\ind{degrees of freedom}\/}} in the data from four to one. [If you
want to learn more about computing the `number
of degrees of freedom', read a sampling theory book;
in Bayesian methods we don't need to know all that,
and quantities equivalent to the number of degrees of
freedom pop straight out of a Bayesian analysis
when they are appropriate.]
These sampling distributions are
tabulated by sampling theory gnomes and come accompanied by
warnings about the conditions under which they are
accurate. For example, standard tabulated distributions for $\chi^2$
are only accurate if the expected numbers $F_i$ are
about 5 or more.
% Ripley personal communication.
% In the present
Once the data arrive, sampling theorists \ind{estimate} the unknown parameters
$f_{\pm}$ of the null hypothesis from the data:
\beq
\hat{f}_{+} = \frac{F_{A+}+F_{B+}}{N_A+N_B}, \:\:\:\:\:
\hat{f}_{-} = \frac{F_{A-}+F_{B-}}{N_A+N_B},
\eeq
and evaluate $\chi^2$. At this point, the sampling theory
school divides itself into two camps. One camp
uses the following protocol: first, before looking
at the data, pick the significance level of the test (\eg\ 5\%),
and determine the critical value of $\chi^2$
above which the null hypothesis will be rejected.
(The significance level is the fraction of times
that the statistic $\chi^2$ would exceed the critical
value, if the null hypothesis were true.)
Then evaluate $\chi^2$, compare with the critical value,
and declare the outcome of the test, and its
significance level (which was fixed beforehand).
The second camp looks at the data, finds $\chi^2$,\index{$\chi^2$}
then looks in the table of $\chi^2$-distributions
for the significance level, $p$, for which
the observed value of $\chi^2$ would be the critical value.
The result of the test is then reported by giving this
value of $p$, which is the fraction of times that
a result as extreme as the one observed, or more extreme,
would be expected to arise if the null hypothesis
were true.
Let's apply these two methods.
First camp: let's pick 5\% as our significance level.
The critical value for $\chi^2$ with one degree of freedom
is $\chi^2_{0.05} = 3.84$.
% [Actually, we've forgotten one issue here, which
% is whether we are going for a one-tailed test or
% a two-tailed test.]
The estimated values of $f_{\pm}$ are
\beq
f_{+} = 1/10, \:\:\:\:\: f_{-} = 9/10 .
\eeq
The expected values of the four measurements are
\beqan
\langle F_{A+} \rangle &=& 3 \\ % 1
\langle F_{A-} \rangle &=& 27 \\ % 29
\langle F_{B+} \rangle &=& 1 \\ % 3
\langle F_{B-} \rangle &=& 9
%\\ % 10
% \var( F_{A+} ) &=& f_{+} f_{-} N_A = 27/10\\
% \var( F_{A-} ) &=& f_{+} f_{-} N_A = 27/10 \\
% \var( F_{B+} ) &=& f_{+} f_{-} N_B = 9/10\\
% \var( F_{B-} ) &=& f_{+} f_{-} N_B = 9/10
\eeqan
and $\chi^2$ (as defined in \eqref{eq.chi2B}) is
%pr 4.0/(3.0) +4.0/(27.0) + 4.0/(9.0) + 4.0/(1.0)
% 5.926
% WRONG: (my defn!)
%pr 4.0*2.0/(2.7) + 4.0*2.0/(0.9)
%11.8518518518519
%pr 4.0/(2.7) + 4.0/(0.9)
% 5.926
% yates's correction
%pr ( 1.0/(3.0) +1.0/(27.0) + 1.0/(9.0) + 1.0/(1.0) ) * ( 1.5**2 )
% 3.333333333333
\beq
\chi^2
% wrong!
% \frac{4}{27/10} + \frac{4}{27/10} + \frac{4}{9/10} + \frac{4}{9/10}
= 5.93 .
\eeq
Since this value exceeds 3.84, we reject the
null hypothesis that the two treatments are equivalent
at the 0.05 significance level. However, if we use Yates's correction,
we find $\chi^2 = 3.33$, and therefore accept the null hypothesis.
Camp two runs a finger across the $\chi^2$ table found at the back
of any good sampling theory book and finds $\chi^2_{.10} = 2.71$.
Interpolating between $\chi^2_{.10}$ and $\chi^2_{.05}$,
camp two reports `the $p$-value is $p=0.07$'.
Notice that this answer does not say how much more
effective $A$ is than $B$, it simply says that $A$
is `significantly' different from $B$.
And here, `significant' means only `statistically
significant', not practically significant.
The man in the street, reading the statement
that `the treatment was significantly different
from the control ($p=0.07$)',
might come to the conclusion that `there is a 93\%
chance that the treatments differ in effectiveness'.
But what `$p=0.07$' actually means
is `if you did this experiment many times,
and the two treatments {\em had\/} equal effectiveness,
then 7\% of the time you would find a value of $\chi^2$
more extreme than the one that happened here'.
%
This has almost nothing to do with what we want
to know, which is how likely it is that treatment A
is better than B.
\subsection{Let me through, I'm a Bayesian}
OK, now let's {\em infer\/} what we really want to know.
We scrap the hypothesis that the two treatments have
exactly equal effectivenesses, since
we do not believe it.
There are two unknown parameters, $p_{A+}$ and
$p_{B+}$,
which are the probabilities that people given treatments
$A$ and $B$, respectively, contract the disease.
Given the data, we can infer these two probabilities,
and we can answer questions of interest
by examining the posterior distribution.
The posterior distribution is
\beq
P( p_{A+} , p_{B+} \given \{ F_i \} )
=
\frac{
P( \{ F_i \} \given p_{A+} , p_{B+} )
P( p_{A+} , p_{B+} )
} { P( \{ F_i \} ) } .
\eeq
The likelihood function is
\beqan
P( \{ F_i \} \given p_{A+} , p_{B+} )
& =&
{{N_A}\choose{F_{A+}}}
p_{A+}^{F_{A+}} p_{A-}^{F_{A-}}
{{N_B}\choose{F_{B+}}}
p_{B+}^{F_{B+}} p_{B-}^{F_{B-}}
\\ & = &
{{30}\choose{1}}
p_{A+}^{1} p_{A-}^{29}
{{10}\choose{3}}
p_{B+}^{3} p_{B-}^{7} .
\label{eq.like.microsoftus}
\eeqan
What prior distribution should we use? The prior distribution
gives us the opportunity to include knowledge from
other experiments, or a prior belief that the
two parameters $p_{A+}$ and $p_{B+}$, while different
from each other, are expected to have similar values.
Here we will use the simplest vanilla prior distribution,
a uniform distribution over each parameter.
\beq
P( p_{A+} , p_{B+} ) = 1 .
\eeq
We can now plot the posterior distribution.
Given the assumption of a separable prior on $p_{A+}$
and $p_{B+}$, the posterior distribution is
also separable:
\beq
P( p_{A+} , p_{B+} \given \{ F_i \} )
=
P( p_{A+} \given F_{A+},F_{A-} )
P( p_{B+} \given F_{B+},F_{B-} ) .
\eeq
The two
posterior distributions
% likelihood functions
are shown in
\figref{fig.pApBm} (except the
graphs are not normalized)
and the joint posterior probability is
shown in \figref{fig.pApBj}.
% see itp/bernplot
% bernplot2.p r1=1 n1=30 r2=3 n2=10
\begin{figure}
\figuremargin{%
\[
{\raisebox{-7.42pt}[1.64in][0in]{\psfig{figure=bernplot/1.30.3.10.ps,width=2.7in,angle=-90} }}
\]
}{%
\caption[a]{Posterior probabilities of the two effectivenesses.
Treatment A -- solid line; B -- dotted line.}
%\label{fig1}
\label{fig.pApBm}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\[
\mbox{
{\raisebox{0.97in}{$p_{B+}$}\hspace*{-0.54025in}\raisebox{-0.7in}[1.4in][0.1in]{\psfig{figure=bernplot/microsoftus.c.ps,width=2.7in,angle=-90}}\hspace*{-0.455325in}\raisebox{-12pt}[0in][0in]{$p_{A+}$}}
\hspace{0.2in}
{\hspace*{-0.54025in}\raisebox{-0.26in}[1.352in][0in]{\psfig{figure=bernplot/microsoftus.ps,width=3.4in,angle=-90}}\hspace*{-0.425in}}
}
\]
}{%
\caption[a]{Joint posterior probability of the two effectivenesses --
contour plot and surface plot.}
%\label{fig1}
\label{fig.pApBj}
}%
\end{figure}
If we want to know the answer to the question
`how probable is it that $p_{A+}$ is
smaller than $p_{B+}$?',
we can answer exactly that question
by computing
the posterior probability
\beq
P( p_{A+} < p_{B+} \given \mbox{Data} ),
\eeq
which%
\amarginfig{t}{
\begin{center}\mbox{\epsfbox{metapost/sampling.2}}\end{center}
\caption[a]{The proposition $p_{A+} < p_{B+}$ is true
for all points in the shaded triangle.
To find the probability of this proposition we
integrate the joint posterior probability
$P( p_{A+}, p_{B+} \given \mbox{Data} )$
(\protect\figref{fig.pApBj}) over this region.
}
\label{fig.pApBj2}
}
is the integral of the joint posterior probability
$P( p_{A+}, p_{B+} \given \mbox{Data} )$
shown in \figref{fig.pApBj}
over the region in which $p_{A+} < p_{B+}$, \ie,
% That region is
the shaded triangle
in \figref{fig.pApBj2}.
The value of this integral (obtained by a straightforward
numerical integration of the
likelihood function (\ref{eq.like.microsoftus})
over the relevant region)
is
$P( p_{A+}\! <\! p_{B+} \given \mbox{Data} ) = 0.990$.
Thus there is a 99\% chance, given the data and
our prior assumptions, that treatment $A$
is superior to treatment $B$. In conclusion, according to
our Bayesian model, the data (1 out of 30 contracted the disease
after vaccination A, and 3 out of 10 contracted the disease
after vaccination B) give very strong evidence -- about 99 to one --
that treatment $A$
is superior to treatment $B$.
% It's interesting to note that in this case, whereas
% the sampling theory method gave an equivocal answer with $p$-values
% close to the mystical threshold of 0.95, the Bayesian
% answer is firmly in favour of
% As in the bent coin example, the answer from a Bayesian approach
% differs from the sampling theory answer. And whereas
% in the example on
% \pref{sec.pvalue05}, ... here....
In the Bayesian approach, it is also easy to answer other relevant
questions. For example, if we want to know `how likely is
it that treatment A is ten times more effective than treatment B?', we
can integrate the joint posterior probability
$P( p_{A+}, p_{B+} \given \mbox{Data} )$
over the region in which $p_{A+} < 10\, p_{B+}$ (\figref{fig.pApBj3}).
%an administrator might
% need to decide whether it is a good idea to treat everyone in Tanzania
% with vaccination $A$ if the cost of the vaccination is 1 unit,
% and the cost of treating someone after they get the disease is 100 units?
%
\amarginfig{t}{
\begin{center}\mbox{\epsfbox{metapost/sampling.1}}\end{center}
\caption[a]{The proposition $p_{A+} < 10\, p_{B+}$ is true
for all points in the shaded triangle.
}
\label{fig.pApBj3}
}
\subsection{Model comparison}
If there were a situation in which we really did want to
compare the two hypotheses $\H_0$: $p_{A+} = p_{B+}$
and $\H_1$: $p_{A+} \neq p_{B+}$, we can of course do
this directly with Bayesian methods also.
As an example, consider the data set:
\begin{realcenter}
$D$: One subject, given treatment A, subsequently contracted
{\em microsoftus}.
One subject, given treatment B, did not.
\end{realcenter}
%\marginpar{% moved into text because chapter length ok, and this page was looking crowded.
\begin{center}
\begin{tabular}{rcc} \toprule
Treatment & A & B \\ \midrule
Got disease& 1 & 0 \\
Did not & 0 & 1 \\ \midrule
Total treated & 1 & 1 \\ \bottomrule
\end{tabular}
\end{center}
%}
How strongly does this data set
favour $\H_1$ over $\H_0$?
We answer this question by computing the evidence for
each hypothesis.
Let's assume uniform priors over the unknown parameters
of the models. The first hypothesis $\H_0$: $p_{A+} = p_{B+}$
has just one unknown parameter, let's call it $p$.
\beq
P(p \given \H_0) = 1 \:\: \:\: \:\: p \in (0,1) .
\eeq
We'll use the uniform prior over the two parameters
of model $\H_1$ that we used before:
\beq
P(p_{A+},p_{B+} \given \H_1) = 1 \:\: \:\: \:\: p_{A+} \in (0,1), \: p_{B+} \in (0,1) .
\eeq
Now, the probability of the data $D$ under model $\H_0$
is the normalizing constant from the inference of $p$ given $D$:
\beqan
P(D \given \H_0)& =& \int \d p \: P(D \given p) P(p \given \H_0)\\
& =& \int \d p \: p (1-p) \times 1 \\
& =& 1/6 .
\eeqan
The probability of the data $D$ under model $\H_1$
is given by a simple two-dimensional integral:
\beqan
\!\!\!\!
\!\!\!\!
\!\!\!\!
P(D \given \H_1)& =& \int \! \int \! \d p_{A+} \, \d p_{B+} \: P(D \given p_{A+},p_{B+})
P(p_{A+},p_{B+} \given \H_1) \\
& =& \int \d p_{A+}\, p_{A+} \: \int \d p_{B+} \, (1-p_{B+}) \\
& =& 1/2 \times 1/2 \\
& =& 1/4 .
\eeqan
Thus the evidence ratio in favour of model $\H_1$, which
asserts that the two effectivenesses are unequal,
is
\beq
\frac{P(D \given \H_1)}{P(D \given \H_0)} = \frac{1/4}{1/6} = \frac{0.6}{0.4} .
\eeq
So if the prior probability over the two hypotheses was 50:50,
the posterior probability is 60:40 in favour of $\H_1$.\ENDsolution
Is it not easy to get sensible answers to well-posed questions
using Bayesian methods?
[The sampling theory answer to this question would involve
the identical significance test that was used in the
preceding problem; that test would yield a `not significant'
result. I think it is greatly preferable to acknowledge
what is obvious to the intuition, namely that the
data $D$ do give {\em weak\/} evidence in favour of $\H_1$.
Bayesian methods quantify how weak
the evidence is.]
\section{Dependence of $\boldmath p$-values on irrelevant information}
In an expensive laboratory,\index{p-value}
Dr.\ Bloggs\index{Dr.\ Bloggs} tosses a coin labelled $a$ and $b$
twelve times and the outcome is the string
\[
aaabaaaabaab,
\]
which contains three $b$s and nine $a$s.
What evidence do these data give that the coin is biased in favour of $a$?
Dr.\ Bloggs consults his sampling theory friend
who says `let $r$ be the number of $b$s and $n=12$ be the total
number of tosses; I view $r$ as the random variable and
find the probability of $r$ taking on the value $r=3$ or a more extreme value,
assuming the null hypothesis $p_a=0.5$ to be true'.
He thus computes
\beqan
P(r\leq 3 \given n\eq 12,\H_0) &\!\!=\!\!&
\sum_{r= 0}^{3} {{n}\choose{r}} \mbox{\dhalf}^{n}
= \left(
{\textstyle {12\choose 0} + {12\choose 1} + {12\choose 2} + {12\choose 3}}
\right) \mbox{\dhalf}^{12}
\nonumber \\
&=& 0.07,
\label{eq.samplingsum}
\eeqan
and reports `at the significance level of 5\%,
there is not significant evidence of bias in favour
of $a$'. Or, if the friend prefers to report $p$-values
rather than simply compare $p$ with 5\%, he would
report `the $p$-value is 7\%, which is not conventionally viewed
as significantly small'.
If a two-tailed test seemed more appropriate,
he might compute the two-tailed area, which is twice the above probability,
and report `the $p$-value is 15\%, which is not significantly small'.
We won't focus on the issue of the choice between the one-tailed
and two-tailed tests, as we have bigger fish to catch.
% pr 0.072998 * 2
% 0.145996
Dr.\ Bloggs pays careful attention to the calculation (\ref{eq.samplingsum}),
and responds `no, no, the \ind{random variable} in the experiment
was not $r$: I decided before running the experiment that I would
keep tossing the coin until I saw three $b$s; the random variable
is thus $n$'.
\begin{aside}
Such experimental designs are not unusual. In my experiments
on error-correcting codes I often simulate the decoding of
a code until a chosen number $r$ of block errors ($b$s)
has occurred,
since the error on the inferred value of $\log \pb$
goes roughly as $\sqrt{r}$, independent of $n$.
\end{aside}
\exercisxA{2}{ex.stoppingrule}{%
Find the Bayesian inference about the bias $p_a$ of the coin
given the data, and determine whether a Bayesian's
inferences depend on what \ind{stopping rule}\index{experimental design}\index{sermon!stopping rule}
was in force.
}
% ~/bin/sampling.p n=12 r=3
% Data: r = 3; n= 12; p0=0.5
% =========================
% Sampling theory result 1: assuming stopping rule was 'stop after n=12'
% Sanity check: this should be 1.0: 1
%
% Prob of r or fewer events = 0.072998046875
%
% -------------------------
%
%
% Sampling theory result 2: assuming rule was 'stop on the r'th 1, and data ended with a 1'
% Sanity check: this should be about 1.0: 0.999999999999998
%
% Prob of n being <= 12 = 0.980712890625
% Prob of n being >= 12 = 0.03271484375
%
% Ratio = 0.448160535117057
%
% The sampling theorist knows that, a
According to sampling theory,
a different calculation is required
in order to assess the `\index{significance}{significance}' of the result $n=12$.
The probability distribution of $n$ given $\H_0$ is
the probability that the first $n\!-\!1$ tosses contain exactly $r\!-\!1$
$b$s and then the $n$th toss is a $b$.
\beq
P(n \given \H_0, r) = {{n\!-\!1}\choose{r\!-\!1}} \mbox{\dhalf}^{n} .
\eeq
The sampling theorist thus computes
\beq
P(n \geq 12 \given r\eq 3,\H_0) = 0.03 .
\eeq
%
He reports back to Dr.\ Bloggs, `the $p$-value is 3\% --
there {\em is\/} significant evidence of bias after all!'
What do you think Dr.\ Bloggs should do? Should he publish
the result, with this marvellous $p$-value,
in one of the journals that insists that all experimental\index{journal publication policy}
results have their `significance' assessed using sampling theory?\index{rant!p-value}\index{sermon!p-value}
Or should he boot the sampling theorist out of the door
and seek a coherent method of assessing significance,
one that does not depend on the stopping rule?
At this point the audience divides in two. Half the audience intuitively
feel that the stopping rule is irrelevant, and don't need
any convincing that the answer to \exerciseref{ex.stoppingrule} is
`the inferences about $p_a$ do not depend on the stopping rule'.
The other half, perhaps on account of a thorough training
in sampling theory, intuitively feel that Dr.\ Bloggs's
stopping rule, which stopped tossing the moment
the third $b$ appeared, may have biased the experiment somehow.
%
If you are in the second group, I encourage you to reflect on
the situation, and hope you'll eventually come round to the
view that is consistent with the \ind{likelihood principle},
which is that the stopping rule is {\em not\/} relevant to
what we have learned about $p_a$.
As a thought experiment, consider some onlookers who (in order to
save money) are spying\index{spy} on Dr.\ Bloggs's experiments: each time
he tosses the coin, the spies update the values of $r$ and $n$.
The spies are eager to make inferences from the data as soon as each new result occurs.
Should the spies' beliefs about the bias of the coin depend
on Dr.\ Bloggs's intentions regarding the continuation of the experiment?
%
%
The fact that the $p$-values of sampling theory {\em{do}\/}
depend on the stopping rule (indeed, whole
volumes of the sampling theory literature are concerned
with the task of assessing `significance' when a complicated
stopping rule is required -- `\ind{sequential probability ratio test}s',
for example)
seems to me a compelling argument
for having nothing to do with $p$-values at all.
A Bayesian solution to this inference problem was
given in sections \ref{sec.bentcoin}
and \ref{sec.bentcoin2}
and \exerciseref{ex.eurotoss}.
% indeed we've been looking at the same problem ever since {ex.ip.urns} p.33
Would it help clarify this issue if I added one more scene
to the story?
The janitor, who's been eavesdropping on Dr.\ Bloggs's
conversation, comes in and says `I happened to notice that
just after you stopped doing the experiments on the coin,
the Officer for Whimsical Departmental Rules
ordered the immediate destruction of all such coins. Your coin was
therefore destroyed by the departmental safety officer.
There is no way you could have continued the experiment much beyond
$n=12$ tosses. Seems to me, you need to recompute your $p$-value?'
%
\section{Confidence intervals}
In an experiment in which data $D$ are obtained
from a system with an unknown parameter $\theta$,
a standard concept in sampling theory is
the idea of a {\dem\ind{confidence interval}} for $\theta$.
Such an interval $(\theta_{\min}(D),\theta_{\max}(D))$
has associated with it a {\dem\ind{confidence level}\/}
% \footnote{check terminology}
such as $95\%$
which is informally interpreted
as `the probability that $\theta$ lies in the confidence interval'.
Let's make precise what the confidence level really
means, then give an example.
A confidence interval is a function
$(\theta_{\min}(D),\theta_{\max}(D))$ of the data set $D$.
The confidence level of the confidence interval is a property
that we can compute before the data arrive.
We imagine
% that the
generating many data sets from a particular true value of $\theta$,
and calculating the interval $(\theta_{\min}(D),\theta_{\max}(D))$,
and then checking whether the true value of $\theta$ lies in that
interval. If, averaging over all these imagined repetitions of the experiment,
the true value of $\theta$ lies in the confidence interval
a fraction $f$ of the time, and this property holds
for all true values of $\theta$, then the confidence level of
the confidence interval is $f$.
For example, if $\theta$ is the mean of a Gaussian distribution
which is known to have standard deviation 1, and $D$ is a sample
from that Gaussian, then $(\theta_{\min}(D),\theta_{\max}(D))$
= $(D\!-\!2,D\!+\!2)$ is a 95\% confidence interval for $\theta$.
Let us now look at a simple example where the
meaning of the confidence level becomes clearer.
Let the parameter $\theta$ be an integer, and let the
data be a pair of points $x_1,x_2$, drawn independently from the following
distribution:
\beq
P(x \given \theta) = \left\{ \begin{array}{cl}
\dhalf & x = \theta \\
\dhalf & x = \theta +1 \\
0 & \mbox{for other values of $x$.}
\end{array} \right.
\eeq
For example, if $\theta$ were 39, then
we could expect the following data sets:
\beq
\begin{array}{rl}
D= (x_1,x_2) = (39,39) & \mbox{with probability $\dfrac{1}{4}$;}\\
(x_1,x_2) = (39,40) & \mbox{with probability $\dfrac{1}{4}$;}\\
(x_1,x_2) = (40,39) & \mbox{with probability $\dfrac{1}{4}$;}\\
(x_1,x_2) = (40,40) & \mbox{with probability $\dfrac{1}{4}$.}
\end{array}
\label{eq.four-data-con}
\eeq
We now consider the following confidence interval:
\beq
[\theta_{\min}(D),\theta_{\max}(D)] = [ \min(x_1,x_2) , \min(x_1,x_2) ] .
\eeq
For example, if $(x_1,x_2) = (40,39)$, then the
confidence interval for $\theta$ would be $ [\theta_{\min}(D),\theta_{\max}(D)] = [39,39]$.
Let's think about this confidence interval.
What is its confidence level?
By considering the four possibilities shown in
(\ref{eq.four-data-con}),
we can see that there is a 75\% chance that the confidence
interval will contain the true value.
The confidence interval therefore has a confidence level
of 75\%, by definition.
Now, what if the data we acquire are $(x_1,x_2) = (29,29)$?
Well, we can compute the confidence interval,
and it is $[29,29]$.
So shall we report this interval, and its associated confidence level, 75\%?
This would be correct by the rules of sampling theory. But
does this make sense?\index{rant!confidence level}\index{sermon!confidence level}
What do we actually know in this case?
Intuitively, or by \Bayes\ theorem,
it is clear that $\theta$ could either be $29$ or $28$,
and both possibilities are equally likely (if the prior probabilities of 28 and 29 were equal).
The posterior probability of $\theta$ is 50\% on 29 and 50\%
on 28.
What if the data are $(x_1,x_2) = (29,30)$?
In this case, the confidence interval is
still $[29,29]$, and its associated confidence level is 75\%.
But in this case, by \Bayes\ theorem, or common sense,
we are 100\% sure that $\theta$ is 29.
In neither case is the probability that $\theta$ lies in the
`75\% confidence interval' equal to 75\%!
Thus
\ben
\item
the way in which many people interpret
the confidence levels of sampling theory
is {\em incorrect};
\item
given some data,
what people usually want to know (whether they know
it or not)
is a Bayesian posterior probability distribution.
\een
% This is not a contrived example. It is commonly the case
% that the shape of a posterior distribution will depend
% on the details of the data, so it is to be expected
Are all these examples contrived?
% Are sampling methods actually
Am I making a fuss about nothing?
If you are sceptical about the dogmatic views I have expressed,
I encourage you to look at a case study: look in depth
at \exerciseref{ex.luriadelbruck} and the reference \cite{kepleroprea01a},
in which sampling theory
estimates and confidence intervals for a mutation rate are constructed.
Try both methods on simulated data -- the Bayesian approach based
on simply computing the likelihood function, and
the confidence interval from sampling theory; and let me know
if you don't find that the Bayesian answer is always better than
the sampling theory answer; and often much, much better.
This suboptimality of sampling theory, achieved
with great effort, is why I am passionate about Bayesian methods.
Bayesian methods are straightforward, and they optimally
use all the information in the data.
%
% It upsets me to see people
% using suboptimal tools.
%
% Sampling theorists who create
% confidence intervals are wasting their time, since one
% likelihood function
%
% I wonder how many incorrect conclusions have been reached
% because Bayesian methods were not used; and how many unnecessary experiments
% have been performed because poor confidence intervals were used.
\section{Some compromise positions}
Let's end on a conciliatory note.
Many sampling theorists are pragmatic -- they are happy to choose
from a selection of statistical methods, choosing whichever has
the `best' long-run properties. In contrast, I have no problem with
the idea that there is only {\em{one}\/} answer to
a well-posed problem; but it's not essential to convert sampling theorists
% from their viewpoint on how to solve inference problems:
to this viewpoint:
instead, we can offer them Bayesian estimators
and Bayesian confidence intervals, and request
that the sampling theoretical properties of these
methods be evaluated. We don't need to mention
that the methods are derived from a Bayesian perspective.
If the sampling properties are good
then the pragmatic sampling theorist will choose to use
the Bayesian methods.
% We'll all end up using the same methods
It is indeed the case that
many Bayesian methods have good sampling-theoretical properties.
Perhaps it's not surprising that a method that
gives the optimal answer for each individual case
should also be good in the long run!
% So people who haven't understood or accepted the Bayesian
% perspective may nevertheless pragmatically agree
% that the Bayesian approach is an efficient way to
% construct good answers to inference problems.
Another piece of common ground can be conceded:
while I believe that most well-posed inference problems
have a unique correct answer, which can be found
by Bayesian methods, not all problems are well-posed.
A common question arising in data modelling is
`am I using an appropriate model?'
Model criticism, that is, hunting for defects
in a current model, is a task that may be aided
by sampling theory tests, in which the null hypothesis
(`the current model is correct') is well defined,
but the alternative model is not specified. One could
use sampling theory measures such as $p$-values
to guide one's search for the aspects of the model
most in need of scrutiny.
%\section{More amusing P-value stories}
% meter measurement - if X happens then we use a different meter; oh it was not available
%
% coin toss with different stopping rules. stop after r heads or stop after
% N tosses/ Different answers for p-values?
% \section*{Summary} moved to NOTES.tex
%\subsection*{Further reading}
% Bayesian methods are contrasted with
% sampling theory statistics in \cite{Jaynes.intervals,G1,Loredo}.
\section*{Further reading}
My favourite reading on this topic includes \cite{Jaynes.intervals,G1,Loredo,Berger,Jaynes2003}.
Treatises on Bayesian statistics from the
statistics community include \cite{Box_and_Tiao_text,ohagan94}.
\section{Further exercises}
\exercisxB{3C}{ex.traffic}{
A traffic survey records traffic on two
successive days.
On Friday morning, there are 12 vehicles in one hour.
On Saturday morning, there are 9 vehicles in half an hour.
Assuming that the vehicles are Poisson
distributed with rates $\l_F$ and $\l_S$ (in vehicles per hour) respectively,
% and assuming a priori that the two rates are different,
\ben
\item
is $\l_S$ greater than $\l_F$?
\item
by what factor is $\l_S$ bigger or smaller than $\l_F$?
\een
% itp/traffic.gnu contains likelihood plot
}
\exercisxB{3C}{ex.microsoftus}{
Write a program to compare treatments A and B given data $F_{A+}$, $F_{A-}$, $F_{B+}$,
$F_{B-}$ as described in \secref{sec.microsoftus}.
The outputs of the program should be (a) the probability that treatment A is more
effective than treatment B; (b) the probability that $p_{A+} < 10\, p_{B+}$;
(c) the probability that $p_{B+} < 10\, p_{A+}$.
}
%
\dvips
%
% NEURAL NETWORKS
%
\renewcommand{\partfigure}{\poincare{8.frag2}}
\part{Neural networks}
%\chapter{Introduction to neural networks}
\chapter{Introduction to Neural Networks}
\label{ch.nn.intro}
% \chapter{Some neural networks that learn}
%
% \chapter{Some neural networks that learn}
%
% This used to be a chapter with both single neuron and hopfield net in it.
% that is saved in old/9702
%
In the field of neural networks, we study the properties of
networks of idealized `neurons'.
% The field of neural networks is broad and interdisciplinary.
Three motivations underlie work
% on neural networks.
in this broad and interdisciplinary field.
\begin{description}
\item[Biology\puncspace] The task of understanding how the \ind{brain} works
is one of the outstanding unsolved problems in science.
Some neural network models are intended to shed light on the
way in which computation and memory are performed by brains.
\item[Engineering\puncspace] Many researchers would like to create
machines that can `learn', perform `pattern recognition' or
`discover patterns in data'.
% Neural networks
% automate
\item[Complex systems\puncspace]
% Neural nets as c
A third motivation for being interested in neural networks
is that they are complex adaptive systems whose properties
are interesting in their own right.
\end{description}
I should emphasize several points at the outset.
\bit
\item
This book gives only a taste of
this field. There are many interesting neural network models which
we will not have time to touch on.
\item
The models that we discuss
are not intended to be faithful models of biological systems.
% Rather,
If they are at all relevant to biology, their relevance
is on an abstract level.
%
\item
I will describe some neural network methods
that are widely used in nonlinear data modelling, but I
will not be able to give a full description of the
state of the art.
% in neural network modelling.
If you wish to solve
real problems with neural networks, please read the relevant papers.
% find out about
\eit
% Having given these cautions, let's start!
\section{Memories}
In the next few chapters we will meet several
\ind{neural network} models
% , the `single neuron'
% (also known as perceptron
% and the ``\ind{Hopfield network}". Both come with
which come with simple \ind{learning
algorithms} which make them function as {\dem memories}. Perhaps we should
dwell for a moment on the conventional idea of memory in digital
computation. A memory (a string of 5000 bits describing the name of
a person and an image of their face, say) is stored in a digital
computer at an {\dem address}. To retrieve the \ind{memory} you need to know the
address. The \ind{address} has nothing to do with the memory itself. Notice
the properties that this scheme does {\em not\/} have:
\ben
\item
Address-based memory is
{\em not\/} associative.\index{memory!associative}\index{memory!address-based}\index{associative memory}
Imagine you know half of a memory, say someone's face, and you would
like to recall the rest of the memory -- their name. If your memory is address-based then you
can't get at a memory without knowing the address. [Computer
scientists have devoted effort to wrapping
traditional address-based memories inside cunning
software to produce \index{content-addressable memory}content-addressable memories, but
content-addressability does not come naturally. It has to be added
on.]
\item
Address-based memory is {\em not\/} robust or fault-tolerant.
% [Note connection to ecc]
If a one-bit mistake is made in specifying the
{\em address\/} then a completely different memory will be retrieved. If one
bit of a {\em memory\/} is flipped then whenever that memory is retrieved the
error will be present. Of course, in all modern computers,
error-correcting codes are used in the
memory, so that small numbers of errors can be detected and corrected.
But this error-tolerance is not an intrinsic property of the
memory system. If minor
damage occurs to certain hardware that implements memory retrieval,
it is likely that all functionality will be catastrophically lost.
% SHOULD I HAVE CUT THIS?
% It is interesting to ask whether there are models of computation
% and memory that have in-built error correcting capabilities.
%
%\item
% Serial computers are not simple. Well they are.
\item
Address-based memory is not distributed. In a
serial computer that is accessing a particular memory, only a
tiny fraction of the devices participate in the memory recall:
the CPU and the circuits that are storing the required byte.
All the other millions of devices in the machine are sitting
idle.
% of , explicit. [There are parallel computers, but
% leopard has not changed its spots.]
% It is interesting to ask whether there are
Are there models of truly parallel
computation, in which multiple devices participate in
all computations? [Present-day parallel computers scarcely differ
from serial computers from this point of view. Memory retrieval
works in just the same way, and control of the computation
process resides in CPUs. There are simply a few more CPUs.
Most of the devices sit idle most of the time.]
\een
Biological memory systems are completely different.
\ben
\item
\index{memory!content-addressable}\index{content-addressable memory}Biological
memory is associative. Memory recall is {\em{content-addressable}}.
Given a person's name, we can often recall their
face; and {\em vice versa}.
Memories are apparently recalled spontaneously, not
just at the
request of some CPU.
\item
Biological memory recall is error-tolerant and robust.
\bit
\item
Errors in the cues for memory recall can be corrected.
% NEED A NEW JOKE HERE.
% HELP
An example asks you to recall `An American politician
who was very intelligent and
whose politician father did not like broccoli'.
Many people think of president Bush --
even though one of the cues contains an error.
\item
Hardware faults can also be tolerated.
Our brains are noisy lumps of meat that
% are alive and
are in a continual
state of change, with cells being damaged by natural
processes, alcohol, and boxing.
% [all the proteins change in... all thecells change in...]
While the cells in our brains and the proteins in our cells are
continually changing, many of our memories
persist unaffected.
\eit
\item
Biological memory is parallel and distributed -- not {\em completely\/} distributed
throughout the whole brain: there does appear to be
some functional specialization
-- but in the parts of the brain where memories are stored,
it seems that many neurons participate in the storage of multiple
memories.
\een
% Where are we going: we are going to view nets as memory devices,
% which means they can be channels, when we go from the teacher to
% the test situation. Can also view the net itself as a channel if
% it has inputs and outputs. Then learning is adapting the channel to
% some purpose. Can also view learning as statistical inference,
% inferring what the world is doing. World is input to channel.
These properties of biological memory systems motivate the
study of `artificial neural networks' -- parallel distributed
computational systems consisting of many interacting
simple elements. The hope is that
these model systems might give some hints
as to how neural computation is achieved in real biological neural networks.
%The
%one thing I think it is lacking is a clear and obvious definition of what
%is meant by a neuron - I was quite confused when you started talking about
%functions and fitting. Also, I would have found a concrete example of
%what is meant by inputs x and targets t (such examples were in later
%chapters, but I would have definitely found it useful to have explicit
%examples in this introduction).
%
\section{Terminology}
Each time we describe a \ind{neural network}
% model
algorithm we will typically specify three things.
[If any of this terminology is hard to understand,
it's probably best to dive straight into the next chapter.]
\begin{description}
\item[Architecture\puncspace] The \ind{architecture} specifies what variables are involved
in the network and their topological relationships --
for example, the variables involved in a
neural net might be the {\dbf weights\/} of the connections between the
neurons, along with the {\dbf activities\/} of the neurons.
% ,
% which neuron is connected to which,
% and the {\dbf weights\/} of the connections between them.
\item[Activity rule\puncspace] Most neural network models have short
time-scale dynamics: local rules define how
the {\dbf activities\/} of the neurons change in response to each other.
Typically the \ind{activity rule}
depends on the {\dbf weights\/} (the parameters)
in the network.
\item[Learning rule\puncspace] The \ind{learning rule} specifies the way in which the
neural network's {\dbf weights\/} change with time. This learning is
usually viewed as taking place on a longer time scale than the
time scale of the dynamics under the activity rule. Usually the
learning rule will depend on the {\dbf activities\/} of the neurons. It may
also depend on the values of {\dbf target\/} values supplied by a {\dbf teacher\/}
and on the current value of the weights.
\end{description}
Where do these rules come from?
Often, activity rules and learning rules are invented by imaginative
researchers. Alternatively, activity rules and learning rules
may be {\em derived\/} from carefully chosen {\dbf objective functions}.
% Sometimes we will pull interesting
% activity rules and learning rules out of the
% air and study their properties; on other occasions we will
% {\em derive\/} these rules from well-defined {\dbf objective functions}.
% \section{
Neural network algorithms can be roughly divided into two classes.
\begin{description}
\item[Supervised neural networks] are given data in the form of
{\dbf inputs} and {\dbf targets}, the targets being
a {\dbf teacher}'s specification
of what the neural network's response to the
input should be.
\item[Unsupervised neural networks] are given data in an undivided
form -- simply a set of examples $\{\bx\}$. Some learning
algorithms
% of some unsupervised neural networks
are intended
simply to memorize these data in such a way that the examples
can be recalled in the future. Other algorithms are intended
to `generalize',
% to process the data into
to discover `patterns' in the data, or extract the underlying `features'
from them.
Some unsupervised algorithms are able to make predictions --
for example, some algorithms can `fill in' missing variables
in an example $\bx$ -- and
so can also be viewed as supervised networks.
% learning as a special case.
\end{description}
\dvips
%\chapter{The single neuron as a classifier}
\chapter{The Single Neuron as a Classifier}
\label{ch.single.neuron.class}
% \newcommand{\FIGSlearning}{/home/mackay/book/FIGS/learning}
%\noindent
% Supervised
% neural networks are parameterized nonlinear models used for empirical
% regression and classification modelling. Their flexibility makes
% them able to discover more general relationships in data than traditional
% statistical models.
%
% solns in _s13.tex
%
%%%%%%%%%%%%%%%%%%5
\section{The single neuron}
We will study a single \ind{neuron} for two reasons. First, many neural
network models are built out of single neurons, so it is good to understand
them in detail. And second, a single neuron is itself capable of
`\ind{learning}' -- indeed, various standard statistical methods can be
viewed in terms of single neurons -- so this model will serve as a
first example of a {\dbf supervised neural network}.
% introduction to supervised
\subsection{Definition of a single neuron}
We will start by defining the architecture and the \ind{activity rule} of
a single neuron, and we will then derive a
% since there are a variety of learning rules.
learning rule.
%\begin{figure}
%\figuremargin{%
\marginfig{
\begin{center}\small
{\setlength{\unitlength}{0.021in}
\begin{picture}(48,61)(-1,-2)
\put(20,3){\line(1,6){4.56}}% 5 less a bit
\put(30,3){\line(-1,6){4.56}}
\put(10,3){\line(1,2){13.8}}%15 less a bit
\put(40,3){\line(-1,2){13.8}}
%
% inputs
\multiput(10,2)(10,0){4}{\circle{2}}
% bias
\put(0,24){\circle{2}}
\put(1,24.5){\line(2,1){18}}
\put(10,32){\makebox(0,0)[r]{\small$w_0$}}
%
% neuron
\put(25,37){\circle{12}}
\put(25,44){\vector(0,1){16}}
\put(24,47.5){\makebox(0,0)[r]{\small$y$}}
%
\put(10,0){\makebox(0,0)[t]{\small$x_1$}}
\put(40,0){\makebox(0,0)[t]{\small$x_I$}}
\put(12,10){\makebox(0,0)[r]{\small$w_1$}}
\put(38,10){\makebox(0,0)[l]{\small$w_I$}}
\put(25,-1){\makebox(0,0)[t]{\small$\ldots$}}
\end{picture}}
%\mbox{\psfig{figure=\mjofigs/node.ps}}
\end{center}
%}{%
\caption[a]{A single neuron}
\label{fig.neuron}
}%
%\end{figure}
\begin{description}
\item[Architecture\puncspace]
A single neuron has a number $I$ of {\dbf inputs} $x_i$ and one {\dem output}
which we will here call $y$. (See \figref{fig.neuron}.)
Associated with each input is a {{\dem{weight}}}\index{weight!in neural net}
$w_i$ ($i = 1 ,\ldots, I$).
There may be an
additional parameter $w_0$ of the neuron called a {\dbf bias}\index{bias!in neural net}
which we may
view as being the weight associated with an input $x_0$ which is
permanently set to 1.
The single neuron is a {\dbf feedforward} device -- the connections
are directed from the inputs to the output of the neuron.
\item[Activity rule\puncspace]
The activity rule has two steps.
%\begin{description}
\ben
\item
% []{\sl Activation.}
First, in response to the imposed inputs
$\bx$, we compute the {\dbf \ind{activation}} of the neuron,
\beq
a = \sum_i w_i x_i ,
\eeq
where the sum is over
$i = 0 ,\ldots, I$ if there is a bias and $i = 1 ,\ldots, I$
otherwise.
\item
% []{\sl Activity.}
Second, the {\dbf output} $y$ is set as a function $f(a)$ of the
activation.
The output is also called the {\dbf \ind{activity}} of the neuron, not to
be confused with the activation $a$.
\marginpar{
\small%footnotesize
\begin{tabular}{rcl}
activation & & activity\\
$a$ & $\rightarrow$ & $y(a)$ \\
\end{tabular}
}
There are several possible {\dbf \ind{activation function}s};
here are the most popular.
\newcommand{\dinkyfigsig}[1]{\begin{center}\mbox{\psfig{figure=#1,angle=-90,width=0.75in}}\end{center}}
\ben
\item Deterministic activation functions:
\ben
\item Linear.
\beq
y(a) = a .
\eeq
\item Sigmoid (logistic function).
\amarginfignocaption{t}{
\dinkyfigsig{figs/sigmoidl.ps}
}
\beq
y(a) = \frac{1}{1+e^{-a}} \hspace{0.3in} (y \in (0,1)) .
\eeq
\item Sigmoid (tanh).
\amarginfignocaption{t}{
\dinkyfigsig{figs/sigmoidt.ps}
}
\beq
y(a) = \tanh(a) \hspace{0.3in} (y \in (-1,1)) .
\label{eq.single.tanh}
\eeq
\item Threshold function.
\amarginfignocaption{t}{
\dinkyfigsig{figs/sigmoids.ps}
}
\beq
y(a) = \Theta(a) \equiv \left\{ \begin{array}{rl}
1 & a > 0 \\
-1 & a \leq 0 . \end{array} \right.
\label{eq.single.thresh}
\eeq
\een
\item Stochastic\index{stochastic}
activation functions: $y$ is stochastically selected
from $\pm 1$.
\ben
\item Heat bath.
\beq
y(a) = \left\{ \begin{array}{ll}
1 & \mbox{with probability $\displaystyle \frac{1}{1+e^{-a}}$} \\
-1 & \mbox{otherwise.} \end{array} \right.
\eeq
\item The Metropolis
rule produces the output in a way that depends on the
previous output state $y$:
% If this quantity is positive then the neuron's activity
% agrees with its activation; if it is negative then the two
% disagree and the neuron's activity is flipped.
\[%beq
\begin{array}{l}
\mbox{\sf Compute}\:\: \Delta = a y
\\
\mbox{\sf If}\:\:\Delta \leq 0, \:\:\mbox{flip $y$ to the other state}
\\
\mbox{{\sf{Else}} flip $y$ to the other state with probability $e^{-\Delta}$.}
\end{array}
\]%eeq
%\def\sp{\hspace*{0.2in}}
%\beq
% \parbox{3in}{
%if $y = 1 $ \{ \\
%\sp if $a \leq 0$ \\
%\sp \sp $y=-1$ \\
%\sp else if $a>0$ \{\\
%\sp \sp $y = -1$ with probability $e^{-|a|}$ \\
%\sp otherwise \\
%\sp \sp $y = 1$ \\
%% \sp \} \\
%\}
% else if $y=-1$ \{ \\
%\sp if $a \geq 0$ \\
%\sp \sp $y=1$ \\
%\sp else if $a < 0$ \\ % \{\\
%\sp \sp $y = 1$ with probability $e^{-|a|}$ \\
%\sp % \}
% otherwise \\ % \{ \\
%\sp \sp $y = - 1$ \\
%% \sp \} \\
%\} \\
%}
%%
%\eeq
\een
\een
\een
%\end{description}
\end{description}
\section{Basic neural network concepts}
A neural network implements a function $y(\bx;\bw)$; the `output' of
the network, $y$, is a nonlinear function of the `inputs' $\bx$; this
function is parameterized by `weights' $\bw$.
We will study a single neuron which
% A very simple neural network
produces an output between 0 and 1 as
the following function of $\bx$:
\beq
y(\bx;\bw) = \frac{1}{1+e^{-\bw \cdot \bx}} .
\label{lin.log}
\eeq
\exercisaxA{1}{ex.logitremind}{
In what contexts have we encountered the function
$y(\bx;\bw) = \linefrac{1}{(1+e^{-\bw \cdot \bx})}$
already?
%
% mft, gaussian channel, general probs
%
}
\subsection*{Motivations for the linear logistic function}
% We have already encountered this function.
In section \secpulse\ we studied `the best detection
of pulses', assuming that one of two signals $\bx_0$ and $\bx_1$
had been transmitted over a Gaussian channel with variance--covariance
matrix $\bAI$. We found that the
probability that the source signal was $s \eq 1$ rather than $s \eq 0$,
given the received signal $\by$,
was
\beq
P( s\eq 1 \given \by ) = \frac{1}{1+\exp (-a(\by) )} ,
\eeq
where $a(\by)$ was a linear function of the
received vector,
\beq
a(\by) = \bw^{\T} \by + \theta ,
\eeq
with $\bw \equiv \bA ( \bx_1 -\bx_0)$.
The linear logistic function can be motivated in several other ways --
see the exercises.
\subsection{Input space and weight space}
For convenience let us study the case where the input vector $\bx$
and the parameter vector $\bw$ are both two-dimensional: $\bx = (x_1,x_2)$,
$\bw = (w_1,w_2)$. Then we can spell out the function performed by the
neuron
% $f$
thus:
\beq
y(\bx;\bw) = \frac{1}{1+e^{-(w_1 x_1 + w_2 x_2)}} .
\label{lin.log2}
\eeq
Figure \ref{one_output} shows the output of the neuron as a function
of the input vector, for $\bw = (0,2)$. The two horizontal axes of
this figure are the inputs $x_1$ and $x_2$, with the output $y$ on
the vertical axis. Notice that on any line perpendicular to $\bw$,
the output is constant; and along a line in the direction of $\bw$,
the output is a \ind{sigmoid} function.
We now introduce the idea of {\dbf\ind{weight
space}}, that is, the parameter space of the network. In this case,
there are two parameters $w_1$ and $w_2$, so the weight space is two
dimensional. This weight space is shown in figure
\ref{fig.points.in.w.space}. For a selection of values of
the parameter vector $\bw$, smaller inset figures show the function
of $\bx$ performed by the network when $\bw$ is set to those
values. Each of these smaller figures is equivalent to figure
\ref{one_output}. Thus each {\em point} in $\bw$ space corresponds
to a {\em function} of $\bx$. Notice that the gain of the sigmoid
function (the gradient of the ramp) increases as the magnitude of
$\bw$ increases.
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{@{}c@{}}
\setlength{\unitlength}{1in}\begin{picture}(2.6,2.2)(0.5,0.3)
\put(0,0){\makebox{\psfig{figure=\FIGSlearning/f.0.2.ps,height=3in,width=3in,angle=-90}}}\end{picture} \\
$\bw = (0,2)$ \\
\end{tabular}
\end{center}
}{%
\caption[a]{{Output of a simple neural network as a function of its input.}}
\label{one_output}
}%
\end{figure}
\begin{figure}
%\figuremargin{%
\figuredangle{
\begin{raggedright}
\setlength{\unitlength}{0.67in}% this is the original value
% \setlength{\unitlength}{0.6in}
\begin{picture}(10,8)(-3.2,-2.4)\thinlines
\put(-3,0){\vector(1,0){9.5}}
\put(0,-2){\vector(0,1){7}}
\put(-3,-2.2){\makebox(0,0)[t]{$-3$}}
\put(-2,-2.2){\makebox(0,0)[t]{$-2$}}
\put(-1,-2.2){\makebox(0,0)[t]{$-1$}}
\put( 0,-2.2){\makebox(0,0)[t]{ 0}}
\put( 1,-2.2){\makebox(0,0)[t]{ 1}}
\put( 2,-2.2){\makebox(0,0)[t]{ 2}}
\put( 3,-2.2){\makebox(0,0)[t]{ 3}}
\put( 4,-2.2){\makebox(0,0)[t]{ 4}}
\put( 5,-2.2){\makebox(0,0)[t]{ 5}}
\put( 6,-2.2){\makebox(0,0)[t]{ 6}}
\put(-3.2,-2){\makebox(0,0)[r]{$-2$}}
\put(-3.2,-1){\makebox(0,0)[r]{$-1$}}
\put(-3.2, 0){\makebox(0,0)[r]{ 0}}
\put(-3.2, 1){\makebox(0,0)[r]{ 1}}
\put(-3.2, 2){\makebox(0,0)[r]{ 2}}
\put(-3.2, 3){\makebox(0,0)[r]{ 3}}
\put(-3.2, 4){\makebox(0,0)[r]{ 4}}
\put(-3.2, 5){\makebox(0,0)[r]{ 5}}
%\put(-6.5,-4.45){\makebox(0,0)[bl]{\psfig{figure=\FIGSlearning/wspace.ps,%orig
%angle=-90,height=6.55in,width=9in} }}
% new version:
%\put(-6.5,-4.45){\makebox(0,0)[bl]{\psfig{figure=\FIGSlearning/wspace.ps,%
%angle=-90,height=5.8657in,width=8.059in} }}
\put(-0.4,4.864){\makebox(0,0)[bl]{$w_2$}}
\put(6,-0.4){\makebox(0,0)[bl]{$w_1$}}
%\put(-4,4){\makebox(0,0)[bl]{$w_2$}}
%\put(4,-4){\makebox(0,0)[bl]{$w_1$}}
\funcfig{-2}{3} \funcfig{0}{2}
\funcfig{1}{4} \funcfig{5}{4}
\funcfig{2}{2} \funcfig{2}{-2}
\funcfig{5}{1}
\funcfig{-2}{-1} \funcfig{1}{0} \funcfig{3}{0}
\end{picture}
\end{raggedright}
}{%
\caption{Weight space.\index{weight space}}
\label{fig.funcs.in.x.space}
\label{fig.points.in.w.space}
\label{fig.w.space}
}%
\end{figure}
Now, the central idea of supervised neural networks is this. Given
{\dbf examples} of a relationship between an input vector $\bx$, and
a target $t$, we hope to make the neural network
`learn' a model of the relationship between $\bx$ and $t$.
A successfully trained network will, for any given $\bx$, give an
output $y$ that is close (in some sense) to the target value $t$.
{\dbf Training} the network involves searching in the weight
space of the network for a value of $\bw$ that produces a function
that fits the provided training data well.
Typically an {\dbf \ind{objective
function}} or {\dbf \ind{error function}} is defined, as a function of $\bw$,
to measure how well the network
with weights set to $\w$
% $M(\w)$
solves the task. The objective function is a sum of terms, one for
each input/target pair $\{\bx,t\}$, measuring
how close the output $y(\bx;\bw)$ is to the target $t$.
The training process is an exercise in {\em function \ind{minimization}}
-- \ie, adjusting $\bw$ in such a way as to find a $\bw$ that
minimizes the objective function. Many function-minimization\index{algorithm!function minimization}\index{function minimization}
algorithms make use not only of the objective function, but also its
{\em gradient} with respect to the parameters $\bw$. For
general feedforward neural networks
the {\dbf \ind{backpropagation}} algorithm
% \cite{backprop}
efficiently evaluates the gradient of the output $y$ with respect
to the parameters $\bw$, and thence the gradient of the objective function
with respect to $\w$.
\section{Training the single neuron as a binary classifier}
\label{sec.single.neuron.class}
% Without further ado, let us put a neuron to work.
% A neural network with one output can be
% trained to classify input patterns as belonging to one of
% two classes.
We assume we have a data set of inputs $\{\bx^{(n)}\}_{n=1}^N$ with
binary labels $\{t^{(n)}\}_{n=1}^N$, and\index{learning algorithms!classification}
% We can define an objective function for
a neuron\index{learning algorithms!single neuron}
whose output
$y(\bx;\bw)$ is bounded between 0 and 1.
We can then write down the following {\dbf error function}:
\beq
G(\bw) =
-
\sum_n
\left[
t^{(n)} \ln y(\bx^{(n)};\bw) + (1-t^{(n)}) \ln (1-y(\bx^{(n)};\bw))
\right] .
\label{eq.single.neuron.G}
\eeq
Each term in
this objective function may be recognized as
the {\em information content\/} of one outcome.
It may also be described as the \ind{relative entropy} between
the empirical probability distribution $(t^{(n)},1-t^{(n)})$
and the probability distribution implied by the output of the
neuron $(y,1-y)$.
The objective function is bounded below by zero and only attains this
value if $y(\bx^{(n)};\bw) = t^{(n)}$ for all $n$.
We now differentiate this objective function with respect to $\bw$.
\exercisxA{2}{ex.gradG}{
{\sf The \ind{backpropagation} algorithm.}
Show that the derivative $\bg = \partial G/\partial \bw$ is given by:
\beq
g_j = \frac{\partial G}{\partial w_j} = \sum_{n=1}^N
- (t^{(n)} - y^{(n)}) x_j^{(n)} .
\label{eq.gradG}
\eeq
}
Notice that the quantity $e^{(n)} \equiv t^{(n)} - y^{(n)}$ is the {\dbf error\/}
on example $n$ -- the difference between the target and the output.
The simplest thing to do with a gradient of an error function is
to {\em descend\/} it (even though this is often dimensionally incorrect, since
a gradient has dimensions [1/parameter], whereas a change in
a parameter has dimensions [parameter]).
% so \eqref{eq.gradG}
%% {This expression
% motivates the following learning algorithm.
Since the derivative $\partial G/\partial \bw$ is a sum of terms $\bg^{(n)}$
defined by
\beq
% \frac{\partial G}{\partial w_j} = \sum_{n=1}^N \bg
g_j^{(n)} \equiv - (t^{(n)} - y^{(n)}) x_j^{(n)}
\eeq
for $n=1,\ldots, N$, we can obtain a simple on-line algorithm by putting
each input through the network one at a time, and adjusting
$\bw$ a little in a direction opposite to $\bg^{(n)}$.
%
% add figure here showing a single little step for one example.
%
We summarize the whole learning algorithm.
\subsection{The on-line gradient-descent learning algorithm}
% What follows is a paradigm for all supervised learning models.
\begin{description}
\item[Architecture\puncspace]
A single neuron has a number $I$ of {\dbf inputs} $x_i$ and one {\dbf output}
$y$.
Associated with each input is a weight $w_i$ ($i = 1 ,\ldots, I$).
% There may be an
% additional parameter $w_0$ of the neuron called a {\dbf bias} which we may
% view as being the weight associated with an input $x_0$ which happens to be
% permanently equal to 1.
\item[Activity rule\puncspace]
\ben
\item
% []{\sl Activation.}
First, in response to the received inputs
$\bx$ (which may be arbitrary real numbers),
we compute the {\dbf activation} of the neuron,
\beq
a = \sum_i w_i x_i ,
\eeq
where the sum is over
$i = 0 ,\ldots, I$ if there is a bias and $i = 1 ,\ldots, I$
otherwise.
\item
Second, the {\dbf output} $y$ is set as a sigmoid
function of the
activation.
\beq
y(a) = \frac{1}{1+e^{-a}} .
\eeq
\een
% The output can also be called the {\dbf activity} of the neuron, not to
% be confused with the activation $a$.
This output might be viewed as stating the probability,
according to the neuron,
that the given input is in class 1 rather than class 0.
\item[Learning rule\puncspace]
The {\dbf teacher} supplies a {\dbf target} value $t \in \{0,1\}$ which says
what the correct answer is for the given input.
We compute the {\dbf error signal}
\beq
e = t - y
\eeq
then adjust the weights $\bw$ in a direction that would reduce the
magnitude of this error:
\beq
\upDelta w_i = \eta e x_i ,
\label{eq.single.dw}
\eeq
where $\eta$ is the `learning rate'. Commonly $\eta$ is set by trial and
error to a constant value or to a decreasing function
of simulation time $\tau$ such as $\eta_0/\tau$.
\end{description}
%
% Could also call this
% Heteroassociative memory as well as Classifier / Supervised learning.
%
The activity rule and learning rule
% above rules
are repeated for each input/target pair $(\bx,t)$
that is presented. If there is a fixed data set of size $N$, we can cycle
through the data multiple times.
% \newcommand{\learnfigstuff}{width=1.8in,angle=-90}
\begin{figure}% 30 , 80 , 500 , 3000 , 40000 its. see neuron/README
%\figuremargin{%
\fullwidthfigureright{
% \begin{center}
%\fbox{
\begin{tabular}{c@{}c}
\begin{tabular}{*{2}{c@{}}}
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(a)}}\makebox[1.675in][r]{$x_1$}\hspace{-1.675in}\raisebox{1in}{\makebox[0in][l]{$x_2$}}\hspace{-4mm}\psfig{figure=neuron/ps/dat01.ps,width=1.8in,angle=-90}
&
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(c)}}%
\raisebox{0.1in}{\makebox[1.6in][r]{$w_1$}\hspace{-1.6in}%
\raisebox{1in}{\makebox[0in][l]{$w_2$}}%
\hspace{-1.5mm}%
\psfig{figure=neuron/ps/wl.ps,width=1.6in,angle=-90}}
\\
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(f)}}\hspace{-4mm}\psfig{figure=neuron/ps/dw30.ps,width=1.8in,angle=-90}
&
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(g)}}\hspace{-4mm}\psfig{figure=neuron/ps/dw80.ps,width=1.8in,angle=-90}
\\
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(h)}}\hspace{-4mm}\psfig{figure=neuron/ps/dw500.ps,width=1.8in,angle=-90}
&
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(i)}}\hspace{-4mm}\psfig{figure=neuron/ps/dw3000.ps,width=1.8in,angle=-90}
\\
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(j)}}\hspace{-4mm}\psfig{figure=neuron/ps/dw10000.ps,width=1.8in,angle=-90}
&
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(k)}}\hspace{-4mm}\psfig{figure=neuron/ps/dw40000.ps,width=1.8in,angle=-90}
\\
\end{tabular}
&
\begin{tabular}{*{1}{c}}
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(b)}}\psfig{figure=neuron/ps/wltimelog.ps,width=2.2in,angle=-90}
\\
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(d)}}\psfig{figure=neuron/ps/Gtimelog.ps,width=2.2in,angle=-90}
\\
\makebox[0in][l]{\raisebox{5mm}{\footnotesize(e)}}\psfig{figure=neuron/ps/EWtimelog.ps,width=2.2in,angle=-90}
\\
\end{tabular}
\end{tabular}
%}% fbox
% \end{center}
}{%
\caption[a]{ {A single neuron learning to classify by gradient descent.}
The neuron has two weights $w_1$ and $w_2$ and a bias $w_0$.
The learning rate was set to $\eta=0.01$ and batch-mode gradient descent
was performed using the code displayed in \protect\algref{fig.train.algm}.
(a) The training data.
(b) Evolution of weights $w_0$, $w_1$ and $w_2$ as a function of
number of iterations (on log scale).
(c) Evolution of weights $w_1$ and $w_2$ in weight space.
(d) The objective function $G(\bw)$ as a function of number of
iterations. (e) The magnitude of the weights $E_W(\bw)$
as a function of time.
(f--k) The function performed by the neuron (shown by three of its contours)
after 30, 80, 500, 3000, $10\,000$ and $40\,000$ iterations.
The contours shown are those corresponding to
$a=0,\pm 1$, namely $y=0.5, 0.27$ and $0.73$. Also shown is a vector
proportional to $(w_1,w_2)$. The larger the weights are, the bigger
this vector becomes, and the closer together are the contours.
}
\label{fig.neuron.learns}
}%
\end{figure}
\subsection{Batch learning versus on-line learning}
Here we have described the {\dbf on-line} learning
algorithm, in which a change in the weights is made after every example
is presented. An alternative paradigm is to go through a {\dbf batch}
of examples, computing the outputs and errors and accumulating the
changes specified in
\eqref{eq.single.dw} which are then made at the end of the batch.
\subsection{Batch learning for the single neuron classifier}
% \setlength{\fboxsep}{10pt}
\begin{framedalgorithm}
\begin{description}
\item[For each input/target pair] $(\bx^{(n)},t^{(n)})$ ($n=1,\ldots,\, N$),
compute $y^{(n)} = y(\bx^{(n)}; \bw)$, where
\beq
y(\bx; \bw) = \frac{1}{1+\exp\! \left( - \sum_i w_i x_i \right) } ,
\eeq
define $e^{(n)} = t^{(n)} - y^{(n)}$, and compute for
each weight $w_i$
\beq
g^{(n)}_i = - e^{(n)} x^{(n)}_i .
\eeq
\item[Then] let
\beq
\upDelta w_i = - \eta \sum_n g^{(n)}_i .
\eeq
% The minus signs in parentheses are omitted in the source code
% shown in \figref{fig.octave.grad}.
\end{description}
\end{framedalgorithm}
\medskip
\noindent
% Ends up memorizing. (also generalizing, bt that's another lecture)
%
% Demo.
This batch learning algorithm is a {\dem\ind{gradient descent}
algorithm}, whereas the on-line algorithm\index{optimization!gradient descent}
is a {\dem\ind{stochastic gradient} descent\/} algorithm.
Source code implementing batch learning is given
in \algref{fig.octave.grad}.
\begin{algorithm}%%%%%%%%%%%%%%% this needs updating?
\begin{framedalgorithmwithcaption}
%\figuremargin{%\margincaption
{%
\caption[a]{{\tt Octave} source code for a gradient descent optimizer
of a single neuron, batch learning,
with optional weight decay (rate
% . The weight decay rate is controlled by the scalar
{\tt alpha}).
{\tt Octave}\index{octave}
% is a convenient language for matrix operations. N
notation: the
% single
instruction
%
{\tt{a = x * w}}
% \verb|a = x * w|
%
causes the ($N \times I$)
{\em matrix\/} {\tt{x}} consisting of all the input vectors to be multiplied
by the weight vector {\tt w}, giving the {\em vector\/} {\tt a} listing
the activations for all $N$ input vectors;
{\tt x'} means {\tt x}-transpose; the single command {\tt y = sigmoid(a)}
computes the sigmoid function of all
elements of the vector {\tt a}.
% $L$ is the total number of gradient descent steps made.
}
\label{fig.octave.grad}
\label{fig.train.algm}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%
\footnotesize
% # The input space has dimension I and there are N data points
\begin{verbatim}
global x ; # x is an N * I matrix containing all the input vectors
global t ; # t is a vector of length N containing all the targets
for l = 1:L # loop L times
a = x * w ; # compute all activations
y = sigmoid(a) ; # compute outputs
e = t - y ; # compute errors
g = - x' * e ; # compute the gradient vector
w = w - eta * ( g + alpha * w ) ; # make step, using learning rate eta
# and weight decay alpha
endfor
function f = sigmoid ( v )
f = 1.0 ./ ( 1.0 .+ exp ( - v ) ) ;
endfunction
\end{verbatim}
% if ( [ log required ] )
% M = findM ( w ) ;
% wl(logtime,:) = [w',T,G,EW,M] ; # Keep log of the weight vector,
% # time, G, E_W and M
% logtime ++ ;
% endif
%
%function ret = findM ( w )
%# computes objective function assuming that y contains the relevant activities
% G = - (t' * log(y) + (1-t') * log( 1-y )) ;
% EW = w' * w / 2 ;
% M = G + alpha * EW ;
% ret = M ;
%endfunction
\end{framedalgorithmwithcaption}
%
%
%%%%%%%%%%%%%%%%%%%%%%%5 README
% try to get this figure in a good place. batch/on-line confusion!?
%%%%%%%%%%%%%%%%%%%%%%%
\end{algorithm}
This algorithm
is demonstrated in \figref{fig.neuron.learns} for a neuron
with two inputs with weights $w_1$ and $w_2$ and a bias $w_0$,
performing the function
\beq
y(\bx;\bw) = \frac{1}{1+e^{-(w_0 + w_1 x_1 + w_2 x_2)}} .
\label{lin.log3}
\eeq
The bias $w_0$ is included, in contrast to figure \ref{fig.w.space},
where it was omitted.
The neuron is trained on a data set of ten labelled examples.
\begin{figure}% 30 , 80 , 500 , 3000 , 40000 its. see neuron/README
\figuremargin{%
\begin{center}
\hspace*{-0.2in} \begin{tabular}{c@{\hspace*{-0.1in}}*{3}{c@{\hspace*{-0.05in}}}}
& $\a=0.01$ & $\a=0.1$ & $\a = 1$ \\
\raisebox{0.5in}{\footnotesize(a)}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/wlwd01timelog.ps,width=1.8in,angle=-90}}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/wlwdtimelog.ps,width=1.8in,angle=-90}}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/wlwd1timelog.ps,width=1.8in,angle=-90}}
\\
\raisebox{0.5in}{\footnotesize(b)}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/wlwlwd01.ps,width=1.8in,angle=-90}}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/wlwlwd.ps,width=1.8in,angle=-90} }
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/wlwlwd1.ps,width=1.8in,angle=-90} }
\\
\raisebox{0.5in}{\footnotesize(c)}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/GMwd01timelog.ps,width=1.8in,angle=-90,height=1.3in}}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/GMwdtimelog.ps,width=1.8in,angle=-90,height=1.3in} }
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/GMwd1timelog.ps,width=1.8in,angle=-90,height=1.3in}}
\\
\raisebox{0.5in}{\footnotesize(d)}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/dwd40000.0.01.ps,width=1.8in,angle=-90}}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/dwd40000.0.1.ps,width=1.8in,angle=-90}}
& \makebox[1.8in][l]{\psfig{figure=neuron/ps/dwd40000.1.ps,width=1.8in,angle=-90} }
\\
\end{tabular}\hspace{0.1in}
\end{center}
}{%
\caption[a]{ {The influence of weight decay on a single neuron's
learning.}
The objective function is $M(\bw) = G(\bw) + \a E_W(\bw)$.
The learning method was as in
\protect\figref{fig.neuron.learns}.
% rate was set to $\eta=0.01$ and batch-mode gradient descent
% was performed using the code displayed in \protect\figref{fig.train.algm}
(a) Evolution of weights $w_0$, $w_1$ and $w_2$.
% as a function of
% number of iterations (on log scale).
(b) Evolution of weights $w_1$ and $w_2$ in weight space shown by points,
contrasted with the trajectory followed in the case of zero weight decay,
shown by a thin line (from
\protect\figref{fig.neuron.learns}). Notice that for this problem
weight decay has an effect very similar to `early stopping'.
(c) The objective function $M(\bw)$ and the error function $G(\bw)$
as a function of number of
iterations.
(d) The function performed by the neuron
% (shown by three of its contours)
after $40\,000$ iterations.
% The contours shown are those corresponding to
% $a=0,\pm 1$, namely $y=0.5, 0.27$ and $0.73$. Also shown is a vector
% proportional to $(w_1,w_2)$.
}
\label{fig.neuron.learns.decay}
}%
\end{figure}
% gradient
\section{Beyond descent on the error function: regularization}
If the parameter $\eta$ is set to an appropriate value, this algorithm
works: the algorithm finds a setting of $\bw$ that correctly classifies
as many of the examples as possible.
If the examples are in fact {\dbf linearly separable} then the neuron
finds this linear separation and its weights diverge to
ever-larger values as the simulation continues. This can be seen
happening in \figref{fig.neuron.learns}(f--k). This is an
example of {\dbf overfitting}, where a model fits the data so well
that its generalization performance is likely to be adversely
affected.
This behaviour may be viewed as undesirable.
How can it be rectified?
An \adhoc\ solution to overfitting is to use {\dbf early stopping},
that is, use an algorithm originally intended to minimize the error
function $G(\bw)$, then prevent it from doing so by halting
the algorithm at some point.
% yes, this spell is right
A more principled solution to overfitting makes use of {\dbf regularization}.
% (also known as `weight decay').
Regularization involves modifying the
objective function in such a way as to incorporate a bias against
the sorts of solution $\bw$ which we dislike. In the above example,
what we dislike is the development of a very sharp decision boundary
in \figref{fig.neuron.learns}k;
this sharp boundary is associated with large weight values, so
we use a regularizer that penalizes large weight values.
We modify the objective function to:
\beq
M(\w) = G(\w) + \a E_W(\w)
\eeq
where the simplest choice of regularizer is the {\dbf\ind{weight decay}\/}
regularizer
\beq
E_W(\w)= \frac{1}{2} \sum_i w_i^2.
\eeq
The {\dem\ind{regularization constant}\/} $\alpha$ is called the
{weight decay rate}.
This additional
term favours small values of $\bw$ and
decreases the tendency of a model to overfit fine details of the
training data.
%\footnote{In real problems involving larger networks
% the use of more complex regularizers with multiple hyperparameters
% is recommended.}
The quantity $\alpha$ is known as a {\dbf \ind{hyperparameter}}. Hyperparameters
play a role in the learning algorithm but play no role in the
activity rule of the network.
\exercisxA{1}{ex.derivWD}{
Compute the derivative of $M(\w)$ with respect to $w_i$. Why is the above
regularizer known as the `\ind{weight decay}'
regularizer?
% and write down a steepest descents algorithm for minimizing $M$.
} The gradient descent source code of \algref{fig.octave.grad} implements
weight decay. This gradient descent algorithm is demonstrated in
\figref{fig.neuron.learns.decay} using weight decay rates
$\a=0.01$, $0.1$, and 1. As the weight decay rate
is increased the solution becomes biased towards broader sigmoid
functions with decision boundaries that are closer to the origin.
\subsection*{Note}
Gradient\index{gradient descent}\index{optimization}
descent with a step size $\eta$ is in general {\em not\/}
the most efficient way to minimize a function.
A modification of gradient descent known as {\dem\ind{momentum}},
while improving convergence, is also not
recommended.
Most neural network
experts use more advanced optimizers such as
\ind{conjugate gradient} algorithms.
[Please do not confuse momentum, which
is sometimes given the symbol $\a$, with \ind{weight decay}.]
%
% Originally, I here had a ``Bayes interpretation of learning''
% but now I have put it into a sep chapter. the original is preserved
% in old/9702 .
%
% \newpage
%
% solutions are in
\section{Further exercises}
\subsection*{More motivations for the linear neuron}
\exercisxB{2}{ex.neuronmotiv}{
Consider the task of recognizing which of two Gaussian distributions
a vector $\bz$ comes from. Unlike the case studied in section \secpulse,
where the distributions had different means but a common variance--covariance
matrix, we will assume that the two distributions have exactly the
same mean but different variances.
Let the probability of $\bz$ given $s$ ($s \in \{0,1\}$) be
\beq
P(\bz\given s) = \prod_{i=1}^I \Normal( z_i ; 0,\sigma^2_{si} ),
\eeq
where $\sigma^2_{si}$ is the variance of $z_i$ when the source symbol
is $s$.
Show that $P(s \eq 1\given \bz)$ can be written in the form
\beq
P( s \eq 1\given \bz ) = \frac{1}{1+\exp (-\bw^{\T} \bx + \theta )} ,
\eeq
where $x_i$ is an appropriate function of $z_i$, $x_i=g(z_i)$.
}
\exercisaxA{2}{ex.LED}{ {\bf The noisy LED}.
\begin{center}
\raisebox{-0.3in}{\psfig{figure=figs/led.eps,height=0.6in,angle=-90}}
\hspace{0.5in}
$\:\: \bc(2)=$ \raisebox{-0.2in}{\psfig{figure=figs/led2.eps,height=0.4in,angle=-90}}
$\:\:\bc(3)=$ \raisebox{-0.2in}{\psfig{figure=figs/led3.eps,height=0.4in,angle=-90}}
$\:\:\bc(8)=$ \raisebox{-0.2in}{\psfig{figure=figs/led8.eps,height=0.4in,angle=-90}}
\end{center}
Consider an LED display with 7 elements numbered as shown
above. The state of the
display is a vector $\bx$. When the controller wants the display to
show character number $s$, \eg\ $s \eq 2$, each element $x_j$ ($j=1,\ldots, 7$)
either adopts
its intended state $c_j(s)$, with probability $1-f$, or is flipped, with
probability $f$. Let's call the two states of $x$ `$+1$' and `$-1$'.
\ben
\item
Assuming that the intended character $s$ is actually
a 2 or a 3, what is the probability of $s$, given
the state $\bx$? Show that $P(s \eq 2\given \bx)$ can be written in the form
\beq
P( s \eq 2\given \bx ) = \frac{1}{1+\exp (-\bw^{\T} \bx + \theta )} ,
\eeq
and compute the values of the weights $\w$ in the case $f=0.1$.
\item
Assuming that $s$ is one of $\{0,1,2,\ldots, 9\}$, with prior
probabilities $p_s$,
% values equiprobable {\em a priori\/},
what is the probability of $s$, given
the state $\bx$?
Put your answer in the form
\beq
P(s\given \bx) = \frac{ \displaystyle e^{a_s} }
{ \displaystyle \sum_{s'} e^{a_{s'}} } ,
\label{eq.softmax.a}
\eeq
%
where $\{a_s\}$ are functions of $\{c_j(s)\}$ and $\bx$.
% and $\{c_j(3)\}$.)
\een
Could you make a better alphabet of 10 characters
for a noisy LED, \ie, an alphabet less susceptible to
confusion?
\amargintab{b}{
\begin{center}
\begin{tabular}{cc} \toprule
0 & \hammingdigit{0} \\
1 & \hammingdigit{1} \\
2 & \hammingdigit{2} \\
3 & \hammingdigit{3} \\
4 & \hammingdigit{4} \\
5 & \hammingdigit{5} \\
6 & \hammingdigit{6} \\
7 & \hammingdigit{7} \\
8 & \hammingdigit{8} \\
9 & \hammingdigit{9} \\
10 &\hammingdigit{10} \\
11 &\hammingdigit{11} \\
12 &\hammingdigit{12} \\
13 &\hammingdigit{13} \\
14 &\hammingdigit{14} \\
\bottomrule
\end{tabular}
\end{center}
\caption[a]{An alternative 15-character alphabet
% mapping of integers 0--14 to 15 codewords
for the 7-element LED display.
}
}% end fig
}
\exercisaxB{2}{ex.LED31}{
A $(3,1)$ error-correcting code consists of the
two codewords $\bx^{(1)} = (1,0,0)$ and
$\bx^{(2)} = (0,0,1)$.
A source bit $s \in \{ 1,2 \}$ having probability distribution $\{ p_1,p_2
\}$ is used to select one of the two codewords
for transmission over a binary symmetric channel with noise level
$f$. The received vector is $\br$. Show that the
posterior probability of $s$ given $\br$ can be written in the form
\[
P(s \eq 1 \given \br ) = \frac{1}{ 1 + \exp\! \left( - w_0 - \sum_{n=1}^3
w_n r_n
\right) },
\]
and give expressions for the coefficients $\{ w_n \}_{n=1}^3$
and the bias, $w_0$.
Describe, with a diagram, how this optimal decoder can be expressed
in terms of a `neuron'.
}
\index{connection between!pattern recognition and error-correction}
\dvips
%\section{Solutions}% to Chapter \protect\ref{ch.single.neuron.class}'s exercises}
%\input{tex/_s13.tex}
%\dvipsb{solutions neuron}
\prechapter{Problems to look at before Chapter}% \chcover
% move to lecture 6
% (Please look at these before lecture \chcover.)% \ref{lec.cover}.)
% counting arguments and separation
% for before the cover lecture
\exercisxB{2}{ex.sumCNK}{
What is $\sum_{K=0}^N {N \choose K}$?
\noindent
[The symbol ${N \choose K}$ means the combination
$\smallfrac{N!}{K!(N-K)!}$.]
}
\exercisxB{2}{ex.sumpascal}{
If the top row of Pascal's triangle (which contains the single number
`1') is denoted row zero, what is the sum of all the numbers
in the triangle above row $N$?
}
\exercisxB{2}{ex.T33}{
3 points are selected at random on the surface of a sphere. What is
the probability that all of them lie on a single hemisphere? }
%\subsubsection{General position}
%\begin{definc}
% A set of points $\{\bx_n\}$ in $K$--dimensional
% space are in {\em general position\/} if
%% they are distinct and non-zero and
% any subset of size $\leq K$ is linearly independent.
%\label{defn.generalposition}
%\end{definc}
\begin{aside}
{This chapter's material is originally due to
\citeasnoun{Polya} and \citeasnoun{Cover65}\index{Cover, Thomas}\index{Abu-Mostafa, Yaser}
and the exposition that follows is Yaser Abu-Mostafa's.\nocite{HKP}
% Another explanation of this material can be found in
% \citeasnoun{HKP}, page 111.
}
\end{aside}
% {Capacity of a single neuron}
\ENDprechapter
\chapter{Capacity of a Single Neuron}% neural network memories}
\label{ch.single.neuron.capacity}
\label{ch.cover}
% cover.tex
% 2 bits per synapse
% \label{ch.single.neuron.capacity}
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\setlength{\unitlength}{0.07in}
\begin{picture}(57,12.5)(5,2.5)% tweaked from 57,15) on Sun 20/10/02
\put(9,12){\makebox(0,0)[r]{$\{t_n\}_{n=1}^N$}}
\put(20,4){\makebox(0,0)[t]{$\{\bx_n\}_{n=1}^N$}}
\put(10,12){\vector(1,0){5}}
\put(20,5){\vector(0,1){4}}
\put(20,12){\makebox(0,0){\begin{tabular}{c}Learning\\algorithm\end{tabular}}}
\put(25,12){\vector(1,0){5}}
\put(32.5,12){\makebox(0,0){$\bw$}}
\put(35,12){\vector(1,0){5}}
\put(42.5,12){\makebox(0,0){$\bw$}}
\put(42.5,4){\makebox(0,0)[t]{$\{\bx_n\}_{n=1}^N$}}
\put(42.5,5){\vector(0,1){5}}
\put(45,12){\vector(1,0){5}}
\put(52,12){\makebox(0,0)[l]{$\{\hat{t}_n\}_{n=1}^N$}}
\end{picture}
\end{center}
}{%
\caption[a]{Neural network learning viewed as communication.}
\label{fig.nn.as.comm}
% To put it another way, here is our communication scenario.
}%
\end{figure}
% \section{The capacity of a single neuron}
% In the last chapter we made an overview of neural networks.
\section{Neural network learning as communication}
Many\index{neural network!learning as communication}\index{learning!as communication}\index{communication!perspective on learning}
neural network models involve the adaptation of a set of weights $\w$
in response to a set of data points, for example a set of $N$ target
values $D_N = \{t_n\}_{n=1}^N$ at given locations
$\{\bx_n\}_{n=1}^N$. The adapted weights are then used to process
subsequent input data. This process can be viewed as a communication
process, in which the sender examines the data $D_N$ and creates a
message $\w$ that depends on those data. The receiver then uses
$\w$;
%
for example, the
% the simplest example use of $\w$ is the
receiver might use the weights to try to reconstruct what
the data $D_N$ was. [In neural network parlance, this is
using the neuron for `memory' rather than for `\ind{generalization}';
`generalizing'
means extrapolating from
the observed data to the value of $t_{N+1}$ at some new location
$\bx_{N+1}$.]
Just as a disk drive
% or a piece of RAM
is a communication channel,
the adapted network weights $\w$
therefore play the role of a communication channel, conveying
information about the training data to a future user
of that neural net. The question
we now address is, `what is the capacity of this channel?' -- that is,
`how much information can be stored by training a neural network?'\index{capacity!neural network}\index{neural network!capacity}\index{neuron!capacity}
If
we had a learning algorithm that either produces a network whose
response to all inputs is $+1$ or a network whose response to all
inputs is $0$, depending on the training data, then the weights allow
us to distinguish between just two sorts of data set. The maximum
information such a learning algorithm could convey about the data is
therefore 1 bit, this information content being achieved if the two
sorts of data set are equiprobable. How much more information can be
conveyed if we make full use of a neural network's ability to
represent other functions?
% to distinguish The network conveys more
\section{The capacity of a single neuron}
We will look at the simplest case, that of a single binary
threshold neuron. We will find that the capacity\index{capacity!neuron}
of such a neuron is\index{neuron!capacity}
{\em two bits per weight}. A neuron with $K$ inputs can store $2 K$ bits
of information.
To obtain this interesting
result we lay down some rules to exclude
less interesting answers, such as:
% of which the most obvious is this:
`the
capacity of a neuron is infinite, because each of its weights is a
real number and so can convey an infinite number of bits'. We exclude
this answer by saying that the receiver is not able to examine the
weights directly, nor is the receiver allowed to probe the weights by
observing the output of the neuron for arbitrarily chosen inputs. We
constrain the receiver to observe the output of the neuron at
the same fixed set of $N$ points $\{\bx_n\}$ that were in the
training set. What matters now is how many different distinguishable
functions our neuron can produce, given that we can only observe the
function at these $N$ points. How many different binary labellings of
$N$ points can a linear threshold function produce? And how does
this number compare with the maximum possible number of binary
labellings, $2^N$? If nearly all of the $2^N$ labellings can be
realized by our neuron, then it is a communication channel that can
convey all $N$ bits (the target values $\{t_n\}$) with small probability of
error. We will identify the capacity of the neuron as
the maximum value that $N$ can have such that the probability of
error is very small. [We are departing a little from the
definition of \ind{capacity} in \chref{ch5}.]
We thus examine the following scenario.
The sender
is given a neuron with $K$ inputs and a data set $D_N$ which is a
labelling of $N$ points.
% in general position.
The sender uses an
adaptive algorithm to try to find a $\w$ that can reproduce this
labelling exactly. We will assume the
algorithm finds such a $\w$ if it exists.
% Otherwise, $\w$ is set to some other value.
The receiver then evaluates the
threshold function on the $N$ input values. What is the probability
that {\em all\/} $N$ bits are correctly reproduced? How large can $N$
become, for a given $K$, without this probability becoming
substantially less than one?
% is not vanishingly small?
\subsection{General position}
One technical detail needs to be pinned down: what set of inputs
$\{\bx_n\}$ are we considering? Our answer might depend on this
choice. We will assume that the points are in {\em\ind{general position}}.
% (\pref{defn.generalposition}),
% which means in $K=3$ dimensions, for example, that no three points
% are colinear and no four points are coplanar.
%\footnote{We could
% get another result for the capacity of a neuron assuming that the
% input set is the set of all binary input points. This might be helpful
% if we're going to go and play with binary Hopfield networks in
% the next lecture.}
\begin{definc}% see also pascal.tex - cut from there
A set of points $\{\bx_n\}$ in $K$-dimensional
space are in {\em general position} if
%% they are distinct and non-zero and
any subset of size $\leq K$ is linearly
independent, and
no $K+1$ of them lie in a $(K-1)$-dimensional plane.
% http://www.sciencedaily.com/encyclopedia/general_position
\label{defn.generalposition}
\end{definc}
In $K=3$ dimensions, for example, a set of points are in general position
if no three points are colinear and no four points are coplanar.
The intuitive idea is that points in general position
are like random points in the space, in terms of the linear
dependences between points. You don't expect three random points
in three dimensions to lie on a straight line.
\subsection{The linear threshold function}
The neuron we will consider performs the
function
\beq
y = f \left( \sum_{k=1}^K w_k x_k \right)
\eeq
where
\beq
f(a) = \left\{ \begin{array}{cc} 1 & a > 0 \\
0 & a \leq 0 .
\end{array} \right.
\eeq
We will not have a bias $w_0$; the capacity
for a neuron with a bias can be obtained by replacing
$K$ by $K+1$ in the final result below, \ie,
% a bound on the result for a neuron
% with a bias can be obtained by replacing $K$ by $K+1$ and
considering
one of the inputs to be fixed to 1. (These input points
would not then be in general position; the derivation still works.)
% but it is actually sufficient for the original input vectors to be in general position.)
% as long as $K \geq 1$).
%
% , which is why the result
% obtained would be an upper bound on the capacity of the neuron with
% bias.)
% from cover.tex
\section{Counting threshold functions}
Let us denote by $T(N,K)$ the number of
distinct threshold functions on $N$ points
in general position in $K$ dimensions.
We will derive a formula for $T(N,K)$.\index{shattering}% points in $K$ dimensions}
% (not rigorously
To start with, let us work out a few cases by hand.
\subsection{In $K=1$ dimension, for any $N$}
The $N$ points lie on a line. By changing
the sign of the one weight $w_1$ we can
label all points on the right side of the origin
1 and the others 0, or {\em vice versa}.
Thus there are
two distinct threshold functions.
$T(N,1) = 2$.
\subsection{With $N=1$ point, for any $K$}
If there is just one point $\bx^{(1)}$
then we can realize both possible labellings
by setting $\w = \pm \bx^{(1)}$. Thus
$T(1,K) = 2$.
\subsection{In $K=2$ dimensions}
In two dimensions with $N$ points, we are free to spin the separating
line around the origin. Each time the line passes over a point we
obtain a new function. Once we have spun the line through 360
degrees
% 2$\pi$
we reproduce the function we started from. Because the points are in
general position, the separating plane (line) crosses only one point at a
time. In one revolution, every point is passed over twice. There are
therefore $2 N$ distinct threshold functions. $T(N,2) = 2N$.
Comparing with the total number of binary functions, $2^N$, we may
note that for $N\geq 3$, not all binary functions can be realized by
a linear threshold function. One famous example of an unrealizable
function with $N=4$ and $K=2$
is the exclusive-or function on the points $\bx = (\pm
1,\pm 1)$. [{These points are not in general position, but you
may confirm that the function remains unrealizable even if the points are
perturbed into general position.}]
\begin{figure}
\figuremargin{%
\begin{center}
\setlength{\unitlength}{0.1in}
\raisebox{0.5in}{\makebox[0in][l]{\footnotesize(a)}
\begin{picture}(10,10)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/T12x.eps,angle=-90,width=1in}}}
\put(10,4){\makebox(0,0)[t]{$x_1$}}
\put(5,10){\makebox(0,0)[r]{$x_2$}}
\put(7,3){\makebox(0,0)[t]{$\bx^{(1)}$}}
\end{picture}
}
\hspace{0.3in}\raisebox{0.5in}{\makebox[0in][l]{\footnotesize(b)}}
\begin{picture}(20,20)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/T12.eps,angle=-90,width=2in}}}
\put(20,9){\makebox(0,0)[t]{$w_1$}}
\put(10,20){\makebox(0,0)[r]{$w_2$}}
\put(15,6){\makebox(0,0)[l]{(1)}}
\put(4,12){\makebox(0,0)[br]{(0)}}
\end{picture}
\end{center}
}{%
\caption[a]{One data point in a two-dimensional input space,
and the two regions of weight space that give the two
alternative labellings of that point.}
\label{fig.T12x}
\label{fig.T12}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}
\setlength{\unitlength}{0.1in}
\raisebox{0.5in}{\makebox[0in][l]{\footnotesize(a)}
\begin{picture}(10,10)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/T22x.eps,angle=-90,width=1in}}}
\put(10,4){\makebox(0,0)[t]{$x_1$}}
\put(5,10){\makebox(0,0)[r]{$x_2$}}
\put(7,3){\makebox(0,0)[t]{$\bx^{(1)}$}}
\put(1,3){\makebox(0,0)[tl]{$\bx^{(2)}$}}
\end{picture}
}
\hspace{0.3in}\raisebox{0.5in}{\makebox[0in][l]{\footnotesize(b)}}
\begin{picture}(20,20)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/T22.eps,angle=-90,width=2in}}}
\put(20,9){\makebox(0,0)[t]{$w_1$}}
\put(10,20){\makebox(0,0)[r]{$w_2$}}
\put(12,3){\makebox(0,0)[r]{(1,1)}}
\put(11,15){\makebox(0,0)[r]{(0,0)}}
\put(16,8.5){\makebox(0,0)[l]{(1,0)}}
\put(4,13){\makebox(0,0)[r]{(0,1)}}
\end{picture}
\end{center}
}{%
\caption[a]{Two data points in a two-dimensional input space,
and the four regions of weight space that give the four
alternative labellings.}
\label{fig.T22x}
\label{fig.T22}
}
\end{figure}
\begin{figure}
\figuremargin{%
\begin{center}
\setlength{\unitlength}{0.1in}
\raisebox{0.5in}{\makebox[0in][l]{\footnotesize(a)}
\begin{picture}(10,10)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/T32x.eps,angle=-90,width=1in}}}
\put(10,4){\makebox(0,0)[t]{$x_1$}}
\put(5,10){\makebox(0,0)[r]{$x_2$}}
\put(7,3){\makebox(0,0)[t]{$\bx^{(1)}$}}
\put(1,3){\makebox(0,0)[tl]{$\bx^{(2)}$}}
\put(5.6,7){\makebox(0,0)[l]{$\bx^{(3)}$}}
\end{picture}
}
\hspace{0.3in}\raisebox{0.5in}{\makebox[0in][l]{\footnotesize(b)}}
\begin{picture}(20,20)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/T32.eps,angle=-90,width=2in}}}
\put(20,9){\makebox(0,0)[t]{$w_1$}}
\put(10,20){\makebox(0,0)[r]{$w_2$}}
\put(8,3){\makebox(0,0)[l]{(1,1,0)}}
% \put(11,15){\makebox(0,0)[l]{(0,0,0)}}
\put(17,5){\makebox(0,0)[l]{(1,0,0)}}
\put(4,14){\makebox(0,0)[r]{(0,1,1)}}
% \put(8,5){\makebox(0,0)[r]{(1,1,1)}}
\put(11,17){\makebox(0,0)[r]{(0,0,1)}}
\put(16,13){\makebox(0,0)[l]{(1,0,1)}}
\put(4,8){\makebox(0,0)[r]{(0,1,0)}}
\end{picture}
\end{center}
}{%
\caption[a]{Three data points in a two-dimensional input space,
and the six regions of weight space that give
alternative labellings of those points. In this case, the labellings
$(0,0,0)$ and $(1,1,1)$ cannot be realized. For any three points
in general position there are always two labellings that
cannot be realized.}
\label{fig.T32x}
\label{fig.T32}
}%
\end{figure}
\subsection{In $K=2$ dimensions, from the point of view of weight space}
There is another way of visualizing this problem. Instead of
visualizing a plane separating points in the two-dimensional input
space, we can consider the two-dimensional {\em weight space},
colouring
regions in weight space different colours if they label the given datapoints differently.
We can then count
the number of threshold functions by counting how many
distinguishable regions there are in weight space.
Consider first the set of weight vectors in weight space that
classify a particular example $\bx^{(n)}$ as a 1. For example,
figure \ref{fig.T12x}a shows a single point in our two-dimensional
$\bx$-space, and figure \ref{fig.T12}b shows the two corresponding
sets of points in $\bw$-space. One set of weight vectors
%. These vectors
occupy the half space
\beq
\bx^{(n)}{\bf \cdot} \w > 0,
\eeq
and the others
occupy $\bx^{(n)} {\bf \cdot} \w < 0$. In figure \ref{fig.T22x}a we have
added a second point in the input space. There are now 4 possible
labellings: $(1,1)$, $(1,0)$, $(0,1)$, and $(0,0)$. Figure \ref{fig.T22}b
shows the two hyperplanes $\bx^{(1)} {\bf \cdot} \w = 0$ and $\bx^{(2)} {\bf \cdot} \w = 0$
which separate the
sets of weight vectors that produce each of these labellings.
When $N=3$ (figure \ref{fig.T32}), weight space is divided by three
hyperplanes into six regions. Not all of the eight conceivable labellings
can be realized. Thus $T(3,2)=6$.
\subsection{In $K=3$ dimensions}
% \subsubsection{$K=3$}
We now use this weight space visualization to study the
three dimensional case.
% we may find a new description
% of the problem helpful. Instead of visualizing a plane separating
% points in the three dimensional input space, we can consider the
% three-dimensional weight space and count the number of threshold
% functions by counting how many distinguishable regions there are in
% weight space.
% Consider weight space and consider the set of weight vectors which
% classify a particular example $\bx^{(n)}$ as a 1. These vectors
% occupy the half space $\bx^{(n)}\cdot \w \geq 0$.
Let us imagine
adding one point at a time and count the number of threshold
functions as we do so. When $N=2$, weight space is divided by two
hyperplanes $\bx^{(1)}{\bf \cdot} \w = 0$ and $\bx^{(2)}{\bf \cdot} \w = 0$ into
four regions; in any one region all vectors $\w$ produce the same
function on the 2 input vectors. Thus $T(2,3)=4$.
Adding a third point in general position produces a third plane in
$\w$ space, so that there are 8 distinguishable regions. $T(3,3)=8$.
The three bisecting planes are shown in figure \ref{fig.T33}a.
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\raisebox{0.6in}{\footnotesize(a)}$\:$%
\psfig{figure=figs/T33.eps,angle=-90,width=2.1in}
\hspace{0.35in}\raisebox{0.6in}{\footnotesize(b)}$\:$%
\psfig{figure=figs/T43_14.eps,angle=-90,width=2.1in}
}
\end{center}
}{%
\caption[a]{Weight space\index{weight space} illustrations for $T(3,3)$ and $T(4,3)$.
(a) $T(3,3)=8$. Three hyperplanes
(corresponding to three points in
general position) divide 3-space into 8 regions, shown
here by colouring the relevant part of the surface of a
hollow, semi-transparent cube
centred on the origin.
(b) $T(4,3)=14$. Four hyperplanes
divide 3-space into 14 regions, of which this figure shows 13 (the
14th region is out of view on the right-hand face.
Compare with figure \protect\ref{fig.T33}a:
all of the regions that are
not coloured white have been cut into two.}
\label{fig.T33}
\label{fig.T43_14}
}%
\end{figure}
\begin{table}
\figuremarginb{%
\begin{center}
\begin{tabular}{c@{\hspace{0.642in}}*{8}{p{4mm}}} \toprule
% \begin{tabular}{c|*{8}{p{4mm}}}
& & & & $K$ & & & \\ \cline{2-9}
$N$ &1&2 &3 &4 &5 &6 &7 &8 \\[0.04in] \midrule
1&2&2 &2 &2 &2 &2 &2 &2 \\
2&2&4 &4 & & & & & \\
3&2&6 &8 & & & & & \\
4&2&8 &14& & & & & \\
5&2&10& & & & & & \\
6&2&12& & & & & & \\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption{Values of $T(N,K)$ deduced by hand.}
\label{tab.T}
}%
\end{table}
\begin{figure}
\figuremargin{%
\begin{tabular}{ll}
\mbox{%
\raisebox{0.5in}{\footnotesize(a)}
\psfig{figure=figs/T43.eps,angle=-90,width=2.1in}
\hspace{0.2in}
}&
\mbox{
\hspace*{-0.2in}\raisebox{0.5in}{\footnotesize(b)}
\psfig{figure=figs/T43cut.eps,angle=-90,width=2.1in}
}\\
\mbox{%
\raisebox{0.5in}{\footnotesize(c)}
\psfig{figure=figs/T43_26.eps,angle=-90,width=2.1in}
}\\
\end{tabular}
}{%
\caption[a]{Illustration of the cutting process
going from $T(3,3)$ to $T(4,3)$.
%
(a) The
eight regions of figure \ref{fig.T33}a with one added
hyperplane. All of the regions that are not coloured white
% (and the hidden region)
have been cut into two.
(b) Here, the hollow cube has been made solid, so we can see
which regions are cut by the fourth plane. The front half
of the cube has been cut away.
% Counting how many regions are cut by the new hyperplane.
(c) This figure shows the new two dimensional hyperplane, which
is divided into six regions by the three one-dimensional hyperplanes
(lines) which cross it. Each of these regions corresponds to
one of the three-dimensional regions in figure \ref{fig.T43}a
which is cut into two by this new hyperplane. This shows
that $T(4,3)- T(3,3) = 6$.
%
Figure
\protect\ref{fig.T43_26}c should be compared with figure
\protect\ref{fig.T32}b.
}
\label{fig.T43}
\label{fig.T43_26}
}%
\end{figure}
At this point matters become slightly more tricky.
As figure \ref{fig.T43_14}b illustrates,
the fourth plane
in the three-dimensional $\w$ space cannot transect all eight of the
sets created by the first three planes.
% A sketch confirms that for points in general position s
Six of the existing regions are cut in
two and the remaining two are unaffected.
So $T(4,3)=14$. Two of the binary functions on 4 points in 3
dimensions cannot be realized by a linear threshold function.
We have now filled in the values of $T(N,K)$ shown in table
\ref{tab.T}. Can we obtain any insights into our derivation
of $T(4,3)$ in order to fill in the rest of the table for $T(N,K)$?
Why was $T(4,3)$ greater than $T(3,3)$ by six?
Six is the number of
regions that the new hyperplane bisected in $\w$-space (figure
\ref{fig.T43}a$\,$b). Equivalently, if we look in the $K\!-\!1$ dimensional
subspace that is the $N$th hyperplane, that subspace is divided
into six regions by the $N\!-\!1$ previous hyperplanes (figure
\ref{fig.T43_26}c). Now this is a concept we have met before.
Compare figure
\ref{fig.T43_26}c with figure
\ref{fig.T32}b. How many
regions are created by $N-1$ hyperplanes in a $K\!-\!1$ dimensional
space? Why,
$T(N\!-\!1,K\!-\!1)$, of course! In the present case $N=4$, $K=3$, we can look up
$T(3,2)=6$ in the previous section. So
\beq
T(4,3) = T(3,3) + T(3,2) .
\eeq
\subsection{Recurrence relation for any $N,K$}
Generalizing this picture, we see that when we add an $N$th
hyperplane in $K$ dimensions, it will bisect $T(N\!-\!1,K\!-\!1)$ of the
$T(N\!-\!1,K)$ regions that were created by the previous $N\!-\!1$
hyperplanes. Therefore, the total number of regions obtained after
adding the $N$th hyperplane is $2 T(N\!-\!1,K\!-\!1)$
(since $T(N\!-\!1,K\!-\!1)$ out of
$T(N\!-\!1,K)$ regions are split in two) plus the remaining
$T(N\!-\!1,K) - T(N\!-\!1,K\!-\!1)$
regions not split by the $N$th hyperplane, which gives the following
equation for $T(N,K)$:
% in terms of $T(N\!-\!1,K\!-\!1)$ and $T(N\!-\!1,K)$:
\beq
% T(N,K) = T(N\!-\!1,K\!-\!1) + T(N\!-\!1,K).
T(N,K) = T(N\!-\!1,K) + T(N\!-\!1,K\!-\!1) .
\label{eq.T.rec}
\eeq
Now all that remains is to solve this recurrence relation given the
boundary conditions $T(N,1)=2$ and $T(1,K)=2$.
Does the recurrence relation (\ref{eq.T.rec}) look familiar? Maybe you
remember building Pascal's triangle by adding together two adjacent
numbers in one row to get the number below. The $N,K$ element of
Pascal's triangle is equal to
\beq
C(N,K) \equiv
{N \choose K }\equiv \frac{N!}{(N-K)!K!}.
\eeq
\begin{table}[htbp]
\figuremargin{%
\begin{center}
%\begin{tabular}{c@{\hspace{0.642in}}*{8}{p{4mm}}} \hline
\begin{tabular}{c@{\hspace{0.642in}}*{8}{p{3mm}@{\hspace{0.1in}}}} \toprule
& & & & $K$ & & & & \\ \cline{2-9}
$N$ &0&1 &2 &3&4&5&6&7 \\[0.04in] \midrule
0&1& & & & & & & \\
1&1&1 & & & & & & \\
2&1&2 &1 & & & & & \\
3&1&3 &3 &1 & & & & \\
4&1&4 &6 &4 &1 & & & \\
5&1&5 &10&10 &5 &1 & & \\ \bottomrule
\end{tabular}
\end{center}
}{%
\caption{Pascal's triangle.}
\label{tab.pascal}
}%
\end{table}
\noindent
Combinations $N\choose K$ satisfy the equation
\beq
C(N,K) = C(N\!-\!1,K\!-\!1) +
C(N\!-\!1,K) , \:\mbox{ for all $N > 0$}.
% \mbox{for all $N...K...$}.
\label{eq.C.rec}
%\eeq
%\beq
% {N \choose K} = {N\!-\!1 \choose K\!-\!1}
% + {N\!-\!1 \choose K}, \:\mbox{ for all $N > 0$}.
\label{eq.Com.rec}
\eeq
[Here we are adopting the convention that ${N \choose K }\equiv 0$ if
$K>N$ or $K<0$.]
So ${N \choose K}$ satisfies the required recurrence relation
(\ref{eq.T.rec}). This doesn't mean $T(N,K)= {N \choose K}$,
since many
functions can satisfy one recurrence relation.
% maybe ${N \choose K}$
% is the Green's function for this problem, that is,
But perhaps we can
express $T(N,K)$ as a linear superposition of combination functions
of the form $C_{\a,\b}(N,K) \equiv {N+\alpha \choose K+\beta}$.
% Notice that
% these functions all satisfy the same recurrence relation.
%\beq
% C_{\a,\b}(N,K) = C_{\a,\b}(N\!-\!1,K\!-\!1) +
% C_{\a,\b}(N\!-\!1,K) \mbox{for all $N...K...$}.
%\label{eq.C.rec}
%\eeq
%
% CHRIS SAYS THIS IS NOT CLEAR
%
By comparing tables \ref{tab.pascal} and \ref{tab.T} we can see how
to satisfy the boundary conditions:
% if we define $N \choose K$ to be
% zero outside the triangle then
we simply need to translate Pascal's
triangle to the right by 1, 2, 3, $\ldots$; superpose; add;
multiply by two, and drop the whole table by one line. Thus:
\beq
T(N,K) = 2 \sum_{k=0}^{K\!-\!1} {{N\!-\!1} \choose k} .
\eeq
Using the fact that the $N$th row of Pascal's triangle sums to $2^N$,
that is, $\sum_{k=0}^{N\!-\!1} {N\!-\!1 \choose k}=2^{N\!-\!1}$, we
can simplify the cases where $K\!-\!1 \geq N\!-\!1$.
\beq
T(N,K) = \left\{ \begin{array}{cc}2^N & K\geq N \\
2 \sum_{k=0}^{K-1} {{N\!-\!1} \choose k} & K < N .
\end{array} \right.
\label{eq.TNK}
\eeq
% see /home/mackay/_courses/itprnn/gnu for the method
\begin{figure}
%\figuremargin{%
\fullwidthfigureright{%
%\begin{center}
\mbox{%restore to mbox
\makebox[0in][l]{\raisebox{0.1in}{\footnotesize(a)}}\hspace{-0.2in}
%\fbox{%
\raisebox{-0.53in}[1.5in][0.2in]{\psfig{figure=figs/erfNK.ps,angle=-90,width=3in}}%
%}%fbox
%\fbox{%
\raisebox{0.2in}{
\makebox[0in][l]{\raisebox{-0.1in}{\footnotesize(b)}} \hspace{-0.45in}
\raisebox{-0.340in}[1.25in][0in]{\psfig{figure=figs/erfKN.ps,angle=-90,width=3.64in}}
}
%}%fbox
}
\\
\mbox{
\makebox[0in][l]{\raisebox{0.25in}{\footnotesize(c)}}\hspace{-0.15in}\psfig{figure=figs/erf1000.ps,angle=-90,width=4in}%}\\{%
$\:\:$\makebox[0in][l]{\raisebox{0.25in}{\footnotesize(d)}}\hspace{0.1in}\raisebox{0.2in}{\psfig{figure=figs/erfbits.ps,angle=-90,width=2.5in}}
}
%\end{center}
}{%
\caption[a]{The fraction of functions on $N$ points in $K$ dimensions
that are linear threshold functions, $T(N,K)/2^N$, shown
from various viewpoints.
In (a) we see the dependence on $K$, which is
approximately an error function passing through 0.5 at $K=N/2$;
the fraction reaches 1 at $K=N$.
In (b) we see the dependence on $N$, which is 1 up to $N=K$
and drops sharply at $N=2K$. Panel (c) shows the
dependence on $N/K$ for $K=1000$. There is a
sudden drop in the fraction of realizable labellings when $N=2K$.
Panel (d) shows the values of $\log_2 T(N,K)$ and $\log_2 2^N$
as a function of $N$ for $K=1000$.
These figures were plotted using the approximation of $T/2^N$ by the
error function.}
\label{fig.erfNK}
}
\end{figure}
\subsection{Interpretation}
It is natural to compare $T(N,K)$ with the total number of binary
functions on $N$ points, $2^N$.
% (figure \ref{fig.erfNK}).
The ratio $T(N,K)/2^N$ tells us the
probability that an arbitrary labelling $\{t_n\}_{n=1}^N$
can be memorized by our neuron.
The two functions are equal for all
$N \leq K$. The line $N=K$ is thus a special line, defining
the maximum number of points on which {\em any\/}
arbitrary labelling can be
realized. This number of points is referred to as the
{\dem\ind{Vapnik--Chervonenkis dimension}\/}
(\ind{VC dimension}) of the class of functions. The VC
dimension of a binary threshold function on $K$ dimensions is thus
$K$.
What is interesting is (for large $K$) the number of points $N$ such that
{\em almost\/} any labelling can be realized. The ratio $T(N,K)/2^N$
is, for $N < 2K$, still greater than 1/2, and for large $K$ the ratio
is very close to 1.
% less than 1 by an exponentially small quantity.
% We are familiar with the sum and have the inequality....
For our purposes the
% familiar
sum in equation
(\ref{eq.TNK}) is well approximated by the
\ind{error function},\index{combination}
\beq
\sum_0^K {N \choose k} \simeq \: 2^N \: \Phi\!
\left( \frac{K - (N/2)}{\sqrt{N}/2} \right) ,
\eeq
where
$\Phi(z) \equiv \int_{-\infty}^{z} \exp ( - z^2/2 )/\sqrt{2 \pi}$.
Figure \ref{fig.erfNK} shows the realizable
fraction $T(N,K)/2^N$ as a function of
$N$ and $K$.
The take-home message is shown in figure \ref{fig.erfNK}c:
although the fraction $T(N,K)/2^N$ is less than 1 for $N>K$, it is
only negligibly less than 1 up to $N=2K$; there, there is a catastrophic
drop to zero, so that for $N>2K$, only a tiny fraction
of the binary labellings can be realized by the threshold function.
\subsection{Conclusion}
The capacity of a linear threshold neuron, for large $K$,
% (as defined here)
is 2 bits per weight.
% connection.
A single neuron can almost certainly memorize up to $N=2K$ random
binary labels perfectly, but will almost certainly fail to memorize more.
%
% Far side cartoon `please can I leave my brain is full'
%
% Another description. What is the probability
% that a random set of patterns will be
% linearly separable.
\section{Further exercises}
\exercisxB{2}{ex.T2N2}{
Can a finite set of $2N$ distinct points in a two-dimensional space
be split in half by a straight line
\bit
\item if the points are in general position?
\item if the points are not in general position?
\eit
Can $2N$ points in a $K$ dimensional space be split in half by a
$K-1$ dimensional hyperplane?
}
\exercissxA{2}{ex.T43}{
Four points are selected at random on the surface of a sphere. What is
the probability that all of them lie on a single hemisphere? How does
this question relate to $T(N,K)$? }
\exercisaxA{2}{ex.T43b}{
Consider the binary threshold neuron in $K=3$ dimensions, and the
set of points $\{\bx \} = \{ (1,0,0), (0,1,0), (0,0,1),(1,1,1) \}$.
Find a parameter vector $\bw$ such that the neuron memorizes the
labels:
%\ben
%\item
(a) $\{ t \} = \{ 1,1,1,1 \}$; (b) $\{ t \} = \{ 1,1,0,0 \}$.
Find an unrealizable labelling $\{ t \}$.
}
\exercisxB{3}{ex.polya}{
In this chapter we constrained all our hyperplanes to go through the
origin. In this exercise, we remove this constraint.%
\amarginfig{c}{\begin{center}\mbox{\epsfbox{metapost/cover.1}}\end{center}
\caption[a]{Three lines in a plane create seven regions.}
}
How many regions in a plane are created by $N$ lines in general position?
}
\exercisxA{2}{ex.sensorybits}{
Estimate in bits the total sensory experience that you have had in
your life -- visual information, auditory information, etc.
Estimate how much information you have memorized.
Estimate the information content of the works of Shakespeare.
Compare these with the capacity of your brain assuming you have $10^{11}$
% KNM says:
% 10^10 is frequently given for human, but granule cells in cerebellum, numbr
% up to 10^11
neurons each making 1000 synaptic connections, and that the
capacity result for one neuron (two bits per connection) applies.
Is your brain full yet?
}
% shakes is about 1 Meg uncompressed?
\exercisxB{3}{ex.spikingneuron}{
What is the capacity of the axon of
a spiking neuron, viewed as a communication channel,
in bits per second?
[See \citeasnoun{MacKayMcCulloch1952} for an early publication
on this topic.]
Multiply by the number of axons in the \ind{optic nerve}\index{ganglion cells} (about $10^6$) or
cochlear nerve (about $50\,000$ per ear)
% http://psych.athabascau.ca/html/Psych402/Biotutorials/25/nerve.shtml
to estimate again the rate of acquisition sensory experience.
}
\dvips
\section{Solutions}% to Chapter \protect\ref{ch.single.neuron.capacity}'s exercises} %
\soln{ex.T43}{% 4 points selected
The probability that all four points lie on a single hemisphere
is
\beq
T(4,3)/2^4 = 14 / 16 = 7/8 .
\eeq
}
\dvipsb{solutions neuroncap}
% \chapter{Single neuron learning as Inference}
\chapter{Learning as Inference}
\label{ch.single.neuron.bayes}
% \newcommand{\FIGSlearning}{/home/mackay/book/FIGS/learning}
%\noindent
% Supervised
% neural networks are parameterized nonlinear models used for empirical
% regression and classification modelling. Their flexibility makes
% them able to discover more general relationships in data than traditional
% statistical models.
%
%
%%%%%%%%%%%%%%%%%%5
\section{Neural network learning as inference}
In \chapterref{ch.single.neuron.class} we trained a simple neural network
as a classifier by minimizing an objective function\index{neural network!learning as inference}\index{learning!as inference}
\beq
M(\w) = G(\w) + \a E_W(\w)
\eeq
made up of an error function
\beq
G(\bw) =
-
\sum_n
\left[
t^{(n)} \ln y(\bx^{(n)};\bw) + (1-t^{(n)}) \ln (1-y(\bx^{(n)};\bw))
\right]
\eeq
and a regularizer
\beq
E_W(\w)= \frac{1}{2} \sum_i w_i^2.
\eeq
%
%
This neural network learning process
% e neural network learning process described in chapter \chthirteen\
can be given the following
probabilistic interpretation.
We interpret the output
$y(\bx;\bw)$ of the neuron literally as defining (when its
parameters $\bw$ are specified) the probability that an input $\bx$
belongs to class $t=1$, rather than the alternative $t=0$. Thus
$y(\bx;\bw) \equiv P(t\eq 1\given \bx,\bw)$. Then each value of $\bw$
% in the picture of weight space
defines a different
hypothesis about the probability of class 1 relative to class 0
as a function of $\bx$.
We define the observed data $D$ to be the {\em targets\/}
$\{t\}$ -- the inputs $\{\bx\}$ are assumed to be given,
and not to be modelled.
To infer $\bw$ given the data, we require a likelihood function and
a prior probability over $\bw$.
The likelihood function
% is the probability of the observed data given the parameters.
measures how well the parameters $\w$
predict the observed data; it is the probability assigned to the
observed $t$ values by the model with parameters set to $\bw$.
% ; if and obtain the probability that the target is 1 from the output
% $y$ of the neuron then
Now the two equations
\beq
\begin{array}{ccl}
P(t=1\given \bw,\bx) &=& y\\
P(t=0\given \bw,\bx) &=& 1-y
\end{array}
\eeq
can be rewritten as the single equation
\beq
% \Rightarrow
P(t\given \bw,\bx)
= y^{t}(1-y)^{1-t} = \exp \! \left[ t \ln y + (1-t) \ln (1-y) \right] .
\eeq
%
So the error function $G$ can be interpreted
as minus the log likelihood:
\beq
P(D\given \w
% ,\H
) = \exp [-G(\bw)] .
\label{eq.neu.basic.like}
\eeq
Similarly the regularizer can be
interpreted in terms of a log prior probability
distribution over the parameters:
\beq
P(\w\given \a
% ,\H
) = \frac{1}{Z_W(\a)} \exp (-\a E_W ) .
\label{eq.neu.basic.prior}
\eeq
If $E_W$ is quadratic as defined above, then the corresponding prior
distribution is a Gaussian with variance $\sigW^2 = 1/\a$,
and $1/Z_W(\a)$ is equal to $(\a/2\pi)^{K/2}$, where $K$ is the
number of parameters in the vector $\w$.\index{learning!as inference}
% The
%probabilistic model $\H$ specifies the functional form $\A$ of the
%network, the likelihood (\ref{basic.like}), and the prior
%(\ref{basic.prior}).
The objective function $M(\w)$ then corresponds to the {\em inference\/}\index{inference!and learning}
of the parameters $\w$, given the data:\index{learning!as inference}
\beqan
\label{eq.neu.level1}
P(\w\given D,\a
% ,\H
) &=& \frac{ P(D\given \w
% ,\H
) P(\w\given \a
% ,\H
) }{ P(D\given \a
% ,\H
) } \\
&=& \frac{ \displaystyle e^{-G(\w)} \:e^{-\a E_W(\w)} / Z_W(\a) }{ P(D\given \a ) } \\
&=& \frac{1}{Z_M} \exp( - M(\w) ).
\label{eq.posterior.single.neuron}
\label{eq.neu.level1b}
\eeqan
{\em So the $\w$ found by (locally) minimizing $M(\w)$ can be interpreted as
the (locally) most probable parameter vector, $\w^*$.}
From
now on we will refer to $\w^*$ as $\wmp$.
Why is it natural to interpret the error functions as {\em
log\/} probabilities?
% Could an alternative mapping be constructed?
Error functions are usually additive. For example, $G$ is a {\em
sum\/} of information contents, and $E_W$ is a {\em sum\/}
of squared weights. Probabilities, on the other hand, are
multiplicative: for independent events $X$ and $Y$, the joint probability
is $P(x,y)=P(x)P(y)$. The logarithmic mapping maintains this
correspondence.
The interpretation of $M(\w)$ as a log probability has numerous
benefits, some of which we will discuss in a moment.
%
\begin{figure}
%\figuremargin{%
\fullwidthfigureright{%
\begin{center}
\begin{tabular}{*{4}{c@{}}}
{Data set} & {Likelihood}
&\multicolumn{2}{c}{Probability of parameters} \\ \hline
$N=0$ & {\em (constant)}
&\wsurfig{post.0s.ps} & \wflatfig{post.0c.ps} \\
\begin{tabular}{@{}c@{}}
$N=2$ \\[0.1in] \datfig{data.2b.ps}
\end{tabular} & \wsurfig{like.2s.ps}
&\wsurfig{post.2s.ps} & \wflatfig{post.2c.ps} \\
\begin{tabular}{@{}c@{}}
$N=4$ \\[0.1in] \datfig{data.4b.ps}
\end{tabular} & \wsurfig{like.4s.ps}
&\wsurfig{post.4s.ps} & \wflatfig{post.4c.ps} \\
\begin{tabular}{@{}c@{}}
$N=6$ \\[0.1in] \datfig{data.6.ps}
\end{tabular} & \wsurfig{like.6s.ps}
&\wsurfig{post.6s.ps} & \wflatfig{post.6c.ps} \\
\end{tabular}
% \begin{tabular}{c*{4}{c@{}}}
% {$N$} & {Data set} & {Likelihood}
% &\multicolumn{2}{c}{Probability of parameters} \\ \hline
% 0 & & {\em (constant)}
% &\wsurfig{post.0s.ps} & \wflatfig{post.0c.ps} \\
% 2 & \datfig{data.2b.ps} & \wsurfig{like.2s.ps}
% &\wsurfig{post.2s.ps} & \wflatfig{post.2c.ps} \\
% 4 & \datfig{data.4b.ps} & \wsurfig{like.4s.ps}
% &\wsurfig{post.4s.ps} & \wflatfig{post.4c.ps} \\
% 6 & \datfig{data.6.ps} & \wsurfig{like.6s.ps}
% &\wsurfig{post.6s.ps} & \wflatfig{post.6c.ps} \\
% \end{tabular}
\end{center}
}{%
\caption[a]{ The Bayesian interpretation and generalization of traditional
neural network learning.
{Evolution of the probability distribution over
parameters as data arrive.
} }
\label{fig.incremental.data}
\label{fig.prior}
}%
\end{figure}
\section{Illustration for a neuron with two weights}
In the case of a neuron with just two inputs and no bias,
\beq
y(\bx;\bw) = \frac{1}{1+e^{-(w_1 x_1 + w_2 x_2)}} ,
\label{lin.log2.again}
\eeq
we can plot the posterior probability of $\bw$, $P(\bw\given D,\a)
\propto \exp(-M(\bw))$.
Imagine that we receive some data as shown in the left
column of figure \ref{fig.incremental.data}. Each data point
consists of a two-dimensional input vector $\bx$ and a
$t$ value indicated by $\times\!$ ($t=1$) or $\Box\!$ ($t=0$).
%
% In the traditional view of learning, a single parameter vector
% $\w$ evolves under the learning rule from an initial starting point
% $\w^0$ to a final optimum $\w^*$,
%% (figure \ref{trad.learning}a),
% in such a way as to minimize the {objective function} $M(\bw)=G(\bw)
% + \a E_W(\bw)$.
The {likelihood function} $\exp(-G(\bw))$
% is the {\em likelihood\/},
is shown as a
function of $\bw$ in the second column.
It is a product of functions of the form (\ref{lin.log2.again}).
The product of traditional learning is a point in $\bw$-space, the
estimator $\w^*$, which maximizes the posterior probability
density. In contrast, in the Bayesian view, the product of
learning is an {\em ensemble\/} of plausible parameter values (bottom
right of figure \ref{fig.incremental.data}). We do not choose one
particular
% sub-
hypothesis $\w$; rather we evaluate their posterior
probabilities.\index{Bayes' theorem}
% , which by \Bayes\ theorem are:
% , the posterior probability of
% one of the sub-hypotheses $\bw$ is:
% \beq
% P(\bw \given \{ t \} , \{ \bx \} , \H ) =
% \frac{ P( \{ t \} \given \bw , \{ \bx \} , \H ) P(\bw \given \H ) }
% { P( \{ t \} \given \{ \bx \} , \H ) }
%\label{eq.posterior.single.neuron}
%\eeq
The posterior distribution is obtained by multiplying the
likelihood by a prior distribution over $\bw$ space (shown as a broad
Gaussian at the upper right of figure \ref{fig.prior}). The posterior
ensemble (within a multiplicative constant) is shown in the third
column of figure \ref{fig.prior}, and as a contour plot in the fourth
column.
As the amount of data increases (from top to bottom), the
posterior ensemble becomes increasingly concentrated around the
most probable value $\w^*$.
\begin{figure}% 30 , 80 , 500 , 3000 , 40000 its. see neuron/README
\fullwidthfigureright{\small%
%\figuremargin{%
\begin{center}
\begin{tabular}{*{3}{l}}% @{}}}
(a)\hspace{-0.4in}\makebox[2in][l]{\psfig{figure=neuron/ps/dwd40000.0.01.AB.ps,width=2in,angle=-90}}
&
(b)\hspace{-0.245in}\mbox{
\setlength{\unitlength}{1in}
\begin{picture}(1.9,1.4)(0,0)
\put(-0.25,-0.55){\makebox(0,0)[bl]{\psfig{figure=neuron/ps/predlangavg.ps,height=2.676in,width=2.526in,angle=-90}}}
\put(-0.25,-0.55){\makebox(0,0)[bl]{\psfig{figure=neuron/ps/preddata.AB.ps,height=2.676in,width=2.526in,angle=-90}}}
\end{picture}
}
&
\makebox[0 in][l]{(c)}\psfig{figure=\handfigs/cl.wc.ps,%
width=1.7in,angle=-90}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{Making predictions.
%
(a)
The function performed
by an optimized neuron $\wmp$ (shown by three of its contours)
trained with weight decay, $\a=0.01$ (from \protect\figref{fig.neuron.learns.decay}).
The contours shown are those corresponding to
$a=0,\pm 1$, namely $y=0.5, 0.27$ and $0.73$.
% Also shown is a vector proportional to $(w_1,w_2)$.
(b) Are these predictions more reasonable? (Contours shown are for
$y=0.5, 0.27$, $0.73$, $0.12$ and $0.88$.) (c) The posterior
probability of $\w$ (schematic); the Bayesian predictions shown in
(b) were obtained by averaging together the predictions made by each
possible value of the weights $\bw$, with each value of $\bw$
receiving a vote proportional to its probability under the posterior
ensemble. The method used to create (b) is described in section
\protect\ref{sec.langevin.neuron}. }
%
\label{fig.neuron.map}
}%
\end{figure}
\section{Beyond optimization: making predictions}
% benefits of the inference viewpoint}
% Marginalization.
Let us consider the task of making predictions with the neuron
which we trained as a classifier in section \ref{sec.single.neuron.class}.
This was a neuron with two inputs and a bias.
\beq
y(\bx;\bw) = \frac{1}{1+e^{-(w_0 + w_1 x_1 + w_2 x_2)}} .
\label{lin.log3.again}
\eeq
When we last played with it, we trained it by minimizing
the objective function
\beq
M(\bw) = G(\bw) + \alpha E(\bw) .
\eeq
The resulting optimized function for the case $\alpha = 0.01$
is reproduced in \figref{fig.neuron.map}a.
We now consider the task of predicting the class $t^{\bssNN}$
corresponding to a new input $\bx^{\bssNN}$. It is common practice, when
making predictions, simply to use a neural network with its weights
fixed to their optimized value
$\wmp$, but this is not optimal, as can be seen intuitively
by considering the predictions shown in \figref{fig.neuron.map}a.
% The sub-optimality issue is illustrated schematically
% for a simple two-class problem in figure
% \ref{marg.fig}. Figure \ref{marg.fig}a shows a binary data set,
% which, in figure \ref{marg.fig}b is modelled with a linear logistic
% function. The best fit parameter values give predictions which are
% shown by three contours.
Are these reasonable predictions?
\newcommand{\predictionA}{{\sf A}}
\newcommand{\predictionB}{{\sf B}}
Consider
new data arriving at points \predictionA\ and \predictionB.
The best fit model assigns both
of these examples probability 0.2 of being in class 1, because
they have the same value of $\wmp \bf \cdot \bx$. If we really knew that
$\w$ was equal to $\wmp$, then these predictions would be correct.
% the data were generated from a model with
But we do not know $\w$. The parameters are {\em uncertain}.
Intuitively we might be inclined to assign a less confident
probability (closer to 0.5) at \predictionB\ than at \predictionA,
as shown in \figref{fig.neuron.map}b, since point \predictionB\ is far from
the training data.
{\em The best fit parameters $\wmp$ often give
over-confident predictions.}
A non-Bayesian approach to this problem is to
downweight all predictions uniformly, by an empirically determined
factor \cite{Copas:83}. This is not ideal, since intuition suggests
the strength of the predictions at \predictionB\ should be downweighted more than
those at \predictionA.
% Copas, Vapnik, Geoff Hinton all are into this.
A Bayesian viewpoint helps us to understand the cause of
the problem, and provides a straightforward solution.
% that is demonstrably superior to this ad hoc procedure.
In a nutshell,
we obtain Bayesian predictions by taking into account
the whole posterior ensemble, shown schematically in
\figref{fig.neuron.map}c.
%We average the predictions
% made by each possible setting of the weights $\bw$ together, with
% each setting of $\bw$ receiving a vote proportional to its
% posterior probability.
The {Bayesian\/} prediction of a
new datum $\bt^{\bssNN}$ involves {\em marginalizing\/} over
the parameters (and over anything else about which we
are uncertain). For simplicity, let us assume that
the weights $\bw$ are the only uncertain quantities -- the weight
decay rate $\a$ and the model $\H$ itself are assumed to
be fixed. Then by the sum rule, the predictive probability
of a new target $\bt^{\bssNN}$ at a location $\bx^{\bssNN}$
is:
\beq
P(\bt^{\bssNN} \given \bx^{\bssNN}, D,\a
% ,\H
) = \int \! \d^K\w \,
P(\bt^{\bssNN} \given \bx^{\bssNN} ,\w,\a
% ,\H
) P(\w\given D,\a
% ,\H
) ,
\eeq
where $K$ is the dimensionality of $\bw$, three in the toy problem.
Thus the predictions are obtained by weighting
the prediction for each possible $\w$,
\beq
\begin{array}{rcl}
P(\bt^{\bssNN}\eq 1 \given \bx^{\bssNN} ,\w,\a )& =&
y( \bx^{\bssNN} ; \w ) \\
P(\bt^{\bssNN}\eq 0 \given \bx^{\bssNN} ,\w,\a )& =&
1 - y( \bx^{\bssNN} ; \w ) ,
\end{array}
\eeq
with a weight given by the posterior probability of $\w$,
$P(\w\given D,\a)$, which we
% first encountered in \eqref{eq.posterior.single.neuron}.
most recently wrote down in \eqref{eq.posterior.single.neuron}.
This posterior probability is
% most compactly written as
\beq
P(\w\given D,\a
% ,\H
) = \frac{1}{Z_M} \, { \exp(-M(\w)) } ,
\eeq
where
\beq
Z_M = \int \d^K \w \exp(-M(\w)) .
\eeq
% $Z_M$ is
In summary, we can get the Bayesian predictions if we can find
a way of computing the integral
\beq
P(\bt^{\bssNN}\eq 1 \given \bx^{\bssNN}, D,\a
% ,\H
) = \int \! \d^K\w \:\:
y( \bx^{\bssNN} ; \w ) \, \frac{1}{Z_M} \, { \exp(-M(\w)) } ,
\label{eq.marg.integral}
\eeq
which is the average of the output of the neuron at $\bx^{\bssNN}$
under the posterior distribution of $\bw$.
% How shall we compute the integral \eqref{eq.marg.integral}?
% This is not a straightforward problem to solve, and we will shortly
% devote several pages to it, so to avoid
% losing our train of thought, let us anticipate the outcome.
%
%
% Figure \ref{marg.fig} gives a summary of the whole story. Figure
% \ref{marg.fig}a shows a binary data set, which, in figure
% \ref{marg.fig}b is modelled with a linear logistic function. The
% best fit parameter values give predictions which are shown by three
% contours. Are these reasonable predictions? Consider new data
% arriving at points A and B. The best-fit model assigns both of these
% examples probability 0.9 of being in class 1. But intuitively we
% might be inclined to assign a less confident probability (closer to
% 0.5) at B than at A, since point B is far from the training data.
% Precisely this result is obtained when we marginalize over the
% parameters, whose posterior probability distribution is depicted in
% figure \ref{marg.fig}c. Two random samples from the posterior define
% two different classification surfaces, which are illustrated in
% figures \ref{marg.fig}d, e. The point B is classified differently by
% these different plausible classifiers, whereas the classification of
% A is relatively stable. We obtain the Bayesian predictions (figure
% \ref{marg.fig}f) by averaging together the predictions of the
% plausible classifiers. The resulting 0.5 contour remains similar to
% that for the best-fit parameters. However, the width of the decision
% boundary increases as we move away from the data, in full accordance
% with intuition.
%
%\subsection{Implementation of Bayesian inference for neural networks}
% Figure \ref{marg.fig} is only a schematic figure. How can we actually
% implement Bayesian inference for the probability distribution
% $P(\bw\given D,\a,\H) \propto \exp( -M(\bw))$?
\subsection{Implementation}
How shall we compute the integral (\ref{eq.marg.integral})?
% Such marginalizations can rarely be done analytically.
For our toy problem, the weight space is three dimensional;
for a realistic neural network the dimensionality $K$ might be
in the thousands.
Bayesian inference for general data modelling problems
may be implemented by exact methods (\chref{ch.exact}),
by Monte Carlo sampling (\chref{ch.mc}), or
by deterministic approximate methods, for example,
methods that make Gaussian approximations to $P(\bw\given D,\a)$ using\index{approximation!by Gaussian}
\ind{Laplace's method} (\chref{ch.laplace})
or \ind{variational methods} (\chref{ch.mft}).
For neural networks there are few exact
% analytic
methods.
The two main approaches to implementing Bayesian inference for neural
networks are the
% Sophisticated
Monte Carlo methods developed by \citeasnoun{Radford_book}\index{Neal, Radford}
and the Gaussian approximation methods developed by \citeasnoun{MacKay91}.\index{MacKay, David}%
\nocite{MacKay92b}\nocite{MacKay92d}\nocite{Neal_nips5}\nocite{MacKay95:network}
\section{Monte Carlo implementation of a single neuron\nonexaminable}
First we will use a Monte Carlo approach in which
the task of evaluating the integral (\ref{eq.marg.integral})
is solved by treating $y( \bx^{\bssNN} ; \w )$ as a function $f$
of $\w$ whose mean we compute using
\beq
\left< f(\bw) \right> \simeq \frac{1}{R} \sum_{r} f( \bw^{(r)} )
\label{eq.neu.mc.est}
\eeq
where $\{ \bw^{(r)} \}$ are
% random
samples from
the posterior distribution
$\frac{1}{Z_M} { \exp(-M(\w)) }$ (\cf\ \eqref{eq.mc.est}). We obtain
the samples using a Metropolis method
(section \ref{sec.metropolis}).
{As an aside, a possible
% \subsection{A final note about marginalization by sampling}
disadvantage of this Monte Carlo approach is that it is a poor
way of estimating the probability of an improbable event, \ie, a
$P(t\given D,\H)$ that is very close to zero, if the improbable event is
most likely to occur in conjunction with improbable parameter values.
}
How to generate the samples $\{ \bw^{(r)} \}$?
Radford Neal\index{Neal, Radford} introduced the
{\dem\ind{Hamiltonian Monte Carlo}\/} method to
neural networks. We met this sophisticated \ind{Metropolis method},
which makes use of gradient information, in \chref{ch.mc2}.
The method we now demonstrate
is a simple version of \Hybrid\ Monte Carlo called
the {\dem{Langevin Monte Carlo method}}.
% , and source code that implements it is shown in \figref{}.
% The alternatives are the Gaussian approximations described here,
% and Monte Carlo methods \cite{Neal_nips5}.
%% Precisely this result is obtained by marginalizing over parameters,
%% whose posterior probability distribution is depicted in figure
%% \ref{marg.fig}c. Two random samples from the posterior define two
%%different classification surfaces, which are illustrated in figures
%%\ref{marg.fig}d, e. The point B is classified differently by
%%these different plausible classifiers, whereas the classification of A
%%is relatively stable. We obtain the Bayesian predictions (figure
%%\ref{marg.fig}f) by averaging together the predictions of the
%%plausible classifiers. The resulting 0.5 contour remains similar to
%%that for the best-fit parameters. However, the width of the decision
%%boundary increases as we move away from the data, in full accordance with
%%intuition.
\begin{figure}% ``22'' means 2 contours in 2 dimensions
%\figuredangle{%
\figuremarginb{\small%
%\figuremargin{%
\begin{center}
%\framebox{
{
(a)\makebox[0in][l]{\raisebox{-5mm}{Dumb Metropolis}}%
\setlength{\unitlength}{0.6666mm}
\begin{picture}(45,75)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/metrop22displaced.eps,%
width=2in,angle=-90}}}
\put(18,41){\makebox(0,0)[l]{$\bx^{(1)}$}}
\put(9,37){\makebox(0,0)[r]{$Q(\bx;\bx^{(1)})$}}
\put(37,64){\makebox(0,0)[l]{$P^*(\bx)$}}
\put(16,52){\makebox(0,0)[b]{$\epsilon$}}
\end{picture}
(b)\makebox[0in][l]{\raisebox{-5mm}{Gradient descent}}%
\setlength{\unitlength}{0.6666mm}
\begin{picture}(45,75)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/gradient22.eps,%
width=2in,angle=-90}}}
\put(19,33){\makebox(0,0)[r]{$- \eta \bg$}}
\end{picture}
(c)\makebox[0in][l]{\raisebox{-5mm}{Langevin}}%
\setlength{\unitlength}{0.6666mm}
\begin{picture}(45,75)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/langevin22.eps,%
width=2in,angle=-90}}}
\end{picture}
}
\end{center}
}{%
\caption[a]{{One step of the Langevin method in two dimensions (c), contrasted with a
traditional `dumb' Metropolis method (a) and with\index{Monte Carlo methods!Metropolis method!dumb Metropolis}\index{dumb Metropolis}
gradient descent (b). The proposal density of the Langevin
method is given by `gradient descent with noise'.
}}
% For a complete description of the method see the text.}}
\label{fig.langevin22}
}%
\end{figure}
\begin{algorithm}
\begin{framedalgorithmwithcaption}%\figuremargin{%\margincaption
{%
\caption[a]{{\tt Octave} source code for the Langevin Monte Carlo method.
% Variable declarations have been omitted.
%
%, as have some parts of the code
% that are identical to that in \figref{fig.octave.grad}.
To obtain the \ind{Hamiltonian Monte Carlo} method, we
repeat the four lines marked {\tt{*}} multiple times
(\protect\algref{fig.octave.hmc}).}
\label{fig.octave.langmc}
}%
%{%%%%%%%%%%%%%%%%
\footnotesize
\begin{verbatim}
g = gradM ( w ) ; # set gradient using initial w
M = findM ( w ) ; # set objective function too
for l = 1:L # loop L times
p = randn ( size(w) ) ; # initial momentum is Normal(0,1)
H = p' * p / 2 + M ; # evaluate H(w,p)
* p = p - epsilon * g / 2 ; # make half-step in p
* wnew = w + epsilon * p ; # make step in w
* gnew = gradM ( wnew ) ; # find new gradient
* p = p - epsilon * gnew / 2 ; # make half-step in p
Mnew = findM ( wnew ) ; # find new objective function
Hnew = p' * p / 2 + Mnew ; # evaluate new value of H
dH = Hnew - H ; # decide whether to accept
if ( dH < 0 ) accept = 1 ;
elseif ( rand() < exp(-dH) ) accept = 1 ; # compare with a uniform
else accept = 0 ; # variate
endif
if ( accept ) g = gnew ; w = wnew ; M = Mnew ; endif
endfor
function gM = gradM ( w ) # gradient of objective function
a = x * w ; # compute activations
y = sigmoid(a) ; # compute outputs
e = t - y ; # compute errors
g = - x' * e ; # compute the gradient of G(w)
gM = alpha * w + g ;
endfunction
function M = findM ( w ) # objective function
G = - (t' * log(y) + (1-t') * log( 1-y )) ;
EW = w' * w / 2 ;
M = G + alpha * EW ;
endfunction
\end{verbatim}
% assuming
% # that y contains the relevant activities
\end{framedalgorithmwithcaption}
%}
\end{algorithm}
%
% this demo actually had 38299 / 41000 accepts = 0.93412 acceptance rate
%
% see also neuron_hmc.tex
\subsection{The Langevin Monte Carlo method}
The Langevin method (\algref{fig.octave.langmc})\index{algorithm!Langevin Monte Carlo}\indexs{Hamiltonian Monte Carlo}\index{Monte Carlo methods!Hamiltonian Monte Carlo}\index{algorithm!Hamiltonian Monte Carlo}
may be summarized as `gradient descent with
added noise', as shown pictorially in \figref{fig.langevin22}.
% One way of describing the Langevin method is as follows.
A noise vector $\bp$ is generated from
a Gaussian with unit variance.
The gradient $\bg$ is computed, and a step in $\bw$ is made,
given by
\beq
\upDelta \bw = - {\textstyle\frac{1}{2}} \epsilon^2 \bg + \epsilon \bp .
\eeq
Notice that if the $\epsilon \bp$ term were omitted
this would simply be gradient descent with
learning rate $\eta = \frac{1}{2} \epsilon^2$.
%
This step in $\bw$ is accepted or rejected depending on the
change in the value of the objective function $M(\bw)$
and on the change in gradient, with a probability of acceptance
such that detailed balance holds.
% A more detailed description of the Langevin method is given
% in \figref{fig.octave.langmc}.
%, where the acceptance rule is given in detail.
% epsilon = sqrt( 2 * eta ) ;
%\begin{figure}
%\begin{center}
%\mbox{
%\makebox[0 in][l]{\bf (a)}\psfig{figure=\handfigs/cl.dat.ps,%
%width=1.25 true in,angle=-90}
%\hspace{0.2in}
%\makebox[0 in][l]{\bf (b)}\psfig{figure=\handfigs/cl.mpb.ps,%
%width=1.25 true in,angle=-90}
%\hspace{0.2in}
%\makebox[0 in][l]{\bf (c)}\psfig{figure=\handfigs/cl.wc.ps,%
%width=1.25 true in,angle=-90}
%}\\[0.1in]
%\mbox{
%\makebox[0 in][l]{\bf (d)}\psfig{figure=\handfigs/cl.s1b.ps,%
%width=1.25 true in,angle=-90}
%\makebox[0 in][l]{\bf (e)}\psfig{figure=\handfigs/cl.s2c.ps,%
%width=1.25 true in,angle=-90}
%\hspace{0.4in}
%\makebox[0 in][l]{\bf (f)}\psfig{figure=\handfigs/cl.mab.ps,%
%width=1.25 true in,angle=-90}
%}\\
%\end{center}
%\caption{ {\bf Taking into account uncertainty when making predictions
% with a classifier.}
%{\bf (a)} {\bf A binary data set.} The two classes are denoted
%by $\times$=1, $\circ$=0.
%{\bf (b)}
%The data are modelled with a linear logistic function.
%Here the {\bf best-fit} model is shown by its 0.1, 0.5 and 0.9 predictive
%contours.
%% Is it reasonable to use this optimum model for predictions?
%% Consider new data arriving at points A and B.
%The best fit model
%assigns probability 0.9 of being in class
%1 to both inputs A and B.
%{\bf (c)}
%The posterior probability distribution of the model parameters,
%$P(\w\given D,\H)$ (schematic; the third parameter, the bias, is not shown).
%The parameters are not perfectly determined by the data.
%Two typical samples from the posterior are indicated by the points labeled 1
%and 2.
%The following two panels show the corresponding classification contours.
%{\bf (d)}~Sample 1.
%{\bf (e)}~Sample 2. Notice how the point B is classified differently by
%these different plausible classifiers, whereas the classification of
%A is relatively stable.
%{\bf (f)}~We obtain the Bayesian predictions by integrating over the
%posterior distribution of $\bw$. The width of the decision boundary
%increases as we move away from the data (point B).
%% in accordance with intuition.
%See text for further discussion.
%}
%\label{marg.fig}
%\end{figure}
\begin{figure}% 30 , 80 , 500 , 3000 , 40000 its. see neuron/README
\fullwidthfigureright{\small%
% saved to graveyard Tue 31/12/02
\begin{center}
\hspace*{-0.0in} \begin{tabular}{cccc}
\makebox[0in][l]{\raisebox{-4mm}{(a)}}\makebox[1.7in][l]{\hspace{-0.3in}\psfig{figure=neuron/ps/wlangwd01time.ps,width=1.9in,angle=-90}}
&
\makebox[0in][l]{\raisebox{-4mm}{(b)}}\makebox[1.55in][l]{\hspace{-0.4in}\psfig{figure=neuron/ps/wlangwlwd01.ps,width=1.9in,angle=-90}}{\hspace{-0.1in}}
&
\makebox[0in][l]{\raisebox{-4mm}{(c)}}\makebox[1.7in][l]{\hspace{-0.3in}\psfig{figure=neuron/ps/Gwlangwd01time.ps,width=1.9in,angle=-90}}
&
\makebox[0in][l]{\raisebox{-4mm}{(d)}}\makebox[1.7in][l]{\hspace{-0.3in}\psfig{figure=neuron/ps/Mwlangwd01time.ps,width=1.9in,angle=-90}}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{ {
A single neuron learning under the \index{Langevin method}Langevin Monte Carlo method.\index{Monte Carlo methods!Langevin method}}
(a) Evolution of weights $w_0$, $w_1$ and $w_2$ as a function of
number of iterations.
% (a1: linear scale; a2: log scale).
(b) Evolution of weights $w_1$ and $w_2$ in weight space.
Also shown by a line is the evolution of the weights using the
optimizer of figure \protect\ref{fig.neuron.learns.decay}.
% b2: every 100th state, connected by lines to show the simulation
% sequence.
(c) The error function $G(\bw)$
as a function of number of
iterations. Also shown is the error function during the optimization
of \figref{fig.neuron.learns.decay}.
(d) The objective function $M(\bx)$
as a function of number of
iterations.
See also figures \ref{fig.neuron.langevin.samples} and
\ref{fig.neuron.langevin.pred}.
}
\label{fig.neuron.langevin}
}%
\end{figure}
\begin{figure}% 30 , 80 , 500 , 3000 , 40000 its. see neuron/README
\figuremargin{%
\begin{center}
\mbox{\hspace*{-0.38in}\begin{tabular}{*{5}{c@{\hspace{-0.38in}}}}% with \hspace{-0.3in},
% these were perfectly spaced horizontally and vertically
\psfig{figure=neuron/ps/langmc0.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc1.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc2.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc3.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc4.ps,width=1.35in,height=1.055in,angle=-90}
\\[-0.08in]
\psfig{figure=neuron/ps/langmc5.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc6.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc7.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc8.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc9.ps,width=1.35in,height=1.055in,angle=-90}
\\[-0.08in]
\psfig{figure=neuron/ps/langmc10.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc11.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc12.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc13.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc14.ps,width=1.35in,height=1.055in,angle=-90}
\\[-0.08in]
\psfig{figure=neuron/ps/langmc15.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc16.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc17.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc18.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc19.ps,width=1.35in,height=1.055in,angle=-90}
\\[-0.08in]
\psfig{figure=neuron/ps/langmc20.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc21.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc22.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc23.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc24.ps,width=1.35in,height=1.055in,angle=-90}
\\[-0.08in]
\psfig{figure=neuron/ps/langmc25.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc26.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc27.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc28.ps,width=1.35in,height=1.055in,angle=-90}
&
\psfig{figure=neuron/ps/langmc29.ps,width=1.35in,height=1.055in,angle=-90}
\\[-0.08in]
% \psfig{figure=neuron/ps/langmc30.ps,width=1.35in,height=1.055in,angle=-90}
\end{tabular}\hspace*{0.38in}}
\end{center}
}{%
\caption[a]{{Samples
% from the posterior probability distribution of neural network weights,
obtained by the Langevin Monte Carlo method.}
The learning rate was set to $\eta=0.01$ and the weight decay rate
to $\alpha=0.01$. The step size is given by $\epsilon = \sqrt{2 \eta}$.
The function performed by the neuron is shown by three of its contours
every 1000 iterations from iteration $10\,000$ to $40\,000$.
%starting at iteration 10500, and continuing to $39\,500$.
The contours shown are those corresponding to
$a=0,\pm 1$, namely $y=0.5, 0.27$ and $0.73$. Also shown is a vector
proportional to $(w_1,w_2)$.
}
\label{fig.neuron.langevin.samples}
}%
\end{figure}
\begin{figure}% 30 , 80 , 500 , 3000 , 40000 its. see neuron/README
%\figuremargin{%
\fullwidthfigureright{\small
\begin{center}
%\fbox{
\begin{tabular}{*{2}{c}}
\raisebox{0.6in}{(a)}\hspace{-0.4in}%
\setlength{\unitlength}{1in}
\begin{picture}(2.1,1.6)(0,0)% Two superposed ps files
\put(-0.3,-0.75){\makebox(0,0)[bl]{\psfig{figure=neuron/ps/predlangavg.ps,height=3.2in,width=3in,angle=-90}}}
\put(-0.3,-0.75){\makebox(0,0)[bl]{\psfig{figure=neuron/ps/preddata.ps,height=3.2in,width=3in,angle=-90}}}
\end{picture}
&
\raisebox{0.6in}{(b)}\hspace{-0.4in}%
\setlength{\unitlength}{1in}
\begin{picture}(2.1,1.6)(0,0)
\put(-0.3,-0.75){\makebox(0,0)[bl]{\psfig{figure=neuron/ps/predmap.ps,height=3.2in,width=3in,angle=-90}}}
\put(-0.3,-0.75){\makebox(0,0)[bl]{\psfig{figure=neuron/ps/preddata.ps,height=3.2in,width=3in,angle=-90}}}
\end{picture}
\\
\end{tabular}
%}%fbox
\end{center}
}{%
\caption[a]{Bayesian predictions found by the Langevin Monte Carlo method
compared with the
predictions using the optimized parameters.
(a) The predictive function obtained by averaging the
predictions for 30 samples uniformly spaced
between iterations $10\,000$ and $40\,000$, shown in
\figref{fig.neuron.langevin.samples}.
The contours shown are those corresponding to
$a=0,\pm 1,\pm 2$, namely $y=0.5, 0.27$, $0.73$, $0.12$ and $0.88$.
(b) For contrast, the predictions given by the `most probable'
setting of the neuron's parameters, as given by optimization of
$M(\bw)$.
}
\label{fig.neuron.langevin.pred}
}%
\end{figure}
The Langevin method has one free parameter, $\epsilon$,
which controls the typical step size. If $\epsilon$ is set to too
large a value, moves may be rejected. If it is set to a very small
value, progress around the state space will be slow.
\subsection{Demonstration of Langevin method}
\label{sec.langevin.neuron}
The {Langevin method} is demonstrated in figures
\ref{fig.neuron.langevin}, \ref{fig.neuron.langevin.samples} and
\ref{fig.neuron.langevin.pred}. Here, the objective function is
$M(\bw) = G(\bw) + \a E_W(\bw)$, with $\a=0.01$. These figures
include, for comparison, the results of the previous optimization
method using gradient descent on the same objective function
(\protect\figref{fig.neuron.learns.decay}). It can be seen that the
mean evolution of $\bw$ is similar to the evolution of the parameters
under gradient descent. The \index{Monte Carlo methods}{Monte Carlo} method appears to have
converged to the posterior distribution after about $10\,000$
iterations.
% the same number of
% iterations as it took the gradient descent algorithm.
The average acceptance rate during this simulation was 93\%;
only 7\% of the proposed moves were rejected. Probably, faster progress
around the state space
would have been made if a larger step size $\epsilon$ had been used,
but the value was chosen so that
the `descent rate'\index{gradient descent}
$\eta = \frac{1}{2} \epsilon^2$ matched
the step size of the earlier simulations.
\subsection{Making Bayesian predictions}
From iteration 10,000 to 40,000, the weights were sampled every
1000 iterations and the corresponding functions of $\bx$ are
plotted in \figref{fig.neuron.langevin.samples}. There is a considerable
variety of plausible functions.
We obtain a Monte Carlo approximation to
the Bayesian predictions by averaging these thirty
functions of $\bx$ together. The result is shown in
\figref{fig.neuron.langevin.pred} and contrasted with
the predictions given by the optimized parameters.
The Bayesian predictions become satisfyingly moderate\index{moderation}
as we move away from the region of highest data density.
% Comparing figures \ref{} and \ref{},
% we notice that when we obtain the Bayesian predictions,
%% the Bayesian approach is superior because
% the best-fit model's
% predictions are {\em selectively} downweighted, to a different degree
% for each test case. The consequence is that a
The Bayesian classifier is
better able to identify the points where the classification is
uncertain. This pleasing behaviour results simply from a mechanical
application of the rules of probability.
\subsection{Optimization and typicality}
A final observation concerns the behaviour of the functions
$G(\bw)$ and $M(\bw)$ during the Monte Carlo sampling process,
compared with the values of $G$ and $M$ at the optimum $\wmp$
(\figref{fig.neuron.langevin}).
The function $G(\bw)$ fluctuates around the value of $G(\wmp)$,
though not in a symmetrical way.
The function $M(\bw)$ also fluctuates, but
it does not fluctuate {\em around\/} $M(\wmp)$
-- obviously it cannot, because $M$ is minimized at
$\wmp$, so $M$ could not go any smaller --
furthermore, $M$ only rarely drops
close to $M(\wmp)$. In the language of information
theory, {\em the typical set of $\bw$ has different
properties from the most probable state $\wmp$}.
% Put main theme here --
A general message therefore emerges -- applicable to
all data models, not just neural networks:
one should be cautious about making
use of {\em optimized\/} parameters, as the properties of
% predictions given by\index{optimization}\index{typicality}
optimized parameters may be unrepresentative
of the properties of typical, plausible parameters; and the
predictions obtained using optimized parameters alone will
often be unreasonably over-confident.
\fakesection{lang versus hmc}
% full story in \input{tex/neuron_hmc.tex}
% make sure you are editing the right one!!!
%
%% changed gwnew to gnew Wed 6/12/00 and gw to g
\begin{algorithm}
\begin{framedalgorithmwithcaption}%\figuremargin{%\margincaption
{
\caption[a]{{\tt Octave} source code for the \hybrid\ Monte Carlo method.
The algorithm is identical to the Langevin method
in \algref{fig.octave.langmc}, except for the replacement of the
four lines marked {\tt{*}} in that algorithm by the fragment shown here.}
\label{fig.octave.hmc}
}%
%%%%%%%%%%%%
\footnotesize
\begin{verbatim}
wnew = w ;
gnew = g ;
for tau = 1:Tau
p = p - epsilon * gnew / 2 ; # make half-step in p
wnew = wnew + epsilon * p ; # make step in w
gnew = gradM ( wnew ) ; # find new gradient
p = p - epsilon * gnew / 2 ; # make half-step in p
endfor
\end{verbatim}
\end{framedalgorithmwithcaption}
\end{algorithm}
%
\begin{figure}% lang versus hmc
\figuremargin{%
\begin{center}
\footnotesize
\begin{tabular}{l*{1}{c}}
\raisebox{4mm}{Langevin} &
%\makebox[0in][l]{\raisebox{-4mm}{()}}
\makebox[5in][l]{\hspace{-0.4in}\psfig{figure=neuron/ps/wlangtime.1.10000.ps,width=4.5in,angle=-90}}
\\[0.1in]
\raisebox{4mm}{HMC} &
%\makebox[0in][l]{\raisebox{-4mm}{()}}
\makebox[5in][l]{\hspace{-0.4in}\psfig{figure=neuron/ps/whmctime.1.10000.ps,width=4.5in,angle=-90}}
\\[0.1in]
\end{tabular}
% to graveyard
\end{center}
}{%
\caption[a]{ {Comparison of sampling properties of the
Langevin Monte Carlo method and the \hybrid\ Monte Carlo (HMC) method.}
The horizontal axis is the number of gradient evaluations made.
%(a)
Each figure shows the weights during the first 10,000 iterations.
% (b) The weights during iterations 10000--20000.
The rejection rate during this \hybrid\ Monte
Carlo simulation was 8\%.
}
\label{fig.neuron.hmc.v.lang}
}%
\end{figure}
\subsection{Reducing random walk behaviour using \hybrid\ Monte Carlo}
As a final study of Monte Carlo methods, we now compare the Langevin
Monte Carlo method with its big brother, the \hybrid\ Monte Carlo method.
The change to \hybrid\ Monte Carlo is simple to implement, as shown
in \algref{fig.octave.hmc}. Each single proposal makes use of
multiple gradient evaluations along a dynamical trajectory
in $\bw,\bp$ space, where $\bp$ are the extra `momentum' variables
of the Langevin and \hybrid\ Monte Carlo methods.
The number of steps `{\tt{Tau}}' was set at
random to a number between 100 and 200 for each trajectory.
The step size $\epsilon$
was kept fixed so as to retain comparability with the simulations that
have gone before; it is recommended that one randomize the step
% \index{Neal, Radford}
size in practical applications, however.
\Figref{fig.neuron.hmc.v.lang} compares the sampling properties of the
Langevin and \hybrid\ Monte Carlo methods.
% The results are shown in
% figures \ref{fig.neuron.hmc} and \ref{fig.neuron.hmc.v.lang}.
The
autocorrelation of the state of the \hybrid\ Monte Carlo
simulation falls much more rapidly with simulation time
than that of the Langevin method. For this toy problem,
\Hybrid\ Monte Carlo is at least ten times more efficient in its
use of computer time.
% full story in \input{tex/neuron_hmc.tex}
% Bayesian
\section{Implementing inference with Gaussian approximations\nonexaminable}
Physicists love to take nonlinearities and locally linearize them,
and they love to approximate probability distributions by
Gaussians. Such approximations offer an alternative strategy
for dealing with the integral
\beq
P(\bt^{\bssNN} \eq 1 \given \bx^{\bssNN}, D,\a
% ,\H
) = \int \! \d^K\w \:
y( \bx^{\bssNN} ; \w ) \frac{1}{Z_M} \, { \exp(-M(\w)) } ,
\label{eq.marg.integral.again}
\eeq
which we just evaluated using Monte Carlo methods.
We start by making a Gaussian approximation to the
posterior probability. We go to the minimum of $M(\w)$ (using a
gradient-based optimizer)
and Taylor-expand $M$ there:
\beq
M(\w) \simeq
M(\wmp) + \frac{1}{2} (\w - \wmp)^{\T} \bA
(\w - \wmp)
+ \cdots ,
\eeq
% \beq
% \frac{1}{Z_M} { \exp(-M(\w)) } \simeq
% \frac{1}{Z_M} { \exp\left[- M(\wmp) - \frac{1}{2} (\w - \wmp)^{\T} \bA
% (\w - \wmp) + \ldots \right] } ,
% \eeq
where $\bA$ is the matrix of second derivatives, also known as
the {\dbf\ind{Hessian}}, defined by
\beq
A_{ij} \equiv \left.
\frac{\partial^2}{\partial w_i \partial w_j} M(\w) \right|_{\w = \wmp}.
\eeq
We thus define our {Gaussian}\index{Laplace's method} approximation:\index{Gaussian distribution!approximation}
\beq
Q(\w;\wmp,\bA) = \left[ \det \! \left( \bA / 2 \pi \right) \right]^{1/2}
\exp\left[- \frac{1}{2} (\w - \wmp)^{\T} \bA
(\w - \wmp) \right] .
\eeq
We can think of the matrix $\bA$ as defining {\dbf\ind{error bars}\/} on
$\w$. To be precise, $Q$ is a normal distribution whose
variance--covariance matrix is $\bAI$.
\exercisxA{2}{ex.neuron2deriv}{
Show that the second derivative of $M(\w)$ with respect to $\w$ is given by
\beq
% \bA =
\frac{\partial^2}{\partial w_i \partial w_j} M(\w)
= \sum_{n=1}^N
f'(a^{(n)}) x_i^{(n)} x_j^{(n)} + \alpha \delta_{ij},
\label{eq.hessian}
\eeq
where $f'(a)$ is the first derivative of
%\beq
$f(a) \equiv 1/(1+e^{-a})$,
%\eeq
which is
% given by
\beq
f'(a) = \frac{\d}{\d a} f(a) = f(a) (1-f(a)),
\eeq
and
\beq
a^{(n)} = \sum_j w_j x_j^{(n)} .
\eeq
}
Having computed the Hessian, our task is then to perform the
integral (\ref{eq.marg.integral.again}) using
our Gaussian approximation.
\subsection{Calculating the marginalized probability}% Skippable section
The output $y(\bx;\bw)$ only depends on $\bw$ through the scalar
$a(\bx;\bw)$, so we can reduce the dimensionality of the integral
by finding the probability density of $a$.
We are assuming a locally Gaussian posterior probability
distribution
over $\bw=\wmp+\upDelta \bw$,
$P(\bw\given D,\a)\simeq (1/Z_Q) \exp (-\half \upDelta \bw^{\T} \bA \upDelta \bw)$.
For our single neuron, the activation $a(\bx;\bw)$ is a linear
function of $\bw$ with $\partial a/ \partial \bw = \bx$,
%
% maybe I will regret letting g be a gradient instead of a sensitivity?
%
so for any $\bx$, the activation $a$ is Gaussian-distributed.
\exercisaxB{2}{ex.gaussianmarg}{
Assuming $\w$ is Gaussian-distributed with mean $\wmp$ and
variance--covariance matrix $\bAI$, show that
the probability distribution of $a(\bx)$ is
\beq
P(a\given\bx,D,\a) = {\rm Normal} (\aamp, s^2) = \frac{1}{\sqrt{2\pi s^2}}\exp\left(-\frac{(a-\aamp)^2}{2 s^2}\right),
\eeq
where $\aamp \eq a(\bx;\wmp)$ and $s^2 \eq \bx^{\T} \bAI \bx$.
}
This means that the marginalized output is:
\beq
P(t\! = \! 1\given\bx,D,\a)=\psi(\aamp, s^2) \equiv \int \d a \: f(a) \: {\rm Normal} (\aamp, s^2).
\label{mod.out}
\eeq
This is to be contrasted with
$y(\bx;\wmp)\eq f(\aamp)$,
the output of the most probable network. The integral of a sigmoid times a Gaussian
can be approximated by\nocite{MacKay92d}:
%The suggested approximation is
\beq
\psi(\aamp, s^2) \simeq \phi(\aamp, s^2) \equiv
f(\kappa(s) \aamp)
\label{approx.mod.out}
\eeq
with $\kappa = 1/\sqrt{1+\pi s^2/8}$ (\figref{approx.assess}).
\begin{figure}
\figuremargin{\small%
\[
\begin{array}[b]{c@{\hspace{-0.42in}}c}
\makebox[0pt][l]{\raisebox{0.5cm}[0pt][0pt]{(a)}}
\makebox[0in][l]{{\raisebox{1.75in}{$\psi(a, s^2)$}}}\hspace*{-0.15321in}
\psfig{figure=thesisfigs/class/psi3d.ps,%
width=75mm,height=48mm,%
bbllx=13mm,bblly=70mm,%
bburx=204mm,bbury=192mm}&
\makebox[0pt][l]{\hspace*{1cm}\raisebox{3.54cm}[0pt][0pt]{(b)}}%\hspace{0.1in}
\psfig{figure=thesisfigs/class/s2.p.f.ps,%
width=75mm,height=48mm,%
bbllx=13mm,bblly=78mm,%
bburx=204mm,bbury=200mm}
% \\
% \mbox{\footnotesize(a)} & \mbox{\footnotesize(b)}
%\\
% \makebox[0pt][l]{\raisebox{0.5cm}[0pt][0pt]{(c)}}
%\psfig{figure=thesisfigs/class/s2.f-p.ps,%
%width=75mm,height=24mm,%
%bbllx=12mm,bblly=109mm,%
%bburx=204mm,bbury=170mm}&
% \makebox[0pt][l]{\raisebox{1.5cm}[0pt][0pt]{(d)}}
%\psfig{figure=thesisfigs/class/deriv.log.ps,%
%width=75mm,height=24mm,%
%bbllx=12mm,bblly=109mm,%
%bburx=204mm,bbury=170mm}
\end{array}
\]
}{%
\caption[A]{{The marginalized probability, and an approximation to it.}
(a) The function $\psi(a, s^2)$, evaluated numerically.
In (b) the functions $\psi(a, s^2)$ and $\phi(a, s^2)$
defined in the text are shown as a function of $a$ for $s^2=4$.
%In (c), the difference $\phi - \psi$ is shown for the same parameter values.
%In (d), the breakdown of the approximation is emphasized by showing
%$\ln \phi'$ and $\ln \psi'$ (derivatives with respect to $a$).
% The errors become significant when $a\! \gg \! s$.
From \citeasnoun{MacKay92d}.}
\label{approx.assess}
}%
\end{figure}
\begin{figure}% 30 , 80 , 500 , 3000 , 40000 its. see neuron/README
\figuremargin{\small%
\begin{center}
\begin{tabular}{*{2}{c}}
\raisebox{0.6in}{(a)}\hspace{-0.4in}%
\psfig{figure=neuron/ps/wwwgaussian.ps,width=2.3in,angle=-90}
&
\raisebox{0.6in}{(b)}\hspace{-0.4in}%
\setlength{\unitlength}{1in}
\begin{picture}(2.1,1.6)(0,0)
\put(-0.3,-0.75){\makebox(0,0)[bl]{\psfig{figure=neuron/ps/predgaussian.ps,height=3.2in,width=3in,angle=-90}}}
\put(-0.3,-0.75){\makebox(0,0)[bl]{\psfig{figure=neuron/ps/preddata.AB.ps,height=3.2in,width=3in,angle=-90}}}
\end{picture}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{The Gaussian approximation in weight space and its approximate
predictions in input space. (a) A projection of the Gaussian approximation
onto the $(w_1,w_2)$ plane of weight space. The one- and
two-standard-deviation contours are shown. Also shown are the
trajectory of the optimizer, and the Monte Carlo method's samples.
% The Monte Carlo samples are {\em not\/} involved in the creation
% of the Gaussian approximation.
(b) The predictive function obtained from the
Gaussian approximation and \protect\eqref{approx.mod.out}.
% The contours shown are those corresponding to
% $a=0,\pm 1,\pm 2$, namely $y=0.5, 0.27$, $0.73$, $0.12$ and $0.88$.
(\cf\ \protect\figref{fig.neuron.map}.)
}
\label{fig.neuron.gaussian}
}%
\end{figure}
\subsection{Demonstration}
\Figref{fig.neuron.gaussian} shows the result of fitting a Gaussian
approximation at the optimum $\wmp$, and the results of using
that Gaussian approximation and \eqref{approx.mod.out}
to make predictions.
Comparing these predictions with those of
the Langevin Monte Carlo method (\figref{fig.neuron.langevin.pred})
we observe that, whilst qualitatively the same,
the two are clearly numerically different. So at least
one of the two methods
% presumably the Gaussian approximation,
is not completely accurate.
\exercisaxB{2}{ex.gaussianapprox}{
Is the Gaussian approximation to $P(\w\given D,\a)$
too heavy-tailed\index{tail}
or too light-tailed, or both? It may help
to consider $P(\w\given D,\a)$ as a function of one parameter $w_i$
% a one-dimensional problem
and to think of the two distributions
% $P$ and $Q$
on a logarithmic scale.
%
Discuss the conditions under which the Gaussian approximation is most
accurate.
}
%
% Discuss contrast between discriminative training and full modelling
%
\subsection{Why marginalize?}
If the output is immediately used to make a ({\tt0}/{\tt1}) decision and the
costs associated with error are symmetrical, then the use of
marginalized outputs under this Gaussian approximation will make no
difference to the performance of the classifier, compared with using
the outputs given by the most probable parameters, since both
functions pass through $0.5$ at $\aamp\eq 0$. But these Bayesian
outputs will make a difference if, for example, there is an option of
saying `I don't know', in addition to saying `I guess {\tt0}' and `I guess {\tt1}'.
And even if there are just the two choices `{\tt0}' and `{\tt1}', if the
costs associated with error are unequal, then the decision
boundary will be some contour other than the 0.5 contour,
and the boundary will be affected by marginalization.
% Now in real life, we do have to make decisions, but often
% there are more choices available than just two
\dvips
%\section{Solutions}% to Chapter \protect\ref{ch.single.neuron.bayes}'s exercises} %
% empty file
\chapter*{Postscript on Supervised Neural Networks}
One of my students, Robert,
%{\tt rkh23@hermes.cam.ac.uk> Harle
asked:
\begin{quote}
Maybe I'm missing something fundamental, but supervised neural networks
seem equivalent to fitting a pre-defined function to some given data, then
extrapolating -- what's the difference?
\end{quote}
I agree with Robert. The
supervised neural networks we have studied so far
are simply parameterized nonlinear functions
which can be fitted to data.
%
Hopefully you will agree with another comment that
Robert made:
\begin{quote}
Unsupervised networks seem much more interesting than their supervised
counterparts. I'm amazed that it works!
\end{quote}
% So, to unsupervised networks.
% {The Hopfield network}
\chapter{Hopfield Networks}
\label{ch.hopfield}
% \label{ch.hopfield}
% \section{Connecting neurons together}
We have now spent three chapters studying the single
neuron. The time has come to connect multiple neurons together, making
the output of one neuron be the input to another, so as to
make neural networks.
Neural networks can be divided into two classes on the basis
of their connectivity.
\begin{figure}[hbtp]\small
\figuremargin{%
\begin{center}
(a)\mbox{\psfig{figure=figs/ffnet.eps,width=1.2in}}
\hspace{0.3in}
(b)\mbox{\psfig{figure=figs/fbnet.eps,width=1in}}
\end{center}
}{%
\caption[a]{(a) A feedforward network. (b) A feedback network.}
\label{fig.ffnet}
\label{fig.fbnet}
}%
\end{figure}
\begin{description}
\item[Feedforward networks\puncspace]
In a feedforward network, all the connections are directed such that
the network forms a {directed acyclic graph}.
\item[Feedback networks\puncspace]
Any network that is not a feedforward network will be called
a feedback network.
\end{description}
In this chapter we will discuss a fully connected feedback network
called the \inds{Hopfield network}. The weights in the Hopfield
network are constrained to be {\em symmetric}, \ie, the weight from
neuron $i$ to neuron $j$ is equal to the weight from
neuron $j$ to neuron $i$.
Hopfield\nocite{Hopfield82}\nocite{Hopfield84}\nocite{Hopfield87}
networks have two applications. First,\index{content-addressable memory}
they can act
as {\em associative memories}.\index{associative memory}\index{memory!associative}
Second, they can be used to solve {\em \ind{optimization} problems}.
We will first discuss the idea of associative memory,
also known as \ind{content-addressable memory}.
\section{\ind{Hebbian learning}}\index{learning!Hebbian}\index{learning algorithms!Hopfield network}
In \chapterref{ch.nn.intro}, we discussed the contrast between
traditional digital memories and biological memories.
Perhaps the most striking difference is the {\em associative\/}
nature of biological memory.
A\nocite{Hebb49}
simple model due to Donald \index{Hebb, Donald}Hebb (1949) captures the
idea of associative memory.
Imagine that
% all neurons are connected to each other, and
% imagine that the weights are adjusted so that
the weights between neurons whose activities are {\em positively correlated\/}
are {\em increased\/}:\index{correlations}\index{Hebbian learning}\index{learning!Hebbian}
\beq
\frac{\d w_{ij}}{\d t} \sim \mbox{Correlation}(x_i,x_j) .
\label{eq.hebb}
\eeq
Now imagine that
% neuron $m$ is associated with stimulus $m$
%, such that
when stimulus $m$ is present (for example, the smell of a banana),
the activity of neuron
$m$ increases; and that neuron $n$ is associated
with another stimulus, $n$ (for example, the sight of a yellow object).
If these two stimuli -- a yellow sight and a banana smell --
co-occur in the environment, then the Hebbian
learning rule (\ref{eq.hebb}) will increase the weights
$w_{nm}$ and
$w_{mn}$.
This means that when, on a later occasion, stimulus $n$ occurs in
isolation, making the activity $x_n$ large,
the positive weight from $n$ to $m$ will cause neuron $m$
also to be activated.
% stimulated.
Thus the response to the sight of
a yellow object is an automatic association with the smell of a banana.
We could call this `pattern completion'. No teacher is required for this
associative memory to work. No signal is needed to indicate that
a correlation has been detected or that an association should
be made. The unsupervised, local learning algorithm and the
unsupervised, local activity rule spontaneously produce
associative memory.
This idea seems so simple and so effective that it must be relevant
to how memories work in the brain.
% Demo should assess stability of desired memories, basin of attraction.
% Show error-correcting capacbilty and also robustness to damage. <<<<<<<
%
% Learning as communication - what is the capacity? How many examples
% xn,tn could you retain in memory? Capacity = number of bits = number
% of random pattern labels.
% Objective function as useful description.
% But for the moment, let's get a broader feel
% for neural networks.
%
% \section{Hopfield network}
% An autoassociative memory. Feedback network. Unsupervised learning.
%
% Hopfield networks also have other uses which
% % that make use of different
% % learning rule;
% we'll come to later.
%
% Also, later, show that activity rule has lyapunov function.
\section{Definition of the binary Hopfield network}
\begin{description}
\item[Convention for weights\puncspace] Our convention
in general will be that $w_{ij}$ denotes the connection {\em from\/} neuron
$j$ {\em to} neuron $i$.
\item[Architecture\puncspace]
A \ind{Hopfield network} consists of $I$ neurons. They are fully connected
through {\em symmetric, bidirectional\/} connections with weights
$w_{ij}=w_{ji}$. There are no self-connections, so $w_{ii}=0$ for
all $i$. Biases $w_{i0}$ may be included\index{bias} (these may be viewed as weights
from a neuron `0' whose activity is permanently $x_0=1$). We will
denote the activity of neuron $i$ (its output) by $x_i$.
\item[Activity rule\puncspace]
Roughly, a Hopfield network's activity rule is for each neuron
to update its state as if it were a single neuron with
the threshold activation function
% of equation (\ref{eq.single.thresh}).
\beq
x(a) = \Theta(a) \equiv \left\{ \begin{array}{ll}
1 & a \geq 0 \\
-1 & a < 0 . \end{array} \right.
\label{eq.single.thresh.again}
\eeq
Since there is {\em \ind{feedback}\/} in a Hopfield network (every neuron's
output is an input to all the other neurons)
we will have to specify
an order for the updates to occur. The
updates may be synchronous or asynchronous.
\begin{description}
\item[Synchronous updates] -- all neurons compute their activations
\beq
a_i = \sum_j w_{ij} x_j
\label{eq.hopfield.activation.defn}
\eeq
then update their states simultaneously to
\beq
x_i = \Theta( a_i ) .
\eeq
\item[Asynchronous updates] -- one neuron at a time
computes its activation and updates its state.
% before the next neuron does its thing.
The sequence of selected neurons may be a fixed sequence
or a random sequence.
\end{description}
The properties of a Hopfield network may be
% are not necessarily in
sensitive to the above choices.
% \problema{
% Where have we seen an equation like
% }
\item[Learning rule\puncspace]
The learning rule is intended to make a set of
desired {\dbf memories\/} $\{ \bx^{(n)} \}$
be {\dbf stable states\/} of the
Hopfield network's activity rule. Each memory is a binary pattern, with $x_i \in \{ -1 , 1 \}$.
The weights are set using the {\dbf sum of outer products\/} or {\dbf Hebb rule},\index{Hebbian learning}\index{learning!Hebbian}
\beq
w_{ij} = \eta \sum_n x^{(n)}_i x^{(n)}_j ,
\label{eq.hebb.hopfield}
\eeq
where $\eta$ is an unimportant constant. To prevent the largest
possible weight from growing with $N$ we might choose to
set
% out Some theoretical analyses have
$\eta = 1/N$.
\end{description}
\exercisxA{1}{ex.hopeta}{
Explain why the value of $\eta$ is not important for the Hopfield network
defined above.
}
\section{Definition of the continuous Hopfield network}
Using the identical architecture and learning rule we can define
a Hopfield network whose activities are real numbers between $-1$ and 1.
\begin{description}
\item[Activity rule\puncspace]
A Hopfield network's activity rule is for each neuron
to update its state as if it were a single neuron with
a sigmoid
% (tanh)
activation function.
% of equation (\ref{eq.single.tanh}).
%
The updates may be
synchronous or asynchronous, and involve the equations
\beq
a_i = \sum_j w_{ij} x_j
\label{eq.cont.hop.a}
\eeq
and
\beq
x_i = \tanh( a_i ) .
\label{eq.cont.hop.b}
\eeq
\end{description}
%% \item[Learning rule\puncspace]
The learning rule is the same as in the binary Hopfield network,
but the value of $\eta$ becomes relevant.
Alternatively, we may fix $\eta$ and introduce a {\dbf \ind{gain}\/}
$\beta \in (0,\infty)$ into the activation function:
\beq
x_i = \tanh( \beta a_i ) .
\label{eq.cont.hop.c}
\eeq
\exercisxA{1}{ex.remindising}{
Where have we encountered equations \ref{eq.cont.hop.a}, \ref{eq.cont.hop.b},
and \ref{eq.cont.hop.c} before?
}
\section{Convergence of the Hopfield network}
\begin{figure}
\figuredanglenudge{\small%
%\margincaption{%
%%%%%%%%%%%%%%%%%%
\begin{center}\small
\begin{tabular}{lll}
(a) \ \begin{tabular}{|l|} \hline
{\tt{moscow------russia}} \\ \hline
{\tt{lima----------peru}} \\ \hline
{\tt{london-----england}} \\ \hline
{\tt{tokyo--------japan}} \\ \hline
{\tt{edinburgh-scotland}} \\ \hline
{\tt{ottawa------canada}} \\ \hline
{\tt{oslo--------norway}} \\ \hline
{\tt{stockholm---sweden}} \\ \hline
{\tt{paris-------france}} \\ \hline
\end{tabular}% end of list (a)
& \ \ &
\begin{tabular}{l}
%\rule{0pt}{15mm}(b)% was this when all in one vertical table
\rule{0pt}{5mm}(b)%
\begin{tabular}{lll}
{\tt{moscow---:::::::::}}
&
$\Longrightarrow$
&
{\tt{moscow------russia}} \\ % \hline
{\tt{::::::::::--canada}}
&
$\Longrightarrow$
&
{\tt{ottawa------canada}} \\
\end{tabular}% end of b
\\
\rule{0pt}{14mm}(c)\begin{tabular}{lll}
{\tt{otowa-------canada}}
&
$\Longrightarrow$
&
{\tt{ottawa------canada}} \\
{\tt{egindurrh-sxotland}}
&
$\Longrightarrow$
&
{\tt{edinburgh-scotland}} \\
\end{tabular}% end of b
\end{tabular}% end of bc column
\end{tabular}% end of left left
\end{center}
}{\caption[a]{Associative memory (schematic).
(a) A list of desired memories.
(b) The first purpose of an associative memory is
pattern completion, given a partial pattern.
(c) The second purpose of a memory is error correction.
% We could view these two functions as
}
\label{fig.assoc.mem}
}{3.5mm}
%
\end{figure}
The hope is that the Hopfield networks we have defined
will perform associative memory recall, as shown
schematically in \figref{fig.assoc.mem}.\index{associative memory} We hope that the
activity rule of a Hopfield network will take a partial
memory or a corrupted memory, and perform pattern completion
or error correction to restore the original memory.
But why should we expect {\em any\/} pattern to be stable under the
activity rule, let alone the desired memories?
% Why should we expect the activity rule to have stable states at all?
% to be useful for memory
We address the continuous Hopfield network, since the binary network
is a special case of it.
We have already encountered the activity rule
(\ref{eq.cont.hop.a}, \ref{eq.cont.hop.c}) when we discussed
variational methods (\secref{sec.vfeising}):\index{variational methods}
when we approximated the spin system
% \beq
% P( \bx| \beta, \bJ) = \frac{1}{Z(\beta,\bJ)}
% \exp \left[ - \beta E( \bx ; \bJ ) \right] ,
% \label{eq.ising.p.again2}
% \eeq
whose energy function was
\beq
E(\bx;\bJ) = -
\frac{1}{2}
\sum_{m,n} J_{mn} x_m x_n - \sum_n h_n x_n
\label{eq.ising.e.again2}
\eeq
with a separable distribution
% $Q(\bx;\ba)$
\beq
Q(\bx; \ba) = \frac{1}{Z_Q} \exp \left({ \sum_n a_n x_n }\right)
\eeq
and optimized the latter so as to minimize
the variational free energy
% objective function
\beq
\beta \tF(\ba) = \beta \sum_{\bx} \: Q(\bx;\ba)
E( \bx ; \bJ )
- \sum_{\bx} \: Q(\bx;\ba) \ln\frac{1}
{ Q(\bx;\ba) } ,
\eeq
we found that the pair of iterative equations
\beq
a_m = \b\left( \sum_{n} J_{mn} \bar{x}_n + h_m \right)
\label{eq.mfta2}
\eeq
and
\beq
\bar{x}_n = \tanh( a_n )
\label{eq.mftb2}
\eeq
were guaranteed to decrease the variational free energy
\beq
\b \tF(\ba)
% = \b \left< E( \bx ; \bJ ) \right>_Q - S_Q
= \b \left(- \frac{1}{2}
\sum_{m,n} J_{mn} \bar{x}_m \bar{x}_n - \sum_n h_n \bar{x}_n \right) - \sum_n H_2^{(e)}(q_n) .
\eeq
% This objective function is lower-bounded by the
% true free energy $\b F$.
If we simply replace $J$ by $w$, $\bar{x}$ by $x$, and $h_n$ by $w_{i0}$,
we see that the equations of the Hopfield network are
identical to a set of mean-field equations that minimize
\beq
\b \tF(\bx) = - \b \frac{1}{2} \bx^{\T} \bW \bx
- \sum_i H_2^{(e)}[(1+x_i)/2] .
% x = 2 q - 1 ; q = (1+x)/2
\label{eq.hn.vfe}
\eeq
There is a general name for a function that decreases
under the dynamical evolution of a system and that is bounded
below: such a function is a {\dem\inds{Lyapunov function}\/} for the system.
It is useful to be able to prove the existence of Lyapunov
functions:
if a system has a Lyapunov function then its dynamics are bound
to settle down to a {\dem\ind{fixed point}}, which is a local minimum of the
Lyapunov function, or a {\dem\ind{limit cycle}},
along which the Lyapunov function
is a constant. Chaotic behaviour is not possible for a system
with a Lyapunov function.
% If additionally the Lyapunov function
If a system has a {Lyapunov function} then its state space can be divided
into {\dem basins of attraction}, one basin associated with each
attractor.
So, the continuous Hopfield network's activity rules (if implemented
asynchronously) have a Lyapunov function. This Lyapunov
function is a convex function of each parameter $a_i$ so a Hopfield
network's dynamics will always converge to a stable fixed point.
This convergence proof depends crucially on the fact that the
Hopfield network's connections are {\em symmetric}.
%If they were not,
% we would not be able to make the connection to free energy minimization.
It also depends on the updates being made asynchronously.
\exercissxA{2}{ex.hopasymm}{
Show by constructing an example that if a feedback network does not
have symmetric connections then its dynamics may fail to converge to
a fixed point.}
\exercissxA{2}{ex.hopasynch}{
Show by constructing an example that if a Hopfield network is updated
synchronously that, from
some initial conditions, it may fail to converge to a fixed point.
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
\fakesection{hopfield-djcm}
\newcommand{\hdjcmfig}[1]{\setlength{\unitlength}{0.3in}%
\begin{picture}(1,1.2)%
\put(0.05,0.05){\makebox(0.9,0.9)[bl]{%
\psfig{figure=hopfield/figs/#1.ps,width=0.27in,height=0.33in}}}%
\put(0,0){\framebox(1,1.2)[bl]{}}\end{picture}}
%\newcommand{\hdjcmfig}[1]{\framebox{%
%\psfig{figure=hopfield/figs/#1.ps,width=0.3in}}}
\begin{figure}
\figuremarginb{%
\small
\begin{center}
% two columns
% col 1: DJCM. weights.
% col 2: all the easy decodes
\begin{tabular}{@{}cc}%%%%%%%%%%%%%%%%% top half
\begin{tabular}{@{}c}%%%%%%%%%%%%%%%%% left col
(a)
\begin{tabular}[t]{*{4}{c}}
\hdjcmfig{d1}
&
\hdjcmfig{d2}
&
\hdjcmfig{d3}
&
\hdjcmfig{d4}
\\
\end{tabular}
\\[0.1in]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% choose a size:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
{\footnotesize % 1.1mm is good space
\begin{tabular}{*{25}{r@{\hspace{1.1mm}}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%{\tiny% .5mm is good
%\begin{tabular}{*{25}{r@{\hspace{0.5mm}}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
.& 0& 0& 0& 0&-2& 2&-2& 2& 2&-2& 0& 0& 0& 2& 0& 0&-2& 0& 2& 2& 0& 0&-2&-2\\
0& .& 4& 4& 0&-2&-2&-2&-2&-2&-2& 0&-4& 0&-2& 0& 0&-2& 0&-2&-2& 4& 4& 2&-2\\
0& 4& .& 4& 0&-2&-2&-2&-2&-2&-2& 0&-4& 0&-2& 0& 0&-2& 0&-2&-2& 4& 4& 2&-2\\
0& 4& 4& .& 0&-2&-2&-2&-2&-2&-2& 0&-4& 0&-2& 0& 0&-2& 0&-2&-2& 4& 4& 2&-2\\
0& 0& 0& 0& .& 2&-2&-2& 2&-2& 2&-4& 0& 0&-2& 4&-4&-2& 0&-2& 2& 0& 0&-2& 2\\
-2&-2&-2&-2& 2& .& 0& 0& 0& 0& 4&-2& 2&-2& 0& 2&-2& 0&-2& 0& 0&-2&-2& 0& 4\\
2&-2&-2&-2&-2& 0& .& 0& 0& 4& 0& 2& 2&-2& 4&-2& 2& 0&-2& 4& 0&-2&-2& 0& 0\\
-2&-2&-2&-2&-2& 0& 0& .& 0& 0& 0& 2& 2& 2& 0&-2& 2& 4& 2& 0& 0&-2&-2& 0& 0\\
2&-2&-2&-2& 2& 0& 0& 0& .& 0& 0&-2& 2& 2& 0& 2&-2& 0& 2& 0& 4&-2&-2&-4& 0\\
2&-2&-2&-2&-2& 0& 4& 0& 0& .& 0& 2& 2&-2& 4&-2& 2& 0&-2& 4& 0&-2&-2& 0& 0\\
-2&-2&-2&-2& 2& 4& 0& 0& 0& 0& .&-2& 2&-2& 0& 2&-2& 0&-2& 0& 0&-2&-2& 0& 4\\
0& 0& 0& 0&-4&-2& 2& 2&-2& 2&-2& .& 0& 0& 2&-4& 4& 2& 0& 2&-2& 0& 0& 2&-2\\
0&-4&-4&-4& 0& 2& 2& 2& 2& 2& 2& 0& .& 0& 2& 0& 0& 2& 0& 2& 2&-4&-4&-2& 2\\
0& 0& 0& 0& 0&-2&-2& 2& 2&-2&-2& 0& 0& .&-2& 0& 0& 2& 4&-2& 2& 0& 0&-2&-2\\
2&-2&-2&-2&-2& 0& 4& 0& 0& 4& 0& 2& 2&-2& .&-2& 2& 0&-2& 4& 0&-2&-2& 0& 0\\
0& 0& 0& 0& 4& 2&-2&-2& 2&-2& 2&-4& 0& 0&-2& .&-4&-2& 0&-2& 2& 0& 0&-2& 2\\
0& 0& 0& 0&-4&-2& 2& 2&-2& 2&-2& 4& 0& 0& 2&-4& .& 2& 0& 2&-2& 0& 0& 2&-2\\
-2&-2&-2&-2&-2& 0& 0& 4& 0& 0& 0& 2& 2& 2& 0&-2& 2& .& 2& 0& 0&-2&-2& 0& 0\\
0& 0& 0& 0& 0&-2&-2& 2& 2&-2&-2& 0& 0& 4&-2& 0& 0& 2& .&-2& 2& 0& 0&-2&-2\\
2&-2&-2&-2&-2& 0& 4& 0& 0& 4& 0& 2& 2&-2& 4&-2& 2& 0&-2& .& 0&-2&-2& 0& 0\\
2&-2&-2&-2& 2& 0& 0& 0& 4& 0& 0&-2& 2& 2& 0& 2&-2& 0& 2& 0& .&-2&-2&-4& 0\\
0& 4& 4& 4& 0&-2&-2&-2&-2&-2&-2& 0&-4& 0&-2& 0& 0&-2& 0&-2&-2& .& 4& 2&-2\\
0& 4& 4& 4& 0&-2&-2&-2&-2&-2&-2& 0&-4& 0&-2& 0& 0&-2& 0&-2&-2& 4& .& 2&-2\\
-2& 2& 2& 2&-2& 0& 0& 0&-4& 0& 0& 2&-2&-2& 0&-2& 2& 0&-2& 0&-4& 2& 2& .& 0\\
-2&-2&-2&-2& 2& 4& 0& 0& 0& 0& 4&-2& 2&-2& 0& 2&-2& 0&-2& 0& 0&-2&-2& 0& .\\
\end{tabular}
}
\end{tabular}%%%%%%%%%%%%%%%%%%%% end left col
%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{*{1}{l}}%%%%%%%% right col
(b)
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{d5}
&
$\:\rightarrow\:$
&
\hdjcmfig{d1}
\\
\end{tabular}
\\[0.2in]% & %%%%%%%%%%%%%%%%%%%%%%%%
(c)
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{d7}
&
$\:\rightarrow\:$
&
\hdjcmfig{d2}
\\
\end{tabular}
\\[0.2in]% & %%%%%%%%%%%%%%%%%%%%%%%%
%\multicolumn{2}{l}{
{(d)~\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{d8}
&
$\:\rightarrow\:$
&
\hdjcmfig{d9}
&
$\:\rightarrow\:$
&
\hdjcmfig{d2}
\\
\end{tabular}
}
\\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
(e)
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{d10}
&
$\:\rightarrow\:$
&
\hdjcmfig{d3}
\\
\end{tabular}
\\[0.2in]% & %%%%%%%%%%%%%%%%%%%%%%%%
(f)
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{d11}
&
$\:\rightarrow\:$
&
\hdjcmfig{d4}
\\
\end{tabular}
\\[0.2in]% & %%%%%%%%%%%%%%%%%%%%%%%%
(g)
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{d13}
&
$\:\rightarrow\:$
&
\hdjcmfig{d1}
\\
\end{tabular}
\\[0.2in] % & %%%%%%%%%%%%%%%%%%%%%%%%
(h)
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{d12}
&
$\:\rightarrow\:$
&
\hdjcmfig{d4}
\\
\end{tabular}
\end{tabular}%%%%%%%%%%%%% end right col
\end{tabular}%%%%%%%%%%%%% end top half
\\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{*{3}{r@{$\,$}l}}%%%%%%%%%%%% bot half
(i)&
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{d14}
&
$\:\rightarrow\:$
&
\hdjcmfig{d15}
\\
\end{tabular}
& %%%%%%%%%%%%%%%%%%%%%%%%
(j)&
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{d16}
&
$\:\rightarrow\:$
&
\hdjcmfig{d17}
\\
\end{tabular}
& %%%%%%%%%%%%%%%%%%%%%%%%
%\multicolumn{2}{l}{
(k)&
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{d18}
&
$\:\rightarrow\:$
&
\hdjcmfig{d19}
$\:\rightarrow\:$
&
\hdjcmfig{d20}
\\
\end{tabular}
\\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
%\multicolumn{2}{l}{
(l)&
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{d24}
&
$\:\rightarrow\:$
&
\hdjcmfig{d25}
$\:\rightarrow\:$
&
\hdjcmfig{d26}
\\
\end{tabular}
& %%%%%%%%%%%%%%%%%%%%%%%%
%\multicolumn{2}{l}{
(m)&
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{d21}
&
$\:\rightarrow\:$
&
\hdjcmfig{d22}
$\:\rightarrow\:$
&
\hdjcmfig{d23}
\\
\end{tabular}
\\ % \\[0.2in]
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%
\end{center}
}{%
\caption[a]{Binary Hopfield network storing four memories.
(a) The four memories, and the weight matrix.
(b--h) Initial states that differ by one, two, three,
four, or even five bits
from a desired memory are restored to that memory
in one or two iterations.
%(e,f,g,h) Even states that are 4 or 5 bits from a memory
% are restored to that memory in one iteration.
(i--m) Some initial conditions that are
far from the memories lead to stable states other than the four
memories; in (i),
the stable state looks like a mixture of
two memories, `D' and `J'; stable state (j)
is like a mixture of `J' and `C'; in (k),
we find a corrupted version of the `M' memory (two bits
distant); in (l) a corrupted version of `J' (four
bits distant) and in (m), a state which looks
spurious until we recognize that it is the inverse of
the stable state (l).
% The activation rule sets the activity to 1 if the
% activation is $\geq 0$.
}
\label{fig.hopfield.memory.djcm}
}%
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The associative memory in action}
\Figref{fig.hopfield.memory.djcm} shows the dynamics
of a 25-unit binary Hopfield network that has learnt four patterns
by Hebbian learning. The four patterns
% , `D', `J', `C', and `M',
are displayed as five by five
binary images in \figref{fig.hopfield.memory.djcm}a.
%
For twelve initial conditions, panels (b--m)
show the state of the network, iteration by
iteration, all 25 units being updated
asynchronously in each iteration. For an initial condition randomly
perturbed from a memory, it often only takes one iteration
for all the errors to be corrected.
%
The network
has more stable states in addition to the four desired memories:
the inverse of any stable state is also a stable state;
% , so there are four stable states corresponding to the inverses of the four memories.
and there are several stable states
that can be interpreted as mixtures of the memories.
% As shown schematically in \figref{fig.hopfield.failures}, there
% could also be spurious stable states unrelated to the
% memories.
\subsection{Brain damage}
The network can be severely damaged and still work fine
as an associative memory.
If we take the 300 weights of the network shown in
\figref{fig.hopfield.memory.djcm} and randomly set
50 or 100 of them to zero, we still find that
the desired memories are attracting stable states.
Imagine a digital computer that still works fine
even when 20\% of its components are destroyed!
% , though the basin of attraction (the number of flipped bits that can be corrected) is reduced.
\exercisxB{2}{ex.hopdo}{
Implement a Hopfield network and confirm this amazing robust
error-correcting capability.
}
\subsection{More memories}
We can squash more memories into the network too.
\Figref{fig.hopfield.djcmx}a shows a set of five memories.
When we train the network with Hebbian learning,
all five memories are stable states, even when
26 of the weights are randomly deleted (as shown by the `x's
in the weight matrix).
However, the basins of attraction are smaller than before:
% .when we have more memories:
figures \ref{fig.hopfield.djcmx}(b--f) show
the dynamics resulting from randomly chosen starting
states close to each of the memories (3 bits flipped).
Only three of the memories are recovered correctly.
If we try to store too many patterns, the associative
memory fails catastrophically. When we add a sixth pattern,
as shown in \figref{fig.hopfield.djcmxs}, only one of the
patterns is stable; the others all flow into
one of two spurious stable states.
% lie in one or other of two basins of attraction.
\fakesection{hopfield-djcmb}
\begin{figure}
\figuremargin{%
\small
\begin{center}
% two columns
% col 1: DJCMX. weights.
% col 2: all the easy decodes
\begin{tabular}{@{}cc}%%%%%%%%%%%%%%%%% top half
\begin{tabular}{@{}c}%%%%%%%%%%%%%%%%% left col
(a)
\begin{tabular}[t]{*{5}{c}}
\hdjcmfig{d1}
&
\hdjcmfig{d2}
&
\hdjcmfig{d3}
&
\hdjcmfig{d4}
&
\hdjcmfig{f1}
\\
\end{tabular}
\\[0.1in]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% choose a size:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%{\footnotesize % 1.1mm is good space
%\begin{tabular}{*{25}{r@{\hspace{1.1mm}}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
{\tiny% .5mm is good
\begin{tabular}{*{25}{r@{\hspace{0.5mm}}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
.&-1& 1&-1& 1& x& x&-3& 3& x& x&-1& 1&-1& x&-1& 1&-3& x& 1& 3&-1& 1& x&-1\\
-1& .& 3& 5&-1&-1&-3&-1&-3&-1&-3& 1& x& 1&-3& 1&-1&-1&-1&-1&-3& 5& 3& 3&-3\\
1& 3& .& 3& 1&-3&-1& x&-1&-3&-1&-1& x&-1&-1&-1& 1&-3& 1&-3&-1& 3& 5& 1&-1\\
-1& 5& 3& .&-1&-1&-3&-1&-3&-1&-3& 1&-5& 1&-3& 1&-1&-1&-1&-1&-3& 5& x& 3&-3\\
1&-1& 1&-1& .& 1&-1&-3& x& x& 3&-5& 1&-1&-1& 3& x&-3& 1&-3& 3&-1& 1&-3& 3\\
x&-1&-3&-1& 1& .&-1& 1&-1& 1& 3&-1& 1&-1&-1& 3&-3& 1& x& 1& x&-1&-3& 1& 3\\
x&-3&-1&-3&-1&-1& .&-1& 1& 3& 1& 1& 3&-3& 5&-3& 3&-1&-1& x& 1&-3&-1&-1& 1\\
-3&-1& x&-1&-3& 1&-1& .&-1& 1&-1& 3& 1& x&-1&-1& 1& 5& 1& 1&-1& x&-3& 1&-1\\
3&-3&-1&-3& x&-1& 1&-1& .&-1& 1&-3& 3& 1& 1& 1&-1&-1& 3&-1& 5&-3&-1& x& 1\\
x&-1&-3&-1& x& 1& 3& 1&-1& .&-1& 3& 1&-1& 3&-1& x& 1&-3& 5&-1&-1&-3& 1&-1\\
x&-3&-1&-3& 3& 3& 1&-1& 1&-1& .&-3& 3&-3& 1& 1&-1&-1&-1&-1& 1&-3&-1&-1& 5\\
-1& 1&-1& 1&-5&-1& 1& 3&-3& 3&-3& .&-1& 1& 1&-3& 3& x&-1& 3&-3& 1&-1& 3&-3\\
1& x& x&-5& 1& 1& 3& 1& 3& 1& 3&-1& .&-1& 3&-1& 1& 1& 1& 1& 3&-5&-3&-3& 3\\
-1& 1&-1& 1&-1&-1&-3& x& 1&-1&-3& 1&-1& .& x& 1&-1& 3& 3&-1& 1& 1&-1&-1&-3\\
x&-3&-1&-3&-1&-1& 5&-1& 1& 3& 1& 1& 3& x& .& x& 3&-1&-1& 3& 1&-3&-1&-1& 1\\
-1& 1&-1& 1& 3& 3&-3&-1& 1&-1& 1&-3&-1& 1& x& .&-5&-1&-1&-1& 1& 1&-1&-1& 1\\
1&-1& 1&-1& x&-3& 3& 1&-1& x&-1& 3& 1&-1& 3&-5& .& 1& 1& 1&-1&-1& 1& 1&-1\\
-3&-1&-3&-1&-3& 1&-1& 5&-1& 1&-1& x& 1& 3&-1&-1& 1& .& 1& 1&-1&-1&-3& 1&-1\\
x&-1& 1&-1& 1& x&-1& 1& 3&-3&-1&-1& 1& 3&-1&-1& 1& 1& .&-3& 3&-1& 1&-3&-1\\
1&-1&-3&-1&-3& 1& x& 1&-1& 5&-1& 3& 1&-1& 3&-1& 1& 1&-3& .& x&-1&-3& 1&-1\\
3&-3&-1&-3& 3& x& 1&-1& 5&-1& 1&-3& 3& 1& 1& 1&-1&-1& 3& x& .&-3&-1&-5& 1\\
-1& 5& 3& 5&-1&-1&-3& x&-3&-1&-3& 1&-5& 1&-3& 1&-1&-1&-1&-1&-3& .& 3& x&-3\\
1& 3& 5& x& 1&-3&-1&-3&-1&-3&-1&-1&-3&-1&-1&-1& 1&-3& 1&-3&-1& 3& .& 1&-1\\
x& 3& 1& 3&-3& 1&-1& 1& x& 1&-1& 3&-3&-1&-1&-1& 1& 1&-3& 1&-5& x& 1& .&-1\\
-1&-3&-1&-3& 3& 3& 1&-1& 1&-1& 5&-3& 3&-3& 1& 1&-1&-1&-1&-1& 1&-3&-1&-1& .\\
\end{tabular}
}
\end{tabular}%%%%%%%%%%%%%%%%%%%% end left col
%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{*{1}{l}}%%%%%%%% right col
(b)
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{f2}
&
$\:\rightarrow\:$
&
\hdjcmfig{f3}
&
$\:\rightarrow\:$
&
\hdjcmfig{f4}
\\
\end{tabular}
\\[0.2in]% & %%%%%%%%%%%%%%%%%%%%%%%%
(c)
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{f5}
&
$\:\rightarrow\:$
&
\hdjcmfig{f6}
&
$\:\rightarrow\:$
&
\hdjcmfig{f7}
\\
\end{tabular}
\\[0.2in]% & %%%%%%%%%%%%%%%%%%%%%%%%
(d)
\begin{tabular}[t]{*{3}{@{}c@{}}}
\hdjcmfig{f8}
&
$\:\rightarrow\:$
&
\hdjcmfig{d3}
\\
\end{tabular}
\\[0.2in]% & %%%%%%%%%%%%%%%%%%%%%%%%
%\multicolumn{2}{l}{
{(e)~\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{f9}
&
$\:\rightarrow\:$
&
\hdjcmfig{d4}
\\
\end{tabular}
}
\\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
{(f)~\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{f10}
&
$\:\rightarrow\:$
&
\hdjcmfig{f11}
&
$\:\rightarrow\:$
&
\hdjcmfig{f1}
\\
\end{tabular}
}
\\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
\end{tabular}%%%%%%%%%%%%% end right col
\end{tabular}%%%%%%%%%%%%% end top half
\\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
\end{center}
}{%
\caption[a]{Hopfield network storing five memories, and
suffering deletion of 26 of its 300 weights.
(a) The five memories, and the weights of the network,
with deleted weights shown by `x'.
(b--f) Initial states that differ by three random bits
from a memory: some are restored, but some
converge to other states.
}
\label{fig.hopfield.djcmx}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\small
\begin{center}
% two columns
% col 1: DJCMX. weights.
% col 2: all the easy decodes
\begin{tabular}{@{}l}%%%%%%%%%%%%%%%%% whole thing
\begin{tabular}{@{}c}%%%%%%%%%%%%%%%%% left col
% (a)
\raisebox{0.23in}{Desired memories:}
\begin{tabular}[b]{*{6}{c}}
\hdjcmfig{d1}
&
\hdjcmfig{d2}
&
\hdjcmfig{d3}
&
\hdjcmfig{d4}
&
\hdjcmfig{f1}
&
\hdjcmfig{g1}
\\
\end{tabular}
\\[0.1in]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% choose a size:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%{\footnotesize % 1.1mm is good space
%\begin{tabular}{*{25}{r@{\hspace{1.1mm}}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{tabular}%%%%%%%%%%%%%%%%%%%% end left col
%%%%%%%%%%%%%%%%%%%%%%%
\\[0.2in]\hline
~\\
%(b)
\begin{tabular}[t]{l@{\hspace{0.6in}}l@{\hspace{0.6in}}l}%%%%%%%% 2nd row
% (b)
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{d1}
&
$\:\rightarrow\:$
&
\hdjcmfig{g2}
&
$\:\rightarrow\:$
&
\hdjcmfig{g3}
\\
\end{tabular}
& %%%%%%%%%%%%%%%%%%%%%%%%
% (c)
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{d2}
&
$\:\rightarrow\:$
&
\hdjcmfig{d2}
\\
\end{tabular}
& %%%%%%%%%%%%%%%%%%%%%%%%
% (d)
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{d3}
&
$\:\rightarrow\:$
&
\hdjcmfig{g4}
&
$\:\rightarrow\:$
&
\hdjcmfig{g3}
\\
\end{tabular}
\\[0.2in]% & %%%%%%%%%%%%%%%%%%%%%%%%
%\multicolumn{2}{l}{
%(e)~
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{d4}
&
$\:\rightarrow\:$
&
\hdjcmfig{g5}
&
$\:\rightarrow\:$
&
\hdjcmfig{g6}
\\
\end{tabular}
&
% \\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
% (f)
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{f1}
&
$\:\rightarrow\:$
&
\hdjcmfig{g7}
&
$\:\rightarrow\:$
&
\hdjcmfig{g3}
\\
\end{tabular}
&
% \\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
%(g)~
\begin{tabular}[t]{*{5}{@{}c@{}}}
\hdjcmfig{g1}
&
$\:\rightarrow\:$
&
\hdjcmfig{g3}
\\
\end{tabular}
\\[0.2in]
%%%%%%%%%%%%%%%%%%%%%%%%
\end{tabular}%%%%%%%%%%%%% end right col
\end{tabular}%%%%%%%%%%%%% end whole thing
%%%%%%%%%%%%%%%%%%%%%%%%
\end{center}
}{%
\caption[a]{An overloaded Hopfield network trained on six memories,
% fails to recall them.
most of which are not stable.
}
\label{fig.hopfield.djcmxs}
}%
\end{figure}
\section{The continuous-time continuous Hopfield network}
The fact that the Hopfield network's properties are
not robust to the minor change
from asynchronous to synchronous updates might be a cause for concern;
can this model be a useful model of biological networks?
It turns out that once we move to a continuous-time version of
the Hopfield networks, this issue melts away.
% Here we will study one way of making a continuous-time model.
We assume that each neuron's activity $x_i$ is a continuous function
of time $x_i(t)$ and that the activations $a_i(t)$ are
computed instantaneously in accordance with
\beq
a_i(t) = \sum_{j} w_{ij} x_j(t) .
\label{eq.hn.continuous}
\eeq
The neuron's response to its activation is assumed to be mediated
by the differential equation:
\beq
\frac{\d}{\d t} x_i(t) = - \frac{1}{\tau} ( x_i(t) - f(a_i) ) ,
\eeq
where $f(a)$ is the activation function, for example $f(a) = \tanh(a)$.
For a steady activation $a_i$, the activity $x_i(t)$ relaxes
exponentially to $f(a_i)$ with time-constant $\tau$.
Now, here is the nice result:
as long as the weight matrix is symmetric, this system
has the variational free energy (\ref{eq.hn.vfe}) as its Lyapunov function.
\exercisxB{1}{ex.hn.lyapunov.contin}{
By computing $\frac{\d}{\d t} \tilde{F}$,
prove that the variational free energy $\tilde{F}(\bx)$
is a Lyapunov function for the continuous-time
Hopfield network.
}
It is particularly easy to prove that a function $L$ is
a Lyapunov functions if the system's dynamics perform steepest descent on $L$,
with $\frac{\d}{\d t} x_i(t) \propto
\frac{\partial}{\partial x_i} L$.
In the case of the continuous-time continuous Hopfield
network, it is not quite so simple, but
every component of $\frac{\d}{\d t} x_i(t)$ does have the
same sign as $\frac{\partial}{\partial x_i} \tilde{F}$, which
means that with an appropriately defined \ind{metric},
the Hopfield network dynamics do perform steepest descents on
$\tilde{F}(\bx)$.
\begin{figure}\small
%\figuremargin{%
\margincaption{%
\caption[a]{Failure modes of a Hopfield network (highly
schematic).
A list of desired memories, and the resulting list of attracting
stable states. Notice (1) some memories that are retained
with a small number of errors; (2) desired memories that are completely
lost (there is no attracting stable state at the desired memory or near it);
(3) spurious stable states unrelated to the original list;
(4) spurious stable states that are confabulations of desired memories.
}
\label{fig.hopfield.failures}
}{%%%%%%%%%%%%%%%%%%
\begin{center}
\begin{tabular}[t]{lcl}
\begin{tabular}[t]{|l|}
\multicolumn{1}{c}{Desired memories} \\ \hline
{\tt{moscow------russia}} \\ \hline
{\tt{lima----------peru}} \\ \hline
{\tt{london-----england}} \\ \hline
{\tt{tokyo--------japan}} \\ \hline
{\tt{edinburgh-scotland}} \\ \hline
{\tt{ottawa------canada}} \\ \hline
{\tt{oslo--------norway}} \\ \hline
{\tt{stockholm---sweden}} \\ \hline
{\tt{paris-------france}} \\ \hline
\end{tabular}
&
\raisebox{-1in}{$\rightarrow \: \bW \: \rightarrow$}
&
\begin{tabular}[t]{|c|l}
\multicolumn{1}{c}{Attracting stable states} & \\ \cline{1-1}
{\tt{moscow------russia}}& \\ \cline{1-1}
{\tt{lima----------peru}}& \\ \cline{1-1}
{\tt{londog-----englard}}& (1) \\ \cline{1-1}
{\tt{tonco--------japan}}&(1) \\ \cline{1-1}
{\tt{edinburgh-scotland}}& \\ \cline{1-1}
\multicolumn{1}{c}{}
% \ \ completely blank line
& (2) \\ \cline{1-1}
{\tt{oslo--------norway}}& \\ \cline{1-1}
{\tt{stockholm---sweden}}& \\ \cline{1-1}
{\tt{paris-------france}}& \\ \cline{1-1}
{\tt{wzkmhewn--xqwqwpoq}}&(3) \\ \cline{1-1}
{\tt{paris-------sweden}}&(4) \\ \cline{1-1}
{\tt{ecnarf-------sirap}}&(4) \\ \cline{1-1}
\end{tabular}% end of list reconstructed
\end{tabular}% end of 5 col list
\end{center}
}%
\end{figure}
\section{The capacity of the Hopfield network}
One way in which we viewed
% looked at
learning in the single neuron was
as communication -- communication of the
labels of the training data set from one point in time to
a later point in time.\index{communication!perspective on learning}
%
% now, did he call that a 2 or a 3?
%
We found that the capacity of a linear threshold neuron was
2 bits per weight.
Similarly, we might view the Hopfield associative memory as
a communication channel (\figref{fig.hopfield.failures}).
A list of desired memories is encoded into a set of weights $\bW$ using
the Hebb rule of \eqref{eq.hebb.hopfield}, or perhaps
some other learning rule. The receiver, receiving the weights $\bW$
only, finds the stable
states of the Hopfield network, which he interprets as
the original memories. This communication
system can fail in various ways, as
illustrated in the figure.
% \figref{fig.hopfield.failures}.
\ben
\item Individual bits in some memories might be corrupted, that is, a
% This corresponds
stable state of the Hopfield network is displaced
a little from the desired memory.
\item Entire memories might be absent from the list of attractors
of the network; or a stable state might be present
% exist as a fixed point
but have such a small basin of attraction that it is of no use
for pattern completion and error correction.
\item Spurious additional memories unrelated to the desired memories
might be present.
\item Spurious additional memories derived from the desired memories
by operations such as mixing and inversion may also be present.
\een
Of these failure modes, modes 1 and 2 are clearly undesirable, mode 2
especially so. Mode 3 might not matter so much as long as each of
the desired memories has a large basin of attraction. The fourth
failure mode might in some contexts actually be viewed as beneficial. For
example, if a network is required to memorize examples of valid
sentences such as `John loves Mary' and `John gets cake', we might be
happy to find that `John loves cake' was also a stable state of the
network. We might call this behaviour `generalization'.
The capacity of a Hopfield network with $I$ neurons might be defined to be
the number of random patterns $N$ that can be stored without failure-mode
2 having substantial probability. If we also
require failure-mode 1 to have tiny probability then the resulting capacity
is much smaller.
We now study these alternative definitions of
the capacity.
% of the Hopfield network in the next section.
% chapter.
% (Check this.)]
%
% The memory storage capabilities
% of the binary Hopfield network have been investigated for large $I$
% using methods from statistical physics by Elizabeth
% Gardner and colleagues. It is found that as $N$ is increased
% the probability of failure suddenly increases at a critical value of
% $N/I$. This critical value depends on the learning algorithm used.
% If the Hebb rule is used, the critical value is $N/I=0.14$.
% If an ideal learning algorithm is used, the critical value is $N/I=1$.
%
% Just below
% the critical value of 0.14 for the Hebb rule, each desired memory
% will have an associated stable state differing in about 1.6\%
% of its bits. This constitutes a loss of $H_2(0.016) = 0.12$ bits
% per bit.
%
% The capacity of the binary Hopfield network is thus
% $0.14 I$ random binary patterns,
%% each of size $I$,
% which is $0.14 [1- H_2(0.016)] I^2$ bits, or 0.24 bits per weight,
% since there are roughly $I^2/2$ weights.
%% 0.88165 * 0.138 * 2 = 0.243335
% quote result of gardner? alpha = .14 or give mcelice snr result?
%
% \section{Other applications of Hopfield nets}
% Optimization.
%
% \section{The Boltzmann machine}
% Can view the continuous Hopfield network as an approximation to
% a related probabilistic model -- Boltzmann machine
%
% convert -monochrome -geometry 1000%x1000% figs/d20.pbm figs/d20.ps
% convert -monochrome -geometry 1000%x1000% figs/d21.pbm figs/d21.ps
% convert -monochrome -geometry 1000%x1000% figs/d22.pbm figs/d22.ps
% convert -monochrome -geometry 1000%x1000% figs/d23.pbm figs/d23.ps
% convert -monochrome -geometry 1000%x1000% figs/d24.pbm figs/d24.ps
% convert -monochrome -geometry 1000%x1000% figs/d25.pbm figs/d25.ps
% convert -monochrome -geometry 1000%x1000% figs/d26.pbm figs/d26.ps
% convert -monochrome -geometry 1000%x1000% figs/d1.pbm figs/d1.ps
% convert -monochrome -geometry 1000%x1000% figs/d2.pbm figs/d2.ps
% convert -monochrome -geometry 1000%x1000% figs/d3.pbm figs/d3.ps
% convert -monochrome -geometry 1000%x1000% figs/d4.pbm figs/d4.ps
% convert -monochrome -geometry 1000%x1000% figs/d5.pbm figs/d5.ps
% convert -monochrome -geometry 1000%x1000% figs/d6.pbm figs/d6.ps
% convert -monochrome -geometry 1000%x1000% figs/d7.pbm figs/d7.ps
% convert -monochrome -geometry 1000%x1000% figs/d8.pbm figs/d8.ps
% convert -monochrome -geometry 1000%x1000% figs/d9.pbm figs/d9.ps
% convert -monochrome -geometry 1000%x1000% figs/d10.pbm figs/d10.ps
% convert -monochrome -geometry 1000%x1000% figs/d11.pbm figs/d11.ps
% convert -monochrome -geometry 1000%x1000% figs/d12.pbm figs/d12.ps
% convert -monochrome -geometry 1000%x1000% figs/d13.pbm figs/d13.ps
% convert -monochrome -geometry 1000%x1000% figs/d14.pbm figs/d14.ps
% convert -monochrome -geometry 1000%x1000% figs/d15.pbm figs/d15.ps
% convert -monochrome -geometry 1000%x1000% figs/d16.pbm figs/d16.ps
% convert -monochrome -geometry 1000%x1000% figs/d17.pbm figs/d17.ps
% convert -monochrome -geometry 1000%x1000% figs/d18.pbm figs/d18.ps
% convert -monochrome -geometry 1000%x1000% figs/d19.pbm figs/d19.ps
%
% useless - I did it by hand with xv instead
\subsection{The capacity of the Hopfield network -- stringent definition}
We will first explore the information storage capabilities
of a binary Hopfield network that learns using the Hebb rule
by considering the stability of just {one bit} of {one}
of the desired
patterns, assuming that the state of the network is set to that desired
pattern $\bx^{(n)}$. We will assume that the patterns to be stored
are randomly selected binary patterns.
%, and we will evaluate the probability that the selected bit is stable.
The activation of a particular neuron $i$ is
\beq
a_i = \sum_j w_{ij} x_j^{(n)} ,
\eeq
where the weights are, for $i \neq j$,
\beq
w_{ij} = x_i^{(n)} x_j^{(n)} +
\sum_{
% m \in 1 ,\ldots, N \wo n
m \neq n} x_i^{(m)} x_j^{(m)} .
\eeq
Here we have split $\bW$ into two terms, the first of which will
contribute `signal', reinforcing the desired memory, and
the second
%is viewed as
`noise'.
Substituting for $w_{ij}$, the activation is
\beqan
a_i &=& \sum_{j \neq i}
% \in \{1 ,\ldots, I\} \wo i}
x_i^{(n)} x_j^{(n)} x_j^{(n)}
+ \sum_{j \neq i}
\sum_{m \neq n} x_i^{(m)} x_j^{(m)} x_j^{(n)} \\
&=& (I-1) x_i^{(n)} + \sum_{j\neq i}
\sum_{m \neq n } x_i^{(m)} x_j^{(m)} x_j^{(n)} .
\eeqan
The first term is $(I-1)$ times the desired state $x_i^{(n)}$.
If this were the only term, it would keep the neuron
firmly clamped in the desired state.
The second term is a sum of $(I-1)(N-1)$ random quantities
$x_i^{(m)} x_j^{(m)} x_j^{(n)}$. A moment's reflection confirms
that these quantities are independent random binary variables with
mean 0 and variance 1.
Thus, considering the statistics of $a_i$ under the ensemble of random
patterns, we conclude that $a_i$ has mean $(I-1) x_i^{(n)}$
and variance $(I-1)(N-1)$.
For brevity, we will now assume $I$ and $N$
are large enough that we can neglect the distinction between $I$ and
$I-1$, and between $N$ and $N-1$.
Then we can restate our conclusion:
$a_i$ is Gaussian-distributed with mean $I x_i^{(n)}$
and variance $IN$.
\marginfig{\small
%\makebox[0in][l]{\hspace{0.3in}{\raisebox{0.65in}{$P(a_i)$}}}
\makebox[0in][l]{\hspace{0.83in}{\raisebox{0.27in}{$\sqrt{IN}$}}}
\makebox[0in][l]{\hspace{0.82in}{\raisebox{-0.23in}{$I$}}}
%\makebox[0in][l]{\hspace{0.8in}{\raisebox{-0.2in}{$I x_i^{(n)}$}}}
\mbox{
\psfig{figure=figs/phi.ps,angle=-90,width=1.2in}
}\raisebox{0.0in}{$a_i$}
\caption[a]{The probability density of
the activation $a_i$ in the case $x_i^{(n)} \eq 1$;
the probability that bit $i$ becomes flipped is the area of the tail. }
}
What then is the probability that the selected bit is stable, if
we put the network into the state $\bx^{(n)}$?
The probability that bit $i$ will flip on the
first iteration of the Hopfield network's dynamics is
\beq
P(\mbox{$i$ unstable}) = \Phi \left( -\frac{I}{\sqrt{IN}} \right)
= \Phi \left( -\frac{1}{\sqrt{N/I}} \right) ,
\label{eq.phi.hop.flip}
\eeq
where\index{$\Phi(z)$}\index{error function}
\beq
\Phi(z) \equiv \int_{-\infty}^{z}
\d z \: {\smallfrac{1}{\sqrt{2 \pi}}} \, e^{-z^2/2} .
\eeq
The important quantity $N/I$ is the ratio of the number of patterns stored to
the number of neurons.
If, for example, we try to store $N \simeq 0.18 I$ patterns in the
Hopfield network then there is a chance of 1\% that a specified bit
in a specified pattern will be unstable on the first iteration.
We are now in a position to derive our first capacity result, for the
case where no corruption of the desired memories is permitted.
\exercisxB{2}{ex.hopexact}{
Assume that we wish all the desired patterns to be
completely stable -- we don't want any of the bits
to flip when the network is put into any desired
pattern state -- and the total probability of any error at all is required
to be less than a small number $\epsilon$.
Using the approximation to the error function for large $z$,
\beq
\Phi ( -z ) \simeq {\frac{1}{\sqrt{2 \pi}}} \frac{ e^{-z^2/2} }{z},
\eeq
show that the maximum number of patterns that can be stored, $N_{\max}$,
is
\beq
N_{\max} \simeq \frac{I}{4 \ln I + 2 \ln (1/\epsilon)} .
\eeq
}
%
% no solution has been written for this one.
%
If, however, we allow a small amount of corruption of memories
to occur, the number of patterns that can be stored increases.
\subsection{The statistical physicists' capacity}
The analysis that led to\index{capacity!Hopfield network}\index{Hopfield network!capacity}
\eqref{eq.phi.hop.flip} tells us that
if we try to store $N \simeq 0.18 I$ patterns in the
Hopfield network then, starting from a desired memory, about
1\% of the bits will be unstable on the first iteration.
Our analysis does not shed light on what is expected to happen
on subsequent iterations. The flipping of
these bits might make some of the other bits unstable too, causing
an increasing number of bits to be flipped. This process might
lead to an avalanche in which the network's state ends up a long way from
the desired memory.
\begin{figure}
\figuremargin{%
\begin{center}
\mbox{\psfig{figure=hopfield/amy.ps,angle=-90,width=2.63in}%
\psfig{figure=hopfield/amy.det.ps,angle=-90,width=2.63in}}
\end{center}
}{%
\caption[a]{Overlap between a desired memory and the stable
state nearest to it as a function of the loading fraction $N/I$.
The overlap is defined to be the scaled
inner product $\sum_i x_i x_i^{(n)}/I$, which
is 1 when recall is perfect and zero when the stable state has
50\% of the bits flipped.
There is an abrupt transition at $N/I = 0.138$, where
the overlap drops from 0.97 to zero.}
\label{fig.amit}
}%
\end{figure}
In fact, when $N/I$ is large,
such avalanches do happen. When $N/I$ is small, they tend
not to -- there is a stable state near to each desired memory.
For the limit of large $I$, \citeasnoun{Amit85b}
% Amit, Gutfreund and Sompolinsky
have used methods
from statistical physics to find numerically the transition between
these two behaviours. {There is
a sharp discontinuity at}
\beq N_{\rm{crit}} = 0.138 I.
\eeq
Below this critical value, there is likely to be a stable state near
every desired memory, in which
a small fraction of the bits are flipped.
When $N/I$ exceeds 0.138, the system has only spurious stable states,
known as {\dem spin glass states}, none of which is correlated
with any of the desired memories.
Just below the critical value, the fraction of bits that are flipped
when a desired memory has evolved to its associated stable state is
1.6\%. \Figref{fig.amit} shows the overlap between the desired memory
and the nearest stable state as a function of $N/I$.
Some other
transitions in properties of the model occur at some additional
values of $N/I$, as summarized below.
\begin{description}
\item[For all $N/I$\mycomma] stable spin glass states exist, uncorrelated with
the desired memories.
\item[For $N/I>0.138$\mycomma] these spin glass states are the only stable states.
\item[For $N/I \in (0,0.138)$\mycomma] there are stable states close to the
desired memories.
% , and spurious spin glass states.
\item[For $N/I \in (0,0.05)$\mycomma] the stable states associated with
the desired memories have lower energy than the spurious spin glass states.
\item[For $N/I \in (0.05,0.138)$\mycomma] the spin glass states dominate --
there are spin glass states that have lower energy than
the stable states associated with the desired memories.
\item[For $N/I \in (0,0.03)$\mycomma] there are additional {\dem mixture\/} states,
which are combinations of several desired memories. These stable
states do not have as low energy as
the stable states associated with the desired memories.
\end{description}
In conclusion, the capacity of the Hopfield network with $I$ neurons,
if we define the capacity
in terms of the abrupt discontinuity discussed above, is $0.138 I$
random binary patterns, each of length $I$, each of which is received
with 1.6\% of its bits flipped.
In bits, this capacity is\marginpar{\small\raggedright{This expression for the capacity
omits a smaller negative term of order $N \log_2 N$ bits, associated with the
arbitrary order of the memories.
}}
\beq
0.138 I^2 \times ( 1 - H_2(0.016) ) = 0.122 \, I^2 \mbox{ bits}.
\eeq
Since there are $I^2/2$ weights in the network,
we can also express the capacity as
{\em 0.24 bits per weight.}
\section{Improving on the capacity of the Hebb rule}
The capacities discussed in the previous section are the capacities
of the Hopfield network whose weights are set using
the Hebbian learning rule.\index{connection between!supervised and
unsupervised learning}
We can do better than the Hebb rule by defining an
objective function that measures how well the
network stores all the memories, and minimizing it.
For an associative memory to be useful, it must be able to
correct at least one flipped bit. Let's make an
objective function that measures whether flipped bits
tend to be restored correctly.
Our intention is that, for every neuron $i$ in the network,
the weights to that neuron should satisfy this rule:
\begin{quote}
for every pattern $\bx^{(n)}$, if the neurons other than $i$
are set correctly to $x_j = x^{(n)}_j$, then the activation of neuron $i$
should be such that its preferred output is $x_i= x^{(n)}_i$.
\end{quote}
Is this rule a familiar idea? Yes, it is precisely what
we wanted the single neuron of \chref{ch.single.neuron.class}
to do. Each pattern $\bx^{(n)}$ defines an
input, target pair for the single neuron $i$.
And it defines an
input, target pair for all the other neurons too.
% see itp/hopfield/ALTERNATE.m
% see itp/hopfield/trainlog.m
So, just as we defined an objective function (\ref{eq.single.neuron.G})
for the training of a single neuron as a classifier,
we can define
\beq
G(\bW) = - \sum_i \sum_n t^{(n)}_i \ln y^{(n)}_i + (1-t^{(n)}_i )
\ln ( 1 - y^{(n)}_i )
\label{eq.multi.neuron.G}
\eeq
where
\beq
t^{(n)}_i = \left\{ \begin{array}{cc} 1 & x^{(n)}_i = 1\\
0 & x^{(n)}_i = -1 \end{array} \right.
\eeq
and
\beq
y^{(n)}_i = \frac{1}{1+\exp( - a^{(n)}_i ) },
\:\:\mbox{where $a^{(n)}_i = \sum w_{ij} x^{(n)}_j$ }.
\eeq
We can then steal the algorithm (\algref{fig.train.algm}, \pref{fig.train.algm}) which we wrote
for the single neuron,
to write an
algorithm for optimizing a Hopfield network,
algorithm \ref{hopfield.train}.
The convenient syntax of {\tt{Octave}}
requires very few changes; the extra lines
enforce the constraints that the self-weights $w_{ii}$ should
all be zero and that
the weight matrix should be symmetrical ($w_{ij}=w_{ji}$).
\begin{algorithm}
\begin{framedalgorithmwithcaption}%\figuremargin{%\margincaption
{%
\caption[a]{{\tt Octave} source code for optimizing
the weights of a Hopfield network, so that
it works as an associative memory. \cf\
\protect\algref{fig.train.algm}.
The data matrix {\tt{x}} has $I$ columns and $N$ rows.
The matrix {\tt{t}} is identical to {\tt{x}}
except that $-1$s are replaced by $0$s.
}
\label{hopfield.train}
}
%%%%%%%%%%%%%%%
\footnotesize
\begin{verbatim}
w = x' * x ; # initialize the weights using Hebb rule
for l = 1:L # loop L times
for i=1:I #
w(i,i) = 0 ; # ensure the self-weights are zero.
end #
a = x * w ; # compute all activations
y = sigmoid(a) ; # compute all outputs
e = t - y ; # compute all errors
gw = x' * e ; # compute the gradients
gw = gw + gw' ; # symmetrize gradients
w = w + eta * ( gw - alpha * w ) ; # make step
endfor
\end{verbatim}
\end{framedalgorithmwithcaption}
\end{algorithm}
% DEMO6
% load 'weights.gnu'
%
% silly figure, no meaning!!
%
%\begin{figure}
%\figuremargin{%
%\begin{center}
%\mbox{\psfig{figure=hopfield/weights.ps,angle=-90,width=2.63in}%
%}
%\end{center}
%}{%
%\caption[a]{The six patterns
% of \figref{fig.hopfield.djcmxs}, which cannot be memorized by the Hebb rule,
% were learned using \algref{hopfield.train}.
% Horizontal axis shows each weight's value under the Hebb rule;
% vertical axis shows the value after training.
%}
%\label{fig.hop.learnt.wts}
%}%
%\end{figure}
As expected, this learning algorithm does a better job than
the {one-shot} Hebbian learning rule.
% \Figref{fig.hop.learnt.wts} shows how much the 625 weights
%
% change from the Hebb prescription w
When the six patterns
of \figref{fig.hopfield.djcmxs}, which cannot be memorized by the Hebb rule,
are learned using \algref{hopfield.train},
all six patterns become stable states.
% , though some have tiny
% basins of attraction.
%
% !!!!!!!!!!!!!!!!11 add the name of this method
\exercisxC{4C}{ex.hopdob}{
Implement this learning rule
and investigate empirically
its capacity for memorizing random patterns; also
compare its avalanche properties with those
of the Hebb rule.
}
\section{Hopfield networks for optimization problems}
\label{sec.tsp}
Since a Hopfield network's dynamics minimize an energy function,
it is natural to ask whether we can map interesting \ind{optimization}
{problems} onto Hopfield networks. Biological data processing problems
often involve an element of {\dem\ind{constraint satisfaction}} -- in scene
interpretation, for example, one might wish to infer the spatial location,
orientation, brightness and texture of each visible element, and
which visible elements are connected together in objects. These
inferences are constrained by the given data and by prior knowledge
about continuity of objects.
\citeasnoun{Hopfield_Tank}
% Hopfield and Tank
\index{Hopfield, John J.}\index{Tank, David W.}suggested that one might take an interesting
constraint satisfaction problem and design the weights
of a binary or continuous
\ind{Hopfield network} such that the settling process of the
network would minimize the objective function of the problem.
\begin{figure}\small
\figuremargin{\small%
\begin{center}
\mbox{\hspace{6mm}%hack to compensate for hack below
\begin{tabular}{c}
%\raisebox{40mm}{\makebox[0in][l]{(a1)}}%
%\hspace{-0.052in}%
\setlength{\unitlength}{0.6mm}
\begin{picture}(65,55)(0,0)% hack to make look nice in column
%\begin{picture}(65,55)(-15,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/tsp2.eps,width=30mm}}}
\put(25,51){\makebox(0,0)[b]{Place in tour}}
\put(-2,27.5){\makebox(0,0)[r]{City}}
\end{picture}
\\[0.1in]
\hspace{-0.53in}\psfig{figure=figs/tspACDB.eps,width=30mm}
\\
(a1)
\end{tabular}
%%%%%%%%%%%%%%%%
\hspace{-0.1in}
\begin{tabular}{c}
%\raisebox{40mm}{\makebox[0in][l]{(a2)}}%
%\hspace{-0.052in}%
\setlength{\unitlength}{0.6mm}
\begin{picture}(65,55)(0,0)% hack to make look nice in column
%\begin{picture}(65,55)(-15,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/tsp3.eps,width=30mm}}}
\put(25,51){\makebox(0,0)[b]{Place in tour}}
\put(-2,27.5){\makebox(0,0)[r]{City}}
\end{picture}
\\[0.1in]
\hspace{-0.53in}\psfig{figure=figs/tspABCD.eps,width=30mm}
\\
(a2)%
\end{tabular}
\hspace*{-0.5in}
\begin{tabular}{r}
\makebox[0.2in][l]{(b)}%
\setlength{\unitlength}{0.65mm}
\begin{picture}(50,55)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/tspvalid.eps,width=32.5mm}}}
%\put(25,51){\makebox(0,0)[b]{Place in tour}}
%\put(0,27.5){\makebox(0,0)[r]{City}}
\end{picture}
\\
\hspace{0.083in}
\makebox[0.2in][l]{(c)}%
\setlength{\unitlength}{0.65mm}
\begin{picture}(50,55)(0,0)
\put(0,0){\makebox(0,0)[bl]{\psfig{figure=figs/tspdist.eps,width=32.5mm}}}
\put(36.5,11.5){\makebox(0,0)[l]{$-d_{BD}$}}
\end{picture}
\end{tabular}
}
\end{center}
}{%
\caption[a]{Hopfield network for solving a travelling salesman problem
with $K=4$ cities. (a1,2) Two solution states of the 16-neuron network,
with activites represented
by black = 1, white = 0; and the
% . (a2) The
tours corresponding to these network states.
% (a2) Another valid state and its tour.
% corresponding to another network state.
(b) The negative weights between node $B2$
and other nodes; these weights enforce validity of a tour. (c) The negative weights
that embody the distance objective function.}
\label{fig.tsp}
}%
\end{figure}
\subsection{The travelling salesman problem}
A classic constraint satisfaction problem
to which Hopfield networks have been applied is the
\ind{\TSP}.
A set of $K$ cities is given, and a matrix of the $K(K-1)/2$ distances
between those cities. The task is to find a closed tour of the
cities, visiting each city once, that has the smallest
total distance. The \TSP\ is equivalent in difficulty
to an \ind{NP-complete} problem.
The method suggested by Hopfield and Tank is to represent a tentative
solution to the problem by the state of a network with
$I=K^2$ neurons arranged in a square, with each neuron representing
the hypothesis that a particular city comes at a particular point
in the tour. It will be convenient to consider the states of the
neurons as being between 0 and 1 rather than $-1$ and 1.
Two solution states for a four-city \TSP\ are shown in \figref{fig.tsp}a.
The weights in the Hopfield network play two roles. First, they
must define an energy function which is minimized only when the
state of the network represents a {\em valid\/} tour. A valid
state is one that looks like a permutation matrix, having
exactly one `1' in every row and one `1' in every column.
This rule can be enforced by putting large negative weights between
any pair of neurons that are in the same row or the same column,
and setting a positive bias for all neurons to ensure that
$K$ neurons do turn on.
\Figref{fig.tsp}b shows the negative weights that are connected
to one neuron, `$\!B2$', which represents the
statement `city B comes second in the tour'.
% , which we can abbreviate as neuron $B2$.
Second, the weights must encode the objective function
that we want to minimize -- the total
distance. This can be done by putting negative weights
proportional to the appropriate distances between
the nodes in adjacent columns. For example, between the $B$ and $D$ nodes
in adjacent columns, the weight would be $-d_{BD}$.
The negative weights that are connected to neuron $B2$ are shown in
\figref{fig.tsp}c. The result is that when the network is
in a valid state,
its total energy will be the total distance of
the corresponding tour, plus a constant given by the energy associated with the biases.
Now, since a Hopfield network minimizes its energy,
it is hoped that the binary or continuous
Hopfield network's dynamics will take the
state to a minimum that is a valid tour and which might be
an optimal tour. This hope is not fulfilled for large
{\TSP}s, however, without
some careful modifications.
We have not specified the size of the weights that enforce the
tour's validity, relative to the size of the distance weights,
and setting this scale factor poses difficulties.
% really, we would rather not do so -- there is no natural
% scale factor relating them.
If `large' validity-enforcing weights
are used, the network's dynamics will rattle into a valid state
with little regard for the distances.
% objective function.
If `small' validity-enforcing weights are used, it is possible that
the distance weights will cause the network to adopt an {\em invalid\/}
state that has lower energy than any valid state. Our original
formulation of the energy function puts the objective function and
the solution's validity in potential conflict with each other.
This difficulty has been resolved by the work of
% Aiyer,
Sree \citeasnoun{Aiyer_thesis},\index{Aiyer, Sree}\index{Balakrishnan, Sree} who showed
how to modify the distance weights so that they would not interfere
with the solution's validity, and how to define a continuous Hopfield
network whose dynamics are at all times confined to a `valid
subspace'. Aiyer used a {\dem\ind{graduated non-convexity}\/}
or {\dem\ind{deterministic annealing}\/} approach to find good solutions using
these Hopfield
networks. The deterministic annealing\index{annealing!deterministic} approach involves gradually
increasing the gain $\b$ of the neurons in the network from 0
to $\infty$, at which point the state of the network corresponds
to a valid tour. A sequence of trajectories generated by applying
this method to a thirty-city \TSP\
% as the gain is increased
is shown in \figref{fig.scholar}a.
% I grabbed this from here:
%http://www-svr.eng.cam.ac.uk/reports/abstracts/aiyer_tr89.html
% page 2.
% page 70 has a sequence of expanding tours.
A solution to the `travelling scholar problem' found by Aiyer
using a continuous Hopfield network is shown
in \figref{fig.scholar}b.
\begin{figure}\small
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\mbox{\psfig{figure=figs/ps/amoeba1.ps,width=1.52in,height=4.5in}}%
& \mbox{\psfig{figure=figs/ps/scholar1.ps,height=4.15in}}%
\\
(a) &
(b)
\\
\end{tabular}
\end{center}
}{%
\caption[a]{(a) Evolution of the state of a
continuous Hopfield network solving a \TSP\
using \quotecite{Aiyer_thesis} graduated non-convexity
method; the state of the network is projected into the
two-dimensional space in which the cities are located
by finding the centre of mass for each point in the tour, using the
neuron activities as the mass function.
(b) The travelling scholar problem. The shortest tour linking the
27 Cambridge Colleges, the Engineering Department,
the University Library, and Sree Aiyer's house.
% The solution
% was found using a continuous Hopfield network.
From \citeasnoun{Aiyer_thesis}.}
\label{fig.scholar}
}%
\end{figure}
\fakesection{Hopfield networks for cypher-cracking}
% Demo software: crack.tcl.
\section{Further exercises}
\exercisxB{3}{ex.hopfieldtwomemory}{
{\sf{Storing two memories.}}
% Describe a {\em Hopfield network\/}
% with $I$ neurons having binary activities $x_i = \pm 1$,
% explaining the r\^ole of the {\em biases\/} and {\em weights\/}
% in the {\em activity rule}, and stating the constraints
% satisfied by the weights.
% \marginpar{[4]}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%{\em [This paragraph can be cut to shorten the question:]}
%
% Explain the relationship between the dynamics
% of the Hopfield network and the energy function
%\beq
% E(\bx) = - \frac{1}{2} \sum_{ij} w_{ij} x_i x_j - \sum_i b_i x_i ,
%\eeq
% distinguishing between the cases of synchronous
% and asynchronous dynamics.
% \marginpar{[4]}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Two
binary memories
% $\bx^{(n)}$
% $\bx^{(1)}$ and $\bx^{(2)}$
$\bm$ and $\bn$ ($m_i, n_i \in \{ -1,+1\}$)
are stored by Hebbian learning in a
Hopfield network using
\beq
w_{ij} = \left\{
\begin{array}{cl} m_i m_j + n_i n_j & \mbox{for $i \not = j$}
\\
0 & \mbox{for $i = j$.}
\end{array} \right.
% \sum_{n=1}^{n=2} x^{(n)}_i x^{(n)}_j
% ;\:\: b_i = 0 .
\eeq
The biases $b_i$ are set to zero.
The network is put in the state $\bx = \bm$.
Evaluate the activation $a_i$ of neuron $i$ and show that
in can be written in the form
\beq
a_i = \mu m_i + \nu n_i .
\eeq
By comparing the signal strength, $\mu$, with the magnitude
of the noise strength, $|\nu|$,
show that
$\bx = \bm$ is a stable state
% Show that both the intended memories are indeed stable
% states
of the dynamics of the network.
% \marginpar{[4]}
The network is put in a state $\bx$ differing in $D$ places
from $\bm$,
\beq
\bx = \bm + 2 \bd ,
\eeq
where the perturbation $\bd$ satisfies $d_i \in \{ -1,0,+1 \}$.
$D$ is the number of components of
$\bd$ that are non-zero, and
for each $d_i$ that is non-zero, $d_i = -m_i$.
Defining the overlap between $\bm$ and $\bn$ to be
\beq
o_{\bm \bn} = \sum_{i=1}^{I} m_i n_i ,
\eeq
evaluate the activation $a_i$ of neuron $i$ again and show that
the dynamics of the network will restore $\bx$ to $\bm$ if
the number of flipped bits satisfies
\beq
D < \frac{1}{4} ( I - |o_{\bm \bn}| - 2 ) .
\eeq
% \marginpar{[9]}
How does this number compare with the maximum number of
flipped bits that can be corrected by the optimal decoder,
assuming the vector $\bx$ is either a noisy version of $\bm$ or of $\bn$?
}
\exercisxC{3}{ex.hopfieldclass}{ {\sf{Hopfield network as a collection of binary classifiers.}}
This exercise explores the
link between unsupervised networks and supervised networks.
If a Hopfield network's desired memories are all attracting stable states,
then every neuron
%$i$
in the network
has weights going to it that solve a classification
problem personal to that neuron.
%$i$
Take the set of memories and write them
in the form ${\bx'}^{(n)},x_i^{(n)}$, where $\bx'$ denotes all the
components $x_{i'}$ for all $i' \neq i$, and let $\bw'$ denote the
vector of weights $w_{ii'}$, for $i' \neq i$.
Using what we know about the capacity of the single neuron,
show that it is almost certainly impossible to store more than $2I$ random memories
in a Hopfield network of $I$ neurons.
%
% By taking the objective function $G$ for the supervised
% training of a single neuron, and adding up $I$ of them,
% write down an objective function $G(\bw)$ for
% optimizing the weights of a Hopfield network with the
% objective of storing $N$ memories, and
% write down a learning algorithm. Write
% a computer program and test whether this learning
% algorithm does a better job than the Hebb rule of storing random
% memories.
}
\subsection{Lyapunov functions}
%\soln{ex.hopfieldclass}{
% Could include octave code and demonstration here.
%%%%%%%%%%%%%%%%% TODO!!!!!!!!!!!!!!!
%}
\exercisxC{3}{ex.eric}{
\index{Winfree, Erik}{\sf Erik's puzzle}.
In a stripped-down version of Conway's\index{Conway, John H.} game of \ind{life},\index{game!life}\index{puzzle!life}\index{Lyapunov function}
cells are arranged on a square grid.
% in an $N \times N$ square.
Each cell is either alive or dead.
Live cells do not die.
Dead cells become alive if two or more of their
immediate neighbours are alive. (Neighbours to north, south,
east and west.)
What is the smallest number of live cells needed
in order that these rules lead to an entire $N \times N$ square
being alive?
\amarginfig{t}{
\begin{center}\mbox{%
\epsfbox{metapost/erik.1}\raisebox{7mm}{$\rightarrow$}%
\epsfbox{metapost/erik.2}\raisebox{7mm}{$\rightarrow$}%
\epsfbox{metapost/erik.3}}\end{center}
\caption[a]{Erik's dynamics.}
}
In a $d$-dimensional version of the same game,
the rule is that if $d$ neighbours are alive then you come to life.
What is the smallest number of live cells needed
in order that an entire $N \times N \times \cdots \times N$ hypercube
becomes alive? (And how should those live cells be arranged?)
}
% Add Eric's puzzle:
%Noone dies.
%What's the smallest number of beans needed to fill the NxN square?
%Lyapunov function is the surface area (or length).
% itprnnchapter.tex
\subsection*{The southeast puzzle}
\begin{figure}[htbp]
\figuredangle{\footnotesize
\setlength{\unitlength}{2.53mm}
\thinlines
\begin{picture}(53,10)(-3,-10)
% (a)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\put(-2,-2){\makebox(0,0){(a)}}
\gridlet{0,0}
\piece{1,-1}
% a->b
%\hnextposition{7,-3}
% (b)
\gridlet{10,0}
\movingpiece{11,-1}
% b->c
\lhnextposition{17.5,-3}{(b)}
\gridlet{20,0}
\movingpiece{21,-3}
\piece{23,-1}
%\vnextposition{13,-7}
%\ldnextposition{23,-7}{(b2)}
\lhnextposition{27.5,-3}{(c)}
% (c)
\gridlet{30,0}
\movingpiece{33,-3}
\piece{33,-1}
\piece{31,-5}
\lhnextposition{37.5,-3}{(d)}
% (d)
\gridlet{40,0}
\piece{43,-5}
\piece{45,-3}
\piece{43,-1}
\piece{41,-5}
\lhnextposition{47.5,-3}{}
%\lhnextposition{49.5,-4}{$\ldots$}
% (e)
%\gridlet{50,0}
%\piece{53,-3}
%\piece{51,-5}
%\piece{53,-1}
% % this VV was underneath, see NOTES.tex
\put(50.5,-3){\makebox(0,0){$\ldots$}}%...
%\put(52.5,-2){\makebox(0,0){(z)}}
\lhnextposition{51.5,-3}{(z)}
\gridletfive{54,0}
\opiece{55,-7}
\opiece{55,-5}
\opiece{55,-3}
\opiece{55,-1}
\opiece{57,-5}
\opiece{57,-3}
\opiece{57,-1}
\opiece{59,-3}
\opiece{59,-1}
\opiece{61,-1}
%
\end{picture}
}{
\caption[a]{The {\tt southeast} puzzle.\index{puzzle!southeast}\index{puzzle!chessboard}}
% Thanks to Tadashi for this one.}
\label{fig.southeast}
}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The {\tt{southeast}}\index{southeast puzzle}
puzzle\index{Lyapunov function}
% \index{puzzle!{\tt{southeast}}}
is played on a semi-infinite \ind{chess board}, starting at its northwest
(top left) corner. There are three rules:
\ben
\item
In the starting position, one piece is placed in the northwest-most
square (\figref{fig.southeast}a).
\item
It is not permitted for more than one piece to be on any given square.
\item
At each step, you remove one piece from the board, and replace
it with two pieces, one in the square immediately to the east,
and one in the the square immediately to the south,
as illustrated in \figref{fig.southeast}b. Every such step
increases the number of pieces on the board by one.
\een
After move (b) has been made, either piece may be selected for
the next move. \Figref{fig.southeast}c shows the outcome of moving
the lower piece.
At the next move, either the lowest piece or the
middle piece of the three may be selected; the uppermost
piece may not be selected, since that would violate
rule 2.
At move (d) we have selected the middle piece. Now any of
the pieces may be moved, except for the leftmost piece.
Now, here is the puzzle:
\exercissxB{4}{ex.southeast}{
Is it possible to obtain a position in
which all the ten squares closest to the northwest
corner, marked in \figref{fig.southeast}z,
are empty?
[{\sf Hint:}
this puzzle has a connection to data compression.]
% ? {\sf Hint:} Symbol codes.]
}
\dvips
%\section{Solutions to Chapter \protect\ref{ch.hopfield}'s exercises}
\section{Solutions}% to Chapter \protect\ref{ch.hopfield}'s exercises}
% 133
\soln{ex.hopasymm}{
Take a binary feedback network with 2 neurons and let $w_{12} = 1$
and $w_{21} = -1$. Then
whenever neuron 1 is updated, it will match neuron 2,
and whenever neuron 2 is updated, it will flip to the opposite
state from neuron 1.
There is no stable state.
}
\soln{ex.hopasynch}{
Take a binary Hopfield network with 2 neurons and let $w_{12} = w_{21}
= 1$, and let the initial condition be $x_1 = 1$, $x_2=-1$.
Then if the dynamics are synchronous, on every iteration
both neurons will flip their state. The dynamics do
not converge to a fixed point.
%
}
%
\soln{ex.southeast}{
The key to this problem is to
% When you played with this problem, I hope that you
notice its similarity to the construction of a binary
symbol code. Starting from the empty string, we can build
a binary tree by repeatedly splitting a codeword into two.
% An important idea in source codes was the Kraft equality.
Every codeword has an implicit probability $2^{-l}$, where
$l$ is the depth of the codeword in the binary tree.
Whenever we split a codeword in two
and create two new codewords whose length is increased by one,
the two new codewords each have implicit probability equal to
half that of the old codeword.
For a complete binary code, the Kraft equality affirms that
the sum of these implicit
probabilities is 1.\index{Kraft inequality}
Similarly, in {\tt{southeast}},
we can associate a `weight' with each piece on the board.
If we assign a weight of 1 to any piece sitting on the
top left square; a weight of 1/2 to any piece on a square
whose distance from the top left is one; a weight
of 1/4 to any piece whose distance from the top left is two;
and so forth, with `distance' being the city-block
distance; then every legal move in {\tt{southeast}}
leaves unchanged the total weight of all pieces on the
board.
\ind{Lyapunov function}s come
in two flavours: the function
may be a function of state whose value is known to stay constant;
or it may be a function of state that
is bounded below, and whose value always decreases or stays constant.
The total weight is a Lyapunov function of the second type.
The starting weight is 1, so now we have a powerful tool:
a conserved function of the state.
Is it possible to find a position in which the ten highest-weight
squares are vacant, and the total weight is 1?
What is the total weight if {\em all\/} the other squares on
the board are occupied (\figref{fig.southeast.sol})?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure}
%\figuremargin{
\marginfig{
\footnotesize
\begin{center}
\setlength{\unitlength}{2.53mm}
\thinlines
\begin{picture}(13,13)(0,-13)
\gridletfive{0,0}
\piece{9,-9}
\piece{9,-7}
\piece{9,-5}
\piece{9,-3}
\piece{9,-1}
\piece{7,-5}
\piece{7,-3}
\piece{7,-7}
\piece{7,-9}
\piece{5,-5}
\piece{5,-7}
\piece{5,-9}
\piece{3,-7}
\piece{3,-9}
\piece{1,-9}
\put(11,-10){\makebox(0,0){$\ddots$}}
\put(11,-5){\makebox(0,0){$\ldots$}}
\put(11,-1){\makebox(0,0){$\ldots$}}
\put(5,-11){\makebox(0,0){$\vdots$}}
\put(1,-11){\makebox(0,0){$\vdots$}}
%
\end{picture}
\end{center}
%}{
\caption[a]{A possible position for the {\tt southeast} puzzle?}
\label{fig.southeast.sol}
}
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% sum( l*2^(-l) , l = 4...infinity ) ;
% 5/8
% sum( (l+1)*2^(-l) , l = 4...infinity ) ;
% 3/4
The total weight would be $\sum_{l=4}^{\infty} (l+1) 2^{-l}$, which
is equal to $3/4$.
% e d Sun 31/12/00
So it is impossible to empty all ten of those squares.
% Functions of the state of a dynamical system
% like the total weight, which we used here,
% are known as \ind{Lyapunov function}s.
}
\dvipsb{solutions hopfield networks}
\chapter{Boltzmann Machines} % From Hopfield Networks to
\label{ch.boltzmann}
\section{From Hopfield networks to Boltzmann machines}
%
We have noticed that the binary Hopfield network minimizes an
energy function\index{Boltzmann machine}
\beq
E(\bx) = - \frac{1}{2} \bx^{\T} \bW \bx
\eeq
and that the continuous Hopfield network with activation
function $x_n = \tanh(a_n)$ can be viewed as
{\em approximating\/} the probability distribution
associated with that energy function,
\beq
P(\bx\given \bW) = \frac{1}{Z(\bW)} \exp [ - E(\bx) ]
= \frac{1}{Z(\bW)} \exp \left[
\frac{1}{2} \bx^{\T} \bW \bx \right] .
\label{eq.bm}
\eeq
These observations motivate the idea of working with
a neural network model that actually {\em implements\/} the above
probability distribution.\index{Hinton, Geoffrey E.}\index{learning algorithms!Boltzmann machine}
The {\dbf stochastic Hopfield network} or {\dbf Boltzmann machine}
\cite{BM}\index{Hinton, Geoffrey E.}\index{Sejnowski, Terry J.}
has the following activity rule:\medskip
\begin{framedalgorithm}
\begin{description}
\item[Activity rule of Boltzmann machine:] after computing the activation
$a_i$ (\ref{eq.hopfield.activation.defn}),
\beq
\begin{array}{l}
\mbox{set $x_i = +1$ with probability
$\displaystyle\frac{1}{1+e^{-2a_i}}$}\\
\mbox{else set $x_i = -1$}.
\end{array}
\label{eq.bma}
\eeq
\end{description}
\end{framedalgorithm}\medskip
%
\noindent
This rule implements Gibbs sampling for the probability distribution
(\ref{eq.bm}).
\subsection{Boltzmann machine learning}
Given a set of examples $\{ \bx^{(n)} \}_{1}^{N}$ from the real world, we
might be interested in adjusting the weights $\bW$ such
that the generative model
\beq
P(\bx\given \bW) = \frac{1}{Z(\bW)} \exp \left[
\frac{1}{2} \bx^{\T} \bW \bx \right]
\label{eq.bm.a}
\eeq
is well matched to those examples.
%
We can derive a learning algorithm by writing down \Bayes\ theorem
to obtain the posterior probability of the weights given\index{Bayes' theorem}
the data:
\beq
P(\bW\given \{ \bx^{(n)} \}_{1}^{N}\} )
= \frac{ \displaystyle \left[ \prod_{n=1}^{N}
P(\bx^{(n)} \given \bW) \right]
P(\bW) }{ P( \{ \bx^{(n)} \}_{1}^{N}\} ) } .
\eeq
We concentrate on the first term in the numerator, the likelihood,
and derive a maximum likelihood algorithm (though
there might be advantages in pursuing a full Bayesian approach
as we did in the case of the single neuron).
%; indeed \citeasnoun{BM}
% used weight decay in their work on Boltzmann machines).
We differentiate the logarithm of the
likelihood,
\beq
\ln \left[ \prod_{n=1}^{N}
P(\bx^{(n)} \given \bW) \right]
= \sum_{n=1}^{N} \left[ \frac{1}{2} {\bx^{(n)}}^{\T} \bW \bx^{(n)}
- \ln Z(\bW)
\right]
,
\eeq
with respect to $w_{ij}$, bearing
in mind that $\bW$ is defined to be symmetric with $w_{ji} = w_{ij}$.
\exercisxA{2}{ex.bmderiv}{
Show that the derivative of $\ln Z(\bW)$ with respect to $w_{ij}$
is
\beq
\frac{\partial}{\partial w_{ij}} \ln Z(\bW) = \sum_{\bx} x_i x_j P(\bx\given \bW)
= \left< x_i x_j \right>_{P(\bx\given \bW)} .
\label{eq.bmdZ}
\eeq
[This exercise is similar to \exerciseref{ex.mlmaxenta}.]
}
The derivative of the log likelihood is therefore:
%
%\noindent
%\begin{framedalgorithm}
\beqan
\frac{\partial}{\partial w_{ij}} \ln P( \{ \bx^{(n)} \}_{1}^{N}\} \given \bW )
&=&
\sum_{n=1}^{N} \left[ x_i^{(n)} x_j^{(n)}
- \left< x_i x_j \right>_{P(\bx\given \bW)}
\right] \\
&=& N \left[ \left< x_i x_j \right>_{\rm Data}
- \left< x_i x_j \right>_{P(\bx\given \bW)} \right] .
\label{eq.bm.learn}
\eeqan
%\end{framedalgorithm}\medskip
%
This gradient is proportional to the difference of two terms.
The first term is the {\em empirical\/} correlation between
$x_i$ and $x_j$,
\beq
\left< x_i x_j \right>_{\rm Data} \equiv
\frac{1}{N} \sum_{n=1}^{N} \left[ x_i^{(n)} x_j^{(n)}
\right] ,
\eeq
and the second term is the correlation between $x_i$ and $x_j$
under the current model,
\beq
\left< x_i x_j \right>_{P(\bx\given \bW)}
\equiv \sum_{\bx} x_i x_j P(\bx\given \bW) .
\eeq
The first correlation $\left< x_i x_j \right>_{\rm Data}$
is readily evaluated -- it is just the empirical correlation
between the activities in the real world.
The second correlation, $\left< x_i x_j \right>_{P(\bx\given \bW)}$,
is not so easy to evaluate, but it can be estimated by Monte Carlo
methods, that is, by observing the average value of $x_i x_j$
while the activity rule of the Boltzmann machine, \eqref{eq.bma},
is iterated.
In the special case $\bW = 0$, we can evaluate the gradient exactly
because, by symmetry, the correlation $\left< x_i x_j \right>_{P(\bx\given \bW)}$
must be zero.
If the weights
are adjusted by gradient descent with learning rate $\eta$,
then, after one iteration,
the weights will be
\beq
w_{ij} = \eta \sum_{n=1}^{N} \left[ x_i^{(n)} x_j^{(n)} \right] ,
\eeq
precisely
the value of the weights given by the Hebb rule, equation (\eqsixteenfive),
with which we trained the Hopfield network.
\subsection{Interpretation of Boltzmann machine learning}
One way of viewing the two terms in
the gradient (\ref{eq.bm.learn})
is as `waking' and `sleeping' rules.
While the network is `awake', it measures the correlation
between $x_i$ and $x_j$ in the real world, and weights are {\em increased\/}
in proportion. While the network is `a\ind{sleep}', it `\ind{dream}s'
about the world using the generative model
(\ref{eq.bm.a}), and measures the correlations between
$x_i$ and $x_j$ in the model world; these correlations determine
a proportional {\em decrease\/} in the weights. If the second-order
correlations in the
dream world match the correlations in the real world, then
the two terms balance and the weights do not change.
%\begin{figure}\small
%\figuremargin
\marginfig{\small%
\begin{center}
\begin{tabular}{c@{\hspace*{0.4in}}c}
\psfig{figure=figs/shifterhard.ps,width=0.71in}
&
\psfig{figure=figs/shifter3.ps,width=0.71in}\\
(a)&(b)\\
\end{tabular}
\end{center}
%}{%
\caption[a]{The `shifter' ensembles. (a) Four samples
from the plain shifter ensemble. (b) Four corresponding samples
from the labelled shifter ensemble.}
\label{fig.shiftera}
\label{fig.shifterb}
}%
%\end{figure}
\subsection{Criticism of Hopfield networks and simple Boltzmann machines}
Up to this point we have discussed Hopfield networks and Boltzmann
machines in which all of the neurons correspond to {\dbf visible}
variables $x_i$.
The result is a probabilistic model that, when optimized, can capture
the second-order statistics of the environment.
[The second-order statistics of an ensemble $P(\bx)$ are
the expected values $\langle x_ix_j \rangle$
of all the pairwise products $x_ix_j$.]
The real world, however, often has higher-order correlations
that must be included if our description of it is to be effective.
Often the second-order correlations in themselves may carry little or
no useful information.
Consider, for example, the ensemble of binary images of chairs. We can
imagine images of chairs with various designs -- four-legged chairs,
comfy chairs, chairs with five legs
and wheels, wooden chairs, cushioned chairs, chairs with
rockers instead of legs. A child
can easily learn to distinguish these images from
images of carrots and parrots.
% , beetles and beets.
But I expect the second-order statistics of the raw data are useless for describing the
ensemble. Second-order statistics only capture whether
two pixels are likely to be in the same state as each other.
Higher-order concepts are needed to
make a good generative model of images of chairs.\index{models of images}
A simpler ensemble of images in which high-order statistics
are important
is the `\ind{shifter ensemble}', which comes in two flavours.
\Figref{fig.shiftera}a shows a few samples from the `plain shifter
ensemble'. In each image, the bottom eight pixels are a copy
of the top eight pixels, either shifted one pixel to the left,
or unshifted, or shifted one
pixel to the right. (The top eight pixels are set at random.)
This ensemble is a simple model of the visual signals from the two eyes\index{stereoscopic vision}
arriving at early levels of the brain. The signals from
the two eyes
are similar to each other but may differ by small translations
because of the varying depth of the visual world.
This ensemble is simple to describe, but its second-order statistics
convey no useful information.
The correlation between one pixel and any of the three pixels above it
is $1/3$.
The correlation between any other two pixels is zero.
% {\em (It would
% be nice here if I had an example with exactly zero correlation between
% all pixels.)}
\Figref{fig.shifterb}b shows a few samples from the `labelled shifter
ensemble'. Here, the problem has been made easier by including an extra
three neurons that label the visual image as being
an instance of either the `shift left', `no shift', or `shift right'
sub-ensemble. But with this extra information, the ensemble is
still not learnable using second-order statistics alone.
The second-order correlation between any label neuron
and any image neuron is zero.
% Even if you know that a particular
% image is a `shift left' image,
We need models that can capture higher-order statistics of an environment.
So, how can we develop such models? One idea
% nstinct
might be
to create models that directly capture higher-order correlations,\index{correlations!high-order}
such as:
\beq
P'(\bx\given \bW,\bV,\ldots)
=\frac{1}{Z'} \exp \left( \frac{1}{2}\sum_{ij} w_{ij} x_i x_j
+ \frac{1}{6} \sum_{ij} v_{ijk} x_i x_j x_k
+\cdots \right) .
\eeq
Such {\dem{higher-order Boltzmann machines}\/}\nocite{Sej}
are equally easy to simulate using stochastic updates,
and the learning rule for the higher-order parameters $v_{ijk}$
is equivalent to the learning rule for $w_{ij}$.
\exercisxB{2}{ex.bmderiv3}{
Derive the gradient of the log likelihood with respect to $v_{ijk}$.
}
It is possible that the \ind{spines} found on biological neurons
are responsible for detecting correlations between small numbers of
incoming signals.
However, to capture statistics of high enough order to describe the
ensemble of images of chairs well would require an unimaginable number of
terms. To capture merely the fourth-order statistics in a $128 \times 128$
pixel image, we need more than $10^7$ parameters.
So measuring moments of images is {\em not\/} a good way to describe
their underlying structure. Perhaps what we need
instead or in addition are {\dbf hidden variables},
also known to statisticians as {\dbf latent variables}.
This is the important innovation introduced by \citeasnoun{BM}.
% Hinton and Sejnowski.
The idea is that the high-order correlations among the
visible variables are described by including extra hidden variables and sticking
to a model that has only second-order interactions between its variables;
the hidden variables induce higher-order correlations between the visible variables.
\section{Boltzmann machine with hidden units}
% The simplest way to turn our stochastic model into
% a latent variable model is to add {\dbf hidden neurons} to it.
We now add {\dbf\ind{hidden neurons}} to our stochastic model.
These are neurons that do not correspond to observed variables;
they are free to play any role in the probabilistic model defined by
equation
(\ref{eq.bm.a}). They might actually take on interpretable
roles, effectively performing `feature extraction'.
%, or discovering the
% underlying structure of the given ensemble.
\subsection{Learning in Boltzmann machines with hidden units}
The activity rule of a Boltzmann machine with hidden units
is identical to that of the original Boltzmann machine.
%
The learning rule can again be derived by maximum likelihood,
but now we need to take into account the fact that the
states of the hidden units are unknown. We will denote the
states of the visible units by $\bx$, the states of
the hidden units by $\bh$, and the generic state of
a neuron (either visible or hidden) by $y_i$, with $\by \equiv
(\bx, \bh)$. The state of the network when
the visible neurons are clamped in state $\bx^{(n)}$ is $\by^{(n)} \equiv
(\bx^{(n)}, \bh)$.
The likelihood of $\bW$ given a single data example $\bx^{(n)}$ is
\beq
P( \bx^{(n)} \given \bW )
= \sum_{\bh} P( \bx^{(n)}, \bh \given \bW )
= \sum_{\bh} \frac{1}{Z(\bW)}
\exp\left[ \frac{1}{2} [\by^{(n)}]^{\T}
\bW \by^{(n)} \right] ,
\label{eq.bmh}
\eeq
where
\beq
Z(\bW) = \sum_{\bx,\bh} \exp\left[ \frac{1}{2} \by^{\T}
\bW \by \right] .
\eeq
\Eqref{eq.bmh} may also be written
\beq
P( \bx^{(n)} \given \bW ) = \frac{ Z_{\bx^{(n)}}(\bW)}{ Z(\bW)}
\eeq
where
\beq
Z_{\bx^{(n)}}(\bW) = \sum_{\bh}
\exp\left[ \frac{1}{2} [\by^{(n)}]^{\T}
\bW \by^{(n)} \right] .
\eeq
Differentiating the likelihood as before, we find that the
derivative with respect to any weight $w_{ij}$ is
again the difference between a `waking' term and a `sleeping'
term,
\beq
\frac{\partial}{\partial w_{ij}}
\ln P( \{ \bx^{(n)} \}_{1}^{N} \given \bW )
= \sum_n \left\{ \left< y_i y_j \right>_{P(\bh\given \bx^{(n)},\bW)}
- \left< y_i y_j \right>_{P(\bx,\bh\given \bW)}
\right\} .
\label{eq.bmh.learn}
\eeq
The first term $\left< y_i y_j \right>_{P(\bh\given \bx^{(n)},\bW)}$ is
the correlation between $y_i$ and $y_j$ if the Boltzmann machine
is simulated with the visible variables clamped to $\bx^{(n)}$
and the hidden variables freely sampling from their conditional
distribution.
The second term $\left< y_i y_j \right>_{P(\bx,\bh\given \bW)}$
is the correlation between $y_i$ and $y_j$ when the Boltzmann machine
generates samples from its model distribution.
Hinton and Sejnowski demonstrated that non-trivial ensembles
such as the labelled shifter ensemble can be learned using a
Boltzmann machine with hidden units. The hidden units take on
the role of feature detectors that spot patterns likely to be
associated with one of the three shifts.
The Boltzmann machine is time-consuming to simulate because the
computation of the gradient of the log likelihood depends on taking
the difference of two gradients, both found by Monte Carlo
methods. So Boltzmann machines are not in widespread use. It is an
area of active research to create models that embody the same
capabilities using more efficient computations
\cite{Hinton1995,Dayan1995,HintonGhahramani97,hinton00training,HinTeh2001}.
% In my opinion, Hinton and Sejnowski's conception of the
% Boltzmann machine is an
% important step towards understanding
% how the brain works.
\section{Exercise}
\exercisxB{3}{ex.barsstripes}{
Can%
\amarginfig{c}{
\begin{center}
\begin{tabular}{c}
\mbox{\epsfbox{metapost/bm.1}}\\[0.05in]
\mbox{\epsfbox{metapost/bm.2}}\\[0.05in]
\mbox{\epsfbox{metapost/bm.3}}\\[0.05in]
\mbox{\epsfbox{metapost/bm.4}}\\
\end{tabular}
\end{center}
\caption[a]{Four samples from the `bars and stripes' ensemble. Each sample
is generated by first picking an orientation, horizontal or vertical;
then, for each row of spins in that orientation (each bar or stripe respectively),
switching all spins on with probability $\dhalf$.}
\label{fig.barstripe}
}
the `bars and stripes' ensemble (\figref{fig.barstripe}) be learned by a Boltzmann
machine with no hidden units? [You may be surprised!]
}
\dvips
%\chapter{Supervised learning in multilayer networks}
\chapter{Supervised Learning in Multilayer Networks}
% see also /home/mackay/_doc/thesis/summary.tex and ./mlpold.tex
%\section{The Backpropagation algm}
\section{Multilayer perceptrons}
No course on neural networks could be complete without a
discussion of supervised multilayer networks,
also known as backpropagation networks.
The {\dbf multilayer perceptron\/} is a feedforward network.
It has input neurons, hidden neurons and output neurons.
The hidden neurons may be arranged in a sequence of layers.
The most common multilayer perceptrons have a single hidden
layer, and are known as `two-layer' networks, the number `two'
counting the number of layers of neurons not including the inputs.
%\section{Neural networks as probabilistic models}
%\label{sec.nn_as_pm}
Such a feedforward network defines a nonlinear parameterized mapping
from an input $\bx$ to an output $\by = \by(\bx;\bw,\A)$. The output
is a continuous function of the input and of the parameters $\bw$;
the architecture of the net, \ie, the functional form of the mapping,
is denoted by $\A$. {Feedforward networks can be `trained' to perform
regression and classification tasks.}
\subsection{Regression networks}
\marginfig{
\begin{center}
\mbox{\psfig{figure=figs/network.eps,width=2in}}
\end{center}
\caption[abbr]{A typical two-layer network, with six inputs, seven
hidden units, and three outputs. Each line represents one weight.}
\label{fig.net}
\label{fig.two.layer.net}
}
In the case of a regression problem, the mapping for a network with one
hidden layer may have the form:
%
\beq
\mbox{Hidden layer:} \:\:\:\:\:
a_j^{(1)} = \sum_l w_{jl}^{(1)} x_l + \theta^{(1)}_j ; \:\:\:\:\:
h_j = f^{(1)} ( a_j^{(1)} )
%h_j = f^{(1)}\left( a_j^{(1)} \right)
\label{two.layer.net.a}
\eeq
\beq
\mbox{Output layer:}\:\:\: \:\:
a_i^{(2)} = \sum_j w_{ij}^{(2)} h_j + \theta^{(2)}_i ; \:\:\: \:\:
y_i = f^{(2)} ( a_i^{(2)} )
\label{two.layer.net}
\eeq
where, for example, $f^{(1)}(a) = \tanh (a)$, and $f^{(2)}(a) = a$.
Here $l$ runs over the inputs $x_1 ,\ldots, x_L$, $j$ runs over the
hidden units, and $i$ runs over the outputs. The `weights' $w$ and
`biases' $\theta$ together make up the parameter vector $\bw$. The
nonlinear \ind{sigmoid} function $f^{(1)}$ at the hidden layer gives the
neural network greater computational flexibility than a standard
\ind{linear regression} model. Graphically, we can represent the neural
network as a set of layers of connected neurons (\figref{fig.two.layer.net}).
% This is called a two layer network because
% it has two layers of weights. In terms of layers of neurons, it has one input
% layer, one hidden layer and one output layer.
\subsection{What sorts of functions can these networks implement?}
Just as we explored the weight space of
the single neuron in \chapterref{ch.single.neuron.class},
examining the functions it could produce, let us
% regression
explore the weight space of a multilayer network.
In
figures \ref{movie1}
and
\ref{newmovie}
I take a network with one input and one output%
\amarginfig{b}{
\begin{center}
\hspace*{-0.12in}\raisebox{-10pt}{%
\mbox{\psfig{figure=\handfigs/movie2.ps,width=2.24in,angle=-90}}}
\end{center}
%}{%
\caption[abbrev]{{Samples from the prior over functions
of
a one-input network.}
%
For each of a sequence
of values of $\sigma_{\rm bias} = $
8, 6, 4, 3, 2, 1.6, 1.2, 0.8, 0.4,
0.3, 0.2, and $\sigma_{\rm in} = 5 \siguW_{\rm bias}$,
one random function is shown. The other hyperparameters
of the network were $H= 400$, $\siguW_{\rm out}=0.05$.
% The larger values of
% $\siguW_{\rm in}$
%% and $\siguW_{\rm bias}$
% produce the more complex
% functions with more fluctuations.
}
\label{movie1}
}%
\begin{figure}
\figuremargin{\small%
\begin{center}
\mbox{\raisebox{0.421in}{
\setlength{\unitlength}{1.752pt}
\begin{picture}(80,50)(-12,-4)
\put(0,20){\line(1,0){40}}
\put(43,20){\makebox(0,0)[l]{Hidden layer}}
\put(43,0){\makebox(0,0)[l]{Input}}
\put(43,40){\makebox(0,0)[l]{Output}}
\put(0,20){\circle*{3}}
\put(5,20){\circle*{3}}
\put(10,20){\circle*{3}}
\put(15,20){\circle*{3}}
\put(20,20){\circle*{3}}
\put(40,20){\circle*{3}}
%
\put(0,20){\line(1,1){20}}
\put(20,40){\line(1,-1){20}}
\put(20,40){\circle*{3}}
\put(20,44){\makebox(0,0)[b]{$y$}}
\put(0,20){\line(1,-1){20}}
\put(20,0){\line(1,1){20}}
\put(20,0){\circle*{3}}
\put(20,-4){\makebox(0,0)[t]{$x$}}
\put(0,0){\line(0,1){20}}
\put(0,0){\line(2,1){40}}
\put(0,0){\circle*{3}}
\put(0,-3){\makebox(0,0)[t]{$1$}}
\put(-5,5){\vector(1,0){8}}
\put(-6,5){\makebox(0,0)[r]{$\sigbias$}}
\put(35,5){\vector(-1,0){14}}
\put(36,5){\makebox(0,0)[l]{$\sigin$}}
\put(35,30){\vector(-1,0){14}}
\put(36,30){\makebox(0,0)[l]{$\sigout$}}
\end{picture}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mbox{\setlength{\unitlength}{1.12mm}
\begin{picture}(70,50)(0,0)
% !generate33.bkp genspec3 400 484 8 4.0 0.5
% sigma_{\mbox{bias}} = 4 , \sigma_{\mbox{in}} = 8, \sigma_{\mbox{out}}
% = 0.5
\put(0,0){\makebox(0,0)[bl]{ \psfig{figure=\handfigs/one_sample.ps,%
width=78.4mm,angle=-90}}}
% 70 * unitlength
\put(33,10){\makebox(0,0)[bl]{\footnotesize$\sim\sigbias/\sigin$}}
\put(20,30){\makebox(0,0)[bl]{\footnotesize$\sqrt{H} \sigout$}}
\put(36,32.5){\makebox(0,0)[tl]{\footnotesize$\sim 1/\sigin$}}
% this last one needs tweak, done Wed 18/7/01
\end{picture}
}}
\end{center}
}{%
\caption[abbrev]{ {Properties of a function produced by a random network.}
The vertical scale of a typical function produced by
the network with random weights is of order
$\sqrt{H} \sigout$;
% that's 10
the horizontal range in which the function varies
significantly is of order
$\sigbias/\sigin$;
% which is 0.5
and the shortest horizontal
length scale is of order $1/\sigin$.
% that's 0.125
%
The function shown was produced by making
a random network with $H=400$ hidden units, and Gaussian weights
% were generated
with $\sigbias = 4$,
$\sigin = 8$, and $\sigout = 0.5$.
}
\label{newmovie}
}%
\end{figure}
and a
large number $H$ of hidden units, set the biases and weights
$\theta^{(1)}_j$, $w_{jl}^{(1)}$, $\theta^{(2)}_i$ and $w_{ij}^{(2)}$ to
random values, and plot the resulting function $y(x)$. I set the hidden unit
biases $\theta^{(1)}_j$ to random values from a Gaussian with zero mean
and standard deviation $\sigma_{\rm bias}$; the input to hidden
weights $w_{jl}^{(1)}$ to random values with standard deviation
$\sigma_{\rm in}$; and the bias and output weights $\theta^{(2)}_i$ and
$w_{ij}^{(2)}$ to random values with standard deviation $\sigma_{\rm
out}$.
The sort of functions that we obtain depend on the values of
$\sigma_{\rm bias}$, $\sigma_{\rm in}$ and $\sigma_{\rm out}$\nocite{Radford_book}.
As the weights and biases are made
bigger we obtain more complex functions with more features and a
greater sensitivity to the input variable.
The vertical scale of a typical function produced by
the network with random weights is of order
$\sqrt{H} \sigout$;
% that's 10
the horizontal range in which the function varies
significantly is of order
$\sigbias/\sigin$;
% which is 0.5
and the shortest horizontal
length scale is of order $1/\sigin$.
Radford \citeasnoun{Radford_book} has also shown that in the limit
as $H \rightarrow \infty$ the statistical properties of the functions
generated by randomizing the weights are independent of the number of
hidden units; so, interestingly, the complexity of the functions becomes
independent of the number of parameters in the model. What determines
the complexity of the typical functions is the characteristic
magnitude of the weights. Thus we anticipate that when we fit these
models to real data, an important way of controlling the complexity of the
fitted function will be to control the characteristic magnitude of the
weights.
\amarginfig{b}{
\begin{center}
\hspace*{-0.4in}\mbox{\psfig{figure=\handfigs/io.2d.a.1.ps,%
width=3in,height=2.0in,angle=-90}}\\[-0.3in]
\end{center}
%}{%
\caption[abbrev]{{One sample from the prior of
a two-input network}
with
$\{ H,\siguW_{\rm in},\siguW_{\rm bias},\siguW_{\rm out} \} =
\{400, 8.0, 8.0, 0.05\}$.
}
\label{movie3}
}%
\Figref{movie3} shows one typical function produced by a network
with two inputs and one output. This should be contrasted with the
function produced by a traditional linear regression model, which
is a flat plane. Neural networks can create functions
with more complexity than a linear regression.
%
% Wed 25/8/04 third printing
% to try to force same pagination as printing 2,
% add meskip
\medskip
\medskip
\section{How a regression network is traditionally trained}
This network is trained using a data
set $D=\{ \bx^{(n)}, \bt^{(n)} \}$ by adjusting $\bw$ so as to
minimize an error function,
\eg,
\beq
E_D(\w) = \frac{1}{2} \sum_n \sum_i \left( t^{(n)}_i - y_i(\bx^{(n)};\bw)
\right)^2.
\label{E_D}
\eeq
This objective function is a sum of terms, one for
each input/target pair $\{\bx,\bt\}$, measuring
how close the output $\by(\bx;\bw)$ is to the target $\bt$.
This minimization is based on repeated evaluation of the gradient of
$E_D$.\index{learning algorithms!multilayer perceptron}
This gradient can be efficiently computed using
the {\dbf\ind{backpropagation}} algorithm \cite{backprop}, which
uses the \ind{chain rule} to find the derivatives.\index{learning algorithms!backpropagation}
Often, \ind{regularization} (also known as \ind{weight decay}) is included,
modifying the
objective function to:
\beq
M(\w) = \b E_D + \a E_W
\eeq
where, for example, $E_W= \frac{1}{2} \sum_i w_i^2$. This additional
term favours small values of $\bw$ and
decreases the tendency of a model to \index{overfitting}overfit noise in the
\ind{training data}.
\citeasnoun{backprop} showed that \ind{multilayer perceptron}s can be trained,
by \ind{gradient descent} on $M(\w)$, to discover solutions to non-trivial
problems such as deciding whether an image is symmetric or not.
These networks have been successfully applied to
real-world tasks as varied as pronouncing English text\index{reading aloud}
\cite{nettalk} and \ind{focus}sing
multiple-\ind{mirror} \ind{telescope}s \cite{Angel}.%
\index{Angel, J. R. P.}% and P. Wizinowich and M. Lloyd-Hart and D. Sandler
\section{Neural network learning as inference}
The neural network learning process above can be given the following
probabilistic interpretation. [Here we repeat and generalize the
discussion of \chapterref{ch.single.neuron.bayes}.]
The \ind{error function} is interpreted
as defining a noise model. $\b E_D$ is
the negative log likelihood:
\beq
P(D \given \w,\b,\H) = \frac{1}{Z_D(\b)} \exp (-\b E_D ) .
\label{basic.like}
\eeq
Thus, the use of the sum-squared error $E_D$ (\ref{E_D}) corresponds
to an assumption of Gaussian noise on the target variables, and the
parameter $\b$ defines a noise level $\sigma^2_{\nu} = 1/\beta$.
Similarly the regularizer is interpreted in terms of a log prior probability
distribution over the parameters:
\beq
P(\w \given \a,\H) = \frac{1}{Z_W(\a)} \exp (-\a E_W ) .
\label{basic.prior}
\eeq
If $E_W$ is quadratic as defined above, then the corresponding prior
distribution is a Gaussian with variance $\sigW^2 = 1/\a$. The
probabilistic model $\H$ specifies the \ind{architecture} $\A$ of the
\ind{network}, the \ind{likelihood} (\ref{basic.like}), and the \ind{prior}
(\ref{basic.prior}).
The objective function $M(\w)$ then corresponds to the {\dbf\ind{inference}\/}
of the parameters $\w$, given the data:
\beqan
\label{level1}
P(\w \given D,\a,\b,\H) &=& \frac{ P(D \given \w,\b,\H) P(\w \given \a,\H) }{ P(D \given \a,\b,\H) } \\
&=& \frac{1}{Z_M} \exp( - M(\w) ).
\label{level1b}
\eeqan
The $\w$ found by (locally) minimizing $M(\w)$ is then interpreted as
the (locally) most probable parameter vector, $\wmp$.
The interpretation of $M(\w)$ as a log probability adds little new at
this stage. But new tools will emerge when we proceed to other
inferences.
%
First, though, let us establish the probabilistic interpretation
of classification networks, to which the same tools apply.
\subsection{Binary classification networks}
If the targets $t$ in a data set are binary classification labels
$(0,1)$, it is natural to use a neural network whose output
$y(\bx;\bw,\A)$ is bounded between 0 and 1, and is interpreted as a
probability $P(\mbox{$t\!=\!1$} \given \bx,\bw,\A)$. For example, a network
with one hidden layer could be described by the feedforward equations
(\ref{two.layer.net.a}) and (\ref{two.layer.net}),
%equations (\ref{two.layer.net}),
with $f^{(2)}(a) = 1/(1+e^{-a})$.
The error function $\b E_D$ is replaced by the negative log likelihood:
\beq
G(\bw) = - \left[
\sum_n t^{(n)} \ln y(\bx^{(n)};\bw) + (1-t^{(n)}) \ln (1-y(\bx^{(n)};\bw))
\right].
\eeq
The total objective function is then $M = G + \a E_W$. Note that
this includes no parameter $\b$ (because there is no Gaussian noise).
\subsection{Multi-class classification networks}
For a multi-class classification problem, we can represent the targets
by a vector, $\bt$, in which a single element is set to 1, indicating
the correct class, and all other elements are set to 0.
%
% If there are several classes and each target vector $\bt$ consists of
% one `1' and all other entries `0', then
In this case it is appropriate to use a `softmax' network\nocite{Bridle}
having coupled outputs which sum to one
and are interpreted as class probabilities $y_i =
P(\mbox{$t_i\!=\!1$} \given \bx,\bw,\A)$. The last part of equation (\ref{two.layer.net}) is
replaced by:
%
\beq
y_i = \frac{ e^{a_i} }{ \displaystyle \sum_{i'} e^{a_{i'}} } .
%y_i = \frac{\exp(a_i)}{\sum_{i'} \exp(a_{i'}) } .
\label{soft.max.a}%\label{soft.max}
\eeq
%
The negative log likelihood in this case is
\beq
G(\bw) = - \sum_n \sum_i t^{(n)}_i \ln y_i(\bx^{(n)};\w) .
\eeq
As in the case of the regression network, the minimization
of the objective function $M(\w) = G + \a E_W$ corresponds to
an inference of the form (\ref{level1b}).
% (equation (\ref{level1b})).
A variety of useful results
can be built on this interpretation.
%The results will refer to regression models; the corresponding results
%for classification models are obtained by replacing $\b E_D$ by $G$,
%and $Z_D(\b)$ by 1.
\section{Benefits of the Bayesian approach to supervised feedforward neural networks}
From the statistical perspective,
supervised neural networks are nothing more than
nonlinear curve-fitting devices.
\begin{figure}[tbp]
\figuremargin{\footnotesize%
\begin{center}
\begin{tabular}{ll}
\begin{tabular}{l}
\makebox[0.0in]{(a)} \psfig{figure=\figsinter/sc_a1.blank.ps,%
width=4.96cm,height=3.78cm,angle=-90,%
bbllx=2.4cm,bblly=4.3cm,%
bburx=17cm,bbury=26.4cm}
\\
\makebox[0.0in]{(b)} \psfig{figure=\figsinter/sc_a2b.blank.ps,%
width=4.96cm,height=3.78cm,angle=-90,%
bbllx=2.4cm,bblly=4.3cm,%
bburx=17cm,bbury=26.4cm}
\\
\makebox[0.0in]{(c)} \psfig{figure=\figsinter/sc_a3.blank.ps,%
width=4.96cm,height=3.78cm,angle=-90,%
bbllx=2.4cm,bblly=4.3cm,%
bburx=17cm,bbury=26.4cm} \\
\end{tabular}
&
\begin{tabular}{l}
\mbox{(d)$\,$}\psfig{figure=\figs/test_train.eps,%
width=2.3in,height=1.95in,angle=90}
\\
\mbox{(e)$\,$}\psfig{figure=\figs/probability.eps,%
width=2.3in,height=1.95in,angle=90}
\\
\end{tabular}
\end{tabular}
\end{center}
}{%
\caption[abbrev]{\index{optimization!of model complexity}{Optimization} of model \ind{complexity}.
Panels (a--c) show a radial basis function model
interpolating a simple data set with one input variable and
one output variable. As the regularization constant is varied
to increase the complexity of the model (from (a) to (c)), the
interpolant is able to fit the training data increasingly
well, but beyond a certain point the generalization ability
(test error) of the model deteriorates. Probability theory
allows us to optimize the control parameters without needing a
test set.
}
\label{figmlp1}
}%
\end{figure}
Curve fitting is not a trivial task however.
The effective complexity of an interpolating model is of crucial
importance, as illustrated
in figure \ref{figmlp1}. Consider a control parameter that
influences the complexity of a model, for example
a regularization constant $\a$ (weight decay parameter).
As the control parameter is varied to increase
the complexity of the model (descending from figure \ref{figmlp1}{a--c} and
going from left to right across figure \ref{figmlp1}d), the best fit to
the {\dbf training\/} data that the model can achieve becomes
increasingly good. However, the empirical performance of the model,
the {\dbf test error}, first decreases then {increases again}.
% beyond which the model is too complex.
{\em An over-complex model overfits the data and generalizes poorly.}
This problem may also complicate the choice of architecture
in a multilayer perceptron, the radius of the basis functions
in a radial basis function network, and the choice of the input variables
themselves in any multidimensional regression problem.
Finding values for model control parameters that are appropriate for
the data is therefore an important and non-trivial problem.
The {\dbf \ind{overfitting}
problem\/} can be solved by using a Bayesian approach to control model
complexity.
% Bayesian model comparison can be used for example
% to optimize weight decay rates, and to infer automatically which
% are the relevant input variables for a problem.
% An important message is illustrated
% in figure \ref{figmlp1}$e$.
If we give a probabilistic interpretation to the model, then we can
evaluate the \ind{evidence} for alternative values of the control
parameters. As was explained in \chref{ch.occam},
over-complex
models turn out to be less probable, and the evidence
$P({\rm Data} \given \mbox{Control
Parameters})$ can be used as an objective function for optimization
of model control parameters (figure \ref{figmlp1}e).
The setting of $\a$ that maximizes the evidence
is displayed in figure \ref{figmlp1}b.
Bayesian \ind{optimization} of model control parameters has four important
advantages. (1) No `test set' or `validation set' is involved, so
all available training data can be devoted to both model fitting and model
comparison. (2) Regularization constants can be optimized on-line,
\ie, simultaneously with the optimization of ordinary model
parameters. (3) The Bayesian objective function is not noisy, in
contrast to a \ind{cross-validation} measure. (4) The gradient of the
evidence with respect to the control parameters can be evaluated,
making it possible to simultaneously optimize a large number of
control parameters.
% \subsection{Further advantages of a Bayesian approach to supervised neural networks}
% Bayesian probability theory provides a unifying framework for
% data modeling which offers several benefits.
%
Probabilistic modelling also handles
{\dbf uncertainty\/} in a natural manner. It offers a unique prescription, {\dbf
marginalization}, for incorporating uncertainty about parameters into
predictions; this procedure yields better predictions, as
we saw in \chapterref{ch.single.neuron.bayes}.
\Figref{fig.eb} shows error bars
on the predictions of a trained neural network.
%\begin{figure}
\marginfig{\begin{center}
\mbox{\hspace*{-0.32in}\psfig{figure=\handfigs/error_bar_net.ps,%
width=6cm,angle=-90,height=3.14cm}}
\end{center}
\caption[abbrev]{ {Error bars on the predictions of a trained regression network.}
The solid line gives the predictions of the best fit
parameters of a multilayer perceptron trained on the data
points. The error bars (dotted lines) are
those produced by the uncertainty of the parameters
$\bw$. Notice that the error bars become larger where the data
are sparse. }
\label{fig.eb}
}
\subsection{Implementation of Bayesian inference}
As was mentioned in \chref{ch.single.neuron.bayes},
Bayesian inference for multilayer networks
may be implemented by Monte Carlo sampling, or
by deterministic methods employing Gaussian approximations \cite{Radford_book,MacKay92b}.
% Sophisticated
% Monte Carlo methods that make use of gradient information
% have been developed by \citeasnoun{Radford_book}.
% Methods based on Gaussian approximations to the posterior
% distribution have been developed by \citeasnoun{MacKay92b}.
% The methods of section \ref{bugs} employ Monte Carlo methods.
%
%\subsection*{Summary}
Within the Bayesian framework for
data modelling,
it is easy to improve our probabilistic models.
For example, if we believe that some input
variables in a problem may be irrelevant to the
predicted quantity, but we don't know which,
we can define a new model with multiple hyperparameters
that captures the idea of uncertain input variable
relevance \cite{MacKay94:pred_ashrae,Radford_book,MacKay95:network};
these models then infer automatically from the data which
are the relevant input variables for a problem.
\section{Exercises}
% _doc/dirichlet/mutinfo.tex
\ExercisxC{4}{ex.mutinfoclassifier}{
%{\sf {How to Measure a Classifier's Quality}}.
{\sf {How to measure a classifier's quality}}.
You've just written a new classification algorithm and
want to measure how well it performs on a test set, and
compare it with other classifiers.
What performance measure should you use?
There are several standard answers.
Let's assume the classifier gives an output $y(\bx)$, where $\bx$ is
the input, which we won't discuss further, and that the
true target value is $t$. In the simplest discussions of classifiers,
both $y$ and $t$ are binary variables, but you might
care to
consider cases where $y$ and $t$ are more general objects also.
%\begin{description}
%\item[Error rate]
The most widely used measure of performance on a test set is
the {\dem{error rate}\/}\index{classifier}
-- the fraction of {\em misclassifications\/} made by the
classifier. This measure forces the classifier to give a 0/1
output and ignores any additional information that the classifier
might be able to offer -- for example, an indication of the
firmness of a prediction.
% \end{description}
Unfortunately, the error rate does not necessarily measure how {\em informative\/}
a classifier's output is.
Consider frequency tables showing the
joint frequency of the 0/1 output of a classifier (horizontal axis), and
the true 0/1 variable (vertical axis). The numbers that we'll show
are percentages.
The error rate $e$ is the sum of the two off-diagonal numbers,
which we could call the false positive rate $e_+$ and the false negative
rate $e_-$.
Of the following three classifiers, A and B have the same error rate
of 10\% and C has a greater error rate of 12\%.
\newcommand{\fourthreetable}[9]{\begin{tabular}[b]{lc|ccc}
{} & {$y$} & {#1} & {#2} & {#3} \\ %\cline{1-1}
\multicolumn{2}{l|}{$t$} & & & \\ \hline
\multicolumn{2}{l|}{0} & {#4} & {#5} & {#6} \\
\multicolumn{2}{l|}{1} & {#7} & {#8} & {#9} \\ % \hline
\end{tabular}}
\newcommand{\fourfourtableB}[9]{\begin{tabular}[b]{lc|cc}
{#1} & {#2} & {#3} & {#4} \\ %\cline{1-1}
\multicolumn{2}{l|}{#5} & & \\ \hline
\multicolumn{2}{l|}{#3} & {#6} & {#7} \\
\multicolumn{2}{l|}{#4} & {#8} & {#9} \\ % \hline
\end{tabular}}
\newcommand{\ofourthreetable}[9]{\begin{tabular}[b]{lc|c|c|c|}
{} & {y} & {#1} & {#2} & {#3} \\ %\cline{1-1}
\multicolumn{2}{l|}{$t$} & & & \\ \hline
\multicolumn{2}{l|}{0} & {#4} & {#5} & {#6} \\ \hline
\multicolumn{2}{l|}{1} & {#7} & {#8} & {#9} \\ \hline
\end{tabular}}
\newcommand{\ofourfourtable}[9]{\begin{tabular}[b]{lc|c|c|}
{#1} & {#2} & {#3} & {#4} \\ %\cline{1-1}
\multicolumn{2}{l|}{#5} & & \\ \hline
\multicolumn{2}{l|}{#3} & {#6} & {#7} \\ \hline
\multicolumn{2}{l|}{#4} & {#8} & {#9} \\ \hline
\end{tabular}}
\begin{realcenter}
\begin{tabular}{p{1.5in}p{1.5in}p{1.5in}}
Classifier A & Classifier B & Classifier C \\
\fourfourtableB{}{$y$}{0}{1}{$t$}{90}{0}{10}{0} &
\fourfourtableB{}{$y$}{0}{1}{$t$}{80}{10}{0}{10} &
\fourfourtableB{}{$y$}{0}{1}{$t$}{78}{12}{0}{10} \\
\end{tabular}
\end{realcenter}
But clearly classifier A, which simply guesses that the outcome
is 0 for all cases, is conveying no information at all about $t$;
whereas classifier B has an informative output: if $y=0$ then
we are sure that $t$ really is zero; and if $y \eq 1$ then there
is a 50\% chance that $t \eq 1$, as compared to the prior probability
$P(t\eq 1)=0.1$. Classifier C is slightly less informative than B, but it
is still much more useful than the information-free classifier A.
\marginpar{\small
\begin{center}
How common sense ranks the classifiers:\\ \smallskip
$\mbox{(best)}\:\:B>C>A \:\:\mbox{(worst).}$\\ \medskip
How error rate ranks the classifiers:\\ \smallskip
$\mbox{(best)}\:\:A = B>C \:\:\mbox{(worst).}$
\end{center}
}
One way to improve on the error rate as a performance measure is
to report the pair $(e_+,e_-)$, the false positive error
rate and the false negative error rate, which
are $(0,0.1)$ and $(0.1,0)$ for classifiers A and B.
It is especially important to distinguish between these two
error probabilities in applications where the two sorts
of error have different associated costs.
However, there are a couple of problems with the `error rate pair':
\bit
\item
First, if I simply told you that classifier A
has error rates $(0,0.1)$ and B has error rates $(0.1,0)$,
it would not be immediately evident that classifier A is
actually utterly worthless. Surely we should have a performance
measure that gives the worst possible score to A!
\item
Second, if we turn to a multiple-class classification problem
such as digit recognition, then the number of types of
error increases from two to $10 \times 9=90$ -- one for each
possible confusion of class $t$ with $t'$.
It would be nice to have some sensible way
of collapsing these 90 numbers into a single rankable number
that makes more sense than the error rate.
%-- their sum --
% which we have already seen can give silly scores to worthless
% classifiers.
\eit
% {\sf Rejections as an indication of uncertainty}.
Another reason for not liking the error rate is
that it doesn't give a classifier credit for accurately
specifying its uncertainty.
Consider classifiers that
% improves on classifier B by
% splitting the original output class into a `1' class and
have three outputs available, `0', `1' and a
{\dem\ind{rejection}\/} class, `?', which indicates that the classifier is not sure.
%thinks
% the example might be a 1, but it is not sure.
% If the frequency table for these outcomes is, in percentages:
Consider classifiers D and E with the following frequency tables,
in percentages:
\begin{realcenter}
\begin{tabular}{p{1.5in}p{1.5in}}
Classifier D & Classifier E \\
\fourthreetable{0}{\hspace{0.4mm}?\hspace{0.4mm}}{\hspace{0.4mm}1\hspace{0.4mm}}{74}{10}{6}{0}{1}{9} &
\fourthreetable{0}{\hspace{0.4mm}?\hspace{0.4mm}}{\hspace{0.4mm}1\hspace{0.4mm}}{78}{6}{6}{0}{5}{5}
\end{tabular}
\end{realcenter}
Both of these classifiers have $(e_+,e_-,r)=(6\%,0\%,11\%)$.
But are they equally good classifiers? Compare
classifier E with C. The two classifiers are equivalent. E is
just C in disguise -- we could make E by taking the output
of C and tossing a coin when C says `1' in order to decide whether
to give output `1' or `?'.
So E is equal to C and thus inferior to B.
Now compare D with B.
Can you justify the suggestion
that D is a more informative classifier than B, and thus
is superior to E?
Yet D and E have the same $(e_+,e_-,r)$ scores.
% http://iris.usc.edu/Vision-Notes/bibliography/pattern586.html
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=figs/roc.eps,width=1.5in}}
\end{center}
\caption[a]{An error-reject curve.
Some people use
the area under this curve as a
% single
measure of classifier quality.}
\label{fig.roc}
}
People often plot {\dem\ind{error-reject curves}\/} (also known as \ind{ROC} curves;
ROC stands for `\ind{receiver
operating characteristic}')
which show the total $e= (e_+ + e_-)$
versus $r$ as $r$ is allowed to vary from 0 to 1, and use these curves
to compare classifiers (\figref{fig.roc}). [In the special case of
binary classification problems, $e_+$ may be plotted versus $e_-$
instead.] But as we have seen, error rates
can be undiscerning performance measures.
Does plotting one
error rate as a function of another make this weakness
of error rates
go away?
For this exercise, either construct an explicit example demonstrating
that the {error-reject curve}, and the
area under it, are not necessarily good ways to compare
classifiers; or prove that they {\em are}.
As a suggested alternative
method for comparing classifiers, consider
the {\em mutual information\/}
between the output and the target,
\beq
I(T;Y) \equiv H(T) - H(T\given Y) = \sum_{y,t} P(y)P(t\given y)
\log \frac{P(t)}
{P(t\given y)} ,
\eeq
which measures how many {\em bits\/} the classifier's output conveys about
the target.
Evaluate the mutual information for classifiers A--E above.
Investigate this performance measure and discuss whether it is
a useful one. Does it have practical drawbacks?
}
%\begin{center}
%\begin{tabular}{l*{5}{c}}
%Classifier & A & B & C & D & E \\ \hline
%$H(T;Y)$ & 0 & 0.269 & 0.250 & 0.275 & 0.250 \\
%\end{tabular}
%\end{center}
\dvips
% {Gaussian processes}
\prechapter{About Chapter}
% _pgp.tex
% prechapter for
% for _gpB.tex
% \begin{abstract}
Feedforward neural networks such as multilayer perceptrons are
popular tools for nonlinear regression and classification problems.
From a Bayesian perspective, a choice of a neural network model can
be viewed as defining a prior probability distribution over
nonlinear functions, and the neural network's learning process can
be interpreted in terms of the posterior probability distribution
over the unknown function. (Some learning algorithms search for the
function with maximum posterior probability and other Monte Carlo
methods draw samples from this posterior probability.)
In the limit of large but otherwise standard networks, \citeasnoun{Radford_book}
% Neal (1996)
has shown that the prior distribution over nonlinear functions
implied by the Bayesian neural network falls in a class of
probability distributions known as Gaussian processes. The
hyperparameters of the neural network model determine the
characteristic lengthscales of the Gaussian process. Neal's
observation motivates the idea of discarding parameterized networks
and working directly with Gaussian processes. Computations in which
the parameters of the network are optimized are then replaced by
simple matrix operations using the covariance matrix of the Gaussian
process.
In this chapter
I will review work on this idea by \citeasnoun{williams_rasmussen:96},
% Neal, Williams, Rasmussen, Barber, Gibbs and MacKay,
\citeasnoun{Neal_gp}, \citeasnoun{williams:96} and
\citeasnoun{Gibbs_MacKay97b},
and will assess whether, for
supervised regression and classification tasks, the feedforward
network has been superceded.
%\end{abstract}
\exercisxB{3}{ex.spiceupGP}{
I regret that this chapter is rather dry. There's no
simple explanatory examples in it, and few pictures.
This exercise asks you to create interesting pictures
to explain to yourself this chapter's ideas.
}
% My lectures on neural networks and Gaussian processes
% feature a sequence of computer demonstrations
% written in the free language {\tt octave}. The s
Source code for computer demonstrations\index{software!Gaussian processes}
written in the free language {\tt octave} is
available at:\\
{\tt http://www.inference.phy.cam.ac.uk/mackay/itprnn/software.html}.
Radford Neal's software for Gaussian processes is
available at:\\
{\verb+http://www.cs.toronto.edu/~radford/+}.
% These lecture notes are based on the
% work of \citeasnoun{Radford_book}, \citeasnoun{williams_rasmussen:96}
% and \citeasnoun{GibbsPhd}.
% \chapter{Gaussian processes}
\ENDprechapter
\chapter{Gaussian Processes \nonexaminable}
% A review of work by Neal, Williams, Rasmussen and Gibbs on
% Gaussian processes as a replacement for supervised neural networks.
%
% This is the one to use for the book.
% _gpB.tex was copied from _gp.tex may 1998
% it was used in the Bishop volume, and backed up in itp/bak.
% newcommands were moved to graveyard.
%
% Gaussian processes
%
\renewcommand{\figs}{figshandbk}%/home/mackay/handbook/figs
\newcommand{\mngfigs}{/data/tiree/mng10/tex/figs}
% Acknowledgement}
%
%tutorial
% \prechapter{About Chapter }
% _pgp.tex
% \section{Overview}
After the publication of Rumelhart, Hinton and Williams's (1986)
% \quotecite{backprop} paper
\nocite{backprop} paper\index{backpropagation}
on supervised learning in neural networks
% 1986
there was a surge of interest in the empirical modelling
of relationships in high-dimensional
data using nonlinear parametric models such as \ind{multilayer perceptron}s
and \ind{radial basis function}s.\index{Gaussian processes}
In the Bayesian interpretation
of these modelling methods, a \ind{nonlinear} function $y(\bx)$
parameterized by parameters $\bw$ is assumed to
underlie the data $\{ \bx^{(n)}, \tn \}_{n=1}^{N}$,
and the adaptation of the model
to the data
corresponds to an {\em inference\/} of the function given the data.
We will denote the set of input vectors
by $\bXN \equiv \{ \bx^{(n)} \}_{n=1}^{N}$ and the
set of corresponding
target values by the vector $\btN \equiv \{ \tn \}_{n=1}^{N}$.
The inference of $y(\bx)$ is
described by the posterior probability distribution
\beq
P( y(\bx) \given \btN , \bXN )
= \frac{
P( \btN \given y(\bx) , \bXN )
P( y(\bx) )
}{ P( \btN \given \bXN ) }
.
\label{eq.infer.y}
\eeq
Of the two terms on the right-hand side, the first,
$P( \btN \given y(\bx) , \bXN )$, is the probability
of the target values given the function $y(\bx)$, which in the case of
regression problems is often
% implicitly
assumed to be a separable Gaussian distribution; and the
second term, $P( y(\bx) )$, is the prior distribution on functions
assumed by the model. This prior is implicit in the choice
of parametric model and the choice of regularizers used during
the model fitting.
% adaptation.
The prior typically specifies that the
function $y(\bx)$ is expected to be continuous and smooth, and has
less high frequency power than low frequency power,
but the precise meaning of the prior is somewhat obscured
by the use of the parametric model.
Now,
% from the point of view of prediction of future values of $t$,
for the prediction of future values of $t$,
all that matters is the assumed prior $P( y(\bx) )$ and the assumed
noise model $P( \btN \given y(\bx) , \bXN )$
-- the parameterization of the function $y(\bx;\bw)$ is
irrelevant.
% might think too complex (inf dim) and too simple (gaussian) but snot
%
The idea of Gaussian process modelling is to place a prior $P( y(\bx) )$
directly on the space of functions, without
% bother
% discard the idea of
% representing
parameterizing $y(\bx)$. The simplest type of
prior over functions is called a Gaussian process. It can be thought
of as the generalization of a Gaussian distribution over a finite vector
space to a function
% vector
space of infinite dimension. Just as a Gaussian
distribution is
fully specified by its mean and covariance matrix, a Gaussian process
is specified by a mean and a {\dem\ind{covariance function}}. Here, the mean is
a function of $\bx$ (which we will often take to be the zero function),
and the covariance is a function $C(\bx,\bx')$ that expresses
the expected covariance between the values of the function $y$
at the points $\bx$ and $\bx'$. The function $y(\bx)$ in any one
data modelling problem is
assumed to be a {\em single\/} sample from this Gaussian distribution.
Gaussian processes are already well established models for
various spatial and temporal problems
\nocite{Ripley91}
-- for example, \ind{Brownian motion},
\ind{Langevin process}es and \ind{Wiener process}es are all examples of Gaussian
processes; \ind{Kalman filter}s, widely used to model speech waveforms,
also correspond to Gaussian process models; the
method of `\ind{kriging}'
in \ind{geostatistics} is a Gaussian process \ind{regression} method.
\subsection{Reservations about Gaussian processes}
It might be thought that it is not possible to reproduce the interesting
properties of neural network interpolation methods with
something so simple as a Gaussian distribution, but as
we shall now see, many popular nonlinear interpolation methods
are equivalent to particular Gaussian processes. (I use the
term `interpolation' to cover both the problem of `regression' -- fitting a
curve through noisy data -- and the task of fitting an interpolant
that passes exactly through the given data points.)
% -- for particular choices of the covariance function.
% Thus many parametric models are special cases of Gaussian processes.
It might also be thought that the computational complexity of
inference when we work with priors over infinite-dimensional
function spaces might be infinitely large. But by concentrating
on the joint probability distribution of the observed data and
the quantities we wish to predict, it is possible to make
predictions with resources that scale as polynomial functions of
$N$, the number of data points.
% graveyard.tex
% Fri 3/1/03
\section{Standard methods for nonlinear regression}
\label{sec2}
\subsection{The problem}
We are given $N$ data points $\bXN,\btN = \{ \bx^{(n)}, \tn \}_{n=1}^{N}$.
The inputs $\bx$ are vectors of some fixed input dimension $I$. The targets
$t$ are either real numbers, in which case the task will be a
regression or interpolation task,
or they are categorical variables, for example $t \in \{ 0, 1\}$,
in which case the task is a classification task. We will concentrate
on the case of regression for the time being.
Assuming that a function $y(\bx)$ underlies the observed data,
the task is to infer the function from the given data, and predict the function's value
-- or the value of the observation $\tNN$ -- at a new point
$\bx^{(N+1)}$.
% We will denote the set of input vectors by
% $\bXN \equiv \{ \bx^{(n)} \}_{n=1}^{N}$ and the set of target values
% by the vector $\btN \equiv \{ \tN \}_{n=1}^{N}$. (Please
% do not confuse the single observation $t_N$ with $\btN$.)
%
\subsection{Parametric approaches to the problem}
In a parametric approach to regression we express the
unknown function $y(\bx)$ in terms of a nonlinear function $y(\bx;\bw)$
parameterized by parameters $\bw$.
\exampl{ex.fbf}{{\sf Fixed basis functions.}
Using a set of basis functions $\{ \phi_h(\bx)\}_{h=1}^H$, we
can write
\beq
y(\bx ; \bw) = \sum_{h=1}^H w_h \phi_h(\bx)
.
\label{eq.fbf}
\eeq
If the basis functions are nonlinear functions of $\bx$ such
as
\ind{radial basis function}s centred at fixed points $\{ \bc_h \}_{h=1}^{H}$,
\beq
\phi_h(\bx) = \exp\left[ -\frac{(\bx - \bc_h)^2}{2 r^2} \right] ,
\label{eq.rbf.phi}
\eeq
then $y(\bx ; \bw)$ is a nonlinear function of $\bx$; however, since
the dependence of $y$ on the parameters $\bw$ is linear, we might
sometimes refer to this as a `linear' model. In neural network terms,
this model is like a multilayer network whose connections from the
input layer to the nonlinear hidden layer are fixed; only the output weights
$\bw$ are adaptive.
Other possible sets of fixed basis functions include polynomials such
as $\phi_h(\bx) = x_i^p x_j^q$ where $p$ and $q$ are integer powers that
depend on $h$.
}
\exampl{ex.abf}{{\sf Adaptive basis functions.}
Alternatively, we might make a function $y(\bx)$ from basis functions that
depend on additional parameters included in the vector $\bw$.
In a two-layer feedforward
neural network with nonlinear hidden units and a linear output,
the function can be written
\beq
y(\bx;\bw) =
\sum_{h=1}^H w_{h}^{(2)} \tanh \left(
\sum_{i=1}^I w^{(1)}_{hi} x_i + w^{(1)}_{h0} \right)
+ w^{(2)}_{0}
% f^{(2)} \left( \right)
\label{eq.mlp}
\eeq
where $I$ is the dimensionality of the input space and the weight vector
$\bw$ consists of the input weights
$\{ w^{(1)}_{hi} \}$, the hidden unit
biases $\{ w^{(1)}_{h0} \}$, the output weights $\{ w_{h}^{(2)} \}$
and the output bias $w^{(2)}_{0}$.
In this model, the dependence of $y$ on $\bw$ is nonlinear.
}
Having chosen the parameterization,
we then infer the function $y(\bx;\bw)$ by inferring the parameters
$\bw$.
% Many methods for inferring the parameters
% can be interpreted in terms of a Bayesian model for the problem,
% in which t
The posterior probability of the parameters is
% given by
\beq
% P(\bw \given \{ \bx^{(n)}, t^{(n)} \} )
% = \frac{
% P( \{ t^{(n)} \} \given \bw , \bXN )
% P( \bw )
% }{ P( \{ t^{(n)} \} \given \bXN ) }
P(\bw \given \btN , \bXN )
= \frac{
P( \btN \given \bw , \bXN ) P( \bw )
}{ P( \btN \given \bXN ) }
.
\label{eq.infer.w}
\eeq
The factor $P( \btN \given \bw , \bXN )$
states the
% assumed
probability
% distribution
of the observed
data points when the parameters $\bw$ (and hence, the
function $y$) are known. This probability distribution
is often taken to be a separable
Gaussian, each data point $\tn$ differing from the underlying value
$y(\bx^{(n)};\bw)$ by additive noise. The factor $P(\bw)$
specifies the prior probability distribution of the parameters.
This too is often taken to be a separable Gaussian distribution.
If the dependence of $y$ on $\bw$ is nonlinear
the posterior distribution $P(\bw \given \btN , \bXN )$ is in general not
a Gaussian distribution.
The inference can be implemented in various ways.
In the Laplace method,\index{Laplace's method}
we minimize an objective
function
\beq
M(\bw) = - \ln \left[ P( \btN \given \bw , \bXN )
P( \bw ) \right]
\eeq
with respect to $\bw$, locating the locally most probable parameters,
then use the curvature of $M$, $\partial^2 M(\bw) /\partial w_i \partial w_j$,
to define error bars on $\bw$. Alternatively we can use more general
%, which
% works in cases where the noise model and the prior $P(\bw)$
% are not simple Gaussians,
Markov chain Monte Carlo techniques
to create samples from the posterior distribution
$P(\bw \given \btN , \bXN )$.
Having obtained one of these representations of the inference of
$\bw$ given the data, predictions are then made by marginalizing
over the parameters:
\beq
P( \tNN \given \btN , \bXNN )
= \int \! \d^{H} \bw \: P( \tNN \given \bw , \bx^{(N+1)} )
P(\bw \given \btN ,
\bXN ) .
\label{eq.predictive}
\eeq
If we have a Gaussian representation of
the posterior $P(\bw \given \btN , \bXN )$, then
this integral can typically be evaluated directly.
In the alternative Monte Carlo approach, which generates
$R$ samples $\bw^{(r)}$ that are intended to be samples from
the posterior distribution $P(\bw \given \btN , \bXN )$,
we approximate the predictive distribution by
\beq
P( \tNN \given \btN , \bXNN )
\simeq \frac{1}{R} \sum_{r=1}^{R}
P( \tNN \given \bw^{(r)} , \bx^{(N+1)} )
.
\label{eq.approx.mc}
\eeq
\subsection{Nonparametric approaches.}
In nonparametric\index{nonparametric data modelling}
methods, predictions are obtained without explicitly
parameterizing the unknown function $y(\bx)$;
$y(\bx)$ lives in the infinite-dimensional space of all continuous
functions of $\bx$.
One well known nonparametric approach to the regression problem
is the \ind{spline} smoothing method \cite{Kimeldorf_Wahba}.
% Splines can be defined in various ways.
% One way of defining
A spline solution to
a one-dimensional regression problem can be described as follows:
we define the estimator of $y(\bx)$ to be
the function $\hat{y}(\bx)$ that minimizes the
functional
\beq
M( y(x) ) = \half \, \b \sum_{n=1}^N
( y(x^{(n)}) - \tn )^2
+ \half \, \a \int \d x \: [y^{(p)}(x)]^2 ,
\label{eq.M.splines}
\eeq
where $y^{(p)}$ is the $p$th derivative of $y$ and $p$ is a positive number.
If $p$ is set to 2
then the resulting function $\hat{y}(\bx)$ is a cubic spline,
that is, a piecewise cubic function that has `knots'
-- discontinuities in its second derivative -- at the data points
$\{x^{(n)} \}$.
This estimation method can be interpreted as a Bayesian method
% \cite{Kimeldorf_Wahba}
by identifying the prior for the function $y(x)$ as:
% It can be made proper by adding terms
% corresponding to boundary conditions to (\ref{tractable.eq1a}).
\beq
\ln P(y(x) \given \a) = - \half \, \a \int \d x \: [y^{(p)}(x)]^2 + {\rm const} ,
\label{tractable.eq1a}
\eeq
and the probability of the data measurements $\btN = \{\tn\}_{n=1}^N$
assuming independent Gaussian noise as:
\beq
\ln P \left(\left. \btN \rightgiven y(x), \b \right)
= - \half \, \b \sum_{n=1}^N
( y(x^{(n)}) - \tn )^2
\! + {\rm const} .
\label{tractable.eq1b}
\eeq
[The constants in equations (\ref{tractable.eq1a}) and
(\ref{tractable.eq1b}) are functions of $\a$ and $\b$
respectively. {Strictly the prior (\ref{tractable.eq1a}) is improper
since addition of an arbitrary polynomial of degree $(p-1)$ to $y(x)$
is not constrained. This impropriety is easily rectified by the
addition of $(p-1)$ appropriate terms to (\ref{tractable.eq1a}).}]
Given this interpretation of the functions in \eqref{eq.M.splines},
$M( y(x) )$ is equal to minus the log of the posterior probability $P(
y(x) \given \btN,\a,\b)$, within an additive constant, and the splines
estimation procedure can be interpreted as yielding a Bayesian \index{maximum {\em a posteriori}}{MAP}
estimate. The Bayesian perspective allows us additionally to put error
bars on the splines estimate and to draw typical samples from the posterior
distribution, and it gives an automatic method for inferring the
hyperparameters $\a$ and $\b$.\nocite{MacKay92a}
% is an important one,
% whose Bayesian solution is
% as reviewed in \citeasnoun{MacKay92a}.
\subsection{Comments}
\subsubsection{Splines priors are Gaussian processes}
The prior distribution defined in \eqref{tractable.eq1a} is our first
example of a Gaussian process.
Throwing mathematical precision to the winds, a Gaussian process can be
defined as
a probability distribution on a space of functions $y(x)$
that can be written in the form
\beq
P( y(x) \given \mu(x) , A )
= \frac{1}{Z} \exp\left[ -
\half (y(x)-\mu(x))^{\T}A(y(x)-\mu(x)) \right],
\label{eq.impreciseGP}
\eeq
where $\mu(x)$ is the mean function
% of the distribution
and $A$ is a
linear operator, and where the inner product of two functions
$y(x)^{\T}z(x)$ is defined by, for example, $\int \d x \: y(x)z(x)$.
% , with $w(x)$ being a measure on $x$.
Here, if we denote by
% $D^{(p)}$
$D$ the linear operator that
maps $y(x)$ to the
% $p$th
derivative of $y(x)$, we can write \eqref{tractable.eq1a} as
\beq
\ln P(y(x) \given \a) = - \half \, \a \int \d x \: [D^{p}y(x)]^2 + {\rm const}
= - \half \, y(x)^{\T}\! Ay(x) + {\rm const} ,
\label{eq1a.Tp}
\eeq
which has the same form as \eqref{eq.impreciseGP} with $\mu(x) =0$,
and $A \equiv [D^{p}]^{\T} D^{p}$.
In order for the prior in \eqref{eq.impreciseGP} to be a \ind{proper}
prior, $A$ must be a \ind{positive definite} operator, \ie, one satisfying
$y(x)^{\T}\! Ay(x) > 0$ for all functions $y(x)$ other than $y(x)=0$.
\subsubsection{Splines can be written as parametric models}
Splines may be written in terms of an infinite set of fixed basis
functions, as in \eqref{eq.fbf}, as follows. First rescale the $x$ axis
so that the interval $(0,2\pi)$ is much wider than the range of
$x$ values of interest.
Let the basis functions
be a Fourier set $\{ \cos hx, \sin hx$, $h\!=\!0,1,2,\ldots \}$,
so the function is
\beq
y(x) = \sum_{h=0}^{\infty} w_{h(\cos)} \cos(hx)
+ \sum_{h=1}^{\infty} w_{h(\sin)} \sin(hx)
.
\eeq
Use the
regularizer
\beq
E_W(\bw) = \sum_{h=0}^{\infty} \half h^{\frac{p}{2}} w_{h(\cos)}^2
+ \sum_{h=1}^{\infty} \half h^{\frac{p}{2}} w_{h(\sin)}^2
\eeq
to define a Gaussian prior on $\bw$,
\beq
P(\bw \given \a) = \frac{1}{Z_W(\a)} \exp (-\a E_W).
\eeq
If $p\eq 2$ then
% in the limit $k \! \rightarrow \! \infty$
we have
the cubic splines regularizer $E_W(\bw) \eq \int y^{(2)}(x)^2 \, \d x$,
as in \eqref{eq.M.splines}; if $p\eq 1$ we have the regularizer
$E_W(\bw)\eq \int y^{(1)}(x)^2 \, \d x$, etc. (To make the prior proper
we must add an extra regularizer on the term $w_{0(\cos)}$.)
Thus in terms of the prior $P(y(x))$
there is no fundamental difference between the `nonparametric'
splines approach and other parametric approaches.
\subsubsection{Representation is irrelevant for prediction}
From
the point of view of prediction at least, there are two objects
of interest. The first is the conditional distribution
$P( \tNN \given \btN , \bXNN )$
defined in \eqref{eq.predictive}.
The other object of interest, should we wish to compare one
model with others,
% two predictive models with each other,
is the joint probability of all the observed data given the model,
the evidence
$P( \btN \given \bXN )$, which appeared as the normalizing constant in
\eqref{eq.infer.w}.
Neither of these quantities makes any reference to the
representation of the unknown function $y(x)$.
So at the end of the day, our choice of representation is
irrelevant.
The question we now address is, in the case of popular parametric models,
what form do these two quantities take?
% It is not difficult to see that
We will see that for standard models with fixed basis functions
and Gaussian distributions on the unknown parameters, the
joint probability of all the observed data given the model,
$P( \btN \given \bXN )$,
is a multivariate Gaussian
distribution with mean zero and with a covariance matrix
determined by the basis functions;
this implies that the conditional distribution
$P( \tNN \given \btN , \bXNN )$
is also a Gaussian distribution, whose mean depends linearly
on the values of the targets $\btN$.
Standard parametric models are simple examples of Gaussian processes.
\section{From parametric models to Gaussian processes}
\label{sec3}\subsection{Linear models}
Let us consider a
% one-dimensional
regression problem using $H$ fixed
% radial
basis functions, for example one-dimensional radial
basis functions as defined in \eqref{eq.rbf.phi}.
% We will then consider what happens as we increase $H$.
Let us assume that a list of $N$ input points $\{\bx^{(n)} \}$
has been specified and define the $N \times H$ matrix $\bR$ to be the
matrix of values of the basis functions $\{ \phi_h(\bx) \}_{h=1}^{H}$
at the points $\{\bx_{n}\}$,
\beq
R_{nh} \equiv \phi_h(\bx^{(n)}) .
\eeq
We define the vector $\byN$ to be the vector of values of
$y(\bx)$ at the $N$ points,
\beq
y_n \equiv \sum_h R_{nh} w_h
.
\eeq
If the prior distribution of $\bw$ is Gaussian with zero mean,
\beq
P( \bw ) = \Normal( \bw ; {\bf 0} , \sigma_w^2 \bI ) ,
\eeq
then $\by$, being a linear function of $\bw$, is
also Gaussian distributed, with mean zero.
The covariance matrix of $\by$ is
\beqan
\bQ& =& \left< \by \by^{\T} \right> = \left< \bR \bw \bw^{\T} \bR^{\T}
\right> =\bR \left< \bw \bw^{\T} \right> \bR^{\T}
\\ & = &
\sigma_w^2 \bR \bR^{\T}.
\eeqan
So the prior distribution of $\by$ is:
\beq
P( \by ) = \Normal( \by;{\bf 0} , \bQ ) = \Normal(\by; {\bf 0} , \sigma_w^2 \bR \bR^{\T} ) .
\eeq
This result, that the vector of $N$ function values
$\by$ has a Gaussian distribution, is true for
any selected points $\bXN$.
This is the defining property of a Gaussian process. {\em The probability
distribution of a function $y(\bx)$ is a Gaussian process if
for any finite selection of points $\bx^{(1)}, \bx^{(2)}, \ldots ,
\bx^{(N)}$, the
% marginal
density $P( y(\bx^{(1)}) , y(\bx^{(2)}) , \ldots ,
y(\bx^{(N)}))$ is a Gaussian.}
Now, if the number of basis functions $H$ is smaller than the number
of data points $N$, then the matrix $\bQ$ will not have full rank. In
this case the probability distribution of $\by$ might be thought of
as a flat elliptical pancake confined to an $H$-dimensional subspace
in the $N$-dimensional space in which $\by$ lives.
What about the target values?
If each target $\tn$ is assumed to differ by additive Gaussian noise
of variance $\sigma^2_{\nu}$ from the corresponding function value $y_n$
then $\bt$ also has a Gaussian prior distribution,
\beq
P( \bt ) = \Normal( \bt ; {\bf 0} , \bQ + \sigma_{\nu}^2 \bI ) .
\eeq
We will denote the covariance matrix of $\bt$ by $\bC$:
\beq
\bC = \bQ + \sigma_{\nu}^2 \bI =
\sigma_w^2 \bR \bR^{\T} + \sigma_{\nu}^2 \bI .
\eeq
Whether or not $\bQ$ has full rank, the covariance matrix $\bC$
has full rank since $\sigma_{\nu}^2 \bI$ is full rank.
% positive definite.
What does the covariance matrix $\bQ$ look like? In general,
the $(n,n')$ entry of $\bQ$ is
\beq
Q_{n n'} = [ \sigma_w^2 \bR \bR^{\T} ]_{n n'}
= \sigma_w^2 \sum_h \phi_h( \bx^{(n)} ) \phi_h( \bx^{(n')} )
\label{eq.phi.phi}
\eeq
and the $(n,n')$ entry of $\bC$ is
\beq
C_{n n'}
= \sigma_w^2 \sum_h \phi_h( \bx^{(n)} ) \phi_h( \bx^{(n')} )
+ \delta_{n n'} \sigma_{\nu}^2 ,
\eeq
where $\delta_{n n'} = 1$ if $n=n'$ and 0 otherwise.
\Exampl{ex.onedGP}{
Let's take as an
example a one-dimensional case, with radial basis functions.
%
The expression for $Q_{n n'}$
becomes simplest if we assume we have uniformly-spaced basis functions
with the basis function labelled $h$ centred on the point $x=h$,
and take the limit $H \rightarrow
\infty$, so that the sum over $h$ becomes an integral; to avoid having
a covariance that diverges with $H$, we had better make $\sigma_w^2$
scale as $S/(\Delta H)$, where $\Delta H$ is the number of
basis functions per unit length of the $x$-axis, and $S$ is a constant;
then
\beqan
\hspace*{-0.342in} Q_{n n'} &=& S \int_{h_{\min}}^{h_{\max}} \! \d h \:
\phi_h( x^{(n)} ) \phi_h( x^{(n')} )
\\ &=&
S \int_{h_{\min}}^{h_{\max}} \! \d h \:
\exp\left[ - \frac{(x^{(n)}-h)^2}{2 r^2} \right]
\exp\left[ - \frac{(x^{(n')}-h)^2}{2 r^2} \right] .
\eeqan
If we let the limits of integration be $\pm \infty$, we can solve this
integral:
\beqan
Q_{n n'} &=&
\sqrt{\pi r^2} \, S
\exp\left[ - \frac{(x^{(n')}-x^{(n)})^2}{4 r^2} \right]
.
\eeqan
% int( exp( (-(x/2+h)**2 -(-x/2+h)**2)/2 ) , h=-infinity..infinity ) ;
% exp(-1/4*x^2)*Pi^(1/2)
}
We are arriving at a new perspective on the interpolation problem.
Instead of specifying
the prior distribution on functions
% of the standard radial basis function model
in terms of basis functions and priors on parameters, the prior
can be summarized
simply by a covariance function,
\beq
C( x^{(n)} , x^{(n')} ) \equiv
\theta_1 \exp\left[ - \frac{(x^{(n')}-x^{(n)})^2}{4 r^2} \right]
,
\label{eq.Cgaussian1}
\eeq
where we have given a new name, $\theta_1$, to
the constant out front.
% A Gaussian process is a probability distribution over functions
% def
Generalizing from this particular case, a vista of interpolation
methods opens up. Given any valid covariance function $C( \bx , \bx' )$
-- we'll discuss in a moment what `valid' means --
we can define the covariance matrix for $N$ function values
at locations $\bXN$ to
be the matrix $\bQ$ given by
\beq
Q_{n n'} = C( \bx^{(n)} , \bx^{(n')} )
\eeq
and the covariance matrix for $N$ corresponding target values, assuming
Gaussian noise,
to be the matrix $\bC$ given by
\beq
C_{n n'} = C( \bx^{(n)} , \bx^{(n')} ) + \sigma^2_{\nu}
\delta_{n n'} .
\label{eq.C.gp}
\eeq
In conclusion, the prior probability of the $N$ target values $\bt$
in the data set is:
\beq
P(\bt) = \Normal( \bt ; {\bf 0} , \bC ) = \smallfrac{1}{Z} e^{- \half \bt^{\T} \bC^{-1} \bt } .
\label{eq.fundamentalgp}
\eeq
Samples from this Gaussian process and a few other simple
Gaussian processes are displayed in \figref{fig:GPsamp}.
\begin{figure}
%\fullwidthfigureright{\small
\figuremargin{\small
\begin{center}
\mbox{
\begin{tabular}{cc}
% \put(-20,255){
\raisebox{0in}[2in]{
\psfig{figure=\mngfigs/GPsamp1.eps,%
width=2.32in,height=2.32in,angle=-90}
}
&
% \put(195,255){
\raisebox{0in}[2in]{
\psfig{figure=\mngfigs/GPsamp2.eps,%
width=2.32in,height=2.32in,angle=-90}
}
\\
%\put(100,230)
{\makebox(0,0){(a) $2 \exp \left( -\smallfrac{(x - x')^2}{2 (1.5)^2} \right)$} } &
%\put(320,230)
{\makebox(0,0){(b) $2 \exp \left( -\smallfrac{(x - x')^2}{2 (0.35)^2} \right)$} }
\\
% \put(-20,25){
\psfig{figure=\mngfigs/GPsamp3.eps,%
width=2.32in,height=2.32in,angle=-90}
&
% \put(195,25){
\psfig{figure=\mngfigs/GPsamp4.eps,%
width=2.32in,height=2.32in,angle=-90}
\\
%\put(100,0)
{\makebox(0,0){(c) $2 \exp \left( - \smallfrac{ \sin^2( \pi(x - x') / 3.0 ) }{2 (0.5)^2} \right)$} }
&
%\put(320,0)
{\makebox(0,0){(d) $2 \exp \left( -\smallfrac{(x - x')^2}{2 (1.5)^2} \right) + x x'$} }
\\
\end{tabular}
}
\end{center}
}{
\caption[a]{{Samples drawn from Gaussian process priors.}
Each panel shows
two functions drawn from a Gaussian process prior. The
four corresponding covariance functions
are given below each plot. The decrease in lengthscale from (a) to
(b) produces more rapidly fluctuating functions. The periodic
properties of the covariance function in (c) can be seen.
The covariance function in (d) contains the non-stationary term $x
x'$ corresponding to the covariance of a straight line, so that typical
functions include linear trends. From \citeasnoun{GibbsPhD}.
}
\label{fig:GPsamp}
}
\end{figure}
\subsection{Multilayer neural networks and Gaussian processes}
% The recent interest among neural network researchers in Gaussian
% processes was initiated by the work of \citeasnoun{Radford_book}
% on priors for infinite networks.
Figures \ref{newmovie} and \ref{movie1} show
some random samples from the prior distribution
over functions defined by a selection of standard multilayer perceptrons
with large numbers of hidden units.
Those samples don't seem a million miles away from
the Gaussian process samples of \figref{fig:GPsamp}.
And indeed
\citeasnoun{Radford_book}
% Neal
showed that the properties of a
neural network with one hidden layer (as in \eqref{eq.mlp})
converge to those of a Gaussian
process as the number of hidden neurons tends to infinity, if standard
`weight decay' priors are assumed. The covariance function of this
Gaussian process depends on the details of the priors assumed for the
weights in the network and the activation functions of the hidden units.
\section{Using a given Gaussian process model in regression}
\label{sec4}
We have spent some time talking about priors. We now return to
our data and the problem of prediction. How do we make
predictions with a Gaussian process?
Having formed the covariance matrix $\bC$ defined in \eqref{eq.C.gp}
our task is to infer $\tNN$ given the observed
vector $\btN$.
% \equiv ( t^{(1)} , \ldots , \tN )^{\T}$.
The
% inference of $\tNN$ given $\btN$ is simple because
% the
joint density $P( \tNN , \btN )$ is a Gaussian; so the
conditional distribution
\beq
P( \tNN \given \btN ) = \frac{ P( \tNN , \btN ) }
{ P( \btN ) }
\label{eq.tNN}
\eeq
is also a Gaussian.
We now distinguish between different sizes of covariance matrix $\bC$
with a subscript, such that $\bC_{N+1}$ is the $(N+1)\times(N+1)$
covariance matrix for the vector $\btNN \equiv (t_1 , \ldots ,
\tNN )^{\T}$.
We define submatrices of $\bCNN$ as follows:
\beq
\bCNN \equiv
\left[
\begin{array}{@{}c@{}c@{}}
\left[ \begin{array}{@{}ccc@{}}
& & \\
& \bCN & \\
& &
\end{array}
\right]
&
\left[ \begin{array}{@{}c@{}}
\\
\:\bk\: \\
\makebox{ }
\end{array}
\right]
\\[1mm]
\rule{0cm}{13.5pt}% strut to create space
\left[ \begin{array}{@{}ccc@{}}
\:&
\bk^{\T} &\:\,
\end{array}
\right]
&
\left[ \begin{array}{@{}c@{}}
\:\kappa\:\,
\end{array}
\right]
\end{array}
\right] .
\label{eq.CN.NN}
\eeq
The posterior distribution (\ref{eq.tNN}) is given by
\beq
P( \tNN \given \btN ) \propto \exp \left[ -\frac{1}{2}
\left[ \begin{array}{@{}c@{\,}c@{}}
\btN\, & \tNN \end{array} \right]
\bCNN^{-1}
\left[ \begin{array}{@{}c@{}}
\btN \\ \tNN \end{array} \right] \right] .
\label{eq.the.post}
\eeq
% Similarly, we can write
We can evaluate the mean and standard deviation of the
posterior distribution of $t_{N+1}$ by brute force
inversion of $\bCNN$. There is a more elegant expression for
the predictive distribution, however, which is useful whenever
predictions are to be made at a number of new points on the basis of
the data set of size $N$.
We can write $\bCinv_{N+1}$ in terms of $\bC_N$ and $\bCinv_{N}$
using the \ind{partitioned inverse} equations \cite{Barnett}:
\beq
\bCinv_{N+1} = \left[ \begin{array}{cc} \bM & \bm \\ \bm^{\T} &
m \\ \end{array} \right]
\label{eq.CNN.inv}
\eeq
where
\begin{eqnarray}
m & = & \left( \kappa - \bkNN^{T} \bCinv_{N} \bkNN \right)^{-1} \\
\bm & = & - m \: \bCinv_{N} \bkNN \\
\bM & = & \bCinv_{N} + \frac{1}{m} \bm \bm^{T} .
\label{eq.CNN.inv.last}
\end{eqnarray}
When we substitute this matrix into \eqref{eq.the.post}
we find
\beq
P( t_{N+1} \given \btN ) =
\frac{1}{Z} \exp \left[ -\frac{( t_{N+1} - \hat{t}_{N+1}
)^{2}}{2 \sigma_{\hat{t}_{N+1}}^{2}} \right]
\eeq
\noindent where
\begin{eqnarray}
\hat{t}_{N+1} & = & \bkNN^{T} \bCinv_{N} \bt_{N} \label{eq.mean} \\
\sigma^{2}_{\hat{t}_{N+1}} & = & \kappa -
\bkNN^{T} \bCinv_{N} \bkNN .
\label{eq.sig}
\end{eqnarray}
The predictive
mean at the new point is given by $\hat{t}_{N+1}$ and
$\sigma_{\hat{t}_{N+1}}$ defines the error bars on this
prediction. Notice that we do not need to invert $\bC_{N+1}$ in order
to make predictions at $\bx^{(N+1)}$. Only $\bC_{N}$ needs to be
inverted.
Thus Gaussian processes
allow one to
% effectively
implement a model with a number of
basis functions $H$ much larger than the number of data points $N$,
with the computational requirement being of order $N^3$, independent of $H$.
[We'll discuss ways of reducing this cost later.]
The predictions produced by a Gaussian process depend entirely on the
covariance matrix $\bC$. We now discuss the sorts of covariance functions
one might choose to define $\bC$, and how we can automate
the selection of the covariance function in response to data.
% \subsection{Large neural networks are also equivalent to Gaussian processes}
\section{Examples of covariance functions}
\label{sec5}
% \subsection{General points}
The only constraint on our choice of covariance function is that it
must generate a non-negative-definite covariance matrix for any set
of points $\{\bx_{n}\}_{n=1}^{N}$. We will denote the parameters of
a covariance function by $\btheta$. The covariance matrix of $\bt$
has entries given by
%
\beq
C_{mn} = {C}(\bx^{(m)},\bx^{(n)};\btheta) + \delta_{mn}
{\cal N}(\bx^{(n)};\btheta)
\label{eq.cov}
\eeq
where $C$ is the covariance function
% on which we concentrate from now on,
and $\cal N$ is a noise model which might be stationary or spatially
varying, for example,
\beq
{\cal N}(\bx;\btheta) = \left\{ \begin{array}{ll}
\theta_{3} & \mbox{for input-independent noise} \\
\exp \! \left( \sum_{j=1}^{J} \beta_{j}
\phi_{j}(\bx) \right) & \mbox{for input-dependent noise.}
\end{array} \right.
\eeq
\noindent
The continuity properties of $C$ determine the continuity
properties of typical samples from the Gaussian process prior.
% If $C(\bx,\bx')$ is a continuous function of its arguments then typical
% functions $y(x)$ are continuous too.
An encyclopaedic paper on
Gaussian processes giving many valid covariance functions
has been written by \citeasnoun{Abrahamsen97}.
\subsection{Stationary covariance functions}
A {\em stationary\/} covariance function is one that is translation invariant
in that it satisfies
\beq
C(\bx,\bx';\btheta) = D(\bx-\bx';\btheta)
\eeq
for some function $D$, \ie, the covariance is a function of separation only,
also known as the autocovariance function.
If additionally $C$ depends only on the {\em magnitude\/}
of the distance between
$\bx$ and $\bx'$ then the covariance function is said to be
{\dem\ind{homogeneous}}. Stationary covariance functions may also be described
in terms of the \ind{Fourier transform} of the function $D$, which is
known as the power spectrum of the Gaussian process. This Fourier
transform is necessarily a positive function of frequency.
% and it describes the power spectrum of the Gaussian process.
One way of
constructing a valid stationary covariance function is to invent
a positive function of frequency and define $D$ to be its inverse Fourier
transform.
\exampl{ex.gaussianspec}{
Let the power spectrum be a Gaussian function of frequency.
Since the Fourier transform of a Gaussian is a Gaussian, the
autocovariance function corresponding to this power spectrum is a Gaussian
function of separation. This argument rederives the covariance function
we derived at \eqref{eq.Cgaussian1}.
}
Generalizing slightly, a popular form for $C$ with hyperparameters
$\btheta=(\theta_1,\theta_2,\{ r_i \})$ is
\beq
C(\bx,\bx';\btheta) = \theta_{1} \exp \left[ -\frac{1}{2}
\sum_{i=1}^{I} \frac{\left( x_{i} - x'_{i} \right)^{2}}
{r^{2}_{i}} \right] + \theta_{2} .
\label{eq.Cf}
\label{eq.Cstandard}
\eeq
% where $x_{i}$ is the $i^{th}$ component of
$\bx$ is an
$I$-dimensional vector and
$r_{i}$ is a lengthscale associated with input $x_i$,
the lengthscale in direction $i$ on which
$y$ is expected to vary significantly. A very large lengthscale
means that $y$ is expected to be essentially a constant function
of that input. Such an input could be said to be irrelevant, as in
the \ind{automatic relevance determination} method for neural
networks \cite{MacKay94:springer,Radford_book}. The $\theta_{1}$
hyperparameter defines the vertical scale of variations of a typical
function. The $\theta_{2}$ hyperparameter allows the whole function
to be offset away from zero by some unknown constant -- to
understand this term, examine \eqref{eq.phi.phi} and consider the
basis function $\phi(\bx) = 1$.
Another stationary covariance function is
\beq
C(x,x' ) = \exp ( - | x-x'|^{\nu} ) \ \ \ \ 0 < \nu \leq 2.
\eeq
For $\nu=2$, this is a special case of the previous covariance
function. For $\nu \in (1,2)$, the typical functions from this
prior are smooth but not analytic functions. For $\nu \leq 1$
typical functions are continuous but not smooth.
A covariance
function that models a function that is periodic with known
period $\lambda_{i}$ in the $i^{\rm{th}}$ input direction is
\beq
C(\bx,\bx' ;\btheta) = \theta_{1} \exp \left[ -\frac{1}{2} \sum_{i}
\left( \frac{\sin \left( \smallfrac{\pi}{\lambda_{i}}
(x_{i} - x'_{i}) \right)}{r_{i}} \right)^{\!\!\!2\,} \right] .
\label{eq.Cperiodic}
\eeq
Figure~\ref{fig:GPsamp} shows some random samples drawn from
% $P(\bt_N \given \bC_N, \{\bx_n\})$
Gaussian processes with a variety of different covariance functions.
% Many other stationary covariance functions can be found
% in \cite{Abrahamsen97}.
\subsection{Nonstationary covariance functions}
The simplest nonstationary covariance function is the one
corresponding to a linear trend. Consider the plane $y(\bx) =
\sum_{i} w_i x_{i} + c$. If the $\{w_i\}$ and $c$ have Gaussian
distributions with zero mean and variances $\sigma_w^2$ and $\sigma_c^2$
respectively then the plane has a covariance function
%
\beq
C_{\rm lin}( \bx,\bx';\{ \sigma_w, \sigma_c\}) = \sum_{i=1}^I
\sigma^2_w x_i x'_i
+ \sigma_c^2 .
\eeq
%
% predictions. Far away from the data, this will give mean predictions lying
% on the plane
%
% \beq
% y(\bx) = \sum_{l=1}^L \left( \sum_{n=1}^N \alpha_l x^{(l)}_n (\bCinv_N
% \bt_N)_n \right) x^{(l)} + \left( \sum_{n=1}^N \theta_2 (\bCinv_N
% \bt_N)_n \right).
% \eeq
\noindent An example of random sample functions
incorporating the linear term can be seen in
\figref{fig:GPsamp}d.
% cut spatially varying lengthscales from here
\begin{figure}
%\fullwidthfigureright{\small
\figuremargin{\small
\begin{center}
\mbox{\small
(a)\raisebox{-0.20in}[1.9in]{
\psfig{figure=\mngfigs/multimin2.eps,%
width=2.32in,height=2.32in,angle=-90}
}
\small(b)\raisebox{-0.20in}[1.9in]{
\psfig{figure=\mngfigs/multimin3.eps,%
width=2.32in,height=2.32in,angle=-90}
}
}
\small(c)\setlength{\unitlength}{1pt}%
\begin{picture}(220,162)(100,12)% was 140),20)
\put(50,-30){
\hbox{
\psfig{figure=\mngfigs/multimin1.ps,%
width=4.2in,height=3.5in,angle=-90}
}}
\put(322,95){\makebox(0,0){$\theta_3$} }
\put(210,10){\makebox(0,0){$r_1$} }
\put(163,35){\makebox(0,0){$\times$} }
\put(275,125){\makebox(0,0){$\times$} }
\end{picture}
\end{center}
}{
\caption[g]{{Multimodal likelihood functions for Gaussian processes.}
A data set of five points is modelled with
the simple covariance function (\ref{eq.Cf}),
with one hyperparameter $\theta_3$ controlling the noise variance.
Panels a and b show the
most probable interpolant and its $1\sigma$ error bars
when the hyperparameters $\btheta$ are set to
two different values that (locally) maximize the likelihood $P(\bt_{N} \given \bXN , \btheta)$:
(a) $r_1 = 0.95$, $\theta_3=0.0$; (b) $r_1 = 3.5$, $\theta_3 = 3.0$.
Panel c shows a contour plot of the likelihood as a function
of $r_1$ and $\theta_3$, with the two maxima shown by crosses.
From \citeasnoun{GibbsPhD}.
}
\label{fig:multimin}
}
\end{figure}
\section{Adaptation of Gaussian process models}
\label{sec6}
Let us assume that a form of covariance function has been chosen,
but that it depends on undetermined hyperparameters $\btheta$.
We would like to `learn' these hyperparameters from the
data. This learning process is equivalent to the inference
of the hyperparameters of a neural network, for example,
weight decay hyperparameters.
It is a complexity-control problem, one that is solved
nicely by the Bayesian Occam's razor.
Ideally we would
like to define a prior distribution on the hyperparameters
and integrate over them in order to make our predictions, \ie,
we would like to find
%
\beq
P(t_{N+1} \given \bx_{N+1},\cD) = \int P(t_{N+1} \given \bx_{N+1},\btheta,\cD)
P(\btheta \given \cD) \, \d\btheta .
\label{eq.predict}
\eeq
But this integral is usually intractable. There are two
approaches we can take.
\ben
\item We can approximate the integral by using the most probable
values of hyperparameters.
% (cf. \cite{MacKay92a}).
%
\beq
P(t_{N+1} \given \bx_{N+1},\cD) \simeq P(t_{N+1} \given \bx_{N+1},\cD,\btheta_{\MP})
\eeq
\item Or we can perform the integration over $\btheta$ numerically using Monte Carlo methods
\cite{williams_rasmussen:96,Neal_gp}.
\een
Either of these approaches is implemented most efficiently if the gradient
of the posterior probability of $\btheta$ can be evaluated.
\subsection{Gradient}
The posterior probability of $\btheta$ is
\beq
P(\btheta \given \cD) \propto P(\bt_{N} \given \bXN , \btheta)
P(\btheta) .
\eeq
%
The log of the first term (the evidence for the hyperparameters)
is
% $\cL$, is
\beq
\ln P(\bt_{N} \given \bXN , \btheta)
= -\frac{1}{2} \ln \det \bC_{N} - \frac{1}{2} \bt_{N}^{T} \bCinv_{N}
\bt_{N} - \frac{N}{2}\ln 2\pi ,
\eeq
and its derivative with respect to a hyperparameter $\theta$ is
\beq
\frac{\partial }{\partial \theta}
\ln P(\bt_{N} \given \bXN , \btheta)
= -\frac{1}{2} \Trace \left( \bCinv_{N}
\dCda \right) + \frac{1}{2} \bt_{N}^{T} \bCinv_{N} \dCda \bCinv_{N} \bt_{N} .
\label{eq.deriv}
\eeq
\subsection{Comments}
% Multimodality of the data-dependent term}
% We
% can place prior distributions on all our hyperparameters $\btheta$ to
% reflect any prior knowledge that we may have, about the expected
% lengthscales of the function, for example.
%
Assuming that finding the derivatives of the priors is
straightforward, we can now search for $\btheta_{\MP}$. However there
are two problems that we need to be aware of. Firstly, as illustrated in
\figref{fig:multimin}, the
evidence may be multimodal. Suitable priors and sensible
optimization strategies often eliminate poor optima.
Secondly and perhaps most importantly the evaluation of the
gradient of the log likelihood requires the evaluation of
$\bCinv_{N}$. Any exact inversion method (such as Cholesky
decomposition, LU decomposition or Gauss--Jordan elimination) has an associated
computational cost that is of order $N^3$ and so calculating gradients
becomes time consuming for large training data sets.
Approximate methods for implementing the predictions (equations
(\ref{eq.mean}) and (\ref{eq.sig})) and gradient computation
(\eqref{eq.deriv}) are an active research area.
One approach based on the ideas of \citeasnoun{Skilling_clouds} makes
approximations to $\bC^{-1}\bt$ and $\Trace \bC^{-1}$ using
iterative methods
with cost ${\cal O}(N^{2})$
\cite{Gibbs_MacKay97a,GibbsPhD}. Further references on this topic are given at
the end of the chapter.
\section{Classification}
\label{sec10}
Gaussian processes can be integrated into classification modelling
once we identify a variable that can sensibly be given a
Gaussian process prior.
In a binary classification problem, we can define a quantity $a_n
\equiv a(\bx^{(n)})$ such that the probability that the
class is 1 rather than 0 is
\beq
P( \tn \eq 1 \given a_n ) = \frac{1}{1 + e^{-a_n} } .
\label{eq.class.like}
\eeq
Large positive values of $a$ correspond to probabilities close to one;
large negative values of $a$ define probabilities that are close to zero.
In a classification problem, we typically intend that the
probability $P( \tn \eq 1 )$ should be a smoothly varying function of
$\bx$. We can embody this prior belief by defining $a(\bx)$ to have
a Gaussian process prior.
\subsection{Implementation}
It is not so easy to perform inferences and adapt
the Gaussian process model to
data in a classification model as in
regression problems because the likelihood function (\ref{eq.class.like})
is not a Gaussian function of $a_n$. So the posterior
distribution of $\ba$ given some observations $\bt$
is not Gaussian and the normalization constant $P(\btN \given \bXN)$ cannot
be written down analytically.
\citeasnoun{williams:96} have implemented classifiers based on Gaussian\index{approximation!Laplace}
process priors using Laplace
\index{Laplace's method}approximations (\chref{ch.laplace}). \citeasnoun{Neal_gp}
has implemented a Monte Carlo approach to implementing a Gaussian process
classifier.
\citeasnoun{Gibbs_MacKay97b} have implemented another cheap and cheerful
approach based on the methods of
\index{Jaakkola, Tommi S.}\index{Jordan, Michael I.}
Jaakkola and Jordan
% \citeasnoun{jaakkola_jordan:bounds}
(\secref{sec.jaak}). In this\index{VGC (variational
Gaussian process classifier)}\index{variational methods!variational
Gaussian process classifier}\index{Gaussian processes!variational
Gaussian process classifier}
{\dem variational
Gaussian process classifier\/}, we obtain tractable upper and
lower bounds for the unnormalized posterior density over $\ba$,
$P(\btN \given \ba)P(\ba)$. These bounds are parameterized by variational
parameters which are adjusted in order to obtain the tightest
possible fit. Using normalized versions of the optimized bounds
we then compute approximations to the predictive distributions.
% \subsection{Conclusion}
% Gaussian processes can be used to produce
% effective binary classifiers.
Multi-class classification problems can also be solved with Monte Carlo
methods \cite{Neal_gp} and variational methods \cite{GibbsPhD}.
\section{Discussion}
\label{sec11}
Gaussian processes are moderately simple to
implement and use. Because very few parameters of the
model need to be determined by hand (generally only the
priors on the hyperparameters), Gaussian processes are
useful tools for automated tasks where fine tuning for
each problem is not possible. We do not appear to sacrifice any
performance for this simplicity.
It is easy to construct
Gaussian processes that have particular desired properties; for
example we can make a straightforward automatic relevance
determination model.
One obvious problem with Gaussian processes is
the computational cost associated with inverting an $N \times N$
matrix. The cost of direct methods of inversion becomes
prohibitive when the number of data points $N$ is greater than
about $1000$.
%In \citeasnoun{Gibbs_MacKay97a} efficient methods for
% matrix inversion \cite{Skilling_clouds} are developed that allow large
% data sets to be tackled. But further research is going to be needed
% before Gaussian processes can be applied to more than about $10\,000$
% data points.
\subsection{Have we thrown the baby out with the bath water?}
According to the hype of 1987,
neural networks were meant to be intelligent models
that discovered features and patterns in data.
Gaussian processes in contrast are simply smoothing devices.
How can Gaussian processes possibly replace
neural networks? Were neural networks
over-hyped, or have we underestimated the power of smoothing methods?
I think both these propositions are true.
The success of Gaussian processes
% work of \citeasnoun{williams_rasmussen:96}
shows that many real-world data modelling problems are perfectly well
solved by sensible smoothing methods. The most interesting
problems, the task of feature discovery for example, are
not ones that Gaussian processes will solve. But maybe
multilayer perceptrons can't solve them either.
Perhaps a fresh start is needed, approaching the problem
of machine learning from a paradigm different from the
supervised feedforward mapping.
\subsection{Further reading}
%\subsection{Literature}
The study of Gaussian processes for regression is far from new. Time
series analysis was being performed by the astronomer T.N.\ Thiele\index{Thiele, T.N.}
using Gaussian processes in 1880 \cite{Lauritzen81}. In the 1940s,
\index{Wiener, Norbert}\index{Kolmogorov, Andrei Nikolaevich}{Wiener}--{Kolmogorov}
prediction theory was introduced for prediction of
trajectories of military targets \cite{Cybernetics}. Within the
geostatistics field, \citeasnoun{Matheron63b} proposed a framework
for regression using optimal linear estimators which he called
`kriging' after D.G. Krige, a South African mining engineer. This
framework is identical to the Gaussian process approach to
regression. Kriging has been developed considerably in the last
thirty years (see \citeasnoun{Cressie} for a review)
including several Bayesian treatments
\cite{Omre87,Kitanidis86}. However the \ind{geostatistics} approach to the
Gaussian process model has concentrated mainly on low-dimensional
problems and has largely ignored any probabilistic interpretation of
the model.
% and any interpretation of the individual parameters of the
% covariance function.
Kalman filters are widely used to implement
inferences for stationary one-dimensional Gaussian processes,
and are popular models for speech and music modelling
\cite{Bayes.Kalman}. Generalized radial basis functions
\cite{Poggio3}, ARMA models \cite{Wahba90} and variable metric kernel
methods \cite{Lowe95} are all closely related to Gaussian processes.
See also \citeasnoun{OHagan78}.
% introduced an approach similar to Gaussian processes.
The idea of
replacing supervised neural networks by Gaussian processes
was first explored by \citeasnoun{williams_rasmussen:96}
and \citeasnoun{Neal_gp}. A thorough comparison of Gaussian
processes with other methods such as neural networks and MARS was
made by \citeasnoun{rasmussen:phd}.
Methods for reducing the complexity of data modelling with
Gaussian processes remain an active research area
% add ckiw cites here (and cf. earlier discussion already in comments section
\cite{poggio-girosi-90,luo-wahba-97,tresp-00,williams-seeger-01,smola-bartlett-01,rasmussen-02,seeger-williams-lawrence-03,opper00gaussian}.
% the opper00gaussian paper is a mean field approach so probably not reducing
% complexity of the matrix computations, but they do offer ideas for model selection using
% cheap mean field methods.
A longer review of Gaussian processes is in \cite{MacKay98:gp}.
A review paper on regression with
% parametric models and
\ind{complexity} control using hierarchical\index{hierarchical model} Bayesian models
is \cite{MacKay92a}.
% \citeasnoun{Neal_gp} discusses various ways
% of connecting Gaussian processes to classification models.
Gaussian processes and {\dem\ind{support vector} learning
machines} \cite{scholkopf95,vapnik95} have a lot in common. Both are
kernel-based predictors, the \ind{kernel} being another
name for the covariance function.
A Bayesian version of support vectors, exploiting this connection,
can be found in
\cite{Chu01a,Chu02b,Chu02c,Chu02d}.\nocite{ohagan94}\nocite{Ichikawa96}
\dvips
\chapter{Deconvolution \nonexaminable}
% deconvoln.tex
%
% \chapter{Deconvolution}
%
\section{Traditional image reconstruction methods}
\subsection*{Optimal linear filters} % for image deconvolution}
In many imaging problems, the data measurements $\{ d_n \}$ are linearly
related to the underlying image $\bf f$:
\beq
d_n = \sum_k R_{nk} f_k + \noisenu_n.
\eeq
The vector $\mathbf \noisenu$ denotes the inevitable noise that corrupts real
data. In the case of a \ind{camera}\index{blur} which produces a blurred picture, the
vector $\bf f$ denotes the true \ind{image}, $\bf d$ denotes the blurred
and noisy picture, and the linear operator $\bf R$ is a convolution
defined by the \ind{point spread function} of the camera. In this special
case, the true image and the data vector reside in the same space;
but it is important to maintain a distinction between them. We will
use the subscript $n = 1 ,\ldots, N$ to run over data measurements,
and the subscripts $k,k' = 1 ,\ldots, K$ to run over image pixels.
One might speculate that since the blur was created by a linear
operation, then perhaps it might be deblurred by another linear
operation. We can derive the {\dem\ind{optimal linear filter}\/} in two ways.
% from a Bayesian perspective, or from a minimum square error standpoint.
% Both are reviewed here, since the latter relates as follows.
% The latter gives the connection to neural networks
\subsection*{Bayesian derivation}
%We define our hypothesis space as follows.
We assume that the linear operator $\bf R$ is known, and that the
noise $\bf \noisenu$ is Gaussian and independent, with a known standard
deviation $\snu$.
\beq
P(\bd \given \f , \snu, \H ) = \frac{1}{(2 \pi \snu^2)^{N/2}}
\exp \left(
-\sum_n \left. \left( d_n -
\textstyle \sum_k R_{nk} f_k \right)^2 \right/ (2 \snu^2)
\right) .
\eeq
We assume that the prior probability of the image is also \index{Gaussian distribution}{Gaussian},
with a scale parameter $\sigma_{f}$.
\beq
P(\f \given \sigma_f, \H) = \frac{\det^{-\half}\bC}{(2 \pi \sigma_f^2)^{K/2}}
\exp \left(
- \sum_{k,k'} \left. f_k C_{kk'} f_k' \right/ (2 \sigma_f^2)
\right) .
\label{images.prior}
\eeq
If we assume no correlations\index{correlations!in images} among the pixels then the symmetric,
full rank matrix $\bC$ is equal to the identity matrix $\bI$. The
more sophisticated `intrinsic correlation function' model uses $\bC =
[\bG \bG^{\T}]^{-1}$, where $\bG$ is a convolution that takes us from
an imaginary `hidden' image, which is uncorrelated, to the real
correlated image. The \ind{intrinsic correlation function} should not be
confused with the \ind{point spread function} $\bR$ which defines the image-to-data
mapping. A zero-mean Gaussian prior is clearly a poor
assumption if it is known that all elements of the image $\f$ are
positive,
% in every pixel,
but let us proceed. We can now write down the posterior
probability of an image $\f$ given the data $\bd$.
\beq
P(\f \given \bd , \snu, \sigma_f, \H) =
\frac{ P(\bd \given \f , \snu, \H ) P(\f \given \sigma_f, \H)) }{
P(\bd \given \snu, \sigma_f, \H ) } .
\eeq
In words,
\beq
{\rm Posterior} = \frac{ \mbox{Likelihood} \, \times \, \mbox{Prior} }{
\rm Evidence } .
\eeq
The `evidence' $P(\bd \given \snu, \sigma_f, \H )$ is the normalizing
constant for this posterior distribution. Here it is unimportant, but
it is used in a more sophisticated analysis to compare, for example,
different values of $\snu$ and $\sigma_f$, or different point spread
functions $\bR$.
Since the posterior distribution is the product of two Gaussian
functions of $\f$, it is also a Gaussian, and can therefore be
summarized by its mean, which is also the {\em most probable image},
$\fmp$, and its covariance matrix:
\beq
\bSigma_{\f | \bd} \equiv \left[
- \grad \grad \log P(\f \given \bd , \snu, \sigma_f, \H)
\right]^{-1},
\eeq
which defines the joint error bars on $\f$. In this equation, the symbol
$\grad$ denotes differentiation with respect to the image parameters $\f$.
We can find $\fmp$ by differentiating the log of the posterior, and solving
for the derivative being zero. We obtain:
\beq
\fmp = \left[ \bR^{\T} \bR + \frac{\snu^2}{\sigma_f^2} \bC \right]^{-1} \bR^{\T} \bd .
\label{images.olf1}
\eeq
The operator $\left[ \bR^{\T} \bR + \frac{\snu^2}{\sigma_f^2} \bC
\right]^{-1} \bR^{\T}$ is called the optimal linear filter. When the
term $\frac{\snu^2}{\sigma_f^2} \bC$ can be neglected, the optimal
linear filter is the \ind{pseudoinverse}
% `$\bR^{-1}\mbox{'} =
$\left[
\bR^{\T} \bR \right]^{-1} \bR^{\T}$. The term
$\frac{\snu^2}{\sigma_f^2} \bC$ regularizes\index{regularization} this ill-conditioned
inverse.
%\subsection*{Alternative expression}
The optimal linear filter can also be manipulated into the form:
\beq
\mbox{Optimal linear filter} = \bC^{-1} \bR^{\T}
\left[ \bR \bC^{-1} \bR^{\T} + \frac{\snu^2}{\sigma_f^2} \bI \right]^{-1} .
\label{images.olf2}
\eeq
% My favourite derivation of this is to do it for the case C = I,
% which is not hard. (Get an eqn like Rd = RBAd for all d, therefore
% A = B inverse + additive stuff which is irrelevant anyway. )
% Then obtain the general case by defining f' = C^{-1/2} f, and substituting
% R = R' C^{-1/2} everywhere. Finally strip out all 's.
%
%, then the result is transformed to image space using $\bR^{\T}$.
\subsection*{Minimum square error derivation}
\label{sec.olf.ss}
The non-Bayesian derivation of the optimal linear filter starts by assuming
that we will `estimate' the true image $\f$ by a linear function of the
data:
\beq
\hat{\f} = \bW \bd.
\label{linear.filter}
\eeq
The linear operator $\bW$ is then `optimized' by minimizing the expected
sum-squared error between $\hat{\f}$ and the unknown true image
$\bf$.
% The sum squared error $E= \sum_i \left( \hat{f}_i - f_i
% \right)^2 /2 $ is commonly assumed.
% (Interestingly, any quadratic metric using a symmetric positive
% definite matrix gives the same optimal linear filter.)
% bC This symbol is chosen deliberately to suggest a
% connection to the matrix in the prior of equation (\ref{images.prior}).
In the following equations, summations over repeated indices
$k$, $k'$, $n$ are implicit. The expectation $\left< \cdot \right>$
is over both the statistics of the random variables $\{ \noisenu_n \}$,
and the ensemble of images $\f$ which we expect to bump into. We assume
that the noise is zero mean
and uncorrelated to second order with itself and everything else,
with $\left< \noisenu_n \noisenu_{n'} \right> = \snu^2 \delta_{nn'}$.
\beqan
% \langle E \rangle &=& \frac{1}{2} \left( \hat{f}_k - f_k \right)^2 \\
\langle E \rangle &=& \frac{1}{2} \left< \left( W_{kn} d_n - f_k \right)^2 \right> \\
% &=& \frac{1}{2} \left< \left( W_{kn} (R_{nj}f_j + \noisenu_n ) - f_k \right)^2 \right> \\
&=& \half \left< \left( W_{kn} R_{nj} f_j - f_k \right)^2
\right>
+ \frac{1}{2} W_{kn} W_{kn} \snu^2 .
\eeqan
Differentiating with respect to $\bW$, and introducing $\bF \equiv \left< f_{j'} f_j \right>$
(\cf\ $\sigma_f^2 \bC^{-1}$ in the Bayesian derivation above),
% \beqan
% \label{images.olf.wd}
% \frac{ \partial \langle E \rangle }{\partial W_{im} } &=&
% \left< \left( W_{im'} R_{m'j'} f_{j'} - f_i ) \right) R_{mj} f_j \right>
% + W_{im} \snu^2 \\
% &=& W_{im'} \left( R_{m'j'} \left< f_{j'} f_j \right> R^{\T}_{jm}
% + \snu^2 \delta_{m'm} \right)
% - \left< f_i f_j \right> R^{\T}_{jm} .
% \eeqan
we find that the optimal linear filter is:
\beq
\bW_{\rm opt} = \bF \bR^{\T} \left[ \bR \bF \bR^{\T} + \snu^2 \bI \right]^{-1}
.
\eeq
If we identify $\bF = \sigma_f^2 \bC^{-1}$, we obtain the optimal
linear filter (\ref{images.olf2}) of the Bayesian derivation. The ad
hoc assumptions made in this derivation were the choice of a
quadratic error measure, and the decision to use a linear estimator.
It is interesting that without explicit assumptions of Gaussian
distributions, this derivation has reproduced the same estimator as
the Bayesian posterior mode, $\fmp$.
The advantage of a Bayesian approach is that we can criticize these
assumptions and modify them in order to make better reconstructions.
\subsection*{Other image models}
The better matched our model of images $P(\f \given \H)$ is to the real
world, the better our image reconstructions will be, and the less
data we will need to answer any given question. The Gaussian models
which lead to the optimal linear filter are spectacularly
poorly matched to the real world. For example, the
Gaussian prior (\ref{images.prior}) fails to specify that all
pixel intensities in an
image are positive.\index{positivity}
This omission leads to the most pronounced artefacts where
the image under observation has high contrast or large black
patches. Optimal linear
filters applied to \index{astronomy}\index{image reconstruction}astronomical
data give reconstructions with
negative areas in them, corresponding to patches of sky that suck
energy out of telescopes! The {\dem\ind{maximum entropy}\/} model for
image deconvolution \cite{Gull.nature} was a great success
principally because this model forced the reconstructed image to be
positive. The spurious negative areas and complementary spurious
positive areas are eliminated, and the quality of the
reconstruction is greatly enhanced.
\begin{sloppypar}
The `classic maximum entropy' model assigns an \index{entropic distribution}{entropic prior}
%
\beq
P(\f \given \a,\bm,\H_{\rm Classic})=\exp(\a S(\f,\bm))/Z,
\eeq
where
\beq
S(\f,\bm) = \sum_i( f_i \ln (m_i/f_i) + f_i - m_i)
\eeq
\cite{GS1}.
This model enforces positivity; the parameter $\a$ defines a
characteristic dynamic range by which the pixel values are expected
to differ from the default image $\bm$.
\end{sloppypar}
The `\index{intrinsic correlation function}{intrinsic-correlation-function}\index{ICF
(intrinsic correlation function)} maximum-entropy' model \cite{GS2} introduces an expectation
of spatial correlations into the prior on $\f$ by writing $\f = \bG
\bh$, where $\bG$ is a convolution with an intrinsic correlation
function, and putting a classic maxent prior on the underlying hidden image $\bh$.\index{MemSys}
% The `Fermi-Dirac' model generalizes the entropy function so as to
% enforce an upper bound on intensity as well as the lower bound of
% positivity. This model is appropriate where the underlying image is
% bounded between two grey levels, as in the case of printed text.
%
% All these models are implemented in the MemSys package.
\subsection{Probabilistic movies}
Having found not only the most probable image $\f_{\MP}$
but also error bars on it, $\bSigma_{\f|\bd}$, one task
is to visualize those error bars.
Whether or not we use Monte Carlo methods\index{Monte Carlo methods!for visualization}
to infer $\f$,
a correlated random walk around the posterior distribution can
be used to visualize the uncertainties and
correlations.\index{probabilistic movie}\index{movie}
For a Gaussian posterior distribution,
we can create a correlated sequence of unit normal random vectors $\bn$ using
\beq
\bn^{(t+1)} =
c \bn^{(t)} + s \bz ,
\eeq
where $\bz$ is a unit normal random vector and $c^2+s^2 =1$ ($c$ controls how persistent
the memory of the sequence is).
We then render the image sequence defined by
\beq
\f^{(t)} = \f_{\MP} + \bSigma_{\f|\bd}^{1/2} \bn^{(t)}
\eeq
where $\bSigma_{\f|\bd}^{1/2}$ is the \ind{Cholesky decomposition}
of $\bSigma_{\f|\bd}$.
\section{Supervised neural networks for image deconvolution}
%\section{Relationships to Neural Networks}
Neural network researchers often exploit the following
strategy. Given a problem currently solved with a standard algorithm:
interpret the computations performed by the algorithm as a
parameterized mapping from an input to an output, and call this
mapping a neural network; then adapt the parameters to data
so as to produce another mapping that solves the
task better. By construction, the neural network can reproduce the
standard algorithm, so this data-driven adaptation can only make the
performance better.
There are several reasons why standard algorithms can be bettered
in this way.
\ben
\item
Algorithms are often not designed to optimize the real objective
function. For example, in speech recognition, a hidden Markov model
is designed to model the speech signal, and is fitted so as to
to maximize the generative probability given the known
string of words in the training data; but the real objective
is to {\em discriminate\/} between different words. If an inadequate
model is being used, the neural-net-style training of the model
will focus the limited resources of the model on the aspects relevant to
the discrimination task. Discriminative training\index{discriminative training}
of hidden Markov models for
speech recognition does improve their performance.
\item
% (2)
The neural network can be more flexible than the standard model;
some of the adaptive parameters might have been viewed as fixed
features by the original designers.
%\item
%% (3)
A flexible network can find properties in the data that were not included
in the original model.
\een
% added Tue 15/4/03
% extradeconvoln.tex
% added after draft 4.0 Tue 15/4/03
\section{Deconvolution in humans}
A huge fraction of our brain is devoted to vision.
One of the neglected features of our visual system is that
the raw image falling on the retina is severely blurred:
while
most people can see with a resolution of about {\em 1 arcminute\/}
(one sixtieth
of a degree) under any daylight conditions, bright or
dim, {\em the image on our
retina is blurred through a point spread function of width as large as 5
arcminutes} \cite{ChromAb1947,ChromAb1986}.
% under dim lighting.
It is amazing that we are able to resolve pixels that
are twenty-five times smaller in area than the blob produced on our retina
by any point source.
\index{Newton, Isaac}Isaac Newton was aware of this conundrum.
It's hard to make a lens that does not have \ind{chromatic aberration},
and our cornea and lens, like a lens made of ordinary glass,
refract blue light more strongly than red. Typically
our eyes focus correctly for the middle of the visible spectrum (green),
% so green
so if we look
at a single white dot made of red, green, and blue light,
the image on our retina consists of a sharply focussed green dot
surrounded by a broader red blob superposed on an even broader blue blob.
%
The width of the red and blue blobs is proportional to the diameter
of the pupil, which is largest under dim lighting conditions.
[The blobs are roughly concentric, though most people have a slight
bias, such that in one eye the red blob is centred a tiny distance
to the left and the blue is centred a tiny distance to the right,
and in the other eye it's the other way round. This slight bias
explains why when we look at blue and red writing on a dark background
most people perceive the blue writing to be at a slightly greater depth
than the red. In a minority of people, this small bias is the
other way round and the red/blue depth perception is reversed.
But this effect (which many people are aware of, having noticed it in
cinemas, for example) is {\em tiny\/} compared with
the chromatic aberration we are discussing.]
You can vividly demonstrate to yourself how enormous the chromatic
aberration in your eye is with the help of a sheet of card
and a colour computer screen.
For the most impressive results -- I guarantee you will be
amazed
% , if you have not seen this effect before --
--
use a dim room with no light apart from the computer screen;
a pretty strong effect will still be seen even if the room has daylight
coming into it, as long as it is not bright sunshine.
Cut a slit about 1.5$\,$mm wide in the card. On the screen, display
a few small coloured objects on a black background. I especially
recommend thin vertical objects coloured pure red, pure blue,
magenta (\ie, red plus blue), and white (red plus blue plus green).\footnote{%
{\tt{http://www.inference.phy.cam.ac.uk/mackay/itila/Files.html}}}
Include a little black-and-white text on the screen too.
Stand or sit sufficiently far away that you can only just read the text
-- perhaps a distance of four metres or so, if you have normal vision.
Now, hold the slit vertically in front of one of your eyes,
and close the other eye. Hold the slit near to your eye -- brushing your eyelashes --
and look through it. Waggle the slit slowly to the left
and to the right, so that the slit is alternately in front
of the left and right sides of your pupil. What do you see?
I see the red objects waggling to and fro, and the blue
objects waggling to and fro, through {\em huge\/}
distances and in opposite directions, while white objects
appear to stay still and are negligibly distorted. Thin magenta objects
can be seen splitting into their constituent red and blue parts.
Measure how large the motion of the red and blue objects is --
it's more than 5 minutes of arc for me, in a dim room.
Then check how sharply you can see under these conditions -- look
at the text on the screen, for example: is it not the case
that you can see (through your whole pupil)
features far smaller than the distance through which the red and blue
components were waggling?
Yet when you are using the whole pupil, what is falling on your retina
must be an image blurred with a blurring diameter equal to the waggling amplitude.
One of the main functions of early visual processing must
be to deconvolve this chromatic aberration.
Neuroscientists sometimes conjecture that the reason why retinal ganglion cells
and cells in the lateral
geniculate nucleus (the main brain area to which retinal ganglion
cells project) have
centre-surround receptive fields with colour opponency (long
wavelength in the centre and medium wavelength in the surround, for example)
is in order to perform `feature extraction' or `edge detection',
but I think this view is mistaken. The reason we have
centre-surround filters at the first stage of visual processing (in the fovea
at least) is for the huge task of
deconvolution of chromatic aberration.
I speculate that the {\dem McCollough effect}, an
extremely long-lasting association of colours with orientation
\cite{McCollough1965,MacKayMacKay1974},
is produced by the adaptation mechanism that tunes
our chromatic-aberration-deconvolution circuits.
% not McCulloch effect
% It's a perfect match.
Our deconvolution circuits must be
rapidly tuneable, because the point spread function of our eye
changes with our pupil diameter, which can change within seconds;
and indeed the McCollough effect can be induced within 30 seconds.
At the same time, the effect is long-lasting when an eye is covered,
because it's in our interests that our deconvolution circuits should
stay well-tuned while we \ind{sleep}, so that we can see sharply the
instant we wake up.
I also wonder whether the main reason that we evolved colour
vision was not `in order to see fruit better' but\index{evolution!colour vision}
{\em `so as to be able to see black and white sharper'} --
% it's certainly the case that
deconvolving chromatic aberration is easier, even
in an entirely black and white world, if one
has access to chromatic information in the image.
And a final speculation: why do our eyes make \ind{micro-saccades}
when we look at things? These miniature eye-movements\index{eye movements}\index{saccades}\index{speculation about vision}
are of an angular size bigger than the spacing between the \ind{cones} in the \ind{fovea} (which are
spaced at roughly 1 minute of arc, the perceived resolution of the eye).
The typical size of a microsaccade is 5--10$\,$minutes of arc
\cite{SaccadeSize1950}. Is it a coincidence
that this is the same as the size of chromatic aberration? Surely
micro-saccades must play an essential role in the deconvolution mechanism that
delivers our high-resolution vision.
\section{Exercises}
\exercisxC{3}{ex.deconv}{
Blur an image with a circular (top hat) point spread function and add noise.
Then deconvolve the blurry noisy image using the optimal linear filter.
Find error bars and visualize them by making a probabilistic movie.
}
\dvips
\renewcommand{\partfigure}{\poincare{8.6}}
\part{Sparse Graph Codes \nonexaminable}
% {Introduction to sparse graph codes \nonexaminable}
\subchapter{About Part VI}% Introduction to Sparse Graph Codes \nonexaminable}
% Sat 4/1/03 I turned this from being a self-contained intro
% appropriate for a review paper to being the book chapter.
% see bak0301/ for the original
\index{sparse-graph code}\index{error-correcting code!sparse graph}
\label{sec:intro}
The central problem of \ind{communication} theory
is to construct an encoding and a decoding system
that make it possible to communicate reliably over
a noisy channel.
During the 1990s, remarkable progress was made towards
the Shannon limit, using codes that are defined in terms
of sparse random graphs, and which are decoded by a simple
probability-based message-passing algorithm.
In a {\dbf sparse-graph code},\index{graph!of code}
the nodes in the graph represent the transmitted
bits and the constraints they satisfy. For a linear code with a codeword
length $N$ and rate $R=K/N$, the number of constraints
is of order $M=N-K$. Any linear code can be described
by a graph, but what makes a sparse-graph code
special is that each constraint only involves a small number of
variables in the graph: so the number of edges in the graph scales
roughly linearly with $N$, rather than quadratically.
% and $K$.
% cut figure from here
% to graveyard.tex
In the following four chapters we will look at four families
of sparse-graph codes:
three families that are excellent for error-correction: {\dbf\ind{low-density parity-check code}s},
{\dbf\ind{turbo code}s},\nocite{berrou-glavieux-96}
and {\dbf repeat--accumulate codes}\nocite{Divsalar1998};
and the family of {\dbf{digital fountain codes}}, which are outstanding for
erasure-correction.
%\subsection*{Decoding performance}
All these codes can be decoded
by a local message-passing algorithm on the
graph, the \ind{sum--product algorithm},\index{loopy}\index{factor graph}
and, while this algorithm is not a perfect maximum likelihood
decoder, the empirical results are record-breaking.\nocite{mncEL,McElieceMacKay96,mncN}
%{Low-density parity-check codes}
\chapter{Low-Density Parity-Check Codes \nonexaminable}
\label{ch.gallager}
\label{ch.ldpcc}
A \inds{low-density parity-check code} (or \inds{Gallager code}) is a block code
that has a parity-check matrix, $\bH$, every row and column of which
is `sparse'.\indexs{error-correcting code!low-density parity-check}\indexs{error-correcting code!Gallager}
% The codewords of the code are those words which
% satisfy all the parity constraints defined by the matrix.
A {\dem\ind{regular}\/} Gallager code is a low-density parity-check code in which
every column of $\bH$ has the same weight $j$ and every row has the
same weight $k$; regular Gallager codes are constructed at random
subject to these constraints.
A low-density parity-check code with $j=3$ and $k=4$ is illustrated
in \figref{fig.ldpccb}.
%(Its block length is not long enough for it to be a good code.)
\amarginfig{c}{
\[
\raisebox{0.425in}{ \bH \hspace{0.02in} =}\hspace{-0.1in}
\psfig{figure=MNCfigs/12.4.3.111/A.ps,angle=-90,width=1.5in,height=1in}
% {\large \left[
% \begin{array}{*{16}{p{0.53cm}}}
% {}& {}& 1 & {}& {}& {}& {}& 1 & {}& 1 & {}& {}& 1 & {}& {}& {}\\
% {}& {}& {}& 1 & {}& {}& 1 & {}& 1 & {}& {}& {}& 1 & {}& {}& {}\\
% {}& 1 & {}& {}& 1 & {}& 1 & {}& {}& 1 & {}& {}& {}& {}& {}& {}\\
% {}& {}& {}& 1 & {}& 1 & {}& {}& {}& {}& 1 & {}& {}& 1 & {}& {}\\
% {}& {}& 1 & {}& {}& {}& {}& {}& 1 & {}& {}& {}& {}& {}& 1 & 1 \\
% 1 & {}& {}& {}& {}& 1 & {}& {}& 1 & 1 & {}& {}& {}& {}& {}& {}\\
% {}& {}& {}& 1 & {}& {}& {}& 1 & {}& {}& {}& 1 & {}& {}& 1 & {}\\
% {}& 1 & {}& {}& {}& 1 & {}& {}& {}& {}& {}& 1 & {}& {}& {}& 1 \\
% 1 & {}& {}& {}& {}& {}& 1 & {}& {}& {}& {}& {}& {}& 1 & {}& 1 \\
% {}& {}& 1 & {}& 1 & {}& {}& {}& {}& {}& {}& 1 & {}& 1 & {}& {}\\
% {}& 1 & {}& {}& {}& {}& {}& {}& {}& {}& 1 & {}& 1 & {}& 1 & {}\\
% 1 & {}& {}& {}& 1 & {}& {}& 1 & {}& {}& 1 & {}& {}& {}& {}& {}
% \end{array}
% \right] } .
% \: \mbox{($M=12$ rows, $N=16$ columns)}
% MNC:
% method ../../../A2dat.p < A
\]
\begin{center}
\mbox{
\psfig{figure=/home/mackay/itp/figs/gallager/16.12.ps,width=2in,angle=-90}
}\end{center}
\caption[a]{A low-density parity-check matrix
and the corresponding graph of a rate-\dfrac{1}{4}
low-density parity-check code with
% $(j,k) = (3,4)$,
blocklength $N=16$, and $M=12$ constraints.
Each white circle represents a transmitted bit. Each bit
participates in $j=3$ constraints, represented by
\plusnode\ squares. Each
% \plusnode\
constraint forces the
sum of the $k=4$ bits to which it is connected to
be even.
}
\label{fig.ldpccb}
}
\section{Theoretical properties}
Low-density parity-check codes lend themselves to theoretical study.
The following results are proved in \citeasnoun{Gallager63} and
\citeasnoun{mncN}.
Low-density parity-check codes, in spite of their simple construction,
are good codes, {\em given an optimal decoder\/} (good codes in the
sense of \secref{sec.good.codes}).
Furthermore, they have good distance (in the sense
of \secref{sec.bad.dist.def}).
% - indeed {ex.ldpcgood} {sec.good.codes}
These two results hold for any column weight $j \geq 3$.
Furthermore, there are sequences of low-density parity-check codes
in which $j$ increases gradually with $N$, in such a way that
the ratio $j/N$ still
goes to zero, that are {\em very good\/} and that have very good distance.
% in spite of their simple construction,
% are very good codes, {\em given an optimal decoder}.
However, we don't have an optimal decoder, and
decoding low-density parity-check codes is an NP-complete problem.
So what can we do in practice?
\section{Practical decoding}
Given a channel output $\br$, we wish to find the
codeword $\bt$ whose likelihood $P(\br \given \bt)$ is biggest.
All the effective decoding strategies for low-density parity-check
codes are message-passing algorithms. The best algorithm known
is the \ind{sum--product algorithm},
also known as \ind{iterative probabilistic decoding}
or \ind{belief propagation}.
We'll assume that the channel is a memoryless channel (though
more complex channels\index{channel!complex}\index{channel!with memory}\index{channel!bursty}
can easily be handled by
running the sum--product algorithm on a more complex graph that
represents the expected correlations\index{correlations!among errors} among the errors \cite{Worthen}).
For any memoryless channel, there are two approaches to the
decoding problem both of which lead to
the generic problem `find the $\bx$ that maximizes
% and that satisfies
\beq
P^*(\bx) = P(\bx) \, \truth[ \bH \bx = \bz ]\mbox{'} ,
\eeq
where $P(\bx)$ is a separable distribution on a binary vector $\bx$,
and $\bz$ is another binary vector.
Each of these two approaches represents
the decoding problem in terms of a
\ind{factor graph} (\chref{sec.sumproduct}).
\newcommand{\gallboxw}{3.95in}
\newcommand{\galltn}{\put(20.5,8.85){\makebox(0,0)[l]{$t_n$}}}
\newcommand{\gallplus}{\put(18,1){\makebox(0,0)[l]{$\truth\left[ \sum_{n \in \Nm} t_n \!=\! 0\right]$}}}
% \put(18,1){\makebox(0,0)[l]{$\truth[ \sum t_n \!=\! 0]$}}
\begin{figure}
%\figuremargin{\small
%\fullwidthfigureright{\small
\figuredangle{\small
\begin{center}
\begin{tabular}{c@{$\:\:\:\:$}p{\gallboxw}}
{\setlength{\unitlength}{0.1in}
\begin{picture}(28.5,11.5)(-1,1)
%\put(-1,2){\makebox(0,0)[bl]{\small(a)}}
% raise about 2/16 in = 0.125in to get alignment with (c)
\put(0,1.25){\makebox(0,0)[bl]{%
\psfig{figure=/home/mackay/itp/figs/gallager/16.12.ps,width=2in,angle=-90}}}
\galltn
%\gallplus
\end{picture}}&
\begin{minipage}[b]{\gallboxw}
\makebox[0in][r]{(a)~}The prior distribution over codewords $$P(\bt) \propto
\truth[ \bH\bt= {\bf 0} ]. $$
The variable nodes are the transmitted bits $\{ t_n \}$.
\par Each
\plusnode\ node represents the factor $\truth[ \sum_{n \in \Nm} t_n \!=\!
0 \mod 2]$.
\end{minipage}
\\
\setlength{\unitlength}{0.1in}
\begin{picture}(29.95,12.0)(-1,1)
%\put(-1,2){\makebox(0,0)[bl]{\small(b)}}
\put(0,1.25){\makebox(0,0)[bl]{%
\psfig{figure=/home/mackay/itp/figs/gallager/16.12.tobs.eps,width=2in,angle=0}}}
\galltn
\put(20.5,10.5){\makebox(0,0)[l]{$P(r_n\given t_n)$}}
%\gallplus
\end{picture}
&
\raisebox{18pt}{
\begin{minipage}[b]{\gallboxw}
\makebox[0in][r]{(b)~}The posterior distribution over codewords,
$$P(\bt\given \br) \propto P(\bt) P(\br \given \bt).$$
\par Each
% additional square
upper function node
% at the top
represents a likelihood factor $P(r_n\given t_n)$.
\end{minipage}
}
\\
\setlength{\unitlength}{0.1in}
\begin{picture}(28.5,12.5)(-1,-0.1)
%\put(-1,0){\makebox(0,0)[bl]{\small(c)}}
\put(0,0){\makebox(0,0)[bl]{%
\psfig{figure=/home/mackay/itp/figs/gallager/16.12.zobs.eps,width=2in,angle=0}}}
\put(20.5,8.85){\makebox(0,0)[l]{$n_n$}}
\put(20.5,10.5){\makebox(0,0)[l]{$P(n_n)$}}
%\put(18,2){\makebox(0,0)[l]{$\truth\left[ z_m = \sum_{n \in \Nm} n_n\right]$}}
\put(17.3,0){\makebox(0,0)[l]{$z_m$}}
\end{picture}
&
\begin{minipage}[b]{\gallboxw}
\makebox[0in][r]{(c)~}The joint probability of the noise $\bn$
and syndrome $\bz$,
$$P(\bn,\bz) = P(\bn) \, \truth[ \bz \!=\! \bH \bn ].$$
The top variable nodes are now the noise bits $\{ n_n \}$.
\par
The added variable nodes at the base are the syndrome
values $\{ z_m \}$.
\par Each definition $z_m = \sum_{n} H_{mn} n_n \mod 2$
is enforced by a \plusnode\ factor.
% unction $\truth[ \sum_{n \in \Nm} n_n + z_m \!=\! 0 \mod 2]$.
\end{minipage}
\\
\end{tabular}
\end{center}
}{
\caption[a]{Factor graphs associated with a
low-density parity-check code.
% graveyard.tex
}
\label{fig.ldpcc.factor}
}
\end{figure}
\subsubsection{The codeword decoding viewpoint}
First, we note that the prior distribution
over codewords,
\beq
P(\bt) \: \propto \:
\truth[ \bH\bt={\bf 0} \mod 2 ],
\eeq
can be represented by a factor graph (\figref{fig.ldpcc.factor}a),
with the factorization being
\beq
P(\bt) \: \propto \:
\prod_{m}
%\truth\left[ \sum_{n \in \Nm} t_n \!=\! 0\right]
\truth\lbrack \sum_{n \in \Nm} \hspace{-0.81em} t_n = 0 \mod 2 \rbrack
.
\eeq
(We'll omit the `mod~2's from now on.)
The posterior distribution over codewords
is given by multiplying this prior by the likelihood,
which introduces another $N$ factors, one for each received bit.
\beqan
P(\bt\given \br) &\propto& P(\bt) P(\br \given \bt) \nonumber \\
& \propto &
\prod_{m}
\truth\lbrack \sum_{n \in \Nm} \hspace{-0.81em} t_n = 0 \, \rbrack
\, \prod_n P(r_n\given t_n)
\eeqan
The factor graph corresponding to this function is shown in
\figref{fig.ldpcc.factor}b. It is the same as the graph
for the prior, except for the addition of \ind{likelihood} `\ind{dongle}s'
to the transmitted bits.
In this viewpoint, the received signal $r_n$ can live in
any alphabet; all that matters are the values of $P(r_n \given t_n)$.
\subsubsection{The syndrome decoding viewpoint}
Alternatively, we can view the channel output in terms of
a binary received vector $\br$ and a noise vector $\bn$, with
a probability distribution $P(\bn)$ that can be derived from the
channel properties and whatever additional information
is available at the channel outputs.
% As long as the channel is symmetric, this
For example, with
a binary symmetric channel,
we define the noise by $\br = \bt + \bn$, the syndrome
$\bz = \bH \br$, and noise model $ P( n_n ) = f$.
For other channels such as the Gaussian channel with output $\by$,
we may define a received binary vector $\br$ however we wish
and obtain an effective binary
noise model $P(\bn)$ from $\by$
(exercises \ref{ex.GC} (\pref{ex.GC}) and \ref{ex.gc.bsc} (\pref{ex.gc.bsc})).
The joint probability of the
noise $\bn$
and syndrome $\bz = \bH \bn$ can be factored as
\beqan
P(\bn,\bz) &=& P(\bn) \, \truth[ \bz \eq \bH \bn ] \nonumber \\
&=& \prod_n P(n_n) \,
\prod_m \truth\lbrack z_m \eq \!\!\!\! \sum_{n \in \Nm} \hspace{-0.81em} n_n \, \rbrack .
% &=& \prod_n P(n_n) \prod_m \truth\left[ z_m = \sum_{n \in \Nm} n_n\right] .
\eeqan
The factor graph of this function is shown in
\figref{fig.ldpcc.factor}c.
The variables $\bn$ and $\bz$ can also
be drawn in a `belief network' (also known
as a `Bayesian network', `causal network', or `influence diagram')
similar to \figref{fig.ldpcc.factor}a,
but with arrows
on the edges from the upper circular nodes (which represent the variables
$\bn$) to the lower
square nodes (which now represent the variables $\bz$).
We can say that every bit $x_{\fel}$ is the \ind{parent} of $j$
\checks\ $z_{\fen}$, and each \check\ $z_{\fen}$ is the child of
$k$ bits.
% figure half done in 16.12.bn.fig
Both decoding viewpoints involve essentially the same graph.
Either version of the decoding problem can
% , as we said before,
be expressed
as the generic decoding problem `find the $\bx$ that maximizes
% and that satisfies
\beq
P^*(\bx) = P(\bx) \, \truth[ \bH \bx \eq \bz ]\mbox{'} ;
\eeq
in the codeword decoding viewpoint, $\bx$ is the codeword $\bt$,
and $\bz$ is $0$; in the syndrome decoding viewpoint,
$\bx$ is the noise $\bn$, and $\bz$ is the syndrome.
It doesn't matter which viewpoint we take when we apply the
sum--product algorithm.
The two decoding algorithms are isomorphic and will give equivalent
outcomes (unless numerical errors intervene).
\begin{aside}
I tend to use the syndrome decoding viewpoint
because it has one advantage: one does not need
to implement an {\em encoder\/} for a code in order to be able to
simulate a decoding problem realistically.
\end{aside}
We'll now talk in terms of the generic decoding problem.
\section{Decoding with the sum--product algorithm}
\label{sec.spgall}
We aim, given the observed \checks, to compute the marginal posterior
probabilities $P(x_{\fel} \eq 1 \given \bz , \bH )$ for each $\fel$.
It is hard to compute these exactly because the graph contains many
cycles. However, it is interesting to implement the decoding
algorithm that would be appropriate if there were no cycles, on the
assumption that the errors introduced might be relatively small.
This approach of ignoring cycles has been used in the artificial
intelligence literature but is now frowned upon because
it produces inaccurate probabilities. However, if we
are decoding a good error-correcting code,
we don't care about accurate marginal probabilities --
we just want the correct codeword. Also, the posterior probability, in the
case of a good code communicating at an achievable rate, is expected
typically to be hugely concentrated on the most probable decoding;
so we are dealing with a distinctive probability distribution
to which experience gained in other fields may not apply.
The sum--product algorithm
was presented in \chref{sec.sumproduct}.
We now write out explicitly how it works for solving the decoding problem
\[
\bH \bx = \bz \:\: (\!\mod 2) .
\]
For brevity, we reabsorb the dongles hanging off the $x$ and $z$ nodes
in \figref{fig.ldpcc.factor}c and modify
the sum--product algorithm accordingly.
The graph in which $\bx$ and $\bz$ live is then
the original graph (\figref{fig.ldpcc.factor}a) whose edges are
defined by the 1s in $\bH$. The graph contains
nodes of two types, which we'll call checks and bits.
The graph connecting the checks and bits
is a bipartite graph: bits connect only to checks, and {\em vice versa}.
On each iteration, a probability ratio is propagated along each edge in the
graph, and each bit node $x_n$ updates its probability that it should
be in state 1.\nocite{pearl,lauritzen-spiegelhalter-88}
% \subsection{The algorithm}
We denote the set of bits $\fel$ that participate in
\check\ $\fen$ by $\feKn \equiv \{ \fel \! : H_{\fen \fel} \eq 1 \}$.
Similarly we define the set of \checks\ in which bit $\fel$
participates, $\feNk \equiv \{ \fen \!: H_{\fen \fel} \eq 1 \}$. We
denote a set $\feKn$ with bit $\fel$ excluded by $\feKn\wo \fel$.
The algorithm has two alternating parts, in which quantities $q_{\fen
\fel}$ and $\fer_{\fen \fel}$ associated with each edge
in the graph
% non-zero element in the $\bH$ matrix
are iteratively updated. The quantity $q^x_{mn}$
is meant to be the probability that bit $n$ of $\bx$ has the value $x$, given
the information obtained via checks other than check $m$. The
quantity $r^x_{mn}$ is meant to be the probability of check $m$ being
satisfied if bit $n$ of $\bx$ is considered fixed at $x$ and the
other bits have a separable distribution given by the probabilities
$\{q_{mn'}\! :n' \in \feKn\wo \fel\}$. The algorithm would produce the
exact posterior probabilities of all the bits after a fixed number
of iterations if the bipartite graph
defined by the matrix $\bH$ contained no cycles.
%\medskip
%\noindent
\paragraph{Initialization\nocolon}
% {\sf Initialization\llncspunc}
Let $p_{\fel}^0=P(x_{\fel} \eq 0)$ (the prior probability that bit
$x_{\fel}$ is 0), and let $p_{\fel}^1=P(x_{\fel}\eq 1)=1-p_{\fel}^0$.
If we are taking the syndrome decoding viewpoint
and the channel is a \BSC\ then $p_{\fel}^1$ will equal $f$.
If the noise level varies in a known way (for example if the channel
is a binary input Gaussian channel with a real output) then
$p_{\fel}^1$ is initialized to the appropriate normalized likelihood.
% Let the prior probability that bit $x_{\fel}$ is 0 be $p^0_{\fel}$,
% and let $P(x_{\fel}=1) = p^1_{\fel}$.
For every $(\fel,\fen)$ such that $H_{\fen \fel} \eq 1$ the variables
$q^0_{\fen \fel}$ and $q^1_{\fen \fel}$ are initialized to the values
$p^0_{\fel}$ and $p^1_{\fel}$ respectively.
\paragraph{Horizontal step\nocolon}
In the {\dem{horizontal}\/} step of the algorithm (horizontal from the
point of view of the matrix $\bH$), we run through the \checks\
$\fen$ and compute for each $\fel \in \feKn$ two probabilities:
first, $\fer_{\fen \fek}^0$, the probability of
the observed value of $z_{\fen}$ arising when $x_{\fek}=0$, given that
the other bits $\{ x_{\fek'} :
% \fen' \in \feNm ,
\fek' \not = \fek \}$
have a separable distribution given by the probabilities $\{q^0_{\fen \fek'} ,
q^1_{\fen \fek'}\}$, defined by:
\beq
\fer_{\fen \fek}^0 =
\sum_{\left\{x_{\fek'}\! :\, \fek' \in \feKn \wo \fek
\right\} }\!\!\!
P\left(z_{\fen}\given x_{\fek} \eq 0,\,\left\{x_{\fek'}\! :\,
\fek' \in \feKn \wo \fek
\right\}\right)
\prod_{ \fek' \in \feKn \wo \fek
} q_{\fen \fek'}^{x_{\fek'}}
\label{bnd.step2a}
\eeq
and second, $\fer^1_{\fen \fek}$, the probability of the observed value of $z_{\fen}$ arising when
$x_{\fek}=1$, defined by:
\beq
\fer^1_{\fen \fek} =
\sum_{\left\{x_{\fek'}\! :\, \fek' \in \feKn \wo \fek
\right\} }\!\!\!
P\left(z_{\fen}\given x_{\fek} \eq 1,\,\left\{x_{\fek'}\! :\,
\fek' \in \feKn \wo \fek
\right\}\right) \prod_{ \fek' \in \feKn \wo \fek }
q_{\fen \fek'}^{x_{\fek'}} .
\label{bnd.step2}
\eeq
The conditional probabilities in
these summations are either zero or one, depending on whether
the observed $z_m$ matches the hypothesized values for $x_n$ and the
$\{ x_{n'} \}$.
These probabilities can be computed in various obvious ways based on
equation (\ref{bnd.step2a}) and (\ref{bnd.step2}). The computations
may be done most efficiently (if $|\feKn|$ is large) by regarding
$z_m + x_n$ as the final state of a Markov chain with states 0 and
1, this chain being started in state 0, and undergoing transitions
corresponding to additions of the various $x_{n'}$, with transition
probabilities given by the corresponding $q^0_{mn'}$ and $q^1_{mn'}$.
The probabilities for $z_m$ having its observed value given either
$x_n=0$ or $x_n=1$ can then be found efficiently by use of the
forward--backward
% Viterbi
algorithm (\secref{sec.trellisfb})\nocite{Baum_Welch_orig,viterbi,BCJR}.
A particularly convenient implementation of this
method uses
forward and backward passes in which products of the differences
$\delta q_{\fen \fek} \equiv q^0_{\fen \fek} - q^1_{\fen \fek}$ are
computed. We obtain $\delta \fer_{\fen \fek} \equiv \fer^0_{\fen
\fek} - \fer^1_{\fen \fek}$ from the identity:
\beq
\delta \fer_{\fen \fek} = (-1)^{z_{\fen}}
\prod_{\fek' \in \feKn \wo \fek}
\delta q_{\fen \fek'} .
\label{eq.ft.gallager}
\eeq
This identity is derived by iterating the following observation:
if $\zeta = x_{\mu} + x_{\nu} \mod 2$, and $x_{\mu}$ and $x_{\nu}$ have
probabilities $q_{\mu}^0,q_{\nu}^0$ and $q_{\mu}^1,q_{\nu}^1$ of being
0 and 1, then $P( \zeta \eq 1 ) = q_{\mu}^1 q_{\nu}^0 + q_{\mu}^0 q_{\nu}^1$
and $P(\mbox{$\zeta \eq 0$} ) = q_{\mu}^0 q_{\nu}^0 + q_{\mu}^1 q_{\nu}^1$.
Thus $P( \mbox{$\zeta \eq 0$}) - P(\mbox{$ \zeta \eq 1$} ) = ( q_{\mu}^0 - q_{\mu}^1 )
( q_{\nu}^0 - q_{\nu}^1 )$.
We recover
$r_{\fen \fek}^0$ and $r_{\fen \fek}^1$ using
\beq
\fer^0_{\fen \fek} = \dfrac{1}{2}(1 + \delta \fer_{\fen \fek}), \:\:\:
\fer^1_{\fen \fek} = \dfrac{1}{2}(1- \delta \fer_{\fen \fek}).
\eeq
The transformations into differences $\delta q$ and back from $\delta r$ to $\{ r\}$
may be viewed as a Fourier transform and an inverse Fourier transformation.
\paragraph{Vertical step\nocolon}
The {\dem{vertical\/}} step takes the computed values of $\fer^0_{\fen \fek}$ and
$\fer^1_{\fen \fek}$ and updates the values of the probabilities
$q^0_{\fen \fek}$ and
$q^1_{\fen \fek}$. For each $\fek$ we compute:
\beqan
q^0_{\fen \fek} & = & \alpha_{\fen \fek} \:
p^0_{\fek}
\prod_{{\fen' \in \feNk \wo \fen}}
\fer^0_{\fen' \fek}
\\
q^1_{\fen \fek} & = & \alpha_{\fen \fek} \:
p^1_{\fek}
\prod_{{\fen' \in \feNk \wo \fen}}
\fer^1_{\fen' \fek}
\eeqan
where $\alpha_{\fen \fek}$ is chosen such that $q^0_{\fen \fek} +
q^1_{\fen \fek} = 1$. These products can be efficiently
computed in a downward pass and an upward pass.
We can also compute the `pseudoposterior probabilities'
$q^0_{\fek}$ and $q^1_{\fek}$ at this iteration, given by:
\beqan
q^0_{\fek} & = & \alpha_{\fek} \: p^0_{\fek} \prod_{\fen \in \feNk}
\fer^0_{\fen \fek} ,
\\
q^1_{\fek} & = & \alpha_{\fek} \: p^1_{\fek} \prod_{\fen \in \feNk}
\fer^1_{\fen \fek} .
\eeqan
These quantities are used to create a tentative decoding $\hat{\bx}$,
the consistency of which is used to decide whether the decoding algorithm
can halt. (Halt if $\bH \hat{\bx} = \bz$.)
At this point, the algorithm repeats from the horizontal step.
\paragraph{The {stop-when-it's-done} decoding method\nocolon}\index{stop-when-it's-done}
% If the belief network really were a tree without cycles, the values of
% the pseudoposterior probabilities $q^0_{\fek}$ and $q^1_{\fek}$ at
% each iteration would correspond exactly to the posterior
% probabilities of bit $\fek$ given the states of all the checks
% in a truncated belief network centred on bit
% $\fek$ and extending out to a radius equal to twice the number of
% iterations.
The recommended
decoding procedure is to set $\hat{x}_{\fek}$ to 1 if
$q^1_{\fek} > 0.5$ and see if the \checks\ $\bH \hat{\bx} = \bz \mod 2$ are
all satisfied, halting when they are, and declaring a failure if some
maximum number of iterations (\eg\ 200 or 1000) occurs without successful
decoding. In the event of a failure, we may still report $\hat{\bx}$,
but we flag the whole block as a failure.
We note in passing the difference between this decoding procedure
and the widespread practice in the turbo code community, where
the decoding algorithm is run for a {\em fixed\/} number of iterations (irrespective
of whether the decoder finds a consistent state at some earlier time).
This practice is wasteful of computer time,
and it blurs the distinction between undetected and detected
errors.
%, and it means that many
% researchers fail to distinguish between detected and undetected
% errors.
In our procedure, `undetected' errors occur
if the decoder finds an $\hat{\bx}$ satisfying $\bH \hat{\bx} = \bz \mod 2$
which is not equal to the true $\bx$. `Detected' errors
occur if the algorithm runs for the maximum number of iterations
without finding a valid decoding.
Undetected errors are of scientific interest because
they reveal distance properties of a code.
And in engineering practice,
it would seem preferable for the blocks that are known to contain
detected errors
to be so labelled if practically possible.
% and bad engineering.
\paragraph{Cost\nocolon}
In a brute force approach, the time to create the generator matrix scales as
$N^3$, where $N$ is the block size. The encoding time scales as
$N^2$, but encoding involves only binary arithmetic, so for the block
lengths studied here it takes considerably less time than the
simulation of the Gaussian channel.
Decoding involves approximately $6 N j$ floating-point multiplies
per iteration, so the total number of operations per
decoded bit (assuming 20 iterations) is about $120 t / R$,
independent of blocklength. For the codes
presented in the next section, this is about 800 operations.
% per decoded bit.
% %
The encoding complexity can be reduced by
clever encoding tricks invented by \citeasnoun{Urbanke00} or by
specially constructing the parity-check matrix \cite{MacKayWilsonDavey98}.
The decoding complexity can be reduced, with only a small loss in performance, by
passing low-precision messages in place of real numbers
\cite{Richardson98}.
\section{Pictorial demonstration of Gallager codes}
\label{sec.pictures}
Figures \ref{fig.20000.encode}--\ref{fig.t20000}
% The following figures
illustrate visually
the conditions under which
low-density parity-check codes can give reliable
\ind{communication} over binary symmetric channels and Gaussian channels.
These demonstrations may be viewed
as animations on
the world wide web.\footnote{{\tt http://www.inference.phy.cam.ac.uk/mackay/codes/gifs/}}
% \cite{MacKay97:ipd}.
\begin{figure}
\fullwidthfigureright{\small
\begin{center}
\begin{tabular}{ccc}
\raisebox{1.22in}{(a)\ \psfig{figure=bitmaps/dilbert.ps,width=1.22in}
$\rightarrow$\hspace*{0.2in}}%
\raisebox{0.565in}{\makebox[0in][r]{{\large parity bits} $\left.\rule[-0.57in]{0pt}{0.56in} \right\{$}}%
\psfig{figure=bitmaps/t20000.ps,width=1.22in} &
\ (b)\ \psfig{figure=_is/20000.1.ps,width=1.22in} &
\ (c)\ \psfig{figure=_is/20000.10.ps,width=1.22in} \\
\end{tabular}
\end{center}
}{
\caption[a]{Demonstration of encoding with a rate-$1/2$
Gallager code. The encoder is derived from a very sparse $10\,000\times20\,000$
parity-check matrix with three 1s per column (\figref{fig.hugeHmatrix}).
(a) The code creates transmitted vectors consisting of $10\,000$ source bits
and $10\,000$ parity-check bits. (b) Here, the source sequence has been altered
by changing the first bit. Notice that many of the parity-check bits are
changed.
Each parity bit depends on about half of the
source bits.
(c) The transmission for the case
$\bs = (1,0,0,\ldots, 0)$. This vector is the difference (modulo 2)
between transmissions (a) and (b).
\dilbertcopy
}
\label{fig.20000.encode}
}
\end{figure}
\begin{figure}
\fullwidthfigureright{
\[
\raisebox{1.5in}{ \bH \hspace{0.2in}=\hspace{-0.4in} }
\psfig{figure=MNCfigs/10000.10000.3.631/A.ps,angle=-90,width=6.6in,height=3.3in}
% I wonder about doing a death star sequence to show how big this is
% plot "A.dat" u 1:2 w l
% lx=2140 ; rx=2200 ; yl = 6040; yh=6070 ; set xrange [lx:rx]; set yrange [yl:yh];rep
\]
}{
\caption[a]{A low-density parity-check matrix with $N=20\,000$ columns
of weight $j=3$ and $M=10\,000$ rows of weight $k=6$.}
\label{fig.hugeHmatrix}
}
\end{figure}
\subsection{Encoding}
Figure \ref{fig.20000.encode} illustrates the encoding operation for
the case of a Gallager code whose parity-check matrix
is a $10\,000\times20\,000$
matrix with three 1s per column
(\figref{fig.hugeHmatrix}). The high density of the {\em{generator}\/}
matrix is illustrated in \figref{fig.20000.encode}b and c
by showing the change in the transmitted
vector when one of the $10\,000$ source bits is altered.
Of course, the source images shown here are highly redundant,
and such images should really be compressed before encoding. Redundant
images are chosen in these demonstrations to make
it easier to see the correction process during the
iterative decoding. The decoding algorithm does {\em not\/} take advantage
of the redundancy of the source vector,
and it would work in exactly the same way irrespective of the choice of source vector.
\begin{figure}
\fullwidthfigureright{
\begin{center}
\begin{tabular}{rrrr}
{\sc{received:}}& & & \\
0 \psfig{figure=_is/20000.075.ps,width=1.22in} &
1 \psfig{figure=_is/10000.075.01.ps,width=1.22in} &
2 \psfig{figure=_is/10000.075.02.ps,width=1.22in} &
3 \psfig{figure=_is/10000.075.03.ps,width=1.22in} \\
%5 \psfig{figure=_is/10000.075.05.ps,width=1.22in} \\
%9 \psfig{figure=_is/10000.075.09.ps,width=1.22in} &
10 \psfig{figure=_is/10000.075.10.ps,width=1.22in} &
11 \psfig{figure=_is/10000.075.11.ps,width=1.22in} &
12 \psfig{figure=_is/10000.075.12.ps,width=1.22in} &
13 \psfig{figure=_is/10000.075.13.ps,width=1.22in} \\
&\raisebox{0.6in}{$\rightarrow$}
&\raisebox{0.6in}{{\sc decoded:}}
&
\psfig{figure=bitmaps/dilbert.ps,width=1.22in} \\
\end{tabular}
\end{center}
}{
\caption[a]{Iterative probabilistic decoding of a
\ldpc\ code
for a transmission received over a channel with
noise level $f={7.5\%}$.
The sequence of figures shows the best guess,
bit by bit, given by the iterative decoder, after
0, 1, 2, 3, 10, 11, 12, 13 iterations. The decoder halts after the 13th
iteration when the best guess violates no parity checks.
This final decoding is error free.
}
\label{fig.ipd2000}
}
\end{figure}
\subsection{Iterative decoding}
The transmission is sent over a channel with
noise level $f={7.5\%}$ and the received vector is shown in the upper
left of \figref{fig.ipd2000}.
The subsequent pictures in \figref{fig.ipd2000} show the iterative
probabilistic decoding process. The sequence of figures shows the best guess,
bit by bit, given by the iterative decoder, after
0, 1, 2, 3, 10, 11, 12, 13 iterations. The decoder halts after the 13th
iteration when the best guess violates no parity checks.
This final decoding is error free.
In the case of an unusually noisy transmission, the decoding algorithm
% occasionally
fails to find a valid decoding. For this code and a channel
with $f={7.5\%}$, such failures happen
about once in every $100\,000$ transmissions.
% $10^{-5}$
\Figref{fig.rate5perf} shows this error rate compared with the
block error rates of classical error-correcting codes.
\begin{figure}
\figuremargin{
\begin{center}
\mbox{
% \mbox{\psfig{figure=/home/mackay/_doc/code/ps/ps/smnbch075.bnd.ps,angle=-90,width=3.95in}}
% was 7.9, 2.2, 4, 3.4 cm
\mbox{\psfig{figure=/home/mackay/_doc/code/ps/ps/repshangl.075.l.ps,angle=-90,width=5cm}}
\makebox[0cm][l]{\raisebox{1.39cm}{\footnotesize{Shannon limit}}}%
\makebox[0cm][l]{\raisebox{2.53cm}{\footnotesize{low-density}}}%
\makebox[1.39cm][l]{\raisebox{2.15cm}{\footnotesize{parity-check code}}}%
}
\end{center}
}{
\caption[a]{Error probability of the \ldpc\ code (with error bars)
for binary symmetric channel with $f={7.5\%}$,
compared with algebraic codes. Squares: repetition codes and
Hamming $(7,4)$ code; other points: Reed--Muller and BCH codes.
% Point with error bars: the code demonstrated on this poster.
% Curve at right: the Shannon limit.
}
\label{fig.rate5perf}
}
\end{figure}
% , in this case.
\begin{figure}
\figuremargin{\small
% fullwidthfigureright{\small
\begin{center}
\begin{tabular}{ccc}
(a1)\ \ \psfig{figure=bitmaps/20000.1.185.255.ps,width=1.22in} &
(b1)\ \ \psfig{figure=bitmaps/20000.1.0.255.ps,width=1.22in} \\
(a2)\psfig{figure=figs/gc.1.185.ps,width=1.72in,angle=-90} &
(b2)\psfig{figure=figs/gc.1.0.ps,width=1.72in,angle=-90} \\
\end{tabular}
\end{center}
}{
\caption[a]{Demonstration of a
Gallager code for a Gaussian channel.
(a1) The received vector after transmission
over a Gaussian channel with $x/\sigma=1.185$ ($E_b/N_0 = 1.47\,$dB).
The greyscale represents the value of the normalized
likelihood. This transmission
can be perfectly decoded by the \sumproduct\ decoder.
The empirical probability of decoding failure
is about $10^{-5}$. (a2) The probability distribution of the
output $y$ of the channel with $x/\sigma=1.185$
for each of the two possible inputs.
(b1) The received transmission
over a Gaussian channel with $x/\sigma=1.0$, which corresponds
to the Shannon limit. (b2) The probability distribution of the
output $y$ of the channel with $x/\sigma=1.0$
for each of the two possible inputs.
% \dilbertcopy
}
\label{fig.t20000}
}
\end{figure}
% Figure \ref{fig.13298.encode} shows the encoding operation for
% a Gallager code with blocklength $13\,298$ and rate
% roughly 1/4.
\subsection{Gaussian channel}
% Figure \ref{fig.13298.gc}
%
In \figref{fig.t20000} the left picture
% shows the transmitted vector and the second
shows the received vector after
transmission over a Gaussian channel with $x/\sigma=1.185$. The
greyscale represents the value of the normalized likelihood,
$\smallfrac{P(y\given t\eq 1)}{P(y\given t\eq 1)+P(y\given t \eq 0)}$.
This signal-to-noise ratio $x/\sigma=1.185$ is a noise level at which
this rate-1/2 Gallager
code communicates reliably (the probability of error is $\simeq 10^{-5}$).
To show how close we are to the Shannon limit, the right panel shows
the received vector when the signal-to-noise ratio is reduced
to $x/\sigma=1.0$, which corresponds
to the Shannon limit for codes of rate 1/2.
\subsection{Variation of performance with
code parameters}
% see allRegN, allRegt
\begin{figure}
\figuremargin{\small%
\begin{tabular}{c@{\hspace{0.5942in}}c}
\mbox{\psfig{figure=../../code/ps/33.Reg.1.2.N.ps,angle=-90,width=2.25in}}
&
\mbox{\psfig{figure=../../code/ps/816.Reg.1.2.varyt.ps,angle=-90,width=2.25in}}
\\[0.2in]
\small(a)&(b)
\\
\end{tabular}
}{
\caption[a]{Performance of rate-\dhalf\ Gallager codes on the
Gaussian channel. Vertical axis: block error probability.
Horizontal axis: signal-to-noise ratio $E_b/N_0$.
(a) Dependence on blocklength $N$ for
$(j,k)=(3,6)$ codes. From left to right: $N=816$,
$N=408$, $N=204$, $N=96$.
The dashed lines show the frequency of undetected errors,
which is measurable only
when the blocklength is as small as $N=96$ or $N=204$.
(b) Dependence on column weight $j$ for codes of blocklength $N=816$.
% $(3,6)$, $(4,8)$, $
}
\label{fig.varyNt}
}
\end{figure}
\Figref{fig.varyNt} shows how the parameters $N$ and $j$ affect the performance
of \ldpc\ codes.
As Shannon would predict, increasing the blocklength leads to
improved performance.
The dependence on $j$ follows a different pattern. Given
an {\em optimal\/} decoder, the best performance would be obtained for
the codes closest to random codes, that is, the\index{random code}
codes with largest $j$. However, the sum--product decoder
makes poor progress in dense graphs, so the
best performance is obtained for a small value of $j$.
Among the values of $j$ shown in the figure, $j=3$ is the best, for a blocklength of 816,
down to a block error probability of $10^{-5}$.
This observation motivates construction of
Gallager codes with some\hyphenation{col-umns} columns of weight 2.
A construction with $M/2$ columns of
weight 2 is shown in \figref{fig.cons}b.
Too many columns of weight 2, and the code becomes
a much poorer code.
\begin{figure}
\figuremargin{\small
\begin{center}
\begin{tabular}{cc}
\psfig{figure=GHps/33.ps,height=0.8in}%,width=1.5in}
&
\psfig{figure=GHps/3.2A.ps,height=0.8in}%width=1.5in}
\\ (a) & (b) \\
\end{tabular}
\end{center}
}{
\caption{Schematic illustration of constructions (a)
of a completely regular Gallager code with $j=3$, $k=6$ and $R=1/2$;
(b) of a nearly-regular Gallager code with rate $1/3$.
%
Notation: an integer represents a number of permutation
matrices superposed on the surrounding square.
A diagonal line represents an identity matrix.
}
\label{fig.cons}
}
\end{figure}
As we'll discuss later,
we can do even better by making the code even more irregular.
% \newcommand{\bndips}{bndips}
% was /home/mackay/code/bndips
\begin{figure}
\fullwidthfigureright{
\begin{center}
\begin{tabular}{@{}c@{}}
\psfig{figure=\bndips/H.4.8.ps,width=2.5in,angle=-90} \\
\end{tabular}
% how to produce these figs - see ~/bin/bndiperf.p
% and ~/neuron/code/SUMMARY
\end{center}
}{
\caption[a]{%
% Analysis of sum--product decoding by
Monte Carlo simulation
of density evolution, following
% Time-course of
the decoding process for $j\!=\!4, k\!=\!8$.
Each curve shows the average entropy of a bit as a function
of number of iterations, as estimated by a Monte Carlo algorithm
using $10\,000$ samples per iteration. The noise level
of the binary symmetric channel $f$ increases by steps of
$0.005$
from bottom graph ($f\!=\!0.010$) to top graph ($f\!=\!0.100$).
There is evidently a threshold at about $f\!=\!0.075$, above which
the algorithm cannot determine $\bx$. From \citeasnoun{mncN}.
}
\label{fig.bndi.4.8}
\label{fig.bndi.3-8}
}
\end{figure}
% see ~/bin/bndiperf.p
\section{Density evolution}
\index{density evolution}\index{sparse-graph code!density evolution}\index{error-correcting code!density
evolution}\index{error-correcting code!sparse graph!density evolution}One way to study the decoding algorithm
is to imagine it running on an infinite tree-like graph with the
same local topology as the Gallager code's graph.
\marginfig{
\begin{center}
\mbox{\psfig{figure=_poincare/figs/ps8.ps,angle=-90,height=2.3in}}% yippee
\end{center}
\caption[a]{Local topology of the graph of a Gallager
code with column weight $j=3$ and row weight $k=4$.
White nodes represent bits, $x_l$; black nodes represent checks,
$z_m$; each edge corresponds to a 1 in $\bH$.
}
\label{fig.infinitegraph}
}%
%
%\subsubsection{Analysis of decoding of infinite networks by Monte Carlo methods}
\label{sec.bnd_theory}
The larger the matrix $\bH$, the closer its
decoding properties should approach those of the infinite graph.
Imagine an infinite belief network with no loops, in which every
bit $x_{\fek}$ connects to $j$ \checks\ and every \check\ $z_{\fen}$
connects to $k$ bits (\figref{fig.infinitegraph}).
We consider the iterative flow of
information in this network, and examine the average entropy of
one bit as a function of number of iterations. At each iteration,
a bit has accumulated information from its local network out to a radius
equal to the number of iterations. Successful decoding will
occur only if the average entropy of a bit decreases to zero
as the number of iterations increases.
The iterations of an infinite belief network can be simulated
by
Monte Carlo methods -- a technique first used by
% Gallager
\citeasnoun{Gallager63}. Imagine a network of radius $I$ (the total
number of iterations) centred on one bit. Our aim is to compute the
conditional entropy of the central bit $x$ given the state $\bz$ of all \checks\ out
to radius $I$. To evaluate the probability that the central bit is 1
given a {\em particular\/} syndrome $\bz$ involves an $I$-step
propagation from the outside of the network into the centre. At the $i$th
iteration, probabilities $\fer$ at radius $I-i+1$ are transformed into
$q$s and then into $\fer$s at radius $I-i$ in a way that depends on
the states $x$ of the unknown bits at radius $I-i$.
In the Monte Carlo method, rather than simulating this network
exactly, which would take a time that grows exponentially with
$I$, we create for each iteration a representative sample (of size 100, say)
of the values of $\{\fer,x\}$.%
\marginfig{\small
\setlength{\unitlength}{0.6mm}
\begin{picture}(66,50)(-1,-3)
\put(2,43){\makebox(0,0)[b]{$x$}}
\put(3,32){\makebox(0,0)[b]{$r$}}
\multiput(2,40)(10,0){3}{\circle{4}}
\multiput(35,40)(10,0){3}{\circle{4}}
\multiput(2,38)(33,0){2}{\vector(1,-1){8}}
\multiput(12,38)(33,0){2}{\vector(0,-1){8}}
\multiput(22,38)(33,0){2}{\vector(-1,-1){8}}
\multiput(12,25)(33,0){2}{\makebox(0,0)[t]{\plusnode}}
\put(12,24.5){\vector(1,-1){15}}
\put(45,24.5){\vector(-1,-1){15}}
\put(28.5,6){\circle{4}}
\put(28.5,3){\vector(0,-1){5}}
\put(31.5,6){\makebox(0,0)[l]{$x$}}
\put(30.72,-1){\makebox(0,0)[l]{$r$}}
\put(60,35){\makebox(0,0)[l]{\mbox{\small$\left]\begin{array}{c} \mbox{iteration}\\
i\!-\!1 \end{array}\right.$}}}% emacs bracket error
\put(40,3){\makebox(0,0)[l]{\mbox{\small$\left]\begin{array}{c} \mbox{iteration}\\
i \end{array}\right.$}}}% emacs bracket error
\end{picture}
\caption[a]{A tree-fragment constructed during
Monte Carlo simulation of density evolution.
This fragment is appropriate for a regular $j \eq 3$, $k\eq 4$
Gallager code.
}
\label{fig.treefragment}
}
In the case of a regular
network with parameters $j,k$, each new pair $\{\fer,x\}$ in the list
at the $i$th iteration is created
by drawing the new $x$ from its distribution
and drawing at random with replacement
$(j-1)(k-1)$ pairs $\{\fer,x\}$ from the list at the $(i\!-\!1)$th iteration;
these are assembled into a tree fragment (\figref{fig.treefragment})
and the \sumproduct\ algorithm
is run from top to bottom
to find the new $r$ value associated with the new node.
%%%%%%%%%%%%%%%%%%%%% end of text of decoding THEORY
As an example, the results of runs with $j\eq 4$, $k\eq 8$
and noise densities $f$ between
0.01 and 0.10, using $10\,000$ samples at each iteration,
are shown in \figref{fig.bndi.4.8}. Runs with low enough noise level show a
collapse to zero entropy after a small
number of iterations, and those with high noise level decrease to a non-zero
entropy corresponding to a failure to decode.
The boundary between these two behaviours is called
the {\dem\ind{threshold}\/} of the decoding algorithm
for the binary symmetric channel.
\Figref{fig.bndi.4.8} shows by Monte Carlo simulation that
the threshold for regular $(j,k)$ = $(4,8)$ codes
is about 0.075.
\citeasnoun{Richardson98}
have derived thresholds for regular codes
by a tour de force of direct analytic methods. Some of these thresholds
are shown in \tabref{tab.thresh}.
\amargintab{c}{
\begin{center}
\begin{tabular}{cr}\toprule
$(j,k)$ & $f_{\max}$ \\ \midrule
(3,6) & 0.084 \\
(4,8) & 0.076 \\
(5,10) & 0.068 \\ \bottomrule
\end{tabular}
\end{center}
\caption[a]{Thresholds $f_{\max}$ for
regular \ldpc\ codes, assuming
sum--product decoding algorithm,
from \citeasnoun{Richardson98}.
The Shannon limit for rate-\dhalf\ codes is $f_{\max}=0.11$.}
\label{tab.thresh}
}%
% A cheap method for approximating density evolution in sparse-graph codes uses `EXIT charts'
% \cite{brink99}.
% table belongs here but moved later
% to make numbering right.
% Sun 12/1/03 graveyard
\subsection{Approximate density evolution}
For practical purposes, the computational cost of density
evolution can be reduced by making Gaussian
approximations to the probability distributions over the
messages in \ind{density evolution},\index{approximation!of density evolution}
and updating only the parameters of these approximations.
For further information about these techniques,
which produce diagrams known as {\dem\ind{EXIT chart}s},
see \cite{brink99,Chung2001,brink02}.
% A web-based density evolution service for
% \ldpc\ codes is also provided by {ChungAppletb
\section{Improving Gallager codes}
Since the rediscovery of Gallager codes,
two methods have been found for enhancing their performance.
\amargintab{b}{\small
\[
\begin{array}{c@{ \:\:\leftrightarrow\:\: }c}
GF(4) & \mbox{binary} \\ \midrule
0& 00 \\
1& 01 \\
A& 10 \\
B& 11
\end{array}
\]\vskip -0.1in
\caption[a]{Translation between $GF(4)$ and binary for message symbols. }
\label{tab.gf4trans}
}
\subsection{Clump bits and checks together}
First, we can make Gallager codes
in which the variable nodes are grouped together
into metavariables consisting of say 3 binary variables,
and the check nodes\index{parity-check nodes} are similarly grouped together
into metachecks.
As before, a sparse graph can be constructed connecting
metavariables to metachecks, with a lot of freedom
about the details of how the variables and checks
within are wired up.
One way to set the wiring is to work in a finite field $GF(q)$
such as $GF(4)$ or $GF(8)$,\index{Galois field}
% \index{finite field}
define low-density parity-check matrices using elements of $GF(q)$,
and translate our binary messages into $GF(q)$ using a mapping
such as the one for $GF(4)$ given in \tabref{tab.gf4trans}.
Now, when messages are
passed during decoding, those messages are
probabilities and likelihoods over
{\em conjunctions\/} of binary variables. For example if each clump
contains three binary variables then the likelihoods will
describe the likelihoods of the eight alternative states of those
bits.
With carefully optimized constructions,%
\amargintab{b}{\small
\[
\begin{array}{c@{ \:\:\rightarrow\:\: }c}
GF(4) & \mbox{binary} \\ \midrule
0& \footnotesize\begin{array}{c@{}c} 0&0\\[-0.05in]0&0 \end{array} \\[0.1in]
1& \footnotesize\begin{array}{c@{}c} 1&0\\[-0.05in]0&1 \end{array} \\[0.1in]
A& \footnotesize\begin{array}{c@{}c} 1&1\\[-0.05in]1&0 \end{array} \\[0.1in]
B& \footnotesize\begin{array}{c@{}c} 0&1\\[-0.05in]1&1 \end{array}
\end{array}
\]\vskip -0.1in
\caption[a]{Translation between $GF(4)$ and binary for matrix entries.
An $M \times N$ parity-check matrix over $GF(4)$ can be turned into
a $2M \times 2N$ binary parity-check matrix in this way.}
}
the resulting codes over $GF(4)$, $GF(8)$, and $GF(16)$
perform nearly one decibel better than comparable binary Gallager codes.
\begin{algorithm}
\begin{framedalgorithmwithcaption}{
\caption[a]{The \ind{Fourier transform} over $GF(4)$.
% \index{GF($q$)|see{Galois field}}
The \ind{Fourier transform} $F$ of a function $f$ over $GF(2)$ is given by
$F^0=f^0+f^1$, $F^1=f^0-f^1$. Transforms over $GF(2^k)$ can be viewed
as a sequence of binary transforms in each of $k$ dimensions.\index{Galois field}
%
The inverse transform is identical to the Fourier transform,
except that we also divide by $2^k$.
}
\label{alg.ftgfq}
}
\begin{center}$\displaystyle
%\begin{eqnarray}
\begin{array}{rcl}
F^0 &=& [f^0 + f^1] + [f^A + f^B]\\
F^1 &=& [f^0 - f^1] + [f^A - f^B]\\
F^A &=& [f^0 + f^1] - [f^A + f^B]\\
F^B &=& [f^0 - f^1] - [f^A - f^B]
\end{array}
%\end{eqnarray}
$
\end{center}
\end{framedalgorithmwithcaption}
\end{algorithm}
The computational cost for decoding in $GF(q)$ scales as $q \log q$,
if the appropriate \ind{Fourier transform} is used in the check nodes:
the update rule for the check-to-variable message,
\begin{equation}
r^a_{mn} \: = \:
\sum_{\bx:x_n=a}\truth\left[\sum_{n' \in {\cal N}(m)} \hspace{-0.8em} H_{mn'}x_{n'} = z_m\right]
\prod_{j\in{\cal N}(m)\wo n} q^{x_j}_{mj} ,
\end{equation}
is a \ind{convolution} of the quantities $q^{a}_{mj}$, so
the summation can be replaced by a product of the Fourier transforms
of $q^{a}_{mj}$ for
$j\in{\cal N}(m)\wo n$, followed by an inverse Fourier transform.
The Fourier transform for $GF(4)$ is shown in \algref{alg.ftgfq}.
\subsection{Make the graph {irregular}}
The second way of improving Gallager codes, introduced by \citeasnoun{Luby2001b},\index{Luby, Michael G.}\index{Mitzenmacher, Michael}\index{Shokrollahi, M. Amin}\index{Spielman, Daniel A.}
is to make their graphs {\dem\ind{irregular}}.
Instead of giving all variable nodes the same degree $j$,
we can have some variable nodes with degree 2, some 3, some 4, and a few with degree 20.
Check nodes can also be given unequal\index{degree}\index{profile, of random graph}
degrees -- this helps\index{parity-check nodes}
improve performance on erasure channels, but it turns out
that for the Gaussian channel, the best graphs have regular check degrees.
% connectivity.
\begin{figure}
%\figuremargin{%
\fullwidthfigureright{\small%
\begin{center}
\mbox{\psfig{figure=figs/GC-survey2.R0.25.eps,%angle=-90,
width=4.2in}}
\end{center}
}{%
\caption[a]{Comparison of regular binary Gallager codes
with irregular codes, codes over $GF(q)$,
and other outstanding codes of rate \dquarter.
From left (best performance) to right: Irregular \ldpc\ code over $GF(8)$,
blocklength $48\,000$ bits~\cite{mcdthesis}; JPL turbo code~\cite{WWWturbo}
blocklength $65\,536$; Regular \ldpc\ over $GF(16)$, blocklength $24\,448$
bits~\cite{DaveyMacKay96};
Irregular binary \ldpc\ code, blocklength $16\,000$ bits~\cite{mcdthesis};
%{SpielLDPC}
Luby {\em{et al.}} (1998)\nocite{spielman-98-ISIT}
irregular binary \ldpc\ code,
blocklength $64\,000$ bits; JPL
code for Galileo (in 1992, this
was the best known code of rate 1/4); Regular
binary \ldpc\ code: blocklength $40\,000$ bits~\cite{mncN}. The Shannon limit
is at about $-0.79$\,dB. As of 2003, even better sparse-graph codes
have been constructed.
}
% identical to \label{fig.gl.gc}
\label{fig:GCResults}
% \end{center}
}%
\end{figure}
\Figref{fig:GCResults} illustrates the benefits offered by these two
methods for improving Gallager codes, focussing on codes
of rate \dquarter. Making the binary code irregular gives
a win of about 0.4$\,$dB;
switching from $GF(2)$ to $GF(16)$ gives about 0.6$\,$dB;
and Matthew Davey's\index{Davey, Matthew C.} code that combines
both these features -- it's irregular over $GF(8)$ -- gives
a win of about 0.9$\,$dB over the regular binary Gallager code.
% Combining these two ideas (irregular graphs, and
% grouping nodes together), Matthew Davey produced the
% excellent code of rate 0.25 shown in \figref{}.
Methods for optimizing the {\dem{profile}\/} of a\index{error-correcting code!low-density parity-check!profile}\index{sparse graph!profile}
Gallager code (that is, its number of rows and columns of each degree),
have been developed by \citeasnoun{Richardson2001b} and have
led to \ldpc\ codes whose performance, when decoded by
the sum--product algorithm, is within a hair's breadth of
the Shannon limit.\index{degree sequence|see{profile}}
% Another name for the profile of a Gallager code
% is the degree sequence
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}[b]{c@{}c}
\begin{tabular}[b]{c*{6}{r}} \toprule
\multicolumn{7}{c}{\sc difference set cyclic codes} \\ \midrule
% & s=1 & s=2 & s=3 & s=4 & s=5 & s=6 \\
$N$ & 7 & 21 & 73 & 273 & 1057 & 4161\\
$M$ & 4 &{\bf 10}&{\bf 28}&{\bf 82}&{\bf 244}&{\bf 730}\\
% marks codes with huge N/M (lots of extra fortuitous constraints)
$K$ & 3 & 11 & 45 & 191 & 813 & 3431\\
$d$ & 4 &{\bf 6}& {\bf 10}&{\bf 18}& 34 & 66\\
% <- "*" marks codes with good distance
$k$ & 3 & {\bf 5}& {\bf 9}& 17 & 33 & 65\\ \bottomrule
%<- "*" marks codes with nice low row weight
\end{tabular}
&
\raisebox{-0.2in}{\psfig{figure=/home/mackay/pub/codes/EN/DSC273compare.ps,width=2.4in,angle=-90}}
\\
\end{tabular}
\end{center}
}{%
\caption{An algebraically constructed
low-density parity-check code satisfying many redundant constraints
outperforms an equivalent random Gallager code.
% Data on DSC code performance courtesy of R. Lucas and M. Fossorier.
The table
shows the $N$, $M$, $K$, distance $d$, and row weight $k$ of
some difference-set cyclic codes, highlighting the
codes that have large $d/N$, small $k$, and large $N/M$.
In the comparison the Gallager code had $(j,k) = (4,13)$,
and rate identical to the $N\eq 273$ difference-set cyclic code.
}
\label{fig.dsc}
}
\end{figure}
\subsection{Algebraic constructions of Gallager codes}
% Making codes with redundant constraints -- `the Tanner challenge'}
The performance of regular Gallager codes can be enhanced
in a third manner: by designing the
code to have {\dbf redundant sparse constraints}.\nocite{Tanner1981}
% (R. Lucas and M. Fossorier, personal communication).
There is a {\dem\ind{difference-set cyclic code}}, for example,
that\index{error-correcting code!difference-set cyclic}
has $N=273$ and $K=191$, but the code satisfies not $M=82$
but $N$, \ie, {\em 273\/} low-weight constraints (figure \ref{fig.dsc}).
It is impossible
to make random Gallager codes that have anywhere near this much
redundancy among their checks.
The {difference-set cyclic code} performs about 0.7$\,$dB better
than an equivalent random Gallager code.
An open problem is to discover codes sharing the
remarkable properties of the
\ind{difference-set cyclic code}s but
with different blocklengths
and rates.
I call this task {\dbf the Tanner challenge}.\index{Tanner, Michael}
% Fast encoding of LDPCC
\section{Fast encoding of low-density parity-check codes}
\label{sec.fastencode}
\index{error-correcting code!low-density parity-check!fast encoding}%
We now discuss methods for fast encoding of low-density parity-check codes --
faster than the standard method, in which a generator matrix $\bG$ is
found by Gaussian elimination (at a cost of order $M^3$)
and then each block is encoded by multiplying it by $\bG$ (at a cost of order $M^2$).
\subsection{Staircase codes}
\label{sec.staircase}
Certain low-density parity-check matrices with $M$ columns of weight 2 or
less can be encoded easily in linear time. For example, if the
matrix\index{low-density parity-check code!staircase}
has a
{\dem\ind{staircase}\/} structure as illustrated by the right-hand side of
\beq
\bH = \left[
\mbox{\begin{tabular}[c]{c}{\raisebox{-1.5mm}{\psfig{figure=figs/gallager/A28.12.ps,width=2in,angle=-90}}}\\ \end{tabular}}
\right],
\eeq
and if the data $\bs$ are loaded into the first $K$ bits,
then the $M$ parity bits $\bp$ can be computed from
left to right in linear time.
\beq
\begin{array}{rcl@{}ll}
p_1 &=& & & \sum_{n=1}^{K} H_{1n} s_{n} \\
p_2 &=& p_1 & + & \sum_{n=1}^{K} H_{2n} s_{n} \\
p_3 &=& p_2 & + & \sum_{n=1}^{K} H_{3n} s_{n} \\
& \vdots& \\
p_M &=& p_{M-1} & + & \sum_{n=1}^{K} H_{Mn} s_{n} .
\end{array}
\label{eq.staircasemethod}
\eeq
If we call two parts of the $\bH$ matrix $[\bH_s | \bH_p ]$,
we can describe the encoding operation in two steps: first
compute an intermediate parity vector $\bv = \bH_s \bs$;
then pass $\bv$ through an \ind{accumulator} to create $\bp$.
The cost of this encoding method is linear if the sparsity of $\bH$
is exploited when computing the sums in (\ref{eq.staircasemethod}).
\subsection{Fast encoding of general low-density parity-check codes}
\citeasnoun{Urbanke00}\index{Richardson, Thomas J.}\index{Urbanke, R\"udiger}
demonstrated an elegant method by which the
encoding cost of any low-density parity-check code
can be reduced from the straightforward method's
$M^2$ to a cost of $N+ g^2$, where $g$, the {\dem{gap}},
is hopefully a small constant, and in the worst cases
scales as a small fraction of $N$.
\begin{figure}[htbp]
\figuremargin{
\setlength{\unitlength}{0.3mm}
\begin{picture}(220,135)(-100,-13)
% g = 30
%
\thinlines
\put(15,15){\makebox(0,0)[c]{$\bD$}}
\put(15,65){\makebox(0,0)[c]{$\bB$}}
\put(65,15){\makebox(0,0)[c]{$\bE$}}
%
\put(55,55){\makebox(0,0)[c]{$\bT$}}
%
\put(75,75){\makebox(0,0)[c]{${\bf 0}$}}
%
\put(-45,15){\makebox(0,0)[c]{$\bC$}}%
\put(-45,65){\makebox(0,0)[c]{$\bA$}}%
% verticals
\put(30,0){\line(0,1){100}}
\put(100,0){\line(0,1){100}}
\put(0,0){\line(0,1){100}}
\put(-90,0){\line(0,1){100}}%
% horizontal
\put(-90,30){\line(1,0){190}}%
\put(-90,0){\line(1,0){190}}%
\put(-90,100){\line(1,0){190}}%
% arrows
\put(120,60){\vector(0,1){40}}
\put(120,40){\vector(0,-1){40}}
\put(120,50){\makebox(0,0)[c]{$M$}}
%
\put(60,120){\vector(1,0){40}}
\put(40,120){\vector(-1,0){40}}
\put(50,120){\makebox(0,0)[c]{$M$}}
%
\put(110,20){\vector(0,1){10}}
\put(110,10){\vector(0,-1){10}}
\put(110,15){\makebox(0,0)[c]{$g$}}
%
\put(20,-10){\vector(1,0){80}}
\put(0,-10){\vector(-1,0){90}}%
\put(10,-10){\makebox(0,0)[c]{$N$}}
%
\put(20,110){\vector(1,0){10}}
\put(10,110){\vector(-1,0){10}}
\put(15,110){\makebox(0,0)[c]{$g$}}
% diagonal
\thicklines
\put(30,100){\line(1,-1){70}}
\end{picture}
}{
\caption[a]{The parity-check matrix in approximate
lower-triangular form.}
\label{fig.lower.triang}
}
\end{figure}
In the first step, the parity-check matrix is
rearranged, by row-interchange and column-interchange,
into the {\dem{approximate lower-triangular form}\/}
shown in \figref{fig.lower.triang}.
The original matrix $\bH$ was very sparse, so
the six matrices $\bA$, $\bB$, $\bT$,
$\bC$, $\bD,$ and $\bE$
are also very sparse.
The matrix $\bT$ is lower triangular and has {\tt{1}}s
everywhere on the diagonal.
\beq
\bH = \left[\begin{array}{ccc}
\bA& \bB& \bT \\
\bC& \bD& \bE
\end{array}
\right] .
\eeq
The source vector $\bs$ of length $K = N - M$
is encoded into a transmission $\bt = [\bs , \bp_1, \bp_2 ]$
as follows.
\ben
\item
Compute the upper syndrome of the source vector,
\beq
\bz_A = \bA \bs.
\eeq
This can be done in linear time.
\item
Find a setting of the second parity bits, ${\bp}_2^{A}$,
such that the upper syndrome is zero.
\beq
{\bp}_2^{A} = - \bT^{-1} \bz_A .
\eeq
This vector can be found in linear time by back-substitution,
\ie, computing the first bit of ${\bp}_2^{A}$, then the second,
then the third, and so forth.
\item
Compute the lower syndrome of the vector $[\bs, {\bf 0}, {\bp}_2^{A}]$:
\beq
\bz_B = \bC \bs - \bE {\bp}_2^{A} .
\eeq
This can be done in linear time.
\item
Now we get to the clever bit.
Define the matrix
\beq
\bF \equiv -\bE \bT^{-1} \bB + \bD ,
\eeq
and find its inverse, $\bF^{-1}$.
This computation needs to be done once only,
and its cost is of order $g^3$. This inverse $\bF^{-1}$
is a dense $g \times g$ matrix. [If $\bF$ is
not invertible then either $\bH$ is not of full rank,
or else further column permutations of $\bH$ can produce an $\bF$
that is invertible.]
Set the first parity bits, ${\bp}_1$,
to
\beq
{\bp}_1 = - \bF^{-1} \bz_B .
\eeq
This operation has a cost of order $g^2$.
{\sf Claim:} At this point, we have found the correct
setting of the first parity bits, $\bp_1$.
\item
Discard the tentative parity bits ${\bp}_2^{A}$
and find the new upper syndrome,
\beq
\bz_C = \bz_A + \bB \bp_1 .
\eeq
This can be done in linear time.
\item
Find a setting of the second parity bits, ${\bp}_2$,
such that the upper syndrome is zero,
\beq
{\bp}_2 = - \bT^{-1} \bz_C
\eeq
This vector can be found in linear time by back-substitution.
\een
% section{Fast decoding of non-binary {\ldpcc}s}
% in graveyard.tex
\section{Further reading}
{Low-density parity-check codes}
codes were first studied in 1962 by Gallager,\nocite{Gallager62,Gallager63}
then were
generally forgotten by the coding theory community.
\citeasnoun{Tanner1981} generalized Gallager's work by
introducing more general constraint nodes;
the codes that are now called \ind{turbo product code}s
should in fact be called \ind{Tanner product code}s,
since Tanner proposed them, and his colleagues \cite{Karplus1991}
implemented them in hardware.
Publications on Gallager codes contributing to their 1990s rebirth
include \cite{wiberg95,MacKay_Neal_Codes:95,mncEL,wiberg:phd,mncN,spielman-96,Sipser1996}.
Low-precision decoding algorithms
and fast encoding algorithms for Gallager codes
are discussed in \cite{Richardson98,Urbanke00}.
\citeasnoun{MacKayHighRate98} showed that \ldpc\ codes can outperform
\ind{Reed--Solomon code}s, even on the Reed--Solomon codes' home turf: high rate
and short blocklengths.\index{error-correcting code!Reed--Solomon code}
Other important papers include
\cite{Luby2001a,Luby2001b,SpielPLRC,DaveyMacKay96,Richardson2001b,Chung2001}.
% A hypertext archive of sparse graph codes -- mainly Gallager codes --
% is MacKay99ENC,
Useful tools for the design of irregular \ldpc\ codes include
\cite{ChungApplet,UrbankeApplet}.% was ChungAppletb
See \cite{wiberg:phd,frey-98,McElieceMacKay96} for further
discussion of the
\sumproduct\ algorithm.
For a view of \ldpc\ code decoding in terms of group theory and coding theory,
see \cite{Forney2001,Offer2000,Offer2001}; and for background reading on this topic see
\cite{hartmann1976,Terras99}.
% # LDPC codes: a group algebra formulation (E. Offer and E. Soljanin)
% Proc., Internat. Workshop on Coding and Cryptography WCC 2001, 8-12 Jan. 2001, Paris. (ps-file)
There is a growing literature on the practical design of {\ldpcc}s \cite{mao00,mao01,brink02};
they are now being adopted for applications from hard drives to satellite
communications.
For \ldpc\ codes applicable to \ind{quantum error-correction},
see \citeasnoun{mackaymitchisonmcfadden2003}.\index{error-correcting code!quantum}
\section{Exercises}
\exercisxC{2}{ex.tanhGall}{
{\sf The `hyperbolic tangent' version of the decoding algorithm}.
In \secref{sec.spgall}, the sum--product decoding algorithm for
\ldpc\ codes\index{sum--product algorithm}
was presented first in terms of quantities
$q^{0/1}_{\fen \fel}$ and $r^{0/1}_{\fen \fel}$,
then in terms of quantities $\delta q$ and $\delta r$.
There is a third description, in which
the $\{ q \}$ are replaced by log probability-ratios,
\beq
l_{\fen \fel} \equiv \ln \frac{q^0_{\fen \fel} }{ q^1_{\fen \fel} } .
\eeq
Show that
\beq
\delta q_{\fen \fek} \equiv q^0_{\fen \fek} - q^1_{\fen \fek} = \tanh ( l_{\fen \fel} / 2 ) .
\eeq
Derive the update rules for $\{r\}$ and $\{l\}$.
}
\exercissxC{2}{ex.howmanyGH}{
I am sometimes asked `why not decode
{\em{other}\/} linear codes, for example algebraic codes,
by transforming their parity-check matrices
so that they are low-density, and applying the
sum--product algorithm?' [Recall that any linear
combination of rows of $\bH$, $\bH'= \bP \bH$,
is a valid parity-check matrix for a code, as long as
the matrix $\bP$ is invertible; so there are many parity
check matrices for any one code.]
Explain why a random linear code does not have a low-density
parity-check matrix. [Here, low-density means `having
row-weight at most $k$', where $k$ is some small constant $\ll N$.]
}
%\exercisxC{4}{ex.vfegall}{
% A variational free energy minimization decoder
% can be made for Gallager codes {MacKay_Neal_Codes:95}
% method -- this gives a connection to Hopfield networks for
% constraint satisfaction problems.
%}
\exercisxC{3}{ex.gallj2}{
Show that if a \ldpcc\ has more than $M$ columns of weight 2 --
say $\alpha M$ columns, where $\alpha>1$ -- then the code will have
words with weight of order $\log M$.
}
\exercisxC{5}{ex.typicalwef}{
In \secref{sec.wef.random} we found the expected value of
the weight enumerator function $A(w)$, averaging over\index{weight enumerator!typical}
the ensemble of all random linear codes. This calculation
can also be carried out for the ensemble of low-density parity-check codes
\cite{Gallager63,mncN,LitsynEnumerator}.\index{Litsyn, Simon}\index{Shevelev, Vladimir}
It is plausible, however, that the
mean value of $A(w)$ is not always a good indicator of
the {\em{typical}\/} value of $A(w)$ in the ensemble.
For example, if, at a particular value of $w$,
99\% of codes have $A(w) = 0$, and 1\% have $A(w) = 100\,000$,
then while we might say the typical value of $A(w)$ is zero, the mean
is found to be $1000$.
Find the {\em{typical}\/} weight enumerator function of
low-density parity-check codes.
}
\section{Solutions}
\soln{ex.howmanyGH}{
Consider codes of rate $R$ and blocklength $N$,
having $K = RN$ source bits and $M = (1\!-\!R)N$ parity-check bits.
Let all the codes have their bits ordered so that the first $K$ bits
are independent, so that we could if we wish put the
code in systematic form,
\beq
\bG = [ {\bf 1}_K | \bP^{\T} ] ; \:\:
\bH = [ \bP | {\bf 1}_M ] .
\eeq
The number of {\em distinct\/} linear codes is the number of matrices $\bP$,
which is ${\cal N}_1 = 2^{MK} = 2^{N^2 R(1-R)}$.
\marginpar[b]{
$ \log {\cal N}_1 \simeq N^2 R(1-R)
$
}%
Can these all be expressed as distinct \ldpc\ codes?
The number of low-density parity-check matrices with row-weight $k$ is
\beq
{N \choose k}^M
\eeq
and the number of distinct codes that they define is at most
\beq
{\cal N}_2 = \left. { {N \choose k}^M } \right/ { M! } ,
\eeq
%(since it is smaller than $N^{Nk}$ -- take
% logarithms if this is not clear),
\marginpar[b]{
$ \log {\cal N}_2 < N k \log N
$
}%
which is much smaller than ${\cal N}_1$,
so, by the \ind{pigeon-hole} principle, it is not possible
for every random linear code to map on to a low-density $\bH$.
}
\nocite{Battail93_cando}
\nocite{Divsalar1998}
\nocite{DaveyMacKay99a}
\nocite{MacKayHighRate98}
\nocite{BMT78}
\nocite{berrou-glavieux-96}
\nocite{DaveyMacKay96}
%\nocite{dietterich91error-correcting}%
\nocite{Forney66}
\nocite{frey-98}
\nocite{Gallager62}
\nocite{Gallager63}
\nocite{Gallager68}
\nocite{Golomb1994}
\nocite{spielman-98-ISIT}
\nocite{MacKay94:fes}
\nocite{MacKay97:ipd}
\nocite{MacKay_Neal_Codes:95}
\nocite{mncEL}
\nocite{mncN}
\nocite{MacKayAllerton98}
\nocite{McEliece77}
\nocite{McElieceMacKay96}
\nocite{pearl}
\nocite{Shannon48}
\nocite{spielman-96}
\nocite{JPLcode}
\nocite{Tanner1981}
%\nocite{ChungApplet}
\nocite{wiberg:phd}
\nocite{wiberg95}
\dvips
%{Convolutional codes}
\chapter{Convolutional Codes and Turbo Codes\nonexaminable}
\label{ch.convol}
\label{ch.turbo}
This chapter follows tightly on from \chref{ch.exact}.
It makes use of the ideas of codes and
trellises and the forward--backward algorithm.
\medskip
% convolutional codes chapter
%
% newcommands moved to itprnnchapter
%
%\section{Introduction to Convolutional Codes}
\section{Introduction to convolutional codes}
When we studied linear block codes, we described them
in three ways:
\ben
\item
The generator matrix
describes how to turn a string of $K$ arbitrary source bits into a
transmission of $N$ bits.
\item
The parity-check matrix specifies the
% We also described them in terms of the
$M = N - K$ parity-check constraints that a valid codeword satisfies.
\item
The trellis of the code describes its valid codewords in terms
of paths through a trellis with labelled edges.
% We then described the same constraints
% graveyard cut
\een
A fourth way of describing some block codes, the algebraic
approach, is not covered in this book
(a) because it has been well covered by numerous other books in coding theory;
(b) because,
as this part of the book discusses,
the state of the art in error-correcting codes
makes little use of algebraic coding theory; and (c) because I am not
competent to teach this subject.
We will now describe convolutional codes in two ways:
first,
% A convolutional code can also be described
in terms of
mechanisms for generating transmissions $\bt$ from source bits $\bs$;
and second, in terms of trellises that describe the constraints
satisfied by valid transmissions.
% We will give both descriptions now for the case of a rate-1/2 code.
%\section{The generator description: Linear shift registers}
\section{Linear-feedback shift-registers}
We generate a transmission with a convolutional code by putting
a source stream through a linear filter.
This filter makes use of
% three components:
a shift register, linear output functions, and, possibly, linear feedback.
% Various configurations with and without feedback.
I will draw the shift-register in a right-to-left orientation:
bits roll from right to left as time goes on.
% This unconventional orientation has some notational advantages.
%the advantage that the state can be written with high bit first, and the taps
% can be written high bit first, as is conventional, and no mirrors are needed.
% made using draw.p, all in one line as follows:
% draw.p K=7 g=249 f=167 completetex=0 ; draw.p K=7 g=235 f=1 completetex=0 ; draw.p K=2 g=7 f=5 completetex=0 ; draw.p K=2 g=7 f=5 completetex=0 labels=2 texfile=tex/kl ; draw.p completetex=0 ; draw.p K=7 g=249 f=167 completetex=0 labels=2 texfile=tex/kl
%draw.p K=7 g=249 f=167 completetex=0
% draw.p K=7 g=249 f=167 completetex=0 labels=1 texfile=tex/kl
% draw.p K=2 g=7 f=5 completetex=0
% draw.p completetex=0
\Figref{fig.convol.lfsr7} shows three linear-feedback shift-registers
% with two outputs
which could be used to define convolutional codes.
% with rate $1/2$.
The rectangular box surrounding the bits $\z_1 \ldots \z_7$
indicate the {\em memory\/} of the filter, also known as its {\em state\/}.
% of the filter.
% finite state machine that is the filter.
All three filters have one input and two outputs.
On each clock cycle, the source supplies one bit, and the filter
outputs two bits $\cta$ and $\ctb$.
By concatenating together these bits we can obtain from
our source stream $s_1 s_2 s_3 \ldots$ a
transmission stream $\cta_1 \ctb_1 \cta_2 \ctb_2 \cta_3 \ctb_3 \hspace{-0.5ex}\ldots.$
Because there are two transmitted bits for every source bit,
the codes shown in \figref{fig.convol.lfsr7} have rate $\dhalf$.\pagebreak[4]
Because these filters require $k=7$ bits of memory,
the codes they define are known as a {\em constraint-length 7 codes}.
\begin{figure}
\figuremargin{\small%
\begin{tabular}[b]{lll}
& & Octal name \\
(a)&\mbox{\input{convol/tex/k7_1_235sn.tex}} & $(1,353)_8$ \\
\\
(b)&\mbox{\input{convol/tex/k7_167_249nn.tex}} & $(247,371)_8$ \\
\\
(c)&\mbox{\input{convol/tex/k7_167_249sr.tex}} & $\left(1, \frac{247}{371}\right)_8$ \\
\\
\end{tabular}
%\begin{tabular}[b]{l}
%\mbox{(d)\psfig{figure=convol/ps/K7.ps,angle=-90,height=8in}}
%\end{tabular}
}{%
\caption[a]{Linear-feedback shift-registers for generating convolutional codes
with rate $1/2$.
%And the trellis corresponding to one section of (b) or (c).
The symbol $\!\!$\leftD$\:$ indicates a copying with a delay of one clock cycle.
The symbol $\oplus$ denotes linear addition modulo 2 with no delay.
% After Frey (1998).
}
\label{fig.convol.lfsr7}
}%
\end{figure}
Convolutional codes come in three flavours, corresponding to the three
types of filter in \figref{fig.convol.lfsr7}.
\subsection{Systematic nonrecursive}
The filter shown in \figref{fig.convol.lfsr7}a
% is a special case of \ind{linear feedback shift register} with
has no feedback.
It also has the property that one of the output bits, $\cta$,
is identical to the source bit $s$.
This encoder is thus called {\dem\ind{systematic}}, because
the source bits are reproduced transparently in the transmitted
stream, and {\dem\ind{nonrecursive}}, because it has no feedback.
The other transmitted bit $\ctb$ is a linear function of the state
of the filter. One way of describing that function is as a dot product (modulo 2)
between two binary vectors of length $k+1$:
a binary vector $\bg^{(b)} = ( 1,1,1,0,1,0,1,1 )$
and the state vector $\bz = (z_k,z_{k-1},\ldots,z_1,z_0)$.
We include in the state vector the bit $z_0$ that will
be put into the first bit of the memory on the next cycle.
The vector $\bg^{(b)}$
has $g^{(b)}_{\kappa} = 1$ for every ${\kappa}$ where there is a tap (a downward
pointing arrow) from state bit $z_{\kappa}$ into the transmitted bit $\ctb$.
A convenient way to describe these binary tap vectors is in \inds{octal}\index{tap}.
Thus, this filter makes use of the tap vector $353_8$.
\margintab{\begin{center}\begin{tabular}{c@{$\,$}c@{$\,$}c}\tt11&\tt101&\tt011\\
$\downarrow$&
$\downarrow$&
$\downarrow$\\ 3&5&3 \\
\end{tabular}\end{center}
\caption[a]{How taps in the delay line are converted to octal.}
}
I have drawn the \ind{delay line}s from right to left
to make it easy to relate the diagrams to these octal numbers.
\subsection{Nonsystematic nonrecursive}
The filter shown in \figref{fig.convol.lfsr7}b also has no
feedback, but it is not systematic. It makes use of two tap vectors
$\bg^{(a)}$ and $\bg^{(b)}$ to create its two transmitted bits. This
encoder is thus {\em nonsystematic} and {\em nonrecursive}. Because
of their added complexity, nonsystematic codes
can have error-correcting abilities superior to those of systematic
nonrecursive codes with the same constraint length.
\subsection{Systematic recursive}
The filter shown in \figref{fig.convol.lfsr7}c is
similar to the nonsystematic nonrecursive
filter shown in \figref{fig.convol.lfsr7}b, but it
uses the taps that formerly made up $\bg^{(a)}$
to make a linear signal that is fed back into the
shift register along with the source bit.
The output $\ctb$ is a linear function of the state
vector as before. The other output is $\cta = s$,
so this filter is systematic.
%
% could tighten this into one para.
A recursive code is conventionally identified by an octal ratio,
\eg, \figref{fig.convol.lfsr7}c's code is denoted by
$(247/371)_8$.
% eg 21/37.
%\begin{figure}
%\begin{tabular}{ll}
%(b)&\mbox{\input{convol/tex/kl7_167_249nn.tex}}
%\\
%(c)&\mbox{\input{convol/tex/kl7_167_249sr.tex}}
%\\
%\end{tabular}
%\caption[a]{Linear feedback shift registers for generating convolutional codes.
% Labelled figure showing `$p$'}
%\label{fig.convol.lfsr7l}
%\end{figure}
%\begin{figure}
%\figuremargin{%
\amarginfig{t}{\small
$\:\:\:\:$\begin{tabular}{c@{}}
{\input{convol/tex/kl2_5_7nn.tex}} \\ (a) $(5,7)_8$ \\[0.123in]
{\input{convol/tex/kl2_5_7sr.tex}} \\ (b) $(5/7)_8$ \\[0.123in]
\end{tabular}
%}{%
\caption[a]{Two rate-1/2 convolutional codes with constraint length $k=2$:
% Labelled figure showing `$p$'.
(a) non-recursive; (b) recursive. The two codes
are equivalent.
%
}
\label{fig.convol.K2}
}%
%\end{figure}
\subsection{Equivalence of systematic recursive and
nonsystematic nonrecursive codes}
%
The two filters in
\figref{fig.convol.lfsr7}b,c are {\dem\ind{code-equivalent}\/}
in that the {\em sets\/} of codewords that they define are
identical.
For
every codeword of the nonsystematic nonrecursive
code we can choose a source stream
for the other encoder such that
its output is identical (and {\em vice versa}).
To prove this, we denote by
$p$ the quantity
$\sum_{\kappa=1}^k g^{(a)}_{\kappa} \z_{\kappa}$,
as shown in \figref{fig.convol.K2}a and b, which
shows a pair of smaller but otherwise equivalent filters.
If the two transmissions are to be equivalent --
that is, the $\cta$s are equal in both figures and so are the
$\ctb$s --
then on every cycle the source bit in the systematic code
must be $s=\cta$. So now we must simply confirm that
for this choice of $s$, the systematic code's
shift register will follow the same state sequence as
that of the nonsystematic code, assuming that the states match initially.
In \figref{fig.convol.K2}a we have
\beq
\cta = p \oplus z_0^{\rm nonrecursive}
\eeq
whereas in \figref{fig.convol.K2}b we have
\beq
z_0^{\rm recursive} = \cta \oplus p .
\eeq
Substituting for $\cta$, and using $p \oplus p = 0$ we immediately find
\beq
z_0^{\rm recursive} = z_0^{\rm nonrecursive} .
\eeq
Thus, any codeword of a nonsystematic nonrecursive code
is a codeword of a systematic recursive code with the same taps --
the same taps in the sense that there are vertical arrows
in all the same places in figures \ref{fig.convol.K2}(a) and (b),
though one of the arrows points up instead of down
% in the opposite direction
in (b).
\begin{figure}
\figuremargin{\small%
\begin{tabular}{ll}
(a)&\mbox{\psfig{figure=convol/ps/K2NN.001source.ps,angle=-90,height=2in}} \\[0.43in]
(b)&\mbox{\psfig{figure=convol/ps/K2.001source.ps,angle=-90,height=2in}} \\[0.13in]
\end{tabular}
%\begin{tabular}{ll}
%(c)&\mbox{\psfig{figure=convol/ps/K2.paths.ps,angle=-90,height=2in}}
%\\
%(d)&\mbox{\psfig{figure=convol/ps/K2.transmitted.path.ps,angle=-90,height=2in}}
%\\
%\end{tabular}
}{%
\caption[a]{Trellises of the rate-1/2 convolutional codes
of \protect\figref{fig.convol.K2}.
It is assumed that
the initial state of the filter is $(z_2,z_1)=(0,0)$.
Time is on the horizontal axis and
the state of the filter at each time step is the
vertical coordinate.
On the line segments are shown the emitted symbols $\cta$
and $\ctb$, with stars for `1' and boxes for `0'.
The paths
taken through the trellises when the source sequence is {\tt{00100000}} are
highlighted with a solid line.
The light dotted lines show the state trajectories that
are possible for other source sequences.
% NN first transmission =
% 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0. Note finite impulse response.
% SR second, transmission =
% 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1.
% But note that there is a path through either trellis corresponding to the
% other.
}
\label{fig.convol.K2.trellises}
}%
\end{figure}
\begin{figure}
\figuremargin{%
\begin{tabular}{l}
% (a)&\mbox{\psfig{figure=convol/ps/K2NN.111source.ps,angle=-90,height=2in}} \\[0.3in]
\mbox{\psfig{figure=convol/ps/K2.111source.ps,angle=-90,height=2in}} \\[0.203in]
\end{tabular}
}{%
\caption[a]{The source sequence for the systematic recursive
code {\tt{00111000}} produces the same
path through the trellis as {\tt{00100000}}
does in the nonsystematic nonrecursive case.
% Path taken through SR trellis when source = (00111000).
% SR transmission =
% 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
}
\label{fig.convol.K2NN.trans3}
}%
\end{figure}
Now, while these two codes are equivalent, the two encoders
behave differently.
The nonrecursive encoder has a {\em finite impulse response},
that is, if one puts in a string that is all zeroes except for a
single one, the resulting output stream contains a finite number of
ones.
Once the one bit
has passed through all the states of the memory, the
delay line returns to the all-zero state.
\Figref{fig.convol.K2.trellises}a
% and there is
% no record of that bit's passing. The figure
shows the state sequence resulting
from the source string $\bs=$(0, 0, 1, 0, 0, 0, 0, 0).
\Figref{fig.convol.K2.trellises}b shows
the trellis of the recursive code of \figref{fig.convol.K2}b
and
% it shows by a solid line
the response of this filter
to the same source string $\bs=$(0, 0, 1, 0, 0, 0, 0, 0).
The filter has an {\em infinite impulse response}.
The response settles into a periodic state
% after the first `11' is periodic
with period equal to three
clock cycles.
\exercisxB{1}{ex.inputSR}{
What is the input to the
recursive filter such that its state
sequence and the transmission are the same as those of the nonrecursive
filter? (Hint: see \figref{fig.convol.K2NN.trans3}.)
}
In general a \lfsr\ with $k$ bits of memory
has an impulse response that is periodic with a
period that is at most $2^k -1$, corresponding to
the filter visiting every non-zero state in its state space.
% If $2^K -1$ is factorizable into a product of smaller integers, then
% depending on the choice of the taps the period may be equal
% to one of those smaller integers.
Incidentally, cheap pseudorandom number generators
and cheap cryptographic products make use of exactly these periodic
sequences, though with larger values of $k$ than 7;
the random number seed or cryptographic key selects
the initial state of the memory.
There is thus a close connection between certain \ind{cryptanalysis}
problems and the decoding of convolutional codes.
% removed stuff to graveyard Wed 1/1/03
\section{Decoding convolutional codes}
% To keep things general, let's discuss the case where
% the initial start state is not known.
The receiver receives a bit stream, and wishes to infer the
state sequence and thence the source stream.
The posterior probability of each bit can be found by
the \ind{sum--product algorithm} (also known as the forward--backward or
\ind{BCJR} algorithm),
which was introduced in \secref{sec.trellisfb}.
The most probable state sequence can be found using the
\ind{min--sum algorithm} of \secref{sec.viterbi}
(also known as the \ind{Viterbi algorithm}).
The nature of this task is illustrated in \figref{fig.convol.K4.trans},
which shows the cost associated with each
edge in the trellis for the case of a sixteen-state
code; the channel is assumed to be a binary symmetric channel
and the received vector is equal to a codeword
except that one bit has been flipped.
There are three line styles, depending on the value of the likelihood:
thick solid lines show the
edges in the trellis that match the corresponding
two bits of the received string exactly; thick dotted lines
show edges that match one bit but mismatch the other; and
thin dotted lines show the edges that mismatch both bits.
The
% Viterbi
min--sum algorithm seeks the path through the trellis that
uses as many solid lines as possible; more precisely, it minimizes
the cost of the path, where the cost is zero for a solid line,
one for a thick dotted line, and two for a thin dotted line.
\exercissxB{1}{ex.spotbit}{
Can you spot the most probable path and the flipped bit?
}
\begin{figure}
\figuremargin{%
\begin{tabular}{c}
\mbox{\input{convol/tex/k4_17_31sr.tex}} $(21/37)_8$ \\[0.2in]
\mbox{\psfig{figure=convol/ps/K4s2like.ps,angle=-90,height=2in}}
\\
\end{tabular}
}{%
\caption[a]{The trellis for a $k=4$ code
painted with the likelihood function when the received vector is
equal to a codeword with just one bit flipped.
There are three line styles, depending on the value of the likelihood:
thick solid lines show the
edges in the trellis that match the corresponding
two bits of the received string exactly; thick dotted lines
show edges that match one bit but mismatch the other; and
thin dotted lines show the edges that mismatch both bits.
% (b2) the likelihood function for the received vector
% equal to a codeword with a different one bit flipped.
% Notice also in b1 that there are pairs of paths which are identical
% for many successive bits.
% This figure illustrates the unequal protection of bits
% and motivates termination of the trellis.
}
\label{fig.convol.K4.trans}
\label{fig.convol.lfsr4}
}%
\end{figure}
\subsection{Unequal protection}
A defect of the convolutional codes presented thus far
is that they offer unequal protection to the source bits.
\Figref{fig.convol.K4.source} shows
two paths through the trellis that differ in only
two transmitted bits.
% Their source strings differ in the last source
% bit only.
The last source bit is less well protected than the other source bits.
This unequal protection of bits
motivates the {\dem\ind{termination}\/} of the trellis.\index{trellis!termination}
\begin{figure}
\figuremargin{%
\begin{tabular}{l}
\mbox{\psfig{figure=convol/ps/K4s4asource.ps,angle=-90,height=1.52in}} \\[0.2in]
\mbox{\psfig{figure=convol/ps/K4s4bsource.ps,angle=-90,height=1.52in}}
\\
\end{tabular}
}{%
\caption[a]{Two paths that differ in two transmitted bits only.}
\label{fig.convol.K4.source}
}%
\end{figure}
\begin{figure}
\figuremargin{%
%\begin{tabular}{ll}
%(a)&\mbox{\psfig{figure=convol/ps/K2term.ps,angle=-90,height=2in}}
%\\
%(b)&
\mbox{\psfig{figure=convol/ps/K4term.ps,angle=-90,height=2in}}
%\\
%\end{tabular}
}{%
\caption[a]{A terminated trellis.
When any codeword is completed, the filter state is {\sf{0000}}.}
\label{fig.convol.K2term}
}%
\end{figure}
A terminated trellis is shown in \figref{fig.convol.K2term}.
% Note that terminating makes the protection uniform.
% Seems that the end bits get the best protection now.
Termination slightly reduces the number of source bits used
per codeword. Here, four source bits are turned into
parity bits because the $k=4$ memory bits must be returned to zero.
% \clearpage% added Thu 19/6/03 to get lfsr pics to clear out.
% TURBO CODES % used by convol.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \section{Turbo Codes \nonexaminable}
%
\begin{figure}
\fullwidthfigureright{\small%figuremargin
\begin{center}
\hspace*{0.5in}\mbox{(a) \psfig{figure=TCC/ps/T3.ps,height=1.5in}
(b) \psfig{figure=TCC/ps/T.ps,height=1.5in} }\\[0.015in]
% (c) \psfig{figure=convol/ps/K4.ps,height=1.5in,angle=-90}
% \\ \mbox{\input{convol/tex/k4_17_31sr.tex}} \\
\end{center}
}{%
\caption[a]{Rate-1/3 (a) and rate-1/2 (b) turbo codes represented as
\ind{factor graph}s. The circles represent the codeword bits. The two rectangles represent
trellises of rate-1/2 convolutional codes, with the
systematic bits occupying the left half of the rectangle and the
parity bits occupying the right half. The puncturing of these constituent
codes in the rate-1/2 turbo code
is represented by the lack of connections to half of the parity bits in each
trellis.
% (c) The contents of the rectangles. Each trellis is generated
% by the recursive filter shown in \protect\figref{fig.convol.lfsr4}
% which has a state space of 4 bits. In each pair of adjacent bits,
% one is a systematic bit and one is a parity bit.}
}
\label{fig.TCC.T}
}
\end{figure}
\section{Turbo codes}
\label{sec.turbo}
%\label{ch.turbo}
\label{sec:turbo.intro}
An $(N,K)$ turbo code
is defined by a number of constituent convolutional
encoders (often, two) and
an equal number of \index{interleaving}{\dem{interleavers}\/} which are $K \times K$ permutation matrices.
Without loss of generality, we take the first interleaver to be the
identity matrix.
\amarginfig{b}{
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(25,30)(0,8)%
\put(15,18){\framebox(8,8){$C_1$}}
\put(15, 8){\framebox(8,8){$C_2$}}
\put( 9,12){\circle{6}}
\put( 5, 8){\framebox(8,8){$\pi$}}
\put(9.7,14.875){\vector(1,0){0.1}}% right pointing circle vector % was 975
\put(23,22){\vector(1,0){3}}
\put(23,12){\vector(1,0){3}}
\put(13,12){\line(1,0){2}}
\put( 2,12){\vector(1,0){3}}
\put( 0,22){\vector(1,0){15}}
\put( 2,22){\line(0,-1){10}}
%
\end{picture}%
\end{center}
\caption[a]{The encoder of a turbo code.
Each box $C_1$, $C_2$, contains a convolutional code.
The source bits are reordered using a permutation $\pi$ before
they are fed to $C_2$. The transmitted codeword is obtained
by concatenating or interleaving the outputs of the two
convolutional codes.}
}
% The constituent encoders are often convolutional codes.
A string of $K$ source bits is encoded by feeding them into each constituent encoder
in the order defined by the associated interleaver, and transmitting
the bits that come out of each constituent encoder.
% For simplicity, let us concentrate on turbo codes with two constituent
% codes that are both convolutional codes.
Often the first constituent encoder is chosen to be a systematic encoder,
just like the recursive filter shown in \protect\figref{fig.convol.lfsr4}, and the
second is a non-systematic one of rate 1 that emits parity bits only.
%so the
% source stream contains one copy of the source bits.
The transmitted
codeword then consists of $K$ source bits followed by $M_1$ parity bits
generated by the first convolutional code and $M_2$ parity bits
from the second. The resulting turbo code has rate $\dthird$.
The turbo code can be represented by a factor graph
in which the two trellises are represented by two
large rectangular nodes (\figref{fig.TCC.T}a); the $K$ source
bits and the first $M_1$ parity bits participate in the first trellis
and the $K$ source
bits and the last $M_2$ parity bits participate in the second trellis.
Each codeword bit participates in either one or two trellises,
depending on whether it is a parity bit or a source bit.
Each trellis node contains a trellis
exactly like the terminated trellis shown in \figref{fig.convol.K2term},
except one thousand times as long.
[There are other factor graph representations for turbo codes
that make use of more elementary nodes, but the factor graph
given here yields the standard version of the sum--product
algorithm used for turbo codes.]
% See \figref{fig.code.space}.
If a turbo code of smaller rate such as $\dhalf$ is
required, a standard modification to the rate-$\dthird$
code is to {\dem\index{puncturing}{puncture}\/} some of the parity bits (\figref{fig.TCC.T}b).
% I really wanted the wide figure to be here
Turbo codes are decoded using the sum--product algorithm
described in \chref{ch.sumproduct}.
On the first iteration,
each trellis receives the channel likelihoods, and runs
the forward--backward algorithm to compute, for each
bit,
the relative likelihood of its being {\tt{1}} or {\tt{0}},
given the information about the other bits.
These likelihoods are then passed across from each
trellis to the other, and multiplied by the channel
likelihoods on the way. We are then ready for the second iteration:
the forward--backward algorithm is run again in each trellis
using the updated probabilities. After about ten or twenty
such iterations, it's hoped that the correct
decoding will be found. It is common practice to stop
after some fixed number of iterations, but we can do better.
As a stopping criterion,
the following procedure can be used at every iteration.
For each time-step in each trellis, we identify the most probable edge,
according to the local messages. If these most probable edges
join up into two valid paths, one in each trellis,
and if these two paths are consistent with each other,
it is reasonable to stop, as subsequent iterations are unlikely
to take the decoder away from this codeword.
If a maximum number of iterations is reached without this
stopping criterion being satisfied, a decoding error can be reported.
This stopping procedure is recommended for several reasons:
it allows a big saving in decoding time with no loss in error probability;
it allows decoding failures that are detected by the decoder to
be so identified -- knowing that a particular block
is definitely corrupted is surely useful information for the receiver!
And when we distinguish between detected and undetected errors,
the undetected errors give helpful insights into
the low weight codewords of the code, which may improve the
process of code design.\index{sermon!turbo codes}
Turbo codes as described here have excellent performance
down to decoded error probabilities of about $10^{-5}$, but
randomly-constructed turbo codes tend to have an {\dem\ind{error floor}\/}
starting at that level. This error floor is caused by
low-weight codewords.
To reduce the height of the error floor, one can attempt
to modify the random construction to increase the weight of
these low-weight codewords. The tweaking of turbo codes is a black art,
and it never succeeds in totalling eliminating low-weight codewords;
more precisely, the low-weight codewords can only be eliminated
by sacrificing the turbo code's excellent performance.
In contrast, low-density parity-check codes rarely have error floors.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Parity-check matrices of convolutional codes and turbo codes}
% Note different meaning for parity-check matrix
% from convolutional code literature. Here we are talking
%\begin{figure}[htbp]
%\figuremargin{%
\amarginfig{c}{\small
\begin{center}
\begin{tabular}{ll}
\leavevmode
(a) & \psfig{figure=GHps/KL1.ps,width=1.079in}\\%0.83in}\\
(b) & \psfig{figure=GHps/T13.ps,width=1.625in}\\%1.25in}\\
%(d) & \psfig{figure=GHps/G745.ps,width=1.25in}\\
\end{tabular}
\end{center}
%}{%
\caption{Schematic pictures of the parity-check matrices
of
%(a) a regular Gallager code, rate 1/2, (a') an almost regular
% Gallager code, rate 1/3,
% (b)
(a) a convolutional code, rate 1/2, and
(b) a turbo code, rate 1/3.
%
Notation:
A diagonal line represents an identity matrix.
A band of diagonal lines represent a band of diagonal 1s.
A circle inside a square represents the random permutation of
all the columns in that square.
A number inside a square represents the number of random
permutation matrices superposed in that square.
Horizontal and vertical lines indicate
the boundaries of the blocks within the matrix.
%
% (d) shows another code with roughly the same profile as a Turbo code.
}
\label{fig:turbo13}
}%
%\end{figure}
We close by discussing the
% literal
parity-check matrix of a rate-$\dhalf$ convolutional code viewed
as a linear block code.
We adopt the convention that
the $N$ bits of one block are made up of the $N/2$ bits $\cta$ followed
by the $N/2$ bits $\ctb$.
\exercisxB{2}{ex.convH}{
Prove that a convolutional code has
a low-density parity-check matrix as shown schematically in
\figref{fig:turbo13}a.
{\sf Hint:}
It's easiest to figure out the parity constraints satisfied by
a convolutional code by
thinking about the nonsystematic nonrecursive
encoder (\figref{fig.convol.lfsr7}b).
Consider putting through filter $a$
a stream that's been through convolutional filter $b$,
and {\em vice versa}; compare the two resulting streams.
Ignore termination of the trellises.
}
% If we pass stream $b$ through the convolutional filter that generated
% stream $a$ and vice versa, then the two resulting streams are
% identical.
% So the parity-check matrix of a single convolutional code may be
% written as a low-density parity-check matrix as shown in
% figure \ref{}b.
% There are many parity-check matrices for any given block code;
% if any of those parity-check matrices is a low density
% parity-check matrix then we say that the code
% is a LDPCC.
%
% Issue neglected here: termination. Termination simply adds an extra $k$
% constraints, where $k$ is the constraint length. Not a big deal.
The parity-check matrix of a turbo code can be written down
by listing the constraints satisfied by the two constituent
trellises (\figref{fig:turbo13}b).
So turbo codes are also special cases of low-density parity-check codes.
If a turbo code is punctured, it no longer necessarily has a low-density
parity-check matrix, but it always has a
{\dem\ind{generalized parity-check matrix}}\index{parity-check matrix!generalized}
that is sparse, as explained in the next chapter.
\section*{Further reading}
For further reading about convolutional codes,
\citeasnoun{JoZigangirov} is highly recommended. One topic I would have
liked to include is {\dem\ind{sequential decoding}}. Sequential decoding
explores only the most promising paths in the trellis,
and backtracks when evidence accumulates that a wrong turning has been
taken. Sequential decoding is used when the trellis is too big
for us to be able to apply the maximum likelihood algorithm,
the \ind{min--sum algorithm}. You can read about sequential decoding in
\citeasnoun{JoZigangirov}.
For further information
about the use of the sum--product
algorithm in turbo codes, and
the rarely-used but highly recommended stopping criteria for halting
their decoding, \citeasnoun{frey-98} is essential reading.
(And there's lots more good stuff in the same book!)
\section{Solutions}
\soln{ex.spotbit}{
The first bit was flipped.
The most probable path is the upper one in
\figref{fig.convol.K4.source}.
}
\dvips
%\section{Solutions to chapter \protect\ref{ch.convol}'s exercises} %
%\input{tex/_sconvol.tex}
%\dvipsb{solutions convol}
\chapter{Repeat--Accumulate Codes}
\label{ch.ra}
% To finish this book, I'd like to bring us back
%% , full circle,
% to where we came in.
In \chref{chone} we discussed a very simple and not very effective method
for communicating over a noisy channel:
the repetition code.
We now discuss a code that is almost as simple,
and whose performance is outstandingly good.
{\dem{Repeat--accumulate codes}\/} were studied by\indexs{repeat--accumulate code}\indexs{error-correcting code!repeat--accumulate}
\citeasnoun{Divsalar1998} for theoretical purposes,
as simple turbo-like codes that might be more amenable
to analysis than messy turbo codes. Their practical performance
turned out to be just as good as other sparse-graph codes.
\section{The encoder}
\begin{framedalgorithm}
\ben
\item
Take $K$ source bits.
$$s_1 s_2 s_3 \ldots s_K$$
\item
Repeat each bit three times, giving $N=3K$ bits.
$$s_1 s_1 s_1 s_2 s_2 s_2 s_3 s_3 s_3 \ldots s_K s_K s_K$$
\item
Permute these $N$ bits using a random
permutation (a fixed random permutation -- the same
one for every codeword).
% , and it's known to the receiver).
Call the permuted
string $\bu$.
\[
u_1 u_2 u_3 u_4 u_5 u_6 u_7 u_8 u_9 \ldots
% u_{N-2} u_{N-1}
u_N
\]
\item
Transmit the {\dem{accumulated sum}}.\index{accumulator}
\beqan t_1 &=& u_1 \nonumber \\
t_2 &=& t_1 + u_{2} \: (\!\mod 2) \nonumber \\
\ldots\:\:\:
t_n &=& t_{n-1} + u_{n} \: (\!\mod 2)\:\:\: \ldots \\
t_N &=& t_{N-1} + u_{N} \: (\!\mod 2). \nonumber
\eeqan
\item
That's it!
\een
\end{framedalgorithm}
\begin{figure}
\figuremargin{\small
\begin{center}
\mbox{(a)\psfig{figure=figs/gallager/R3.ps,width=3.9in,angle=-90}}\\[0.2in]
\mbox{(b)\psfig{figure=figs/gallager/R3d.eps,width=3.009in}%
\raisebox{0.5in}{\psfig{figure=figs/gallager/paritytrellis.eps,width=1.1379in}}%,angle=-90
}%,angle=-90
\end{center}
}{
\caption[a]{Factor graphs for a repeat--accumulate code with rate 1/3.
(a) Using elementary nodes.
Each white circle represents a transmitted bit.
Each \plusnode\ constraint\index{parity-check nodes} forces the
sum of the 3 bits to which it is connected to
be even.
Each black circle represents an intermediate binary variable.
Each \equalnode\ constraint forces the three variables to which
it is connected to be equal.
(b) Factor graph normally used for decoding.
The top rectangle represents the trellis of the accumulator,
shown in the inset.
}
\label{fig.ragraph}
\label{fig.ragraph.dec}
}
\end{figure}
\section{Graph}
\Figref{fig.ragraph}a shows the graph of
a repeat--accumulate code, using
four types of node: equality constraints \equalnode,
intermediate binary variables (black circles), parity constraints \plusnode,
and the transmitted bits (white circles).
% The encoder sets each group
% of intermediate bits to values read from the source.
% These bits are put through a fixed random permutation.
% The \plusnode\ constraints cause the transmitted stream,
% working from left to right, to be the accumulated sum (modulo 2)
% of the permuted intermediate bits.
The source sets the values of the black bits at the bottom, three at a time,
and the accumulator computes the transmitted bits along the top.
This graph is a \ind{factor graph} for the prior probability
over codewords, with the circles being binary variable
nodes, and the squares representing two types of factor nodes.
As usual, each \plusnode\ contributes a factor of the
form $\truth[ \sum x \!=\! 0 \mod 2]$;
each \equalnode\ contributes a factor of the
form $\truth[ x_1 \!=\! x_2 \!=\! x_3]$.
\section{Decoding}
The repeat--accumulate code is normally decoded using the sum--product
algorithm on the factor graph depicted in
\figref{fig.ragraph.dec}b. The top box represents the trellis
of the accumulator, including the channel likelihoods.
In the first half of each iteration,
the top trellis receives likelihoods for every transition in the trellis,
and runs the forward--backward
algorithm so as to produce likelihoods for each
variable node. In the second half of the iteration,
these likelihoods are multiplied together at the
\equalnode\ nodes to produce new likelihood messages
to send back to the trellis.
% graveyard has separate figure
% Sun 12/1/03
As with Gallager codes and turbo codes,
the \ind{stop-when-it's-done} decoding method can
be applied, so it is possible to distinguish between
undetected errors (which are caused by
low-weight codewords in the code) and
detected errors (where the decoder gets stuck and
knows that it has failed to find a valid answer).
\begin{figure}
\figuremargin{
\mbox{\psfig{figure=/home/mackay/code/allR3N.ps,width=2.3in,angle=-90}}\\[0.1in]
}{
\caption[a]{Performance of six rate-\dthird\
repeat--accumulate codes on the Gaussian channel.
The blocklengths range from $N=204$ to $N=30\,000$.
Vertical axis: block error probability; horizontal axis: $E_b/N_0$.
The dotted lines show the frequency of undetected errors.}
\label{fig.raperfb}
}
\end{figure}
% /home/mackay/code/allR3N.ps
% is included in _doc/code
%ibis.tex:\ebnowide{ps/allR3N}
%sparsecodes.tex:\ebnowide{ps/allR3N}
%talk.tex:\ebnotitle{ps/allR3N}{Repeat--accumulate codes}
%talk.tex:\label{fig.allR3N}
\Figref{fig.raperfb} shows the performance of six
randomly-constructed repeat--accumulate
codes on the Gaussian channel. If
one does not mind the error floor which kicks
in at about a block error probability of $10^{-4}$,
the performance is staggeringly good for such a
simple code (\cf\ \figref{fig:GCResults}).
\newcommand{\RAdirectory}{/data/tiree/mackay/code/RA/RA}
\begin{figure}
% nondangle version moved to graveyard.tex Wed 9/4/03
\figuredangle{\small
\begin{center}
\begin{tabular}{cc@{}c}
\hspace*{-0.2in}%
\mbox{\psfig{figure=\RAdirectory/ps/30000.793.ps,width=2.35in,angle=-90}} &
\mbox{\psfig{figure=\RAdirectory/r/h.0.89.ps,width=2.55in,angle=-90}}&
\mbox{\psfig{figure=\RAdirectory/r/h.0.90.ps,width=2.55in,angle=-90}}\\
(a) &
(ii.b) $E_b/N_0 = 0.749$\,dB &
(ii.c) $E_b/N_0 = 0.846$\,dB \\[0.1in]
&
\mbox{\psfig{figure=\RAdirectory/r/fit.0.89.ps,width=2.4035in,angle=-90}} &
\mbox{\psfig{figure=\RAdirectory/r/fit.0.90.ps,width=2.4035in,angle=-90}}\\
&
(iii.b)&
(iii.c)\\
\end{tabular}
\end{center}
}{
\caption[a]{{Histograms of number of iterations to find a valid
decoding for a repeat--accumulate code with source block
length $K=10\,000$ and transmitted blocklength
$N=30\,000$.}
(a) {Block error probability versus signal-to-noise ratio
for the RA code.}
%(iia) Channel signal to noise ratio $x/\sigma
% = 0.88$, $E_b/N_0 = 0.651$ dB.
(ii.b) Histogram for
$x/\sigma = 0.89$, $E_b/N_0 = 0.749$\,dB.
(ii.c) $x/\sigma = 0.90$, $E_b/N_0 = 0.846$\,dB.
% see code/RA/READRA
% Most errors in this experiment were detected errors, but not
% quite all.
{(iii.b, iii.c) Fits of power laws to (ii.b) $(1/\tau^6)$ and (ii.c)
$(1/\tau^9)$.}
%to the histograms of \protect\figref{fig.hist}.}
% Both the axes are shown
% on a log scale so that a power-law curve appears as a straight line.
}
\label{fig.pb.ebno}
\label{fig.hist}
}
\end{figure}
\section{Empirical distribution of decoding times}
It is interesting to study the number of iterations $\tau$
of the sum--product algorithm
required to decode a sparse-graph code.
% Gallager codes and repeat--accumulate codes.
Given one code and a
set of channel conditions, the decoding time varies randomly from
trial to trial.
We find that the histogram of decoding times
follows a power law, $P(\tau) \propto \tau^{-p}$, for large $\tau$.
The power $p$ depends on the signal-to-noise ratio
and becomes smaller (so that the distribution is more heavy-tailed)\index{tail}
as the signal-to-noise ratio decreases.
%
We have observed power laws in repeat--accumulate codes
and in irregular and regular Gallager codes.
%
%\subsubsection{Decoding times}
Figures~\ref{fig.hist}(ii) and (iii) show the
% heavy tailed
distribution
of decoding times of a repeat--accumulate code at two different
signal-to-noise ratios. The \ind{power law}s extend
over several orders of magnitude.
\exercisxC{5}{ex.powerlaw}{
Investigate these power laws.
Does density evolution predict them?
Can the design of a code be used to manipulate the power law
in a useful way?
}
\section{Generalized parity-check matrices}
I find that it is helpful when relating sparse-graph codes to each
other to use a common representation for them
all. \citeasnoun{Forney2001} introduced the idea of a {\dem\ind{normal graph}}
in which the only nodes are $\plusnode$ and $\equalnode$
and all variable nodes have degree one or two;
variable nodes with degree two can be represented on edges that connect a $\plusnode$ node
to a $\equalnode$ node. The {\sl generalized parity-check matrix\/}
is a graphical way of representing normal graphs.
In a parity-check matrix, the columns are transmitted bits,
and the rows are linear constraints.
In a {generalized parity-check matrix}, additional columns
may be included, which represent state variables that are
not transmitted.
One way of thinking of these state variables is that they
are punctured from the code before transmission.
State variables are indicated by a horizontal line
above the corresponding columns.
The other pieces of diagrammatic notation for
generalized parity-check matrices are,
as in \cite{mncN,MacKayAllerton98}:
\bit
\item A diagonal line in a square indicates that that
part of the matrix contains an identity matrix.
\item Two or more parallel diagonal lines indicate
a band-diagonal matrix with a corresponding
number of 1s per row.
\item A horizontal ellipse with an arrow on it indicates
that the corresponding columns in a block are randomly permuted.
% [The columns are only permuted within the block indicated
\item A vertical ellipse with an arrow on it indicates
that the corresponding rows in a block are randomly permuted.
\item An integer surrounded by a circle represents
that number of superposed random permutation matrices.
\eit
\noindent
{\sf Definition.} A generalized parity-check matrix is a pair $\{ \bA , \bp \}$,
where $\bA$ is a binary matrix and $\bp$ is a list of
the punctured bits. The matrix defines a set of
{\sl valid vectors\/} $\bx$, satisfying
\beq
\bA \bx = 0 ;
\eeq
for each valid vector there is a codeword $\bt(\bx)$ that
is obtained by puncturing from $\bx$ the bits indicated by $\bp$.
For any one code there are many
generalized parity-check matrices.
\medskip
The {\sl rate\/}
of a code with generalized parity-check matrix $\{ \bA , \bp \}$
can be estimated as follows.
If $\bA$ is $L \times M'$, and $\bp$ punctures $S$ bits
and selects $N$ bits for
transmission ($L=N+S$), then the effective number of constraints on
the codeword, $M$, is
\beq
M = M' - S,
\eeq
the number of source bits is
\beq
K = N-M = L - M',
\eeq
and the rate is greater than or equal to
\beq
R = 1 - \frac{M}{N} = 1 - \frac{ M' - S }{ L - S } .
\eeq
% In the special case where there are $S=K$ state variables
%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremargin{
\begin{tabular}{rclcrcl}
$\bG^{\T}$ & $ =$ & \raisebox{-0.4386in}{\mbox{\GHfigthird{r3G.ps}}} &
\hspace*{0.02in}&
$\bH$ & $= $ & \raisebox{-0.254in}{\mbox{\GHfig{r3a.ps}}}
\\[-0.3in]
$\{ \bA, \bp \}$ & $ = $ & \multicolumn{5}{l}{
\raisebox{-0.4386in}[1.2in]{\mbox{\GHfigtwo{r3.ps}}}
} \\
\end{tabular}
}{
\caption[a]{%\captionsize
The generator matrix, parity-check matrix,
and a generalized parity-check matrix of a repetition code
with rate $\dthird$.
}
\label{fig.simple.r3}
}
\end{figure}
%
\begin{figure}
\figuremargin{
\begin{tabular}{rclcrcl}
$\bG^{\T}$&=& \raisebox{-0.561in}{ \mbox{\GHfigthird{sourlasG.ps}}} &
&
$\bH$&=& \raisebox{-0.255in}{ \mbox{\GHfig{sourlas.ps}}} \\
\end{tabular}
}{%
\caption[a]{%\captionsize
The generator matrix and parity-check matrix of
a systematic low-density generator-matrix
code. The code has rate $\dthird$.}
\label{fig.ldgm-s}
}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\figuremargin{
\begin{tabular}{rclcrcl}
$\bG^{\T}$&=& \raisebox{-0.255in}{ \mbox{\GHfigthird{sourlasGN.ps}}} &
&
$\bA,\bp$&=& \raisebox{-0.255in}{ \mbox{\GHfig{sourlasN.ps}}} \\
\end{tabular}
}{%
\caption[a]{%\captionsize
The generator matrix and generalized parity-check matrix of
a {\em non-systematic\/} low-density generator-matrix
code. The code has rate $\dhalf$.}
\label{fig.ldgm-ns}
}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Examples}
%\begin{description}
%\subsubsection*{Repetition code.}
{\sf Repetition code.}
The generator matrix, parity-check matrix, and
generalized parity-check matrix of a simple rate-$\dthird$
repetition code are shown in \figref{fig.simple.r3}.
\medskip
\noindent
%\subsubsection*{Systematic low-density generator-matrix code.}
{\sf Systematic low-density generator-matrix code.}
In an $(N,K)$ systematic low-density generator-matrix code,
there are no state variables.
A transmitted codeword $\bt$ of length $N$ is given by
\beq
\bt = \bG^{\T} \bs,
\eeq
where
\beq
\bG^{\T} = \left[ \begin{array}{c} \bI_K \\ \bP \end{array} \right] ,
\eeq
with $\bI_K$ denoting the $K \times K$ identity matrix, and
$\bP$ being a very sparse $M \times K$ matrix, where $M=N-K$.
The parity-check matrix of this code is
\beq
\bH = [ \bP | \bI_M ] .
% \bH = [ \begin{array}{c} \bP | \bI_M \end{array} ] .
\eeq
In the case of a rate-$\dthird$ code, this parity-check matrix might
be represented as shown in \figref{fig.ldgm-s}.
\medskip
\noindent
%\subsubsection*{Non-systematic low-density generator-matrix code.}
{\sf Non-systematic low-density generator-matrix code.}
In an ($N,K$) non-systematic low-density generator-matrix code,
a transmitted codeword $\bt$ of length $N$ is given by
\beq
\bt = \bG^{\T} \bs,
\eeq
where $\bG^{\T}$ is a very sparse $N \times K$ matrix.
%
The generalized parity-check matrix of this code is
\beq
\bA = \left[ \overline{ \bG^{\T} } | \bI_N \right] ,
% \bA = \left[ \begin{array}{c} \overline{ \bG^{\T} } | \bI_N \end{array} \right] ,
\eeq
and the corresponding generalized parity-check equation
is
\beq
\bA \bx = 0 ,
\:\: \mbox{ where $\bx = \left[ \begin{array}{c} \bs \\ \bt \end{array} \right]$.}
\eeq
%
Whereas the parity-check matrix of this simple code
is typically a complex, dense matrix, the
generalized parity-check matrix retains the underlying
simplicity of the code.
In the case of a rate-$\dhalf$ code, this generalized parity-check matrix might
be represented as shown in \figref{fig.ldgm-ns}.
\medskip
\noindent
%%%%%%%%%%%%%%%%%%%%%%%%
%
%\subsubsection*{Low-density parity-check codes and linear MN codes.}
{\sf Low-density parity-check codes and linear MN codes.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure}
\marginfig{\small
\begin{tabular}{cc}
\mbox{\GHfigone{3.2A.ps}} &
\mbox{\GHfigone{3.2A.MN.ps}} \\
(a) &
(b) \\
\end{tabular}
%
\caption[a]{%\captionsize
The generalized parity-check matrices of
(a) a rate-$\dthird$
Gallager code with $M/2$ columns of weight 2;
(b) a rate-$\dhalf$ linear MN code.}
\label{fig.ldpc}
\label{fig.mn}
}% \end{figure}
The parity-check matrix of a rate-1/3 low-density parity-check code
is shown in \figref{fig.ldpc}a.
A linear MN code is a
non-systematic low-density parity-check code. The $K$ state bits
of an MN code are the source bits.
\Figref{fig.mn}b shows the generalized parity-check matrix
of a rate-$1/2$ linear MN code.
\medskip
\noindent
%\subsubsection*{Convolutional codes.}
{\sf Convolutional codes.}
\marginfig{\small%\begin{figure}
\begin{tabular}{c}
(a) \mbox{\GHfigE{con.ps}}% convolutional code
\\[0.2in]
\mbox{\GHfigdoubleE{turbo.ps}}% turbo code
\\
(b) \\
\end{tabular}
%
\caption[a]{%\captionsize
The generalized parity-check matrices of
(a) a convolutional code with rate \dhalf.
(b) a rate-\dthird\ turbo code built by parallel concatenation
of two convolutional codes.
}
\label{fig.turbo}
\label{fig.con}
}%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In a non-systematic, non-recursive convolutional code,
the source bits, which play the role
of state bits, are fed into a delay-line and two
linear functions of the delay-line are transmitted.
In \figref{fig.con}a, these two parity streams
are shown as two successive vectors of length $K$.
[It is common to interleave these two parity streams,
a bit-reordering that is not relevant here, and is not illustrated.]
\medskip
\noindent
%\subsubsection*{Concatenation.}
{\sf Concatenation.}
`Parallel concatenation' of two codes is represented in one
of these diagrams by aligning the matrices of two codes
in such a way that the `source bits' line up, and
by adding blocks of zero-entries to the matrix
such that the state bits and parity bits
of the two codes occupy separate columns. An example is given
by the turbo code below.
In `serial concatenation', the columns corresponding to
the {\em transmitted\/} bits of the first code are aligned with
the columns corresponding to the {\em source\/} bits of the second code.
\medskip
\noindent
%\subsubsection*{Turbo codes.}
{\sf Turbo codes.}
A turbo code is the parallel concatenation of two convolutional
codes. The generalized parity-check matrix of
a rate-1/3 turbo code is shown in \figref{fig.turbo}b.
\medskip
\noindent
%\subsubsection*{Repeat--accumulate codes.}
{\sf Repeat--accumulate codes.}
The generalized parity-check matrices of a rate-1/3
repeat--accumulate code is shown in \figref{fig.ra}.
Repeat-accumulate codes are equivalent to \ind{staircase} codes
(\secref{sec.staircase}, \pref{sec.staircase}).\index{repeat--accumulate code!connection to low-density parity-check code}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\amarginfig{t}{\small%\begin{figure}
\begin{tabular}{c}
\mbox{\GHfigtwo{r3sa.ps}}% RA code
\\
\end{tabular}
%
\caption[a]{%\captionsize
The generalized parity-check matrix of a repeat--accumulate code with rate \dthird.
}
\label{fig.ra}
}%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\medskip
\noindent
{\sf Intersection.}
The generalized parity-check matrix of the intersection
of two codes
% (that is, the code whose codewords are
is made by stacking their generalized parity-check matrices on top
of each other in such a way that all the transmitted bits' columns
are correctly aligned, and any punctured bits associated with
the two component codes occupy separate columns.
%%%%%%%%%
\dvips
\prechapter{About Chapter}
The following exercise provides a helpful background for digital fountain codes.
\exercisxB{3}{ex.ballsbins}{
An author proofreads his $K=700$-page book
by inspecting random pages. He makes
$N$ page-inspections, and does not take any
precautions to avoid inspecting the same page twice.
\ben
\item
After $N=K$ page-inspections,
what fraction of pages do you expect
have never been inspected?
\item
After $N>K$ page-inspections,
what is the probability that {one or more} pages
have never been inspected?
\item
Show that in order for the probability
that all $K$ pages have been inspected to be $1- \delta$,
we require
$N \simeq K \ln ( K/\delta )$ page-inspections.
\een
[This problem is commonly presented in terms of
throwing $N$ balls at random into $K$ bins; what's
the probability that every bin gets at least one ball?]
}
\ENDprechapter
\chapter{Digital Fountain Codes}
\label{chdfountain}
Digital fountain codes are record-breaking sparse-graph codes
for channels with erasures.
Channels with erasures are of great importance. For example,
files
sent over the \ind{internet} are chopped into packets,
and each packet is either received without error
or not received. A simple channel model
describing this situation is a $q$-ary
\ind{erasure channel},\index{channel!erasure}\index{error-correcting code!erasure channel}
which has
(for all inputs in the input alphabet $\{ 0,1,2,\ldots, q\!-\!1\}$)
a probability $1-f$ of transmitting the input
without error, and probability $f$ of delivering the
output `{\tt{?}}'.
The alphabet size $q$ is $2^l$, where $l$ is the number
of bits in a packet.
Common methods for communicating over such channels
employ a feedback channel from receiver to sender
that is used to control the \ind{retransmission} of
erased packets. For example, the receiver might send
back messages that identify the {\em missing\/} packets,
which are then retransmitted. Alternatively,
the receiver might send back messages that
acknowledge each {\em received\/} packet; the sender
keeps track of which packets have been acknowledged
and retransmits the others until
all packets have been acknowledged.
% The second method is
% often used when a sender is using a broadcast
% channel to communicate the same information to many receivers.
% Mike L said
%(0) The second paragraph mentions common methods for communicating over
%channels. It would be good to point out that the first method only works
%well for a sender communicating to a single receiver, and that the second
%method is often the method of choice when a sender is using a broadcast
%channel to communicate the same information to many receivers, and to give a
%forward pointer to the broadcast application in section 1.5. This would
%make it more obvious up front why these codes are valuable in
%broadcast/satellite/wireless one-to-many environments.
%
% so I added a bit more below.
These simple retransmission protocols have the advantage that
they will work regardless of the erasure probability $f$,
but purists who have learned their Shannon theory
will feel that these retransmission protocols are wasteful.
If the erasure probability $f$ is large, the number of
feedback messages sent by the first protocol will be
large.
Under the second \ind{protocol},
% If we repeatedly send packets until an acknowledgement is received,
it's likely that the receiver will end up receiving
multiple redundant copies of some packets,
and heavy use is made of
% the number of acknowledgement messages that have to be sent over
the feedback channel.
% is proportional to the .
According to Shannon, there is
no need for the feedback channel:
the capacity of the forward channel
is $(1-f)l$ bits, whether or not we have feedback.
The wastefulness of the simple retransmission protocols is especially
evident in the case of a broadcast channel with erasures -- channels where
one sender broadcasts to many receivers, and each receiver receives a
random fraction $(1-f)$ of the packets. If every packet that
is missed by one or more receivers has to be retransmitted,
those retransmissions will be terribly redundant. Every receiver
will have already received most of the retransmitted packets.
% , but they have to be retransmitted for the sake of other receivers who missed them.
So, we would like to make erasure-correcting codes that require
no feedback or almost no feedback.
The classic block codes for erasure correction are called
\ind{Reed--Solomon code}s. An $(N,K)$ Reed--Solomon code (over an alphabet
of size $q=2^l$) has the ideal\index{perfect code}
property that if any $K$ of the $N$ transmitted
symbols are received then the original $K$ source
symbols can be recovered. [See \citeasnoun{Berlekamp}
or \citeasnoun{lincostello83} for further information; Reed--Solomon
codes exist for $N1$) the
expected number of packets of degree $d$ that have their degree
reduced to $d-1$ is $h_0(d) \linefrac{d}{K}$;
and at the $t$th iteration, when $t$ of the $K$ packets have been
recovered and the number of
packets of degree $d$ is $h_t(d)$, the expected number of packets of
degree $d$ that have their degree
reduced to $d-1$ is $h_t(d) \linefrac{d}{K-t}$.
Hence show that in order to have the expected number of packets
of degree 1 satisfy $h_t(1) = 1$ for all $t \in \{ 0, \ldots K-1 \}$,
we must to start with have $h_0(1) =1$ and $h_0(2) = K/2$;
and more generally, $h_t(2) = (K-t)/2$; then by recursion
solve for $h_0(d)$ for $d=3$ upwards.
}
This degree distribution works poorly in practice, because
fluctuations around the expected behaviour make it very likely that
at some point in the decoding process there will be
no degree-one check nodes; and, furthermore, a few source
nodes will receive no connections at all. A small modification
% , slightly increasing the bias towards small degrees,
fixes these problems.
The {\dem{robust soliton distribution}\/} has two extra parameters, $c$
and $\delta$; it is designed to ensure that the
% aiming to make the
expected number of degree-one checks is about
\beq
\Ripple \equiv c \ln (K/\delta) \sqrt{K},
\eeq
rather than 1,
throughout the decoding process.
The parameter $\delta$ is a bound on the probability
that the decoding fails to run to completion
after a certain number $K'$ of packets have been received.
The parameter $c$ is a constant of order 1, if our aim
is to prove Luby's main theorem about LT codes; in practice
however it can be viewed as a free parameter, with
a value somewhat smaller than 1 giving good results.
% is defined by introducing
% an allowable failure probability $\delta$, and
% ]
% The The probability that a random walk of length $K$ deviates
We define a positive function
\beq
\tau(d) = \left\{ \begin{array}{ll}
\smallfrac{\Ripple}{K} \smallfrac{1}{d} & \mbox{for $d = 1,2,\ldots (K/\Ripple) \!-\!1 $} \\[0.04in]
\smallfrac{\Ripple}{K} \ln(\Ripple/\delta) & \mbox{for $d = K/\Ripple$} \\[0.02in]
0 & \mbox{for $d> K/\Ripple$}
\end{array}
\right.
\label{eq.actualrobustcorrection}
\eeq
(see \figref{fig.dfdensity} and \exerciseref{ex.dfproverobustsoliton})
then add the ideal soliton distribution $\rho$ to $\tau$ and normalize
to obtain the robust soliton distribution, $\mu$:
\beq
\mu(d) = \frac{ \rho(d) + \tau(d) }{ Z },
\eeq
where $Z=\sum_{d} \rho(d) + \tau(d).$
% FALSE:
% The expected number of encoded packets required at the receiving end
% before the decoding can run to completion is $\bar{N} = KZ$.
The number of encoded packets required at the receiving end
to ensure that the decoding can run to completion, with probability at
least $1-\delta$, is ${K'} = KZ$.
\marginfig{
\begin{center}\small
\begin{tabular}{r}
% $\Ripple$
\hspace*{7.5mm}\raisebox{0.2in}{\psfig{figure=figs/gallager/dfountainR.ps,%
%\hspace*{5.5mm}\raisebox{0.2in}{\psfig{figure=figs/gallager/dfountainKonR.ps,%
width=40.5mm,angle=-90}}
\\
% \raisebox{$K'$
\hspace*{5.5mm}\raisebox{0.102in}{\psfig{figure=figs/gallager/dfountainKprime.ps,%
width=40.5mm,angle=-90}}
\\
$c$
\end{tabular}
\end{center}
\caption[a]{The number of degree-one checks $\Ripple$ (upper figure) and the quantity
$K'$
% (c,\delta)$
(lower figure) as a function of the
two parameters $c$ and $\delta$,
for $K=10\,000$.
Luby's main theorem proves that there exists a value of $c$
such that, given $K'$ received packets, the decoding algorithm will
recover the $K$ source packets with probability $1-\delta$.
}
\label{fig.dfcurves}
}
\quotecite{luby2002} analysis explains how the small-$d$ end of $\tau$ has the
role of ensuring that the decoding process gets started, and
the spike in $\tau$ at $d = K/\Ripple$ is included to ensure that every source packet
is likely to be connected to a check at least once.
% $\tau( K/\Ripple )$
%
% gnuplot load 'histo.gnu2'
%# does the alternate three-graph version, wider
% gnuplot load 'histo.gnu'
%# runs the ! histo.p min=10000 max=11500 < c0.03d0.5K10000 automatically
\amarginfig{c}{
\begin{center}
\begin{tabular}{l}
~\\
\hspace*{-5mm}\mbox{\psfig{figure=dfountain/histo3.ps,%
width=49.5mm,angle=-90}}
\end{tabular}
\end{center}
\caption[a]{Histograms of the actual number of
packets $N$ required in order to recover a file of size
$K=10\,000$ packets.
% R=9.90348755253613 ; dspecial=1010 ; Zr=1 ; Zt=0.0103793108098562 ; Z=1.01037931080986
The parameters were as follows:
top histogram: $c=0.01$, $\delta=0.5$ ($\Ripple=10$, $K/\Ripple=1010$, and $Z\simeq 1.01$);
middle: $c=0.03$, $\delta=0.5$ ($\Ripple=30$, $K/\Ripple=337$, and $Z\simeq 1.03$);
bottom: $c=0.1$, $\delta=0.5$ ($\Ripple=99$, $K/\Ripple=101$, and $Z\simeq 1.1$).
% Upper histogram: $c=0.03$, $\delta=0.5$ ($\Ripple=30$, $K/\Ripple=337$, and $Z\simeq 1.03$).
% Lower histogram: $c=0.1$, $\delta=0.5$ ($\Ripple=99$, $K/\Ripple=101$, and $Z\simeq 1.1$).
}
\label{fig.dfhisto}
}
Luby's key result is that (for an appropriate value of the constant $c$)
receiving
$K' = K + 2 \ln(\Ripple/\delta)\Ripple$ checks ensures that all packets can be recovered
with probability at least $1-\delta$.
In the illustrative figures I have set
the allowable decoder failure probability $\delta$ quite large, because
the actual failure probability is much smaller than is suggested by
Luby's conservative analysis.
In practice, LT
% digital fountain
codes can be tuned
so that a file of original size $K\simeq 10\,000$ packets
is recovered with an overhead of about 5\%.
\Figref{fig.dfhisto} shows histograms of the actual number of packets
required for a couple of settings of the parameters, achieving
mean overheads smaller than 5\% and 10\% respectively.
% are $K=10\,000$, $c=0.03$, $\delta=0.5$
%# see itp/dfountain and bin/dfountain.p
\section{Applications}
Digital fountain codes
are an excellent solution
in a wide variety of situations. Let's mention two.
\subsection{Storage}
You wish to make a backup of a large file, but you are aware
that your magnetic tapes and hard drives are all unreliable in the
sense that catastrophic failures, in which some stored packets are
permanently
lost within one device, occur at a rate of something like $10^{-3}$
per day. How should you store your file?
A digital fountain can be used to spray encoded packets
all over the place, on every storage device available.
Then to recover the backup file, whose size was $K$ packets,
one simply needs to find $K' \simeq K$ packets from anywhere.
Corrupted packets do not matter; we simply skip over them
and find more packets elsewhere.
This method of storage also has advantages in terms of
{\em speed\/} of file recovery. In a hard drive,
it is standard practice to store a file in successive
sectors of a hard drive, to allow rapid reading of the file;
but if, as occasionally happens, a packet is lost (owing to the
reading head being off track for a moment, giving
a burst of errors that cannot be corrected by the packet's error-correcting code),
a whole revolution
of the drive must be performed to bring back the packet to the
head for a second read. The time taken for one revolution
produces an undesirable delay in the file system.\index{seek time}\index{hard drive}\index{magnetic recording}
If files were instead stored
using the digital fountain principle, with the digital
drops stored in one or more consecutive sectors
on the drive, then one would never need to endure the delay
of re-reading a packet; packet loss would become less important,
and the hard drive could consequently be operated faster, with
higher noise level,
and with fewer resources devoted to noisy-channel coding.
\exercisxB{2}{ex.DFRaid}{
Compare the digital fountain method of robust storage
on multiple hard drives with RAID (the redundant
array of independent disks).
}
\subsection{Broadcast}
Imagine that ten thousand \ind{subscriber}s in an area wish to
receive a \index{digital video broadcast}{digital movie} from a broadcaster. The broadcaster
can send the movie in packets over a \index{broadcast channel}{broadcast} network\index{channel!broadcast} --
for example, by a wide-bandwidth \ind{phone} line, or by \ind{satellite}.
Imagine that not all packets are received at all the houses. Let's say
$f=0.1$\% of them are lost at each house.
In a standard approach in which the file is transmitted
as a plain sequence of packets with no encoding, each house would
have to notify the broadcaster of the $fK$ missing packets,
and request that they be retransmitted. And with ten thousand
subscribers all requesting such retransmissions, there would be a
retransmission request for almost every packet. Thus the broadcaster
would have to repeat the entire broadcast twice in order to ensure that
most subscribers have received the whole movie, and most users would
have to wait roughly twice as long as the ideal time before the download was complete.
If the broadcaster uses a digital fountain to encode the
movie, each subscriber can recover the movie from {\em any\/} $K' \simeq K$
packets. So the broadcast needs to last for only, say, 1.1$K$ packets,
and every house is very likely to have successfully recovered the whole file.
Another application is broadcasting data to cars. Imagine that we
want to send updates to in-car \ind{navigation}\index{in-car navigation}\index{car data reception}
\index{automobile data reception}databases by satellite.
There are hundreds of thousands of vehicles, and they can only
receive data when they are out on the open road; there are no feedback
channels.
A standard method for sending the data
is to put it in a {\dem carousel}, broadcasting the packets
in a fixed periodic sequence. `Yes, a car may go through a tunnel,
and miss out on a few hundred packets, but it will be
able to collect those missed packets an hour later when the carousel
has gone through a full revolution (we hope); or maybe the following day$\ldots$'
If instead the satellite uses a digital fountain, each car needs
to receive only an amount of data equal to the original file
size (plus 5\%).
\section*{Further reading}
The encoders and decoders sold by Digital Fountain
have even higher efficiency than the LT codes described here, and they
work well for all blocklengths, not only large lengths such as $K \gtrsim 10\,000$.
% Some of their tricks are revealed in
\citeasnoun{shokrollahiRaptor} presents
{\dem\ind{Raptor codes}}, which are an extension of LT codes with linear-time encoding and
decoding.
\section{Further exercises}
\exercisxB{2}{ex.dfproverobustsoliton}{
{\sf Understanding the robust soliton distribution}.
Repeat the analysis of \exerciseref{ex.dfprovesoliton}
but now aim to have the expected number of packets
of degree 1 be $h_t(1) = 1 + S$ for all $t$,
instead of 1. Show that
the initial required number of packets is
\beq
h_0(d) = \frac{K}{d(d-1)} + \frac{S}{d} \:\:\:\: \mbox{for $d>1$.}
\label{eq.robustsolitonheavy}
\eeq
The reason for truncating the second term beyond $d=K/S$
and replacing it by the spike at $d=K/S$
(see \eqref{eq.actualrobustcorrection}) is
to ensure that the decoding complexity does not grow larger than
$O(K \ln K)$.
Estimate the expected number of packets $\sum_d h_0(d)$
and the expected number of edges in the sparse graph
$\sum_d h_0(d) d$ (which determines the decoding complexity)
if the histogram of packets is
as given in (\ref{eq.robustsolitonheavy}).
Compare with the expected numbers of packets and edges
when the robust soliton distribution (\ref{eq.actualrobustcorrection}) is
used.
}
\exercisxC{4}{ex.dfproverobustsolitonII}{
Show that the spike at $d=K/S$
(\eqref{eq.actualrobustcorrection})
is an adequate replacement for the tail of high-weight packets
in (\ref{eq.robustsolitonheavy}).
%
% As a first step, show that expected degree $d_{\rm spike}$
% of packets in the spike
% evolves with time as
%\beq
% d_{\rm spike} = (K-t)/S.
%\eeq
% indeed this holds for the expected degree of any packet.
% So its expected time of having zero degree is t=K (for any initial degree)
}
\exercisxC{3C}{ex.dfproverobustsolitonIII}{
Investigate experimentally how necessary the spike at $d=K/S$
(\eqref{eq.actualrobustcorrection})
is for successful decoding. Investigate also whether the tail of $\rho(d)$
beyond $d=K/S$ is necessary. What happens if all high-weight
degrees are removed, both the spike at $d=K/S$ and the tail of $\rho(d)$
beyond $d=K/S$?
}
\exercisxC{4}{ex.dfprove}{
Fill in the details in the proof of Luby's main theorem, that
receiving
$K' = K + 2 \ln(\Ripple/\delta)\Ripple$ checks ensures that all
the source packets can be recovered
with probability at least $1-\delta$.
}
\exercisxC{4C}{ex.dfdo}{
Optimize the degree distribution of a digital fountain code
for a file of $K=10\,000$ packets.
Pick a sensible objective function for your optimization, such
as minimizing the mean of $N$, the
number of packets required for complete decoding, or
the 95th percentile of the histogram of $N$
(\figref{fig.dfhisto}).
}
\exercisxB{3}{ex.dfcar}{
Make a model of the situation where a data stream is broadcast
to cars, and quantify the advantage that the digital fountain
has over the carousel method.
}
\exercisxC{2}{ex.dfdecodersubopt}{
Construct a simple example to illustrate the fact that the
digital fountain decoder of \secref{sec.dfdecode}
is suboptimal -- it sometimes
gives up even though the information available is
sufficient to decode the whole file. How does the cost of the
optimal decoder compare?
}
\exercisxB{2}{ex.overheadG}{
If every transmitted packet were created by adding together
% with degree $K/2$,
source packets at random with probability $\dhalf$ of each
source packet's being included,
% so that the code is effectively a random linear code,
show that the probability that $K' = K$ received packets
suffice for the optimal decoder to be able to recover the $K$
source packets
is just a little below $1/2$. [To put it another way,
what is the probability that a random $K \times K$ matrix
has full rank?]
Show that if $K' = K + \Delta$ packets are received,
the probability that they will not suffice
for the optimal decoder is roughly $2^{-\Delta}$.
}
\exercisxB{4C}{ex.raptor}{
Implement an optimal digital fountain decoder that
uses the method of
\citeasnoun{Urbanke00}\index{Richardson, Thomas J.}\index{Urbanke, R\"udiger}
derived for fast {\em encoding\/}
of sparse-graph codes (\secref{sec.fastencode})
to handle the matrix inversion required for optimal
decoding.
% of a digital fountain code.
Now that you have changed the decoder, you can reoptimize the
degree distribution, using higher-weight packets.
By how much can you reduce the overhead?
Confirm the assertion that this approach makes
digital fountain codes viable as erasure-correcting codes
for all blocklengths, not
just the large blocklengths for which LT codes are excellent.
}
\exercisxB{5}{ex.ratelessnoisy}{
Digital fountain codes are excellent rateless
codes for erasure channels.
Make a rateless code for a channel that has both
erasures and {\em noise}.
}
\newpage
% \section{Conclusion}
\section{Summary of sparse-graph codes}
A simple
method for designing error-correcting codes for noisy channels, first
pioneered by \citeasnoun{Gallager62}, has recently been
rediscovered and generalized,
and \ind{communication} theory has been transformed.
The practical performance of\index{error-correcting code!low-density parity-check}
Gallager's {\ldpcc}s and their modern cousins is vastly better
than the performance of the codes with which textbooks
have been filled in the intervening years.
Which sparse-graph code is
`best' for a noisy channel depends on the chosen rate and blocklength,
the permitted encoding and decoding complexity, and
the question of whether occasional undetected errors
are acceptable.
%(turbo codes and repeat--accumulate codes
% both typically make occasional undetected errors because they
% have a small number of low weight codewords; \ldpc\ codes
% do not typically show such an error floor).
\Ldpc\ codes
are the most versatile; it's easy to make a competitive \ldpc\ code
with almost any rate and blocklength, and \ldpc\ codes
virtually never make undetected errors.
For the special case of the erasure channel,
the sparse-graph codes that are best are digital fountain codes.
\section{Conclusion}
The best solution to the communication problem is:
\begin{conclusionbox}
\begin{realcenter}
Combine a simple, {pseudo-random} code\\
with a message-passing decoder.\index{key points!communication}
% n approximate {\bf probability-based} decoder.
\end{realcenter}
\end{conclusionbox}
\dvips
% \dvipsb{endpart7}
% \chapter{What You Missed}
% \input{tex/nextedition.tex}
\part{Appendices}
\appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Notation}
\label{app.notation}\index{notation}
% {app.notation}
% \section{Basic notation}
\begin{description}
\item[What does $P(A \given B,C)$ mean?]
$P(A \given B,C)$ is pronounced `the probability that\index{terminology}
$A$ is true {\em given that\/} $B$ is true {\em and\/} $C$ is true'.
Or, more briefly, `the probability of $A$ given $B$ and $C$'.
\item[What do $\log$ and $\ln$ mean?]
In this book, $\log x$ means the base-two logarithm, $\log_2 x$;
$\ln x$ means the natural logarithm, $\log_e x$.
\item[What does $\hat{s}$ mean?]
Usually, a `hat' over a variable denotes a guess or
estimator. So $\hat{s}$ is a guess at the value of $s$.
\item[Integrals\puncspace] There is no difference between
$\int f(u) \, \d u$ and $\int \d u \, f(u)$. The integrand is $f(u)$ in both cases.
%
% should I modify this? page 31 e.g.
% is defined to be $\int_{a}^{b} \d v \:
\item[What does $\displaystyle \prod_{n=1}^N$ mean?]
This is like the summation $\sum_{n=1}^N$ but it denotes a product.
It's pronounced `product over $n$ from 1 to N'.
So, for example,
\beq
\prod_{n=1}^N n = 1 \times 2 \times 3 \times \cdots \times N = N!
%\eeq
% and
%\beq
% \prod_{n=1}^N n
= \exp \left[ \sum_{n=1}^N \ln n \right] .
\eeq
I like to choose the name of the free variable in a sum or a product --
here, $n$ -- to be the lower case version of the range of the sum.
So $n$ usually runs from 1 to $N$, and $m$ usually runs from
1 to $M$. This is a habit I learnt from Yaser Abu-Mostafa,
and I think it makes formulae easier to understand.
\item[What does $\displaystyle {N \choose n}$ mean?]
This is pronounced `$N$ choose $n$', and it is the number of
ways of selecting an unordered set of $n$ objects from a set of
size $N$.
\beq
{{N} \choose {n}} = \frac{ N! }{ (N-n)! \, n! } .
\eeq
This function is known as the \ind{combination} function.
\item[What is $\Gamma(x)$?]
The {\dem\ind{gamma function}}\index{$\Gamma$}
is defined by $\Gamma(x) \equiv \int_0^{\infty} \d u \:
u^{x-1} e^{-u}$, for $x>0$.
The gamma function is an extension of the factorial
function to real number arguments.
In general, $\Gamma(x+1)=x \Gamma(x)$,
and for integer arguments, $\Gamma(x+1) = x!$. The digamma function
is defined by $\Psi(x) \equiv \frac{\d}{\d x} \log \Gamma(x)$.
For large $x$ (for practical purposes, $0.1 \leq x \leq \infty$),
% the following approximations are useful:
\beq
\log \Gamma(x) \simeq \textstyle \left(x-\half\right) \log (x) - x
+ \half \log 2\pi + O(1/x) ;
\eeq
% % \rightarrow \infty$
%\beq
% \Psi(x) =
% \frac{\d}{\d x} \log \Gamma(x) \simeq \log(x) - \frac{1}{2x} + O(1/x^2)
%\label{digam_1}
%\eeq
and for small $x$ (for practical purposes, $0 \leq x \leq 0.5$):
\beq
\log \Gamma(x) \simeq \log \frac{1}{x} - \gamma_e x + O(x^2)
\eeq
%\beq
% \Psi(x) \simeq - \frac{1}{x} - \gamma_e + O(x)
%\label{digam_2}
%\eeq
where $\gamma_e$ is Euler's constant.
%The digamma function satisfies
% the following recurrence relation exactly:
%\beq
% \Psi(x+1) = \Psi(x) + \frac{1}{x} .
%\label{digma_3}
%\eeq
% The gamma function is an extension of the factorial
% function to real number arguments.
% For integer arguments, it satisfies $\Gamma(n) = (n-1)!$.
% It can be defined by
%\beq
% \Gamma(x) \equiv \int_0^{\infty} t^{x-1} e^{-t} \, \d t .
%\eeq
\item[What does $H_2^{-1}(1- R/C)$ mean?]
Just as $\sin^{-1}(s)$ denotes the inverse function
to $s = \sin(x)$, so
$H_2^{-1}(h)$ is the inverse function to
$h = H_2(x)$.
There is potential confusion when people use
$\sin^2 x$ to denote $(\sin x)^2$, since then we
might expect $\sin^{-1} s$ to denote $1/\sin(s)$;
I therefore like to avoid using the notation $\sin^2 x$.
\item[What does $f'(x)$ mean?]
The answer depends on the context.
Often, a `prime' is used to denote differentiation:
\beq
f'(x) \equiv \frac{\d}{\d x } f(x) ;
\eeq
similarly, a dot denotes differentiation with respect to time, $t$:
\beq
\dot{x} \equiv \frac{\d}{\d t} x .
\eeq
However, the prime is also a useful indicator for `another variable',
for example `a new
value for a variable'.
So, for example, $x'$ might denote `the new value of $x$'.
Also, if there are two integers that both range from 1 to $N$,
I will often name those integers $n$ and $n'$.
So my rule is: if a prime occurs in an expression that
could be a function, such as $f'(x)$ or $h'(y)$, then
it denotes differentiation; otherwise it indicates `another variable'.
\item[What is the \ind{error function}?]
Definitions of this function vary. I define it to be the
cumulative probability of a standard (variance $=1$) normal distribution,
\beq
\Phi(z) \equiv \int_{-\infty}^{z} \exp ( - z^2/2 )/\sqrt{2 \pi} \,\, \d z.
\eeq
\item[What does $\Exp ( r )$ mean?]
%
% I had to use () not [] because \item is messed up by []
%
$\Exp[r]$ is pronounced `the expected value of $r$'
or `the expectation of $r$',
and it is the mean value of $r$.
Another symbol for `expected value' is the pair of
angle-brackets,
$\langle r \rangle.$
\item[What does $|x|$ mean?]
The vertical bars `$|\cdot|$'
have two meanings.\index{notation!absolute value}\index{notation!set size}
If $\A$ is a set, then
$|\A|$ denotes the number of elements in the set;
if $x$ is a number,
then $|x|$ is the absolute value of $x$.
\item[What does \mbox{$[\bA | \bP]$} mean?]
% In the expression $[\bA | \bP ]$
Here, $\bA$ and $\bP$ are matrices with the same number
of rows.
$[\bA | \bP ]$ denotes the double-width matrix obtained
by putting $\bA$ alongside $\bP$.
The vertical bar is used to avoid confusion with the product $\bA \bP$.
\item[What does $\bx^{\T}$ mean?]
The superscript $\T$ is pronounced `transpose'.
Transposing a row-vector
%$(1,2,3)$
turns it into
a column vector:
\beq
\left( 1,2,3 \right)^{\T}
= \left( \begin{array}{c} 1\\2\\3 \end{array} \right) ,
\eeq
and {\em vice versa}. [Normally my vectors, indicated by
bold face type ($\bx$), are column vectors.]
Similarly, matrices can be transposed. If $M_{ij}$ is
the entry in row $i$ and column $j$ of matrix $\bM$,
and $\bN = \bM^{\T}$, then $N_{ji} = M_{ij}$.
\item[What are $\mbox{\rm$\Trace$} \bM$ and $\det \bM$?]
The trace of a matrix is the sum of its diagonal
elements,
\beq
\Trace \bM = \sum_i M_{ii} .
\eeq
The determinant of $\bM$ is denoted $\det \bM$.
\item[What does $\delta_{mn}$ mean?]
The $\delta$ matrix is the identity matrix.
\beqa
\delta_{mn} = \left\{ \begin{array}{cl}1 & \:\:\mbox{if $m=n$} \\
0 & \:\:\mbox{if $m\neq n$.} \end{array}\right.
\eeqa
Another name for the \ind{identity matrix} is $\bI$ or ${\bf 1}$.
Sometimes I include a subscript on this symbol -- ${\bf 1}_K$ --
which indicates the size of the matrix ($K \times K$).
\item[What does $\delta(x)$ mean?]
The \ind{delta function} has the property
\beq
\int \d x \: f(x) \delta(x) = f(0) .
\eeq
\begin{aside}
Another possible meaning for $\delta(S)$ is the truth function,
which is 1 if the proposition $S$ is true
but I have adopted another notation for that.
After all, the symbol $\delta$ is quite busy already, with the two roles mentioned
above in addition to its role as a small real number $\delta$ and an increment
operator (as in $\delta x$)!
\end{aside}
\item[What does \mbox{$\truth[S]$} mean?]
% 1[S]
$\truth[S]$ is the \ind{truth function},\index{indicator function}
which is 1 if the proposition $S$ is true and 0 otherwise.
For example, the number of positive numbers in the set $T=\{ -2,1,3 \}$
can be written
\beq
\sum_{x \in T} \truth[x>0] .
\eeq
\item[What is the difference between `\mbox{$:=$}' and `$=$'?]
In an algorithm, $x\, := \, y$
means that the variable $x$ is updated by assigning it the value of $y$.\index{:=}
In contrast, $x=y$ is a proposition, a statement that $x$ is equal to $y$.
\end{description}
See Chapters \ref{ch.distributions} and \ref{ch.mc} for
further definitions and notation relating to probability distributions.
%\subsection*{Octave}
% The notation of the {\tt{octave}} computer language is
% explained in
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Some Physics}
%Useful Formulae, and More}
% \section{Representations}
% \label{app.gaussian.rep}
% \input{tex/gaussian.tex}
% \section{Statistical physics}
\label{app.statphy}
% _statphys.tex
\section{About phase transitions}
A system with states $\bx$ in contact with a \ind{heat bath}
at \ind{temperature} $T = 1/\b$ has probability distribution
\beq
P(\bx|\b) = \frac{1}{Z(\b)} \exp( - \b E(\bx)) .
\eeq
The \ind{partition function} is\index{phase transition}
\beq
Z(\b) = \sum_{\bx} \exp( - \b E(\bx)) .
\label{eq.Z1}
\eeq
The inverse \ind{temperature} $\b$ can be interpreted as defining
an \ind{exchange rate} between \ind{entropy} and \ind{energy}.
$(1/\beta)$ is the amount of energy that must be given to
a heat bath to increase its entropy by one {nat}.\index{nat (unit)}
Often, the system will be affected by some other parameters
such as the volume of the box it is in, $V$,
in which case $Z$ is a function of $V$ too, $Z(\b,V)$.
For any system with a finite number of states, the function $Z(\b)$
is evidently a continuous function of $\b$, since it is simply
a sum of exponentials. Moreover,
all the derivatives of $Z(\b)$ with respect to $\b$
are continuous too.\index{thermodynamics}
What phase transitions are all about, however, is this:
phase transitions correspond to values of $\b$ and $V$ (called critical points)
at which
the derivatives of $Z$ have discontinuities or divergences.
Immediately we can deduce:
\begin{conclusionbox}
Only systems with an infinite number of states can show
\ind{phase transition}s.
\end{conclusionbox}
Often, we include a parameter $N$ describing the size of the system.
Phase transitions may appear in the limit $N \rightarrow \infty$.
Real systems may have a value of $N$ like $10^{23}$.
If we make the system large by simply grouping together
$N$ independent systems whose \ind{partition function} is $Z_{(1)}(\b)$,
then nothing interesting happens. The partition function
for $N$ independent identical systems is
simply
\beq
Z_{(N)}(\b) = [Z_{(1)}(\b)]^{N} .
\eeq
Now, while this function $Z_{(N)}(\b)$ may be a very rapidly
varying function of $\beta$, that doesn't mean it is showing
phase transitions. The natural way to look at the partition
function is in the logarithm
% (which is roughly the same as the free energy)
\beq
\ln Z_{(N)}(\b) = N \ln Z_{(1)}(\b) .
\eeq
Duplicating the original system $N$ times simply scales up
all properties like the energy and heat capacity of the system
by a factor of $N$. So if the original system showed no
phase transitions then the scaled up system won't have any either.
\begin{conclusionbox}
Only systems with long-range correlations\index{correlations!and phase transitions} show
phase transitions.
\end{conclusionbox}
%%%%%%%%%% LOOOK, check, action HELP!!!!!!!!!!!!!
% This is a vague assertion, sorry.... can it be tightened?
% Should emphasize that there do not have to be direct long-range
% couplings; short-range energetic couplings are enough to
% give rise to long-range correlations.
Long-range correlations do not require long-range energetic couplings;
for example, a magnet has only short-range couplings (between
adjacent spins) but these are sufficient to create long-range order.
\subsection{Why are points at which derivatives diverge interesting?}
The derivatives of $\ln Z$ describe properties
like the heat capacity of the system (that's the second derivative)
or its fluctuations in energy. If the second derivative of $\ln Z$
diverges at a temperature $1/\b$, then the heat capacity of the
system diverges there, which means it can absorb or release
energy without changing temperature (think of ice melting in ice water);
when the system is at equilibrium at that temperature, its energy
fluctuates a lot,
in contrast to the normal law-of-large-numbers behaviour,
where the energy only varies by one part in $\sqrt{N}$.
\subsection{A toy system that shows a phase transition}
Imagine a collection of $N$ coupled spins that have the following
energy as a function of their state $\bx \in \{0,1\}^N$.
\beq
E(\bx) = \left\{
\begin{array}{ccl}
- N \epsilon & & \bx = (0,0,0,\ldots,0) \\
0 & & \mbox{otherwise.}
\end{array} \right.
\eeq
This energy function describes a ground state
in which all the spins are aligned in the zero direction; the energy
per spin in this state is $-\epsilon$.
if any spin changes state then the energy is zero.
This model is like an extreme version of a magnetic interaction,
which encourages pairs of spins to be aligned.
We can contrast it with an ordinary system of $N$ {\em independent\/} spins
whose energy is:
\beq
E^{0}(\bx) = \epsilon \sum_n (2 x_n -1 ) .
\eeq
Like the first system, the system of independent
spins has a single ground state $(0,0,0,\ldots,0)$ with
energy $-N \epsilon$, and it has roughly $2^{N}$ states with
energy very close to 0, so the low-temperature and high-temperature
properties of the independent-spin system and the coupled-spin system
are virtually identical.
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
(a)\hspace{-0.3in}%
\mbox{\psfig{figure=statphys/toyZ.ps,angle=-90,width=2in}}
&
(c)\hspace{-0.3in}%
\mbox{\psfig{figure=statphys/toyvarE.ps,angle=-90,width=2in}}
\\
(b)\hspace{-0.3in}%
\mbox{\psfig{figure=statphys/toyZ24.ps,angle=-90,width=2in}}
&
\\% \midrule
\end{tabular}
\end{center}
}{%
\caption[a]{(a) Partition function of toy system which shows
a phase transition for large $N$.
The arrow marks the point $\b_c = \linefrac{\log 2}{\epsilon}$.
(b) The same, for larger $N$.
(c) The variance of the energy of the system as a function of $\beta$
for two system sizes. As $N$ increases the variance has
an increasingly sharp peak at the critical point $\b_c$.
Contrast with \protect\figref{fig.toyZb}.
}
\label{fig.toyZ}
}%
\end{figure}
The partition function of the coupled-spin system is given by
\beq
Z(\b) = e^{\b N \epsilon} + 2^{N} - 1 .
\eeq
The function
\beq
\ln Z(\b) = \ln \left( e^{\b N \epsilon} + 2^{N} - 1 \right)
\eeq
is sketched in \figref{fig.toyZ}a along with its
low temperature behaviour,
\beq
\ln Z(\b) \simeq N \b \epsilon, \:\:\: \b \rightarrow \infty,
\eeq
and its high temperature behaviour,
\beq
\ln Z(\b) \simeq N \ln 2, \:\:\: \b \rightarrow 0.
\eeq
%
The arrow marks the point
\beq
\b = \frac{\ln 2}{\epsilon}
\eeq
at which these two asymptotes intersect.
In the limit $N \rightarrow \infty$, the
graph of $\ln Z(\b)$ becomes more and more sharply
bent at this point (\figref{fig.toyZ}b).
The second derivative of $\ln Z$, which describes the
variance of the energy of the system, has a peak value,
at $\b = \linefrac{\ln 2}{\epsilon}$, roughly
equal to
\beq
\frac{N^2 \epsilon^2}{4} ,
\eeq
which corresponds to the system spending half of its
time in the ground state and half its time in the other
states.
% \Figref{fig.toyZ}b show the heat capacity.
At this critical point, the heat capacity of this system
is thus proportional to
$N^2$; the heat capacity per spin is proportional to
$N$, which, for infinite $N$, is infinite,
in contrast to the behaviour of
% ordinary non-critical
systems away from phase transitions, whose capacity per atom is a finite number.
For comparison, \figref{fig.toyZb} shows the partition function and energy-variance
of the ordinary independent-spin system.
\begin{figure}
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
(a)\hspace{-0.3in}%
\mbox{\psfig{figure=statphys/dullZ.ps,angle=-90,width=2in}}
&
(b)\hspace{-0.3in}%
\mbox{\psfig{figure=statphys/dullvarE.ps,angle=-90,width=2in}}
\\
\end{tabular}
\end{center}
}{%
\caption[a]{The partition function (a)
and energy-variance (b) of a system consisting of $N$ independent spins.
The \ind{partition function} changes gradually from one asymptote to the
other, regardless of how large $N$ is; the variance of the energy
does not have a peak. The fluctuations are largest at high
temperature (small $\beta$) and scale linearly with system size $N$.
}
\label{fig.toyZb}
}%
\end{figure}
\subsection{More generally}
Phase transitions can be categorized
into `first-order' and `continuous' transitions.
% by the degree of the
% derivative of $\ln Z$ that diverges.
In a first-order phase transition, there is
a discontinuous change of one or more order-parameters;\index{order parameter}
in a continuous transition, all order-parameters change continuously.
[What's an order-parameter?
-- a scalar function of the state of the system;
or, to be precise, the expectation of such a function.]
In the vicinity of a critical point, the concept of
`typicality' defined in \chref{chtwo} does not
hold. For example, our toy system, at its critical point, has a 50\% chance of
being in a state with energy $- N \epsilon$, and
roughly a $1/2^{N+1}$ chance of being in each of the other states
that have energy zero. It is thus not the case that $\ln 1/P(\bx)$
is very likely to be close to the entropy of the system
at this point, unlike a system with $N$ i.i.d.\ components.
Remember that information content ($\ln 1/P(\bx)$) and
energy are very closely related. If typicality holds, then
the system's energy has negligible fluctuations, and
{\em{vice versa}}.
% see:
%
% replica.tex
%\newpage
\chapter{Some Mathematics}
\section{Finite field theory}
\label{app.GF}
%\newpage
\subsection{Most linear codes are expressed in the language of Galois theory}
Why are \ind{Galois field}s an appropriate language for linear codes?
First, a definition and some examples.
\begin{description}
\item[A field] $F$ is a set\index{field}
%
$%\beq
F = \{ 0 , F' \}$
%\eeq
%
such that
\ben
\item
$F$ forms an Abelian group
under an addition operation `$+$', with 0
being the identity; [Abelian means all elements
commute, \ie, satisfy $a+b=b+a$.]
\item
$F'$ forms an Abelian group
under a multiplication operation `$\cdot$'; multiplication
of any element by 0 yields 0;
% the zero element yields zero.
\item
these operations satisfy the distributive
rule $(a+b) \cdot c = a\cdot c + b \cdot c$.%
%
\margintab{
\[
\begin{array}{cc}
\begin{array}{c|cc|}
+ & 0 & 1 \\ \hline
0 & 0 & 1 \\
1 & 1 & 0 \\ \hline
\end{array} &
\begin{array}{c|cc|}
\cdot
& 0 & 1 \\ \hline
0 & 0 & 0 \\
1 & 0 & 1 \\ \hline
\end{array}
\end{array}
\]
\caption[a]{Addition and multiplication tables for $GF(2)$.}
\label{tab.gf2}
}
\een
For example, the real numbers form a field, with `$+$' and `$\cdot$'
denoting ordinary addition and multiplication.
%
\item[A Galois field] $GF(q)$ is a field with a finite number of elements $q$.
A unique \ind{Galois field} exists for any $q = p^m$, where $p$ is a prime number and
$m$ is a positive integer;
% \index{finite field}
% and not for any other $q$.
there are no other finite fields.
\item[$GF(2).$] The addition and multiplication tables for $GF(2)$ are shown in
\tabref{tab.gf2}.
These are the rules of addition and multiplication modulo 2.
\item[$GF(p).$] For any prime number $p$,
the addition and multiplication rules are those for ordinary
addition and multiplication, modulo $p$.
\item[$GF(4).$]
The rules for $GF(p^m)$, with $m>1$, are
{\em not \/} those of ordinary
addition and multiplication.
For example the tables for $GF(4)$ (\tabref{tab.gf4}) are%
\amargintab{b}{
\[
\begin{array}{c}
\begin{array}{c|cccc|}
+ & 0 & 1 & A & B \\ \hline
0 & 0 & 1 & A & B \\
1 & 1 & 0 & B & A \\
A & A & B & 0 & 1 \\
B & B & A & 1 & 0 \\ \hline
\end{array} \\[0.41in]
\begin{array}{c|cccc|}
\cdot
& 0 & 1 & A & B \\ \hline
0 & 0 & 0 & 0 & 0 \\
1 & 0 & 1 & A & B \\
A & 0 & A & B & 1 \\
B & 0 & B & 1 & A \\ \hline
\end{array}
\end{array}
\]
\caption[a]{Addition and multiplication tables for $GF(4)$.}
\label{tab.gf4}
}
{\em not\/} the rules of addition and multiplication modulo 4. Notice that $1+1=0$, for example.
So how can $GF(4)$ be described? It turns out that the elements
can be related to {\em polynomials}. Consider polynomial functions of $x$
of degree 1 and with coefficients that are elements of
$GF(2)$.
% integers modulo 2.
The polynomials shown in \tabref{tab.gf4e} obey the addition and multiplication rules of $GF(4)$ {\em if\/}
addition and multiplication are modulo the polynomial $x^2 + x + 1$,
and the coefficients of the polynomials are from $GF(2)$.
For example, $B \cdot B = x^2 + (1+1)x + 1 = x = A$.
% [ \mod ( x^2 + x + 1 )]
%
Each element may also be represented as a bit pattern as shown
in \tabref{tab.gf4e}, with addition being bitwise modulo 2, and multiplication
defined with an appropriate carry operation.%
\amargintab{b}{\footnotesize
\[
\begin{array}{ccc} \toprule
\mbox{Element} & \mbox{Polynomial} & \mbox{Bit pattern} \\ \midrule
0 & 0 & {\tt 00} \\
1 & 1 & {\tt 01} \\
A &x & {\tt 10} \\
B &x+1 & {\tt 11} \\ \bottomrule
\end{array}
\]
\caption[a]{Representations of the elements of $GF(4)$.}
\label{tab.gf4e}
}
\item[$GF(8)${\puncspace}]
%\subsection{A code over $GF(8)$}
We\label{sec.gf8}
can denote the elements of $GF(8)$ by $\{0,1,A,B,C,D,E,F\}$.
Each element can be mapped onto a polynomial over $GF(2)$.
The multiplication and addition operations are given by
multiplication and addition of the polynomials, modulo
$x^3+x+1$. The multiplication table is given below.
\[%beq
\begin{array}{ccc} \toprule
\mbox{element} & \mbox{polynomial} & \mbox{binary representation}
\\ \midrule
0 & 0 & {\tt 000} \\
1 & 1 & {\tt 001}\\
A & x & {\tt 010} \\
B & x + 1 & {\tt 011} \\
C & x^2 & {\tt 100} \\
D & x^2 + 1 & {\tt 101} \\
E & x^2 +x & {\tt 110} \\
F & x^2 +x + 1 & {\tt 111} \\ \bottomrule
\end{array}
%\eeq
%
%Here is the multiplication table:
%\beq
\hspace*{0.53in}
\begin{array}{c|*{8}{c}|} \cdot
& 0&1&A&B&C&D&E&F\\
\hline
%%%%%%%%%%%
0& 0&0&0&0&0&0&0&0\\
1& 0&1&A&B&C&D&E&F\\
A& 0&A&C&E&B&1&F&D\\
B& 0&B&E&D&F&C&1&A\\
C& 0&C&B&F&E&A&D&1\\
D& 0&D&1&C&A&F&B&E\\
E& 0&E&F&1&D&B&A&C\\
F& 0&F&D&A&1&E&C&B\\ \hline
\end{array}
\]%eeq
\end{description}
Why are Galois fields relevant to linear codes? Imagine generalizing
a binary generator matrix $\bG$ and binary vector $\bs$ to a matrix
and vector with elements from a larger set, and generalizing the
addition and multiplication operations that define the product $\bG
\bs$. In order to produce an appropriate input for a symmetric
channel, it would be convenient if, for random $\bs$, the product
$\bG \bs$ produced all elements in the enlarged set with equal
probability. This uniform distribution
is easiest to guarantee if these elements form a group
under both addition and multiplication, because then these
operations do not break the symmetry among the elements. When two
random elements of a multiplicative group are multiplied together,
all elements are produced with equal probability. This is not true of
other sets such as the integers, for which the multiplication
operation is more likely to give rise to some elements (the composite
numbers) than others. Galois fields, by their definition, avoid such
symmetry-breaking
effects.
%\section{Matrices}
%\section{Eigenvectors}
%
% eigen.tex
%
\section{Eigenvectors and eigenvalues}
A {\dem right-eigenvector} of a square matrix
$\bA$ is a non-zero vector $\eR$ that satisfies
\beq
\bA \eR = \lambda \eR ,
\eeq
where $\lambda$ is the eigenvalue associated with
that eigenvector. The eigenvalue may be a real number or
complex number and it may be zero. Eigenvectors may
be real or complex.
A {\dem left-eigenvector} of a matrix
$\bA$ is a vector $\eL$ that satisfies
\beq
\eL^{\T} \bA = \lambda \eL^{\T} .
\eeq
The following statements for right-eigenvectors
also apply to left-eigenvectors.
\bit
\item
If a matrix has two or more linearly independent right-eigenvectors
with the same eigenvalue then that eigenvalue is called
a degenerate eigenvalue of the matrix, or a repeated eigenvalue.
Any linear combination of those eigenvectors is another
right-eigenvector with the same eigenvalue.
\item
The principal right-eigenvector of a matrix
is, by definition, the right-eigenvector with the largest
associated eigenvalue.
\item
If a real matrix has a right-eigenvector with complex eigenvalue $\lambda
= x + yi$ then it also has a right-eigenvector with the
conjugate eigenvalue $\lambda^*
= x - yi$.
% {\em(Check.) }
\eit
\subsection{Symmetric matrices}
If $\bA$ is a real symmetric $N \times N$ matrix then
\ben
\item
all the eigenvalues and eigenvectors of $\bA$ are real;
\item
every left-eigenvector of $\bA$ is also a right-eigenvector of $\bA$
with the same eigenvalue,
and \viceversa;
\item
a set of $N$ eigenvectors and eigenvalues $\{ \be^{(a)} ,
\lambda_a \}_{a=1}^N$
can be found that are orthonormal, that is,
\beq
\be^{(a)} {\bf \cdot} \be^{(b)} = \delta_{ab} ;
\eeq
the matrix can be expressed as a weighted sum of outer
products of the eigenvectors:
\beq
\bA = \sum_{a=1}^{N} \lambda_a [\be^{(a)}] [\be^{(a)}]^{\T} .
\eeq
\een
(Whereas I often use $i$ and $n$ as indices for sets of size $I$
and $N$, I will use the indices $a$ and $b$ to run over eigenvectors,
even if there are $N$ of them. This is to avoid confusion with the
components of the eigenvectors, which are
indexed by $n$, \eg\ $e^{(a)}_{n}$.)
\subsection{General square matrices}
An $N \times N$ matrix can have up to $N$ distinct eigenvalues.
Generically, there are $N$ eigenvalues, all distinct,
and each has one left-eigenvector
and one right-eigenvector. In cases where two or more eigenvalues
coincide, for each distinct eigenvalue that is non-zero there
is at least one left-eigenvector and one right-eigenvector.
% (CHECK.)
Left- and right-eigenvectors that have different eigenvalue are orthogonal,
that is,
\beq
\mbox{if $\l_a \neq \l_b$ then} \:\: \eL^{(a)} {\bf \cdot} \eR^{(b)} = 0.
\eeq
\subsection{Non-negative matrices}
%\subsubsection
{\sf Definition.}
If all the elements of a non-zero matrix $\bC$ satisfy $C_{mn} \geq 0$
then $\bC$ is a non-negative matrix.
%% (Further conditions? $\bC \neq 0$ \eg,?)
% The statement $\bC \geq 0$ means that $\bC$ is a non-negative matrix.
% (Don't confuse with the statement that the determinant is $\geq 0$
% which would be written $|\bC| \geq 0$, nor with the
% statement that $\bC$ is positive definite, for which I
% have no notation.)
%
Similarly, if all the elements of a non-zero vector $\bc$ satisfy $c_{n} \geq 0$
then $\bc$ is a non-negative vector.
\smallskip
% \subsubsection
\noindent {\sf Properties.}
% If the principal eigenvalue of a non-negative matrix is
% not degenerate, then the matrix has a principal eigenvector
% that is non-negative.
% If the principal eigenvalue of a non-negative matrix is
% not degenerate,
A non-negative matrix has a principal eigenvector
that is non-negative. It may also have other eigenvectors with
the same eigenvalue that are not non-negative.
But if the principal eigenvalue of a non-negative matrix is
not degenerate, then the matrix has only one principal eigenvector $\be^{(1)}$,
and it is non-negative.
%(Of course, any
% linear multiple of that vector,
% for example, $-\be^{(1)}$, is also an eigenvector; so the statement
% just made should really be worded..... (shall I be pedantic?).)
Generically,
all the other eigenvalues are smaller in absolute magnitude.
% (Check; actually, there could be one with equal and opposite magnitude?)
[There can be several eigenvalues of identical magnitude in
special cases.]
% multiple principal eigenvectors if the
% graph is not connected, and there can be eigenvectors with
% opposite eigenvalue if it is not ergodic in that it doesn't mix.
%\subsubsection{Examples}
% cd itp/eigen
% source makeall
\begin{table}
{\small
\begin{tabular}{cc} \toprule
Matrix & Eigenvalues and eigenvectors $\eL,\eR$ \\ \midrule
%%%%
%%%% written by matrix2tex.p
%%%%
%%%% beginning of matrix
%%%%
$
%%%%
\left[
\begin{array}{@{\,\,}*{2}{c@{\,\,\,\,}}c@{\,\,}}
1 & 2 & 0 \\
1 & 1 & 0 \\
0 & 0 & 1 \\
\end{array}
\right]
%%%%
%%%% DIVIDER between matrix and eigs
%%%%
$ & $
%%%%
%% sort order left eigs:
%021
%% sort order right eigs:
%021
\begin{array}{*{3}{c}}
%%%% left eigenvalue 0 (0)
2.41 & %%%% right eigenvalue 0 (0)
%%%% 2.41 & %%%% left eigenvalue 1 (2)
1 & %%%% right eigenvalue 1 (2)
%%%% 1 & %%%% left eigenvalue 2 (1)
-0.41 \\
%%%% right eigenvalue 2 (1)
%%%% -0.41\\
%%%% left eigenvector 0(0)
\left[ \begin{array}{@{}r@{}}
.58 \\
.82 \\
0
\end{array}
\right]
%%%% right eigenvector 0(0)
\left[ \begin{array}{@{}r@{}}
.82 \\
.58 \\
0
\end{array}
\right]
& %%%% left eigenvector 1(2)
\left[ \begin{array}{@{}r@{}}
0 \\
0 \\
1
\end{array}
\right]
%%%% right eigenvector 1(2)
\left[ \begin{array}{@{}r@{}}
0 \\
0 \\
1
\end{array}
\right]
& %%%% left eigenvector 2(1)
\left[ \begin{array}{@{}r@{}}
-.58 \\
.82 \\
0
\end{array}
\right]
%%%% right eigenvector 2(1)
\left[ \begin{array}{@{}r@{}}
-.82 \\
.58 \\
0
\end{array}
\right]
\\
\end{array}
%%%%
%%%% end of eigs
%%%%
$
%%%%
%\input{eigen/eg2e.tex} \\ \midrule
%\input{eigen/rl3e.tex} \\ \midrule
%\input{eigen/rl2e.tex} \\ \midrule
%%%%
%%%% written by matrix2tex.p
%%%%
%%%% beginning of matrix
%%%%
$
%%%%
\left[
\begin{array}{@{\,\,}*{1}{c@{\,\,\,\,}}c@{\,\,}}
0 & 1 \\
1 & 1 \\
\end{array}
\right]
%%%%
%%%% DIVIDER between matrix and eigs
%%%%
$ & $
%%%%
%% sort order left eigs:
%10
%% sort order right eigs:
%10
\begin{array}{*{2}{c}}
%%%% left eigenvalue 0 (1)
1.62 & %%%% right eigenvalue 0 (1)
%%%% 1.62 & %%%% left eigenvalue 1 (0)
-0.62 \\
%%%% right eigenvalue 1 (0)
%%%% -0.62\\
%%%% left eigenvector 0(1)
\left[ \begin{array}{@{}r@{}}
.53 \\
.85
\end{array}
\right]
%%%% right eigenvector 0(1)
\left[ \begin{array}{@{}r@{}}
.53 \\
.85
\end{array}
\right]
& %%%% left eigenvector 1(0)
\left[ \begin{array}{@{}r@{}}
.85 \\
-.53
\end{array}
\right]
%%%% right eigenvector 1(0)
\left[ \begin{array}{@{}r@{}}
.85 \\
-.53
\end{array}
\right]
\\
\end{array}
%%%%
%%%% end of eigs
%%%%
$
%%%%
%%%%
%%%% written by matrix2tex.p
%%%%
%%%% beginning of matrix
%%%%
$
%%%%
\left[
\begin{array}{@{\,\,}*{3}{c@{\,\,\,\,}}c@{\,\,}}
1 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 \\
1 & 0 & 0 & 0 \\
0 & 0 & 1 & 1 \\
\end{array}
\right]
%%%%
%%%% DIVIDER between matrix and eigs
%%%%
$ & $
%%%%
%% sort order left eigs:
%3120
%% sort order right eigs:
%3120
\begin{array}{*{4}{c}}
%%%% left eigenvalue 0 (3)
1.62 & %%%% right eigenvalue 0 (3)
%%%% 1.62 & %%%% left eigenvalue 1 (1)
0.5\!+\! 0.9i & %%%% right eigenvalue 1 (1)
%%%% 0.5\!+\! 0.9i & %%%% left eigenvalue 2 (2)
0.5\!-\! 0.9i & %%%% right eigenvalue 2 (2)
%%%% 0.5\!-\! 0.9i & %%%% left eigenvalue 3 (0)
-0.62 \\
%%%% right eigenvalue 3 (0)
%%%% -0.62\\
%%%% left eigenvector 0(3)
\left[ \begin{array}{@{}r@{}}
.60 \\
.37 \\
.37 \\
.60
\end{array}
\right]
%%%% right eigenvector 0(3)
\left[ \begin{array}{@{}r@{}}
.60 \\
.37 \\
.37 \\
.60
\end{array}
\right]
& %%%% left eigenvector 1(1)
\left[ \begin{array}{@{}r@{}}
.1\!-\! .5i \\
-.3\!-\! .4i \\
.3\!+\! .4i \\
-.1\!+\! .5i
\end{array}
\right]
%%%% right eigenvector 1(1)
\left[ \begin{array}{@{}r@{}}
.1\!-\! .5i \\
.3\!+\! .4i \\
-.3\!-\! .4i \\
-.1\!+\! .5i
\end{array}
\right]
& %%%% left eigenvector 2(2)
\left[ \begin{array}{@{}r@{}}
.1\!+\! .5i \\
-.3\!+\! .4i \\
.3\!-\! .4i \\
-.1\!-\! .5i
\end{array}
\right]
%%%% right eigenvector 2(2)
\left[ \begin{array}{@{}r@{}}
.1\!+\! .5i \\
.3\!-\! .4i \\
-.3\!+\! .4i \\
-.1\!-\! .5i
\end{array}
\right]
& %%%% left eigenvector 3(0)
\left[ \begin{array}{@{}r@{}}
.37 \\
-.60 \\
-.60 \\
.37
\end{array}
\right]
%%%% right eigenvector 3(0)
\left[ \begin{array}{@{}r@{}}
.37 \\
-.60 \\
-.60 \\
.37
\end{array}
\right]
\\
\end{array}
%%%%
%%%% end of eigs
%%%%
$
%%%%
%\input{eigen/C111.tex} \\ \bottomrule
\end{tabular}
}\\
\caption[a]{Some matrices and their eigenvectors.}
\label{tab.eigsforyou}
\end{table}
\label{sec.rll.eigenvectors}
\subsection{Transition probability matrices}
An important example of a non-negative matrix is
a \ind{transition probability matrix} $\bQ$.
\smallskip
\noindent {\sf Definition.}
A transition probability matrix $\bQ$ has columns that
are probability vectors, that is, it satisfies
$\bQ \geq 0$ and
\beq
\sum_i Q_{ij} = 1 \:\: \mbox{for all $j$}.
\eeq
%\subsubsection{Properties}
This property can be rewritten in terms of the all-ones vector
$\bn = (1,1,\dots,1)^{\T}$:
\beq
\bn^{\T} \bQ = \bn^{\T} .
\eeq
So $\bn$ is the principal left-eigenvector of $\bQ$ with eigenvalue
$\lambda_1 = 1$.
\beq
\eL^{(1)} = \bn .
\eeq
Because it is a non-negative matrix, $\bQ$
% know that a transition probability matrix
has a principal right-eigenvector
that is non-negative, $\eR^{(1)}$.
Generically, for Markov processes that are ergodic,
this eigenvector is the only right-eigenvector
with eigenvalue of magnitude 1 (see \tabref{tab.nonergodic} for illustrative exceptions).
% , but in special cases where the
This vector, if we normalize it such that
$\eR^{(1)} {\bf \cdot} \bn =1$, is called the invariant distribution
of the transition probability matrix.
It is the probability density that is left unchanged
under $\bQ$. Unlike the principal left-eigenvector,
which we explicitly identified above, we can't usually
identify the principal right-eigenvector without
computation.
% (To be rigorous I need to mention degenerate cases where
% there are multiple invariant distributions. These are chains
% that are not ergodic.)
%Give examples.
% If the $\bQ$ is `connected' (define) then it's ergodic
% and the invariant distribution is unique.)
The matrix may have up to $N-1$ other right-eigenvectors
all of which are orthogonal to the left-eigenvector
$\bn$, that is, they are zero-sum vectors.
% \subsubsection{Examples}
% \input{tex/hypercube.tex} % cut Mon 14/4/03
\begin{table}
\fullwidthfigureright{
{\small
\begin{tabular}{cc} \toprule
Matrix & Eigenvalues and eigenvectors $\eL,\eR$ \\ \midrule
%%%%
%%%% written by matrix2tex.p
%%%%
%%%% beginning of matrix
%%%%
$
%%%%
\left[
\begin{array}{@{\,\,}*{1}{c@{\,\,\,\,}}c@{\,\,}}
0 & .38 \\
1 & .62 \\
\end{array}
\right]
%%%%
%%%% DIVIDER between matrix and eigs
%%%%
$ & $
%%%%
%% sort order left eigs:
%10
%% sort order right eigs:
%10
\begin{array}{*{2}{c}}
%%%% left eigenvalue 0 (1)
1 & %%%% right eigenvalue 0 (1)
%%%% 1 & %%%% left eigenvalue 1 (0)
-0.38 \\
%%%% right eigenvalue 1 (0)
%%%% -0.38\\
%%%% left eigenvector 0(1)
\left[ \begin{array}{@{}r@{}}
.71 \\
.71
\end{array}
\right]
%%%% right eigenvector 0(1)
\left[ \begin{array}{@{}r@{}}
.36 \\
.93
\end{array}
\right]
& %%%% left eigenvector 1(0)
\left[ \begin{array}{@{}r@{}}
-.93 \\
.36
\end{array}
\right]
%%%% right eigenvector 1(0)
\left[ \begin{array}{@{}r@{}}
-.71 \\
.71
\end{array}
\right]
\\
\end{array}
%%%%
%%%% end of eigs
%%%%
$
%%%%
%%%%
%%%% written by matrix2tex.p
%%%%
%%%% beginning of matrix
%%%%
$
%%%%
\left[
\begin{array}{@{\,}*{2}{c@{\,\,}}c@{\,}}
0 & .35 & 0 \\
0 & 0 & .46 \\
1 & .65 & .54 \\
\end{array}
\right]
%%%%
%%%% DIVIDER between matrix and eigs
%%%%
$ & $
%%%%
%% sort order left eigs:
%210
%% sort order right eigs:
%210
\begin{array}{*{3}{c}}
%%%% left eigenvalue 0 (2)
1 & %%%% right eigenvalue 0 (2)
%%%% 1 & %%%% left eigenvalue 1 (1)
-0.2\!-\! 0.3i & %%%% right eigenvalue 1 (1)
%%%% -0.2\!-\! 0.3i & %%%% left eigenvalue 2 (0)
-0.2\!+\! 0.3i \\
%%%% right eigenvalue 2 (0)
%%%% -0.2\!+\! 0.3i \\
%%%% left eigenvector 0(2)
\left[ \begin{array}{@{}r@{}}
.58 \\
.58 \\
.58
\end{array}
\right]
%%%% right eigenvector 0(2)
\left[ \begin{array}{@{}r@{}}
.14 \\
.41 \\
.90
\end{array}
\right]
& %%%% left eigenvector 1(1)
\left[ \begin{array}{@{}r@{}}
-.8\!+\! .1i \\
-.2\!-\! .5i \\
.2\!+\! .2i
\end{array}
\right]
%%%% right eigenvector 1(1)
\left[ \begin{array}{@{}r@{}}
.2\!-\! .5i \\
-.6\!+\! .2i \\
.4\!+\! .3i
\end{array}
\right]
& %%%% left eigenvector 2(0)
\left[ \begin{array}{@{}r@{}}
-.8\!-\! .1i \\
-.2\!+\! .5i \\
.2\!-\! .2i
\end{array}
\right]
%%%% right eigenvector 2(0)
\left[ \begin{array}{@{}r@{}}
.2\!+\! .5i \\
-.6\!-\! .2i \\
.4\!-\! .3i
\end{array}
\right]
\\
\end{array}
%%%%
%%%% end of eigs
%%%%
$
%%%%
%\midrule
%\input{eigen/qrl3e.tex} \\[0.051in]
\bottomrule
\end{tabular}
}\\ \medskip
}
{
\caption[a]{
Transition probability matrices for generating random paths through
trellises.
}
}
\end{table}
\begin{table}
\fullwidthfigureright{
{\small
\begin{tabular}{ccc} \toprule & Matrix & Eigenvalues and eigenvectors $\eL,\eR$ \\ \midrule
(a)& \input{eigen/Ta.tex} \\[0.051in] \midrule
(a$'$)& \input{eigen/Tb.tex} \\[0.051in] \midrule
(b)& \input{eigen/Tc.tex} \\[0.051in] \midrule
%\input{eigen/Td.tex} \\[0.051in] \bottomrule
\end{tabular}
}\\
}{
\caption[a]{
Illustrative transition probability matrices and their eigenvectors
showing the two
ways of being non-ergodic. (a) More than one principal eigenvector
with eigenvalue 1 because the state space falls into
two unconnected pieces.
(a$'$) A small perturbation breaks the degeneracy
of the principal eigenvectors.
(b) Under this chain, the density may oscillate
between two parts of the state space. In addition to the
invariant distribution, there is another right-eigenvector with eigenvalue
$-1$. In general such circulating densities correspond to
complex eigenvalues with
magnitude 1.
% Corresponds to a circulating density, which
% doesn't relax to a single stationary distribution.
}
\label{tab.nonergodic}
}
\end{table}
%\section{Transforms}
%\section{Probability densities}
\section{Perturbation theory}
%
% perturbation theory for square matrices
%
Perturbation theory is not used in this book, but
it is useful in this book's fields.
In this section we derive first order perturbation theory for
the eigenvectors and eigenvalues of square, {\em not necessarily symmetric},
matrices. Most presentations of perturbation theory
focus on symmetric matrices, but non-symmetric matrices (such
as transition matrices) also deserve to be perturbed!
We assume that we have an $N \times N$ matrix $\bH$ that is a function $\bH(\epsilon)$ of
a real parameter $\epsilon$, with $\epsilon = 0$ being our starting point.
We assume that a Taylor expansion of $\bH(\epsilon)$ is appropriate:
\beq
\bH(\epsilon) = \bH(0) + \epsilon \bV + \cdots
\label{eq.pt.V}
\eeq
where
\beq
\bV \equiv \frac{ \partial \bH}{ \partial \epsilon} .
\eeq
We assume that for all $\epsilon$ of interest, $\bH(\epsilon)$ has
a complete set of $N$ right-eigenvectors and left-eigenvectors,
and that these eigenvectors and their eigenvalues are continuous
functions of $\epsilon$. This last assumption is not necessarily
a good one: if $\bH(0)$ has degenerate eigenvalues then it is
possible for the eigenvectors to be discontinuous in $\epsilon$;
in such cases, degenerate perturbation theory is needed. That's a
fun topic, but let's stick with the non-degenerate case here.
We write the eigenvectors and eigenvalues as follows:
\beq
\bH(\epsilon) \eRa(\epsilon) = \lambda^{(a)}(\epsilon) \eRa(\epsilon) ,
\label{eq.pt.defn}
\eeq
and we Taylor-expand
\beq
\lambda^{(a)}(\epsilon) = \lambda^{(a)}(0) + \epsilon \mu^{(a)} + \cdots
\eeq
with
\beq
\mu^{(a)} \equiv \frac{ \partial \lambda^{(a)}(\epsilon)}{ \partial \epsilon}
\label{eq.pt.mua}
\eeq
and
\beq
\eRa(\epsilon) = \eRa(0) + \epsilon \fRa + \cdots
\eeq
with
\beq
\fRa \equiv \frac{ \partial \eRa}{ \partial \epsilon} ,
\label{eq.pt.fRa}
\eeq
and similar definitions for $\eLa$ and $\fLa$. We define
these left-vectors to be row vectors, so that
the `transpose' operation is not needed and can be banished.
We are free to constrain the magnitudes of the eigenvectors in whatever
way we please. Each left-eigenvector and each right-eigenvector
has an arbitrary magnitude. The natural constraints to use are as follows. First,
we constrain the inner products with:
\beq
\eLa(\epsilon) \eRa(\epsilon) = 1, \:\:\:\mbox{for all $a$} .
\eeq
% This constraint does not constrain the magnitudes of $\{ \eRa(0) \}_{a=1}^{N}$, but
% if those vectors are chosen, then it constrains $\{ \eRa(\epsilon) \}_{a=1}^{N}$
% and $\{ \eLa(\epsilon) \}_{a=1}^{N}$.
Expanding the eigenvectors in $\epsilon$, \eqref{eq.pt.constraint} implies
\beq
( \eLa(0) + \epsilon \fLa + \cdots )( \eRa(0) + \epsilon \fRa + \cdots)
= 1 ,
\eeq
from which we can extract the terms in $\epsilon$, which say:
\beq
\eLa(0) \fRa + \fLa \eRa(0)
= 0
\eeq
We are now free to choose the two constraints:
\beq
\eLa(0) \fRa = 0 , \:\:\: \fLa \eRa(0) = 0 ,
\label{eq.pt.constraint}
\eeq
which in the special case of a symmetric matrix correspond to
constraining the eigenvectors to be of constant length, as defined by
the Euclidean norm.
% These constraints do not constrain the magnitudes of $\{ \eRa(0) \}_{a=1}^{N}$, but
% if those vectors are chosen, then they constrain $\{ \eRa(\epsilon) \}_{a=1}^{N}$
% and $\{ \eLa(\epsilon) \}_{a=1}^{N}$.
OK, now that we have defined our cast of characters, what do the defining
equations (\ref{eq.pt.defn}) and (\ref{eq.pt.V}) tell us about our Taylor expansions
(\ref{eq.pt.mua}) and (\ref{eq.pt.fRa})?
We expand \eqref{eq.pt.defn} in $\epsilon$.
\beq
(\bH(0) + \epsilon \bV + \cdots ) ( \eRa(0) + \epsilon \fRa + \cdots )
=
(\lambda^{(a)}(0) + \epsilon \mu^{(a)} + \cdots )
( \eRa(0) + \epsilon \fRa + \cdots ) .
\label{eq.pt.mess}
\eeq
Identifying the terms of order $\epsilon$, we have:
%\beq
% \bH(0) \epsilon \fRa + \epsilon \bV \eRa(0)
% =
% \lambda^{(a)}(0) \epsilon \fRa +
% \epsilon \mu^{(a)} \eRa(0)
%\eeq
\beq
% \Rightarrow
\bH(0) \fRa + \bV \eRa(0) = \lambda^{(a)}(0) \fRa + \mu^{(a)} \eRa(0) .
\eeq
We can extract interesting results from this equation by hitting it
with $\eLb(0)$:
%
% half-hearted formatting mod here:
\[%beq
\eLb\!(0) \bH(0) \fRa + \eLb\!(0) \bV \eRa(0) =
\eLb(0) \lambda^{(a)}(0) \fRa + \mu^{(a)}\eLb(0) \eRa(0) .
\]%eeq
\beq
\Rightarrow \
\lambda^{(b)} \eLb(0) \fRa + \eLb(0) \bV \eRa(0) =
\lambda^{(a)}(0) \eLb(0) \fRa + \mu^{(a)}\delta_{ab} .
\eeq
Setting $b=a$ we obtain
\beq
\eLa(0) \bV \eRa(0) =
\mu^{(a)} .
\label{eq.pt.lsoln}
\eeq
Alternatively, choosing $b \neq a$, we obtain:
\beq
\eLb(0) \bV \eRa(0) =
\left[
\lambda^{(a)}(0) - \lambda^{(b)}(0)
\right] \eLb(0) \fRa
\eeq
\beq
\Rightarrow
\eLb(0) \fRa = \frac{1}{\lambda^{(a)}(0) - \lambda^{(b)}(0) } \eLb(0) \bV \eRa(0) .
\label{eq.pt.wb0}
\eeq
Now, assuming that the right-eigenvectors
$\{ \eRb(0) \}_{b=1}^{N}$
form a complete basis, we must be able to write
\beq
\fRa = \sum_{b} w_{b} \eRb\!(0),
\eeq
where
\beq
w_b = \eLb\!(0) \fRa ,
\label{eq.pt.wb}
\eeq
so, comparing (\ref{eq.pt.wb0}) and (\ref{eq.pt.wb}), we have:
\beq
\fRa = \sum_{b \neq a} \frac{\eLb(0) \bV \eRa(0) }{\lambda^{(a)}(0) - \lambda^{(b)}(0) } \eRb(0) .
\label{eq.pt.esoln}
\eeq
Equations (\ref{eq.pt.lsoln}) and (\ref{eq.pt.esoln}) are the solution to the first order
perturbation
theory problem, giving respectively the first derivative
of the eigenvalue and the eigenvectors.
\subsection{Second-order perturbation theory}
If we expand the eigenvector equation (\ref{eq.pt.defn}) to second order in $\epsilon$,
and assume that the equation
\beq
\bH(\epsilon) = \bH(0) + \epsilon \bV
\label{eq.pt.V2}
\eeq
is exact, that is, $\bH$ is a purely linear function of $\epsilon$, then we have:
\beqan
\lefteqn{ (\bH(0) + \epsilon \bV ) ( \eRa(0) + \epsilon \fRa + \half \epsilon^2 \gRa + \cdots )
} \nonumber \\
=& (\lambda^{(a)}(0) + \epsilon \mu^{(a)} + \half \epsilon^2 \nu^{(a)} + \cdots )
( \eRa(0) + \epsilon \fRa + \half \epsilon^2 \gRa + \cdots )
&
\label{eq.pt.mess2}
\eeqan
where $\gRa$ and $\nu^{(a)}$ are the second derivatives of the eigenvector
and eigenvalue.
Equating the second-order terms in $\epsilon$ in \eqref{eq.pt.mess2},
%\beq
% \epsilon^2 \bV \fRa + \half \epsilon^2 \bH(0) \gRa
% =
% \lambda^{(a)}(0) \half \epsilon^2 \gRa + \half \epsilon^2 \nu^{(a)} \eRa(0)
% + \epsilon^2 \fRa \mu^{(a)}
%\label{eq.pt.mess22}
%\eeq
\beq
% \Rightarrow
\bV \fRa + \half \bH(0) \gRa
=
\half \lambda^{(a)}(0) \gRa + \half \nu^{(a)} \eRa(0)
+ \mu^{(a)} \fRa .
\label{eq.pt.mess222}
\eeq
Hitting this equation on the left with $\eLa(0)$, we obtain:
\beqan
\lefteqn{ \eLa(0) \bV \fRa + \half \lambda^{(a)} \eLa(0) \gRa }\nonumber
\\
=&
\half \lambda^{(a)}(0) \eLa(0) \gRa + \half \nu^{(a)} \eLa(0) \eRa(0)
+ \mu^{(a)} \eLa(0) \fRa . &
\label{eq.pt.mess2222}
\eeqan
The term $\eLa(0) \fRa $ is equal to zero because of our constraints (\ref{eq.pt.constraint}), so
\beq
\eLa(0) \bV \fRa
=
\half \nu^{(a)} ,
\label{eq.pt.2}
\eeq
so the second derivative of the eigenvalue with respect to $\epsilon$
is given by
\beqan
\half \nu^{(a)} &=& \eLa(0) \bV
\sum_{b \neq a} \frac{\eLb(0) \bV \eRa(0) }{\lambda^{(a)}(0) - \lambda^{(b)}(0) } \eRb(0)
\\
&=& \sum_{b \neq a} \frac{[\eLb(0) \bV \eRa(0)][ \eLa(0) \bV \eRb(0)] }
{\lambda^{(a)}(0) - \lambda^{(b)}(0) } .
\eeqan
This is as far as we will take the perturbation expansion.
\subsection*{Summary}
If we introduce the abbreviation
$V_{ba}$ for $\eLb(0) \bV \eRa(0)$,
we can write the eigenvectors of $\bH(\epsilon) = \bH(0) + \epsilon \bV$
to first order as
\beq
\eRa(\epsilon) = \eRa(0) + \epsilon
\sum_{b \neq a} \frac{V_{ba} }{\lambda^{(a)}(0) - \lambda^{(b)}(0) } \eRb(0)
+ \cdots
\label{eq.pt.esoln.sum}
\eeq
and the eigenvalues to second order as
\beq
\lambda^{(a)}(\epsilon) = \lambda^{(a)}(0) + \epsilon V_{aa}
+ \epsilon^2 \sum_{b \neq a} \frac{ V_{ba} V_{ab} }
{\lambda^{(a)}(0) - \lambda^{(b)}(0) }
+ \cdots .
\label{eq.pt.suml}
\eeq
\newpage
\section{Some numbers}
% \thispagestyle{empty}
\label{ch.numbers}
\label{sec.numbers}
% a table of big and small numbers
\begin{tabular}{cccp{5.5in}} \toprule
&$2^{8192}$ & $10^{2466}$ & Number of distinct 1-kilobyte files \\
&$2^{1024}$ & $10^{308}$ & Number of states of a 2D Ising
model with $32\!\times\!32$ spins \\
$2^{1000}$ & & $10^{301}$ & Number of binary strings of length 1000 \\
$2^{500}$ & & $3 \!\times\! 10^{150}$ & \\ \midrule
& $2^{469}$ & $10^{141}$ & Number of binary strings of length
1000 having 100 {\tt 1}s
and 900 {\tt 0}s \\
& $2^{266}$ & $10^{80}$ & Number of electrons
% protons and neutrons
in universe \\
$2^{200}$ & & $1.6 \!\times\! 10^{60}$ & \\
& $2^{190}$ & $10^{57}$ & Number of electrons
% protons and neutrons
in solar system \\
& $2^{171}$ & $3 \!\times\! 10^{51}$ & Number of electrons
% protons and neutrons
in the earth \\
$2^{100}$ & & $10^{30}$ & \\ \midrule
& $2^{98}$ & $3\!\times\! 10^{29}$ & Age of universe/picoseconds \\ % pico = 10^{-12}
% \midrule \midrule \midrule
\midrule
& $2^{58}$ & $3\!\times\! 10^{17}$ & Age of universe/seconds \\
$2^{50}$ & & $10^{15}$ & \\ \midrule
$2^{40}$ & & $10^{12}$ & \\ \midrule
& & $10^{11}$ & Number of neurons in human brain \\ % octopus = 170 million
& & $10^{11}$ & Number of bits stored on a DVD \\ % (a bit more than 10G)
% 16,000 Mb
% wheat: 32,000 x 10^6 bits = 3 x 10^10
& & $3\!\times\!10^{10}$ & Number of bits in the wheat genome \\% 3e9 base pairs? 100 K genes
& & $6\!\times\!10^{9}$ & Number of bits in the human genome \\% 3e9 base pairs, 30 K genes
& $2^{32}$ & $6\!\times\! 10^{9}$ & Population of earth \\
$2^{30}$ & & $10^{9}$ & \\ \midrule
% 120 Mb {\it{Arabidopsis thaliana}} (flowering plant related to broccoli)
% i.e. 240 \times 10^6 bits, Nature Dec 14th 2000
& & $2.5\times 10^{8}$ & Number of fibres in the corpus callosum \\
& & $2\!\times\!10^{8}$ & Number of bits in {\em C. Elegans} ({a worm}) genome \\
& & $2\!\times\!10^{8}$ & Number of bits in {\it{Arabidopsis thaliana}} ({a flowering plant related to broccoli}) genome \\
& $2^{25}$ & $3\!\times\!10^{7}$ & One year/seconds \\
& & $2\!\times\!10^{7}$ & Number of bits in the compressed PostScript file that is this book \\
& & $2\!\times\!10^{7}$ & Number of bits in {\tt{unix}} kernel \\
% 2224608 bytes Apr 1 1998 /vmunix 17796864 bits sunos
& & $10^{7}$ & Number of bits in the {\em E.\ Coli} genome, or in a floppy disk \\
& & $4\!\times\!10^{6}$ & Number of years since human/chimpanzee divergence \\% was 6e6, changed to gene.tex
$2^{20}$ & & $10^{6}$ & $1\,048\,576$ \\ \midrule
& & $2\!\times\!10^{5}$ & Number of generations since human/chimpanzee divergence \\
% was 10^5 in 2000!
% brains...
% http://faculty.washington.edu/chudler/facts.html
& & $3 \times 10^{4}$ & Number of genes in human genome \\
% 26,000 genes
& & $3 \times 10^{4}$ & Number of genes in {\it{Arabidopsis thaliana}} genome \\
& & $1.5\!\times\!10^{3}$ & Number of base pairs in a gene \\% 1523 (nature jan 1999)
$2^{10}$ & $e^7$& $10^{3}$ & $2^{10} = 1024$; $e^7 = 1096$ \\ \midrule
$2^{0}$ & & $10^0$ & 1 \\ \midrule
&$2^{-2}$ & $2.5\!\times\!10^{-1}$ & Lifetime probability of dying from smoking one pack of
cigarettes per day. \\
& & $10^{-2}$ & Lifetime probability of dying in a motor vehicle accident \\
$2^{-10}$ & & $10^{-3}$ & \\ \midrule
% http://www.seattlecentral.org/qelp/sets/037/037.html
% 12 parts per billion (ppb) concentration of benzene in drinking water at the rate of two liters a day amounts to a lifetime risk of 1 in 100,000 of developing cancer
& & $10^{-5}$ & Lifetime probability of developing cancer because of drinking
2 litres per day of water containing 12$\,$p.p.b.\ benzene \\
$2^{-20}$ & & $10^{-6}$ & \\ \midrule
& & $3\!\times\!10^{-8}$ & Probability of error in transmission of coding DNA,
per nucleotide, per generation \\% (nature jan 1999)
% a.k.a. 1.3e-9 per nucleotide per year, generation time = 25 years.
$2^{-30}$ & & $10^{-9}$ & \\ \midrule
$2^{-60}$ & & $10^{-18}$ & Probability of undetected error in a hard disk drive, after error correction
% and detection
\\ \midrule
\end{tabular}
% also useful?
%
% one year in seconds: 3e7 = 2^{25}
% one year in picoseconds: 3e19 = 2^{65}
% one life in seconds: 2e9 = 2^{31}
% one life in picoseconds: 2e21 = 2^{71}
%
% mass of earth: 5.976 * 10**27 (g)
% = 5.976e27 / 1.6e-24 = 3.735e51 mass units
% mass of earth's crust :
% light-year 9.463e17 cm
% 1.989e30 kg -- Mass of Sun
% 384401 km -- Mean Earth-Moon distance
% age of Earth 1.6e17 s
% age of Universe 3.3e17 s
% area of continents 1.49e14 m2
% base pairs per average human chromosome 1.55e6
% so number of bits per human = 2 * 46 * 1.5e6 ~= 100,000000
% mass of Sun 2e30 kg
% mass of Universe 2e53 kg
% mass of proton 1.67e-27 kg
%
% protons in sun: 1.2e57
% protons in universe: 1.2e80
% number of 1 kilobyte files = 2**(1024 * 8)
% chars per page = 80*24 = 1920
% the Environmental Protection Agency (EPA) provides cancer risk figures based on a 70-kilogram (154-pound) person consuming a carcinogen for 70 years. Putting its figures together, the EPA calculates that the risk of drinking a 12 parts per billion (ppb) concentration of benzene in drinking water at the rate of two liters a day amounts to a lifetime risk of 1 in 100,000 of developing cancer.
% the death rate from all forms of cancer in the year 1990 was 202.1 per 100,000.
% prob of dying of cancer (whole life) is about 0.13.
% benzene in water scandal: it had 15 ppb instead of 12.
\dvipsb{appendices}
%
% this file contains a list of commands for the index
\fakesection{this file contains a list of commands for the index}
%
%
\index{channel!coding theorem|see{noisy-channel coding theorem}}
\index{Slepian--Wolf|see{dependent sources}}
\index{Laplace approximation|see{Laplace's method}}
\index{notation|see{conventions}}
\index{expectation--maximization algorithm|see{EM algorithm}}
\index{central-limit theorem|see{law of large numbers}}
\index{noise|see{channel}}
\index{normal|see{Gaussian}}
\index{source code!software|see{software}}
\index{DSC|see{difference-set cyclic code}}
\index{ICA|see{independent component analysis}}
\index{MDL|see{minimum description length}}
\index{MML|see{minimum description length}}
\index{MLP|see{multilayer perceptron}}
\index{MAP|see{maximum {\em a posteriori}}}
\index{maxent|see{maximum entropy}}
\index{conventions|see{notation}}
\index{t-distribution|see{Student-$t$}}% distribution}}
\index{average|see{expectation}}
\index{puzzle|see{game}}
\index{game|see{puzzle}}
\index{Kullback--Leibler divergence|see{relative entropy}}
%
% other ideas - ``paradoxes'' fallacies
%
\index{error correction|see{error-correcting code}}
\index{good|see{error-correcting code, good}}
\index{bad|see{error-correcting code, bad}}
\index{very good|see{error-correcting code, very good}}
\index{practical|see{error-correcting code, practical}}
\index{belief propagation|see{sum--product algorithm}}
\index{probability propagation|see{sum--product algorithm}}
\index{probability distributions|see{distribution}}
%
\index{rant|see{sermon}}
\index{caution|see{sermon}}
\index{sermon|see{caution}}
\index{data modelling|see{modelling}}
\index{hybrid Monte Carlo|see{Hamiltonian Monte Carlo}}
\index{stochastic dynamics|see{Hamiltonian Monte Carlo}}
\index{Monte Carlo methods!hybrid Monte Carlo|see{Hamiltonian Monte Carlo}}
\index{hypothesis testing|see{model comparison, sampling theory}}
\index{frequentist|see{sampling theory}}
\index{orthodox statistics|see{sampling theory}}
\index{finite field theory|see{Galois field}}
\index{GF($q$)|see{Galois field}}
%
%
% error correcting codes
\index{errors|see{channel}}
\index{BSC|see{channel, binary symmetric}}
\index{noisy channel|see{channel}}
\index{normalizing constant|see{partition function}}
\index{block code|see{source code or error-correcting code}}
% \index{maximum distance separable|see{error-correcting code, maximum distance separable}}
% \index{MDS|see{error-correcting code, maximum distance separable}}
\index{R$_3$|see{repetition code}}
% general cross-references
%
\index{minimum distance|see{distance of code}}
\index{error-correcting code!distance|see{distance of code}}
\index{Shannon|see{noisy-channel coding theorem, source coding theorem, information content}}
%\index{Shannon's noisy-channel coding theorem|see{noisy-channel coding theorem}}
%\index{Shannon's source coding theorem|see{source coding theorem}}
\index{Simpson, O.J.|see{wife-beaters}}
%
% warnings
\index{warning|see{caution}}
\index{caveat|see{caution}}
%
% Ambiguous words in info theory
%
\index{simulated annealing|see{annealing}}
% \index{perfect simulation|see{Monte Carlo methods, exact sampling}}
% \index{coupling from the past|see{Monte Carlo methods, exact sampling}}
\index{paramagnetic|see{Ising model}}
\index{importance sampling|see{Monte Carlo methods}}
\index{rejection sampling|see{Monte Carlo methods}}
\index{exact sampling|see{Monte Carlo methods}}
\index{slice sampling|see{Monte Carlo methods}}
% \index{Glauber dynamics|see{Monte Carlo methods, Gibbs sampling}}
% \index{heat bath|see{Monte Carlo methods, Gibbs sampling}}
\index{Gibbs sampling|see{Monte Carlo methods}}
%\index{Langevin method|see{Monte Carlo methods}}
\index{Metropolis method|see{Monte Carlo methods}}
\index{MCMC (Markov chain Monte Carlo)|see{Monte Carlo methods}}
\index{Markov chain Monte Carlo|see{Monte Carlo methods}}
%
\index{codeword|see{source code, symbol code, or error-correcting code}}
%
% Source codes
%
\index{source code!Huffman|see{Huffman code}}
% at the moment some topics are maintained with double entries
% eg
% \index{source code!stream codes|(}\index{stream codes|(}
% \ind{convolutional codes}} \index{error-correcting code!convolutional}
% \ind{Fading channels}\/} \index{channel!fading}
% min--sum and viterbi
%
% E C Codes
%
\index{code|see{error-correcting code, source code (for data compression), symbol code, arithmetic coding, linear code, random code or hash code}}
%\index{code!linear|see{error-correcting codes}}
% \index{error-correcting code!dodecahedron|see{dodecahedron code}}
% \index{Gallager code|see{error-correcting codes, low-density parity-check}}
% \index{error-correcting code!Gallager|see{error-correcting codes, low-density parity-check}}
\index{data compression|see{source code}}
\index{compression|see{source code}}
\index{code!dual|see{error-correcting code, dual}}
%
% Probabilistic models
%
\index{Markov model|see{Markov chain and hidden Markov model}}
% MCMC:
% see monte carlo methods
\index{moderation|see{marginalization}}
\index{marginal likelihood|see{evidence}}
\nocite{Ripley96}
\nocite{Bishop95}
\nocite{Shannon&Weaver}
\nocite{goldie91}
\nocite{Blahut}
\nocite{Cover&Thomas}
\nocite{NielsenChuang}
\nocite{durbin1998}
%
% BIBLIOGRAPHY
%
\small
\footnotesize
\clearpage
\bibliography{bibs}
\dvipsb{bibliography}
%\restoremargins
\clearpage
% \newpage
% INDEX
% to get alternative alignment (so chapter titles aligned), use this:
% \addcontentsline{toc}{chapter}{\protect \numberline{}Index}
\addcontentsline{toc}{chapter}{Index}%
\footnotesize\raggedright
\begin{theindex}
\item $\Gamma$, 598
\item $\Phi(z)$, 514
\item $\chi^2$, 40, 323, 458, 459
\item $\lambda $, 119
\item $\sigma _{\scriptscriptstyle N}$ and $\sigma _{\scriptscriptstyle N\mskip -\thinmuskip -\mskip -\thinmuskip 1}$,
320
\item :=, 600
\item {\tt{?}}, 418
\item 2s, 156
\indexspace
\item Abu-Mostafa, Yaser, 482
\item acceptance rate, 365, 367, 369, 380, 383, 394
\item acceptance ratio method, 379
\item accumulator, 254, 570, 582
\item activation, 471
\item activation function, 471
\item activity, 471
\item activity rule, 470, 471
\item adaptive direction sampling, 393
\item adaptive models, 101
\item adaptive rejection sampling, 370
\item address, 201, 468
\item Aiyer, Sree, 518
\item Alberto, 56
\item alchemists, 74
\item algorithm
\subitem covariant, 442
\subitem EM, 432
\subitem exact sampling, 413
\subitem expectation--maximization, 432
\subitem function minimization, 473
\subitem genetic, 395, 396
\subitem Hamiltonian Monte Carlo, 387, 496
\subitem independent component analysis, 443
\subitem Langevin Monte Carlo, 496
\subitem leapfrog, 389
\subitem max--product, 339
\subitem perfect simulation, 413
\subitem sum--product, 334
\subitem Viterbi, 340
\item Alice, 199
\item Allias paradox, 454
\item alphabetical ordering, 194
\item America, 354
\item American, 238, 260
\item amino acid, 201, 204, 279, 362
\item anagram, 200
\item Angel, J. R. P., 529
\item annealed importance sampling, 379
\item annealing, 379, 392, 397
\subitem deterministic, 518
\subitem importance sampling, 379
\item antiferromagnetic, 400
\item ape, 269
\item approximation
\subitem by Gaussian, 2, 301, 341, 350, 496
\subitem Laplace, 341, 547
\subitem of complex distribution, 185, 282, 364, 422, 433
\subitem of density evolution, 567
\subitem saddle-point, 341
\subitem Stirling, 1
\subitem variational, 422
\item arabic, 127
\item architecture, 470, 529
\item arithmetic coding, 101, \bold{110}, 111
\subitem decoder, 118
\subitem software, 121
\subitem uses beyond compression, 118, 250, 255
\item arithmetic progression, 344
\item arms race, 278
\item artificial intelligence, 121, 129
\item associative memory, 468, 505, 507
\item assumptions, 26
\item astronomy, 551
\item asymptotic equipartition, 80, 384
\subitem why it is a misleading term, 83
\item Atlantic, 173
\item AutoClass, 306
\item automatic relevance determination, 544
\item automobile data reception, 594
\item average, 26, \see{expectation}{612}
\item AWGN, 177
\indexspace
\item background rate, 307
\item backpropagation, 473, 475, 528, 535
\item backward pass, 244
\item bad, \see{error-correcting code, bad}{612}
\item Balakrishnan, Sree, 518
\item balance, 66
\item Baldwin effect, 279
\item ban (unit), 264
\item Banburismus, 265
\item band-limited signal, 178
\item bandwidth, 178, 182
\item bar-code, 262, 399
\item base transitions, 373
\item base-pairing, 280
\item basis dependence, 306, 342
\item bat, 213, 214
\item battleships, 71
\item Bayes' theorem, 6, 24, 25, 27, 28, 48--50, 53, 148, 324, 344,
347, 446, 493, 522
\item Bayes, Rev.\ Thomas, 51
\item Bayesian, 26
\item Bayesian belief networks, 293
\item Bayesian inference, 457
\item BCH codes, 13
\item BCJR, 578
\item BCJR algorithm, 330
\item Belarusian, 238
\item belief, 57
\item belief propagation, 330, 557, \see{sum--product algorithm}{612}
\item Benford's law, 446
\item bent coin, 51
\item Berlekamp, Elwyn, 172, 213
\item Bernoulli distribution, 117
\item Berrou, C., 186
\item bet, 200, 209, 455
\item beta distribution, 316
\item beta function, 316
\item beta integral, 30
\item Bethe free energy, 434
\item Bhattacharyya parameter, 215
\item bias, 345, 506
\subitem in neural net, 471
\subitem in statistics, 306, 307, 321
\item biased, 321
\item biexponential distribution, 88, 313, 448
\item bifurcation, 89, 291
\item binary entropy function, 2, 15
\item binary erasure channel, \bold{148}, 151
\item binary images, 399
\item binary representations, 132
\item binary symmetric channel, 4, 148, \bold{148}, 149, 151, 211,
215, 229
\item binding DNA, 201
\item binomial distribution, 1, 311
\item bipartite graph, 19
\item birthday, 156, 157, 160, 198, 200
\item bit (unit), 264
\item bits back, 104, 108, 353
\item bivariate Gaussian, 388
\item black, 355
\item Bletchley Park, 265
\item Blind Watchmaker, 269, 396
\item block code, 9, \see{source code or error-correcting code}{612}
\item block-sorting, 121
\item blow up, 306
\item blur, 549
\item Bob, 199
\item Boltzmann entropy, 85
\item Boltzmann machine, 522
\item bombes, 265
\item book ISBN, 235
\item Bottou, Leon, 121
\item bound, 85
\item bounded-distance decoder, 207, 212
\item bounding chain, 419
\item box, 343, 351
\item boyish matters, 58
\item brain, 468
\item Braunstein, A., 340
\item Bridge, 126
\item British, 260
\item broadcast channel, 237, 239, 594
\item Brody, Carlos, 246
\item Brownian motion, 280, 316, 535
\item BSC, \see{channel, binary symmetric}{612}
\item budget, 94, 96
\item Buffon's needle, 38
\item {\tt{BUGS}}, 371, 431
\item burglar alarm and earthquake, 293
\item Burrows--Wheeler transform, 121
\item burst errors, 185, 186
\item bus-stop paradox, 39, 46, 107
\indexspace
\item cable labelling, 175
\item calculator, 320
\item camera, 549
\item canonical, 88
\item capacity, 14, \bold{146}, \bold{150}, 151, 183, 484
\subitem channel with synchronization errors, 187
\subitem constrained channel, 251
\subitem Gaussian channel, 182
\subitem Hopfield network, 514
\subitem neural network, 483
\subitem neuron, 483
\subitem symmetry argument, 151
\item car data reception, 594
\item card, 233
\item casting out nines, 198
\item Cauchy distribution, 85, 88, 313, 362
\item caution, \see{sermon}{612}
\subitem equipartition, 83
\subitem Gaussian distribution, 312
\subitem importance sampling, 362, 382
\subitem sampling theory, 64
\item cave, 214
\item caveat, \see{caution}{612}
\item cellphone, \see{mobile phone}{186}
\item cellular automaton, 130
\item central-limit theorem, 36, 41, 88, 131,
\see{law of large numbers}{612}
\item centre of gravity, 35
\item chain rule, 528
\item challenges, 246
\item channel
\subitem AWGN, 177
\subitem binary erasure, \bold{148}, 151
\subitem binary symmetric, 4, 146, 148, \bold{148}, 149, 151, 206,
211, 215, 229
\subitem broadcast, 237, 239, 594
\subitem bursty, 185, 557
\subitem capacity, 14, 146, 150, 250
\subsubitem connection with physics, 257
\subitem coding theorem, \see{noisy-channel coding theorem}{612}
\subitem complex, 184, 557
\subitem constrained, 248, 255, 256
\subitem continuous, 178
\subitem discrete memoryless, 147
\subitem erasure, 188, 219, 589
\subitem extended, 153
\subitem fading, 186
\subitem Gaussian, 155, 177, 186
\subitem input ensemble, 150
\subitem multiple access, 237
\subitem multiterminal, 239
\subitem noiseless, 248
\subitem noisy, 3, 146
\subitem noisy typewriter, \bold{148}, 152
\subitem symmetric, 171
\subitem two-dimensional, 262
\subitem unknown noise level, 238
\subitem variable symbol durations, 256
\subitem with dependent sources, 236
\subitem with memory, 557
\subitem Z channel, \bold{148}, 149, 150, 172
\item cheat, 200
\item Chebyshev inequality, 81, 85
\item checkerboard, 404
\item Chernoff bound, 85
\item chess board, 520
\item chi-squared, 27, 40, 323, 458
\item Cholesky decomposition, 552
\item chromatic aberration, 552
\item cinema, 187
\item circle, 316
\item classical statistics, 64
\subitem criticisms, 32, 50, 457
\item classifier, 532
\item Claude Shannon, 3
\item Clockville, 39
\item clustering, 284, \bold{284}, 303
\item coalescence, 413
\item cocked hat, 307
\item code,
\see{error-correcting code, source code (for data compression), symbol code, arithmetic coding, linear code, random code or hash code}{612}
\subitem dual, \see{error-correcting code, dual}{612}
\subitem for constrained channel, 249
\subsubitem variable-length, 249, 255
\item code-equivalent, 576
\item codebreakers, 265
\item codeword,
\see{source code, symbol code, or error-correcting code}{612}
\item coding theory, 4, 205, 215
\item coin, 38, 63
\item coincidence, 267, 343, 351
\item collective, 403
\item collision, 200
\item coloured noise, 179
\item combination, 2, 490, 598
\item commander, 241
\item communication, v, 3, 16, 138, 146, 156, 162, 167, 178, 182, 186,
192, 205, 210, 215, 394, 556, 562, 596
\subitem broadcast, 237
\subitem of dependent information, 236
\subitem over noiseless channels, 248
\subitem perspective on learning, 483, 512
\item competitive learning, 285
\item complexity, 531, 548
\item complexity control, 289, 346, 347, 349
\item {\tt{compress}}, 119
\item compression, \see{source code}{612}
\subitem future methods, 129
\subitem lossless, 74
\subitem lossy, 74, 284, 285
\subitem of already-compressed files, 74
\subitem of {\em any\/} file, 74
\subitem universal, 121
\item computer, 370
\item concatenation, 185, 214, 220
\subitem error-correcting codes, 16, 21, 184, 185
\subitem in compression, 92
\subitem in Markov chains, 373
\item concave$\,\frown $, 35
\item conditional entropy, 138, 146
\item cones, 554
\item confidence interval, 457, 464
\item confidence level, 464
\item confused gameshow host, 57
\item conjugate gradient, 479
\item conjugate prior, 319
\item conjuror, 233
\item connection between
\subitem channel capacity and physics, 257
\subitem error correcting code and latent variable model, 437
\subitem pattern recognition and error-correction, 481
\subitem supervised and unsupervised learning, 515
\subitem vector quantization and error-correction, 285
\item connection matrix, 253, 257
\item constrained channel, 248, 257, 260, 399
\item constraint satisfaction, 516
\item content-addressable memory, 192, 193, 469, 505
\item continuous channel, 178
\item control treatment, 458
\item conventions, \see{notation}{612}
\subitem error function, 156
\subitem logarithms, 2
\subitem matrices, 147
\subitem vectors, 147
\item convex hull, 102
\item convex$\,\smile $, 35
\item convexity, 370
\item convolution, 568
\item convolutional code, 184, 186
\item Conway, John H., 86, 520
\item Copernicus, 346
\item correlated sources, 237
\item correlations, 505
\subitem among errors, 557
\subitem and phase transitions, 602
\subitem high-order, 524
\subitem in images, 549
\item cost function, 180
\item cost of males, 277
\item counting, 241
\item counting argument, 21, 222
\item coupling from the past, 413
\item covariance, 440
\item covariance function, 535
\item covariance matrix, 176
\item covariant algorithm, 442
\item Cover, Thomas, 456, 482
\item Cox axioms, 26
\item crib, 265, 268
\item critical fluctuations, 403
\item critical path, 246
\item cross-validation, 353, 531
\item crossover, 396
\item crossword, 260
\item cryptanalysis, 578
\item cryptography, 200
\subitem digital signatures, 199
\subitem tamper detection, 199
\item cumulative probability function, 156
\item cycles in graphs, 242
\item cyclic, 19
\indexspace
\item Dasher, 119
\item data compression, 73, \see{source code}{612}
\item data entry, 118
\item data modelling, \see{modelling}{612}
\item data set, 288
\item Davey, Matthew C., 569
\item death penalty, 354, 355
\item deciban (unit), 264
\item decibel, 186
\item decibels, 178
\item decision theory, 346, \bold{451}
\item decoder, 4, 146, 152
\subitem bitwise, 220, 324
\subitem bounded-distance, 207
\subitem codeword, 220, 324
\subitem probability of error, 221
\item degree, 568
\item degree sequence, \see{profile}{569}
\item degrees of belief, 26
\item degrees of freedom, 322, 459
\item d\'ej\`a vu, 121
\item delay line, 575
\item Delbr\"uck, Max, 446
\item deletions, 187
\item delta function, 438, 600
\item density evolution, 566, 567, 592
\item density modelling, 284, 303
\item dependent sources, 237
\item depth of lake, 359
\item design theory, 209
\item detailed balance, 391
\item detection of forgery, 199
\item deterministic annealing, 518
\item dictionary, 72, 119
\item difference-set cyclic code, 569
\item differentiator, 254
\item diffusion, 316
\item digital cinema, 187
\item digital fountain, 590
\item digital signature, 199, 200
\item digital video broadcast, 593
\item dimensions, 180
\item dimer, 204
\item directory, 193
\item Dirichlet distribution, 316
\item Dirichlet model, 117
\item discriminant function, 179
\item discriminative training, 552
\item disease, 25, 458
\item disk drive, 3, 188, 215, 248, 255
\item distance, 205
\subitem $D_{\rm KL}$, 34
\subitem bad, 207, 214
\subitem distance distribution, 206
\subitem entropy distance, 140
\subitem Gilbert--Varshamov, 212, 221
\subitem good, 207
\subitem Hamming, 206
\subitem isn't everything, 215
\subitem of code, 206, 214, 220
\subsubitem good/bad, 207
\subitem of code, and error probability, 221
\subitem of concatenated code, 214
\subitem of product code, 214
\subitem relative entropy, 34
\subitem very bad, 207
\item distribution
\subitem beta, 316
\subitem biexponential, 313
\subitem binomial, 311
\subitem Cauchy, 88, 312
\subitem Dirichlet, 316
\subitem exponential, 311, 313
\subitem gamma, 313
\subitem Gaussian, 312
\subsubitem sample from, 312
\subitem inverse-cosh, 313
\subitem log-normal, 315
\subitem Luria--Delbr\"uck, 446
\subitem normal, 312
\subitem over periodic variables, 315
\subitem Poisson, 175, 311, 315
\subitem Student-$t$, 312
\subitem useful, 311
\subitem Von Mises, 315
\item divergence, 34
\item DjVu, 121
\item DNA, 3, 55, 201, 204, 257, 421
\subitem replication, 279, 280
\item do the right thing, 451
\item dodecahedron code, 20, 206, 207
\item dongle, 558
\item doors, on game show, 57
\item Dr.\ Bloggs, 462
\item draw straws, 233
\item dream, 524
\item DSC, \see{difference-set cyclic code}{612}
\item dual, 216
\item dumb Metropolis, 394, 496
\indexspace
\item $E_{\rm b}/N_0$, 177, 178, 223
\item earthquake and burglar alarm, 293
\item earthquake, during game show, 57
\item Ebert, Todd, 222
\item edge, 251
\item eigenvalue, 409
\item Elias, Peter, 111, 135
\item EM algorithm, 283, 432
\item email, 201
\item empty string, 119
\item encoder, 4
\item energy, 291, 401, 601
\item English, 72, 110, 260
\item Enigma, 265, 268
\item ensemble, \bold{67}
\subitem extended, 76
\item ensemble learning, 429
\item entropic distribution, 318, 551
\item entropy, 67, 601
\subitem Boltzmann, 85
\subitem conditional, 138
\subitem Gibbs, 85
\subitem joint, 138
\subitem marginal, 139
\subitem mutual information, 139
\subitem of continuous variable, 180
\subitem relative, 34
\item entropy distance, 140
\item epicycles, 346
\item equipartition, 80
\item erasure channel, 219, 589
\item erasure-correction, 188, 190, 220
\item erf, 156
\item ergodic, 120, 373
\item error bars, 301, 501
\item error correction, \see{error-correcting code}{612}
\subitem in DNA replication, 280
\subitem in protein synthesis, 280
\item error detection, 198, 199, 203
\item error floor, 581
\item error function, 156, 473, 490, 514, 529, 599
\item error probability
\subitem block, 152
\subitem in compression, 74
\item error-correcting code, 188, 203
\subitem bad, 183, 207
\subitem block code, 9, \bold{151}, 183
\subitem concatenated, 184--186, 214
\subitem convolutional, 184
\subitem cyclic, 19
\subitem decoding, 184
\subitem density evolution, 566
\subitem difference-set cyclic, 569
\subitem distance, \see{distance of code}{612}
\subitem dodecahedron, 20, 206, 207
\subitem dual, 216, 218
\subitem erasure channel, 589
\subitem Gallager,
\see{error-correcting codes, low-density parity-check}{612}
\subitem Golay, 209
\subitem good, 183, 184, 207, 214, 218
\subitem Hamming, 19, 214
\subitem in DNA replication, 280
\subitem in protein synthesis, 280
\subitem interleaving, 186
\subitem linear, \bold{9}, 171, 183, 184, 229
\subsubitem noisy-channel coding theorem, 229
\subitem low-density generator-matrix, 218, 590
\subitem low-density parity-check, 20, 187, 218, 557, 596
\subsubitem fast encoding, 569
\subsubitem profile, 569
\subitem LT code, 590
\subitem maximum distance separable, 220
\subitem nonlinear, 187
\subitem P$_3$, 218
\subitem parity-check code, 220
\subitem pentagonful, 221
\subitem perfect, 208, 211, 212
\subitem practical, 183, 187
\subitem product code, 184, 214
\subitem quantum, 572
\subitem random, 184
\subitem random linear, 211, 212
\subitem rate, 152, 229
\subitem rateless, 590
\subitem rectangular, 184
\subitem Reed--Solomon code, 571, 589
\subitem repeat--accumulate, \bold{582}
\subitem repetition, 183
\subitem simple parity, 218
\subitem sparse graph, 556
\subsubitem density evolution, 566
\subitem syndrome decoding, 371
\subitem variable rate, 238, 590
\subitem very bad, 207
\subitem very good, 183
\subitem weight enumerator, 206
\subitem with varying level of protection, 239
\item error-reject curves, 533
\item errors, \see{channel}{612}
\item estimate, 459
\item estimator, 48, 307, 320, 446
\item eugenics, 273
\item euro, 63
\item evidence, 29, 53, 298, 322, 347, 531
\subitem typical behaviour of, 54, 60
\item evolution, 269, 279
\subitem as learning, 277
\subitem Baldwin effect, 279
\subitem colour vision, 554
\subitem of the genetic code, 279
\item evolutionary computing, 394, 395
\item exact sampling, 413, \see{Monte Carlo methods}{612}
\item exchange rate, 601
\item exchangeability, 263
\item exclusive or, 590
\item EXIT chart, 567
\item expectation, 27, 35, 37
\item expectation propagation, 340
\item expectation--maximization algorithm, 432,
\see{EM algorithm}{612}
\item experimental design, 463
\item experimental skill, 309
\item explaining away, 293, 295
\item exploit, 453
\item explore, 453
\item exponential distribution, 45, 313
\subitem on integers, 311
\item exponential-family, 307, 308
\item expurgation, 167, 171
\item extended channel, 153, 159
\item extended code, \bold{92}
\item extended ensemble, 76
\item extra bit, \bold{98}, 101
\item extreme value, 446
\item eye movements, 554
\indexspace
\item factor analysis, 437, 444
\item factor graph, 334--336, 434, 556, 557, 580, 583
\item factorial, 2
\item fading channel, 186
\item feedback, 506
\item female, 277
\item ferromagnetic, 400
\item Feynman, Richard, 422
\item Fibonacci, 253
\item field, 605
\item file storage, 188
\item finger, 119
\item finite field theory, \see{Galois field}{612}
\item fitness, 269, 279
\item fixed point, 508
\item Florida, 355
\item fluctuation analysis, 446
\item fluctuations, 401, 404
\item focus, 529
\item football pools, 209
\item forensic, 421
\item forgery, 199, 200
\item forward pass, 244
\item forward probability, 27
\item forward--backward algorithm, 326, 330
\item Fotherington--Thomas, 241
\item Fourier transform, 88, 219, 339, 544, 568
\item fovea, 554
\item free energy, 257, 407, 409, 410
\subitem minimization, 423
\subitem variational, 423
\item frequency, 26
\item frequentist, 320, \see{sampling theory}{612}
\item Frey, Brendan J., 353
\item Frobenius--Perron theorem, 410
\item frustration, 406
\item full probabilistic model, 156
\item function minimization, 473
\item functions, 246
\indexspace
\item gain, 507
\item Galileo code, 186
\item Gallager code, 557,
\see{error-correcting codes, low-density parity-check}{612}
\item Gallager, Robert, 170, 172, 187
\item Galois field, 185, 224, 567, 568, 605
\item game, \see{puzzle}{612}
\subitem Bridge, 126
\subitem guess that tune, 204
\subitem guessing, 110
\subitem life, 520
\subitem {\tt{sixty-three}}, 70
\subitem {\tt{submarine}}, 71
\subitem three doors, 57, 60, 454
\subitem twenty questions, 70
\item game show, 57, 454
\item game-playing, 451
\item gamma distribution, 313, 319
\item gamma function, 598
\item ganglion cells, 491
\item Gaussian channel, 155, \bold{177}
\item Gaussian distribution, 2, 36, \bold{176}, 312, 321, 398, 549
\subitem $N$--dimensional, 124
\subitem approximation, 501
\subitem parameters, 319
\subitem sample from, 312
\item Gaussian processes, 535
\subitem variational Gaussian process classifier, 547
\item general position, 484
\item generalization, 483
\item generalized parity-check matrix, 581
\item generating function, 88
\item generative model, 27, 156
\item generator matrix, 9, 183
\item genes, 201
\item genetic algorithm, 395, 396
\item genetic code, 279
\item genome, 201, 280
\item geometric progression, 258
\item George, E.I., 393
\item geostatistics, 536, 548
\item GF($q$), \see{Galois field}{612}
\item Gibbs entropy, 85
\item Gibbs sampling, \bold{370}, 391, 418,
\see{Monte Carlo methods}{612}
\item Gibbs' inequality, 34, 37, 44
\item Gilbert--Varshamov conjecture, 212
\item Gilbert--Varshamov distance, 212, 221
\item Gilbert--Varshamov rate, 212
\item Gilks, W.R., 393
\item girlie stuff, 58
\item Glauber dynamics, 370
\item Glavieux, A., 186
\item Golay code, 209
\item golden ratio, 253
\item good, \see{error-correcting code, good}{612}
\item Good, Jack, 265
\item gradient descent, 476, 479, 498, 529
\subitem natural, 443
\item graduated non-convexity, 518
\item Graham, Ronald L., 175
\item grain size, 180
\item graph, 251
\subitem factor graph, 334
\subitem of code, 19, 20, 556
\item graphs and cycles, 242
\item guerilla, 242
\item guessing game, 110, 111, 115
\item {\tt{gzip}}, 119
\indexspace
\item Haldane, J.B.S., 278
\item Hamilton, William D., 278
\item Hamiltonian Monte Carlo, \bold{387}, 397, 496, \bold{496}, 497
\item Hamming code, 8, 9, 12, 13, 17--19, 183, 184, 190, 208, 209,
214, 219
\subitem graph, 19
\item Hamming distance, 206
\item handwritten digits, 156
\item hard drive, 593
\item hash code, 193, 231
\item hash function, 195, 200, 228
\subitem linear, 231
\subitem one-way, 200
\item hat puzzle, 222
\item heat bath, 370, 601
\item heat capacity, 401, 404
\item Hebb, Donald, 505
\item Hebbian learning, 505, 507
\item Hertz, 178
\item Hessian, 501
\item hidden Markov models, 437
\item hidden neurons, 525
\item hierarchical clustering, 284
\item hierarchical model, 379, 548
\item high dimensions, life in, 37, 124
\item hint for computing mutual information, 149
\item Hinton, Geoffrey E., 353, 429, 432, 522
\item hitchhiker, 280
\item homogeneous, 544
\item Hooke, Robert, 200
\item Hopfield network, 283, \bold{505}, 506, 517
\subitem capacity, 514
\item Hopfield, John J., 246, 280, 517
\item hot-spot, 275
\item Huffman code, 91, \bold{99}, 103
\subitem `optimality', 99, 101
\subitem disadvantages, 100, 115
\subitem general alphabet, 104, 107
\item human, 269
\item human--machine interfaces, 119, 127
\item hybrid Monte Carlo, 387, \see{Hamiltonian Monte Carlo}{612}
\item hydrogen bond, 280
\item hyperparameter, 64, 318, 319, 379, 479
\item hypersphere, 42
\item hypothesis testing,
\see{model comparison, sampling theory}{612}
\indexspace
\item i.i.d., 80
\item ICA, \see{independent component analysis}{612}
\item ICF (intrinsic correlation function), 551
\item identical twin, 111
\item identity matrix, 600
\item ignorance, 446
\item ill-posed problem, 309, 310
\item image, 549
\subitem integral, 246
\item image analysis, 343, 351
\item image compression, 74, 284
\item image models, 399
\item image processing, 246
\item image reconstruction, 551
\item implicit assumptions, 186
\item implicit probabilities, \bold{97}, 98, 102
\item importance sampling, \bold{361}, 379,
\see{Monte Carlo methods}{612}
\item improper, 314, 316, 319, 320, 342
\item improper prior, 353
\item in-car navigation, 594
\item independence, 138
\item independent component analysis, 313, \bold{437}, 443
\item indicator function, 600
\item inequality, 35, 81
\item inference, 27, 529
\subitem and learning, 493
\item inference problems
\subitem bent coin, 51
\item information, 66
\item information content, 72, 73, 91, 97, 115, 349
\subitem how to measure, 67
\subitem Shannon, 67
\item information maximization, 443
\item information retrieval, 193
\item information theory, 4
\item inner code, 184
\item Inquisition, 346
\item insertions, 187
\item instantaneous, 92
\item integral image, 246
\item interleaving, 184, 186, 579
\item internet, 188, 589
\item intersection, 66, 222
\item intrinsic correlation function, 549, 551
\item invariance, 445
\item invariant distribution, 372
\item inverse probability, 27
\item inverse-arithmetic-coder, 118
\item inverse-cosh distribution, 313
\item inverse-gamma distribution, 314
\item inversion of hash function, 199
\item investment portfolio, 455
\item irregular, 568
\item ISBN, 235
\item Ising model, 130, 283, 399, 400
\item iterative probabilistic decoding, 557
\indexspace
\item Jaakkola, Tommi S., 433, 547
\item Jacobian, 320
\item Jeffreys prior, 316
\item Jensen's inequality, 35, 44
\item Jet Propulsion Laboratory, 186
\item Johnson noise, 177
\item joint ensemble, 138
\item joint entropy, 138
\item joint typicality, 162
\item joint typicality theorem, 163
\item Jordan, Michael I., 433, 547
\item journal publication policy, 463
\item judge, 55
\item juggling, 15
\item junction tree algorithm, 340
\item jury, 26, 55
\indexspace
\item K-means clustering, \bold{285}, \bold{303}
\subitem derivation, 303
\subitem soft, 289
\item kaboom, 306, 433
\item Kalman filter, 535
\item kernel, 548
\item key points
\subitem communication, 596
\subitem how much data needed, 53
\subitem how to solve probability problems, 61
\subitem likelihood principle, 32
\subitem model comparison, 53
\subitem Monte Carlo, 358, 367
\item keyboard, 119
\item Kikuchi free energy, 434
\item KL distance, 34
\item Knowlton--Graham partitions, 175
\item Knuth, Donald, xii
\item Kolmogorov, Andrei Nikolaevich, 548
\item Kraft inequality, \bold{94}, 521
\item Kraft, L.G., 95
\item kriging, 536
\item Kullback--Leibler divergence, 34, \see{relative entropy}{612}
\indexspace
\item Lagrange multiplier, 174
\item lake, 359
\item Langevin method, 498
\item Langevin process, 535
\item language model, 119
\item Laplace approximation, \see{Laplace's method}{612}
\item Laplace model, 117
\item Laplace prior, 316
\item Laplace's method, 341, 354, 496, 501, 537, 547
\item Laplace's rule, 52
\item latent variable, 437
\item latent variable model, 283
\subitem compression, 353
\item law of large numbers, 36, 81, 82, 85
\item lawyer, 55, 58, 61
\item Le Cun, Yann, 121
\item leaf, 336
\item leapfrog algorithm, 389
\item learning, 471
\subitem as communication, 483
\subitem as inference, 492, 493
\subitem Hebbian, 505, 507
\subitem in evolution, 277
\item learning algorithms, 468
\subitem backpropagation, 528
\subitem Boltzmann machine, 522
\subitem classification, 475
\subitem competitive learning, 285
\subitem Hopfield network, 505
\subitem K-means clustering, \bold{286}, \bold{289}, 303
\subitem multilayer perceptron, 528
\subitem single neuron, 475
\item learning rule, 470
\item Lempel--Ziv coding, 110, 119--122
\subitem criticisms, 128
\item life, 520
\item life in high dimensions, 37, 124
\item likelihood, 6, 28, 49, 152, 324, 529, 558
\subitem contrasted with probability, 28
\subitem subjectivity, 30
\item likelihood equivalence, 447
\item likelihood principle, 32, 61, 464
\item limit cycle, 508
\item linear block code, 9, 11, 19, 171, 183, 186
\subitem decoding, 184
\subitem noisy-channel coding theorem, 229
\item linear feedback shift register, 184
\item linear regression, 342, 527
\item Litsyn, Simon, 572
\item little 'n' large data set, 288
\item log-normal, 315
\item logarithms, 2
\item logit, 316
\item long thin strip, 409
\item loopy, 340, 556
\item loopy belief propagation, 434
\item loopy message-passing, 338
\item lossy compression, 168, 284, 285
\item low-density generator-matrix code, 207, 590
\item low-density parity-check code, 556, \bold{557}
\subitem staircase, 569
\item LT code, 590
\item Luby, Michael G., 568, 590
\item Luria, Salvador, 446
\item Lyapunov function, 287, 291, \bold{508}, 520, 521
\indexspace
\item machine learning, 246
\item macho, 319
\item MacKay, David, 187, 496
\item magician, 233
\item magnetic recording, 593
\item majority vote, 5
\item male, 277
\item Mandelbrot, Benoit, 262
\item MAP, \see{maximum {\em a posteriori}}{612}
\item MAP decoding, 325
\item mapping, 92
\item marginal entropy, 139
\item marginal likelihood, 29, 298, 322, \see{evidence}{612}
\item marginalization, 29, 295, 319
\item Markov chain, 141, 168
\item Markov chain Monte Carlo, \see{Monte Carlo methods}{612}
\item Markov model, 111,
\see{Markov chain and hidden Markov model}{612}
\item marriage, 454
\item matrix, 409
\item matrix identities, 438
\item max--product, 339
\item maxent, 308, \see{maximum entropy}{612}
\item maximum distance separable, 219
\item maximum entropy, 308, 551
\item maximum likelihood, 6, 300, 347
\item maximum likelihood decoder, 152
\item maximum {\em a posteriori\/} decoder, 325
\item maximum {\em a posteriori}, 6, 307, 325, 538
\item MCMC (Markov chain Monte Carlo), \see{Monte Carlo methods}{612}
\item McMillan, B., 95
\item MD5, 200
\item MDL, \see{minimum description length}{612}
\item MDS, 220
\item mean, 1
\item mean field theory, 422, 425
\item melody, 201, 203
\item memory, 468
\subitem address-based, 468
\subitem associative, 468, 505
\subitem content-addressable, 192, 469
\item MemSys, 551
\item message passing, 187, 241, 248, 283, 324, 407, 591
\subitem BCJR, 330
\subitem belief propagation, 330
\subitem forward--backward, 330
\subitem in graphs with cycles, 338
\subitem loopy, 338
\subitem sum--product algorithm, 336
\subitem Viterbi, 329
\item metacode, 104, 108
\item metric, 512
\item Metropolis method, 496, \see{Monte Carlo methods}{612}
\item M\'ezard, Marc, 340
\item micro-saccades, 554
\item microsoftus, 458
\item microwave oven, 127
\item min--sum algorithm, 245, 325, 329, 578, 581
\item mine (hole in ground), 451
\item minimax, 455
\item minimization, 473
\item minimum description length, 352, \bold{352}
\item minimum distance, 206, 214, \see{distance of code}{612}
\item Minka, Thomas, 340
\item mirror, 529
\item Mitzenmacher, Michael, 568
\item mixing coefficients, 298, 312
\item mixture
\subitem in Markov chains, 373
\item mixture distribution, 373
\item mixture modelling, 282, 284, 303, 437
\item mixture of Gaussians, 312
\item MLP, \see{multilayer perceptron}{612}
\item MML, \see{minimum description length}{612}
\item mobile phone, 182, 186
\item model
\subitem latent variable, 437
\item model comparison, 198, 346, 347, 349
\subitem typical behaviour of evidence, 60
\subitem typical evidence, 54
\item modelling, 285
\subitem density modelling, 284, 303
\item models of images, 524
\item moderation, 29, 498, \see{marginalization}{612}
\item molecules, 201
\item Molesworth, 241
\item momentum, 387, 479
\item Monte Carlo methods, 357, 498
\subitem acceptance rate, 394
\subitem acceptance ratio method, 379
\subitem annealed importance sampling, 379
\subitem coalescence, 413
\subitem dependence on dimensionality, 358
\subitem exact sampling, 413
\subitem for visualization, 551
\subitem Gibbs sampling, \bold{370}, 391, 418
\subitem Hamiltonian Monte Carlo, 387, 496
\subitem hybrid Monte Carlo, \see{Hamiltonian Monte Carlo}{612}
\subitem importance sampling, \bold{361}, 379
\subsubitem weakness of, 382
\subitem information communication in, 394
\subitem Langevin method, 498
\subitem Markov chain Monte Carlo, \bold{365}, 366
\subitem Metropolis method
\subsubitem dumb Metropolis, 394, 496
\subitem Metropolis--Hastings, \bold{365}
\subitem multi-state, 392, 395, 398
\subitem overrelaxation
\subsubitem ordered, 391
\subitem perfect simulation, 413
\subitem random walk suppression, 370
\subitem random-walk Metropolis, 388
\subitem rejection sampling, \bold{364}
\subsubitem adaptive, 370
\subitem reversible jump, 379
\subitem simulated annealing, 379, 392
\subitem thermodynamic integration, 379
\subitem umbrella sampling, 379
\item Monty Hall problem, 57
\item Morse, 256
\item motorcycle, 110
\item movie, 551
\item multilayer perceptron, 529, 535
\item multiple access channel, 237
\item multiterminal networks, 239
\item multivariate Gaussian, 176
\item Munro--Robbins theorem, 441
\item murder, 26, 58, 61, 354
\item music, 201, 203
\item mutation rate, 446
\item mutual information, 139, 146, 150, 151
\subitem how to compute, 149
\item myth, 347
\subitem compression, 74
\indexspace
\item nat (unit), 264, 601
\item natural gradient, 443
\item natural selection, 269
\item navigation, 594
\item Neal, Radford, 111, 121, 187, 374, 379, 391, 392, 397, 419, 420,
429, 432, 496
\item needle, Buffon's, 38
\item network, 529
\item neural network, 468, 470
\subitem capacity, 483
\subitem learning as communication, 483
\subitem learning as inference, 492
\item neuron, 471
\subitem capacity, 483
\item Newton algorithm, 441
\item Newton, Isaac, 200, 552
\item Newton--Raphson, 303
\item nines, 198
\item noise, 3, \see{channel}{612}
\subitem coloured, 179
\subitem spectral density, 177
\subitem white, 177, 179
\item noisy channel, \see{channel}{612}
\item noisy typewriter, \bold{148}, 152, 154
\item noisy-channel coding theorem, \bold{15}, 152, 162, 171, 229
\subitem Gaussian channel, 181
\subitem linear codes, 229
\subitem poor man's version, 216
\item noisy-or, 294
\item non-confusable inputs, 152
\item noninformative, 319
\item nonlinear, 535
\item nonlinear code, 20, 187
\item nonparametric data modelling, 538
\item nonrecursive, 575
\item noodle, Buffon's, 38
\item normal, 312, \see{Gaussian}{612}
\item normal graph, 219, 584
\item normalizing constant, \see{partition function}{612}
\item not-sum, 335
\item notation, 598, \see{conventions}{612}
\subitem absolute value, 33, 599
\subitem conventions of this book, 147
\subitem convex/concave, 35
\subitem entropy, 33
\subitem expectation, 37
\subitem intervals, 90
\subitem logarithms, 2
\subitem matrices, 147
\subitem probability, 22, 30
\subitem set size, 33, 599
\subitem transition probability, 147
\subitem vectors, 147
\item NP-complete, \bold{184}, 325, 517
\item nucleotide, 201, 204
\item nuisance parameters, 319
\item numerology, 208
\item Nyquist sampling theorem, 178
\indexspace
\item objective function, 473
\item Occam factor, 322, 345, \bold{348}, 350, 352
\item Occam's razor, \bold{343}
\item octal, \bold{575}
\item {\tt{octave}}, 478
\item Ode to Joy, 203
\item Oliver, 56
\item one-way hash function, 200
\item optic nerve, 491
\item optimal decoder, 152
\item optimal input distribution, \bold{150}, 162
\item optimal linear filter, 549
\item optimal stopping, 454
\item optimization, 169, 392, 429, 479, 505, 516, 531
\subitem gradient descent, 476
\subitem Newton algorithm, 441
\subitem of model complexity, 531
\item ordered overrelaxation, 391
\item orthodox statistics, 320, \see{sampling theory}{612}
\item outer code, 184
\item overfitting, 306, 322, 529, 531
\item overrelaxation, 390
\subitem ordered, 391
\indexspace
\item $p$-value, 64, 457, 462
\item packet, 188
\item paradox, 107
\subitem Allias, 454
\subitem bus-stop, 39
\subitem heat capacity and fluctuations, 401
\item paramagnetic, \see{Ising model}{612}
\item paranormal, 233
\item parasite, 278
\item parent, 559
\item parity, 9
\item parity-check bits, 9, 203
\item parity-check constraints, 20
\item parity-check matrix, 12, 183, 229
\subitem generalized, 581
\item parity-check nodes, 19, 219, 567, 568, 583
\item parse, 119, 448
\item Parsons code, 204
\item parthenogenesis, 273
\item partial order, 418
\item partial partition functions, 407
\item particle filter, 396
\item partition, 174
\item partition function, 401, 407, 409, 422, 423, 601, 603
\subitem analogy with lake, 360
\item partitioned inverse, 543
\item Pasco, Richard, 111
\item path-counting, 244
\item pattern recognition, 156, 179, 201
\item pentagonful code, 21, 221
\item perfect code, 208, 210, 211, 219, 589
\item perfect simulation, 413
\item periodic variable, 315
\item permutation, 19, 268
\item phase transition, 361, 403, 601
\item philosophy, 26, 119, 384
\item phone, 594
\subitem cellular, \see{mobile phone}{186}
\item phone directory, 193
\item phone number, 58, 129
\item photon counter, 307, 342, 448
\item physics, 85
\item pigeon-hole, 573
\item pigeon-hole principle, 86
\item pitchfork bifurcation, 291
\item plaintext, 265
\item plankton, 359
\item point estimate, 432
\item point spread function, 549
\item pointer, 119
\item Poisson distribution, 2, 175, 307, 311, 342
\item Poisson process, 39, 46, 448
\item Poissonville, 39, 313
\item polymer, 257
\item poor man's coding theorem, 216
\item porridge, 280
\item positive definite, 539
\item positivity, 551
\item posterior probability, 6, 152
\item power cost, 180
\item power law, 584
\item practical, 183, \see{error-correcting code, practical}{612}
\item precision, \bold{176}, 181, 312, 320, 383
\item precisions add, 181
\item prediction, 29, 52
\item predictive distribution, 111
\item prefix code, \bold{92}, 95
\item prior, 6, 308, 529
\subitem assigning, 308
\subitem improper, 353
\subitem subjectivity, 30
\item prior equivalence, 447
\item priority of bits in a message, 239
\item prize, on game show, 57
\item probabilistic model, 111, 120
\item probabilistic movie, 551
\item probability, 26, 38
\subitem Bayesian, 50
\subitem contrasted with likelihood, 28
\subitem density, 30, 33
\item probability distributions, 311, \see{distribution}{612}
\item probability of block error, 152
\item probability propagation, \see{sum--product algorithm}{612}
\item product code, 184, 214
\item profile, of random graph, 568
\item pronunciation, 34
\item proper, 539
\item proposal density, \bold{364}, 365
\item Propp, Jim G., 413, 418
\item prosecutor's fallacy, 25
\item prospecting, 451
\item protein, 201, 204
\subitem regulatory, 201, 204
\item protein synthesis, 280
\item protocol, 589
\item pseudoinverse, 550
\item Punch, 448
\item puncturing, 222, 580
\item puzzle, \see{game}{612}
\subitem cable labelling, 173
\subitem chessboard, 520
\subitem fidelity of DNA replication, 280
\subitem hat, 222, 223
\subitem life, 520
\subitem magic trick, 233, 234
\subitem poisoned glass, 103
\subitem {\tt{southeast}}, 520
\subitem transatlantic cable, 173
\subitem weighing 12 balls, 68
\indexspace
\item quantum error-correction, 572
\item queue, 454
\item QWERTY, 119
\indexspace
\item R$_3$, \see{repetition code}{612}
\item race, 354
\item radial basis function, 535, 536
\item radio, 186
\item radix, 104
\item RAID, 188, 190, 219
\item random, 26, 357
\item random cluster model, 418
\item random code, 156, 161, 164, 165, 184, 192, 195, 214, 565
\subitem for compression, 231
\item random variable, 26, 463
\item random walk, 367
\subitem suppression, 370
\item random-coding exponent, 171
\item random-walk Metropolis method, 388
\item rant, \see{sermon}{612}
\subitem confidence level, 465
\subitem $p$-value, 463
\item Raptor codes, 594
\item rate, \bold{152}
\item rate-distortion theory, 167
\item reading aloud, 529
\item receiver operating characteristic, 533
\item recognition, 204
\item record breaking, 446
\item rectangular code, 184
\item reducible, 373
\item redundancy, 4, 33
\subitem in channel code, 146
\item redundant array of independent disks, 188, 190
\item redundant constraints in code, 20
\item Reed--Solomon code, 185, 186, 571, 589
\item regression, 342, 536
\item regret, 455
\item regular, 557
\item regularization, 529, 550
\item regularization constant, 479
\item reinforcement learning, 453
\item rejection, 364, 366, 533
\item rejection sampling, \bold{364}, \see{Monte Carlo methods}{612}
\item relative entropy, 34, 98, 102, 142, 422, 429, 435, 475
\item reliability function, 171
\item repeat--accumulate code, \bold{582}
\subitem connection to low-density parity-check code, 587
\item repetition code, 5, 13, 15, 16, 46, 183
\item responsibility, 289
\item retransmission, 589
\item reverse, 110
\item reversible jump, 379
\item Richardson, Thomas J., 570, 595
\item Rissanen, Jorma, 111
\item Roberts, Gareth O., 393
\item ROC, 533
\item roman, 127
\item rule of thumb, 380
\item runlength, 256
\item runlength-limited channel, 249
\indexspace
\item saccades, 554
\item saddle-point approximation, 341
\item sample, 312, 356
\subitem from Gaussian, 312
\item sampler density, 362
\item sampling distribution, 459
\item sampling theory, 38, 320
\subitem criticisms, 32
\item sandwiching method, 419
\item satellite, 594
\item satellite communications, 186
\item scaling, 203
\item Sch\"onberg, 203
\item Schottky anomaly, 404
\item secret, 200
\item secretary problem, 454
\item security, 199, 201
\item seek time, 593
\item Sejnowski, Terry J., 522
\item self-delimiting, 132
\item self-dual, 218
\item self-orthogonal, 218
\item self-punctuating, 92
\item separation, 242, 246
\item sequence, 344
\item sequential decoding, 581
\item sequential probability ratio test, 464
\item sermon, \see{caution}{612}
\subitem classical statistics, 64
\subitem confidence level, 465
\subitem dimensions, 180
\subitem gradient descent, 441
\subitem illegal integral, 180
\subitem importance sampling, 382
\subitem interleaving, 189
\subitem MAP method, 283
\subitem maximum entropy, 308
\subitem maximum likelihood, 306
\subitem maximum {\em a posteriori\/} method, 306
\subitem most probable is atypical, 283
\subitem $p$-value, 463
\subitem sampling theory, 64
\subitem sphere-packing, 209, 212
\subitem stopping rule, 463
\subitem turbo codes, 581
\subitem unbiased estimator, 307
\subitem worst-case-ism, 207
\item set, 66
\item Shannon,
\see{noisy-channel coding theorem, source coding theorem, information content}{612}
\item shannon (unit), 265
\item Shannon information content, 67, 91, 97, 115
\item Shannon, Claude, 14, 15, 152, 164, 212, 215, 262
\item shattering, 485
\item Shevelev, Vladimir, 572
\item shifter ensemble, 524
\item Shokrollahi, M. Amin, 568
\item shortening, 222
\item Siegel, Paul, 262
\item sigmoid, 473, 527
\item signal-to-noise ratio, 177, 178
\item significance, 463
\item significance level, 51, 64, 457
\item simplex, 173, 316
\item Simpson's paradox, 355
\item Simpson, O.J., \see{wife-beaters}{612}
\item simulated annealing, 379, 392, \see{annealing}{612}
\item Skilling, John, 392
\item sleep, 524
\item Slepian--Wolf, \see{dependent sources}{612}
\item slice sampling, 374, \see{Monte Carlo methods}{612}
\subitem multi-dimensional, 378
\item soft K-means clustering, 289
\item softmax, softmin, 289, 316, 339
\item software, xi
\subitem arithmetic coding, 121
\subitem {\tt{BUGS}}, 371
\subitem Dasher, 119
\subitem free, xii
\subitem Gaussian processes, 534
\subitem hash function, 200
\subitem {\tt{VIBES}}, 431
\item solar system, 346
\item soldier, 241
\item soliton distribution, 592
\item sound, 187
\item source code, 73, 75
\subitem algorithms, 119, 121
\subitem block code, 76
\subitem block-sorting compression, 121
\subitem Burrows--Wheeler transform, 121
\subitem for complex sources, 353
\subitem for constrained channel, 249
\subitem for integers, 132
\subitem Huffman, \see{Huffman code}{612}
\subitem implicit probabilities, 102
\subitem optimal lengths, 97, 102
\subitem prefix code, 95
\subitem software, \see{software}{612}
\subitem stream codes, 110--130
\subitem supermarket, 96, 104, 112
\subitem symbol code, 91
\subsubitem optimal, 91
\subitem uniquely decodeable, 94
\subitem variable symbol durations, 125, 256
\item source coding theorem, 78, 91, 229, 231
\item {\tt{southeast}} puzzle, 520
\item span, 331
\item sparse graph
\subitem profile, 569
\item sparse-graph code, 338, 556
\subitem density evolution, 566
\item sparsifiers, 255
\item species, 269
\item speculation about vision, 554
\item spell, 201
\item sphere packing, 182, 205
\item sphere-packing exponent, 172
\item Spielman, Daniel A., 568
\item spin system, 400
\item spines, 525
\item spline, 538
\item spread spectrum, 182, 188
\item spring, 291
\item spy, 464
\item square, 38
\item staircase, 569, 587
\item stalactite, 214
\item standard deviation, 320
\item stars, 307
\item state diagram, 251
\item statistic, 458
\subitem sufficient, 300
\item statistical physics, 257, 401
\item statistical test, 51, 458
\item steepest descents, 441
\item stereoscopic vision, 524
\item stiffness, 289
\item Stirling's approximation, 1, 8
\item stochastic, 472
\item stochastic dynamics, \see{Hamiltonian Monte Carlo}{612}
\item stochastic gradient, 476
\item stop-when-it's-done, 561, 583
\item stopping rule, 463
\item straws, drawing, 233
\item stream codes, 110--130
\item student, 125
\item Student-$t$ distribution, 312, 323
\item subjective probability, 26, 30
\item {\tt{submarine}}, 71
\item subscriber, 593
\item subset, 66
\item substring, 119
\item sufficient statistics, 300
\item sum rule, 39, 46
\item sum--product algorithm, 187, 245, 326, 334, \bold{336}, 407,
434, 556, 557, 572, 578
\item summary, 335
\item summary state, 418
\item summation convention, 438
\item super-channel, 184
\item supermarket for codewords, 96, 104, 112
\item support vector, 548
\item surprise value, 264
\item survey propagation, 340
\item suspicious coincidences, 351
\item symbol code, \bold{91}
\subitem budget, 94, \bold{96}
\subitem codeword, \bold{92}
\subitem disadvantages, 100
\subitem optimal, 91
\subitem self-delimiting, 132
\subitem supermarket, 112
\item symmetric channel, 171
\item symmetry argument, 151
\item synchronization, 249
\item synchronization errors, 187
\item syndrome, 10, 11, 20
\item syndrome decoding, 11, 216, 229, 371
\item system, 4
\item systematic, 575
\indexspace
\item $t$-distribution, \see{Student-$t$}{612}
\item tail, 85, 312, 313, 440, 446, 503, 584
\item tamper detection, 199
\item Tank, David W., 517
\item Tanner product code, 571
\item Tanner, Michael, 569
\item Tanzanite, 451
\item tap, 575
\item telephone, 125
\item telephone directory, 193
\item telephone number, 58, 129
\item telescope, 529
\item temperature, 392, 601
\item termination, 579
\item terminology, 598
\subitem Monte Carlo methods, 372
\item test
\subitem fluctuation, 446
\subitem statistical, 51, 458
\item text entry, 118, 119
\item thermal distribution, 88
\item thermodynamic integration, 379
\item thermodynamics, 404, 601
\subitem third law, 406
\item Thiele, T.N., 548
\item thin shell, 37, 125
\item third law of thermodynamics, 406
\item Thitimajshima, P., 186
\item three cards, 142
\item three doors, 57
\item threshold, 567
\item tiling, 420
\item time-division, 237
\item timing, 187
\item training data, 529
\item transatlantic, 173
\item transfer matrix method, 407
\item transition, 251
\item transition probability matrix, 147, 356, 607
\item translation-invariant, 409
\item travelling salesman problem, 246, 517
\item tree, 242, 336, 343, 351
\item trellis, 251
\subitem termination, 579
\item trellis section, 251, 257
\item triangle, 307
\item truth function, 211, 600
\item tube, 257
\item turbo code, 186, 556
\item turbo product code, 571
\item Turing, Alan, 265
\item twenty questions, 70, 103
\item twin, 111
\item twos, 156
\item typical set, \bold{80}, 154, 363
\subitem for compression, 80
\subitem for noisy channel, 154
\item typical-set decoder, 165, 230
\item typicality, \bold{78}, 80, 162
\subitem behaviour of evidence, 54, 60
\indexspace
\item umbrella sampling, 379
\item unbiased estimator, 307, 321, 449
\item uncompression, 231
\item union, 66
\item union bound, 166, 216, 230
\item uniquely decodeable, \bold{93}, 94
\item units, 264
\item universal, 110, 120, 121, 135
\item universality, 400
\item Urbanke, R\"udiger, 570, 595
\item urn, 31
\item user interfaces, 118
\item utility, 451
\indexspace
\item vaccination, 458
\item Vapnik--Chervonenkis dimension, 489
\item variable-length code, 249, 255
\item variable-rate error-correcting codes, 238, 590
\item variance, 1, 27, 88, 321
\item variance--covariance matrix, 176
\item variances add, 1, 181
\item variational Bayes, 429
\item variational free energy, 422, 423
\subitem minimization, 423
\item variational methods, 422, 433, 496, 508
\subitem typical properties, 435
\subitem variational Gaussian process classifier, 547
\item VC dimension, 489
\item vector quantization, 284, 290
\item very good, \see{error-correcting code, very good}{612}
\item VGC (variational Gaussian process classifier), 547
\item {\tt{VIBES}}, 431
\item Virtakallio, Juhani, 209
\item Viterbi algorithm, 245, 329, 340, 578
\item volume, 42
\item Von Mises distribution, 315
\indexspace
\item Wainwright, Martin, 340
\item waiting for a bus, 39, 46
\item warning, \see{caution}{612}
\item Watson--Crick base pairing, 280
\item weather collator, 236
\item weighing babies, 164
\item weighing problem, 66, 68
\item weight
\subitem importance sampling, 362
\subitem in neural net, 471
\subitem of binary vector, 20
\item weight decay, 479, 529
\item weight enumerator, 206, 211, 214, 216
\subitem typical, 572
\item weight space, 473, 474, 487
\item Wenglish, 72, 260
\item what number comes next?, 344
\item white, 355
\item white noise, 177, 179
\item Wiberg, Niclas, 187
\item Wiener process, 535
\item Wiener, Norbert, 548
\item wife-beater, 58, 61
\item Wilson, David B., 413, 418
\item window, 307
\item Winfree, Erik, 520
\item Wolf, Jack, 262
\item word-English, 260
\item world record, 446
\item worst-case-ism, 207, 213
\item writing, 118
\indexspace
\item Yedidia, Jonathan, 340
\indexspace
\item Z channel, \bold{148}, 149--151, 155
\item Zecchina, R., 340
\item Zipf plot, 262, 263, 317
\item Zipf's law, 40, 262, 263
\item Zipf, George K., 262
\end{theindex}
\dvipsb{index and that is the end of the book}
\normalsize
\restoremargins
\clearpage
\justifying
\setcounter{page}{701}
\stepcounter{chapter}
\chapter*{Extra Solutions to Exercises}
\chapterheadhack{Solutions manual}
% reminder for bulk of book:
% x means a rating is included
% \exercissx{2}{ex.R3ep}{ An A-rated exercise, with solution linked }
% \exercisxB{2}{ex0}{ A B-rated exercise, with no solution at all }
% \exercisaxB{2}{ex0}{ A B-rated exercise, with solution available in the unpub chapter }
%
% the text BORDERLINE is used to mark
% ex's whose solns I would perhaps like to
% cut or restore.
%
% a good way to search for remaining tasks is to search for cisx in thebook.tex in reftex-mode
%
% at this point we should undefine the index command
\renewcommand{\index}[1]{}
\renewcommand{\ind}[1]{#1}
%
This is the solutions manual for {\em Information Theory, Inference, and Learning Algorithms}.
Solutions to many of the exercises are provided in the book itself.
This manual contains solutions to most of the other core exercises.
These solutions are supplied on request to instructors using this book in their
teaching; please email {\tt{solutions@cambridge.org}} to obtain the latest version.
For the benefit of
instructors, please do not circulate this document
to students.
\begin{center}
\copyright 2003 David J.C. MacKay. Version \thedraft\ -- \today.
\end{center}
Please send corrections or additions to these solutions
to David MacKay, {\tt{mackay@mrao.cam.ac.uk}}.
\section*{Reminder about internet resources}
The website
\begin{realcenter}
{\tt{http://www.inference.phy.cam.ac.uk/mackay/itila}}
\end{realcenter}
contains several resources for this book.
\section*{Extra Solutions for Chapter \ref{chone}}
\soln{ex.GHis0}{
The matrix $\bH \bG^{\T} \mod 2$ is equal to the all-zero
$3\times4$ matrix, so for any codeword $\bt = \bG^{\T}\bs$,
$\bH \bt = \bH \bG^{\T}\bs = (0,0,0)^{\T}$.
}
%
\soln{ex.Hdecode}{
(a) {\tt 1100}
(b) {\tt 0100}
(c) {\tt 0100}
(d) {\tt 1111}.
%\ben
%\item {\tt 1100}
%\item {\tt 0100}
%\item {\tt 0100}
%\item {\tt 1111}
%\een
}
%
\soln{ex.H74detail}{% show that hamming...
To be a valid hypothesis, a decoded pattern must be a codeword
of the code.
If there were a decoded pattern in which the parity bits differed
from the transmitted parity bits, but the source bits didn't differ,
that would mean that there are two codewords with the same source bits
but different parity bits. But since the parity bits are
a deterministic function of the source bits, this is a contradiction.
So if any linear code is decoded with its optimal decoder, and a
decoding error occurs anywhere in the block, some of the source bits
must be in error.
% {\em REWRITE THIS!}
}
\section*{Extra Solutions for Chapter \ref{ch.prob.ent}}
\soln{ex.postpa}{
Tips for sketching the posteriors: best technique for sketching $p^{29}(1-p)^{271}$ is to sketch the logarithm
of the posterior, differentiating to find where its maximum is. Take the second derivative
at the maximum in order to approximate the peak as $\propto \exp[ ( p-p_{\MAP} )^2/2 s^2 ]$ and
find the width $s$.
Assuming the uniform prior (which of course is not fundamentally
`right' in any sense, indeed it doesn't look very uniform
in other bases, such as the logit basis),
the probability that the next outcome is a head is
\beq
\frac{n_H+1}{N+2}
\eeq
\ben
\item $N=3$ and $n_H=0$: $\frac{1}{5}$;
\item $N=3$ and $n_H=2$: $\frac{3}{5}$;
\item $N=10$ and $n_H=3$: $\frac{4}{12}$;
\item
$N=300$ and $n_H=29$: $\frac{30}{302}$.
\een
}
\soln{ex.entropydecompose}{
Define, for each $i>1$, $p^*_i = p_i/(1-p_1)$.
\beqan
\!\!\!\!\!\! H(\bp)
&\!\!=\!\!& p_1 \log 1/p_1 + \sum_{i>1} p_i \log 1/p_i \\
&\!\!=\!\!& p_1 \log 1/p_1 + (1-p_1) \sum_{i>1} p^*_i [ \log 1/(1-p_1) + \log 1/p^*_i ] \\
&\!\!=\!\!& p_1 \log 1/p_1 + (1\!-\!p_1) \log 1/(1\!-\!p_1) + (1\!-\!p_1) \sum_{i>1} p^*_i [ \log 1/p^*_i ]
\eeqan
Similar approach for the more general formula.
}
\continuedsoln{ex.decomposeexample}{
$P(0) = fg$;
$P(1) = f(1-g)$;
$P(2) = (1-f)h$;
$P(3) = (1-f)(1-h)$;
$H(X) = H_2(f) + f H_2(g) + (1-f) H_2(h)$.
$\d H(X)/\d f = \log [(1-f)/f] + H_2(g) - H_2(h)$.
}
\continuedsoln{ex.waithead0}{
Direct solution:
$H(X) = \sum_i p_i \log 1/p_i = \sum_{i=1}^{\infty} (1/2^i) i = 2$.
[The final step, summing the series, requires mathematical skill, or a computer algebra system;
one strategy is to define $Z(\beta) = \sum_{i=1}^{\infty} (1/2^{\beta i})$,
a series that is easier to sum (it's $Z = 1/(2^{\beta} - 1)$),
then differentiate $\log Z$ with respect to $\beta$, evaluating at $\beta =1$.]
Solution using decomposition: the entropy of the string of outcomes, $H$, is
the entropy of the first outcome, plus (1/2)(the entropy of the remaining outcomes,
assuming the first is a tail). The final expression in parentheses is identical to $H$.
So
$H = H_2(1/2) + (1/2) H$.
Rearranging, $(1/2)H=1$ implies $H=2$.
}
\soln{ex.balls}{
$P(\mbox{first is white}) = w/(w+b)$.
$P(\mbox{first is white, second is white}) = \frac{w}{w+b} \frac{w-1}{w+b-1}$.
$P(\mbox{first is black, second is white}) = \frac{b}{w+b} \frac{w}{w+b-1}$.
Now use the sum rule:
$P(\mbox{second is white}) = \frac{w}{w+b} \frac{w-1}{w+b-1} + \frac{b}{w+b} \frac{w}{w+b-1}
= \frac{ w(w-1) + bw }{ (w+b) (w+b-1) }
= \frac{ w }{ (w+b) }$.
}
\soln{ex.buffon}{
The circle lies in a square if the centre of the circle is in a smaller
square of size $b-a$. The probability distribution of the centre of the
circle is uniform over the plane, and these smaller squares make up
a fraction $(b-a)^2/b^2$ of the plane, so this is the probability
required. $(b-a)^2/b^2 = (1-a/b)^2$.
}
\soln{ex.buffon2}{ {\sf \ind{Buffon's needle}}.\index{needle, Buffon's}
The angle $t$ of the needle relative to the parallel lines
is chosen at random. Once the
angle is chosen, there is a probability $a \sin t /b$ that the needle
crosses a line, since the distance between crossings of the parallel lines by
the line aligned with the needle is $b/\sin t$.
So the probability of crossing is
$\int_{t=0}^{\pi/2} \d t \, a \sin t /b / \int_{t=0}^{\pi/2} dt
= a/b [- \cos t]^{\pi/2}_0 / (\pi/2) = (2/\pi) ( a / b)$.
}
\soln{ex.barycentriccoordinate}{
Let the three segments have lengths $x$, $y$, and $z$.
If $x+y>z$, and $x+z>y$, and $y+z>x$, then
they can form a triangle.
Now let the two points be located at $a$ and $b$ with $b>a$,
and define $x=a$, $y=b-a$, and $z=1-b$.
Then the three constraints imply
$b>1-b \ \Rightarrow \ b>1/2$,
similarly
$a<1/2$, and
$b-a<1/2$.
Plotting these regions in the permitted $(a,b)$ plane,
we find that the three constraints are satisfied in a triangular region
of area $1/4$ of the full area ($a>0, b>0, b>a$),
so the probability is $1/4$.
}
\soln{ex.brothers}{
Assuming ignorance about the order of the ages $F$, $A$, and $B$,
the six possible hypotheses have equal probability.
The probability that $F>B$ is $\dhalf$.
The conditional probability that $F>B$ given that $F>A$
is given by the joint probability divided by the marginal probability:
\beq
P( F>B \given F>A ) = \frac{ P( F>B , F>A ) }
{ P( F>A )}
= \frac{ \dfrac{2}{6} }{ \dhalf }
% 2/6 / 1/2
= \frac{2}{3} .
\eeq
(The joint probability that $F>B$ and $F>A$ is the probability that
Fred is the oldest, which is $\dthird$.)
}
%
\soln{ex.liars}{
1/5.
}
\section*{Extra Solutions for Chapter \ref{ch.bayes}}
\soln{ex.evidencebounds}{
The idea that complex models can win (in log evidence) by
an amount linear in the number of data, $F$, and can lose by only
a logarithmic amount is important and general.
For the biggest win by $\H_1$, let $F_{\ta}=F$ and $F_{\tb}=0$.
\beq
\log \frac{ P( \bs \given F,\H_1 ) }
{ P( \bs \given F,\H_0 ) }
= \log \frac{ 1/F+1 }{ p_0^{F} }
= - \log (F+1) + F \log 1/p_0 .
% \frac{ \smallfrac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } }{ p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} } .
\eeq
The second term dominates, and the win for $\H_1$ is growing linearly with $F$.
For the biggest win by $\H_0$, let $F_{\ta}=p_0 F$ and $F_{\tb}=(1-p_0) F$.
We now need to use an accurate version of Stirling's approximation (\ref{eq.H2approxaccurate}),
because things
are very close. The difference comes down to the square root terms in Stirling.
\beqan
\!\!\!\!\!\!\!
\log \frac{ P( \bs \given F,\H_1 ) }
{ P( \bs \given F,\H_0 ) }
&=& \log \linefrac{ \smallfrac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } }{ p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} }
\\
&=& \log ({ 1/F+1 }) - \log {{F}\choose{F_{\ta}}} - \log { p_0^{p_0 F} p_1^{p_1 F} }
\\
&=& - \log (F+1) + \frac{1}{2} \log \left[ {2\pi F \, \frac{p_0F}{F} \,
\frac{p_1F}{F}} \right]
\\
% &=& - \frac{1}{2} \log \frac{(F+1)^2}{F} + \frac{1}{2} \log \left[ {2\pi} \, p_0 \,
% p_1 \right] .
&=& - \frac{1}{2} \log \left[ (F+1) \left(1+\frac{1}{F}\right) \right] + \frac{1}{2} \log \left[ {2\pi} \, p_0 \,
p_1 \right] .
\eeqan
Of these two terms, the second is asymptotically independent of $F$, and the first grows
as half the logarithm of $F$.
}
\soln{ex.girlboy}{
Let the variables be $l,m,n$, denoting the sex of the child
who lives behind each of the three doors, with $l=0$ meaning the first child is male.
We'll assume the prior distribution is uniform,
$P(l,m,n) = (1/2)^3$, over all eight possibilities.
(Strictly, this is not a perfect assumption, since genetic causes
do sometimes lead to some parents producing only one sex or the other.)
The first data item establishes that $l=1$;
the second item establishes that at least one of the
three propositions $l=0$, $m=0$, and $n=0$ is true.
The viable hypotheses are
\begin{center}\begin{tabular}{c}
$l=1,\ m=0,\ n=0$;\\
$l=1,\ m=1,\ n=0$;\\
$l=1,\ m=0,\ n=1$.\\
\end{tabular}\end{center}
These had equal prior probability. The posterior probability that
there are two boys and one girl is $1/3$.
% was in error until Tue 16/12/03
}
\soln{ex.bagcounter}{
There are two hypotheses: let $H=0$ mean that the original counter in the bag
was white and $H=1$ that is was black. Assume the prior probabilities are equal.
The data is that when a randomly selected counter was
drawn from the bag, which contained a white one and the unknown one,
it turned out to be white.
The probability of this result according to each hypothesis is:
\beq
P(D\given H\eq0) = 1 ; \ P(D\given H\eq1) = 1/2 .
\eeq
So by Bayes' theorem, the posterior probability of $H$ is
\beq
P(H\eq0\given D ) = 2/3 ; \
P(H\eq1\given D ) = 1/3 .
\eeq
}
\soln{ex.othercoin}{
It's safest to enumerate all four possibilities.
Call the four equiprobable outcomes
$HH,
HT,
TH,
TT$.
In the first three cases, Fred will declare he has won;
in the first case, $HH$, whichever coin he points to, the other is a head;
in the second and third cases, the other coin is a tail.
So there is a $1/3$ probability that `the other coin' is a head.
}
\section*{Extra Solutions for Chapter \ref{ch.two}}
\continuedsoln{ex.Hadditive}{
\beqan
H(X,Y) &=& \sum_{x,y} P(x,y) h(x,y)
= \sum_{x,y} P(x,y) ( h(x)+h(y) ) \\
&=&
\left[ \sum_{x,y} P(x,y) h(x) \right]+ \left[ \sum_{x,y} P(x,y) h(y) \right].
\eeqan
Because $h(x)$ has no dependence on $y$,
it's easy to sum over $y$ in the first term.
$\sum_{y} P(x,y) = P(x)$.
Summing over x in the second term similarly, we have
$$H(X,Y) = \sum_{x} P(x) h(x) + \sum_y P(y) h(y)
= H(X)+H(Y).$$
}
\soln{ex.weighexplain}{
If six are weighed against six,
then the first weighing conveys no information
about the question `which is the odd ball?'
All 12 balls are equally likely, both before and after.
If six are weighed against six,
then the first weighing conveys exactly one bit of information
about the question `which is the odd ball and is it heavy or light?'
There are 24 viable hypotheses before, all equally likely;
and after, there are 12. A halving of the number of (equiprobable)
possibilities corresponds to gaining one bit. (Think of playing {\tt{sixty-three}}.)
}
\soln{ex.weighthirtynine}{
Let's use our rule of thumb: always maximize the entropy.
At the first step we weigh 13 against 13, since that
maximizes the entropy of the outcome.
If they balance, we weigh 5 of the remainder against 4 of the
remainder (plus one good ball). The outcomes have probabilities
$8/26$ (balance), $9/26$, and $9/26$, which is the most uniform distribution possible.
Let's imagine that the `5' are heavier than the `4 plus 1'.
We now ensure that the next weighing has probability 1/3 for
each outcome: leave out any three of the nine suspects,
and allocate the others appropriately. For example, leaving out HHH,
weigh HLL against HLL, where H denotes a possibly heavy ball
and L a possibly light one. Then if those balance, weigh
an omitted pair of H's; if they do not balance, weigh the
two L's against each other.
John Conway's solution on page \pageref{ex.twelve.generalize.weigh.sol}
of the book gives an explicit and
more general solution.
}
\soln{ex.binaryweigh}{
Going by the rule of thumb that the most efficient strategy is the
most informative strategy, in the sense of having all possible
outcomes as near as possible to equiprobable, we want the first
weighing to have outcomes `the two sides balance' in eight cases and
`the two sides do not balance' in eight cases. This is achieved by
initially weighing 1,2,3,4 against 5,6,7,8, leaving the other eight
balls aside. Iterating this binary division of the
possibilities, we arrive at a strategy requiring 4 weighings.
The above strategy for designing a sequence of binary
experiments by constructing a binary tree from the top down
is actually not always optimal; the optimal
method of constructing a binary tree will be explained in the
next chapter.
}
\soln{ex.flourforty}{
The weights needed are 1, 3, 9, and 27. Four weights in total.
The set of 81 integers from $-40$ to $+40$ can be represented
in ternary, with the three symbols being interpreted as
`weight on left',
`weight omitted', and
`weight on right'.
}
\begincuttable
\soln{ex.twelve.two.weigh}{
\ben
\item
A sloppy answer to this question counts the number of possible
states, ${{12}\choose{2}} 2^2 = 264$, and takes its base 3 logarithm,
which is 5.07, which exceeds 5.
We might estimate that six weighings
suffice to find the state of the two odd balls among 12. If there
are three odd balls then there are ${{12}\choose{3}} 2^3 = 1760$
states, whose logarithm is 6.80, so seven weighings might
be estimated to suffice.
However, these answers neglect the possibility
that we will learn something more from our experiments than
just which are the odd balls.
Let us define the oddness of an odd ball to be the absolute
value of the difference between its weight and the regular weight.
There is a good chance that we will
also learn something about the relative oddnesses
of the two odd balls.
%
% If, say, balls A and B are both heavy,
% and A is heavier than B,
% there is a good chance that the optimal weighing strategy
% will at some point put ball A on one side of the balance
% and ball B on the other, along with a load of regular balls;
% the outcome of this weighing
% reveals, at the end of the day, that A was heavier than B, which
% is not something we were asked to find out. From the
If balls $m$ and $n$ are the odd balls,
there is a good chance that the optimal weighing strategy
will at some point put ball $m$ on one side of the balance
and ball $n$ on the other, along with a load of regular balls;
if $m$ and $n$ are both heavy balls, say,
the outcome of this weighing will
% allow us to deduce
reveal, at the end of the day, whether $m$ was heavier than $n$, or lighter,
or the same, which
is not something we were asked to find out. From the
point of view of the task, finding the relative oddnesses
of the two balls is a waste of experimental capacity.
A more careful estimate takes this annoying possibility into account.
In the case of two odd balls,
a complete description of the balls, including a ranking of their
oddnesses, has three times as many states as we counted above (the
two odd balls could be odd by the same amount, or by amounts
that differ), \ie,
$264\times 3 = 792$ outcomes, whose logarithm is 6.07.
Thus to identify the {\em full\/} state
of the system in 6 weighings is impossible -- at least seven are needed.
I don't know whether the original
problem can be solved in 6 weighings.
%with a strategy that
% sometimes avoids finding the ranking of the oddnesses.
In the case of three odd balls, there are $3!=6$ possible rankings
of the oddnesses if the oddnesses are different (\eg,
$0 set term postscript
% Terminal type set to 'postscript'
% Options are 'landscape monochrome dashed "Helvetica" 14'
% gnuplot> set output 'figs/hd/1.2.3.100.ps
% gnuplot> replot
The curves $\frac{1}{N} H_{\delta}(Y^N)$
as a function of $\delta$ for $N=1,2,3$ and 100 are shown in \figref{fig.hd.1.2.3.100}.
% and table \ref{tab.Hdelta.0.5}.
Note that $H_2(0.5) = 1$ bit.
\begin{figure}[htbp]
\figuredangle{%
\begin{center}
\mbox{%
\begin{tabular}[t]{r}\vspace{0in}\\% alignment hack
\mbox{\psfig{figure=Hdelta/figs/hd/1.2.3.100.ps,%
width=60mm,angle=-90}}\end{tabular}
%
\hspace{0in}
%%%%%%%%%%%%%%%%%%%
\begin{tabular}[t]{r@{--}lcc} \toprule % {r@{--}lcc} \midrule
\multicolumn{4}{c}{$N=2$} \\ \midrule
% delta 1/N Hdelta 2^{Hdelta}
\multicolumn{2}{c}{$\delta$} & $\frac{1}{N} H_{\delta}(\bY)$ & $2^{H_{\delta}(\bY)}$
% raise the roof!
%{\rule[-3mm]{0pt}{8mm}}
\\ \midrule
%
0 & 0.25 & 1 & 4 \\
0.25 & 0.5 & 0.79248 & 3 \\
0.5 & 0.75 & 0.5 & 2 \\
0.75 & 1 & 0 & 1 \\ \bottomrule
\end{tabular}
\hspace{0.1in}
\begin{tabular}[t]{r@{--}lcc} \toprule
\multicolumn{4}{c}{$N=3$} \\ \midrule
% delta 1/N Hdelta 2^{Hdelta}
\multicolumn{2}{c}{$\delta$} & $\frac{1}{N} H_{\delta}(\bY)$ & $2^{H_{\delta}(\bY)}$
% raise the roof!
%{\rule[-3mm]{0pt}{8mm}}
\\ \midrule
%
0& 0.125 & 1 & 8 \\
0.125& 0.25 & 0.93578 & 7 \\
0.25 & 0.375 & 0.86165 & 6 \\
0.375 & 0.5 & 0.77398 & 5 \\
0.5 & 0.625 & 0.66667 & 4 \\
0.625 & 0.75 & 0.52832 & 3 \\
0.75 & 0.875 & 0.33333 & 2 \\
0.875 & 1 & 0 & 1 \\ \bottomrule
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%
%
}
\end{center}
}{%
\caption[a]{$\frac{1}{N} H_{\delta}(\bY)$ (vertical axis) against $\delta$ (horizontal),
for $N=1, 2, 3, 100$ binary variables with $p_1=0.5$.}
\label{fig.hd.1.2.3.100}
\label{tab.Hdelta.0.5}
}%
\end{figure}
%
%\begin{table}[htbp]
%\figuremargin{%
%\begin{center}
%\end{center}
%}{%
%\caption[a]{Values of $\frac{1}{N} H_{\delta}(\bY)$ against $\delta$.}
%% add 0.5 to this caption
%\label{tab.Hdelta.0.5}
%}
%\end{table}
}
\soln{ex.chernoff}{ {\sf\ind{Chernoff bound}.}
Let $t = \exp( sx)$ and $\alpha = \exp(s a)$. If we assume $s>0$
then $x \geq a$ implies $t \geq \alpha$.
Assuming $s>0$,
$P(x \geq a) = P( t \geq \alpha ) \leq \bar{t}/\alpha
= \sum_{x} P(x) \exp(sx) / \exp(sa) = e^{-sa} g(s)$.
Changing the sign of $s$
means that instead $x \leq a$ implies $t \geq \alpha$;
so assuming $s<0$,
$P(x \leq a) = P( t \geq \alpha )$; the remainder of the calculation is as above.
}
\section*{Extra Solutions for Chapter \ref{ch.three}}
\soln{ex.Cnud}{
The code $\{ {\tt 00}, {\tt 11}, {\tt 0101}, {\tt 111}, {\tt 1010},
{\tt 100100},$ ${\tt 0110} \}$ is not
uniquely decodeable because ${\tt 11111}$ can be realized from $c(2)c(4)$
and $c(4)c(2)$.
}
\soln{ex.Ctern}{
The ternary code
$\{ {\tt 00},{\tt 012},{\tt 0110},{\tt 0112},$ ${\tt 100},{\tt 201},{\tt 212},{\tt 22} \}$
% $\{ 00,012,0110,0112,100,201,$ $212,22 \}$
{\em is\/} uniquely decodeable
because it is a prefix code.
}
\soln{ex.Huffambigb}{
Probability vectors leading to a free choice in the Huffman
coding algorithm satisfy $p_1 \geq p_2 \geq p_3 \geq p_4 \geq 0$ and
\beq
p_1 = p_3 + p_4 .
\label{eq.Huffambig}
\eeq
% The
% % reason for this is that the
% first step of the Huffman coding algorithm always combines the
% symbols with smallest probability giving a new symbol with
% probability $p_3 + p_4$. The only way we can get alternative
% lengths is if this probability is equal to
The convex hull of $\cal Q$ is most easily obtained by
turning two of the three inequalities
$p_1 \geq p_2 \geq p_3 \geq p_4$ into equalities, and then solving
\eqref{eq.Huffambig} for $\bp$. Each choice of equalities gives
rise to one of the set of three vectors
\beq
\{ \dthird,\dthird,\dsixth,\dsixth\} , \:
\{ \dtwofifth,\dfifth,\dfifth,\dfifth\} \mbox{ and } \{ \dthird ,\dthird,\dthird,0\}.
\eeq
}
\soln{ex.twenty.questions}{
An optimal strategy asks questions that
have a 50:50 chance of being answered yes or no.
An essay on this topic should discuss practical ways
of approaching this ideal.
}
\soln{ex.powertwogood}{
Let's work out the optimal codelengths. They are all integers.
Now, the question is, can a set of integers satisfying the Kraft equality
be arranged in an appropriate binary tree? We can do this constructively
by going to the codeword supermarket and buying the shortest codewords first.
Having bought them in order, they must define a binary tree.
}
\continuedsoln{ex.huffman.uniform}{
% More details for this solution:
\begin{center}
%codewords:
\begin{tabular}{clrrl} \toprule
$a_i$ & $p_i$ & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} & $l_i$ & $c(a_i)$ \\[0.1in] \midrule
{\tt a} & 0.09091 & 3.5 & 4 & {\tt 0000} \\
{\tt b} & 0.09091 & 3.5 & 4 & {\tt 0001} \\
{\tt c} & 0.09091 & 3.5 & 4 & {\tt 0100} \\
{\tt d} & 0.09091 & 3.5 & 4 & {\tt 0101} \\
{\tt e} & 0.09091 & 3.5 & 4 & {\tt 0110} \\
{\tt f} & 0.09091 & 3.5 & 4 & {\tt 0111} \\
{\tt g} & 0.09091 & 3.5 & 3 & {\tt 100} \\
{\tt h} & 0.09091 & 3.5 & 3 & {\tt 101} \\
{\tt i} & 0.09091 & 3.5 & 3 & {\tt 110} \\
{\tt j} & 0.09091 & 3.5 & 3 & {\tt 111} \\
{\tt k} & 0.09091 & 3.5 & 3 & {\tt 001} \\
\bottomrule
\end{tabular}
\end{center}
%total count 11
%expected length 3.5455
The entropy is $\log_2 11 = 3.4594$
and the expected length is $L=3 \times \frac{5}{11} + 4 \times \frac{6}{11}$
which is $3\frac{6}{11} = 3.54545$.
}
\soln{ex.huffman.uniform2}{
The key steps in this exercise are all spelled out in the
problem statement.
Difficulties arise with these concepts:
(1)~When you run the Huffman algorithm,
all these equiprobable symbols will end up having
one of just two lengths, $l^{+} = \lceil \log_2 I \rceil$
and
$l^{-} = \lfloor \log_2 I \rfloor$.
The steps up to (\ref{eq.HIL}) then involve working out how many
have each of these two adjacent lengths, which depends on
how close $I$ is to a power of 2.
(2)~The excess length was only defined for integer $I$,
but we are free to find the maximum value is attains for
any real $I$; this maximum will certainly not be exceeded by
any integer $I$.
}
% BORDERLINE
\begincuttable
\soln{ex.Huff99}{
The sparse source ${\cal P}_X = \{ 0.99 , 0.01 \}$
could be compressed with a Huffman code based on blocks of
length $N$, but $N$ would need to be quite large
for the code to be efficient. The probability of the all-{\tt{0}} sequence
of length $N$
has to be reduced to about 0.5 or smaller for the code to be efficient.
This sets $N \simeq \log 0.5/\log 0.99 = 69$.
The Huffman code would then have $2^{69}$ entries in its tree,
which probably exceeds the memory capacity of all the computers
in this universe and several others.
There are other ways that we could describe the data stream. One
is run-length encoding. We could chop the source into
the substrings ${\tt{1}},{\tt{01}},{\tt{001}},{\tt{0001}},{\tt{00001}},\ldots$ with the last elements
in the set being, say, two strings of equal maximum length
${\tt{00}}\ldots{\tt{01}}$ and ${\tt{00}}\ldots{\tt{00}}$.
We can give names to each of these strings and compute their
probabilities, which are not hugely dissimilar to each other.
This list of probabilities starts $\{ 0.01, 0.0099, 0.009801 , \ldots\}$.
For this code to be efficient, the string with largest probability
should have probability about 0.5 or smaller; this means that we would
make a code out of about 69 such strings. It is perfectly feasible to
make such a code. The only difficulty with this code is the issue
of termination. If a sparse file ends with a string of 20 {\tt 0}s
still left to transmit, what do we do? This problem has arisen
because we failed to include the end-of-file character
in our source alphabet. The best solution to this
problem is to use an arithmetic code as described in the next chapter.
}
\ENDcuttable
\soln{ex.poisonglass}{
The poisoned glass problem was intended to have the solution `129',
this being the only number of the form $2^m + 1$
% power-of-two plus one
between 100 and 200.
However the optimal strategy, assuming all glasses have equal probability,
is to design a Huffman code for the glasses. This produces a binary
tree in which each pair of branches have almost equal weight.
On the first measurement, either
64 or 65 of the glasses are tested. (Given the
assumption that one of the glasses is poisoned, it makes no difference
which; however, going for 65 might be viewed as preferable if there
were any uncertainty over this assumption.) There is a 2/129 probability
that an extra test is needed after seven tests have occurred. So the
expected number of tests is 7$\frac{2}{129}$, whereas the
strategy of the professor takes 8 tests with probability $128/129$
and one test with probability $1/129$, giving
a mean number of tests $7\frac{122}{129}$. The expected waste is $40/43$
tests.
% glasses, pairing them
}
\section*{Extra Solutions for Chapter \ref{ch.four}}
\soln{ex.ac.vs.huffman}{
Let's assume there are 128 viable ASCII characters.
Then the Huffman method has to start by communicating
128 integers, each of which could in principle be as large as 127 or as small as 1,
but plausible values will range from 2 to 17. There are correlations among these integers:
if one of them is equal to 1, then none of the others can be 1.
For practical purposes we might say that all the integers must be between 1 and 32
and use a binary code to represent them in 5 bits each.
Then the header will have a size of $5 \times 128 = 640 bits$.
If the file to be compressed is short -- 400 characters, say --
then (taking 4 as a plausible entropy per character, if the frequencies are
known)
the compressed length would be 640 (header) + 1600 (body) $\simeq 2240$, if the
compression of the body is optimal.
For any file much shorter than this, the header is clearly going to dominate the
file length.
When we use the Laplace model, the probability distribution over characters
starts out uniform and remains roughly so until roughly 128 characters have
been read from the source.
In contrast, the Dirichlet model with $\alpha = 0.01$ only requires about 2 characters
to be read from the source for its predictions to be strongly swung in favour of
those characters.
For sources that do use just a few characters with high probability, the Dirichlet model
will be better. If actually all characters are used with near-equal probability
then $\alpha=1$ will do better.
The special case of a large file made entirely of equiprobable 0s and 1s is interesting.
The Huffman algorithm has to assign codewords to all the other characters.
It will assign one of the two used characters a codeword of length 1,
and the other gets length 2. The expected filelength is thus more than $(3/2)N$,
where $N$ is the source file length. The arithmetic codes will give an expected filelength
that asymptotically is $\sim N$.
It is also interesting to talk through the case where one character has
huge probability, say 0.995. Here, the arithmetic codes give
a filelength that's asymptotically less than $N$, and the Huffman
method tends to $N$ from above.
}
\soln{ex.Clengthen}{
Assume a code maps all strings onto strings of the same length or shorter.
Let $L$ be the length of the {\em shortest\/} string that is made shorter
by this alleged code, and let that string be mapped to an output string of
length $l$.
Take the set of all input strings of length less than or equal to
$l$, and count them. Let's say there are $n^{\rm in}(l)$ of length $l$.
[$n^{\rm in}(l) = A^{l}$, where $A$ is the alphabet size.]
Now, how many output strings of length $l$ do these strings generate?
Well, for any length $ 2^{100}$)
uses the same idea but
first encodes the number of levels of recursion that the
encoder will go through, using any convenient prefix code
for integers, for example $C_{\omega}$; then the encoder
can use $c_B(n)$ instead of $c_b(n)$ to encode
each successive integer in the recursion, and can omit the terminal zero.
% does not need to transmit all those initial 1's and the final zero. #
% Maybe this code should be called $C_{\omega^{\omega}}$?
%
%I would
% need to check with John Conway whether there is a connection between
% these codes and transcendental numbers.
%%%%%%%%%%%%%
% This could go on for ever. I thought that the Elias system put a cap on it.
% Pity! This transition at $2^{100}$: that's a big number, for an integer,
% but for a file, it's just 100 bits. A decent file is about 1Meg, say,
% that's $8*1024^{2} = 2^{23}$ bits, which is a number about $2^{2^{23}}$.
% How many levels of recursion? 1 gives a meg; 2 gives 23; 3 gives 4;
% 4 gives 2; 5 gives 1. 6 is it. By how much does the new scheme win? Presumably
% about 6 bits at most.
%
% Elias has a good way of overlining the strings which makes clear
% how they are made.
}
\section*{Extra Solutions for Chapter \ref{ch.prefive}}
\soln{ex.Huvw}{
\beq
H(X,Y) = H(U,V,V,W) = H(U,V,W) = H_u + H_v + H_w.
\eeq
\beq
H(X|Y) = H_u.
\eeq
\beq
\I(X;Y) = H_v .
\eeq
}
\soln{ex.Hdist}{
The {entropy distance}:
\beq
D_H(X,Y) \equiv H(X,Y) - \I(X;Y) = \sum_{x,y} P(x,y) \log \frac{P(x)P(y)}
{P(x,y)^2}.
\eeq
is fairly easily shown to satisfy
the first three axioms $D_H(X,Y)\geq 0$, $D_H(X,X)=0$,
$D_H(X,Y)\eq D_H(Y,X)$.
A proof that it obeys the triangle inequality is not
so immediate. It helps to know in advance what the difference
$D(X,Y) + D(Y,Z) - D(X,Z)$ should add up to; this is most easily
seen by first making a picture in which the quantities
$H(X), H(Y),$ and $H(Z)$ are represented by overlapping areas,
\cf\ \figref{fig.venn} and \exerciseref{ex.venn}. Such a picture indicates that
$D(X,Y) + D(Y,Z) - D(X,Z)=H(Y|X,Z) + \I(X;Z|Y)$.
% 1.11.
%
% Here is my proof:
\beqan
\lefteqn{ D(X,Y) + D(Y,Z) - D(X,Z)}
\nonumber \\
&=& \sum_{x,y,z} P(x,y,z)
\log \frac{ P(x)P(y)P(y)P(z)P(xz)^2}
{P(xy)^2 P(x) P(z) P(y,z)^2}
\\ &=& 2 \sum_{x,y,z} P(x,y,z)
\log \frac{P(x,z)P(x,z\given y)}
{P(x,y,z)P(x\given y)P(z\given y)}
% corrected dec 98::::::::::
\\ &=& 2 \sum_{x,y,z} P(x,y,z) \left[
\log \frac{1}{P(y\given xz)}+\log \frac{P(x,z\given y)}{P(x\given y)P(z\given y)}
\right]
\\ &=& 2 \sum_{x,z} P(x,z) \sum_{y} P(y\given x,z)
\log \frac{1}{P(y\given x,z)}+
\\ & &
\:\:\:\:\:
2\sum_{y} P(y) \sum_{x,z} P(x,z\given y)
\log \frac{P(x,z\given y)}{P(x\given y)P(z\given y)}
\\ &=& 2 \sum_{x,z} P(x,z) H(Y\given x,z) + 2 \sum_{y} P(y) \I(X;Z\given y) .
\\ &=& 2 H(Y\given X,Z) + 2 \I(X;Z\given Y) .
\eeqan
The quantity $\I(X;Z\given Y)$ is a conditional mutual information, which
like a mutual information is positive. The other term
$H(Y\given X,Z)$
% and $\I(X;Z\given Y)$ are
is also positive, so $D(X,Y) + D(Y,Z) - D(X,Z)\geq 0$.
}
% the above is from \input{tex/entropy_soln.tex}
\soln{ex.threecards}{
Seeing the top of the card {\em does\/} convey information
about the colour of its other side. \Bayes\ theorem
allows us to draw the correct inference in any given case,
and Shannon's mutual information is the measure of how much
information is conveyed, on average.
This inference problem is
equivalent to the three doors problem.
One quick way to justify the answer without writing down \Bayes\
theorem is `The probability that the lower face
is opposite in colour to the upper face is always $1/3$,
since only one of the three cards has two opposite colours
on it'.
The joint probability of the two colours is
\beq
\begin{array}{c|cc}
\multicolumn{1}{c}{P(u,l)} & u=0 & u=1 \\ \cline{2-3}
l=0 & \dfrac{1}{3} & \dfrac{1}{6} {\rule{0cm}{14pt}}
\\
l=1 & \dfrac{1}{6} & \dfrac{1}{3}
\end{array}
\eeq
The marginal entropies are $H(U) = H(L) = 1\ubit$,
and the mutual information is
\beq
I(U;L) = 1 - H_2(\dfrac{1}{3}) = 0.08\ubits.
\eeq
%
% It is intriguing how different a unit of value information
% is from the expected return on a bet. A bookie or gambler
% would pay a lot for information that turns 1/2 into 1/3 or 1/4,
% even though the change in info content is only 1 - 0.918
%
}
\section*{Extra Solutions for Chapter \ref{ch.five}}
\soln{ex.fiveC}{
The conditional entropy of $Y$ given $X$ is
$H(Y\given X) = \log 4$.
The entropy of $Y$ is at most $H(Y) = \log 10$,
which is achieved by using a uniform input distribution.
The capacity is therefore
\beq
C \:=\: \max_{\P_X}\, H(Y) - H(Y\given X) \:=\: \log \dfrac{10}{4} \:=\: \log \dfrac{5}{2} \ubits.
\eeq
}
\newpage
\soln{ex.twos}{
%\begin{figure}[htbp]
\amarginfig{t}{
\begin{center}
\mbox{\psfig{figure=figs/random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/2random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/6random2.ps}}
\\[0.15in]%\hspace{0.42in}
\mbox{\psfig{figure=figs/7random2.ps}}
\end{center}
\caption[a]{Four random samples from the set of $2^{219}$ `miniature 2s'
defined in the text.}
\label{fig.random2}
%\end{figure}
}%end{marginfig}
% made using figs/random2.ps seed=7
\begincuttable
The number of recognizable `2's is best estimated by\index{2s}\index{counting 2s}
concentrating on the type of patterns
that make the greatest contribution
to this sum. These are patterns
in which just a small patch of
the pixels make the shape of a 2 and most of the other pixels are
set at random.
It is unlikely that the random
pixels will take on some other
recognizable shape, as we will confirm later.
A recognizable letter 2 surrounded by a white border
can be written in $6\times 7$ pixels. This leaves
% 256 - 42
214 pixels that can be set arbitrarily, and
there are also $12 \times 11$ possible locations for
the miniature 2 to be placed, and two colourings (white on black / black on white).
There are thus
about $12 \times 11 \times 2 \times 2^{214} \simeq 2^{219}$
% 6 \times 10^{66}$
% 3.475
miniature 2 patterns, almost all of which are
recognizable only as the character 2.
This claim that the noise pattern will not look like some other
character is confirmed by noting what a small fraction of
all possible patterns the above number of 2s is. Let's assume there
are 127 other characters to worry about.
Only a fraction $2^{-37}$ of the $2^{256}$ random patterns
are recognizable 2s, so similarly, of the
$2^{219}$ miniature 2 patterns identified above,
only a fraction of about $127 \times 2^{-37}$
of them also contain another recognizable character. These double-hits
decrease undetectably the above answer, $2^{219}$.
Another way of estimating the entropy of a 2, this time banning
the option of including background noise, is to consider the number of
{\em decisions\/} that are made in the construction of a font.
A font may be {\bf bold (2)} or not bold; {\em italic (2)\/}
or not;
% {calligraphic $\cal (2)$} or not;
% it may be
{\sf sans-serif (2)} or not. It may be
normal size (2), {\small small (2)} or {\tiny tiny (2)}. It may be
{calligraphic},
futuristic, modern, or gothic. Most of these choices are independent.
So we have at least $2^4 \times 3^2$ distinct fonts.
I imagine that \index{Knuth, Donald}{Donald Knuth}'s {\sc metafont}, with the aid of which
this document was produced, could turn each of these axes of variation
into a continuum so that arbitrary intermediate fonts can also
be created. If we can distinguish, say, five degrees of boldness, ten degrees
of italicity, and so forth, then we can imagine creating perhaps
$10^{6}\simeq 2^{20}$
distinguishable fonts, each with a distinct 2. Extra
parameters such as loopiness and spikiness could further increase
this number. It would
be interesting to know how many distinct 2s {\sc metafont} can actually
produce in a 16 $\times$ 16 box.
The entropy of the probability distribution $P(y \given x\eq 2)$ depends
on the assumptions about noise and character size. If we assume that
noise is unlikely, then the entropy may be roughly equal to the number
of bits to make a clean 2 as discussed above.
The possibility of noise increases the entropy. The largest it could
plausibly be is the logarithm of the number derived above
for the number of patterns that are recognizable as a 2, though I suppose
one could argue that when someone writes a 2, they may end up producing
a pattern $\by$ that is not recognizable as a 2. So the entropy
could be even larger than 220 bits. It should be noted however, that
if there is a 90\% chance that the 2 is a clean 2, with entropy
20 bits, and only a 10\% chance that it is a miniature 2 with noise,
with entropy 220 bits,
then the entropy of $y$ is $H_2(0.1) + 0.1 \times 220 + 0.9 \times
20 \simeq 40$ bits, so the entropy would be much smaller
than 220 bits.
% Notice also, in this case, that the probability
% distribution $P(y \given x\eq 2)$ would not be at all uniform over this
% set of recognizable 2s. The probability of a $y$ in the `clean' set
% would be about $2^{-20}$, and the probability of a noisy
% miniature 2 pattern would be much smaller, about $2^{-220}$.
\ENDcuttable
%
% point to CKIW paper
%
}
\soln{ex.birthdaycode}{
The probability of error is the probability that the selected message
is not uniquely decodeable by the receiver, \ie, it is the probability that
one or more of the $S\!-\!1$ other people has the same birthday
as our selected person, which is
\beq
1 - \left(\frac{A-1}{A}\right)^{S-1} \:\: = \:\:
1-0.939 \:\: = \:\: 0.061 .
\eeq
The capacity of the communication channel is $\log 365 \simeq 8.5$ bits.
The rate of communication attempted is $\log 24 \simeq 4.6$ bits.
So we are transmitting substantially below the capacity of this
noiseless channel, and our communication scheme has an appreciable
probability of error (6\%). Random coding looks a rather silly idea.
}
\soln{ex.birthdaycodeb}{
The number of possible $K$-tuples is $A^K$, and we select
$q^K$ such $K$-tuples, where $q$ is the number of people in
each of the $K$ rooms.
The probability of error is the probability that the selected message
is not uniquely decodeable by the receiver,
\beq
1 - \left(\frac{A^K-1}{A^K}\right)^{q^K-1} .
\label{eq.Krooms}
\eeq
In the case $q=364$ and $K=1$ this probability of error is
\beq
1 - \left(1-\frac{1}{A}\right)^{q-1}
\:\: \simeq \:\: 1 - e^{-(q-1)/A} \:\: \simeq \:\: 1-e \:\: =
\:\: 0.63 .
\eeq
[The exact answer found from \eqref{eq.Krooms} is 0.631.]
Thus random coding is highly likely to lead to a communication failure.
As $K$ gets large, however, we can approximate
\beq
1 - \left(\frac{A^K-1}{A^K}\right)^{q^K-1}
= \:\: 1 - \left(1 - \frac{1}{A^K}\right)^{q^K-1}
\simeq \:\:
\frac{q^K-1}{A^K} \:\: \simeq
\:\: \left(\frac{q}{A}\right)^K \!\! .
\eeq
In the example $q=364$ and $A=365$, this probability of error
decreases as $10^{-0.0012 K}$, so, for example, if $K \simeq 6000$ then
the probability of error is smaller than $10^{-6}$.
For sufficiently large blocklength $K$, \index{random code}{random coding}
becomes a reliable, albeit bizarre, coding strategy.
}
\section*{Extra Solutions for Chapter \ref{ch.six}}
\soln{ex.m.s.I.aboveC}{
Consider a string of bit pairs $b_k,\hat{b}_k$, having the property that
$\sum_{k=1}^K P( \hat{b}_k \neq b_k ) / K = p_{\rm b}$.
These bits are concatenated in blocks of size $K = NR$ to define the
quantities $s$ and $\hat{s}$. Also, $P(b_k\eq 1) = 1/2$.
We wish to show that these properties imply
$\I(\cwm;\hat{\cwm }) \geq
K ( 1 - H_2(p_{\rm b}))$,
regardless of whether there are correlations among the bit errors.
%
% $\I(\cwm;\hat{\cwm }) = H( \cwm ) - H( \cwm \given \hat{\cwm })$.
% If we now show an {\em upper\/} bound on the second quantity $H( \cwm \given \hat{\cwm })$,
% we are establishing a lower bound on $\I(\cwm;\hat{\cwm })$, which is
% what we are after.
% Let's denote by $\be = \{ e_k \}_{k=1}^K$ the string of errors: $e_k = 1$ if
% $\hat{b}_k \neq b_k$.
% Then, since $\cwm$ $H( \cwm \given \hat{\cwm }) = H($
More to come here.
}
\soln{ex.exam01}{
$I(X;Y) = H(X)-H(X\given Y)$.
$I(X;Y) = H_2(p_0)- q H_2(p_0)$.
Maximize over $p_0$, get $C=1-q$.
The $(2,1)$ code is $\{ {\tt{01}}, {\tt{10}} \}$.
With probability $q$, the {\tt{1}} is lost, giving
the output {\tt{00}}, which is equivalent to the ``{\tt{?}}''
output of the Binary Erasure Channel. With probability
$(1-q)$ there is no error; the two input words and the same
two output words are identified with the {\tt{0}} and
{\tt{1}} of the BEC. The equivalent BEC has erasure probability $q$.
Now, this shows the capacity of the Z channel is at least
half that of the BEC.
This result is a bound, not an inequality, because
our code constrains the input distribution to be 50:50,
which is not necessarily optimal, and because
we've introduced simple anticorrelations among successive bits, which
optimal codes for the channel would not do.
}
\section*{Extra Solutions for Chapter \ref{ch.ecc}}
\soln{ex.productorder}{
In a nutshell, the encoding operations involve `additions' and `multiplies',
and these operations are associative.
Let the source block be $\{ s_{k_2 k_1} \}$
and the transmitted block be $\{ t_{n_2 n_1} \}$.
Let the two generator matrices be $\bG^{(1)}$ and $\bG^{(2)}$.
To conform to convention, these matrices have to be transposed if they
are to right-multiply.
If we encode horizontally first, then
the intermediate vector is
\beq
u_{k_2 n_1} = \sum_{k_1} G^{(1)\T}_{n_1 k_1} s_{k_2 k_1}
\eeq
and the transmission is
\beqan
t_{n_2 n_1} &=& \sum_{k_2} G^{(2)\T}_{n_2 k_2} u_{k_2 n_1} \\
&=& \sum_{k_2} G^{(2)\T}_{n_2 k_2} \sum_{k_1} G^{(1)\T}_{n_1 k_1} s_{k_2 k_1} .
\eeqan
Now, by the associative property of addition and multiplication,
we can reorder the summations and multiplications:
\beqan
t_{n_2 n_1}
&=& \sum_{k_1} \sum_{k_2} G^{(2)\T}_{n_2 k_2} G^{(1)\T}_{n_1 k_1} s_{k_2 k_1} \\
&=& \sum_{k_1} G^{(1)\T}_{n_1 k_1} \sum_{k_2} G^{(2)\T}_{n_2 k_2} s_{k_2 k_1} .
\eeqan
This is identical to what happens if we encode vertically first, getting intermediate
vector
\beq
v_{n_2 k_1} = \sum_{k_2} G^{(2)\T}_{n_2 k_2} s_{k_2 k_1}
\eeq
then transmitting
\beq
t_{n_2 n_1} = \sum_{k_1} G^{(1)\T}_{n_1 k_1} v_{n_2 k_1} .
\eeq
}
\soln{ex.codeslinear}{
The fraction of all codes that are linear is absolutely tiny.
We can estimate the fraction by counting how many linear codes there are
and how many codes in total.
A linear $(N,K)$ code can be defined by the $M=N-K$ constraints
that it satisfies.
The constraints can be defined by a $M \times N$ parity-check matrix.
Let's count how many parity-check matrices there are,
then correct for over-counting in a moment.
There are $2^{MN}$ distinct parity-check matrices.
Most of these have nearly full rank.
If the rows of the matrix are rearranged, that makes no difference to
the code. Indeed, you can multiply the matrix $\bH$ by
any square invertible matrix, and there is no change
to the code. Row-permutation is a special case of multiplication by
a square matrix. So the size of the equivalence classes of
parity-check matrix is $2^{M^2}$. (For every parity-check matrix,
there are $2^{M^2}$ ways of expressing it.)
So the number of different linear codes is $2^{MN}/2^{M^2} = 2^{MK}$.
The total number of codes is the number of choices of $2^K$ words
from the set of $2^N$ possible words,
which is
${2^N} \choose {2^K}$,
which is approximately
\beq
\frac{ (2^N)^{2^K} }{ (2^K)! } = \frac{ 2^{N2^K} }{ (2^K)! } .
\eeq
The fraction required is thus
\beq
\frac{2^{N^2 R(1-R)} (2^K)! }{ 2^{N2^K} } .
\eeq
}
\soln{ex.qeccode}{
%\subsection
{\sf A code over $GF(8)$}
\label{sec.extra.gf8}
We can denote the elements of $GF(8)$ by $\{0,1,A,B,C,D,E,F\}$.
Each element can be mapped onto a polynomial over $GF(2)$.
\beq
\begin{array}{ccc} \toprule
\mbox{element} & \mbox{polynomial} & \mbox{binary representation}
\\ \midrule
0 & 0 & {\tt 000} \\
1 & 1 & {\tt 001}\\
A & x & {\tt 010} \\
B & x + 1 & {\tt 011} \\
C & x^2 & {\tt 100} \\
D & x^2 + 1 & {\tt 101} \\
E & x^2 +x & {\tt 110} \\
F & x^2 +x + 1 & {\tt 111} \\ \bottomrule
\end{array}
\eeq
The multiplication and addition operations are given by
multiplication and addition of the polynomials, modulo
$x^3+x+1$.
Here is the multiplication table:
\beq
\begin{array}{c|*{8}{c}|} \cdot
& 0&1&A&B&C&D&E&F\\
\hline
%%%%%%%%%%%
0& 0&0&0&0&0&0&0&0\\
1& 0&1&A&B&C&D&E&F\\
A& 0&A&C&E&B&1&F&D\\
B& 0&B&E&D&F&C&1&A\\
C& 0&C&B&F&E&A&D&1\\
D& 0&D&1&C&A&F&B&E\\
E& 0&E&F&1&D&B&A&C\\
F& 0&F&D&A&1&E&C&B\\ \hline
\end{array}
\eeq
Here is a (9,2) code over $GF(8)$ generated by the
generator matrix
\beq
\bG = \left[ \begin{array}{*{9}{c}}
1 &0 &1 &A &B &C &D &E &F \\
0 &1 &1 &1 &1 &1 &1 &1 &1 \\
\end{array} \right]
\eeq
{\tiny\tt
\begin{center}
\begin{tabular}{*{8}{c}}
000000000 &
011111111 &
0AAAAAAAA &
0BBBBBBBB &
0CCCCCCCC &
0DDDDDDDD &
0EEEEEEEE &
0FFFFFFFF
\\
101ABCDEF &
110BADCFE &
1AB01EFCD &
1BA10FEDC &
1CDEF01AB &
1DCFE10BA &
1EFCDAB01 &
1FEDCBA10
\\
A0ACEB1FD &
A1BDFA0EC &
AA0EC1BDF &
AB1FD0ACE &
ACE0AFDB1 &
ADF1BECA0 &
AECA0DF1B &
AFDB1CE0A
\\
B0BEDFC1A &
B1AFCED0B &
BA1CFDEB0 &
BB0DECFA1 &
BCFA1B0DE &
BDEB0A1CF &
BED0B1AFC &
BFC1A0BED
\\
C0CBFEAD1 &
C1DAEFBC0 &
CAE1DC0FB &
CBF0CD1EA &
CC0FBAE1D &
CD1EABF0C &
CEAD10CBF &
CFBC01DAE
\\
D0D1CAFBE &
D1C0DBEAF &
DAFBE0D1C &
DBEAF1C0D &
DC1D0EBFA &
DD0C1FAEB &
DEBFAC1D0 &
DFAEBD0C1
\\
E0EF1DBAC &
E1FE0CABD &
EACDBF10E &
EBDCAE01F &
ECABD1FE0 &
EDBAC0EF1 &
EE01FBDCA &
EF10EACDB
\\
F0FDA1ECB &
F1ECB0FDA &
FADF0BCE1 &
FBCE1ADF0 &
FCB1EDA0F &
FDA0FCB1E &
FE1BCF0AD &
FF0ADE1BC
\\
\end{tabular}
\end{center}
}
Further exercises that can be based on this example:
\exercisaxB{2}{ex.gf8.perfect}{
% (a)
Is this code a perfect code?
}
%(b)
\exercisaxB{2}{ex.gf8.mds}{
Is this code a maximum distance separable code?
}
}
\subsubsection*{Extra Solutions}
\soln{ex.gf8.perfect}{
The $(9,2)$ code has $M=7$ parity checks, and its
distance is $d=8$. If the code were perfect, then
all points would be at a distance of at most $d/2$
from the nearest codeword, and each point would only have
one nearest codeword.
% which would be unique for each point.
The $(9,2)$ code is not a perfect code.
Any code with even distance cannot be a perfect code
because it must have vectors that are equidistant
from the two nearest codewords, for example,
{\tt 000001111} is at Hamming distance 4 from both {\tt 000000000}
and {\tt 011111111}.
We can also find words that are at a distance
greater than $d/2$ from all codewords, for example
{\tt 111110000}, which is at a distance of five or more
from all codewords.
}
\soln{ex.gf8.mds}{
The $(9,2)$ code is maximum distance separable.
It has $M=7$ parity checks, and when any 7 characters
in a codeword are erased we can restore the others.
{\sf Proof:} any two by two submatrix of $\bG$ is invertible.
% {\em (more here).}
}
\section*{Extra Solutions for Chapter \ref{ch.hash}}
\soln{ex.address}{
$\log_{36} 6,000,000,000 = 6.3$,
so a 7-character address could suffice, if we had no redundancy.
One useful internet service provided by shortURL.com is the
service of turning huge long URLs into tiny ones, using the
above principle.
Email addresses can be as short as four characters (I know m@tc),
but roughly 15 is typical.
% djcm1@cam.ac.uk
}
\section*{Extra Solutions for Chapter \ref{ch.linearecc}}
\soln{ex.poormancoding}{
With $\beta(f) = 2 f^{1/2} (1-f )^{1/2}$, combining (\ref{eq.wef.random}) and (\ref{eq.unionB}),
the average probability of error of all linear codes is bounded by
\beq
\langle P(\mbox{block error}) \rangle
\leq \sum_{w>0} \langle A(w) \rangle [\beta(f)]^w
\simeq
\sum_{w>0} 2^{N [ H_2(w/N) - (1-R) ] } [\beta(f)]^w
\label{eq.unionBsoln}
\eeq
This is a sum of terms that either grow or shrink exponentially with $N$,
depending whether the first factor or the second dominates.
We find the dominant term in the sum over $w$ by differentiating the exponent.
\beq
\frac{\d}{\d w} N [ H_2(w/N) - (1-R) ] + w \log \beta(f)
= \log \frac{ 1-(w/N) }{ w/N } + \log \beta(f)
\eeq
the maximum is at
\beq
\frac{ w/N }{ 1-(w/N) } = \beta(f)
\eeq
\ie,
\beq
{ w/N } = \frac{ \beta(f) }{ 1+ \beta(f) } = \frac{1}{1+1/ \beta(f) } .
\eeq
We require the exponent
\beq
N [ H_2(w/N) - (1-R) ] + w \log \beta(f)
\eeq
to be negative at this point, then we can guarantee that
the average error probability vanishes as $N$ increases.
\marginfig{
\begin{center}
\mbox{%
\small
\hspace{-0.01in}%
\begin{tabular}{c}
\hspace{0.015in}\mbox{\psfig{figure=/home/mackay/itp/gnu/PoorManCoding.ps,%
width=41.5mm,angle=-90}}\\[-0.01in]
\end{tabular}
}
\end{center}
%}{%
\caption[a]{Poor man's capacity (\ref{eq.poormanresult}) compared with Shannon's.
}
\label{fig.poormanresult}
% load 'gnuR'
}
Plugging in the maximum-achieving $w/N$,
we have shown that the average error probability vanishes if
\beq
H_2\left( \frac{1}{1+1/ \beta(f) } \right) + \frac{1}{1+1/ \beta(f) } \log \beta(f) < (1-R) ,
\eeq
and we have thus proved a coding theorem, showing that reliable communication can be
achieved over the binary symmetric channel at rates up to at least
\beq
R_{\rm poor\, man} = 1 - \left[
H_2\left( \frac{1}{1+1/ \beta(f) } \right) + \frac{1}{1+1/ \beta(f) } \log \beta(f) \right] .
\label{eq.poormanresult}
\eeq
}
\section*{Extra Solutions for Chapter \ref{ch.linear}}
\soln{ex.HammingD}{
All the Hamming codes have
% minimum
distance $d=3$.
}
\soln{ex.estimate.wef}{
A code has a word of weight 1 if an entire column
of the parity-check matrix is zero. There is a chance of $2^{-M} = 2^{-360}$
that all entries in a given column are zero.
There are $M=360$ columns.
So the expected value at $w=1$ is
\beq
A(1) = M 2^{-M} = 360 \times
2^{-360} \simeq 10^{-111} .
\eeq
}
\soln{ex.handshakecode}{
This (15,5) code is unexpectedly good:
While the Gilbert distance for a (15,5) code is 2.6, the minimum distance of the code is 7.
The code
% it has minimum distance 7, so
can correct all errors of weight 1, 2, or 3.
The weight enumerator function is (1,0,0,0,0,0,0,15,15,0,0,0,0,0,0,1).
}
\soln{ex.findAwmonodec}{
%\begin{figure}
%\figuremargin{%
\marginfig{
\footnotesize
%\begin{tabular}{cccc}
\begin{tabular}{c}
%# weight enumerator of (15,6) code (monodec10.4)
%# w A(w)
\begin{tabular}{rr}
\toprule
$w$ & $A(w)$ \\ \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0 & 1 \\
5 & 12 \\
6 & 10 \\
8 & 15 \\
9 & 20 \\
10 & 6 \\ \midrule
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55
Total & 64 \\ \bottomrule
\end{tabular}
%%%%%%%%%%%%%%%%%%
\\%&
\begin{tabular}{@{}c@{}}
\buckypsgraphb{monodec10.3Aw.ps}
\\[0.2in]
\buckypsgraphb{monodec10.3Aw.l.ps}
\end{tabular}
% &
% \buckypsfigw{pentagon.eps}
\\% see /home/mackay/code/bucky and ~/LOG
\end{tabular}
%}{
\caption[a]{
The
weight enumerator function of the pentagonful code
(solid lines). The
dotted lines show the
average weight enumerator function of all random linear codes with the
same size of generator matrix.
The lower
figure shows the same functions on a log scale.
While the Gilbert distance is 2.2, the minimum distance of the code is 5.
%%%%%%%%%%%%%%% CHECK %%%%%%%%%%%%%%%
% {\em (Check for cross-reference to earlier occurrence?)}
}
\label{fig.Awmonodec}
}
%\end{figure}
See \figref{fig.Awmonodec}.
}
\soln{ex.selforthog}{
Here's a suggested attack on this still-open problem.
[I use dual-containing as an abbreviation for
``having a self-orthogonal dual''.]
Pick an ensemble of low-density parity-check codes -- for example,
defined by making an $M \times N$ matrix in which every column
is a random vector of weight $j$.
Each column involves $j \choose 2$ pairs of rows.
There are a total of $N { j \choose 2}$ such pairs.
If the code is dual-containing, every
such pair must occur an even number of times, most probably twice.
Estimate the probability of every pair's occuring twice.
Multiply this probability by the total number of codes
in the ensemble to estimate the number that are dual-containing.
}
\soln{ex.one.word.ebno}{
\begin{figure}
%\mbox{\psfig{figure=gnu/gauerrorVdist.ps,width=5in}}
\mbox{\psfig{figure=gnu/gau4errorVdist.ps,angle=-90,width=4.5in}}
\caption[a]{Error probability associated with a single codeword of weight $d$
as a function of the rate-compensated signal-to-noise ratio $E_{\rm b}/N_0$.
Curves are shown for $d=10, 20, \ldots$ and for $R=1/2$, $2/3$, $3/4$,
and $5/6$. In each plot the Shannon limit for a code of that rate is indicated
by a vertical mark.
}
\end{figure}
The formula for the error probability produced by a single codeword of weight $d$ is
$\tilde{\Phi}( \sqrt{d} x )$, where $x$ is the signal-to-noise ratio and
$\tilde{\Phi}(u) = 1- \Phi(u)$
is the tail area of a unit normal distribution. $E_{\rm b}/N_0= 10 \log_{10} \frac{x^2}{2R}$.
}
%\subsection*{Further stuff for
% \exerciseonlyref{ex.hat.puzzle}}
% Mathematicians credit the problem to Dr. Todd Ebert, a computer
% science instructor at the University of California at Irvine, who
% introduced it in his Ph.D. thesis at the University of California at
% Santa Barbara in 1998.
%\section*{Extra Solutions for Chapter \ref{ch.lineartypical}}
\section*{Extra Solutions for Chapter \ref{ch_fInfo}}
\soln{ex.X100}{
\ben
\item
% To have this many distinct codewords requires
$\lceil \log_2 166751 \rceil =$
18 bits.
\item
$1.67 \times 10^{-3}$
\een
}
\soln{ex.0001}{
\ben
\item
$H_2(0.4804) = 0.998891$.
\item
$%\beq
0.496 \times H_2(0.5597) + 0.504 \times H_2(0.6) = 0.9802 \,\mbox{bits}
$%\eeq
\item
1 bit.
\item
$H_2(0.6) = 0.9709$ bits.
\een
}
\soln{ex.dicetree}{
The optimal
symbol code (\ie, questioning strategy) has expected
length $3 \frac{11}{36}$.
}
%\soln{ex.Hinfty}{}
%\soln{ex.C0000}{}
%\soln{ex.sourcechannel}{}
\soln{ex.typical2488}{
$%\beq
|T|
% = 4\times{7} \,+\, 2\times{8}\times{7}\times{5} \,+\, 4\times{7}\times{6}\times{5}\,+\, 4\times{7}\times{6}\times{5}\,+\,
% 2\times{8}\times{7} \,+\, {8}\times{7}\times{6}
= 2716 .
$%\eeq
}
\soln{ex.fairstraws}{
An arithmetic coding
solution: use the coin to generate the
% post-decimal-point
bits of
a binary real number between $0.000\ldots$ and $0.11111\ldots$;
keep tossing until the number's position relative to
$0.010101010\ldots$ and
$0.101010101\ldots$ is apparent.
Interestingly, I think that the simple method
\begin{center}
HH: Tom wins;
HT: Dick wins;
TH: Harry wins;
TT: do over\end{center}
is slightly more efficient in terms of the expected number of tosses.
}
\soln{ex.C3channel}{
By symmetry, the \optens\ for the channel
\beq
Q = \left[ \begin{array}{ccc} 1 & 0 & 0 \\
0 & 1\!-\!\q & \q \\
0 & \q & 1\!-\!\q \end{array}\right]
\eeq
has the form $((1-p),p/2,p/2)$. The \optens\ is
given by
\beq
p^* = \frac{1}{1 + \displaystyle 2^{ H_2(\q)-1 } } .
\eeq
In the case $\q=1/3$,
$%\beq
p^* =
% \frac{1}{1 + \displaystyle 2^{ H_2(\q)-1 } } =
0.514
$
% so the \optens\ is $(0.486,0.257,0.257)$
and the capacity is
% print H(1/(1+2**(H(1/3.0)-1))) + 1/(1+2**(H(1/3.0)-1)) * (1-H(1/3.0))
% 1.04143
$C = 1.041$ bits.
}
%\soln{ex.Herrors}{
%...
%}
\soln{ex.two.bsc.choose}{
% If $f=1/2$ and $g=0$ then $p=1/3$, t
The optimal input distribution
is $(1/6,1/6,1/3,1/3)$, and the capacity is $\log_2 3$ bits.
}
%
\subsection*{More details for \exerciseref{ex.isbn}}
For the first part, for any $x_{10}$,
$10 x_{10} = (11-1) x_{10} = - x_{10} \mod 11$, so
sum$_{1}^{9}$ $= x_{10}$ implies sum$_{1}^{9}$ $+10 x_{10} = 0\mod 11$.
ISBN. Any change to a single digit violates the checksum.
Any interchange of two digits
equal to $a$ and $b$,
separated by distance $s$ in the word (for example, $s=1$ for
adjacent digits)
produces a change in the checksum given by
\[
[ a n + b (n+s) - ( b n + a(n+s) ) ] \mod 11
= [ b s- a s ] \mod 11
= (b-a)s \mod 11
\]
Here $s$ is between 1 and 9. And $b-a$ is between $\pm 9$.
If $b-a=0$ then the digits are identical and their interchange
doesn't matter.
Now since 11 is prime, if $(b-a)s = 0 \mod 11$, then $b-a=0$.
So all interchanges of two digits that matter can be detected.
If we used modulo 10 arithmetic then several things would go wrong.
First, we would be unable to detect an interchange of the last
two adjacent digits. For example 91 and 19 both check out, if they
are the last two digits. Second, there would be other interchanges
of pairs of digits which would be undetected because
10 is not prime. So for example, \ldots005\ldots\ and \ldots500\ldots\
would be indistinguishable. (This example uses two digits differing
by 5 and separated by a space of size 2.)
Third, a minor point: the probability of detecting a completely bogus
ISBN is slightly higher (10/11) in the modulo 11 system than
in the modulo 10 system (9/10).
\subsection*{More details for \exerciseref{ex.85channel}}
Let the transmitted string be $\bt$ and the received string $\br$.
The mutual information is:
\begin{equation}
H(\bt;\br) = H(\br) - H(\br\given \bt) .
\end{equation}
Given the channel model, the conditional entropy $H(\br\given \bt)$ is
$\log_2(8) = 3$ bits, independent of the distribution chosen for
$\bt$.
By symmetry, the optimal input distribution is
the uniform distribution,
and this gives
$H(\br) = 8$ bits.
So the capacity, which is the maximum mutual information, is
\marginpar{[1]}
\beq
C(Q) = 5\, \mbox{bits.}
\eeq
{\sf Encoder:}
A solution exists using a linear $(8,5)$ code in which
the first seven bits are constrained to be a codeword
of the $(7,4)$ Hamming code, which encodes 4 bits
into 7 bits. The eighth transmitted bit is simply set to the
fifth source bit.
{\sf Decoder:}
The decoder computes the syndrome of the first seven received bits
using the $3 \times 7$ parity-check matrix of the Hamming code,
and uses the normal Hamming code decoder
% (details should be given)
to detect any single error in bits 1--7. If such an error is detected,
the corresponding received bit is flipped, and the five source bits
are read out. If on the other hand the syndrome is zero, then
the final bit must be flipped.
%\section*{Extra Solutions for Chapter \ref{ch_f8}}
%\section*{Extra Solutions for Chapter \ref{ch.message}}
%\section*{Extra Solutions for Chapter \ref{ch.noiseless}}
%\section*{Extra Solutions for Chapter \ref{ch.xword}}
\section*{Extra Solutions for Chapter \ref{ch.sex}}
\subsection*{Theory of sex when the fitness is a sum of exclusive-ors}
The following theory gives a reasonable fit
to empirical data on evolution where the fitness
function is a sum of exclusive-ors of independent pairs of bits.
Starting from random genomes, learning is initially
slow because the populatin has to decide, for each
pair of bits, in which direction to break the symmetry:
should they go for {\tt{01}} or {\tt{10}}?
We approximate the situation by assuming that
at time $t$, the fraction of the population
that has {\tt{01}} at a locus
is $a(t)$, for all loci, and the fraction that
have {\tt{10}} is $d(t)$, for all loci. We thus
assume that the symmetry gets broken in the same way
at all loci. To ensure that this assumption loses
no generality, we reserve the right
to reorder the two bits.
We assume that the other states at a locus,
{\tt{00}} and {\tt{11}}, both appear in
a fraction $b(t) \equiv \frac{1}{2} ( 1- (a(t)+d(t)) )$.
Now, we assume that all parents' genomes are drawn independently
at random from this distribution.
The probability distribution of the state of one locus in one child
is then
\beqan
P( {\tt{00}} ) &=& b'(t) \: \equiv \: \frac{1}{2}( b + (a+b)(b+d) )
\nonumber \\
P( {\tt{01}} ) &=& a'(t) \: \equiv \: \frac{1}{2}( a + (a+b)^2 )
\nonumber \\
P( {\tt{10}} ) &=& d'(t) \: \equiv \: \frac{1}{2}( d + (d+b)^2 ) \\
P( {\tt{11}} ) &=& b'(t) \nonumber
\eeqan
where the first terms ($\frac{1}{2}b$, $\frac{1}{2}a$, etc.)
come from the event that both bits inherited from a single parent.
The mean fitness of one locus in an offspring is then
\beq
p \equiv ( a'(t) +d'(t) ) ,
\eeq
and the total fitness, which is the sum of $G/2$ such terms,
has a binomial distribution with parameters $(N,p) = (G/2,p)$,
\ie, mean $\mu = Np$ and variance
$\sigma^2 = Np(1-p)$.
Approximating this distribution by a Gaussian,
and assuming truncation selection keeps the top half
of the distribution, the mean fitness after truncation
will be $\mu + \sqrt{2/\pi} \sigma$, and the fractions
at one locus are adjusted, by this selection, to:
\beq
a''(t) \equiv a'(t) \frac{ p'' }{p } , \:\:\:
d''(t) \equiv d'(t) \frac{ p'' }{p }
\eeq
where
\beq
p'' = p + \sqrt{2/\pi} \frac{1}{\sqrt{G/2}} \sqrt{ p(1-p) } .
\eeq
The parents of the next generation thus have fractions
given by
$a(t+1) = a''(t)$ and
$d(t+1) = d''(t)$.
{\em
add graphs here from gene/xortheory
}
%\section*{Extra Solutions for Chapter \ref{ch.clustering}}
%\section*{Extra Solutions for Chapter \ref{ch.enumerate}}
\section*{Extra Solutions for Chapter \ref{ch.ml}}
\soln{ex.manyparams}{
The likelihood has $N$ maxima: it is infinitely large
if $\mu$ is set equal to any datapoint
$x_n$ and $\sigma_n$ is decreased to zero,
the other $\sigma_{n'}$ being left at non-zero values.
Notice also that the data's mean and median both give
lousy answers to the question `what is $\mu$?'
We'll discuss the straightforward Bayesian solution
to this problem later.
}
%\section*{Extra Solutions for Chapter \ref{ch.distributions}}
%\section*{Extra Solutions for Chapter \ref{ch.bayes.int}}
%\section*{Extra Solutions for Chapter \ref{ch.exact}}
%\section*{Extra Solutions for Chapter \ref{ch.factorgraphs}}
%\section*{Extra Solutions for Chapter \ref{ch.laplace}}
%\section*{Extra Solutions for Chapter \ref{ch.occam}}
\section*{Extra Solutions for Chapter \ref{ch.mc}}
% p 384 soln 29.1
\continuedsoln{ex.Phiconverge}{
The solution in the book is incomplete, as the expression
for the variance of
\beq
\hat{\Phi} \equiv \frac{ \sum_{r} w_r \phi( \xfromq^{(r)} ) }{ \sum_r w_r } ,
% \label{eq.is}
\eeq
where
\beq
w_r \equiv \frac{ P^*\!(\xfromq^{(r)}) }{ Q^*(\xfromq^{(r)}) } ,
\label{eq.mc.is.weight.def.again}
\eeq
is not given. We focus on the variance of the numerator. (The variance of the
ratio is messier.)
But first, let's note the key insight here: what is the optimal $Q(x)$
going to look like? If $\phi(x)$ is a positive function, then
the magic choice
\beq
Q(x) = \frac{1}{Z_Q} P^*\!(x) \phi(x)
\label{eq.Qopt.is}
\eeq
(if we could make it) has the perfect property that
every numerator term will evaulate to the same constant,
\beq
\frac{ P^*\!(\xfromq^{(r)}) }{ Q^*(\xfromq^{(r)}) } \phi( \xfromq^{(r)} )
=
\frac{ P^*\!(\xfromq^{(r)}) Z_Q }{ P^*\!(\xfromq^{(r)}) \phi(\xfromq^{(r)}) } \phi( \xfromq^{(r)} )
=
Z_Q ,
\eeq
which is the required answer $Z_Q = \int \! \d x \, P^*\!(x) \phi(x)$.
The choice (\ref{eq.Qopt.is}) for $Q$ thus minimizes the variance of the numerator.
The denominators meanwhile would have the form
\beq
w_r \equiv \frac{ P^*\!(\xfromq^{(r)}) }{ Q^*(\xfromq^{(r)}) } = \frac{ Z_Q }{ \phi(\xfromq^{(r)}) } .
\eeq
It's intriguing to note that for this special choice of $Q$, where the numerator, even for just
a single random point,
is exactly the required answer, so that the best choice of denominator would be unity,
the denominator created by the standard method is not unity (in general).
This niggle exposes a general problem with importance sampling, which is
that there are multiple possible expressions for the estimator, all of which
are consistent asymptotically. Annoying, hey? The main motivation for estimators that
include the denominator is so that the normalizing constants of the distributions $P$ and $Q$
do not need to be known.
So, to the variance.
The variance of a single term in the numerator is, for normalized $Q$,
\beq
{\rm var}\left[ \frac{ P^*\!(x) }{ Q(x) } \phi( x ) \right]
=
\int \! \d x \, \left[ \frac{ P^*\!(x) }{ Q(x) } \phi( x ) \right]^2 Q(x) - \Phi^2
=
\int \! \d x \, \frac{ P^*\!(x)^2 }{ Q(x) } \phi( x )^2 - \Phi^2
\eeq
To minimize this variance with respect to $Q$,
we can introduce a Lagrange multiplier $\l$ to enforce normalization.
The functional derivative with respect to $Q(x)$ is then
\beq
- \frac{ P^*\!(x)^2 }{ Q(x)^2 } \phi( x )^2 - \lambda ,
\eeq
which is zero if
\beq
Q(x) \propto P^*(x) |\phi(x)| .
\eeq
}
% 115
\soln{ex.metFred}{
Fred's proposals would be appropriate if the target
density $P(x)$ were half as great on the two end states
as on all other states. If this were the target density,
then the factor of two difference in $Q$ for a transition
in or out of an end state would be balanced by the factor
of two difference in $P$, and the acceptance probability
would be 1. Fred's algorithm therefore samples from the
distribution
\beq
P'(x) = \left\{
\begin{array}{ll}
\dfrac{1}{20} & x \in \{ 1,2,\ldots,19 \} \\
\dfrac{1}{40} & x \in \{ 0,20 \} \\
0 & \mbox{otherwise} \end{array}
\right.
.
\label{eq.fred}
\eeq
If Fred wished to retain the new proposal density, he would have to
change the acceptance rule such that transitions {\em out of\/} the
end states would only be accepted with probability 0.5.
%
}
\soln{ex.walkE}{
Typical samples differ in their value of $\log P(\bx)$ by
a standard deviation of order $\sqrt{N}$, let's say $c \sqrt{N}$.
But the value of $\log P(\bx)$ varies during a Metropolis simulation
by a random walk whose steps when negative are roughly of unit size;
and thus by detailed balance the steps when
positive are also roughly of unit size.
So modelling the random walk of $\log P(\bx)$
as a drunkard's
walk, it will take a time $T \simeq c^2 N$ to go a distance $c \sqrt{N}$ using
unit steps.
Gibbs sampling will not necessarily take so long to generate
independent samples because in Gibbs sampling it is possible for
the value of $\log P(\bx)$ to change by a large quantity up or
down in a single iteration. All the same, in many problems each
Gibbs sampling update only changes $\log P(\bx)$ by a small amount of order
1,
so $\log P(\bx)$ evolves by a random walk which takes a time $T \simeq c^2 N$
to traverse the typical set. However this linear scaling with the system size, $N$,
is not unexpected -- since Gibbs sampling updates only one coordinate at
a time, we know that at least N updates
(one for each variable) are needed to get to an independent point.
}
%\section*{Extra Solutions for Chapter \ref{ch.mc2}}
\section*{Extra Solutions for Chapter \ref{ch.ising}}
\soln{ex.isingmemories}{% how to make stable states.
This is the problem of creating a system whose stable
states are a desired set of memories.
See later chapters for some ideas.
}
%\section*{Extra Solutions for Chapter \ref{ch.mcexact}}
%\section*{Extra Solutions for Chapter \ref{ch.mft} }
\section*{Extra Solutions for Chapter \ref{ch.ignorance}}
\soln{ex.recordbreaking}{
To answer this question,
$P(x)$ can be transformed to a uniform density.
Any property of intervals between record-breaking
events that holds for the uniform density also
holds for a general $P(x)$, since we can associate
with any $x$ a variable $u$ equal to the cumulative
probability density $\int^x P(x)$, and $u$'s distribution
is uniform. Whenever a record for $x$ is broken,
a record for $u$ is broken also.
}
\soln{ex.harrydata}{
Let's discuss the two possible parsings. The first parsing $\H_a$ produces
a column of numbers all of which end in a decimal point. This might be
viewed as a somewhat improbable parsing. Why is the decimal point
there if no decimals follow it? On the other hand, this parsing makes every number
four digits long, which has a pleasing and plausible simplicity to it.
However, if we use the second parsing $\H_b$ then the second column of
numbers consists almost entirely of the number `0.0'. This also seems
odd.
We could assign subjective priors to all these possibilities
and suspicious coincidences.
%
The most compelling evidence, however, comes from the fourth column of
digits which are either the initial digits of a second list of numbers,
or the final, post-decimal digits of the first list of numbers.
What is the probability distribution of initial digits, and what
is the probability distribution of final, post-decimal digits?
It is often our experience that initial digits have a non-uniform
distribution, with the digit `1' being much more probable than
the digit `9'.
% (This observation is a part of `Zipf's law)
Terminal digits often have a uniform distribution, or
if they have a non-uniform distribution,
it would be expected to be dominated either by `0' and `5' or by
`0', `2', `4', `6', `8'. We don't generally expect the distribution of
{\em terminal\/}
digits to be asymmetric about `5', for example, we don't expect
`2' and `8' to have very different probabilities.
The empirical distribution seems highly non-uniform
and asymmetric, having 20 `1's,
21 `2's, one `3' and one `5'. This fits well with the hypothesis that
the digits are initial digits (\cf\ \secref{sec.whatyouknow}), and does not fit well with
any of the terminal digit distributions we thought of.
We can quantify the evidence in favour of the first hypothesis
by picking a couple of crude assumptions: First, for initial
digits,
\beq
P(n\given \H_a) = \left\{ \begin{array}{ll} \smallfrac{1}{Z} \smallfrac{1}{n} & n\geq1\\[0.05in]
\smallfrac{1}{Z} \smallfrac{1}{10} & n = 0
\end{array} \right. ,
\eeq
where $Z = 2.93$,
% 2.92897
and second, for terminal digits,
\beq
P(n\given \H_b) = \frac{1}{10} .
\eeq
Then the probability of the given 43 digits is
\beq
P(\{n\} \given \H_a ) = 2.71 \times 10^{-28} .
% 2.71355e-28
\eeq
\beq
P(\{n\} \given \H_b ) = 10^{-43} .
% 2.71355e-28
\eeq
So the data consisting of the fourth column of digits
favour $\H_a$ over $\H_b$ by about $10^{15}$ to 1.
This is an unreasonably extreme conclusion, as is typical of carelessly
constructed Bayesian models \cite{Wallace_book}. But the conclusion is
correct; the data are real data that I received from a colleague,
and the correct parsing is that of $\H_a$.
}
%
% exam question
%
\soln{ex.biexp}{
\Bayes\ theorem:
\beq
P(\mu\given \{x_n\}) = \frac{ P(\mu) \prod_n P(x_n\given \mu ) }{ P(\{x_n \}) }
\eeq
The likelihood function contains a complete summary of
what the experiment tells us about $\mu$.
The log likelihood,
\beq
L (\mu ) = - \sum_n | x_n - \mu | ,
\eeq
is sketched in \figref{fig.likekink}.
%
The most probable values of $\mu$ are 0.9--2, and
the posterior probability falls by a factor of $e^2$
once we reach $-0.1$ and 3, so a range of plausible 0
values for $\mu$ is $(-0.1,3)$.
\amarginfig{b}{\footnotesize
\[ \mbox{\psfig{figure=figs/biexpansl.ps,width=2in,angle=-90}} \]
\[ \mbox{\psfig{figure=figs/biexpans.ps,width=2in,angle=-90}} \]
\caption[a]{Sketches of likelihood function.
Top: likelihood function on a log scale.
The gradient changes by 2 as we pass each data point.
Gradients are 4, 2, 0, $-2$, $-4$.
Bottom: likelihood function on a linear scale.
The exponential functions have lengthscales \dquarter,
\dhalf,
\dhalf, \dquarter.
}
\label{fig.likekink}
}
}
%
\section*{Extra Solutions for Chapter \ref{ch.decision}}
\soln{ex.utils}{
Preference of A to B means
\beq
u(1) > .89u(1) + .10u(2.5) + .01u(0)
\label{eq.conAB}
\eeq
Whereas preference of D to C means
\beqan
.89u(0) + .11u(1) &<& .90u(0) + .10u(2.5) \\
.11u(1) &<& .01u(0) + .10u(2.5) \\
u(1) &<& .89u(1) + .10u(2.5) + .01u(0)
\eeqan
which contradicts (\ref{eq.conAB}).
}
% shortened version of
% \input{tex/s-combns-and-shark.tex}
\soln{ex.joeshark}{
The probability of winning either of the first two bets
is $6/11$ = 0.54545.
The probability that you win the third bet is
0.4544.
% m gardner gives prob joe wins = 4225/7744
% so p you win 0.454416322314, joe 0.545583677686
% good. 7744-4225=3519
Joe simply needs to make
the third bet with a stake that is bigger than the sum of the first
two stakes to have a positive expectation on the sequence of three
bets.
%
%
}% end of Soln
%
\subsection*{The Las Vegas trickster}% combns moved to cutstuff.tex
\continuedsoln{ex.joeshark}{
%
%\begin{quotation}
%\input{tex/joeshark.tex}
%\end{quotation}
%
On a single throw of the two dice,
let the outcomes $6$ and $7$ have probabilities
$P(6)=p_6$ and $P(7)=p_7$. Note $P(8)=p_6$. The
values are $p_6 = 5/36$ and $p_7 = 6/36 = 1/6$. For the first bet,
we can ignore other outcomes apart from the winning and losing
outcomes 7 and 6 and compute the probability that the outcome is a 7,
given that the game has terminated,
\beq
\frac{p_7}{p_6+p_7} = 6/11 = 0.54545 .
\eeq
The second bet is identical. Both are favourable bets.
%
The third bet is the interesting one, because it is not a favourable
bet for you, even though it sounds similar to the two bets that have
gone before. The essential intuition for why two sevens
are less probable than an 8 and a 6 is that the 8 and the 6 can come
in either of two orders, so a rough factor of two appears in the
probability for 8 and 6.
Computing the probability of winning is quite tricky if
a neat route is not found. The probability is most easily
computed if, as above, we {\em discard all the irrelevant events\/} and
just compute the conditional probability of the different ways in
which the state of the game can advance by one `step'. The possible
paths taken by this `pruned' game with their probabilities are shown in the
figure as a Markov process. (The original unpruned game is
a similar Markov process in which an extra path emerges
from each node, giving a transition back to the same node.)
\begin{figure}
\figuremargin{%dangle{
\setlength{\unitlength}{0.6mm}
\newcommand{\fracpsix}{\frac{p_6}{p_6+p_7}}% 5/11
\newcommand{\fracpseven}{\frac{p_7}{p_6+p_7}}% 6/11
\newcommand{\fracpeight}{\frac{p_6+p_8}{p_6+p_7+p_8}}% 10/16
\newcommand{\fracpseveneight}{\frac{p_7}{p_6+p_7+p_8}}% 6/16
\newcommand{\mylabel}[3]{\put(#1){\makebox(0,0)[#2]{$#3$}}}
\newcommand{\mynode}[2]{\put(#1){\makebox(0,0){#2}}\put(#1){\circle{13}}}
\newcommand{\juvector}[1]{\put(#1){\vector(3,2){20}}}
\newcommand{\jdvector}[1]{\put(#1){\vector(3,-2){20}}}
\begin{center}
\mbox{
% included by l9.tex
\begin{picture}(100,90)(-5,-45)
\mynode{0,0}{A}
\mynode{30,-20}{7}
\mynode{30,20}{E}
\mynode{60,40}{68}
\mynode{60,0}{E7}
\mynode{60,-40}{77}
\mynode{90,20}{678}
\mynode{90,-20}{E77}
% bottom two
\mylabel{15,-15}{ur}{\fracpseveneight}% was 15,-10
\mylabel{45,-35}{ur}{\fracpseveneight}
% top two
\mylabel{15,15}{br}{\fracpeight}
\mylabel{45,35}{br}{\fracpsix}
% into E7
\mylabel{45,-15}{ul}{\fracpeight}
\mylabel{45,15}{bl}{\fracpseven}
% from E7
\mylabel{75,15}{br}{\fracpsix}
\mylabel{75,-5}{bl}{\fracpseven}
\juvector{5,3}
\jdvector{5,-3}
%
\juvector{35,23}
\jdvector{35,17}
\juvector{35,-17}
\jdvector{35,-23}
\juvector{65,3}
\jdvector{65,-3}
\end{picture}
\renewcommand{\fracpsix}{\frac{5}{11}}
\renewcommand{\fracpseven}{\frac{6}{11}}
\renewcommand{\fracpeight}{\frac{10}{16}}
\renewcommand{\fracpseveneight}{\frac{6}{16}}
% included by l9.tex
\begin{picture}(100,90)(-5,-45)
\mynode{0,0}{A}
\mynode{30,-20}{7}
\mynode{30,20}{E}
\mynode{60,40}{68}
\mynode{60,0}{E7}
\mynode{60,-40}{77}
\mynode{90,20}{678}
\mynode{90,-20}{E77}
% bottom two
\mylabel{15,-15}{ur}{\fracpseveneight}% was 15,-10
\mylabel{45,-35}{ur}{\fracpseveneight}
% top two
\mylabel{15,15}{br}{\fracpeight}
\mylabel{45,35}{br}{\fracpsix}
% into E7
\mylabel{45,-15}{ul}{\fracpeight}
\mylabel{45,15}{bl}{\fracpseven}
% from E7
\mylabel{75,15}{br}{\fracpsix}
\mylabel{75,-5}{bl}{\fracpseven}
\juvector{5,3}
\jdvector{5,-3}
%
\juvector{35,23}
\jdvector{35,17}
\juvector{35,-17}
\jdvector{35,-23}
\juvector{65,3}
\jdvector{65,-3}
\end{picture}
}
\end{center}
}{%
\caption[a]{Markov process describing the Las Vegas dice game, pruned of
all irrelevant outcomes. The end states $68$ and $678$
are wins for Joe. States $E77$ and $77$ are wins for you.
Please do not confuse this state diagram, in which arrows indicate
which states can follow from each other, with a graphical model,
in which each node represents a different variables and arrows
indicate causal relationships between them.}
}%
\end{figure}
%
The node labelled `A' denotes the initial state in which no 6s, 7s or
8s have been thrown. From here transitions are possible to state `7'
in which exactly one 7 has been thrown, and no 6s or 8s; and to state
`E', in which either [one or more 8s have occurred and no 6s or 7s]
or [one or more 6s have occurred and no 6s or 7s]. The probabilities
of these transitions are shown. We can progress from state E only if
Joe's winning 6 or 8 (whichever it is) is thrown, or if a 7
occurs. These events take us to the states labelled `68' and `E7'
respectively. From state `7' the game advances when a 6 or 8 is
thrown, or when a 7 is thrown, taking us to states `E7' and `77'
respectively. Finally, from state E7, if a 7 is thrown we transfer to
state E77, and if Joe's required 6 or 8 is thrown, we move to state
678. States 68 and 678 are wins for Joe; states 77 and E77 are wins
for you.
We first need the probability of state E7,
\beq
(10/16)(6/11) + (6/16)(10/16) = 405/704 = 0.5753
\eeq
The probability that you win is
\beq
P(77)+P({\rm E77}) = (6/16)^2 + P(E7)(6/11)
= 3519/7744 = 0.4544
\eeq
% m gardner gives prob joe wins = 4225/7744
% so p you win 0.454416322314, joe 0.545583677686
% good. 7744-4225=3519
The bet is not favourable. Notice that Joe simply needs to make
the third bet with a stake that is bigger than the sum of the first
two stakes to have a positive expectation on the sequence of three
bets.
%
%
}% end of Soln
%
% Tue 5/2/02 I cut out the solution to Joe Shark, even though it is beautiful,
% as a space saver? Also cut the combns solutions.
%\section*{Extra Solutions for Chapter \ref{ch.sampling} }
%\section*{Extra Solutions for Chapter \ref{ch.nn.intro}}
\section*{Extra Solutions for Chapter \ref{ch.single.neuron.class}}
%
\soln{ex.logitremind}{ One answer, given in the text on
page \pageref{ex.logitremind}, is that the single neuron
function was encountered under `the best detection of pulses'.
The same function has also appeared in the chapter on variational
methods when we derived mean field theory for a spin system.
Several of the solutions to the inference problems in
chapter 1 were also written in terms of this function.
}
%\soln{ex.gradG}{ ... }
%\soln{ex.derivWD}{ ... }
%% 123
%\soln{ex.neuronmotiv}{
% ...
%}
% 124
\soln{ex.LED}{
If we let $\bx$ and $\bs$ be binary $\in \{ \pm 1\}^7$, the
likelihood is $(1-f)^N f^M$, where
$N=( \bs^{\T}\bx + 7)/2$
and
$M=( 7 - \bs^{\T}\bx )/2$. From here, it is straightforward to
obtain the log posterior probability ratio, which is the activation.
The LED displays a binary code of length 7 with 10
codewords. Some codewords are very confusable -- 8 and 9
differ by just one bit, for example.
A superior binary code of length 7 is the $(7,4)$ Hamming
code. This code has 15 non-zero codewords, all separated
by a distance of at least 3 bits.%
\index{connection between!pattern recognition and error-correcting codes}
Here are those 15 codewords, along with a suggested mapping to
the integers 0--14.
% defns are in itprnnchapter.tex
\begin{realcenter}
\begin{tabular}{*{15}{@{\hspace{0.095in}}c@{\hspace{0.095in}}}} \toprule
\hammingdigit{0} &
\hammingdigit{1} &
\hammingdigit{2} &
\hammingdigit{3} &
\hammingdigit{4} &
\hammingdigit{5} &
\hammingdigit{6} &
\hammingdigit{7} &
\hammingdigit{8} &
\hammingdigit{9} &
\hammingdigit{10} &
\hammingdigit{11} &
\hammingdigit{12} &
\hammingdigit{13} &
\hammingdigit{14} \\
0 &
1 &
2 &
3 &
4 &
5 &
6 &
7 &
8 &
9 &
10 &
11 &
12 &
13 &
14 \\ \bottomrule
\end{tabular}
\end{realcenter}
}
\soln{ex.LED31}{
\begin{eqnarray*}
\log \frac{P(s=1 \given \br )}
{P(s=2\given \br)}
&=& \log \frac{ P(\br\given s=1) P(s=1)}
{ P(\br\given s=2) P(s=2) }
\\
&=&
\log \left( \frac{1-f}{f} \right)^{2r_1-1} +
\log \left( \frac{1-f}{f} \right)^{-(2r_3-1)}
+ \log \frac{ P(s=1)}
{ P(s=2) }
\\
&=&
w_1 r_1 + w_3 r_3 + w_0 ,
\end{eqnarray*}
where
\beq
w_1 = 2 \log \left( \frac{1-f}{f} \right) , \:\:
w_3 = - 2 \log \left( \frac{1-f}{f} \right) , \:\:
w_0 = \log \frac{ P(s=1)}{ P(s=2) } , \:\:
\eeq
and $w_2 = 0$,
which we can rearrange to give
\begin{eqnarray*}
{P(s=1 \,| \, \br )} &=& \frac{1}{ 1 + \exp \left( - w_0 - \sum_{n=1}^3
w_n r_n
\right) }.
\end{eqnarray*}
\marginfig{
\begin{center}\small
{\setlength{\unitlength}{0.021in}
\begin{picture}(48,61)(-1,-2)
%\put(20,3){\line(1,6){4.56}}% 5 less a bit
%\put(30,3){\line(-1,6){4.56}}
\put(10,3){\line(1,2){13.8}}%15 less a bit
\put(40,3){\line(-1,2){13.8}}
%
% inputs
\multiput(10,2)(30,0){2}{\circle{2}}
% bias
\put(0,24){\circle{2}}
\put(1,24.5){\line(2,1){18}}
\put(10,32){\makebox(0,0)[r]{\small$w_0$}}
%
% neuron
\put(25,37){\circle{12}}
\put(25,44){\vector(0,1){16}}
\put(24,48.5){\makebox(0,0)[r]{\small$P(s=1 \,| \, \br )$}}
%
\put(10,0){\makebox(0,0)[t]{\small$r_1$}}
\put(40,0){\makebox(0,0)[t]{\small$r_3$}}
\put(12,10){\makebox(0,0)[r]{\small$w_1$}}
\put(38,10){\makebox(0,0)[l]{\small$w_3$}}
%\put(25,-1){\makebox(0,0)[t]{\small$\ldots$}}
\end{picture}}
\end{center}
%}{%
%\caption[a]{The neuron}
%\label{fig.neuron}
}%
This can be viewed as a neuron with two or three inputs,
one from $r_1$ with a positive weight, and
one from $r_3$ with a negative weight,
and a bias.
}
\section*{Extra Solutions for Chapter \ref{ch.single.neuron.capacity}}
%\section*{Extra Solutions for Chapter \ref{ch.cover}}
\soln{ex.T43b}{
\ben
\item
$\bw = (1,1,1) .$
\item
$\bw = (1/4, 1/4 , -1) .$
\een
The two unrealizable labellings are $\{ 0,0,0,1 \}$ and $\{ 1,1,1,0 \}$.
}
\soln{ex.sensorybits}{
With just a little compression of the raw data,
it's possible your brain
could memorize everything.
}
\section*{Extra Solutions for Chapter \ref{ch.single.neuron.bayes}}
\soln{ex.gaussianmarg}{
When
$\w \sim {\rm Normal} (\wmp, \bAI)$, the scalar
$a = a(\bx;\wmp) + (\w-\wmp) \cdot \bx$
is Gaussian distributed with
mean $a(\bx;\wmp)$ and variance $s^2 \!=\! \bx^{\T} \bAI \bx$.
This is easily shown by simply computing the mean
and variance of $a$, then arguing that $a$'s distribution
must be Gaussian, because the marginals of
a multivariate Gaussian are Gaussian. (See page \pageref{sec.gaussian.props}
for a recap of multivariate Gaussians.)
The mean is
\[%beq
\left< a \right> = \left< a(\bx;\wmp) + (\w-\wmp) \cdot \bx \right>
= a(\bx;\wmp) + \left< (\w-\wmp) \right> \cdot \bx
= a(\bx;\wmp) .
\]%eeq
The variance is
\beqan
\left< (a - a(\bx;\wmp))^2 \right>
&=&
\left< \bx \cdot (\w-\wmp) (\w-\wmp) \cdot \bx \right>
\\ & = &
\bx^{\T} \left< (\w-\wmp) (\w-\wmp)^{\T} \right> \bx
\:\: = \:\: \bx^{\T} \bAI \bx . \nonumber
\eeqan
}
\begin{figure}% see onen.gnu
\figuremargin{\small%
\begin{raggedright}
\begin{tabular}{@{}*{2}{c@{}}}
\raisebox{1.96in}{(a)}\hspace{-0.452in}%
\psfig{figure=figs/logisticfg.ps,width=2.863in,angle=-90}
&
\raisebox{1.96in}{(b)}\hspace{-0.452in}%
\psfig{figure=figs/logisticfgQ.ps,width=2.863in,angle=-90}
\end{tabular}
\end{raggedright}
}{%
\caption[a]{(a) The log of the sigmoid function $f(a)=1/(1+e^{-a})$ and the
log of a Gaussian $g(a)\propto \Normal(0,4^2)$.
(b) The product $P = f(a)g(a)$ and a Gaussian approximation
to it, fitted at its mode. Notice that for a range of negative
values of $a$, the Gaussian approximation $Q$ is bigger than
$P$, while for values of $a$ to the right of the mode, $Q$ is smaller than $P$.}
\label{fig.logistic.gaussianapprox}
}%
\end{figure}
\soln{ex.gaussianapprox}{
In the case of a single data point, the likelihood function,
as a function of one parameter $w_i$, is a sigmoid
function; an example of a sigmoid function
is shown on a logarithmic scale in
\figref{fig.logistic.gaussianapprox}a. The
same figure shows a Gaussian distribution on a
log scale. The prior distribution in this problem is
assumed to be Gaussian; and the approximation $Q$ is
also a Gaussian, fitted at the maximum of the
sum of the log likelihood and the log prior.
The log likelihood and log
prior are both concave functions, so the curvature
of $\log Q$
% the approximating distribution
must necessarily
be greater than the curvature of the log prior. But asymptotically
the log likelihood function is linear, so the curvature of the log posterior
for large $|a|$ decreases to the curvature of the log prior.
Thus for sufficiently large values of $w_i$,
the approximating distribution is {\em lighter-tailed\/}
than the true posterior.
% So at large distances from the
% mode, the Gaussian approximation is light-tailed.
This conclusion may be a little misleading however.
If we multiply the likelihood and the prior and
find the maximum and fit a Gaussian there, we might obtain
a picture like \figref{fig.logistic.gaussianapprox}b.
Here issues of normalization have been ignored. The
important point to note is that
since the Gaussian is fitted at a point
where the log likelihood's curvature is not
very great, the approximating Gaussian's curvature
is {\em too small\/} for $a$ between $a_{\MP}$
and $-a_{\MP}$, with the consequence that
the approximation $Q$ is substantially {\em larger\/}
than $P$ for a wide range of negative values of $a$.
On the other hand, for values of $a$ greater than $a_{\MP}$,
the approximation $Q$ is smaller in value than $P$.
Thus
whether $Q$ is for practical purposes
a heavy-tailed or light-tailed approximation to $P$
depends which direction one looks in, and how far one looks.
The Gaussian approximation becomes most accurate when the amount of
data increases, because the log of the posterior is a sum of more and
more bent functions all of which contribute curvature to the log
posterior, making it more and more Gaussian
(\cf\ \figref{fig.incremental.data}). The greatest curvature is
contributed by data points that are close (in terms of $a$) to the
decision boundary, so the Gaussian approximation becomes good fastest
if the optimized parameters are such that
all the points are close to the decision boundary, that is, if the
data are noisy.
%
}
%\section*{Extra Solutions for Chapter \ref{ch.hopfield}}
%\section*{Extra Solutions for Chapter \ref{ch.boltzmann} }
%\section*{Extra Solutions for Chapter \ref{ch.gallager}}
%\section*{Extra Solutions for Chapter \ref{ch.convol} }
%\section*{Extra Solutions for Chapter \ref{ch.ra}}
%\section*{Extra Solutions for Chapter \ref{chdfountain}}
%\section*{Extra half-baked solutions from all over the book}% none Wed 12/11/03
\dvipsb{extra solutions}
% \input{backpage}
% \typeout{backpage: book.dvi}
% \dvips
\end{document}