This document is (c) David J.C. MacKay, 2001

It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html

It contains the text of David MacKay's book, Information theory, inference, and learning algorithms. (latex source)

Copying and distribution of this file are NOT PERMITTED.

The file is provided for convenience of anyone wishing to make a web-based search of the text of the book.

% This document is (c) David J.C. MacKay, 2001
%
% It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/
%                    http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html
%
% It contains the text of David MacKay's book,
%   Information theory, inference, and learning algorithms.
% (latex source)
%
% Copying and distribution of this file are NOT PERMITTED.
%
% The file is provided for convenience of anyone wishing to
% make a web-based search of the text of the book.

% was book2e.tex is now book.tex   (and still latex2e)
\documentclass[11pt]{book}%
\usepackage{floatflt}
\usepackage{hangingsecnum}% makes sec numbers sit in the left margin
%\usepackage{mparhack}
\usepackage{mparhackright-209}% makes all margin pars go in right margin
\usepackage{marginfig}%   Defines many macros for making various styles of figure with captions
\usepackage{symbols}
%\usepackage{twoside}
\usepackage{myalgorith}%   defines the Algorithm environment as a float
\usepackage{aside}%        defines the {aside} environment
\usepackage{chapsummary}%  helps me compile index-like objects
\usepackage{chapternotes}% lots of assorted stuff
\usepackage{lsalike}%      defines citation commands
\usepackage{booktabs}%     makes nice quality tables
\usepackage{prechapter}%   defines a chapter-like object
\usepackage{mycaption}%    defines ``\indented''and \@makecaption
% additions post-Sat 5/10/02
\usepackage{latexsym}%  needed in order to make use of the \Box command
\usepackage{tocloft}% implements my look of table of contents
\usepackage{tocloftcomp}% implements my look of table of contents
\usepackage{mychapter}% defines chapter command, including the look of the new chapter page
                      % also defines the look of the section and subsection commands
\usepackage{mycenter}%  modifies center to reduce vertical space waste
\usepackage{mypart}%    modifies part to not cleardoublepage 
\usepackage{myheadings}% redefines the pagestyle ``headings''
% \usepackage{headingmods}% redefines the pagestyle ``headings'' (similar to myheadings)
% \usepackage{myindents}% defines parindent and leftmargin
\usepackage{graphics}% enables rotating of boxes
% \usepackage{boldmathgk}%  provides bold alpha etc. (doesn't work)
\usepackage{fixmath}%  provides bold alpha etc.
\usepackage{makeidx}
\makeindex
%
\newcommand{\thedraft}{3.141} 
\renewcommand{\textfraction}{0.10}
\pagestyle{headings}
\begin{document}
\bibliographystyle{lsalike}
%\newcommand{\bf}{\textbf}
%\newcommand{\sf}{\textsf}
%%\newcommand{\em}{\textem}
%\newcommand{\rm}{\textrm}
%\newcommand{\tt}{\texttt}
%\newcommand{\sl}{\textsl}
%\newcommand{\sc}{\textsc}
%
%
%
 

 
 
% chapter.tex
% 
% this contains a few common definitions for all chapters 
% of the itprnn book
\newcommand{\adhoc}{ad hoc}
\newcommand{\busstop}{bus-stop}
\newcommand{\mynewpage}{\newpage}% switch this off later Sun 3/2/02
% see also tex/inputs/itchapter.sty
% chapternotes.sty is where there is an index
\newcommand{\fN}{f\!N}
\newcommand{\exercisetitlestyle}{\sf}
%
% used in sumproduct.tex and gallager.tex
\newcommand{\Mn}{{\cal M}(n)}
\newcommand{\Nm}{{\cal N}(m)}
%\newcommand{\N}{{\cal N}}
%
% the delta function that is 1 if true (defined in notation.tex)
\newcommand{\truth}{1}
% 
% used in gene.tex
\newcommand{\deltaf}{\delta\! f}
\newcommand{\tI}{\tilde{I}}
\newcommand{\Kp}{K_{\rm{p}}}
\newcommand{\Ks}{K_{\rm{s}}}
%
% end
% lang4.tex - distributions.tex
\newcommand{\lI}{I}
%
% clust.tex
\newcommand{\rnk}{r^{(n)}_k}
\newcommand{\hkn}{\hat{k}^{(n)}}
% good sizes: 
% -0.45: 1.25
% -0.25: 0.65
% -0.4 0.8
\newcommand{\softfig}[1]{\hspace{-0.4in}\psfig{figure=octave/kmeansoft/ps1/#1.ps,width=0.8in,angle=-90}}
\newcommand{\softtfa}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.4in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.2in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtfabig}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.6in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.5in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtfabigb}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.45in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.625in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtf}[2]{\softtfa{#1}{#2}{ps1}}
\newcommand{\softtfbig}[2]{\softtfabig{#1}{#2}{ps1}}
\newcommand{\softtfbigb}[2]{\softtfabigb{#1}{#2}{ps1}}
\newcommand{\softtfb}[2]{\softtfa{#1}{#2}{ps3}}
\newcommand{\softtfbbig}[2]{\softtfabigb{#1}{#2}{ps3}}
\newcommand{\softfc}[1]{\begin{tabular}{c}%
\hspace*{-0.2in}\mbox{\psfig{figure=octave/kmeansoft/ps5/#1.ps,width=1.32in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
% end
%
% used in _p1 and _l2
\newcommand{\hpheight}{26mm}
\newcommand{\wow}{\marginpar{\raisebox{-12pt}{\psfig{figure=figs/wow.eps,width=1in}}}}
%
% used in _l1.tex:::::::
\renewcommand{\q}{{f}}
\newcommand{\obr}[3]{\overbrace{{#1}\,{#2}\,{#3}}}
\newcommand{\ubr}[3]{\underbrace{{#1}\,{#2}\,{#3}}}
\newcommand{\nbr}[3]{{{#1}\,{#2}\,{#3}}}
%
%
\newcommand{\fitpath}{/home/mackay/octave/fit/ps}% used in fit.tex (gaussian fitting, octave)
% CUP style: 
\renewcommand{\ie}{i.e.}
\renewcommand{\eg}{e.g.}
\renewcommand{\NB}{N.B.}
%
% symbols i e and d in maths (operators)
\newcommand{\im}{{\rm i}}
\newcommand{\e}{{\rm e}}
% \d is already defined
%
\newenvironment{conclusionbox}%
{\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -18pt\begin{quote}\vskip -8pt}%
{\end{quote}\vskip -14pt \noindent\rule{\textwidth}{0.1pt}\vskip 6pt}
% {\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -12pt\begin{quote}}%
% {\end{quote}\vskip -12pt \noindent\rule{\textwidth}{0.1pt}}
\newcommand{\dy}{\d y}
\newcommand{\plus}{+}
\newcommand{\Wenglish}{Wenglish}% winglish
\newcommand{\wenglish}{\Wenglish}% winglish
\newcommand{\percent}{{per cent}}% in USA only: percent
%
%\newcommand{\nonexaminable}{$^{*}$}
\newcommand{\nonexaminable}{}
%
% for exact sampling chapter
\newcommand{\envelope}{summary state}
%
\def\unit#1{\,{\rm #1}}
\def\cm{\unit{cm}}
\def\grams{\unit{g}}
% this is a 209 versus 2e problem: (huffman.latex edited instead)
%\def\tenrm{\rm}
%\def\tenit{\it}
%
% other problems: \pem
\renewcommand{\textfraction}{0.1}
%
% for use in free text: 
\newcommand{\bits}{{\rm bits}}
\newcommand{\bita}{{\rm bit}}
% for use in equations or in '1 bit'
\newcommand{\ubits}{\,{\bits}}
\newcommand{\ubit}{\,{\bita}}
%
%
% used in ch 1:
\newcommand{\pB}{p_{\rm B}}
\newcommand{\pb}{p_{\rm b}}
%
% ch 2:
\newcommand{\sixtythree}{{\tt sixty-three}}
\newcommand{\aep}{`asymptotic equipartition' principle}
%
% used in alpha: 
\newcommand{\sla}{\sqrt{\lambda_a}}
\newcommand{\kga}{\kappa\gamma}
\newcommand{\kkgg}{\kappa^2\gamma^2}
\newcommand{\skg}{\sqrt{\kappa\gamma}}
\newcommand{\TYP}{{\rm \scriptscriptstyle TYP}}
%
\newcommand{\bb}{{\bf b}} 
\newcommand{\eq}{\mbox{$=$}}
%
% used in ising.tex and _s4.tex
% J=+1 are in states1, J=-1 are in states
%\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1.82in}}
\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1in}}% was 1.75
\newcommand{\smallrisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=0.6in}}% was 1.2 was 0.9
\newcommand{\Hisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=2.6in}}
\newcommand{\hisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=2.6in}}
\newcommand{\bighisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=3.86in}}
%
% used in _noiseless.tex
\newcommand{\Connectionmatrix}{Connection matrix}
\newcommand{\connectionmatrix}{connection matrix}
\newcommand{\connectionmatrices}{connection matrices}
%\newcommand{\cwM}{M}% codeword number
%\newcommand{\cwm}{m}% codeword number
\newcommand{\cwM}{S}% codeword number
\newcommand{\cwm}{s}% codeword number
\newcommand{\sa}{\alpha}% signal amplitude in gaussian channel
%
\newcommand{\cmA}{A}% connection matrix symbol
\newcommand{\bcmA}{{\bf \cmA}}% connection matrix symbol
\newcommand{\bAcm}{{\bcmA}}
\newtheorem{ctheorem}{Theorem}[chapter]
\newtheorem{definc}{Definition}[chapter]
\newcommand{\appendixref}[1]{appendix \ref{#1}}
\newcommand{\Appendixref}[1]{Appendix \ref{#1}}
\newcommand{\sectionref}[1]{section \ref{#1}}
\newcommand{\Sectionref}[1]{Section \ref{#1}}
\newcommand{\secref}[1]{section \ref{#1}}
\newcommand{\Secref}[1]{Section \ref{#1}}
%\newcommand{\chapterref}[1]{chapter #1}% why?
%\newcommand{\Chapterref}[1]{Chapter #1}% why?
\newcommand{\chapterref}[1]{chapter \ref{#1}}
\newcommand{\Chapterref}[1]{Chapter \ref{#1}}
\newcommand{\chref}[1]{chapter \ref{#1}}
\newcommand{\Chref}[1]{Chapter \ref{#1}}
\newcommand{\chone}{\ref{ch.one}}
\newcommand{\chtwo}{\ref{ch.two}}
\newcommand{\chthree}{\ref{ch.three}}
\newcommand{\chfour}{\ref{ch.four}}
\newcommand{\chfive}{\ref{ch.five}}
\newcommand{\chsix}{\ref{ch.six}}
\newcommand{\chseven}{\ref{ch.ecc}}
\newcommand{\cheight}{\ref{ch.bayes}}
\newcommand{\chthirteen}{\ref{ch.single.neuron.class}}% single neuron
\newcommand{\chfourteen}{\ref{ch.single.neuron.bayes}}% single neuron bayes? 
\newcommand{\chtwelve}{\ref{ch.nn.intro}}% intro to nn
\newcommand{\chcover}{\ref{ch.cover}}
\newcommand{\chbayes}{\ref{ch.bayes}}
\newcommand{\secpulse}{\ref{sec.pulse}}% 7.2.1?}
\newcommand{\secthirteenthree}{13.3?}
\newcommand{\secmetrop}{\ref{sec.metrop}}% 11.3?}
\newcommand{\figooo}{?1.11?}
\newcommand{\eqgamma}{8.27?}
\newcommand{\TSP}{travelling salesman problem} 
\newcommand{\vfe}{variational free energy}
\newcommand{\vfem}{variational free energy minimization}
% could make this \ch6 = \ref{ch6}
% author, title etc is in here....
% {headerinfo.tex}% uses special commands
\setcounter{secnumdepth}{2}%
\newcommand{\indep}{\bot}% upside down pi desired
\newcommand{\dbf}{\sl}% boldface in definitions
\newcommand{\dem}{\sl}% emphasized definitions in text
\newcommand{\solutionb}[2]{\setcounter{solution_number}{#1}
\solutiona{#2}}
\newcommand{\lsolution}[2]{\section{Solution to exercise {#1}}{#2}}
%
%
\newcommand{\FIGS}{/home/mackay/book/FIGS}
\newcommand{\bookfigs}{/home/mackay/book/figs}
\newcommand{\figsinter}{/home/mackay/handbook/figs/inter}
\newcommand{\exburglar}{\exerciseref{ex.burglar}}
\newcommand{\exnine}{\exerciseref{ex.invP}}%10}
\newcommand{\exseven}{\exerciseonlyref{ex.weigh}}% use deprecated!
% was \exseven ....  \exerciseref{ex.expectn}}%9}
\newcommand{\exaseven}{\exerciseref{ex.R9}}%{7}
\newcommand{\exten}{\exerciseref{ex.expectng}}%{11}
\newcommand{\exfourteen}{\exerciseref{ex.Hadditive}}%{15}
\newcommand{\exfifteen}{\exerciseref{ex.Hcondnal}}%{16}
\newcommand{\exeighteen}{\exerciseref{ex.Hmutualineq}}%{19}
\newcommand{\extwenty}{\exerciseref{ex.rel.ent}}%{21}
\newcommand{\extwentyone}{\exerciseref{ex.joint}}%{22}% the joint ensemble
\newcommand{\extwentytwo}{\exerciseref{ex.dataprocineq}}%{23}
\newcommand{\extwentythree}{\exerciseref{ex.zxymod2}}%{24}
\newcommand{\extwentyfour}{\exerciseref{ex.waithead}}%{25}
\newcommand{\extwentyfive}{\exerciseref{ex.sumdice}}%{26}
\newcommand{\extwentysix}{\exerciseref{ex.RN}}%{27}
\newcommand{\extwentyseven}{\exerciseref{ex.RNGaussian}}%{28}
\newcommand{\exthirtyone}{\exerciseref{ex.logit}}%{32}% logistic
\newcommand{\exthirtysix}{\exerciseref{ex.exponential}}%{37}% 
\newcommand{\exthirtyseven}{\exerciseref{ex.blood}}%{38}% forensic
\newcommand{\exfiftythree}{\exerciseref{ex.}}%{53}% integers
\newcommand{\eqsixteenfive}{16.5}
\newcommand{\Kraft}{Kraft}% Kraft--McMillan
\newcommand{\exrelent}{\exerciseref{ex.rel.ent}}%{20} %% \ref{ex.rel.ent}
\newcommand{\eqKL}{1.24} %% \eqref{eq.KL}
\newcommand{\bSigma}{{\mathbf{\Sigma}}}
\newcommand{\sumproduct}{sum-product}
%
% for cpi material
%
\newcommand{\sigbias}{\sigma_{\rm bias}}
\newcommand{\sigin}{\sigma_{\rm in}}
\newcommand{\sigout}{\sigma_{\rm out}}
\newcommand{\abias}{\alpha_{\rm bias}}
\newcommand{\ain}{\alpha_{\rm in}}
\newcommand{\aout}{\alpha_{\rm out}}
%\newcommand{\bff}{\bf}
\newcommand{\handfigs}{/home/mackay/handbook/figs} 
\newcommand{\mjofigs}{/home/mackay/figs/mjo} 
\newcommand{\FIGSlearning}{/home/mackay/book/FIGS/learning}
\newcommand{\codefigs}{/home/mackay/_doc/code/ps/ps} 
%
% mncEL stuff
%
\newcommand{\ebnowide}[1]{\mbox{\psfig{figure=../../code/#1.ps,width=2.8in,angle=-90}}}
\newcommand{\fem}{m}
\newcommand{\feM}{M}
\newcommand{\fel}{n}
\newcommand{\feL}{N}
\renewcommand{\L}{N}
\newcommand{\feLm}{{\cal N}(m)}
\newcommand{\feMl}{{\cal M}(n)}
\newcommand{\feK}{N}
\newcommand{\fek}{n}
\newcommand{\feKn}{{\cal N}(m)}
\newcommand{\feNk}{{\cal M}(n)}
\newcommand{\feN}{M}
\newcommand{\fen}{m}
\newcommand{\fer}{r}
\newcommand{\GL}{GL}
\newcommand{\SMN}{GL}
\newcommand{\NMN}{MN}
\newcommand{\MN}{MN}
\renewcommand{\check}{check}% was relationship
\newcommand{\checks}{checks}% was relationship
\newcommand{\fs}{f_{\rm s}}
\newcommand{\fn}{f_{\rm n}}
\newcommand{\llncspunc}{.}
\newcommand{\lcA}{{H}} 
\newcommand{\rmncNall}{/home/mackay/_doc/code/rmncNall} 
\newcommand{\oneA}{1A}
\newcommand{\twoA}{2A}
\newcommand{\thrA}{2A}
\newcommand{\oneB}{1B}
\newcommand{\twoB}{2B}
\newcommand{\thrB}{2B}
\newcommand{\bndips}{/home/mackay/_doc/code/bndips}
\newcommand{\codeps}{/home/mackay/_doc/code/ps}
\newcommand{\equalnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/equal.eps,width=8pt}\hspace{0mm}}}
\newcommand{\plusnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/plus.eps,width=8pt}\hspace{0mm}}}
% 
\newcommand{\fourfourtable}[9]{\begin{tabular}[b]{lcc@{\hspace{4pt}}c}
 \multicolumn{1}{l}{#1:} &  & \multicolumn{2}{c}{#2} \\[-0.1in]%  \cline{1-1}
 & & {#3} & {#4} \\ \cline{3-4}
{#5} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt] 
     &\multicolumn{1}{l|}{#4} & {#8} & {#9} \\ 
\end{tabular}}
\newcommand{\fourfourtableb}[9]{\begin{tabular}[b]{l|c@{\hspace{1pt}}c@{\hspace{3pt}}c}
 {#1} & {#2}  & {#3} & {#4} \\ \cline{1-1}\cline{3-4}
\multicolumn{2}{l}{#5} & & \\ 
\multicolumn{1}{l|}{#3} & & {#6} & {#7} \\[-5pt] 
\multicolumn{1}{l|}{#4} & & {#8} & {#9} \\ 
\end{tabular}}
\newcommand{\fourfourtableold}[9]{\begin{tabular}[b]{l|c|c|c|}
 {#1} & {#2}  & {#3} & {#4} \\ \cline{1-1}
\multicolumn{2}{l|}{#5} & & \\ \hline
\multicolumn{2}{l|}{#3} & {#6} & {#7} \\ \hline
\multicolumn{2}{l|}{#4} & {#8} & {#9} \\ \hline
\end{tabular}}
\newcommand{\mathsstrut}{\rule[-3mm]{0pt}{8mm}}
%
% for ra.tex
%
\newcommand{\halfw}{0.35in}
\newcommand{\onew}{0.7in}
\newcommand{\onehalfw}{1.05in}
\newcommand{\twow}{1.4in}
\newcommand{\twohalfw}{1.75in}
\newcommand{\GHfig}[1]{\psfig{figure=GHps/#1,width=\onehalfw}}% for rate 1/3
\newcommand{\GHfigone}[1]{\psfig{figure=GHps/#1,width=\onew}}%
\newcommand{\GHfigthird}[1]{\psfig{figure=GHps/#1,width=\halfw}}
\newcommand{\GHfigquarter}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}
\newcommand{\GHfigtwo}[1]{\psfig{figure=GHps/#1,width=\twow}}% for rate 1/2
\newcommand{\GHfigdouble}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}% for five wide
% extra wide fitting::::::::::: (for turbo)
\newcommand{\GHfigdoubleE}[1]{\psfig{figure=GHps/#1,width=2in}}% for five wide
\newcommand{\GHfigE}[1]{\psfig{figure=GHps/#1,width=1.2in}}% for rate 1/3
%
\newcommand{\GHdrawfig}[1]{\psfig{figure=GHps/#1,width=1.5in}}% was 1.8
\newcommand{\standardfig}[1]{\psfig{figure=rirreg/#1,width=1.8in,angle=-90}}
\newcommand{\loopsfig}[1]{\psfig{figure=rirreg/loops.#1,height=1.85in,width=1.8in,angle=-90}}
\newcommand{\titledfig}[2]{\begin{tabular}{c}%
{#1}\\%
\standardfig{#2}\\%
\end{tabular}%
}

%
% for the single neuron chapters
%
\newcounter{funcfignum}
\setcounter{funcfignum}{1}
\newcommand{\funcfig}[2]{
	\put(#1,#2){\makebox(0,0)[b]{
	\begin{tabular}{@{}c@{}}
		\psfig{figure=\FIGSlearning/f.#1.#2.ps,height=1.3in,width=1.3in,angle=-90} \\[-0.15in]
 			$\bw = (#1,#2)$		
\\	\end{tabular}
		}
	}
}
\newcommand{\wflatfig}[1]{
	\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.3)(0.30,0.40)
		\psfig{figure=\FIGSlearning/#1,height=2.43in,width=2.064in,angle=-90}
% was 1.3,1.3
	\end{picture}\\\end{tabular}
}
\newcommand{\wsurfig}[1]{
	\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.5)(0,0)
		\psfig{figure=\FIGSlearning/#1,height=1.8in,width=1.8in,angle=-90}
% was 1.5,1.5
	\end{picture}\end{tabular}
}
\newcommand{\datfig}[1]{
	\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1,1)(0.30,0.1)
		\psfig{figure=\FIGSlearning/#1,height=1.2in,width=1.412in,angle=-90}
% was 1,1
	\end{picture}\end{tabular}
}
\newcommand{\optens}{optimal input distribution}% used in l5.tex, l6.tex, s5.tex
\newcommand{\dilbertcopy}{{[Dilbert image Copyright\copyright{1997} United Feature Syndicate, Inc.,
 used with permission.]}}
\newcommand{\Rnine}{\mbox{R}_9}
\newcommand{\Rthree}{\mbox{R}_3}
\newcommand{\eof}{{\Box}}
\newcommand{\teof}{\mbox{$\Box$}}% for use in text
\newcommand{\ta}{{\tt{a}}}
\newcommand{\tb}{{\tt{b}}}
%\newcommand{\dits}{dits}
%\newcommand{\dit}{dit}
\newcommand{\dits}{bans}
\newcommand{\dit}{ban}
%
% used in l5
%
\newcommand{\BSC}{binary symmetric channel}
\newcommand{\BEC}{binary erasure channel}
\newcommand{\subsubpunc}{}% change to . if subsubsections are given in-line headings
%
% convolutional code definitions
%
\newcommand{\cta}{t^{(a)}}
\newcommand{\ctb}{t^{(b)}}
\newcommand{\z}{z}
\newcommand{\lfsr}{linear feedback shift register} 
%
% definitions for including hinton diagrams from extended directory
%
\newcommand{\ecfig}[1]{\psfig{figure=extended/ps/#1.ps,silent=}}
% extra argument
\newcommand{\ecfigb}[2]{\psfig{figure=extended/ps/#1.ps,#2,silent=}}
%
% used in _s1 and in _linear maybe
%%%%%%%%%%% see /home/mackay/code/bucky
\newcommand{\buckypsfig}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in}}}
\newcommand{\buckypsfigw}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in}}}
\newcommand{\buckypsgraph}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in,angle=-90}}}
\newcommand{\buckypsgraphb}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in,angle=-90}}}
\newcommand{\buckypsgraphB}[1]{\mbox{\psfig{figure=buckyps/#1,width=2.2in,angle=-90}}}
%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%55
% for l1a
%%%%%%%%%%%%%%%%%%
% example
% \bigrampicture{3.538mm}{hd_conbigram.ps}
% \bigrampicture{3.538mm}{hd_conbigram.ps,width=278pt}%%%%%%% 278 is the original size
% This used to work fine in latex209 then needed rejigging in 2e.
% (alignment of g,j,p,q,y wrong at the bottom) (saved to graveyard.tex
\newcommand{\bigrampicture}[3]%args are unitlength,picturename-and-picturesize,font-request
{%%%%%%%%%
\setlength{\unitlength}{#1}
\begin{picture}(30,30)(0,-30)% was 28,28   0,-28
\put(0.15,-27.8){\makebox(0,0)[bl]{\psfig{figure=bigrams/#2,angle=-90}}}
\put(1,-29){\makebox(0,0)[b]{{#3\tt a}}}
\put(2,-29){\makebox(0,0)[b]{{#3\tt b}}}
\put(3,-29){\makebox(0,0)[b]{{#3\tt c}}}
\put(4,-29){\makebox(0,0)[b]{{#3\tt d}}}
\put(5,-29){\makebox(0,0)[b]{{#3\tt e}}}
\put(6,-29){\makebox(0,0)[b]{{#3\tt f}}}
\put(7,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt g}}}
\put(8,-29){\makebox(0,0)[b]{{#3\tt h}}}
\put(9,-29){\makebox(0,0)[b]{{#3\tt i}}}
\put(10,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt j}}}
\put(11,-29){\makebox(0,0)[b]{{#3\tt k}}}
\put(12,-29){\makebox(0,0)[b]{{#3\tt l}}}
\put(13,-29){\makebox(0,0)[b]{{#3\tt m}}}
\put(14,-29){\makebox(0,0)[b]{{#3\tt n}}}
\put(15,-29){\makebox(0,0)[b]{{#3\tt o}}}
\put(16,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt p}}}
\put(17,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt q}}}
\put(18,-29){\makebox(0,0)[b]{{#3\tt r}}}
\put(19,-29){\makebox(0,0)[b]{{#3\tt s}}}
\put(20,-29){\makebox(0,0)[b]{{#3\tt t}}}
\put(21,-29){\makebox(0,0)[b]{{#3\tt u}}}
\put(22,-29){\makebox(0,0)[b]{{#3\tt v}}}
\put(23,-29){\makebox(0,0)[b]{{#3\tt w}}}
\put(24,-29){\makebox(0,0)[b]{{#3\tt x}}}
\put(25,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt y}}}
\put(26,-29){\makebox(0,0)[b]{{#3\tt z}}}
\put(27,-29){\makebox(0,0)[b]{{#3--}}}
% they used to be at height -29 and were aligned  bottom
%\put(27,-29){\makebox(0,0)[b]{{#3\verb+-+}}}
%      
\put(29,-29){\makebox(0,0)[r]{#3$y$}}
%
\put(-0.2,-1){\makebox(0,0)[r]{{#3\tt a}}}
\put(-0.2,-2){\makebox(0,0)[r]{{#3\tt b}}}
\put(-0.2,-3){\makebox(0,0)[r]{{#3\tt c}}}
\put(-0.2,-4){\makebox(0,0)[r]{{#3\tt d}}}
\put(-0.2,-5){\makebox(0,0)[r]{{#3\tt e}}}
\put(-0.2,-6){\makebox(0,0)[r]{{#3\tt f}}}
\put(-0.2,-7){\makebox(0,0)[r]{{#3\tt g}}}
\put(-0.2,-8){\makebox(0,0)[r]{{#3\tt h}}}
\put(-0.2,-9){\makebox(0,0)[r]{{#3\tt i}}}
\put(-0.2,-10){\makebox(0,0)[r]{{#3\tt j}}}
\put(-0.2,-11){\makebox(0,0)[r]{{#3\tt k}}}
\put(-0.2,-12){\makebox(0,0)[r]{{#3\tt l}}}
\put(-0.2,-13){\makebox(0,0)[r]{{#3\tt m}}}
\put(-0.2,-14){\makebox(0,0)[r]{{#3\tt n}}}
\put(-0.2,-15){\makebox(0,0)[r]{{#3\tt o}}}
\put(-0.2,-16){\makebox(0,0)[r]{{#3\tt p}}}
\put(-0.2,-17){\makebox(0,0)[r]{{#3\tt q}}}
\put(-0.2,-18){\makebox(0,0)[r]{{#3\tt r}}}
\put(-0.2,-19){\makebox(0,0)[r]{{#3\tt s}}}
\put(-0.2,-20){\makebox(0,0)[r]{{#3\tt t}}}
\put(-0.2,-21){\makebox(0,0)[r]{{#3\tt u}}}
\put(-0.2,-22){\makebox(0,0)[r]{{#3\tt v}}}
\put(-0.2,-23){\makebox(0,0)[r]{{#3\tt w}}}
\put(-0.2,-24){\makebox(0,0)[r]{{#3\tt x}}}
\put(-0.2,-25){\makebox(0,0)[r]{{#3\tt y}}}
\put(-0.2,-26){\makebox(0,0)[r]{{#3\tt z}}}
\put(-0.2,-27){\makebox(0,0)[r]{{#3--}}}
%\put(-0.2,-27){\makebox(0,0)[r]{{#3\verb+-+}}}

\put(-0.2,1){\makebox(0,0)[r]{#3$x$}}
\end{picture}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% from theorems.tex for exact.tex
\newcommand{\PGB}{P^{G}_B}
\newcommand{\PGb}{P^{G}_b}
\newcommand{\PB}{P_B}
\newcommand{\Pb}{P_b}

% included by l2.tex
% definitions for weighings.tex and for text
% shows weighing trees, ternary
%
% decisions of what to weigh are shown in square boxes with 126 over 345 (l:r)
% state of valid hypotheses are listed in double boxes
% or maybe dashboxes?
% three arrows, up means left heavy,  straioght means right heavy, down is balance
%
\newcommand{\mysbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}}
\newcommand{\mydbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}}
\newcommand{\myuvector}[3]{\put(#1){\vector(#2){#3}}}
\newcommand{\mydvector}[3]{\put(#1){\vector(#2){#3}}}
\newcommand{\mysvector}[2]{\put(#1){\vector(1,0){#2}}}
\newcommand{\mythreevector}[4]{\myuvector{#1}{#2,#3}{#4}\mydvector{#1}{#2,-#3}{#4}\mysvector{#1}{#4}}
%
%\newcommand{\h1}{\mbox{$1^+$}}
%\newcommand{\l1}{\mbox{$1^-$}}
%\newcommand{\h2}{\mbox{$2^+$}}
%\newcommand{\l2}{\mbox{$2^-$}}
%\newcommand{\h3}{\mbox{$3^+$}}
%\newcommand{\l3}{\mbox{$3^-$}}
%\newcommand{\h4}{\mbox{$4^+$}}
%\newcommand{\l4}{\mbox{$4^-$}}
%\newcommand{\h5}{\mbox{$5^+$}}
%\newcommand{\l5}{\mbox{$5^-$}}
%\newcommand{\h6}{\mbox{$6^+$}}
%\newcommand{\l6}{\mbox{$6^-$}}
%\newcommand{\h7}{\mbox{$7^+$}}
%\newcommand{\l7}{\mbox{$7^-$}}
%\newcommand{\h8}{\mbox{$8^+$}}
%\newcommand{\l8}{\mbox{$8^-$}}
%\newcommand{\h9}{\mbox{$9^+$}}
%\newcommand{\l9}{\mbox{$9^-$}}
%\newcommand{\h10}{\mbox{$10^+$}}
%\newcommand{\l10}{\mbox{$10^-$}}
%\newcommand{\h11}{\mbox{$11^+$}}
%\newcommand{\l11}{\mbox{$11^-$}}
%\newcommand{\h12}{\mbox{$12^+$}}
%\newcommand{\l12}{\mbox{$12^-$}}

\title{Information Theory,  Inference, \& Learning Algorithms}
\shortlecturetitle{}
\shortauthor{David J.C. MacKay}
% the book - called by book.tex
% thebook.tex
% Mon 7/10/02
\setcounter{page}{0} % set to current value    
\setcounter{exercise_number}{1} % set to imminent value 
%
\setcounter{secnumdepth}{1}    % sets the level at which subsection numbering stops
\setcounter{tocdepth}{0}
\newcommand{\mysetcounter}[2]{}%was {\setcounter{#1}{#2}}
% useful for forcing pagenumbers in drafts
%\setcounter{tocdepth}{1}    
\renewcommand{\bs}{{\bf s}}
\newcommand{\figs}{/home/mackay/handbook/figs} % while in bayes chapter 
% \addtocounter{page}{-1}
\thispagestyle{empty} 
\begin{center}
~\\[1.5in]
{\Huge \bf Information Theory,  \\[0.2in]  
% Pattern Recognition  \\[0.1in] 
 Inference,\\[0.2in]
 and Learning Algorithms\\[1in]
% Probability \\[0.2in]
% and Neural Networks\\[1in]
}
{\Large\sf David J.C. MacKay } \\[0.3in]
\copyright  1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003\\[1.3in]
Draft \thedraft\ \today\\ 
\end{center}
\dvipsb{frontpage}
\newpage
% choose one of these: 
% \input{cambridgefrontstuff.tex}
%
% alternate
\fakesection{Roadmap}
% \subchapter{Part III Physics}
\section*{Information Theory, Pattern Recognition and Neural Networks}
% {\large  Handout number 3.}\medskip
%
% Do not be disturbed by missing pagenumbers: the handouts do
% not include all the  book's chapters.
%
% Because the handouts are based on  different drafts of the book, there may
% be occasional mismatches between pagereferences, etc., where
% cross-references occur between this and the other handouts.
%
% Nonexaminable material is indicated by the symbol
% \nonexaminable.
%

\section*{Approximate roadmap for the eight-week course in Cambridge}
 The course will cover about 16 chapters of this book.
 The rest of the book is provided for your interest.
 The book contains numerous exercises with worked solutions.
% less than half the material in this book.
\medskip

\noindent
\begin{tabular}{@{}lp{5in}}
 Lecture 1 & Introduction to Information Theory.
		 \Chref{ch.one}.
% The lecture is mirrored by
\\
 Before lecture 2 &  Work on \exerciseref{ex.weigh}.
\\
	 & Read chapters \ref{ch.prob.ent} and \ref{ch.two}
                  and work on  exercises  in \chref{ch.prob.ent}.
% listed in the introduction		to \chref{ch.two}.
\\
 Lecture 2--3 &  Information content \& typicality. \Chref{ch.two}.
\\
 Lecture 4 & Symbol codes.  \Chref{ch.three}.
\\
 Lecture 5 & Arithmetic codes.  \Chref{ch.ac} (sections \ref{sec.startofch4}--\ref{sec.stopbeforeLZ} only).
\\
	 &  Read \chref{ch.prefive} and do the exercises.
\\
 Lecture 6 & Noisy channels. Definition of mutual information and capacity.
		\Chref{ch.five}.
\\
 Lecture 7--8 & The noisy channel coding theorem.
		\Chref{ch.six} (but not section \ref{sec.ch6stop} onwards).
\\
 Lecture 9 & Clustering. Bayesian inference. \Chref{ch1b}, \ref{ch.clustering}, \ref{ch.ml}.\\
% as a data modelling problem. \\
          & Read \chref{ch.ising} (Ising models).
\\
 Lecture 10--11 & Monte Carlo methods. \Chref{ch.mc}, \ref{ch.mc2}. \\
 Lecture 12 & Variational methods. \Chref{ch.mft}. \\
 Lecture 13 & Neural networks -- the single neuron. 
		\Chref{ch.single.neuron.class}. \\
 Lecture 14 & Capacity of the single neuron.  \Chref{ch.single.neuron.capacity}. \\
 Lecture 15 & Learning as inference.  \Chref{ch.single.neuron.bayes}. \\
 Lecture 16 & The Hopfield network. Content-addressable memory.  \Chref{ch.hopfield}. \\
\end{tabular}



\subsection*{About the exercises}
 I firmly believe that one can only understand a subject by 
 recreating it for oneself. To this end, I think it is essential to
 work through some exercises on each topic.  For guidance, each exercise
 has a rating (similar to that used by \citeasnoun{Knuth_vol1})
 from 1 to 5 that indicates the level of difficulty.

 In addition, exercises that are especially recommended
 are marked by a marginal encouraging rat -- \dorat.
 Exercises that require the use of a computer may be
 marked with a {\sl C}.
% will have 
% a rating such as A1, A5, C1 or C5. 
% The letter  indicates how important I think the exercise is:
% A = very important $\ldots$ C = not essential to the flow of the
% book. The number indicates the difficulty of the problem: 
% 1 = easy, 5 = research project.

 I'll circulate detailed recommendations on exercises
 as the course progresses.

 Answers to many of the exercises are provided. Please use them
 wisely.

 

%\begin{table}[htbp]
%\caption[a]
\begin{realcenter}
\fbox{
\begin{tabular}{ll}
%\begin{minipage}{3in}
{\sf Summary of codes for exercises}\\%[0.2in]
% \hspace{0.2in}
\begin{tabular}[b]{cl}
\dorat & Especially recommended \\[0.2in]
{\ensuremath{\triangleright}} & Recommended \\
{\sl C}   & Some parts require a computer \\
\end{tabular}
%\end{minipage}
&
\begin{tabular}[b]{cl}
\pdifficulty{1} & Simple (one minute) \\
\pdifficulty{2} &  Medium (quarter hour) \\
\pdifficulty{3} &  Moderately hard \\
\pdifficulty{4} &  Hard \\
\pdifficulty{5} &  Research project \\[0.2in]
\end{tabular}
\\
\end{tabular}
}
\end{realcenter}
%\end{table}



%%%%%%%%%%%%%%
\newpage
\tableofcontents
\dvipsb{toc}
% \input{extrafrontstuff.tex}% aims dedication, about the author, etc
% see also tex/oldaims.tex
% for some good stuff.
% and tex/typicalreaders.tex
%
%% \input{tex/overview2001.tex}
 
\prechapter{About Chapter} 
\setcounter{page}{1} % set to current value    
\label{pch.one}
%
% pre-chapter 1
%
\fakesection{Before ch 1}
 I hope you will find the mathematics in the first chapter easy.
 You will need to be familiar with the \ind{binomial distribution}.
% , reviewed below.
 And to solve the exercises in the text --
 which I urge you to do -- you will need to remember {\dem\ind{Stirling's
 approximation}\/}\index{approximation, Stirling's}
 for the factorial function, $%\beq
	 x! \simeq  x^{x}  e^{-x} 
$,
 and be able to
 apply it to ${{N}\choose{r}} =
 \frac{N!}{(N-r)!r!}$.\marginpar{\footnotesize{Unfamiliar notation?\\ See
 appendix \ref{app.notation}, \pref{app.notation}.}}
% $x!$
 These topics are reviewed below.

\subsection*{The binomial distribution}
\label{sec.first.binomial}
\exampl{ex.binomial}{
 A bent coin has probability $f$ of coming up heads.
 The coin is tossed $N$ times.
 What is the  probability
 distribution of the number of heads, $r$?
 What are the \ind{mean} and \ind{variance} of $r$?
}

\amarginfig{t}{%
\begin{tabular}{r}
% $P(r|f,N)$\\
\mbox{\psfig{figure=bigrams/urn.f.g.ps,angle=-90,width=1.51in}}%
\\
\mbox{\psfig{figure=bigrams/urn.f.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{$r$}
\\
\end{tabular}
%}{%
\caption[a]{The binomial distribution $P(r|f\eq 0.3,\,N \eq 10)$,
 on a linear scale (top) and  a logarithmic scale (bottom).}
\label{fig.binomial}
}
% see bigrams/README

\noindent
%\begin{Sexample}{ex.binomial}
{\sf Solution:}
\label{sec.first.binomial.sol}
 The number of heads
 has a binomial distribution.
\beq P(r|f,N) = {N \choose r} f^{r} (1-f)^{N-r} \eeq
 The mean, $\Exp [ r ]$, and variance, $\var[r]$,
 of this distribution are
 defined by
\beq
 \Exp [ r ] \equiv \sum_{r=0}^{N} P(r|f,N) \, r
\label{eq.mean.def}
\eeq
\beqan
 \var[r] & \equiv &
\Exp \left[ \left( r  -   \Exp [ r ] \right)^2 \right] \\
& = &
\Exp [ r^2 ] -  \left( \Exp [ r ] \right)^2
 =  \sum_{r=0}^{N} P(r|f,N) r^2 -  \left( \Exp [ r ] \right)^2 .
\label{eq.var.sum}
\eeqan
%
 Rather than evaluating the sums over $r$ (\ref{eq.mean.def},\ref{eq.var.sum}) directly,
 it is easiest to  obtain the mean and variance by noting that $r$
 is the sum of $N$ {\em independent\/}
% , identically distributed
 random variables, namely, the number of heads in the
 first toss (which is either zero or one),
 the number of heads in the second toss, and so forth.
 In general, 
\beq
\begin{array}{rcll}
 \Exp [ x + y ] &=&  \Exp [ x ] +  \Exp [ y ]  & \mbox{for any random variables $x$ and $y$};
\\
 \var [ x + y ] &=&  \var [ x ] +  \var [ y ]  & \mbox{if $x$ and $y$ are independent}.
\end{array}
\eeq
 So the mean of $r$ is the sum of the means of those random
 variables, and the variance of $r$ is the sum of their variances.
% its mean and variance are given by adding the means and variances
% of those random variables, respectively.
 The mean number of heads in a single toss
 is $f\times 1 + (1-f)\times 0 = f$, and the variance of the
 number of heads in a single toss  is
\beq
 \left[ f\times 1^2 + (1-f)\times 0^2 \right] - f^2 = f - f^2 = f(1-f),
\eeq
 so the mean and variance of $r$ are:
\beq \Exp [ r ] = N f
%\eeq\beq
\hspace{0.5in} \mbox{and} \hspace{0.5in}
 \var[r] = N f (1-f) .
\eeq
%\end{Sexample}
% ADD END PROOF SYMBOL HERE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

\subsection*{Approximating $x!$ and ${{N}\choose{r}}$}

\amarginfig{t}{%
\begin{tabular}{r}
\mbox{\psfig{figure=bigrams/poisson.g.ps,angle=-90,width=1.5in}}%
\\
\mbox{\psfig{figure=bigrams/poisson.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{$r$}
\\
\end{tabular}
%}{%
\caption[a]{The Poisson distribution $P(r\,|\,\l\eq 15)$,
 on a linear scale (top) and  a logarithmic scale (bottom).}
\label{fig.poisson}
}
% see bigrams/README
\label{sec.poisson}


% FAVOURITE BIT
\noindent
 Let's derive Stirling's approximation by an unconventional route.
 We start from the \ind{Poisson distribution},
\beq
	P( r | \l ) = e^{-\l} \frac{\l^r}{r!} \:\:\:\:
  \:\: r\in \{ 0,1,2,\ldots\} .
\label{eq.poisson}
\eeq
%
% \noindent
For large $\l$, this distribution is well approximated -- at least\index{approximation by Gaussian}
 in the vicinity of $r \simeq \l$ -- by
 a Gaussian distribution with mean $\l$ and variance $\l$:
% So,
\beq
  e^{-\l} \frac{\l^r}{r!} \simeq \frac{1}{\sqrt{2\pi \l}}
		e^{{\textstyle -\frac{(r-\l)^2}{2\l}}} .
\eeq
 Let's plug  $r=\l$ into this formula.
\beqan
  e^{-\l} \frac{\l^{\l}}{\l!} &\simeq& \frac{1}{\sqrt{2\pi \l}}
\\
\Rightarrow \l! &\simeq&  \l^{\l} \, e^{-\l}  \sqrt{2\pi \l}  .
\eeqan
 This is {\bf Stirling's approximation}
 for the \ind{factorial} function,
 including several of the correction
 terms that are usually forgotten.
\beq
	 x! \simeq  x^{x}  e^{-x}  \sqrt{2\pi x}  \:\:\:\Leftrightarrow\:\:\:
	\ln x! \simeq x \ln x - x + {\textstyle\frac{1}{2}} \ln {2\pi x} .
\label{eq.stirling}
\eeq
%
 We can use this approximation
% the approximation
%$%\beq
%	 x! \simeq  x^{x}  e^{-x} $
 to approximate\index{combination}
$%\beq
	{{N}\choose{r}} \equiv \frac{N!}{(N-r)!r!}
$.%\eeq
\beqan
\ln	{{N}\choose{r}} 
%	&	\simeq &
% N  [ \ln N - 1 ] - (N-r) [ \ln (N-r) - 1 ] - r [ \ln r - 1 ]
%\\
 & \simeq & (N-r) \ln\frac{N}{N-r} + r \ln\frac{N}{r}
 .
\label{eq.choose.approx}
\eeqan
 Since all the terms in this equation are logarithms,
 this result can be rewritten in any base.\marginpar{\small Recall that
$\displaystyle{ \log_2 x = \frac{ \log_e x }{ \log_e 2} }$.\\[0.03in]
 Note that $\displaystyle\frac{\partial  \log_2 x }{\partial x} =
 \frac{1}{\log_e 2}\,\frac{1}{x}$.
}
%\fakesubsection*{My rule about log and ln}
 We will denote\index{conventions!logarithms}\index{notation!logarithms}
 natural logarithms ($\log_e$) by `ln', and \ind{logarithms}
 to base 2 ($\log_2$)
 by `$\log$'.

 If we introduce the {\dbf\ind{binary entropy function}},
\beq
 H_2(x) \equiv x \log \frac{1}{x} + (1-x) \log \frac{1}{(1-x)}
\eeq
 then we can rewrite the approximation
 (\ref{eq.choose.approx})
%\beq
%$ \log	{{N}\choose{r}} 
%  \simeq  (N-r) \log \frac{N}{N-r} + r \log \frac{N}{r} 
%$
%\eeq
 as
\amarginfig{t}{\small%
\begin{center}
\mbox{
\hspace{-6mm}
% \hspace{6.2mm}
\raisebox{\hpheight}{$H_2(x)$}
% to put H at left:
\hspace{-7.5mm}
% \hspace{-20mm}
\mbox{\psfig{figure=figs/H2.ps,%
width=42mm,angle=-90}}$x$
}
% see also H2p.tex

\end{center}
\caption[a]{The  binary entropy function.}
% $H_2(x)$.}
\label{fig.h2x}
}
\beq
\log	{{N}\choose{r}} 
  \simeq N H_2(r/N) ,
\label{eq.stirling.choose.l}
\eeq
 or, equivalently,
% \:\:\:\Leftrightarrow\:\:\:
\beq
	{{N}\choose{r}} 
  \simeq 2^{N H_2(r/N)} .
\label{eq.stirling.choose}
\eeq
 If we need a more accurate approximation, we
 can include terms of the next order from
 Stirling's approximation
 (\ref{eq.stirling}):
\beq
\log	{{N}\choose{r}} 
  \simeq N H_2(r/N) -
 {\textstyle\frac{1}{2}} \log \left[ {2\pi N \, \frac{N\!-\!r}{N} \,
                                       \frac{r}{N}}  \right]
.
\label{eq.H2approxaccurate}
\eeq
%
% - {\textstyle\frac{1}{2}} \ln {2\pi N}
% + {\textstyle\frac{1}{2}} \ln {2\pi N-r}
% + {\textstyle\frac{1}{2}} \ln {2\pi r}
%
% ln += {\textstyle\frac{1}{2}} \ln {2\pi (N-r)(r)/N}
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi (N-r)(r)/N}
% or
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi N}
%        + {\textstyle\frac{1}{2}} \log_2 {\frac{(N-r)}{N}\frac{r}{N}} 
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi \frac{(N-r)}{N}\frac{r}{N} N} 


\chapter{Introduction to Information Theory}
\label{ch.one}
\label{chone}
\addtopic{3}{infotheory}
\addtopic{2}{inference}
\addtrack{1}{inferencecourse}
\addtrack{3}{infotheorycourse}
\addtrack{3}{itprnncourse}
% % \part{Information Theory}
% \chapter{Introduction to Information Theory}
\label{ch1}
%\section{Communication over noisy channels}
% One of the principal questions addressed by information theory is 
%  Shannon's ground-breaking paper on `The Mathematical Theory of 
%  Communication' opens thus:
\begin{quotation}
\noindent
 The fundamental problem of communication is that of reproducing at one point
 either exactly or approximately a message selected at another point.
\\
\mbox{~} \hfill {\em (Claude Shannon, 1948)}  \\
%
\end{quotation}

\noindent
 In the first half of 
 this book we
%are going to
 study how to measure information content; 
 we
% are going to
% learn by how much data from a given source 
% can be compressed; we
% are going to
 learn how
% , practically, to
% achieve  data compression;
 to compress data; and we
% are going to
 learn  how to communicate 
 perfectly over  imperfect communication channels. 

 We start by getting a feeling for this last problem. 

\section[How can we achieve perfect communication?]{How
 can we achieve perfect communication over an imperfect, noisy
 commmunication channel?}
 Some examples of noisy communication channels are:
\bit
\item
 an analogue telephone 
 line,\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
 \begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{modem}}}
\put(21,10){\makebox(0,0)[l]{\shortstack{phone\\line}}}
\put(39,10){\makebox(0,0)[l]{\shortstack{modem}}}
\put(15,10){\vector(1,0){3}}
\put(32,10){\vector(1,0){3}}
\end{picture}
}
 over which two modems communicate digital information;
\item
 the radio communication link from the  Jupiter-orbiting spacecraft,
  Galileo,\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
 \begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{Galileo}}}
\put(21,10){\makebox(0,0)[l]{\shortstack{radio\\waves}}}
\put(39,10){\makebox(0,0)[l]{\shortstack{Earth}}}
\put(15,10){\vector(1,0){3}}
\put(32,10){\vector(1,0){3}}
\end{picture}
}
 to earth;
\item
\marginpar[c]{\footnotesize
\setlength{\unitlength}{1mm}%
 \begin{picture}(30,20)(0,0)
\put(0,10){\makebox(0,0)[l]{\shortstack{parent\\cell}}}
\put(16,2){\makebox(0,0)[l]{\shortstack{daughter\\cell}}}
\put(16,16){\makebox(0,0)[l]{\shortstack{daughter\\cell}}}
\put(10,10){\vector(1,1){5}}
\put(10,10){\vector(1,-1){5}}
\end{picture}
}reproducing cells, in which the daughter cells's \ind{DNA}
 contains information from the parent
% cell or
 cells;
\item 
 \marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
 \begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}}
\put(20,10){\makebox(0,0)[l]{\shortstack{disc\\drive}}}
\put(33,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}}
\put(15,10){\vector(1,0){3}}
\put(29,10){\vector(1,0){3}}
\end{picture}
}a disc drive.
\eit
 The last example shows that communication doesn't have to involve 
 information going from one {\em place\/}  to another. When 
 we write a file on a disc drive, we'll
% typically
 read it off
% again 
 in the same location -- but at a later {\em time}.

 These channels are noisy. A telephone line  suffers
 from cross-talk with other lines; the hardware in the 
 line distorts and adds noise to the transmitted signal.  The deep
 space network that listens to Galileo's puny transmitter
% fairy-bulb power
 receives background radiation  from
 terrestrial and cosmic sources.
 DNA is subject to mutations and damage. 
 A \ind{disc drive}, which  writes
 a binary digit (a one or zero, also known as a {\dbf bit}) by aligning a patch of magnetic
 material in one of two orientations, may later
% , with some probability,
 fail to read out the stored binary digit:
% that was stored
 the patch of material might  spontaneously flip
 magnetization, or
 a glitch of
 background noise might cause the reading circuit
 to report the wrong 
 value for the binary digit, or  the writing head might not induce 
 the magnetization in the first place because of interference
 from neighbouring bits.

 In all these cases, if we transmit data, \eg, a string 
 of bits, over the channel, there is some probability that 
 the received message will not be identical to the transmitted message. 
% And in all cases,
 We would prefer to have a communication channel for
 which this probability was zero -- or so close to zero that 
 for practical purposes it is indistinguishable from zero.  

 Let's consider
% the example of
 a noisy disc drive
% having the property
 that transmits  each bit  correctly
% transmitted
 with probability
 $(1-f)$ and incorrectly  with probability $f$. 
 This model
% favourite
 communication channel  is known 
 as the {\dbf{\ind{binary symmetric channel}}} (\figref{fig.bsc1}).

\begin{figure}[htbp]
\figuremargin{%
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,20)(-5,0)
\put(-4,9){{\makebox(0,0)[r]{$x$}}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{1}}
\put(4,16){\makebox(0,0)[r]{0}}
\put(16,2){\makebox(0,0)[l]{1}}
\put(16,16){\makebox(0,0)[l]{0}}
\put(24,9){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}

\:\:\:
\begin{array}{ccl}%%%%% {c@{}c@{}l} %%%%% (for twocolumn style)
        P(y\eq 0|x\eq 0) &\!=\!& 1 - \q ; \\        P(y\eq 1|x\eq 0) &\!=\!& \q ;
\end{array} 
\begin{array}{ccl}
        P(y\eq 0|x\eq 1) &\!=\!&  \q ; \\ P(y\eq 1|x\eq 1) &\!=\!& 1 - \q .
\end{array} 
\]
}{%
\caption[a]{The binary symmetric channel. The 
 transmitted symbol is $x$ and the 
 received symbol $y$. The noise level, the probability of a bit's being
 flipped, is $f$.}
\label{fig.bsc1}
}%
\end{figure}
\begin{figure}[htbp]
\figuremargin{%
\begin{mycenter}
\begin{tabular}{rcl}
\psfig{figure=bitmaps/dilbert.ps,width=1.2in} 
&\hspace{0.1in}%
\raisebox{0.22in}{%
\setlength{\unitlength}{1.2mm}%
\begin{picture}(20,20)(0,0)%
\put(10,1){\makebox(0,0)[t]{$(1-f)$}}
\put(10,17){\makebox(0,0)[b]{$(1-f)$}}
\put(12,9.5){\makebox(0,0)[l]{$f$}}
% \put(10,16.5){\makebox(0,0)[b]{$(1-f)$}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{{1}}}
\put(4,16){\makebox(0,0)[r]{{0}}}
\put(16,2){\makebox(0,0)[l]{{1}}}
\put(16,16){\makebox(0,0)[l]{{0}}}
\end{picture}%
}%
\hspace{0.385in}&
\psfig{figure=_is/10000.10.ps,width=1.2in} \\
% & & \makebox[0in][l]{\large 10\% of bits are flipped} \\
\end{tabular}
\end{mycenter}
}{%
\caption[a]{A binary data sequence of length 10000 transmitted over 
 a binary symmetric channel with  noise level $f=0.1$.
\dilbertcopy}
\label{fig.bsc.dil}
}%
\end{figure}

\noindent
 As an example,
% For the sake of argument,
 let's imagine that $f=0.1$, that is, ten \percent\ of the bits are 
 flipped (figure \ref{fig.bsc.dil}).
% For a disc drive to be useful, we would prefer that it should 
% flip no bits at all in its entire lifetime.
 A useful disc drive would  flip no bits at all in its entire lifetime.
%  
 If we expect to read and write a 
 gigabyte per day for ten years,  we require a bit error 
 probability  of the order of $10^{-15}$, or smaller.
 There are two approaches to this goal. 


\subsection{The physical solution}
 The physical solution is to improve the physical characteristics of 
 the communication channel to reduce its error probability. We could 
 improve our disc drive by
% , for example,
\ben
\item
  using more reliable components in its circuitry;
\item
  evacuating the air from the disc enclosure so as
 to eliminate the turbulence that perturbs the 
 reading head from the  track; 
\item
  using a larger magnetic patch to represent each bit;  or 
\item 
   using higher-power signals or cooling the 
 circuitry in order to reduce thermal noise. 
\een
 These physical modifications
 typically
 increase the cost of the communication 
 channel.
%   unit of area  making the disc spin at a slower rate

%
% the system solution
% 
\begin{figure}%[htbp]
\figuremargin{%
\setlength{\unitlength}{1.25mm}
\begin{mycenter}
\begin{picture}(50,40)(-10,5)
\put(0,5){\framebox(25,10){\begin{tabular}{c}Noisy\\ channel\end{tabular}}}
\put(-20,20){\framebox(25,10){\begin{tabular}{c}Encoder\end{tabular}}}
\put(20,20){\framebox(25,10){\begin{tabular}{c}Decoder\end{tabular}}}
%\put(-20,40){\framebox(25,10){\begin{tabular}{c}Compressor\end{tabular}}}
%\put(20,40){\framebox(25,10){\begin{tabular}{c}Decompressor\end{tabular}}}
%\put(-50,20){\makebox(25,10){\begin{tabular}{c}{\sc Source}\\{\sc coding}\end{tabular}}}
% \put(-50,40){\makebox(25,10){\begin{tabular}{c}{\sc Channel}\\{\sc coding}\end{tabular}}}
\put(-20,37){\makebox(25,12){Source}}
%
\put(-10,14){\makebox(0,0){$\bt$}}
\put(-10,34){\makebox(0,0){$\bs$}}
\put(35,14){\makebox(0,0){$\br$}}
\put(35,34){\makebox(0,0){$\hat{\bs}$}}

\put(-7.5,18){\line(0,-1){8}}  
\put(-7.5,10){\vector(1,0){6}} 
\put(32.5,10){\vector(0,1){8}}
\put(32.5,10){\line(-1,0){6}}
%
\put(32.5,31){\vector(0,1){8}}
%\put(32.5,51){\vector(0,1){5}}
\put(-7.5,39){\vector(0,-1){8}}
%\put(-7.5,55){\vector(0,-1){5}}
\end{picture}
\end{mycenter}
}{%
\caption[a]{The `system' solution for
         achieving 
% almost perfect 
        reliable communication
        over a noisy channel. The encoding system introduces
 systematic redundancy
%        in a systematic way 
        into the transmitted vector $\bt$. The decoding system 
   uses this known redundancy to deduce
 from  the 
        received vector $\br$
 {\em both\/}
 the original source vector
        {\em and\/}
 the noise introduced by the channel.
}
\label{system.solution}
}%
\end{figure}
\subsection{The `system' solution}
 Information theory\index{information theory} and
 \ind{coding theory}\index{system} offer
 an alternative (and much more exciting)
 approach: we accept the given noisy channel as it is
 and 
 add communication {\dem systems\/} to it so that we 
 can {detect\/} and {correct\/} the errors introduced by the 
% noise.
 channel.
 As shown in \figref{system.solution}, we   add an 
 {\dem\ind{encoder}\/} before the channel and a {\dem\ind{decoder}\/} after 
 it. The encoder encodes the source message $\bs$ 
 into a {\dem transmitted\/} message $\bt$,
% the idea is that the  encoder adds
 adding {\dem\ind{redundancy}\/} to the original message in some way. The 
 channel adds noise to the transmitted message, yielding a received 
 message $\br$. The decoder uses the  known redundancy 
 introduced by the encoding system to infer both the original signal 
 $\bs$ and the added noise.
% added by the channel was. 

 Whereas  physical solutions give incremental channel improvements
 only  at an ever-increasing cost,
% we hope to find
% there exist
 system solutions  can turn noisy channels into reliable
 communication channels
 with the only cost being a  {\em computational\/} requirement 
 at the encoder and decoder.
% (and the delay associated with those computations.
%
% suggested addition: 
% So, as the cost of computation falls, the cost of reliability will fall as well.

{\dbf Information theory} is concerned with the theoretical limitations and 
% theoretical 
 potentials of such  systems. `What is the best error-correcting 
 performance we could achieve?'

{\dbf Coding theory} is concerned with the creation of practical 
 encoding and decoding systems.
 
% Some
\section{Error-correcting codes for the binary symmetric channel}
 We now consider  examples of encoding and decoding systems. 
 What is the simplest way to  add useful redundancy to a transmission? 
 [To make the rules of the game clear:
 we want to be able to detect {\em and\/} correct errors;
 and retransmission is not an option. We  get only
one chance to encode, transmit,
 and decode.]

\subsection{Repetition codes}
\label{sec.r3}
 A straightforward idea is to repeat every bit of the message a prearranged
 number of times -- for example, three times, as shown in \figref{fig.r3}. 
 We call this {\dem \ind{repetition code}\/} `$\Rthree$'.

%\begin{figure}[htbp]
%\figuremargin{%
\amarginfig{c}{
\begin{mycenter}
\begin{tabular}{c@{\hspace{0.3in}}c} \toprule % \hline
%        Source sequence $\bs$ &  Transmitted sequence $\bt$ \\ \hline
        Source  &  Transmitted  \\[-0.02in] % was -0.1, which was to much
         sequence  &   sequence  \\ 
         $\bs$ &   $\bt$ \\ \midrule % \hline
        \tt 0 &\tt  000 \\
        \tt 1 &\tt  111  \\ \bottomrule % \hline
\end{tabular} 
\end{mycenter}
%}{%
\caption[a]{The repetition code {$\Rthree$}.}
\label{fig.r3}
}%
%\end{figure}

% \noindent
%
 Imagine that
% what might happen if 
 we transmit the source message
\[
 \bs = \mbox{\tt 0 0 1 0 1 1 0}
\]
 over a binary 
 symmetric channel with noise level $f=0.1$ using this repetition code. 
 We can describe the channel as `adding' a sparse noise vector $\bn$ to the 
 transmitted vector -- adding in modulo 2 arithmetic, \ie, the binary algebra in which 
 {\tt 1}+{\tt 1}={\tt 0}.  A possible noise
 vector $\bn$ and received vector $\br = \bt + \bn$
 are shown in 
 \figref{fig.r3.transmission}.
\begin{figure}[htbp]
%
% here i should switch the \[ \] for a display that oes not introduce
% white space at the top (about 0.1in)
%
\figuremargin{%
\[
        \begin{array}{rccccccc}
        \bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\
        \bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ 
        \bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8}
        \br &  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \nbr{{\tt 0}}{{\tt 1}}{{\tt 0}}&   \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} 
        \end{array}
\]
}{%
\caption{An example transmission using $\mbox{R}_3$.}
\label{fig.r3.transmission}
}
\end{figure}

%\noindent
 How should we decode this received vector?
%
% optimality not clear - should justify? 
%
% Perhaps you can see  that
 The optimal algorithm looks at the received
 bits three at a time and takes 
 a \ind{majority vote} (\algref{alg.r3}).
%
\begin{aside}
% 
 At the risk of explaining the obvious, let's prove this result.
 The optimal decoding decision
 (optimal in the sense
 of having the smallest probability of being wrong)
 is to find which value of $\bs$
 is most probable, given $\br$.\index{MAP}
% to make clear the assumptions.
 Consider the decoding of a single bit $s$, which was encoded
 as
% after encoding as
 $\bt(s)$ 
 and  gave rise to three received bits $\br = r_1r_2r_3$.
 By \ind{Bayes's theorem},\label{sec.bayes.used} the {\dem posterior
 probability\/} of $s$ is
\beq
	P(s \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s ) P( s ) }
				{ P( r_1r_2r_3 ) } .
\label{eq.bayestheorem}
\eeq
 We can spell out the posterior probability of the two alternatives thus:
\beq
	P(s\!=\!{\tt 1} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\!=\!{\tt 1} )
						P( s\!=\!{\tt 1} ) }
				{ P( r_1r_2r_3 ) } ; 
\label{eq.post1}
\eeq
\beq
	P(s\!=\!{\tt 0} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\!=\!{\tt 0} )
						P( s\!=\!{\tt 0} ) }
				{ P( r_1r_2r_3 ) } .
\label{eq.post0}
\eeq
%
 This \ind{posterior probability} is determined by two factors:
  the
 {\dem{\ind{prior} probability\/}} $P(s)$, and 
 the data-dependent term $P( r_1r_2r_3 \,|\, s )$, which is called
 the {\dem{\ind{likelihood}\/}} of $s$.
 The normalizing constant $P( r_1r_2r_3 )$
% is irrelevant to
 needn't be computed when finding
 the optimal decoding decision,
 which is to guess $\hat{s}\!=\!{\tt 0}$
 if $P(s\!=\!{\tt 0} \,|\, \br ) > P(s\!=\!{\tt 1} \,|\, \br )$,
 and $\hat{s}\!=\!{\tt 1}$ otherwise.

 To find
 $P(s\!=\!{\tt 0} \,|\, \br )$ and $P(s\!=\!{\tt 1} \,|\, \br )$,
% the optimal decoding decision,
 we must  make an assumption about the prior probabilities of the
 two hypotheses ${s}\!=\!{\tt 0}$ and ${s}\!=\!{\tt 1}$, and we 
 must make an assumption about the probability of $\br$ given
 $s$.
% $\bt(s)$.
 We  assume that the prior probabilities are equal:
 $P( {s}\!=\!{\tt 0}) = P( {s}\!=\!{\tt 1}) = 0.5$; 
 then  maximizing the posterior probability $P(s\,|\,\br)$ is
 equivalent to maximizing the likelihood $P(\br\,|\,s)$.\index{maximum likelihood}
 And  we  assume that the
 channel is a binary symmetric channel with noise level $f<0.5$, so that
 the likelihood is
\beq
	P( \br \,|\, s ) = P(\br \,|\, \bt(s) ) = \prod_{n=1}^N
		P(r_n \,|\, t_n(s) ) ,
\eeq
 where $N=3$ is the number of transmitted bits in the block
 we are considering, and
\beq
 P(r_n\,|\,t_n) = \left\{ \begin{array}{lll}
 (1\!-\!f) & \mbox{if} &  r_n=t_n \\
 f & \mbox{if} & r_n \neq t_n. \end{array} \right.
\eeq
 Thus the likelihood ratio for the
 two hypotheses is
% if we define $
\beq
	\frac{P(\br\,|\, s\!=\!{\tt 1})}{P(\br\,|\, s\!=\!{\tt 0})}
%	= \left( \frac{ (1-f) }{f} \right)^{
	= \prod_{n=1}^N
		\frac{P(r_n \,|\, t_n({\tt 1}) )}{P(r_n \,|\, t_n({\tt 0}) )} ;
\label{eq.likelihood.bsc}
\eeq
 each factor
% $P(r_n \,|\, t_n(s) )$
 $\frac{P(r_n \,|\, t_n({\tt 1}) )}{P(r_n \,|\, t_n({\tt 0}) )}$
 equals $\frac{ (1-f) }{f}$ if $r_n=1$ and $\frac{f}{ (1-f) }$ if
 $r_n=0$. 
 The ratio $\gamma \equiv \frac{ (1-f) }{f}$ is greater than 1,
 since $f<0.5$, so the winning hypothesis is the one with the most
 `votes', each vote counting for a factor of $\gamma$ in the
% posterior probability.
 likelihood ratio. 

 Thus the majority-vote decoder shown in \algref{fig.r3d}
 is the optimal decoder if we assume that 
 the channel is  a binary symmetric channel and that the 
 two possible source messages {\tt 0} and {\tt 1} 
 have equal prior probability.
\end{aside}

\begin{algorithm}[htbp]
\algorithmmargin{%
\begin{mycenter}
\begin{tabular}{ccc} % \toprule % \hline
        Received sequence $\br$ &
 Likelihood ratio $\frac{P(\br\,|\, s\hspace{-0.2mm}=\hspace{-0.2mm}{\tt 1})}{P(\br\,|\, s\hspace{-0.2mm}=\hspace{-0.2mm}{\tt 0})}$
 &
 Decoded sequence $\hat{\bs}$ \\ \midrule
\tt      000 & $\gamma^{-3}$ &\tt 0 \\
\tt      001 & $\gamma^{-1}$ &\tt 0 \\
\tt      010 & $\gamma^{-1}$ &\tt 0 \\
\tt      100 & $\gamma^{-1}$ &\tt 0 \\
\tt      101 & $\gamma^{1}$  &\tt 1 \\
\tt      110 & $\gamma^{1}$  &\tt 1 \\
\tt      011 & $\gamma^{1}$  &\tt 1 \\
\tt      111 & $\gamma^{3}$  &\tt 1 \\
% \bottomrule
\end{tabular} 
\end{mycenter}
}{%
\caption[a]{Majority-vote decoding algorithm for {$\Rthree$}.
 Also shown are the likelihood ratios (\ref{eq.likelihood.bsc}), assuming
%  This is the optimal decoder if
 the channel is a binary symmetric channel; $\gamma \equiv (1-f)/f$.}
%
\label{fig.r3d}
\label{alg.r3}
}%
\end{algorithm}

%\noindent
 We now apply the majority vote decoder to the received vector of  \figref{fig.r3.transmission}.
 The first  three received bits are all ${\tt 0}$, so
 we decode this triplet 
 as a ${\tt 0}$. 
 In the second triplet of \figref{fig.r3.transmission},
 there are two {\tt 0}s and one {\tt 1}, so  we decode 
 this triplet as a ${\tt 0}$ -- which in this case corrects the error.
 Not all errors are corrected, however. If we are unlucky and 
 two errors fall in a single block, as in the fifth triplet of 
 \figref{fig.r3.transmission}, 
 then the decoding rule gets the wrong answer, as shown in 
 \figref{fig.decoding.R3}. 
% \Figref{fig.decoding.R3}
% shows the result of decoding the received vector 
% from \figref{fig.r3.transmission}.
\begin{figure}[htbp]
\figuremargin{%
\[
        \begin{array}{rccccccc}
        \bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\
        \bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ 
        \bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}&   \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&  \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8} 
        \br &  \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 1}}&   \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}&  \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
                \ubr{{\tt 0}}{{\tt 1}}{{\tt 0}}&   \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}&  \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ 
        \hat{\bs} &     {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 0}&{\tt 1}&{\tt 0} \\
        \mbox{corrected errors} &
                         &\star & & & & & \\ 
        \mbox{undetected errors} &
                         & & & &\star & & 
        \end{array}
\]
}{%
\caption{Decoding
% Applying the maximum likelihood decoder for $\mbox{R}_3$ to 
 the received vector 
 from \protect\figref{fig.r3.transmission}.}
\label{fig.decoding.R3}
}%
\end{figure}

\noindent
% Thus the error probability is reduced by the use of this code. 
% It is easy to compute the error probability.

% Exercise 1.1. Could this be made an Example, i.e. worked through in
%        the text? -- for a beginner, there is a lot in it, and it seems to
%        be important.
%
% see exercise.sty
\exercissx{2}{ex.R3ep}{%%%%%%%% keep this as A2, but cut it from the ITPRNN list
 Show\marginpar{\footnotesize The exercise's rating, \eg,
% `{\em{A}}2'
 `[{\sl2}]'
 indicates its  difficulty:
 `1' exercises are the easiest.
% An exercise rated {\em{A}}2 is important and should not prove too difficult.
 Exercises that are accompanied by a marginal rat are especially recommended.
}
 that  the error probability is reduced by the use of {$\Rthree$}
 by computing the error probability of
 this code for a binary symmetric channel
 with noise level $f$.
%Do so.
}
%
% This fig is 0.1 inch too wide, 9801
%
\begin{figure}
%\fullwidthfigure{%
%\figuredangle{% this hung off the bottom of the page
\figuremarginb{% I think this may make a collision?
\begin{center}
\setlength{\unitlength}{0.8in}% was 0.75 98.12. changed to 0.8 99.01
\begin{picture}(7,4.3)(0,1.4)
\put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}}
\put(0.625,5.4){\makebox(0,0){\Large$\bs$}}
\thicklines
\put(1.35,4.75){\vector(1,0){0.4}}
\put(1.55,5.4){\makebox(0,0){{\sc encoder}}}
\put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.ps,width=1in}}}
\put(2.625,5.4){\makebox(0,0){\Large$\bt$}}
\put(3.6,5.4){\makebox(0,0){{\sc channel}}}
\put(3.6,5.15){\makebox(0,0){$f={10\%}$}}
\put(3.4,4.75){\vector(1,0){0.4}}
\put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.ps,width=1in}}}
\put(4.625,5.4){\makebox(0,0){\Large$\br$}}
\put(5.6,5.4){\makebox(0,0){{\sc decoder}}}
%\put(5.6,3.4){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder takes the majority vote of the three signals.}}}}
\put(5.4,4.75){\vector(1,0){0.4}}
\put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.d.ps,width=1in}}}
\put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}}
\end{picture}
\end{center}
}{%
\caption[a]{Transmitting 10000 source bits over a binary symmetric channel
 with $f=10\%$
% 0.1$
  using a repetition code and the majority vote decoding 
 algorithm. The  probability 
 of decoded bit error has fallen to about 3\%; the rate has fallen 
 to 1/3.}
% \dilbertcopy
\label{fig.r3.dilbert}
}%
\end{figure}

 The error probability is dominated by the probability that two
 bits in a block of three are flipped, which scales as $f^2$.
%
% JARGON??????
%
 In the
 case of the binary symmetric channel with $f=0.1$, the {$\Rthree$} code has a
 probability of error, after decoding, of $\pb \simeq 0.03$ per bit.
 \Figref{fig.r3.dilbert} shows the 
 result of transmitting a binary
 image over a binary symmetric channel
 using the repetition code.

%  Should `rate' be explicitly defined?

The repetition code $\Rthree$ has therefore reduced the probability of
 error, as desired.
 Yet we have lost something: our
 {\em rate\/} of information transfer has fallen by a factor of
 three. So if we use a repetition code to communicate data over a telephone
 line, it will reduce the error frequency, but it will also reduce our
 communication rate. We will have to pay three times as much for each
 phone call.
% there will also be a delay
 Similarly,
%As for our disc drive,
 we would need three of the original noisy gigabyte disc drives
 in order to create a one-gigabyte disc drive with $\pb=0.03$.

 Can we 
% What happens as we try to 
 push the error probability lower, to the
 values required for a
% quality
 sellable disc drive -- $10^{-15}$?
 We could achieve lower error probabilities by using repetition 
 codes with more repetitions. 

\exercissx{3}{ex.R60}{
\ben
\item
 Show that the probability of error of $\RN$, the repetition
 code with  $N$ repetitions,  is 
\beq
 p_{\rm b} = \sum_{n=(N+1)/2}^{N} {{N}\choose{n}} f^n (1-f)^{N-n} ,
\eeq
 for  odd $N$.
\item
 Assuming $f = 0.1$, which of the terms in this sum is the biggest?
 How much bigger is it than the second-biggest term? 
\item
 Use \ind{Stirling's approximation} to approximate
% get rid of
 the ${{N}\choose{n}}$
 in the largest term, and find,
 approximately, the probability of error of the repetition
 code with  $N$ repetitions.
\item
  Assuming $f = 0.1$, find how many repetitions
 are required
% show that it takes a repetition 
% code with rate about $1/60$
 to get the probability of error 
 down to $10^{-15}$. [Answer: about 60.]
\een
}
 So to build a {\em single\/}
 gigabyte disc drive 
 with the required reliability from noisy gigabyte drives with $f=0.1$, 
 we would need {\em sixty\/} of the  noisy disc drives.
 The tradeoff between error probability and rate for repetition 
 codes is shown in \figref{fig.pbR.R}.
%
%  see end of l1.tex for method, also see poster1.gnu
%
\newcommand{\pbobject}{\hspace{-0.15in}\raisebox{1.62in}{$\pb$}%
\hspace{-0.05in}}
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/rep.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/rep.1.l.ps,angle=-90,width=2.6in}  \\
\end{tabular} 
\end{center}
}{%
\caption[a]{Error probability $\pb$ versus rate for repetition codes
 over a binary symmetric channel with $f=0.1$.
 The right hand figure shows $\pb$ on a logarithmic scale. We would like 
 the rate to be large and $\pb$ to be small.
}
\label{fig.pbR.R}
}%
\end{figure}
%  see end of this file for method


\subsection{Block codes -- the (7,4) Hamming code}
\label{sec.ham74}
 We would  like to  communicate with 
 tiny probability of error {\em and\/} at a substantial rate.
 Can we improve on repetition codes? What if we add redundancy to 
 {\dem blocks\/} of data instead of 
% redundantly 
 encoding one bit at a time?
% You may already have heard of the idea of `parity check bits'. 
 We now 
 study a simple {\dem{block code}}.

 A {\dem \ind{block code}\/} is a rule for converting a sequence of source
 bits $\bs$, of length $K$, say, into a transmitted sequence $\bt$ of length
 $N$ bits. To add redundancy, we make $N$ 
 greater than $K$. In a {\dem linear\/} block code,
 the extra $N-K$ bits are linear functions of the
 original $K$ bits; these extra bits are called {\dem\ind{parity check bits}}.
 An example of a \ind{linear block code}\index{error-correcting code!linear} is the \mbox{\dem(7,4)
 \ind{Hamming code}}, which transmits $N=7$ bits for every $K=4$ source
 bits.

\begin{figure}[htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
(a)\psfig{figure=hamming/encode.eps,angle=-90,width=1.3in}  &
(b)\psfig{figure=hamming/correct.eps,angle=-90,width=1.3in}  \\
\end{tabular} 
\end{center}
}{
\caption[a]{Pictorial representation of encoding for the Hamming (7,4) 
 code.
% a and  b are not explained in the caption. Does this matter? 
%
% The parity check bits $t_5,t_6,t_7$ are set so that the parity within
%% each circle is even.
}
\label{fig.74h.pictorial}
\label{fig.hamming.pictorial}
}
\end{figure}
%
 The encoding operation for the code is shown pictorially
 in \figref{fig.74h.pictorial}.
%
% \subsubsection{Encoding}
 We arrange the seven transmitted bits in three intersecting circles.
%  as shown in \figref{fig.hamming.encode}.
 The first four
 transmitted bits,
 $t_1 t_2 t_3 t_4$, are set equal to the four source bits,
 $s_1 s_2 s_3 s_4$.
 The parity check bits\index{parity check bits}
 $t_5 t_6 t_7$ are set so that the {\dem\ind{parity}\/}
 within each circle is even:
 the first parity check bit is the parity of the first three source bits
 (that is, it is
%zero
 {\tt 0} if the sum of those bits is even, and
% one
 {\tt 1} if the sum  is odd); 
 the second is the parity of the last three; and the third parity bit 
 is the parity of source bits one, three and four. 

 As an example, \figref{fig.74h.pictorial}b shows the transmitted 
 codeword  for the case $\bs = {\tt 1000}$. 
% idea for rewriting this: go straight to pictorial story, leave out the 
% matrix description for another time.
%
%
%\noindent
%
 Table \ref{tab.74h} shows the codewords generated
 by each of the  $2^4=$ sixteen settings of the four source bits.
% Notice that the first four transmitted bits are 
% identical to the four source bits, and the remaining three bits 
% are parity bits:
 The special property of these codewords is that
 any pair 
 differ from each other in at least three bits.
\begin{table}[htbp]
\figuremargin{%
\begin{center}
\mbox{\small
\begin{tabular}{cc} \toprule
%       Source sequence
 $\bs$ & 
% Transmitted sequence
               $\bt$ \\ \midrule
\tt     0000 &\tt 0000000 \\
\tt     0001 &\tt 0001011 \\
\tt     0010 &\tt 0010111 \\
\tt     0011 &\tt 0011100 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
         $\bs$ &   $\bt$ \\ \midrule
\tt     0100 &\tt 0100110 \\
\tt     0101 &\tt 0101101 \\
\tt     0110 &\tt 0110001 \\
\tt     0111 &\tt 0111010 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule 
         $\bs$ &   $\bt$ \\ \midrule
\tt     1000 &\tt 1000101 \\
\tt     1001 &\tt 1001110 \\
\tt     1010 &\tt 1010010 \\
\tt     1011 &\tt 1011001 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
         $\bs$ &   $\bt$ \\ \midrule
\tt     1100 &\tt 1100011 \\
\tt     1101 &\tt 1101000 \\
\tt     1110 &\tt 1110100 \\
\tt     1111 &\tt 1111111 \\ \bottomrule
\end{tabular}
}%%%%%%%%% end of row of four tables
\end{center} 
}{%
\caption[a]{The sixteen codewords
  $\{ \bt \}$ of the (7,4) Hamming  code. Any pair of
  codewords 
% have the % beautiful % elegant property that they
 differ from each other in at least three bits.}
%\label{fig.hamming.encode}
\label{tab.74h}
\label{tab.h74}
\label{fig.h74}
\label{fig.74h}
}
\end{table}
%

\begin{aside}
 Because the Hamming code is a   {linear\/} code, it can\indexs{error-correcting code!linear}
 be written  compactly in terms of matrices as follows. 
% It is a 
% {\em linear\/} code; that is, t
 The transmitted codeword $\bt$ is
% can be
 obtained 
 from the source sequence $\bs$ by a linear operation,
\beq
        \bt = \bG^{\T} \bs,
\label{eq.encode}
\eeq
 where $\bG$ is the {\dem\ind{generator matrix}} of the code,
\beq
 \bG^{\T} = {\left[ \begin{array}{cccc} 
\tt 1 &\tt 0 &\tt 0 &\tt 0 \\
\tt 0 &\tt 1 &\tt 0 &\tt 0 \\
\tt 0 &\tt 0 &\tt 1 &\tt 0 \\
\tt 0 &\tt 0 &\tt 0 &\tt 1 \\
\tt 1 &\tt 1 &\tt 1 &\tt 0 \\
\tt 0 &\tt 1 &\tt 1 &\tt 1 \\
\tt 1 &\tt 0 &\tt 1 &\tt 1  \end{array} \right] } ,
\label{eq.h74.gen}
\eeq 
 and the encoding operation (\ref{eq.encode}) uses 
  modulo-2 arithmetic [${\tt 1}+{\tt 1}={\tt{0}}$, ${\tt 0}+{\tt 1}={\tt 1}$, etc.].
%\footnote{My notational 
% convention  is that  all  vectors -- $\bs$, $\bt$, etc.\ --
% are column vectors, except that in the figures where many 
% vectors are listed, they are displayed as row vectors. The 
% generator matrix $\bG$  is written ..... as to retain 
% consistency with established notation in coding texts.}

% \begin{aside}
 In the encoding operation
 (\ref{eq.encode}) I have assumed that $\bs$ and $\bt$ are 
 column vectors. If instead they are row vectors, then this equation 
 is replaced by
\beq
        \bt =  \bs \bG,	
\label{eq.encodeT}
\eeq
 where 
\beq
       \bG = \left[ \begin{array}{ccccccc} 
 \tt 1& \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1 \\
 \tt 0& \tt 1& \tt 0& \tt 0& \tt 1& \tt 1& \tt 0 \\
 \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1& \tt 1 \\
 \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1 \\
  \end{array} \right] .
\label{eq.Generator}
\eeq
% f you are like me, you may
 I find it easier to relate to
 the right-multiplication (\ref{eq.encode})
 than the left-multiplication (\ref{eq.encodeT}).
% -- I like my matrices to act to the right.
 Many coding theory texts use the left-multiplying conventions 
 (\ref{eq.encodeT}--\ref{eq.Generator}), however.

 The rows of the generator matrix (\ref{eq.Generator}) can be 
 viewed as defining four basis vectors lying in a seven-dimensional
 binary space. The sixteen codewords are obtained by making all 
 possible linear combinations
% binary sums
 of these vectors.
\end{aside}


%
% should I add a cast of characters here?
% s,t,r,s^

\subsubsection{Decoding the (7,4) Hamming code}
 When we invent a more complex encoder $\bs \rightarrow \bt$,
 the task of decoding the
 received vector $\br$ becomes less straightforward. Remember that
 {\em any\/} of the bits may have been flipped, including the parity bits. 
% We can't assume that the  three extra parity bits 
%(The reader who
% is eager to see the denouement of the plot may skip ahead to section
% \ref{sec.code.perf}.)
 

% General defn of optimal decoder 
 If we assume that the channel is a binary symmetric channel and that
 all source vectors are equiprobable, 
% {\em a priori},
 then  the
 optimal decoder
% is one that
 identifies the source vector $\bs$ whose
 encoding $\bt(\bs)$ differs from the received vector $\br$ in the
 fewest bits. [{Refer to the likelihood function 
% equation
% {eq.bayestheorem}--\ref{eq.likelihood.bsc}}
 \bref{eq.likelihood.bsc}} to see why this is so.]
 We could solve the decoding problem by measuring how far $\br$
 is from each of the 
 sixteen codewords in \tabref{tab.74h} then picking the closest.
 Is there a more efficient way of finding the most probable source vector?


\subsubsection{Syndrome decoding for the Hamming code}
\label{sec.syndromedecoding}
 For the (7,4) Hamming code there is a pictorial solution to the 
% syndrome
 decoding problem, based on the  encoding  picture,
 \figref{fig.74h.pictorial}. 
%
% \subsubsection{Decoding}
%
% sanjoy says this is CONFUSING - tried to importve it Sat 22/12/01
% also romke did not like it

 As a first example, let's assume the transmission was
 $\bt = {\tt 1000101}$ and the noise flips the second bit,
 so the received vector is
 $\br = {\tt 1000101}\oplus{\tt{0100000}} = {\tt{1100101}}$.
% \ie,  $\bn=({\tt 0},{\tt 1},{\tt 0},{\tt 0},{\tt 0}, {\tt 0},{\tt 0})$,
% and the received vector 
 We write the received vector  into the three circles
 as shown in \figref{fig.hamming.decode}(a), and
 look at each of the three circles to see whether its parity is even.
 The circles whose parity is {\em{not}\/} even are shown by
 dashed lines in \figref{fig.74h.pictorial}b.
% The fact that all codewords differ from each other in at least 
% three bits means that if the noise has flipped any one or two bits, 
% the received vector will no longer be a valid codeword, and some of 
% the parity checks  will be broken.
%
 The decoding task is
%We want
 to find the smallest
 set of flipped bits that can account for these violations
 of the parity rules.
% violated.
 [The  pattern of violations of the parity checks is called the {\dem\ind{syndrome}}, and can be written as a binary vector -- for example,
 in \figref{fig.hamming.decode}b, the syndrome is $\bz = ({\tt1},{\tt1},{\tt0})$,
 because the first two circles are `unhappy' (parity {\tt1}) and the
 third circle is `happy' (parity {\tt0}).]
% RESTORE ME:
%, and the task of  syndrome decoding 
% syndrome (just as a
% \ind{doctor} might seek the most probable underlying \ind{disease} to account for
% the symptoms shown by a \ind{patient}).


\begin{figure}% [htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{ccc}
(a)\psfig{figure=hamming/decode.eps,angle=0,width=1.3in}  \\
(b)\psfig{figure=hamming/s2.eps,angle=-90,width=1.3in}  &
(c)\psfig{figure=hamming/t5.eps,angle=-90,width=1.3in}  &
(d)\psfig{figure=hamming/s3.eps,angle=-90,width=1.3in}  \\[0.3in]
\multicolumn{3}{c}{%
(e)\psfig{figure=hamming/s3.t7.eps,angle=0,width=1.3in}  
\setlength{\unitlength}{1in}
\begin{picture}(0.4,0.6)(0,0)
\put(0,0.6){\vector(1,0){0.6}}
\end{picture}
% \raisebox{0.6in}{$\rightarrow$}
(${\rm e}'$)\psfig{figure=hamming/s3.t7.d.eps,angle=0,width=1.3in}  
}\\
\end{tabular} 
\end{center}
}{%
\caption[a]{Pictorial representation of decoding of the Hamming (7,4) 
 code. The received vector is written into the diagram
 as shown in (a).
 In (b,c,d,e), the received vector is
 shown, assuming that the transmitted vector was
 as in
% The bits that are flipped relative to
 \protect
 \figref{fig.hamming.pictorial}(b) and the bits labelled by $\star$
 were flipped. The violated 
 parity checks are highlighted by dashed circles. One of the seven bits 
 is the most probable suspect to account for each `\ind{syndrome}', \ie, each 
 pattern of violated and satisfied parity checks. 

 In examples  (b), (c), and (d), the most probable suspect is
 the one bit that was flipped.

 In example (e), two bits  have been flipped, $s_3$ and $t_7$.
 The most probable suspect is $r_2$, marked by a circle in (${\rm e}'$),
 which shows the output of the decoding algorithm. 
% each circle is even.
}\label{fig.hamming.decode}
\label{fig.hamming.s2}% these labels were in the wrong place feb 2000
\label{fig.hamming.s3}
\label{fig.hamming.correct}
}
\end{figure}
%
% ACTION: sanjoy still thinks this part is hard to follow - fixed Sat 22/12/01?
 To solve the decoding task,
% problem,
 we ask the question:
 can we find  a unique bit that lies {\em inside\/}
 all the `unhappy' circles and {\em outside\/} all the
 `happy' circles? If so, the flipping of that bit
 would account for the observed
 syndrome. 
 In the case shown in   \figref{fig.hamming.s2}(b),
 the bit $r_2$
% that was flipped
 lies  inside the  two unhappy circles and outside the happy
 circle;
 no other single bit has this property, so
 $r_2$  is the only single bit capable of explaining the syndrome.

 Let's work through a couple more examples. 
  \Figref{fig.hamming.s2}(c) shows what happens if one of the
 parity bits, $t_5$, is flipped by the noise. Just one of the checks
 is violated. Only $r_5$ lies inside this unhappy circle and outside
the other two happy circles,
 so $r_5$ is  identified as the only single bit
 capable of explaining the syndrome.

 If the central bit $r_3$ is received flipped, 
  \figref{fig.hamming.s3}(d) shows that all three checks are violated;
 Only $r_3$ lies inside all three circles, so $r_3$  is
 identified as the  suspect bit.

 If you try flipping any one of the seven bits, you'll find 
 that a different syndrome is obtained in each case -- seven non-zero syndromes,
 one for each bit. There is only 
 one other syndrome, the all-zero syndrome. So if
 the channel is a binary symmetric channel with a
 small noise level $f$, the optimal 
 decoder unflips at most one bit, depending  on the 
 syndrome, as shown in \algref{tab.hamming.decode}.
 Each syndrome could have been caused by other noise patterns
 too, but any other noise pattern that has the same syndrome 
 must be less probable because it involves a larger number of 
 noise events.

%\begin{figure}
%\figuremargin{%
\begin{algorithm}
\algorithmmargin{%
\begin{center}
\begin{tabular}{c*{8}{c}}
% Fri 4/1/02 removed toprule and bottomrule because algorithm has its own frame
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \toprule
Syndrome $\bz$ &            {\tt 000} & {\tt 001} & {\tt 010} & {\tt 011} & {\tt 100} & {\tt 101} & {\tt 110} & {\tt 111} \\ \midrule
Unflip this bit  & {\small{\em none}} &   $r_7$      & $r_6$          &        $r_4$  &  $r_5$       & $r_1$          & $r_2$         & $r_3$ \\
% \bottomrule
% Unflip this bit  & {\small{\em none}} &   7      & 6          &        4  &  5       & 1          & 2         & 3 \\
% \bottomrule
% this is appropriate only if z =z3,z2,z1: 
% Unflip this bit  & {\small{\em none}} & 5      & 6          &        2  &  7       & 1          & 4         & 3 \\  \hline
\end{tabular}
\end{center} 
%\begin{center}
%\begin{tabular}{cc}  \hline
%Syndrome $\bz$ & % 3 2 1 !!!!!!!!!!!!!!!!!!!
%Flip this bit  \\  \hline
% 000 &{\small{\em none}} \\
% 001 &5\\
% 010 &6\\
% 011 &2\\
% 100 &7\\
% 101 &1\\
% 110 &4\\
% 111 &3 \\  \hline
%\end{tabular}
%\end{center}
}{%
\caption[a]{Actions taken by the optimal decoder for the (7,4) Hamming 
 code, assuming a binary symmetric channel with small noise level $f$.
 The syndrome vector $\bz$ lists whether each parity check is 
 violated ({\tt 1}) or satisfied ({\tt 0}),
 going through the checks in the order
 of the bits $r_5$, $r_6$,
 and $r_7$. }
\label{tab.hamming.decode}
}%
\end{algorithm}

 What happens if the noise actually flips more than one bit? 
  \Figref{fig.hamming.s3}(e) shows the situation when two bits, 
 $r_3$ and $r_7$, are received flipped.  The syndrome, {\tt 110},
 makes us suspect the single bit $r_2$; so our optimal decoding algorithm 
 flips this bit, giving a decoded pattern with three errors 
 as shown in   \figref{fig.hamming.s3}(${\rm e}'$).
 If we use the optimal decoding algorithm, 
 any two-bit error pattern will lead to a decoded seven-bit vector 
 that contains three errors. 

\subsection{General view of decoding for linear codes: syndrome decoding}
\label{sec.syndromedecoding2}
\begin{aside}
% {\em (Does some of this stuff belong earlier in the pictorial area?)}
 We can also  describe the decoding problem
 for a linear code in terms of matrices.\index{syndrome decoding} 
% In the  case of a linear code and a  symmetric channel, 
% the decoding task can be re-expressed as {\bf syndrome decoding}.
% Let's assume that the noise level $f$ is less than $1/2$.
 The first four received bits, $r_1r_2r_3r_4$, purport to be  
 the four source bits; and the received bits $r_5r_6r_7$ purport
 to be the parities of the source bits, as defined by the generator 
 matrix $\bG$. We evaluate the three parity check bits for the 
 received bits, $r_1 r_2r_3 r_4$, and see whether
 they match the three received 
 bits, $r_5r_6r_7$. The differences (modulo 2) between 
 these two triplets are called the {\dbf\ind{syndrome}}
 of the received vector. 
 If the syndrome is zero -- if all three parity checks are happy
% agree with  the corresponding received bits
 -- then the received vector is a codeword, 
 and the most probable decoding is given by reading out its first four 
 bits.  If the syndrome is non-zero, then
% we are certain that
 the noise 
 sequence for this block was non-zero, and the syndrome is our 
 pointer to the most probable error pattern. 

 The computation of  the  syndrome vector is a
 linear operation. If we define the $3 \times 4$ matrix $\bP$
 such that  the matrix of 
 equation (\ref{eq.h74.gen})
is
\beq
        \bG^{\T} = \left[ \begin{array}{c}{\bI_4}\\
 \bP\end{array} \right], 
\eeq
 where $\bI_4$ is the $4\times 4$ identity matrix,  then 
 the syndrome vector is $\bz  = \bH \br$, where the {\dbf\ind{parity check matrix}}
 $\bH$ is given by $\bH =  \left[ \begin{array}{cc} -\bP & \bI_3 \end{array}
 \right]$; in  modulo 2 arithmetic, $-1 \equiv 1$, so
\beq
        \bH =   \left[ \begin{array}{cc} \bP & \bI_3 \end{array}
 \right] = \left[ 
 \begin{array}{ccccccc} 
\tt  1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0&\tt 0 \\
\tt  0&\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0 \\
\tt  1&\tt 0&\tt 1&\tt 1&\tt 0&\tt 0&\tt 1
 \end{array} \right] .
\label{eq.pcmatrix}
\eeq
 All the codewords $\bt = \bG^{\T} \bs$ of the code satisfy
\beq
	\bH \bt = \left[ {\tt \begin{array}{c} \tt0\\ \tt0\\ \tt0 \end{array} } \right] .
% (0,0,0)  .
\eeq
\exercisxB{2}{ex.GHis0}{
 Prove that this is so by evaluating the $3\times4$ matrix $\bH \bG^{\T}$.
}
 Since the received vector $\br$ is given by $\br = \bG^{\T}\bs + \bn$,
% and $\bH \bG^{\T}$=0, 
 the syndrome-decoding problem is  to find the
 most probable noise vector $\bn$ satisfying
 the equation 
\beq
        \bH \bn = \bz .
\eeq
 A decoding algorithm that solves this problem is called 
 a {\dem maximum-likelihood decoder}. We will discuss 
 decoding problems like this  in later  chapters. 
%\footnote{Somewhere in this book
% I need to spell out Bayes' theorem for decoding. Here would be 
% a good spot; but on the other hand, people can understand decoding
% intuitively, they don't need Bayes theorem and they might find it 
% a hindrance if they were not only being hit by 
% Shannon's theorem but also by likliehoods and priors.}
%
% ACTION NEEDED ????????????????????????????????????????
%
\end{aside}

\begin{figure}
%\fullwidthfigure{%
\figuredanglenudge{%
\begin{center}
\setlength{\unitlength}{0.8in}% was 1in, with figures 1.25 wide % then was 0.8 with 1in
\begin{picture}(7,2.7)(0,2.8)
\put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}}
\put(0.625,5.4){\makebox(0,0){\Large$\bs$}}
\thicklines
\put(1.35,4.75){\vector(1,0){0.4}}
\put(1.55,5.4){\makebox(0,0){{\sc encoder}}}
\put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.ps,width=1in}}}
\put(1.982,3.75){\makebox(0,0)[tr]{{parity bits} $\left.\rule[-0.342in]{0pt}{0.342in} \right\{$}}
\put(2.625,5.4){\makebox(0,0){\Large$\bt$}}
\put(3.6,5.4){\makebox(0,0){{\sc channel}}}
\put(3.6,5.15){\makebox(0,0){$f={10\%}$}}
\put(3.4,4.75){\vector(1,0){0.4}}
\put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.ps,width=1in}}}
\put(4.625,5.4){\makebox(0,0){\Large$\br$}}
\put(5.6,5.4){\makebox(0,0){{\sc decoder}}}
%\put(5.6,3.5){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder picks the $\hat{\bs}$ with maximum likelihood.}}}}
\put(5.4,4.75){\vector(1,0){0.4}}
\put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.d.ps,width=1in}}}
\put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}}
\end{picture}
\end{center}
}{%
\caption[a]{Transmitting 10,000 source bits over a binary symmetric channel
 with $f=10\%$
%0.1$
  using a (7,4) Hamming code. The  probability 
 of decoded bit error is about 7\%.}
% \dilbertcopy}
\label{fig.h74.dilbert}
}{0.7in}% third argument is the upward nudge of the caption
\end{figure}
\subsection{Summary of the (7,4) Hamming code's properties}
 Every possible received vector of length 7 bits is either a codeword,
 or it's one flip away from a codeword.\index{Hamming code}

 Since there are three parity constraints, each of which might
 or might not be violated, there are
 $2\times 2\times 2= 8$
% eight
 distinct syndromes. They can be divided 
 into seven non-zero syndromes --  one
 for each of the one-bit error patterns --
 and the all-zero syndrome, corresponding to the zero-noise case. 

 The optimal decoder takes no action if the syndrome is zero, 
 otherwise it uses this mapping of non-zero syndromes onto one-bit error 
 patterns to unflip the suspect bit. 

 There is a {\dbf decoding error}   if the four decoded bits $\hat{s}_1,
 \ldots, \hat{s}_4$ do not all match the source bits ${s}_1,
 \ldots, {s}_4$. The {\dbf probability of block error} $\pB$ is 
 the probability that one or more of the decoded bits in one block fail to 
 match the corresponding source bits,
\beq
 \pB = P( \hat{\bs} \neq \bs ) .
\eeq
 The {\dbf probability of bit error} $\pb$ is 
 the average probability
%  per decoded bit
 that a decoded bit fails to 
 match the corresponding source bit,
\beq
        \pb =  \frac{1}{K} \sum_{k=1}^K P( \hat{s}_k \neq s_k ) .
\eeq

 In the case of the Hamming code, 
 a decoding error will occur whenever  the noise has flipped more than 
 one bit in a block of seven. 
%  Any noise pattern that flips more than one bit will give rise to one of 
%  these syndromes, and our decoder will make an erroneous decision. 
%
 The probability of block error is thus the probability that two or more 
 bits are flipped in a block. This probability scales as $O(f^2)$, as did the 
 probability of error for the repetition code 
 $\Rthree$. But notice that the Hamming code 
 communicates at a greater rate, $R=4/7$. 

 \Figref{fig.h74.dilbert} shows a binary image transmitted over 
 a binary symmetric channel using the (7,4) Hamming code. 
 About 7\% of the decoded bits are in error. Notice that 
 the errors are correlated:
% with each other:
 often two or three successive
 decoded bits are flipped.

\exercissxA{1}{ex.Hdecode}{
 This exercise and the next three  refer to the 
  (7,4) \ind{Hamming code}.  Decode the received strings:
\ben
\item $\br = {\tt 1101011}$ % 10
\item $\br = {\tt 0110110}$ % 4
\item $\br = {\tt 0100111}$ % 4
\item $\br = {\tt 1111111}$. % 15
\een
}
\exercisxA{2}{ex.H74p}{
\ben \item
 Calculate the probability of block error $p_{\rm B}$ of the (7,4) Hamming 
 code 
 as a function of the noise level $f$ and show that to leading order
% \footnote{Do I need to explain what this means? Or use a different
%  terminology? Maybe only physicists are familiar?} 
%
% ACTION!!!
%
 it goes as $21 f^2$.
\item
% }
% \exercis{}{
 $^{B3}$
 Show that to leading order  the probability of 
 bit error $\pb$ goes as $9 f^2$. 
\een}
\exercisxA{2}{ex.H74zero}{
% Hamming (7,4) code.
 Find some noise vectors that give the all-zero syndrome (that is, 
 noise vectors that leave all the parity checks unviolated). 
 How many such noise vectors are there?
}
% they are the codewords. 
\exercisxB{2}{ex.H74detail}{
% Hamming (7,4) code.
 I asserted above that a block decoding error will result 
 whenever two or more bits are flipped in a single block. 
 Show that this is indeed so. [In principle, there might be 
 error patterns that, after decoding, led only to the corruption 
 of the parity bits, without the source bits's being incorrectly 
 decoded.] 
}
\exercisxB{2}{ex.R9}{
 Consider
% the encoder for
 the repetition code $\Rnine$. One way 
 of viewing this code is as a \index{concatenation!error-correcting codes}{\dbf{concatenation}} of  $\Rthree$ with 
 $\Rthree$. We first encode the source stream with $\Rthree$, then encode 
 the resulting output with $\Rthree$. We could call this code `$\Rthree^2$'.
 This idea motivates an alternative decoding algorithm, in which we decode the 
 bits three at a time using the decoder for $\Rthree$; then decode the
 decoded bits from that first decoder using the decoder for $\Rthree$. 

 Evaluate the probability of error for this decoder and compare it with the
 probability of error for the optimal decoder for $\Rnine$. 

 Do the concatenated encoder and decoder for $\Rthree^2$ have advantages over 
 those for $\Rnine$? 
}

\subsection{Summary of codes' performances}
\label{sec.code.perf}
 Figure \ref{fig.pbR.RH} shows the performance of \ind{repetition code}s and
 the \ind{Hamming code}. It also shows the performance of a family of linear
 block codes that are generalizations of Hamming codes, \ind{BCH codes}.  
% Reed-Muller codes, and
%  see end of this file for method
% 
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/rephambch.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/rephambch.1.l.ps,angle=-90,width=2.6in}  \\
\end{tabular} 
\end{center}
}{%
\caption[a]{Error probability $\pb$ versus rate $R$ for repetition codes,
  the (7,4) Hamming code and BCH codes with block lengths up to 1023
 over a binary symmetric channel with $f=0.1$.
 The righthand figure shows $\pb$ on a logarithmic scale.}
\label{fig.pbR.RH}
}
\end{figure}
%

%\noindent
% use this noindent if the ``h'' (here) works, otherwise new para.
 This figure shows that we can, using linear block codes, achieve better
 performance than repetition codes; but the asymptotic situation still
 looks  grim. 
\exercisxA{4}{ex.makecode}{
% invent your own code 
 Design an error-correcting code and a decoding algorithm for it,
 compute its probability of error, 
 and add it to figure \ref{fig.pbR.RH}.
 [Don't worry if you find it difficult to make a code better than the
 Hamming code, or if you find it difficult to find a good
 decoder for your code; that's the point of this exercise.]
}
\exercisxA{3}{ex.makecode2error}{
 A (7,4) Hamming code
 can correct any {\em one\/} error;  might there be a (10,4) code
 that can correct any two errors? What about a (9,4) code?

 {\sf Optional extra:} Does the answer to this question
 depend on whether the code is linear or nonlinear?
}
\exercisxA{4}{ex.makecode2}{
	 Design an error-correcting code, other than
 a repetition code, that can
 correct any {\em two\/} errors  in a block of size $N$.
}

\section{What performance can the best codes achieve?}
 There seems to be a trade-off between the decoded bit-error
 probability $\pb$ (which we would like to reduce) and the rate $R$ (which
 we would like to keep large).  How can this trade-off be
 characterized?
%  Can we do better than repetition codes?
 What points in
 the $(R,\pb)$ plane are achievable?  This question was addressed by
 \ind{Shannon} in his pioneering paper of 1948, in which he both created the
 field of information theory and solved most of its fundamental
 problems.
%  in the same paper.

 At that time  there was a widespread belief that the 
 boundary between achievable and nonachievable points in the 
 $(R,\pb)$ plane was a curve passing through the origin $(R,\pb) = (0,0)$; 
 if this were so, then,  in order to achieve a vanishingly small 
 error probability $\pb$, one would have to reduce the rate 
 correspondingly close to zero.
%  (figure ref here).
% This would seem a reasonable guess, 
% in accordance with the general rule that the better something works
% the more you have to pay for it. 
%
% ACTION: sanjoy doesn't like This
%
 `No pain, no gain.'

 However, Shannon proved the remarkable result that\wow\
% , for any given  channel,
 the boundary 
 between achievable and nonachievable points meets the $R$ 
 axis at a {\em non-zero\/} value $R=C$, as shown in \figref{fig.pbR.RHS}. 
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/repshan.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/repshan.1.l.ps,angle=-90,width=2.6in}  \\
\end{tabular} 
\end{center}
}{%
\caption[a]{Shannon's noisy-channel coding theorem.
 The solid curve shows  the Shannon limit 
 on achievable values of $(R,\pb)$  for
 the binary symmetric channel with $f=0.1$.
 Rates up to $R=C$ are achievable with arbitrarily small $\pb$.
 The  points show the performance of some textbook codes,
 as in \protect\figref{fig.pbR.RH}.


 The equation defining the Shannon limit (the solid curve) is 
%\[
        $R = \linefrac{C}{(1-H_2(\pb))},$
%\]
 where $C$ and $H_2$ are defined in \protect \eqref{eq.capacity}.
}
\label{fig.pbR.RHS}
}
\end{figure}
%  see end of this file for method
%
 For any channel, there exist codes that make it possible to 
 communicate  with {\em arbitrarily small\/} probability of 
 error $\pb$ at non-zero rates.  The first half of this book (parts I--III) will be 
 devoted  to understanding  this  remarkable result, which is called 
 the {\dbf\ind{noisy-channel coding theorem}}.

\subsection{Example: $f=0.1$}% A few details}
 The maximum rate at which communication is possible with arbitrarily
 small $\pb$ is called the {\dbf\ind{capacity}} of the channel.\index{channel!capacity} 
 The formula for the capacity of a binary
 symmetric channel  with noise level $f$ is\index{binary entropy function} 
\beq
        C(f) = 1 - H_2(f) = 1 - \left[ f \log_2
                 \frac{1}{f} + (1-f) \log_2 \frac{1}{1-f} \right] ;
\label{eq.capacity}
\eeq
 the channel we were discussing earlier with noise level $f=0.1$
 has capacity $C \simeq 0.53$. Let us consider what this means in terms 
 of noisy disc drives. The \ind{repetition code} $\Rthree$ could communicate over this
 channel with $\pb=0.03$ at a rate $R = 1/3$. Thus we know how
 to  build a single  gigabyte disc drive with $\pb = 0.03$
 from three noisy  gigabyte disc drives. We also know how to make 
 a single  gigabyte disc drive 
 with  $\pb \simeq 10^{-15}$ from sixty
 noisy one-gigabyte drives \exercisebref{ex.R60}.
 And now \ind{Shannon} passes by, notices us 
 \ind{juggling}
% tinkering 
 with disc drives and codes and says:
\begin{quotation}
\noindent
        `What performance are you trying to achieve? 
        $10^{-15}$? You don't need {\em sixty\/} disc drives   -- 
        you can get that performance with just  
         {\em two\/} disc drives (since 1/2 is  less than $0.53$).  
%       (The capacity is 0.53, so the number of disc drives needed at 
%       capacity is 1/0.53.)
%       `
        And if you want $\pb = 10^{-18}$
% , or $10^{-21}$,
         or $10^{-24}$ or anything,
        you can get there with  two disc drives too!'
\end{quotation}
%\begin{aside}
 [Strictly, the above statements might  not be quite right, since,
 as we shall see, Shannon
 proved his 
 noisy-channel coding theorem
%proves the achievability of ever smaller
% error probabilities at a given rate $R