This document is (c) David J.C. MacKay, 2001
It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html
It contains the text of David MacKay's book, Information theory, inference, and learning algorithms. (latex source)
Copying and distribution of this file are NOT PERMITTED.
The file is provided for convenience of anyone wishing to make a web-based search of the text of the book.
% This document is (c) David J.C. MacKay, 2001
%
% It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/
% http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html
%
% It contains the text of David MacKay's book,
% Information theory, inference, and learning algorithms.
% (latex source)
%
% Copying and distribution of this file are NOT PERMITTED.
%
% The file is provided for convenience of anyone wishing to
% make a web-based search of the text of the book.
% was book2e.tex is now book.tex (and still latex2e)
\documentclass[11pt]{book}%
\usepackage{floatflt}
\usepackage{hangingsecnum}% makes sec numbers sit in the left margin
%\usepackage{mparhack}
\usepackage{mparhackright-209}% makes all margin pars go in right margin
\usepackage{marginfig}% Defines many macros for making various styles of figure with captions
\usepackage{symbols}
%\usepackage{twoside}
\usepackage{myalgorith}% defines the Algorithm environment as a float
\usepackage{aside}% defines the {aside} environment
\usepackage{chapsummary}% helps me compile index-like objects
\usepackage{chapternotes}% lots of assorted stuff
\usepackage{lsalike}% defines citation commands
\usepackage{booktabs}% makes nice quality tables
\usepackage{prechapter}% defines a chapter-like object
\usepackage{mycaption}% defines ``\indented''and \@makecaption
% additions post-Sat 5/10/02
\usepackage{latexsym}% needed in order to make use of the \Box command
\usepackage{tocloft}% implements my look of table of contents
\usepackage{tocloftcomp}% implements my look of table of contents
\usepackage{mychapter}% defines chapter command, including the look of the new chapter page
% also defines the look of the section and subsection commands
\usepackage{mycenter}% modifies center to reduce vertical space waste
\usepackage{mypart}% modifies part to not cleardoublepage
\usepackage{myheadings}% redefines the pagestyle ``headings''
% \usepackage{headingmods}% redefines the pagestyle ``headings'' (similar to myheadings)
% \usepackage{myindents}% defines parindent and leftmargin
\usepackage{graphics}% enables rotating of boxes
% \usepackage{boldmathgk}% provides bold alpha etc. (doesn't work)
\usepackage{fixmath}% provides bold alpha etc.
\usepackage{makeidx}
\makeindex
%
\newcommand{\thedraft}{3.141}
\renewcommand{\textfraction}{0.10}
\pagestyle{headings}
\begin{document}
\bibliographystyle{lsalike}
%\newcommand{\bf}{\textbf}
%\newcommand{\sf}{\textsf}
%%\newcommand{\em}{\textem}
%\newcommand{\rm}{\textrm}
%\newcommand{\tt}{\texttt}
%\newcommand{\sl}{\textsl}
%\newcommand{\sc}{\textsc}
%
%
%
% chapter.tex
%
% this contains a few common definitions for all chapters
% of the itprnn book
\newcommand{\adhoc}{ad hoc}
\newcommand{\busstop}{bus-stop}
\newcommand{\mynewpage}{\newpage}% switch this off later Sun 3/2/02
% see also tex/inputs/itchapter.sty
% chapternotes.sty is where there is an index
\newcommand{\fN}{f\!N}
\newcommand{\exercisetitlestyle}{\sf}
%
% used in sumproduct.tex and gallager.tex
\newcommand{\Mn}{{\cal M}(n)}
\newcommand{\Nm}{{\cal N}(m)}
%\newcommand{\N}{{\cal N}}
%
% the delta function that is 1 if true (defined in notation.tex)
\newcommand{\truth}{1}
%
% used in gene.tex
\newcommand{\deltaf}{\delta\! f}
\newcommand{\tI}{\tilde{I}}
\newcommand{\Kp}{K_{\rm{p}}}
\newcommand{\Ks}{K_{\rm{s}}}
%
% end
% lang4.tex - distributions.tex
\newcommand{\lI}{I}
%
% clust.tex
\newcommand{\rnk}{r^{(n)}_k}
\newcommand{\hkn}{\hat{k}^{(n)}}
% good sizes:
% -0.45: 1.25
% -0.25: 0.65
% -0.4 0.8
\newcommand{\softfig}[1]{\hspace{-0.4in}\psfig{figure=octave/kmeansoft/ps1/#1.ps,width=0.8in,angle=-90}}
\newcommand{\softtfa}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.4in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.2in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtfabig}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.6in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.5in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtfabigb}[3]{\begin{tabular}{c}{$t=#2$}\\
\hspace*{-0.45in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.625in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
\newcommand{\softtf}[2]{\softtfa{#1}{#2}{ps1}}
\newcommand{\softtfbig}[2]{\softtfabig{#1}{#2}{ps1}}
\newcommand{\softtfbigb}[2]{\softtfabigb{#1}{#2}{ps1}}
\newcommand{\softtfb}[2]{\softtfa{#1}{#2}{ps3}}
\newcommand{\softtfbbig}[2]{\softtfabigb{#1}{#2}{ps3}}
\newcommand{\softfc}[1]{\begin{tabular}{c}%
\hspace*{-0.2in}\mbox{\psfig{figure=octave/kmeansoft/ps5/#1.ps,width=1.32in,angle=-90}\hspace*{-0.2in}}\\
\end{tabular}}
% end
%
% used in _p1 and _l2
\newcommand{\hpheight}{26mm}
\newcommand{\wow}{\marginpar{\raisebox{-12pt}{\psfig{figure=figs/wow.eps,width=1in}}}}
%
% used in _l1.tex:::::::
\renewcommand{\q}{{f}}
\newcommand{\obr}[3]{\overbrace{{#1}\,{#2}\,{#3}}}
\newcommand{\ubr}[3]{\underbrace{{#1}\,{#2}\,{#3}}}
\newcommand{\nbr}[3]{{{#1}\,{#2}\,{#3}}}
%
%
\newcommand{\fitpath}{/home/mackay/octave/fit/ps}% used in fit.tex (gaussian fitting, octave)
% CUP style:
\renewcommand{\ie}{i.e.}
\renewcommand{\eg}{e.g.}
\renewcommand{\NB}{N.B.}
%
% symbols i e and d in maths (operators)
\newcommand{\im}{{\rm i}}
\newcommand{\e}{{\rm e}}
% \d is already defined
%
\newenvironment{conclusionbox}%
{\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -18pt\begin{quote}\vskip -8pt}%
{\end{quote}\vskip -14pt \noindent\rule{\textwidth}{0.1pt}\vskip 6pt}
% {\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -12pt\begin{quote}}%
% {\end{quote}\vskip -12pt \noindent\rule{\textwidth}{0.1pt}}
\newcommand{\dy}{\d y}
\newcommand{\plus}{+}
\newcommand{\Wenglish}{Wenglish}% winglish
\newcommand{\wenglish}{\Wenglish}% winglish
\newcommand{\percent}{{per cent}}% in USA only: percent
%
%\newcommand{\nonexaminable}{$^{*}$}
\newcommand{\nonexaminable}{}
%
% for exact sampling chapter
\newcommand{\envelope}{summary state}
%
\def\unit#1{\,{\rm #1}}
\def\cm{\unit{cm}}
\def\grams{\unit{g}}
% this is a 209 versus 2e problem: (huffman.latex edited instead)
%\def\tenrm{\rm}
%\def\tenit{\it}
%
% other problems: \pem
\renewcommand{\textfraction}{0.1}
%
% for use in free text:
\newcommand{\bits}{{\rm bits}}
\newcommand{\bita}{{\rm bit}}
% for use in equations or in '1 bit'
\newcommand{\ubits}{\,{\bits}}
\newcommand{\ubit}{\,{\bita}}
%
%
% used in ch 1:
\newcommand{\pB}{p_{\rm B}}
\newcommand{\pb}{p_{\rm b}}
%
% ch 2:
\newcommand{\sixtythree}{{\tt sixty-three}}
\newcommand{\aep}{`asymptotic equipartition' principle}
%
% used in alpha:
\newcommand{\sla}{\sqrt{\lambda_a}}
\newcommand{\kga}{\kappa\gamma}
\newcommand{\kkgg}{\kappa^2\gamma^2}
\newcommand{\skg}{\sqrt{\kappa\gamma}}
\newcommand{\TYP}{{\rm \scriptscriptstyle TYP}}
%
\newcommand{\bb}{{\bf b}}
\newcommand{\eq}{\mbox{$=$}}
%
% used in ising.tex and _s4.tex
% J=+1 are in states1, J=-1 are in states
%\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1.82in}}
\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1in}}% was 1.75
\newcommand{\smallrisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=0.6in}}% was 1.2 was 0.9
\newcommand{\Hisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=2.6in}}
\newcommand{\hisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=2.6in}}
\newcommand{\bighisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=3.86in}}
%
% used in _noiseless.tex
\newcommand{\Connectionmatrix}{Connection matrix}
\newcommand{\connectionmatrix}{connection matrix}
\newcommand{\connectionmatrices}{connection matrices}
%\newcommand{\cwM}{M}% codeword number
%\newcommand{\cwm}{m}% codeword number
\newcommand{\cwM}{S}% codeword number
\newcommand{\cwm}{s}% codeword number
\newcommand{\sa}{\alpha}% signal amplitude in gaussian channel
%
\newcommand{\cmA}{A}% connection matrix symbol
\newcommand{\bcmA}{{\bf \cmA}}% connection matrix symbol
\newcommand{\bAcm}{{\bcmA}}
\newtheorem{ctheorem}{Theorem}[chapter]
\newtheorem{definc}{Definition}[chapter]
\newcommand{\appendixref}[1]{appendix \ref{#1}}
\newcommand{\Appendixref}[1]{Appendix \ref{#1}}
\newcommand{\sectionref}[1]{section \ref{#1}}
\newcommand{\Sectionref}[1]{Section \ref{#1}}
\newcommand{\secref}[1]{section \ref{#1}}
\newcommand{\Secref}[1]{Section \ref{#1}}
%\newcommand{\chapterref}[1]{chapter #1}% why?
%\newcommand{\Chapterref}[1]{Chapter #1}% why?
\newcommand{\chapterref}[1]{chapter \ref{#1}}
\newcommand{\Chapterref}[1]{Chapter \ref{#1}}
\newcommand{\chref}[1]{chapter \ref{#1}}
\newcommand{\Chref}[1]{Chapter \ref{#1}}
\newcommand{\chone}{\ref{ch.one}}
\newcommand{\chtwo}{\ref{ch.two}}
\newcommand{\chthree}{\ref{ch.three}}
\newcommand{\chfour}{\ref{ch.four}}
\newcommand{\chfive}{\ref{ch.five}}
\newcommand{\chsix}{\ref{ch.six}}
\newcommand{\chseven}{\ref{ch.ecc}}
\newcommand{\cheight}{\ref{ch.bayes}}
\newcommand{\chthirteen}{\ref{ch.single.neuron.class}}% single neuron
\newcommand{\chfourteen}{\ref{ch.single.neuron.bayes}}% single neuron bayes?
\newcommand{\chtwelve}{\ref{ch.nn.intro}}% intro to nn
\newcommand{\chcover}{\ref{ch.cover}}
\newcommand{\chbayes}{\ref{ch.bayes}}
\newcommand{\secpulse}{\ref{sec.pulse}}% 7.2.1?}
\newcommand{\secthirteenthree}{13.3?}
\newcommand{\secmetrop}{\ref{sec.metrop}}% 11.3?}
\newcommand{\figooo}{?1.11?}
\newcommand{\eqgamma}{8.27?}
\newcommand{\TSP}{travelling salesman problem}
\newcommand{\vfe}{variational free energy}
\newcommand{\vfem}{variational free energy minimization}
% could make this \ch6 = \ref{ch6}
% author, title etc is in here....
% {headerinfo.tex}% uses special commands
\setcounter{secnumdepth}{2}%
\newcommand{\indep}{\bot}% upside down pi desired
\newcommand{\dbf}{\sl}% boldface in definitions
\newcommand{\dem}{\sl}% emphasized definitions in text
\newcommand{\solutionb}[2]{\setcounter{solution_number}{#1}
\solutiona{#2}}
\newcommand{\lsolution}[2]{\section{Solution to exercise {#1}}{#2}}
%
%
\newcommand{\FIGS}{/home/mackay/book/FIGS}
\newcommand{\bookfigs}{/home/mackay/book/figs}
\newcommand{\figsinter}{/home/mackay/handbook/figs/inter}
\newcommand{\exburglar}{\exerciseref{ex.burglar}}
\newcommand{\exnine}{\exerciseref{ex.invP}}%10}
\newcommand{\exseven}{\exerciseonlyref{ex.weigh}}% use deprecated!
% was \exseven .... \exerciseref{ex.expectn}}%9}
\newcommand{\exaseven}{\exerciseref{ex.R9}}%{7}
\newcommand{\exten}{\exerciseref{ex.expectng}}%{11}
\newcommand{\exfourteen}{\exerciseref{ex.Hadditive}}%{15}
\newcommand{\exfifteen}{\exerciseref{ex.Hcondnal}}%{16}
\newcommand{\exeighteen}{\exerciseref{ex.Hmutualineq}}%{19}
\newcommand{\extwenty}{\exerciseref{ex.rel.ent}}%{21}
\newcommand{\extwentyone}{\exerciseref{ex.joint}}%{22}% the joint ensemble
\newcommand{\extwentytwo}{\exerciseref{ex.dataprocineq}}%{23}
\newcommand{\extwentythree}{\exerciseref{ex.zxymod2}}%{24}
\newcommand{\extwentyfour}{\exerciseref{ex.waithead}}%{25}
\newcommand{\extwentyfive}{\exerciseref{ex.sumdice}}%{26}
\newcommand{\extwentysix}{\exerciseref{ex.RN}}%{27}
\newcommand{\extwentyseven}{\exerciseref{ex.RNGaussian}}%{28}
\newcommand{\exthirtyone}{\exerciseref{ex.logit}}%{32}% logistic
\newcommand{\exthirtysix}{\exerciseref{ex.exponential}}%{37}%
\newcommand{\exthirtyseven}{\exerciseref{ex.blood}}%{38}% forensic
\newcommand{\exfiftythree}{\exerciseref{ex.}}%{53}% integers
\newcommand{\eqsixteenfive}{16.5}
\newcommand{\Kraft}{Kraft}% Kraft--McMillan
\newcommand{\exrelent}{\exerciseref{ex.rel.ent}}%{20} %% \ref{ex.rel.ent}
\newcommand{\eqKL}{1.24} %% \eqref{eq.KL}
\newcommand{\bSigma}{{\mathbf{\Sigma}}}
\newcommand{\sumproduct}{sum-product}
%
% for cpi material
%
\newcommand{\sigbias}{\sigma_{\rm bias}}
\newcommand{\sigin}{\sigma_{\rm in}}
\newcommand{\sigout}{\sigma_{\rm out}}
\newcommand{\abias}{\alpha_{\rm bias}}
\newcommand{\ain}{\alpha_{\rm in}}
\newcommand{\aout}{\alpha_{\rm out}}
%\newcommand{\bff}{\bf}
\newcommand{\handfigs}{/home/mackay/handbook/figs}
\newcommand{\mjofigs}{/home/mackay/figs/mjo}
\newcommand{\FIGSlearning}{/home/mackay/book/FIGS/learning}
\newcommand{\codefigs}{/home/mackay/_doc/code/ps/ps}
%
% mncEL stuff
%
\newcommand{\ebnowide}[1]{\mbox{\psfig{figure=../../code/#1.ps,width=2.8in,angle=-90}}}
\newcommand{\fem}{m}
\newcommand{\feM}{M}
\newcommand{\fel}{n}
\newcommand{\feL}{N}
\renewcommand{\L}{N}
\newcommand{\feLm}{{\cal N}(m)}
\newcommand{\feMl}{{\cal M}(n)}
\newcommand{\feK}{N}
\newcommand{\fek}{n}
\newcommand{\feKn}{{\cal N}(m)}
\newcommand{\feNk}{{\cal M}(n)}
\newcommand{\feN}{M}
\newcommand{\fen}{m}
\newcommand{\fer}{r}
\newcommand{\GL}{GL}
\newcommand{\SMN}{GL}
\newcommand{\NMN}{MN}
\newcommand{\MN}{MN}
\renewcommand{\check}{check}% was relationship
\newcommand{\checks}{checks}% was relationship
\newcommand{\fs}{f_{\rm s}}
\newcommand{\fn}{f_{\rm n}}
\newcommand{\llncspunc}{.}
\newcommand{\lcA}{{H}}
\newcommand{\rmncNall}{/home/mackay/_doc/code/rmncNall}
\newcommand{\oneA}{1A}
\newcommand{\twoA}{2A}
\newcommand{\thrA}{2A}
\newcommand{\oneB}{1B}
\newcommand{\twoB}{2B}
\newcommand{\thrB}{2B}
\newcommand{\bndips}{/home/mackay/_doc/code/bndips}
\newcommand{\codeps}{/home/mackay/_doc/code/ps}
\newcommand{\equalnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/equal.eps,width=8pt}\hspace{0mm}}}
\newcommand{\plusnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/plus.eps,width=8pt}\hspace{0mm}}}
%
\newcommand{\fourfourtable}[9]{\begin{tabular}[b]{lcc@{\hspace{4pt}}c}
\multicolumn{1}{l}{#1:} & & \multicolumn{2}{c}{#2} \\[-0.1in]% \cline{1-1}
& & {#3} & {#4} \\ \cline{3-4}
{#5} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt]
&\multicolumn{1}{l|}{#4} & {#8} & {#9} \\
\end{tabular}}
\newcommand{\fourfourtableb}[9]{\begin{tabular}[b]{l|c@{\hspace{1pt}}c@{\hspace{3pt}}c}
{#1} & {#2} & {#3} & {#4} \\ \cline{1-1}\cline{3-4}
\multicolumn{2}{l}{#5} & & \\
\multicolumn{1}{l|}{#3} & & {#6} & {#7} \\[-5pt]
\multicolumn{1}{l|}{#4} & & {#8} & {#9} \\
\end{tabular}}
\newcommand{\fourfourtableold}[9]{\begin{tabular}[b]{l|c|c|c|}
{#1} & {#2} & {#3} & {#4} \\ \cline{1-1}
\multicolumn{2}{l|}{#5} & & \\ \hline
\multicolumn{2}{l|}{#3} & {#6} & {#7} \\ \hline
\multicolumn{2}{l|}{#4} & {#8} & {#9} \\ \hline
\end{tabular}}
\newcommand{\mathsstrut}{\rule[-3mm]{0pt}{8mm}}
%
% for ra.tex
%
\newcommand{\halfw}{0.35in}
\newcommand{\onew}{0.7in}
\newcommand{\onehalfw}{1.05in}
\newcommand{\twow}{1.4in}
\newcommand{\twohalfw}{1.75in}
\newcommand{\GHfig}[1]{\psfig{figure=GHps/#1,width=\onehalfw}}% for rate 1/3
\newcommand{\GHfigone}[1]{\psfig{figure=GHps/#1,width=\onew}}%
\newcommand{\GHfigthird}[1]{\psfig{figure=GHps/#1,width=\halfw}}
\newcommand{\GHfigquarter}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}
\newcommand{\GHfigtwo}[1]{\psfig{figure=GHps/#1,width=\twow}}% for rate 1/2
\newcommand{\GHfigdouble}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}% for five wide
% extra wide fitting::::::::::: (for turbo)
\newcommand{\GHfigdoubleE}[1]{\psfig{figure=GHps/#1,width=2in}}% for five wide
\newcommand{\GHfigE}[1]{\psfig{figure=GHps/#1,width=1.2in}}% for rate 1/3
%
\newcommand{\GHdrawfig}[1]{\psfig{figure=GHps/#1,width=1.5in}}% was 1.8
\newcommand{\standardfig}[1]{\psfig{figure=rirreg/#1,width=1.8in,angle=-90}}
\newcommand{\loopsfig}[1]{\psfig{figure=rirreg/loops.#1,height=1.85in,width=1.8in,angle=-90}}
\newcommand{\titledfig}[2]{\begin{tabular}{c}%
{#1}\\%
\standardfig{#2}\\%
\end{tabular}%
}
%
% for the single neuron chapters
%
\newcounter{funcfignum}
\setcounter{funcfignum}{1}
\newcommand{\funcfig}[2]{
\put(#1,#2){\makebox(0,0)[b]{
\begin{tabular}{@{}c@{}}
\psfig{figure=\FIGSlearning/f.#1.#2.ps,height=1.3in,width=1.3in,angle=-90} \\[-0.15in]
$\bw = (#1,#2)$
\\ \end{tabular}
}
}
}
\newcommand{\wflatfig}[1]{
\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.3)(0.30,0.40)
\psfig{figure=\FIGSlearning/#1,height=2.43in,width=2.064in,angle=-90}
% was 1.3,1.3
\end{picture}\\\end{tabular}
}
\newcommand{\wsurfig}[1]{
\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.5)(0,0)
\psfig{figure=\FIGSlearning/#1,height=1.8in,width=1.8in,angle=-90}
% was 1.5,1.5
\end{picture}\end{tabular}
}
\newcommand{\datfig}[1]{
\begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1,1)(0.30,0.1)
\psfig{figure=\FIGSlearning/#1,height=1.2in,width=1.412in,angle=-90}
% was 1,1
\end{picture}\end{tabular}
}
\newcommand{\optens}{optimal input distribution}% used in l5.tex, l6.tex, s5.tex
\newcommand{\dilbertcopy}{{[Dilbert image Copyright\copyright{1997} United Feature Syndicate, Inc.,
used with permission.]}}
\newcommand{\Rnine}{\mbox{R}_9}
\newcommand{\Rthree}{\mbox{R}_3}
\newcommand{\eof}{{\Box}}
\newcommand{\teof}{\mbox{$\Box$}}% for use in text
\newcommand{\ta}{{\tt{a}}}
\newcommand{\tb}{{\tt{b}}}
%\newcommand{\dits}{dits}
%\newcommand{\dit}{dit}
\newcommand{\dits}{bans}
\newcommand{\dit}{ban}
%
% used in l5
%
\newcommand{\BSC}{binary symmetric channel}
\newcommand{\BEC}{binary erasure channel}
\newcommand{\subsubpunc}{}% change to . if subsubsections are given in-line headings
%
% convolutional code definitions
%
\newcommand{\cta}{t^{(a)}}
\newcommand{\ctb}{t^{(b)}}
\newcommand{\z}{z}
\newcommand{\lfsr}{linear feedback shift register}
%
% definitions for including hinton diagrams from extended directory
%
\newcommand{\ecfig}[1]{\psfig{figure=extended/ps/#1.ps,silent=}}
% extra argument
\newcommand{\ecfigb}[2]{\psfig{figure=extended/ps/#1.ps,#2,silent=}}
%
% used in _s1 and in _linear maybe
%%%%%%%%%%% see /home/mackay/code/bucky
\newcommand{\buckypsfig}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in}}}
\newcommand{\buckypsfigw}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in}}}
\newcommand{\buckypsgraph}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in,angle=-90}}}
\newcommand{\buckypsgraphb}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in,angle=-90}}}
\newcommand{\buckypsgraphB}[1]{\mbox{\psfig{figure=buckyps/#1,width=2.2in,angle=-90}}}
%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%55
% for l1a
%%%%%%%%%%%%%%%%%%
% example
% \bigrampicture{3.538mm}{hd_conbigram.ps}
% \bigrampicture{3.538mm}{hd_conbigram.ps,width=278pt}%%%%%%% 278 is the original size
% This used to work fine in latex209 then needed rejigging in 2e.
% (alignment of g,j,p,q,y wrong at the bottom) (saved to graveyard.tex
\newcommand{\bigrampicture}[3]%args are unitlength,picturename-and-picturesize,font-request
{%%%%%%%%%
\setlength{\unitlength}{#1}
\begin{picture}(30,30)(0,-30)% was 28,28 0,-28
\put(0.15,-27.8){\makebox(0,0)[bl]{\psfig{figure=bigrams/#2,angle=-90}}}
\put(1,-29){\makebox(0,0)[b]{{#3\tt a}}}
\put(2,-29){\makebox(0,0)[b]{{#3\tt b}}}
\put(3,-29){\makebox(0,0)[b]{{#3\tt c}}}
\put(4,-29){\makebox(0,0)[b]{{#3\tt d}}}
\put(5,-29){\makebox(0,0)[b]{{#3\tt e}}}
\put(6,-29){\makebox(0,0)[b]{{#3\tt f}}}
\put(7,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt g}}}
\put(8,-29){\makebox(0,0)[b]{{#3\tt h}}}
\put(9,-29){\makebox(0,0)[b]{{#3\tt i}}}
\put(10,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt j}}}
\put(11,-29){\makebox(0,0)[b]{{#3\tt k}}}
\put(12,-29){\makebox(0,0)[b]{{#3\tt l}}}
\put(13,-29){\makebox(0,0)[b]{{#3\tt m}}}
\put(14,-29){\makebox(0,0)[b]{{#3\tt n}}}
\put(15,-29){\makebox(0,0)[b]{{#3\tt o}}}
\put(16,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt p}}}
\put(17,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt q}}}
\put(18,-29){\makebox(0,0)[b]{{#3\tt r}}}
\put(19,-29){\makebox(0,0)[b]{{#3\tt s}}}
\put(20,-29){\makebox(0,0)[b]{{#3\tt t}}}
\put(21,-29){\makebox(0,0)[b]{{#3\tt u}}}
\put(22,-29){\makebox(0,0)[b]{{#3\tt v}}}
\put(23,-29){\makebox(0,0)[b]{{#3\tt w}}}
\put(24,-29){\makebox(0,0)[b]{{#3\tt x}}}
\put(25,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt y}}}
\put(26,-29){\makebox(0,0)[b]{{#3\tt z}}}
\put(27,-29){\makebox(0,0)[b]{{#3--}}}
% they used to be at height -29 and were aligned bottom
%\put(27,-29){\makebox(0,0)[b]{{#3\verb+-+}}}
%
\put(29,-29){\makebox(0,0)[r]{#3$y$}}
%
\put(-0.2,-1){\makebox(0,0)[r]{{#3\tt a}}}
\put(-0.2,-2){\makebox(0,0)[r]{{#3\tt b}}}
\put(-0.2,-3){\makebox(0,0)[r]{{#3\tt c}}}
\put(-0.2,-4){\makebox(0,0)[r]{{#3\tt d}}}
\put(-0.2,-5){\makebox(0,0)[r]{{#3\tt e}}}
\put(-0.2,-6){\makebox(0,0)[r]{{#3\tt f}}}
\put(-0.2,-7){\makebox(0,0)[r]{{#3\tt g}}}
\put(-0.2,-8){\makebox(0,0)[r]{{#3\tt h}}}
\put(-0.2,-9){\makebox(0,0)[r]{{#3\tt i}}}
\put(-0.2,-10){\makebox(0,0)[r]{{#3\tt j}}}
\put(-0.2,-11){\makebox(0,0)[r]{{#3\tt k}}}
\put(-0.2,-12){\makebox(0,0)[r]{{#3\tt l}}}
\put(-0.2,-13){\makebox(0,0)[r]{{#3\tt m}}}
\put(-0.2,-14){\makebox(0,0)[r]{{#3\tt n}}}
\put(-0.2,-15){\makebox(0,0)[r]{{#3\tt o}}}
\put(-0.2,-16){\makebox(0,0)[r]{{#3\tt p}}}
\put(-0.2,-17){\makebox(0,0)[r]{{#3\tt q}}}
\put(-0.2,-18){\makebox(0,0)[r]{{#3\tt r}}}
\put(-0.2,-19){\makebox(0,0)[r]{{#3\tt s}}}
\put(-0.2,-20){\makebox(0,0)[r]{{#3\tt t}}}
\put(-0.2,-21){\makebox(0,0)[r]{{#3\tt u}}}
\put(-0.2,-22){\makebox(0,0)[r]{{#3\tt v}}}
\put(-0.2,-23){\makebox(0,0)[r]{{#3\tt w}}}
\put(-0.2,-24){\makebox(0,0)[r]{{#3\tt x}}}
\put(-0.2,-25){\makebox(0,0)[r]{{#3\tt y}}}
\put(-0.2,-26){\makebox(0,0)[r]{{#3\tt z}}}
\put(-0.2,-27){\makebox(0,0)[r]{{#3--}}}
%\put(-0.2,-27){\makebox(0,0)[r]{{#3\verb+-+}}}
\put(-0.2,1){\makebox(0,0)[r]{#3$x$}}
\end{picture}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% from theorems.tex for exact.tex
\newcommand{\PGB}{P^{G}_B}
\newcommand{\PGb}{P^{G}_b}
\newcommand{\PB}{P_B}
\newcommand{\Pb}{P_b}
% included by l2.tex
% definitions for weighings.tex and for text
% shows weighing trees, ternary
%
% decisions of what to weigh are shown in square boxes with 126 over 345 (l:r)
% state of valid hypotheses are listed in double boxes
% or maybe dashboxes?
% three arrows, up means left heavy, straioght means right heavy, down is balance
%
\newcommand{\mysbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}}
\newcommand{\mydbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}}
\newcommand{\myuvector}[3]{\put(#1){\vector(#2){#3}}}
\newcommand{\mydvector}[3]{\put(#1){\vector(#2){#3}}}
\newcommand{\mysvector}[2]{\put(#1){\vector(1,0){#2}}}
\newcommand{\mythreevector}[4]{\myuvector{#1}{#2,#3}{#4}\mydvector{#1}{#2,-#3}{#4}\mysvector{#1}{#4}}
%
%\newcommand{\h1}{\mbox{$1^+$}}
%\newcommand{\l1}{\mbox{$1^-$}}
%\newcommand{\h2}{\mbox{$2^+$}}
%\newcommand{\l2}{\mbox{$2^-$}}
%\newcommand{\h3}{\mbox{$3^+$}}
%\newcommand{\l3}{\mbox{$3^-$}}
%\newcommand{\h4}{\mbox{$4^+$}}
%\newcommand{\l4}{\mbox{$4^-$}}
%\newcommand{\h5}{\mbox{$5^+$}}
%\newcommand{\l5}{\mbox{$5^-$}}
%\newcommand{\h6}{\mbox{$6^+$}}
%\newcommand{\l6}{\mbox{$6^-$}}
%\newcommand{\h7}{\mbox{$7^+$}}
%\newcommand{\l7}{\mbox{$7^-$}}
%\newcommand{\h8}{\mbox{$8^+$}}
%\newcommand{\l8}{\mbox{$8^-$}}
%\newcommand{\h9}{\mbox{$9^+$}}
%\newcommand{\l9}{\mbox{$9^-$}}
%\newcommand{\h10}{\mbox{$10^+$}}
%\newcommand{\l10}{\mbox{$10^-$}}
%\newcommand{\h11}{\mbox{$11^+$}}
%\newcommand{\l11}{\mbox{$11^-$}}
%\newcommand{\h12}{\mbox{$12^+$}}
%\newcommand{\l12}{\mbox{$12^-$}}
\title{Information Theory, Inference, \& Learning Algorithms}
\shortlecturetitle{}
\shortauthor{David J.C. MacKay}
% the book - called by book.tex
% thebook.tex
% Mon 7/10/02
\setcounter{page}{0} % set to current value
\setcounter{exercise_number}{1} % set to imminent value
%
\setcounter{secnumdepth}{1} % sets the level at which subsection numbering stops
\setcounter{tocdepth}{0}
\newcommand{\mysetcounter}[2]{}%was {\setcounter{#1}{#2}}
% useful for forcing pagenumbers in drafts
%\setcounter{tocdepth}{1}
\renewcommand{\bs}{{\bf s}}
\newcommand{\figs}{/home/mackay/handbook/figs} % while in bayes chapter
% \addtocounter{page}{-1}
\thispagestyle{empty}
\begin{center}
~\\[1.5in]
{\Huge \bf Information Theory, \\[0.2in]
% Pattern Recognition \\[0.1in]
Inference,\\[0.2in]
and Learning Algorithms\\[1in]
% Probability \\[0.2in]
% and Neural Networks\\[1in]
}
{\Large\sf David J.C. MacKay } \\[0.3in]
\copyright 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003\\[1.3in]
Draft \thedraft\ \today\\
\end{center}
\dvipsb{frontpage}
\newpage
% choose one of these:
% \input{cambridgefrontstuff.tex}
%
% alternate
\fakesection{Roadmap}
% \subchapter{Part III Physics}
\section*{Information Theory, Pattern Recognition and Neural Networks}
% {\large Handout number 3.}\medskip
%
% Do not be disturbed by missing pagenumbers: the handouts do
% not include all the book's chapters.
%
% Because the handouts are based on different drafts of the book, there may
% be occasional mismatches between pagereferences, etc., where
% cross-references occur between this and the other handouts.
%
% Nonexaminable material is indicated by the symbol
% \nonexaminable.
%
\section*{Approximate roadmap for the eight-week course in Cambridge}
The course will cover about 16 chapters of this book.
The rest of the book is provided for your interest.
The book contains numerous exercises with worked solutions.
% less than half the material in this book.
\medskip
\noindent
\begin{tabular}{@{}lp{5in}}
Lecture 1 & Introduction to Information Theory.
\Chref{ch.one}.
% The lecture is mirrored by
\\
Before lecture 2 & Work on \exerciseref{ex.weigh}.
\\
& Read chapters \ref{ch.prob.ent} and \ref{ch.two}
and work on exercises in \chref{ch.prob.ent}.
% listed in the introduction to \chref{ch.two}.
\\
Lecture 2--3 & Information content \& typicality. \Chref{ch.two}.
\\
Lecture 4 & Symbol codes. \Chref{ch.three}.
\\
Lecture 5 & Arithmetic codes. \Chref{ch.ac} (sections \ref{sec.startofch4}--\ref{sec.stopbeforeLZ} only).
\\
& Read \chref{ch.prefive} and do the exercises.
\\
Lecture 6 & Noisy channels. Definition of mutual information and capacity.
\Chref{ch.five}.
\\
Lecture 7--8 & The noisy channel coding theorem.
\Chref{ch.six} (but not section \ref{sec.ch6stop} onwards).
\\
Lecture 9 & Clustering. Bayesian inference. \Chref{ch1b}, \ref{ch.clustering}, \ref{ch.ml}.\\
% as a data modelling problem. \\
& Read \chref{ch.ising} (Ising models).
\\
Lecture 10--11 & Monte Carlo methods. \Chref{ch.mc}, \ref{ch.mc2}. \\
Lecture 12 & Variational methods. \Chref{ch.mft}. \\
Lecture 13 & Neural networks -- the single neuron.
\Chref{ch.single.neuron.class}. \\
Lecture 14 & Capacity of the single neuron. \Chref{ch.single.neuron.capacity}. \\
Lecture 15 & Learning as inference. \Chref{ch.single.neuron.bayes}. \\
Lecture 16 & The Hopfield network. Content-addressable memory. \Chref{ch.hopfield}. \\
\end{tabular}
\subsection*{About the exercises}
I firmly believe that one can only understand a subject by
recreating it for oneself. To this end, I think it is essential to
work through some exercises on each topic. For guidance, each exercise
has a rating (similar to that used by \citeasnoun{Knuth_vol1})
from 1 to 5 that indicates the level of difficulty.
In addition, exercises that are especially recommended
are marked by a marginal encouraging rat -- \dorat.
Exercises that require the use of a computer may be
marked with a {\sl C}.
% will have
% a rating such as A1, A5, C1 or C5.
% The letter indicates how important I think the exercise is:
% A = very important $\ldots$ C = not essential to the flow of the
% book. The number indicates the difficulty of the problem:
% 1 = easy, 5 = research project.
I'll circulate detailed recommendations on exercises
as the course progresses.
Answers to many of the exercises are provided. Please use them
wisely.
%\begin{table}[htbp]
%\caption[a]
\begin{realcenter}
\fbox{
\begin{tabular}{ll}
%\begin{minipage}{3in}
{\sf Summary of codes for exercises}\\%[0.2in]
% \hspace{0.2in}
\begin{tabular}[b]{cl}
\dorat & Especially recommended \\[0.2in]
{\ensuremath{\triangleright}} & Recommended \\
{\sl C} & Some parts require a computer \\
\end{tabular}
%\end{minipage}
&
\begin{tabular}[b]{cl}
\pdifficulty{1} & Simple (one minute) \\
\pdifficulty{2} & Medium (quarter hour) \\
\pdifficulty{3} & Moderately hard \\
\pdifficulty{4} & Hard \\
\pdifficulty{5} & Research project \\[0.2in]
\end{tabular}
\\
\end{tabular}
}
\end{realcenter}
%\end{table}
%%%%%%%%%%%%%%
\newpage
\tableofcontents
\dvipsb{toc}
% \input{extrafrontstuff.tex}% aims dedication, about the author, etc
% see also tex/oldaims.tex
% for some good stuff.
% and tex/typicalreaders.tex
%
%% \input{tex/overview2001.tex}
\prechapter{About Chapter}
\setcounter{page}{1} % set to current value
\label{pch.one}
%
% pre-chapter 1
%
\fakesection{Before ch 1}
I hope you will find the mathematics in the first chapter easy.
You will need to be familiar with the \ind{binomial distribution}.
% , reviewed below.
And to solve the exercises in the text --
which I urge you to do -- you will need to remember {\dem\ind{Stirling's
approximation}\/}\index{approximation, Stirling's}
for the factorial function, $%\beq
x! \simeq x^{x} e^{-x}
$,
and be able to
apply it to ${{N}\choose{r}} =
\frac{N!}{(N-r)!r!}$.\marginpar{\footnotesize{Unfamiliar notation?\\ See
appendix \ref{app.notation}, \pref{app.notation}.}}
% $x!$
These topics are reviewed below.
\subsection*{The binomial distribution}
\label{sec.first.binomial}
\exampl{ex.binomial}{
A bent coin has probability $f$ of coming up heads.
The coin is tossed $N$ times.
What is the probability
distribution of the number of heads, $r$?
What are the \ind{mean} and \ind{variance} of $r$?
}
\amarginfig{t}{%
\begin{tabular}{r}
% $P(r|f,N)$\\
\mbox{\psfig{figure=bigrams/urn.f.g.ps,angle=-90,width=1.51in}}%
\\
\mbox{\psfig{figure=bigrams/urn.f.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{$r$}
\\
\end{tabular}
%}{%
\caption[a]{The binomial distribution $P(r|f\eq 0.3,\,N \eq 10)$,
on a linear scale (top) and a logarithmic scale (bottom).}
\label{fig.binomial}
}
% see bigrams/README
\noindent
%\begin{Sexample}{ex.binomial}
{\sf Solution:}
\label{sec.first.binomial.sol}
The number of heads
has a binomial distribution.
\beq P(r|f,N) = {N \choose r} f^{r} (1-f)^{N-r} \eeq
The mean, $\Exp [ r ]$, and variance, $\var[r]$,
of this distribution are
defined by
\beq
\Exp [ r ] \equiv \sum_{r=0}^{N} P(r|f,N) \, r
\label{eq.mean.def}
\eeq
\beqan
\var[r] & \equiv &
\Exp \left[ \left( r - \Exp [ r ] \right)^2 \right] \\
& = &
\Exp [ r^2 ] - \left( \Exp [ r ] \right)^2
= \sum_{r=0}^{N} P(r|f,N) r^2 - \left( \Exp [ r ] \right)^2 .
\label{eq.var.sum}
\eeqan
%
Rather than evaluating the sums over $r$ (\ref{eq.mean.def},\ref{eq.var.sum}) directly,
it is easiest to obtain the mean and variance by noting that $r$
is the sum of $N$ {\em independent\/}
% , identically distributed
random variables, namely, the number of heads in the
first toss (which is either zero or one),
the number of heads in the second toss, and so forth.
In general,
\beq
\begin{array}{rcll}
\Exp [ x + y ] &=& \Exp [ x ] + \Exp [ y ] & \mbox{for any random variables $x$ and $y$};
\\
\var [ x + y ] &=& \var [ x ] + \var [ y ] & \mbox{if $x$ and $y$ are independent}.
\end{array}
\eeq
So the mean of $r$ is the sum of the means of those random
variables, and the variance of $r$ is the sum of their variances.
% its mean and variance are given by adding the means and variances
% of those random variables, respectively.
The mean number of heads in a single toss
is $f\times 1 + (1-f)\times 0 = f$, and the variance of the
number of heads in a single toss is
\beq
\left[ f\times 1^2 + (1-f)\times 0^2 \right] - f^2 = f - f^2 = f(1-f),
\eeq
so the mean and variance of $r$ are:
\beq \Exp [ r ] = N f
%\eeq\beq
\hspace{0.5in} \mbox{and} \hspace{0.5in}
\var[r] = N f (1-f) .
\eeq
%\end{Sexample}
% ADD END PROOF SYMBOL HERE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
\subsection*{Approximating $x!$ and ${{N}\choose{r}}$}
\amarginfig{t}{%
\begin{tabular}{r}
\mbox{\psfig{figure=bigrams/poisson.g.ps,angle=-90,width=1.5in}}%
\\
\mbox{\psfig{figure=bigrams/poisson.l.ps,angle=-90,width=1.64in}}%
\\[-0.1in]
\multicolumn{1}{c}{$r$}
\\
\end{tabular}
%}{%
\caption[a]{The Poisson distribution $P(r\,|\,\l\eq 15)$,
on a linear scale (top) and a logarithmic scale (bottom).}
\label{fig.poisson}
}
% see bigrams/README
\label{sec.poisson}
% FAVOURITE BIT
\noindent
Let's derive Stirling's approximation by an unconventional route.
We start from the \ind{Poisson distribution},
\beq
P( r | \l ) = e^{-\l} \frac{\l^r}{r!} \:\:\:\:
\:\: r\in \{ 0,1,2,\ldots\} .
\label{eq.poisson}
\eeq
%
% \noindent
For large $\l$, this distribution is well approximated -- at least\index{approximation by Gaussian}
in the vicinity of $r \simeq \l$ -- by
a Gaussian distribution with mean $\l$ and variance $\l$:
% So,
\beq
e^{-\l} \frac{\l^r}{r!} \simeq \frac{1}{\sqrt{2\pi \l}}
e^{{\textstyle -\frac{(r-\l)^2}{2\l}}} .
\eeq
Let's plug $r=\l$ into this formula.
\beqan
e^{-\l} \frac{\l^{\l}}{\l!} &\simeq& \frac{1}{\sqrt{2\pi \l}}
\\
\Rightarrow \l! &\simeq& \l^{\l} \, e^{-\l} \sqrt{2\pi \l} .
\eeqan
This is {\bf Stirling's approximation}
for the \ind{factorial} function,
including several of the correction
terms that are usually forgotten.
\beq
x! \simeq x^{x} e^{-x} \sqrt{2\pi x} \:\:\:\Leftrightarrow\:\:\:
\ln x! \simeq x \ln x - x + {\textstyle\frac{1}{2}} \ln {2\pi x} .
\label{eq.stirling}
\eeq
%
We can use this approximation
% the approximation
%$%\beq
% x! \simeq x^{x} e^{-x} $
to approximate\index{combination}
$%\beq
{{N}\choose{r}} \equiv \frac{N!}{(N-r)!r!}
$.%\eeq
\beqan
\ln {{N}\choose{r}}
% & \simeq &
% N [ \ln N - 1 ] - (N-r) [ \ln (N-r) - 1 ] - r [ \ln r - 1 ]
%\\
& \simeq & (N-r) \ln\frac{N}{N-r} + r \ln\frac{N}{r}
.
\label{eq.choose.approx}
\eeqan
Since all the terms in this equation are logarithms,
this result can be rewritten in any base.\marginpar{\small Recall that
$\displaystyle{ \log_2 x = \frac{ \log_e x }{ \log_e 2} }$.\\[0.03in]
Note that $\displaystyle\frac{\partial \log_2 x }{\partial x} =
\frac{1}{\log_e 2}\,\frac{1}{x}$.
}
%\fakesubsection*{My rule about log and ln}
We will denote\index{conventions!logarithms}\index{notation!logarithms}
natural logarithms ($\log_e$) by `ln', and \ind{logarithms}
to base 2 ($\log_2$)
by `$\log$'.
If we introduce the {\dbf\ind{binary entropy function}},
\beq
H_2(x) \equiv x \log \frac{1}{x} + (1-x) \log \frac{1}{(1-x)}
\eeq
then we can rewrite the approximation
(\ref{eq.choose.approx})
%\beq
%$ \log {{N}\choose{r}}
% \simeq (N-r) \log \frac{N}{N-r} + r \log \frac{N}{r}
%$
%\eeq
as
\amarginfig{t}{\small%
\begin{center}
\mbox{
\hspace{-6mm}
% \hspace{6.2mm}
\raisebox{\hpheight}{$H_2(x)$}
% to put H at left:
\hspace{-7.5mm}
% \hspace{-20mm}
\mbox{\psfig{figure=figs/H2.ps,%
width=42mm,angle=-90}}$x$
}
% see also H2p.tex
\end{center}
\caption[a]{The binary entropy function.}
% $H_2(x)$.}
\label{fig.h2x}
}
\beq
\log {{N}\choose{r}}
\simeq N H_2(r/N) ,
\label{eq.stirling.choose.l}
\eeq
or, equivalently,
% \:\:\:\Leftrightarrow\:\:\:
\beq
{{N}\choose{r}}
\simeq 2^{N H_2(r/N)} .
\label{eq.stirling.choose}
\eeq
If we need a more accurate approximation, we
can include terms of the next order from
Stirling's approximation
(\ref{eq.stirling}):
\beq
\log {{N}\choose{r}}
\simeq N H_2(r/N) -
{\textstyle\frac{1}{2}} \log \left[ {2\pi N \, \frac{N\!-\!r}{N} \,
\frac{r}{N}} \right]
.
\label{eq.H2approxaccurate}
\eeq
%
% - {\textstyle\frac{1}{2}} \ln {2\pi N}
% + {\textstyle\frac{1}{2}} \ln {2\pi N-r}
% + {\textstyle\frac{1}{2}} \ln {2\pi r}
%
% ln += {\textstyle\frac{1}{2}} \ln {2\pi (N-r)(r)/N}
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi (N-r)(r)/N}
% or
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi N}
% + {\textstyle\frac{1}{2}} \log_2 {\frac{(N-r)}{N}\frac{r}{N}}
% log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi \frac{(N-r)}{N}\frac{r}{N} N}
\chapter{Introduction to Information Theory}
\label{ch.one}
\label{chone}
\addtopic{3}{infotheory}
\addtopic{2}{inference}
\addtrack{1}{inferencecourse}
\addtrack{3}{infotheorycourse}
\addtrack{3}{itprnncourse}
% % \part{Information Theory}
% \chapter{Introduction to Information Theory}
\label{ch1}
%\section{Communication over noisy channels}
% One of the principal questions addressed by information theory is
% Shannon's ground-breaking paper on `The Mathematical Theory of
% Communication' opens thus:
\begin{quotation}
\noindent
The fundamental problem of communication is that of reproducing at one point
either exactly or approximately a message selected at another point.
\\
\mbox{~} \hfill {\em (Claude Shannon, 1948)} \\
%
\end{quotation}
\noindent
In the first half of
this book we
%are going to
study how to measure information content;
we
% are going to
% learn by how much data from a given source
% can be compressed; we
% are going to
learn how
% , practically, to
% achieve data compression;
to compress data; and we
% are going to
learn how to communicate
perfectly over imperfect communication channels.
We start by getting a feeling for this last problem.
\section[How can we achieve perfect communication?]{How
can we achieve perfect communication over an imperfect, noisy
commmunication channel?}
Some examples of noisy communication channels are:
\bit
\item
an analogue telephone
line,\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
\begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{modem}}}
\put(21,10){\makebox(0,0)[l]{\shortstack{phone\\line}}}
\put(39,10){\makebox(0,0)[l]{\shortstack{modem}}}
\put(15,10){\vector(1,0){3}}
\put(32,10){\vector(1,0){3}}
\end{picture}
}
over which two modems communicate digital information;
\item
the radio communication link from the Jupiter-orbiting spacecraft,
Galileo,\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
\begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{Galileo}}}
\put(21,10){\makebox(0,0)[l]{\shortstack{radio\\waves}}}
\put(39,10){\makebox(0,0)[l]{\shortstack{Earth}}}
\put(15,10){\vector(1,0){3}}
\put(32,10){\vector(1,0){3}}
\end{picture}
}
to earth;
\item
\marginpar[c]{\footnotesize
\setlength{\unitlength}{1mm}%
\begin{picture}(30,20)(0,0)
\put(0,10){\makebox(0,0)[l]{\shortstack{parent\\cell}}}
\put(16,2){\makebox(0,0)[l]{\shortstack{daughter\\cell}}}
\put(16,16){\makebox(0,0)[l]{\shortstack{daughter\\cell}}}
\put(10,10){\vector(1,1){5}}
\put(10,10){\vector(1,-1){5}}
\end{picture}
}reproducing cells, in which the daughter cells's \ind{DNA}
contains information from the parent
% cell or
cells;
\item
\marginpar{\footnotesize
\setlength{\unitlength}{1mm}%
\begin{picture}(45,10)(0,5)
\put(0,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}}
\put(20,10){\makebox(0,0)[l]{\shortstack{disc\\drive}}}
\put(33,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}}
\put(15,10){\vector(1,0){3}}
\put(29,10){\vector(1,0){3}}
\end{picture}
}a disc drive.
\eit
The last example shows that communication doesn't have to involve
information going from one {\em place\/} to another. When
we write a file on a disc drive, we'll
% typically
read it off
% again
in the same location -- but at a later {\em time}.
These channels are noisy. A telephone line suffers
from cross-talk with other lines; the hardware in the
line distorts and adds noise to the transmitted signal. The deep
space network that listens to Galileo's puny transmitter
% fairy-bulb power
receives background radiation from
terrestrial and cosmic sources.
DNA is subject to mutations and damage.
A \ind{disc drive}, which writes
a binary digit (a one or zero, also known as a {\dbf bit}) by aligning a patch of magnetic
material in one of two orientations, may later
% , with some probability,
fail to read out the stored binary digit:
% that was stored
the patch of material might spontaneously flip
magnetization, or
a glitch of
background noise might cause the reading circuit
to report the wrong
value for the binary digit, or the writing head might not induce
the magnetization in the first place because of interference
from neighbouring bits.
In all these cases, if we transmit data, \eg, a string
of bits, over the channel, there is some probability that
the received message will not be identical to the transmitted message.
% And in all cases,
We would prefer to have a communication channel for
which this probability was zero -- or so close to zero that
for practical purposes it is indistinguishable from zero.
Let's consider
% the example of
a noisy disc drive
% having the property
that transmits each bit correctly
% transmitted
with probability
$(1-f)$ and incorrectly with probability $f$.
This model
% favourite
communication channel is known
as the {\dbf{\ind{binary symmetric channel}}} (\figref{fig.bsc1}).
\begin{figure}[htbp]
\figuremargin{%
\[
\begin{array}{c}
\setlength{\unitlength}{0.46mm}
\begin{picture}(30,20)(-5,0)
\put(-4,9){{\makebox(0,0)[r]{$x$}}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{1}}
\put(4,16){\makebox(0,0)[r]{0}}
\put(16,2){\makebox(0,0)[l]{1}}
\put(16,16){\makebox(0,0)[l]{0}}
\put(24,9){{\makebox(0,0)[l]{$y$}}}
\end{picture}
\end{array}
\:\:\:
\begin{array}{ccl}%%%%% {c@{}c@{}l} %%%%% (for twocolumn style)
P(y\eq 0|x\eq 0) &\!=\!& 1 - \q ; \\ P(y\eq 1|x\eq 0) &\!=\!& \q ;
\end{array}
\begin{array}{ccl}
P(y\eq 0|x\eq 1) &\!=\!& \q ; \\ P(y\eq 1|x\eq 1) &\!=\!& 1 - \q .
\end{array}
\]
}{%
\caption[a]{The binary symmetric channel. The
transmitted symbol is $x$ and the
received symbol $y$. The noise level, the probability of a bit's being
flipped, is $f$.}
\label{fig.bsc1}
}%
\end{figure}
\begin{figure}[htbp]
\figuremargin{%
\begin{mycenter}
\begin{tabular}{rcl}
\psfig{figure=bitmaps/dilbert.ps,width=1.2in}
&\hspace{0.1in}%
\raisebox{0.22in}{%
\setlength{\unitlength}{1.2mm}%
\begin{picture}(20,20)(0,0)%
\put(10,1){\makebox(0,0)[t]{$(1-f)$}}
\put(10,17){\makebox(0,0)[b]{$(1-f)$}}
\put(12,9.5){\makebox(0,0)[l]{$f$}}
% \put(10,16.5){\makebox(0,0)[b]{$(1-f)$}}
\put(5,2){\vector(1,0){10}}
\put(5,16){\vector(1,0){10}}
\put(5,4){\vector(1,1){10}}
\put(5,14){\vector(1,-1){10}}
\put(4,2){\makebox(0,0)[r]{{1}}}
\put(4,16){\makebox(0,0)[r]{{0}}}
\put(16,2){\makebox(0,0)[l]{{1}}}
\put(16,16){\makebox(0,0)[l]{{0}}}
\end{picture}%
}%
\hspace{0.385in}&
\psfig{figure=_is/10000.10.ps,width=1.2in} \\
% & & \makebox[0in][l]{\large 10\% of bits are flipped} \\
\end{tabular}
\end{mycenter}
}{%
\caption[a]{A binary data sequence of length 10000 transmitted over
a binary symmetric channel with noise level $f=0.1$.
\dilbertcopy}
\label{fig.bsc.dil}
}%
\end{figure}
\noindent
As an example,
% For the sake of argument,
let's imagine that $f=0.1$, that is, ten \percent\ of the bits are
flipped (figure \ref{fig.bsc.dil}).
% For a disc drive to be useful, we would prefer that it should
% flip no bits at all in its entire lifetime.
A useful disc drive would flip no bits at all in its entire lifetime.
%
If we expect to read and write a
gigabyte per day for ten years, we require a bit error
probability of the order of $10^{-15}$, or smaller.
There are two approaches to this goal.
\subsection{The physical solution}
The physical solution is to improve the physical characteristics of
the communication channel to reduce its error probability. We could
improve our disc drive by
% , for example,
\ben
\item
using more reliable components in its circuitry;
\item
evacuating the air from the disc enclosure so as
to eliminate the turbulence that perturbs the
reading head from the track;
\item
using a larger magnetic patch to represent each bit; or
\item
using higher-power signals or cooling the
circuitry in order to reduce thermal noise.
\een
These physical modifications
typically
increase the cost of the communication
channel.
% unit of area making the disc spin at a slower rate
%
% the system solution
%
\begin{figure}%[htbp]
\figuremargin{%
\setlength{\unitlength}{1.25mm}
\begin{mycenter}
\begin{picture}(50,40)(-10,5)
\put(0,5){\framebox(25,10){\begin{tabular}{c}Noisy\\ channel\end{tabular}}}
\put(-20,20){\framebox(25,10){\begin{tabular}{c}Encoder\end{tabular}}}
\put(20,20){\framebox(25,10){\begin{tabular}{c}Decoder\end{tabular}}}
%\put(-20,40){\framebox(25,10){\begin{tabular}{c}Compressor\end{tabular}}}
%\put(20,40){\framebox(25,10){\begin{tabular}{c}Decompressor\end{tabular}}}
%\put(-50,20){\makebox(25,10){\begin{tabular}{c}{\sc Source}\\{\sc coding}\end{tabular}}}
% \put(-50,40){\makebox(25,10){\begin{tabular}{c}{\sc Channel}\\{\sc coding}\end{tabular}}}
\put(-20,37){\makebox(25,12){Source}}
%
\put(-10,14){\makebox(0,0){$\bt$}}
\put(-10,34){\makebox(0,0){$\bs$}}
\put(35,14){\makebox(0,0){$\br$}}
\put(35,34){\makebox(0,0){$\hat{\bs}$}}
\put(-7.5,18){\line(0,-1){8}}
\put(-7.5,10){\vector(1,0){6}}
\put(32.5,10){\vector(0,1){8}}
\put(32.5,10){\line(-1,0){6}}
%
\put(32.5,31){\vector(0,1){8}}
%\put(32.5,51){\vector(0,1){5}}
\put(-7.5,39){\vector(0,-1){8}}
%\put(-7.5,55){\vector(0,-1){5}}
\end{picture}
\end{mycenter}
}{%
\caption[a]{The `system' solution for
achieving
% almost perfect
reliable communication
over a noisy channel. The encoding system introduces
systematic redundancy
% in a systematic way
into the transmitted vector $\bt$. The decoding system
uses this known redundancy to deduce
from the
received vector $\br$
{\em both\/}
the original source vector
{\em and\/}
the noise introduced by the channel.
}
\label{system.solution}
}%
\end{figure}
\subsection{The `system' solution}
Information theory\index{information theory} and
\ind{coding theory}\index{system} offer
an alternative (and much more exciting)
approach: we accept the given noisy channel as it is
and
add communication {\dem systems\/} to it so that we
can {detect\/} and {correct\/} the errors introduced by the
% noise.
channel.
As shown in \figref{system.solution}, we add an
{\dem\ind{encoder}\/} before the channel and a {\dem\ind{decoder}\/} after
it. The encoder encodes the source message $\bs$
into a {\dem transmitted\/} message $\bt$,
% the idea is that the encoder adds
adding {\dem\ind{redundancy}\/} to the original message in some way. The
channel adds noise to the transmitted message, yielding a received
message $\br$. The decoder uses the known redundancy
introduced by the encoding system to infer both the original signal
$\bs$ and the added noise.
% added by the channel was.
Whereas physical solutions give incremental channel improvements
only at an ever-increasing cost,
% we hope to find
% there exist
system solutions can turn noisy channels into reliable
communication channels
with the only cost being a {\em computational\/} requirement
at the encoder and decoder.
% (and the delay associated with those computations.
%
% suggested addition:
% So, as the cost of computation falls, the cost of reliability will fall as well.
{\dbf Information theory} is concerned with the theoretical limitations and
% theoretical
potentials of such systems. `What is the best error-correcting
performance we could achieve?'
{\dbf Coding theory} is concerned with the creation of practical
encoding and decoding systems.
% Some
\section{Error-correcting codes for the binary symmetric channel}
We now consider examples of encoding and decoding systems.
What is the simplest way to add useful redundancy to a transmission?
[To make the rules of the game clear:
we want to be able to detect {\em and\/} correct errors;
and retransmission is not an option. We get only
one chance to encode, transmit,
and decode.]
\subsection{Repetition codes}
\label{sec.r3}
A straightforward idea is to repeat every bit of the message a prearranged
number of times -- for example, three times, as shown in \figref{fig.r3}.
We call this {\dem \ind{repetition code}\/} `$\Rthree$'.
%\begin{figure}[htbp]
%\figuremargin{%
\amarginfig{c}{
\begin{mycenter}
\begin{tabular}{c@{\hspace{0.3in}}c} \toprule % \hline
% Source sequence $\bs$ & Transmitted sequence $\bt$ \\ \hline
Source & Transmitted \\[-0.02in] % was -0.1, which was to much
sequence & sequence \\
$\bs$ & $\bt$ \\ \midrule % \hline
\tt 0 &\tt 000 \\
\tt 1 &\tt 111 \\ \bottomrule % \hline
\end{tabular}
\end{mycenter}
%}{%
\caption[a]{The repetition code {$\Rthree$}.}
\label{fig.r3}
}%
%\end{figure}
% \noindent
%
Imagine that
% what might happen if
we transmit the source message
\[
\bs = \mbox{\tt 0 0 1 0 1 1 0}
\]
over a binary
symmetric channel with noise level $f=0.1$ using this repetition code.
We can describe the channel as `adding' a sparse noise vector $\bn$ to the
transmitted vector -- adding in modulo 2 arithmetic, \ie, the binary algebra in which
{\tt 1}+{\tt 1}={\tt 0}. A possible noise
vector $\bn$ and received vector $\br = \bt + \bn$
are shown in
\figref{fig.r3.transmission}.
\begin{figure}[htbp]
%
% here i should switch the \[ \] for a display that oes not introduce
% white space at the top (about 0.1in)
%
\figuremargin{%
\[
\begin{array}{rccccccc}
\bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\
\bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\
\bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8}
\br & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\nbr{{\tt 0}}{{\tt 1}}{{\tt 0}}& \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}
\end{array}
\]
}{%
\caption{An example transmission using $\mbox{R}_3$.}
\label{fig.r3.transmission}
}
\end{figure}
%\noindent
How should we decode this received vector?
%
% optimality not clear - should justify?
%
% Perhaps you can see that
The optimal algorithm looks at the received
bits three at a time and takes
a \ind{majority vote} (\algref{alg.r3}).
%
\begin{aside}
%
At the risk of explaining the obvious, let's prove this result.
The optimal decoding decision
(optimal in the sense
of having the smallest probability of being wrong)
is to find which value of $\bs$
is most probable, given $\br$.\index{MAP}
% to make clear the assumptions.
Consider the decoding of a single bit $s$, which was encoded
as
% after encoding as
$\bt(s)$
and gave rise to three received bits $\br = r_1r_2r_3$.
By \ind{Bayes's theorem},\label{sec.bayes.used} the {\dem posterior
probability\/} of $s$ is
\beq
P(s \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s ) P( s ) }
{ P( r_1r_2r_3 ) } .
\label{eq.bayestheorem}
\eeq
We can spell out the posterior probability of the two alternatives thus:
\beq
P(s\!=\!{\tt 1} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\!=\!{\tt 1} )
P( s\!=\!{\tt 1} ) }
{ P( r_1r_2r_3 ) } ;
\label{eq.post1}
\eeq
\beq
P(s\!=\!{\tt 0} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\!=\!{\tt 0} )
P( s\!=\!{\tt 0} ) }
{ P( r_1r_2r_3 ) } .
\label{eq.post0}
\eeq
%
This \ind{posterior probability} is determined by two factors:
the
{\dem{\ind{prior} probability\/}} $P(s)$, and
the data-dependent term $P( r_1r_2r_3 \,|\, s )$, which is called
the {\dem{\ind{likelihood}\/}} of $s$.
The normalizing constant $P( r_1r_2r_3 )$
% is irrelevant to
needn't be computed when finding
the optimal decoding decision,
which is to guess $\hat{s}\!=\!{\tt 0}$
if $P(s\!=\!{\tt 0} \,|\, \br ) > P(s\!=\!{\tt 1} \,|\, \br )$,
and $\hat{s}\!=\!{\tt 1}$ otherwise.
To find
$P(s\!=\!{\tt 0} \,|\, \br )$ and $P(s\!=\!{\tt 1} \,|\, \br )$,
% the optimal decoding decision,
we must make an assumption about the prior probabilities of the
two hypotheses ${s}\!=\!{\tt 0}$ and ${s}\!=\!{\tt 1}$, and we
must make an assumption about the probability of $\br$ given
$s$.
% $\bt(s)$.
We assume that the prior probabilities are equal:
$P( {s}\!=\!{\tt 0}) = P( {s}\!=\!{\tt 1}) = 0.5$;
then maximizing the posterior probability $P(s\,|\,\br)$ is
equivalent to maximizing the likelihood $P(\br\,|\,s)$.\index{maximum likelihood}
And we assume that the
channel is a binary symmetric channel with noise level $f<0.5$, so that
the likelihood is
\beq
P( \br \,|\, s ) = P(\br \,|\, \bt(s) ) = \prod_{n=1}^N
P(r_n \,|\, t_n(s) ) ,
\eeq
where $N=3$ is the number of transmitted bits in the block
we are considering, and
\beq
P(r_n\,|\,t_n) = \left\{ \begin{array}{lll}
(1\!-\!f) & \mbox{if} & r_n=t_n \\
f & \mbox{if} & r_n \neq t_n. \end{array} \right.
\eeq
Thus the likelihood ratio for the
two hypotheses is
% if we define $
\beq
\frac{P(\br\,|\, s\!=\!{\tt 1})}{P(\br\,|\, s\!=\!{\tt 0})}
% = \left( \frac{ (1-f) }{f} \right)^{
= \prod_{n=1}^N
\frac{P(r_n \,|\, t_n({\tt 1}) )}{P(r_n \,|\, t_n({\tt 0}) )} ;
\label{eq.likelihood.bsc}
\eeq
each factor
% $P(r_n \,|\, t_n(s) )$
$\frac{P(r_n \,|\, t_n({\tt 1}) )}{P(r_n \,|\, t_n({\tt 0}) )}$
equals $\frac{ (1-f) }{f}$ if $r_n=1$ and $\frac{f}{ (1-f) }$ if
$r_n=0$.
The ratio $\gamma \equiv \frac{ (1-f) }{f}$ is greater than 1,
since $f<0.5$, so the winning hypothesis is the one with the most
`votes', each vote counting for a factor of $\gamma$ in the
% posterior probability.
likelihood ratio.
Thus the majority-vote decoder shown in \algref{fig.r3d}
is the optimal decoder if we assume that
the channel is a binary symmetric channel and that the
two possible source messages {\tt 0} and {\tt 1}
have equal prior probability.
\end{aside}
\begin{algorithm}[htbp]
\algorithmmargin{%
\begin{mycenter}
\begin{tabular}{ccc} % \toprule % \hline
Received sequence $\br$ &
Likelihood ratio $\frac{P(\br\,|\, s\hspace{-0.2mm}=\hspace{-0.2mm}{\tt 1})}{P(\br\,|\, s\hspace{-0.2mm}=\hspace{-0.2mm}{\tt 0})}$
&
Decoded sequence $\hat{\bs}$ \\ \midrule
\tt 000 & $\gamma^{-3}$ &\tt 0 \\
\tt 001 & $\gamma^{-1}$ &\tt 0 \\
\tt 010 & $\gamma^{-1}$ &\tt 0 \\
\tt 100 & $\gamma^{-1}$ &\tt 0 \\
\tt 101 & $\gamma^{1}$ &\tt 1 \\
\tt 110 & $\gamma^{1}$ &\tt 1 \\
\tt 011 & $\gamma^{1}$ &\tt 1 \\
\tt 111 & $\gamma^{3}$ &\tt 1 \\
% \bottomrule
\end{tabular}
\end{mycenter}
}{%
\caption[a]{Majority-vote decoding algorithm for {$\Rthree$}.
Also shown are the likelihood ratios (\ref{eq.likelihood.bsc}), assuming
% This is the optimal decoder if
the channel is a binary symmetric channel; $\gamma \equiv (1-f)/f$.}
%
\label{fig.r3d}
\label{alg.r3}
}%
\end{algorithm}
%\noindent
We now apply the majority vote decoder to the received vector of \figref{fig.r3.transmission}.
The first three received bits are all ${\tt 0}$, so
we decode this triplet
as a ${\tt 0}$.
In the second triplet of \figref{fig.r3.transmission},
there are two {\tt 0}s and one {\tt 1}, so we decode
this triplet as a ${\tt 0}$ -- which in this case corrects the error.
Not all errors are corrected, however. If we are unlucky and
two errors fall in a single block, as in the fifth triplet of
\figref{fig.r3.transmission},
then the decoding rule gets the wrong answer, as shown in
\figref{fig.decoding.R3}.
% \Figref{fig.decoding.R3}
% shows the result of decoding the received vector
% from \figref{fig.r3.transmission}.
\begin{figure}[htbp]
\figuremargin{%
\[
\begin{array}{rccccccc}
\bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\
\bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\
\bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8}
\br & \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}&
\ubr{{\tt 0}}{{\tt 1}}{{\tt 0}}& \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\
\hat{\bs} & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 0}&{\tt 1}&{\tt 0} \\
\mbox{corrected errors} &
&\star & & & & & \\
\mbox{undetected errors} &
& & & &\star & &
\end{array}
\]
}{%
\caption{Decoding
% Applying the maximum likelihood decoder for $\mbox{R}_3$ to
the received vector
from \protect\figref{fig.r3.transmission}.}
\label{fig.decoding.R3}
}%
\end{figure}
\noindent
% Thus the error probability is reduced by the use of this code.
% It is easy to compute the error probability.
% Exercise 1.1. Could this be made an Example, i.e. worked through in
% the text? -- for a beginner, there is a lot in it, and it seems to
% be important.
%
% see exercise.sty
\exercissx{2}{ex.R3ep}{%%%%%%%% keep this as A2, but cut it from the ITPRNN list
Show\marginpar{\footnotesize The exercise's rating, \eg,
% `{\em{A}}2'
`[{\sl2}]'
indicates its difficulty:
`1' exercises are the easiest.
% An exercise rated {\em{A}}2 is important and should not prove too difficult.
Exercises that are accompanied by a marginal rat are especially recommended.
}
that the error probability is reduced by the use of {$\Rthree$}
by computing the error probability of
this code for a binary symmetric channel
with noise level $f$.
%Do so.
}
%
% This fig is 0.1 inch too wide, 9801
%
\begin{figure}
%\fullwidthfigure{%
%\figuredangle{% this hung off the bottom of the page
\figuremarginb{% I think this may make a collision?
\begin{center}
\setlength{\unitlength}{0.8in}% was 0.75 98.12. changed to 0.8 99.01
\begin{picture}(7,4.3)(0,1.4)
\put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}}
\put(0.625,5.4){\makebox(0,0){\Large$\bs$}}
\thicklines
\put(1.35,4.75){\vector(1,0){0.4}}
\put(1.55,5.4){\makebox(0,0){{\sc encoder}}}
\put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.ps,width=1in}}}
\put(2.625,5.4){\makebox(0,0){\Large$\bt$}}
\put(3.6,5.4){\makebox(0,0){{\sc channel}}}
\put(3.6,5.15){\makebox(0,0){$f={10\%}$}}
\put(3.4,4.75){\vector(1,0){0.4}}
\put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.ps,width=1in}}}
\put(4.625,5.4){\makebox(0,0){\Large$\br$}}
\put(5.6,5.4){\makebox(0,0){{\sc decoder}}}
%\put(5.6,3.4){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder takes the majority vote of the three signals.}}}}
\put(5.4,4.75){\vector(1,0){0.4}}
\put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.d.ps,width=1in}}}
\put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}}
\end{picture}
\end{center}
}{%
\caption[a]{Transmitting 10000 source bits over a binary symmetric channel
with $f=10\%$
% 0.1$
using a repetition code and the majority vote decoding
algorithm. The probability
of decoded bit error has fallen to about 3\%; the rate has fallen
to 1/3.}
% \dilbertcopy
\label{fig.r3.dilbert}
}%
\end{figure}
The error probability is dominated by the probability that two
bits in a block of three are flipped, which scales as $f^2$.
%
% JARGON??????
%
In the
case of the binary symmetric channel with $f=0.1$, the {$\Rthree$} code has a
probability of error, after decoding, of $\pb \simeq 0.03$ per bit.
\Figref{fig.r3.dilbert} shows the
result of transmitting a binary
image over a binary symmetric channel
using the repetition code.
% Should `rate' be explicitly defined?
The repetition code $\Rthree$ has therefore reduced the probability of
error, as desired.
Yet we have lost something: our
{\em rate\/} of information transfer has fallen by a factor of
three. So if we use a repetition code to communicate data over a telephone
line, it will reduce the error frequency, but it will also reduce our
communication rate. We will have to pay three times as much for each
phone call.
% there will also be a delay
Similarly,
%As for our disc drive,
we would need three of the original noisy gigabyte disc drives
in order to create a one-gigabyte disc drive with $\pb=0.03$.
Can we
% What happens as we try to
push the error probability lower, to the
values required for a
% quality
sellable disc drive -- $10^{-15}$?
We could achieve lower error probabilities by using repetition
codes with more repetitions.
\exercissx{3}{ex.R60}{
\ben
\item
Show that the probability of error of $\RN$, the repetition
code with $N$ repetitions, is
\beq
p_{\rm b} = \sum_{n=(N+1)/2}^{N} {{N}\choose{n}} f^n (1-f)^{N-n} ,
\eeq
for odd $N$.
\item
Assuming $f = 0.1$, which of the terms in this sum is the biggest?
How much bigger is it than the second-biggest term?
\item
Use \ind{Stirling's approximation} to approximate
% get rid of
the ${{N}\choose{n}}$
in the largest term, and find,
approximately, the probability of error of the repetition
code with $N$ repetitions.
\item
Assuming $f = 0.1$, find how many repetitions
are required
% show that it takes a repetition
% code with rate about $1/60$
to get the probability of error
down to $10^{-15}$. [Answer: about 60.]
\een
}
So to build a {\em single\/}
gigabyte disc drive
with the required reliability from noisy gigabyte drives with $f=0.1$,
we would need {\em sixty\/} of the noisy disc drives.
The tradeoff between error probability and rate for repetition
codes is shown in \figref{fig.pbR.R}.
%
% see end of l1.tex for method, also see poster1.gnu
%
\newcommand{\pbobject}{\hspace{-0.15in}\raisebox{1.62in}{$\pb$}%
\hspace{-0.05in}}
\begin{figure}
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/rep.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/rep.1.l.ps,angle=-90,width=2.6in} \\
\end{tabular}
\end{center}
}{%
\caption[a]{Error probability $\pb$ versus rate for repetition codes
over a binary symmetric channel with $f=0.1$.
The right hand figure shows $\pb$ on a logarithmic scale. We would like
the rate to be large and $\pb$ to be small.
}
\label{fig.pbR.R}
}%
\end{figure}
% see end of this file for method
\subsection{Block codes -- the (7,4) Hamming code}
\label{sec.ham74}
We would like to communicate with
tiny probability of error {\em and\/} at a substantial rate.
Can we improve on repetition codes? What if we add redundancy to
{\dem blocks\/} of data instead of
% redundantly
encoding one bit at a time?
% You may already have heard of the idea of `parity check bits'.
We now
study a simple {\dem{block code}}.
A {\dem \ind{block code}\/} is a rule for converting a sequence of source
bits $\bs$, of length $K$, say, into a transmitted sequence $\bt$ of length
$N$ bits. To add redundancy, we make $N$
greater than $K$. In a {\dem linear\/} block code,
the extra $N-K$ bits are linear functions of the
original $K$ bits; these extra bits are called {\dem\ind{parity check bits}}.
An example of a \ind{linear block code}\index{error-correcting code!linear} is the \mbox{\dem(7,4)
\ind{Hamming code}}, which transmits $N=7$ bits for every $K=4$ source
bits.
\begin{figure}[htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{cc}
(a)\psfig{figure=hamming/encode.eps,angle=-90,width=1.3in} &
(b)\psfig{figure=hamming/correct.eps,angle=-90,width=1.3in} \\
\end{tabular}
\end{center}
}{
\caption[a]{Pictorial representation of encoding for the Hamming (7,4)
code.
% a and b are not explained in the caption. Does this matter?
%
% The parity check bits $t_5,t_6,t_7$ are set so that the parity within
%% each circle is even.
}
\label{fig.74h.pictorial}
\label{fig.hamming.pictorial}
}
\end{figure}
%
The encoding operation for the code is shown pictorially
in \figref{fig.74h.pictorial}.
%
% \subsubsection{Encoding}
We arrange the seven transmitted bits in three intersecting circles.
% as shown in \figref{fig.hamming.encode}.
The first four
transmitted bits,
$t_1 t_2 t_3 t_4$, are set equal to the four source bits,
$s_1 s_2 s_3 s_4$.
The parity check bits\index{parity check bits}
$t_5 t_6 t_7$ are set so that the {\dem\ind{parity}\/}
within each circle is even:
the first parity check bit is the parity of the first three source bits
(that is, it is
%zero
{\tt 0} if the sum of those bits is even, and
% one
{\tt 1} if the sum is odd);
the second is the parity of the last three; and the third parity bit
is the parity of source bits one, three and four.
As an example, \figref{fig.74h.pictorial}b shows the transmitted
codeword for the case $\bs = {\tt 1000}$.
% idea for rewriting this: go straight to pictorial story, leave out the
% matrix description for another time.
%
%
%\noindent
%
Table \ref{tab.74h} shows the codewords generated
by each of the $2^4=$ sixteen settings of the four source bits.
% Notice that the first four transmitted bits are
% identical to the four source bits, and the remaining three bits
% are parity bits:
The special property of these codewords is that
any pair
differ from each other in at least three bits.
\begin{table}[htbp]
\figuremargin{%
\begin{center}
\mbox{\small
\begin{tabular}{cc} \toprule
% Source sequence
$\bs$ &
% Transmitted sequence
$\bt$ \\ \midrule
\tt 0000 &\tt 0000000 \\
\tt 0001 &\tt 0001011 \\
\tt 0010 &\tt 0010111 \\
\tt 0011 &\tt 0011100 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
$\bs$ & $\bt$ \\ \midrule
\tt 0100 &\tt 0100110 \\
\tt 0101 &\tt 0101101 \\
\tt 0110 &\tt 0110001 \\
\tt 0111 &\tt 0111010 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
$\bs$ & $\bt$ \\ \midrule
\tt 1000 &\tt 1000101 \\
\tt 1001 &\tt 1001110 \\
\tt 1010 &\tt 1010010 \\
\tt 1011 &\tt 1011001 \\ \bottomrule
\end{tabular} \hspace{0.02in}
\begin{tabular}{cc} \toprule
$\bs$ & $\bt$ \\ \midrule
\tt 1100 &\tt 1100011 \\
\tt 1101 &\tt 1101000 \\
\tt 1110 &\tt 1110100 \\
\tt 1111 &\tt 1111111 \\ \bottomrule
\end{tabular}
}%%%%%%%%% end of row of four tables
\end{center}
}{%
\caption[a]{The sixteen codewords
$\{ \bt \}$ of the (7,4) Hamming code. Any pair of
codewords
% have the % beautiful % elegant property that they
differ from each other in at least three bits.}
%\label{fig.hamming.encode}
\label{tab.74h}
\label{tab.h74}
\label{fig.h74}
\label{fig.74h}
}
\end{table}
%
\begin{aside}
Because the Hamming code is a {linear\/} code, it can\indexs{error-correcting code!linear}
be written compactly in terms of matrices as follows.
% It is a
% {\em linear\/} code; that is, t
The transmitted codeword $\bt$ is
% can be
obtained
from the source sequence $\bs$ by a linear operation,
\beq
\bt = \bG^{\T} \bs,
\label{eq.encode}
\eeq
where $\bG$ is the {\dem\ind{generator matrix}} of the code,
\beq
\bG^{\T} = {\left[ \begin{array}{cccc}
\tt 1 &\tt 0 &\tt 0 &\tt 0 \\
\tt 0 &\tt 1 &\tt 0 &\tt 0 \\
\tt 0 &\tt 0 &\tt 1 &\tt 0 \\
\tt 0 &\tt 0 &\tt 0 &\tt 1 \\
\tt 1 &\tt 1 &\tt 1 &\tt 0 \\
\tt 0 &\tt 1 &\tt 1 &\tt 1 \\
\tt 1 &\tt 0 &\tt 1 &\tt 1 \end{array} \right] } ,
\label{eq.h74.gen}
\eeq
and the encoding operation (\ref{eq.encode}) uses
modulo-2 arithmetic [${\tt 1}+{\tt 1}={\tt{0}}$, ${\tt 0}+{\tt 1}={\tt 1}$, etc.].
%\footnote{My notational
% convention is that all vectors -- $\bs$, $\bt$, etc.\ --
% are column vectors, except that in the figures where many
% vectors are listed, they are displayed as row vectors. The
% generator matrix $\bG$ is written ..... as to retain
% consistency with established notation in coding texts.}
% \begin{aside}
In the encoding operation
(\ref{eq.encode}) I have assumed that $\bs$ and $\bt$ are
column vectors. If instead they are row vectors, then this equation
is replaced by
\beq
\bt = \bs \bG,
\label{eq.encodeT}
\eeq
where
\beq
\bG = \left[ \begin{array}{ccccccc}
\tt 1& \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1 \\
\tt 0& \tt 1& \tt 0& \tt 0& \tt 1& \tt 1& \tt 0 \\
\tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1& \tt 1 \\
\tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1 \\
\end{array} \right] .
\label{eq.Generator}
\eeq
% f you are like me, you may
I find it easier to relate to
the right-multiplication (\ref{eq.encode})
than the left-multiplication (\ref{eq.encodeT}).
% -- I like my matrices to act to the right.
Many coding theory texts use the left-multiplying conventions
(\ref{eq.encodeT}--\ref{eq.Generator}), however.
The rows of the generator matrix (\ref{eq.Generator}) can be
viewed as defining four basis vectors lying in a seven-dimensional
binary space. The sixteen codewords are obtained by making all
possible linear combinations
% binary sums
of these vectors.
\end{aside}
%
% should I add a cast of characters here?
% s,t,r,s^
\subsubsection{Decoding the (7,4) Hamming code}
When we invent a more complex encoder $\bs \rightarrow \bt$,
the task of decoding the
received vector $\br$ becomes less straightforward. Remember that
{\em any\/} of the bits may have been flipped, including the parity bits.
% We can't assume that the three extra parity bits
%(The reader who
% is eager to see the denouement of the plot may skip ahead to section
% \ref{sec.code.perf}.)
% General defn of optimal decoder
If we assume that the channel is a binary symmetric channel and that
all source vectors are equiprobable,
% {\em a priori},
then the
optimal decoder
% is one that
identifies the source vector $\bs$ whose
encoding $\bt(\bs)$ differs from the received vector $\br$ in the
fewest bits. [{Refer to the likelihood function
% equation
% {eq.bayestheorem}--\ref{eq.likelihood.bsc}}
\bref{eq.likelihood.bsc}} to see why this is so.]
We could solve the decoding problem by measuring how far $\br$
is from each of the
sixteen codewords in \tabref{tab.74h} then picking the closest.
Is there a more efficient way of finding the most probable source vector?
\subsubsection{Syndrome decoding for the Hamming code}
\label{sec.syndromedecoding}
For the (7,4) Hamming code there is a pictorial solution to the
% syndrome
decoding problem, based on the encoding picture,
\figref{fig.74h.pictorial}.
%
% \subsubsection{Decoding}
%
% sanjoy says this is CONFUSING - tried to importve it Sat 22/12/01
% also romke did not like it
As a first example, let's assume the transmission was
$\bt = {\tt 1000101}$ and the noise flips the second bit,
so the received vector is
$\br = {\tt 1000101}\oplus{\tt{0100000}} = {\tt{1100101}}$.
% \ie, $\bn=({\tt 0},{\tt 1},{\tt 0},{\tt 0},{\tt 0}, {\tt 0},{\tt 0})$,
% and the received vector
We write the received vector into the three circles
as shown in \figref{fig.hamming.decode}(a), and
look at each of the three circles to see whether its parity is even.
The circles whose parity is {\em{not}\/} even are shown by
dashed lines in \figref{fig.74h.pictorial}b.
% The fact that all codewords differ from each other in at least
% three bits means that if the noise has flipped any one or two bits,
% the received vector will no longer be a valid codeword, and some of
% the parity checks will be broken.
%
The decoding task is
%We want
to find the smallest
set of flipped bits that can account for these violations
of the parity rules.
% violated.
[The pattern of violations of the parity checks is called the {\dem\ind{syndrome}}, and can be written as a binary vector -- for example,
in \figref{fig.hamming.decode}b, the syndrome is $\bz = ({\tt1},{\tt1},{\tt0})$,
because the first two circles are `unhappy' (parity {\tt1}) and the
third circle is `happy' (parity {\tt0}).]
% RESTORE ME:
%, and the task of syndrome decoding
% syndrome (just as a
% \ind{doctor} might seek the most probable underlying \ind{disease} to account for
% the symptoms shown by a \ind{patient}).
\begin{figure}% [htbp]
\figuremargin{\small%
\begin{center}
\begin{tabular}{ccc}
(a)\psfig{figure=hamming/decode.eps,angle=0,width=1.3in} \\
(b)\psfig{figure=hamming/s2.eps,angle=-90,width=1.3in} &
(c)\psfig{figure=hamming/t5.eps,angle=-90,width=1.3in} &
(d)\psfig{figure=hamming/s3.eps,angle=-90,width=1.3in} \\[0.3in]
\multicolumn{3}{c}{%
(e)\psfig{figure=hamming/s3.t7.eps,angle=0,width=1.3in}
\setlength{\unitlength}{1in}
\begin{picture}(0.4,0.6)(0,0)
\put(0,0.6){\vector(1,0){0.6}}
\end{picture}
% \raisebox{0.6in}{$\rightarrow$}
(${\rm e}'$)\psfig{figure=hamming/s3.t7.d.eps,angle=0,width=1.3in}
}\\
\end{tabular}
\end{center}
}{%
\caption[a]{Pictorial representation of decoding of the Hamming (7,4)
code. The received vector is written into the diagram
as shown in (a).
In (b,c,d,e), the received vector is
shown, assuming that the transmitted vector was
as in
% The bits that are flipped relative to
\protect
\figref{fig.hamming.pictorial}(b) and the bits labelled by $\star$
were flipped. The violated
parity checks are highlighted by dashed circles. One of the seven bits
is the most probable suspect to account for each `\ind{syndrome}', \ie, each
pattern of violated and satisfied parity checks.
In examples (b), (c), and (d), the most probable suspect is
the one bit that was flipped.
In example (e), two bits have been flipped, $s_3$ and $t_7$.
The most probable suspect is $r_2$, marked by a circle in (${\rm e}'$),
which shows the output of the decoding algorithm.
% each circle is even.
}\label{fig.hamming.decode}
\label{fig.hamming.s2}% these labels were in the wrong place feb 2000
\label{fig.hamming.s3}
\label{fig.hamming.correct}
}
\end{figure}
%
% ACTION: sanjoy still thinks this part is hard to follow - fixed Sat 22/12/01?
To solve the decoding task,
% problem,
we ask the question:
can we find a unique bit that lies {\em inside\/}
all the `unhappy' circles and {\em outside\/} all the
`happy' circles? If so, the flipping of that bit
would account for the observed
syndrome.
In the case shown in \figref{fig.hamming.s2}(b),
the bit $r_2$
% that was flipped
lies inside the two unhappy circles and outside the happy
circle;
no other single bit has this property, so
$r_2$ is the only single bit capable of explaining the syndrome.
Let's work through a couple more examples.
\Figref{fig.hamming.s2}(c) shows what happens if one of the
parity bits, $t_5$, is flipped by the noise. Just one of the checks
is violated. Only $r_5$ lies inside this unhappy circle and outside
the other two happy circles,
so $r_5$ is identified as the only single bit
capable of explaining the syndrome.
If the central bit $r_3$ is received flipped,
\figref{fig.hamming.s3}(d) shows that all three checks are violated;
Only $r_3$ lies inside all three circles, so $r_3$ is
identified as the suspect bit.
If you try flipping any one of the seven bits, you'll find
that a different syndrome is obtained in each case -- seven non-zero syndromes,
one for each bit. There is only
one other syndrome, the all-zero syndrome. So if
the channel is a binary symmetric channel with a
small noise level $f$, the optimal
decoder unflips at most one bit, depending on the
syndrome, as shown in \algref{tab.hamming.decode}.
Each syndrome could have been caused by other noise patterns
too, but any other noise pattern that has the same syndrome
must be less probable because it involves a larger number of
noise events.
%\begin{figure}
%\figuremargin{%
\begin{algorithm}
\algorithmmargin{%
\begin{center}
\begin{tabular}{c*{8}{c}}
% Fri 4/1/02 removed toprule and bottomrule because algorithm has its own frame
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \toprule
Syndrome $\bz$ & {\tt 000} & {\tt 001} & {\tt 010} & {\tt 011} & {\tt 100} & {\tt 101} & {\tt 110} & {\tt 111} \\ \midrule
Unflip this bit & {\small{\em none}} & $r_7$ & $r_6$ & $r_4$ & $r_5$ & $r_1$ & $r_2$ & $r_3$ \\
% \bottomrule
% Unflip this bit & {\small{\em none}} & 7 & 6 & 4 & 5 & 1 & 2 & 3 \\
% \bottomrule
% this is appropriate only if z =z3,z2,z1:
% Unflip this bit & {\small{\em none}} & 5 & 6 & 2 & 7 & 1 & 4 & 3 \\ \hline
\end{tabular}
\end{center}
%\begin{center}
%\begin{tabular}{cc} \hline
%Syndrome $\bz$ & % 3 2 1 !!!!!!!!!!!!!!!!!!!
%Flip this bit \\ \hline
% 000 &{\small{\em none}} \\
% 001 &5\\
% 010 &6\\
% 011 &2\\
% 100 &7\\
% 101 &1\\
% 110 &4\\
% 111 &3 \\ \hline
%\end{tabular}
%\end{center}
}{%
\caption[a]{Actions taken by the optimal decoder for the (7,4) Hamming
code, assuming a binary symmetric channel with small noise level $f$.
The syndrome vector $\bz$ lists whether each parity check is
violated ({\tt 1}) or satisfied ({\tt 0}),
going through the checks in the order
of the bits $r_5$, $r_6$,
and $r_7$. }
\label{tab.hamming.decode}
}%
\end{algorithm}
What happens if the noise actually flips more than one bit?
\Figref{fig.hamming.s3}(e) shows the situation when two bits,
$r_3$ and $r_7$, are received flipped. The syndrome, {\tt 110},
makes us suspect the single bit $r_2$; so our optimal decoding algorithm
flips this bit, giving a decoded pattern with three errors
as shown in \figref{fig.hamming.s3}(${\rm e}'$).
If we use the optimal decoding algorithm,
any two-bit error pattern will lead to a decoded seven-bit vector
that contains three errors.
\subsection{General view of decoding for linear codes: syndrome decoding}
\label{sec.syndromedecoding2}
\begin{aside}
% {\em (Does some of this stuff belong earlier in the pictorial area?)}
We can also describe the decoding problem
for a linear code in terms of matrices.\index{syndrome decoding}
% In the case of a linear code and a symmetric channel,
% the decoding task can be re-expressed as {\bf syndrome decoding}.
% Let's assume that the noise level $f$ is less than $1/2$.
The first four received bits, $r_1r_2r_3r_4$, purport to be
the four source bits; and the received bits $r_5r_6r_7$ purport
to be the parities of the source bits, as defined by the generator
matrix $\bG$. We evaluate the three parity check bits for the
received bits, $r_1 r_2r_3 r_4$, and see whether
they match the three received
bits, $r_5r_6r_7$. The differences (modulo 2) between
these two triplets are called the {\dbf\ind{syndrome}}
of the received vector.
If the syndrome is zero -- if all three parity checks are happy
% agree with the corresponding received bits
-- then the received vector is a codeword,
and the most probable decoding is given by reading out its first four
bits. If the syndrome is non-zero, then
% we are certain that
the noise
sequence for this block was non-zero, and the syndrome is our
pointer to the most probable error pattern.
The computation of the syndrome vector is a
linear operation. If we define the $3 \times 4$ matrix $\bP$
such that the matrix of
equation (\ref{eq.h74.gen})
is
\beq
\bG^{\T} = \left[ \begin{array}{c}{\bI_4}\\
\bP\end{array} \right],
\eeq
where $\bI_4$ is the $4\times 4$ identity matrix, then
the syndrome vector is $\bz = \bH \br$, where the {\dbf\ind{parity check matrix}}
$\bH$ is given by $\bH = \left[ \begin{array}{cc} -\bP & \bI_3 \end{array}
\right]$; in modulo 2 arithmetic, $-1 \equiv 1$, so
\beq
\bH = \left[ \begin{array}{cc} \bP & \bI_3 \end{array}
\right] = \left[
\begin{array}{ccccccc}
\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0&\tt 0 \\
\tt 0&\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0 \\
\tt 1&\tt 0&\tt 1&\tt 1&\tt 0&\tt 0&\tt 1
\end{array} \right] .
\label{eq.pcmatrix}
\eeq
All the codewords $\bt = \bG^{\T} \bs$ of the code satisfy
\beq
\bH \bt = \left[ {\tt \begin{array}{c} \tt0\\ \tt0\\ \tt0 \end{array} } \right] .
% (0,0,0) .
\eeq
\exercisxB{2}{ex.GHis0}{
Prove that this is so by evaluating the $3\times4$ matrix $\bH \bG^{\T}$.
}
Since the received vector $\br$ is given by $\br = \bG^{\T}\bs + \bn$,
% and $\bH \bG^{\T}$=0,
the syndrome-decoding problem is to find the
most probable noise vector $\bn$ satisfying
the equation
\beq
\bH \bn = \bz .
\eeq
A decoding algorithm that solves this problem is called
a {\dem maximum-likelihood decoder}. We will discuss
decoding problems like this in later chapters.
%\footnote{Somewhere in this book
% I need to spell out Bayes' theorem for decoding. Here would be
% a good spot; but on the other hand, people can understand decoding
% intuitively, they don't need Bayes theorem and they might find it
% a hindrance if they were not only being hit by
% Shannon's theorem but also by likliehoods and priors.}
%
% ACTION NEEDED ????????????????????????????????????????
%
\end{aside}
\begin{figure}
%\fullwidthfigure{%
\figuredanglenudge{%
\begin{center}
\setlength{\unitlength}{0.8in}% was 1in, with figures 1.25 wide % then was 0.8 with 1in
\begin{picture}(7,2.7)(0,2.8)
\put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}}
\put(0.625,5.4){\makebox(0,0){\Large$\bs$}}
\thicklines
\put(1.35,4.75){\vector(1,0){0.4}}
\put(1.55,5.4){\makebox(0,0){{\sc encoder}}}
\put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.ps,width=1in}}}
\put(1.982,3.75){\makebox(0,0)[tr]{{parity bits} $\left.\rule[-0.342in]{0pt}{0.342in} \right\{$}}
\put(2.625,5.4){\makebox(0,0){\Large$\bt$}}
\put(3.6,5.4){\makebox(0,0){{\sc channel}}}
\put(3.6,5.15){\makebox(0,0){$f={10\%}$}}
\put(3.4,4.75){\vector(1,0){0.4}}
\put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.ps,width=1in}}}
\put(4.625,5.4){\makebox(0,0){\Large$\br$}}
\put(5.6,5.4){\makebox(0,0){{\sc decoder}}}
%\put(5.6,3.5){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder picks the $\hat{\bs}$ with maximum likelihood.}}}}
\put(5.4,4.75){\vector(1,0){0.4}}
\put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.d.ps,width=1in}}}
\put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}}
\end{picture}
\end{center}
}{%
\caption[a]{Transmitting 10,000 source bits over a binary symmetric channel
with $f=10\%$
%0.1$
using a (7,4) Hamming code. The probability
of decoded bit error is about 7\%.}
% \dilbertcopy}
\label{fig.h74.dilbert}
}{0.7in}% third argument is the upward nudge of the caption
\end{figure}
\subsection{Summary of the (7,4) Hamming code's properties}
Every possible received vector of length 7 bits is either a codeword,
or it's one flip away from a codeword.\index{Hamming code}
Since there are three parity constraints, each of which might
or might not be violated, there are
$2\times 2\times 2= 8$
% eight
distinct syndromes. They can be divided
into seven non-zero syndromes -- one
for each of the one-bit error patterns --
and the all-zero syndrome, corresponding to the zero-noise case.
The optimal decoder takes no action if the syndrome is zero,
otherwise it uses this mapping of non-zero syndromes onto one-bit error
patterns to unflip the suspect bit.
There is a {\dbf decoding error} if the four decoded bits $\hat{s}_1,
\ldots, \hat{s}_4$ do not all match the source bits ${s}_1,
\ldots, {s}_4$. The {\dbf probability of block error} $\pB$ is
the probability that one or more of the decoded bits in one block fail to
match the corresponding source bits,
\beq
\pB = P( \hat{\bs} \neq \bs ) .
\eeq
The {\dbf probability of bit error} $\pb$ is
the average probability
% per decoded bit
that a decoded bit fails to
match the corresponding source bit,
\beq
\pb = \frac{1}{K} \sum_{k=1}^K P( \hat{s}_k \neq s_k ) .
\eeq
In the case of the Hamming code,
a decoding error will occur whenever the noise has flipped more than
one bit in a block of seven.
% Any noise pattern that flips more than one bit will give rise to one of
% these syndromes, and our decoder will make an erroneous decision.
%
The probability of block error is thus the probability that two or more
bits are flipped in a block. This probability scales as $O(f^2)$, as did the
probability of error for the repetition code
$\Rthree$. But notice that the Hamming code
communicates at a greater rate, $R=4/7$.
\Figref{fig.h74.dilbert} shows a binary image transmitted over
a binary symmetric channel using the (7,4) Hamming code.
About 7\% of the decoded bits are in error. Notice that
the errors are correlated:
% with each other:
often two or three successive
decoded bits are flipped.
\exercissxA{1}{ex.Hdecode}{
This exercise and the next three refer to the
(7,4) \ind{Hamming code}. Decode the received strings:
\ben
\item $\br = {\tt 1101011}$ % 10
\item $\br = {\tt 0110110}$ % 4
\item $\br = {\tt 0100111}$ % 4
\item $\br = {\tt 1111111}$. % 15
\een
}
\exercisxA{2}{ex.H74p}{
\ben \item
Calculate the probability of block error $p_{\rm B}$ of the (7,4) Hamming
code
as a function of the noise level $f$ and show that to leading order
% \footnote{Do I need to explain what this means? Or use a different
% terminology? Maybe only physicists are familiar?}
%
% ACTION!!!
%
it goes as $21 f^2$.
\item
% }
% \exercis{}{
$^{B3}$
Show that to leading order the probability of
bit error $\pb$ goes as $9 f^2$.
\een}
\exercisxA{2}{ex.H74zero}{
% Hamming (7,4) code.
Find some noise vectors that give the all-zero syndrome (that is,
noise vectors that leave all the parity checks unviolated).
How many such noise vectors are there?
}
% they are the codewords.
\exercisxB{2}{ex.H74detail}{
% Hamming (7,4) code.
I asserted above that a block decoding error will result
whenever two or more bits are flipped in a single block.
Show that this is indeed so. [In principle, there might be
error patterns that, after decoding, led only to the corruption
of the parity bits, without the source bits's being incorrectly
decoded.]
}
\exercisxB{2}{ex.R9}{
Consider
% the encoder for
the repetition code $\Rnine$. One way
of viewing this code is as a \index{concatenation!error-correcting codes}{\dbf{concatenation}} of $\Rthree$ with
$\Rthree$. We first encode the source stream with $\Rthree$, then encode
the resulting output with $\Rthree$. We could call this code `$\Rthree^2$'.
This idea motivates an alternative decoding algorithm, in which we decode the
bits three at a time using the decoder for $\Rthree$; then decode the
decoded bits from that first decoder using the decoder for $\Rthree$.
Evaluate the probability of error for this decoder and compare it with the
probability of error for the optimal decoder for $\Rnine$.
Do the concatenated encoder and decoder for $\Rthree^2$ have advantages over
those for $\Rnine$?
}
\subsection{Summary of codes' performances}
\label{sec.code.perf}
Figure \ref{fig.pbR.RH} shows the performance of \ind{repetition code}s and
the \ind{Hamming code}. It also shows the performance of a family of linear
block codes that are generalizations of Hamming codes, \ind{BCH codes}.
% Reed-Muller codes, and
% see end of this file for method
%
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/rephambch.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/rephambch.1.l.ps,angle=-90,width=2.6in} \\
\end{tabular}
\end{center}
}{%
\caption[a]{Error probability $\pb$ versus rate $R$ for repetition codes,
the (7,4) Hamming code and BCH codes with block lengths up to 1023
over a binary symmetric channel with $f=0.1$.
The righthand figure shows $\pb$ on a logarithmic scale.}
\label{fig.pbR.RH}
}
\end{figure}
%
%\noindent
% use this noindent if the ``h'' (here) works, otherwise new para.
This figure shows that we can, using linear block codes, achieve better
performance than repetition codes; but the asymptotic situation still
looks grim.
\exercisxA{4}{ex.makecode}{
% invent your own code
Design an error-correcting code and a decoding algorithm for it,
compute its probability of error,
and add it to figure \ref{fig.pbR.RH}.
[Don't worry if you find it difficult to make a code better than the
Hamming code, or if you find it difficult to find a good
decoder for your code; that's the point of this exercise.]
}
\exercisxA{3}{ex.makecode2error}{
A (7,4) Hamming code
can correct any {\em one\/} error; might there be a (10,4) code
that can correct any two errors? What about a (9,4) code?
{\sf Optional extra:} Does the answer to this question
depend on whether the code is linear or nonlinear?
}
\exercisxA{4}{ex.makecode2}{
Design an error-correcting code, other than
a repetition code, that can
correct any {\em two\/} errors in a block of size $N$.
}
\section{What performance can the best codes achieve?}
There seems to be a trade-off between the decoded bit-error
probability $\pb$ (which we would like to reduce) and the rate $R$ (which
we would like to keep large). How can this trade-off be
characterized?
% Can we do better than repetition codes?
What points in
the $(R,\pb)$ plane are achievable? This question was addressed by
\ind{Shannon} in his pioneering paper of 1948, in which he both created the
field of information theory and solved most of its fundamental
problems.
% in the same paper.
At that time there was a widespread belief that the
boundary between achievable and nonachievable points in the
$(R,\pb)$ plane was a curve passing through the origin $(R,\pb) = (0,0)$;
if this were so, then, in order to achieve a vanishingly small
error probability $\pb$, one would have to reduce the rate
correspondingly close to zero.
% (figure ref here).
% This would seem a reasonable guess,
% in accordance with the general rule that the better something works
% the more you have to pay for it.
%
% ACTION: sanjoy doesn't like This
%
`No pain, no gain.'
However, Shannon proved the remarkable result that\wow\
% , for any given channel,
the boundary
between achievable and nonachievable points meets the $R$
axis at a {\em non-zero\/} value $R=C$, as shown in \figref{fig.pbR.RHS}.
\begin{figure}[htbp]
\figuremargin{%
\begin{center}
\begin{tabular}{cc}
\hspace{-0.2in}\psfig{figure=\codefigs/repshan.1.ps,angle=-90,width=2.6in} &
\pbobject\psfig{figure=\codefigs/repshan.1.l.ps,angle=-90,width=2.6in} \\
\end{tabular}
\end{center}
}{%
\caption[a]{Shannon's noisy-channel coding theorem.
The solid curve shows the Shannon limit
on achievable values of $(R,\pb)$ for
the binary symmetric channel with $f=0.1$.
Rates up to $R=C$ are achievable with arbitrarily small $\pb$.
The points show the performance of some textbook codes,
as in \protect\figref{fig.pbR.RH}.
The equation defining the Shannon limit (the solid curve) is
%\[
$R = \linefrac{C}{(1-H_2(\pb))},$
%\]
where $C$ and $H_2$ are defined in \protect \eqref{eq.capacity}.
}
\label{fig.pbR.RHS}
}
\end{figure}
% see end of this file for method
%
For any channel, there exist codes that make it possible to
communicate with {\em arbitrarily small\/} probability of
error $\pb$ at non-zero rates. The first half of this book (parts I--III) will be
devoted to understanding this remarkable result, which is called
the {\dbf\ind{noisy-channel coding theorem}}.
\subsection{Example: $f=0.1$}% A few details}
The maximum rate at which communication is possible with arbitrarily
small $\pb$ is called the {\dbf\ind{capacity}} of the channel.\index{channel!capacity}
The formula for the capacity of a binary
symmetric channel with noise level $f$ is\index{binary entropy function}
\beq
C(f) = 1 - H_2(f) = 1 - \left[ f \log_2
\frac{1}{f} + (1-f) \log_2 \frac{1}{1-f} \right] ;
\label{eq.capacity}
\eeq
the channel we were discussing earlier with noise level $f=0.1$
has capacity $C \simeq 0.53$. Let us consider what this means in terms
of noisy disc drives. The \ind{repetition code} $\Rthree$ could communicate over this
channel with $\pb=0.03$ at a rate $R = 1/3$. Thus we know how
to build a single gigabyte disc drive with $\pb = 0.03$
from three noisy gigabyte disc drives. We also know how to make
a single gigabyte disc drive
with $\pb \simeq 10^{-15}$ from sixty
noisy one-gigabyte drives \exercisebref{ex.R60}.
And now \ind{Shannon} passes by, notices us
\ind{juggling}
% tinkering
with disc drives and codes and says:
\begin{quotation}
\noindent
`What performance are you trying to achieve?
$10^{-15}$? You don't need {\em sixty\/} disc drives --
you can get that performance with just
{\em two\/} disc drives (since 1/2 is less than $0.53$).
% (The capacity is 0.53, so the number of disc drives needed at
% capacity is 1/0.53.)
% `
And if you want $\pb = 10^{-18}$
% , or $10^{-21}$,
or $10^{-24}$ or anything,
you can get there with two disc drives too!'
\end{quotation}
%\begin{aside}
[Strictly, the above statements might not be quite right, since,
as we shall see, Shannon
proved his
noisy-channel coding theorem
%proves the achievability of ever smaller
% error probabilities at a given rate $R