%Proof Complexity Scribe Notes template

\ifx\CompleteCourse\relax
\ClassScribeSetupA
\else
\documentclass[11pt]{article}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{homework}{Homework}
\newenvironment{definition}{\begin{trivlist}\item[]{\bf Definition}\ }%
  {\end{trivlist}}
\newenvironment{fact}{\begin{trivlist}\item[]{\bf Fact}\ }%
  {\end{trivlist}}
\newenvironment{example}{\begin{trivlist}\item[]{\bf Example}\ }%
  {\end{trivlist}}
\newenvironment{proof}{\begin{trivlist}\item[]{\bf Proof}\ }%
  {\end{trivlist}}
                                % Make the page large
\addtolength{\textwidth}{1.50in}
\addtolength{\textheight}{1.00in}
\addtolength{\evensidemargin}{-0.75in}
\addtolength{\oddsidemargin}{-0.75in}
\addtolength{\topmargin}{-.50in}

                                % \vdashsub{X}  makes a turnstyle with subscript "X"
                                % \vdashsup{X}  makes a turnstyle with superscript "X"

\newdimen\srbdimenA
\newcommand{\vdashsupsub}[2]{ \mathop{
    \setbox251 = \hbox{$\scriptstyle #1$}
    \setbox252 = \hbox{$\scriptstyle #2$}
    \ifdim \wd251<\wd252 \srbdimenA = \wd252 \else \srbdimenA = \wd251 \fi
    \setbox255 = \hbox {${\srbAvdash \vphantom( \kern -\srbdimenA \kern +.05em}
      ^{\hbox to\srbdimenA{\hfill \box251\hfill}}
      _{\hbox to\srbdimenA{\hfill \box252\hfill}}$}
    \box255 \kern .05em}}
\newcommand{\srbAvdash}{\hbox{ \vrule height1.4ex width0.02em
    \dimen255 = \srbdimenA
    \advance\dimen255 by 0.1em
    \vbox{\hrule width\srbdimenA height0.02em
      \kern .65ex  }}}
\newcommand{\vdashsup}[1]{\vdashsupsub{{#1}}{\mbox{~}}}
\newcommand{\vdashsub}[1]{\vdashsupsub{\mbox{~}}{#1}}
\fi

                                %   FOR THE SCRIBE: CUSTOMIZE THE ENTRIES BELOW:
                                %   Fill in the following information particular to these scribe notes:

\def\scribeone{Danny Heap}    % Who is the scribe?
\def\classdate{15 September 2005} % Date of the class
\def\classnumber{1}     % Is this the first, second, ...?

                                % Here are some commands that stay the same for the whole class.

\def\classinstructor{Toniann Pitassi}
\def\classtitle{Machine Learning Theory}
\def\doctitle{\textup{CS 2416 - Machine Learning Theory}}
\def\classid{\textup{Lecture \#\classnumber: \classdate}}

                                %   Put your macros for these scribe notes HERE
                                %  It is best to use as few as possible.
                                %  environments for "theorem", "corollary", "lemma", "fact" "definition"
                                %   "homework", "proof", "example", "proposition"
                                %   are already defined above.

                                % Start the document

\ifx\CompleteCourse\relax
\ClassScribeSetupB
\else
\def\makeatletter{\catcode`\@=11\relax}
\def\makeatother{\catcode`\@=12\relax}
\makeatletter
\def\ps@scribeheadings{\let\@mkboth\@gobbletwo
  \def\@oddhead{\sl\doctitle \hfill \classid
  }\def\@oddfoot{\hfil \rm \thepage \hfil}\def\@evenhead{\@oddhead}%
  \def\@evenfoot{\@oddfoot}\def\sectionmark##1{}\def\subsectionmark##1{}}
\makeatother
\pagestyle{scribeheadings}
\begin{document}
\bibliographystyle{siam}
\fi

\begin{center}
  \Large\bf\doctitle\\[1em]
  \Large\bf\classid\\[1em]
  {\large\bf Lecturer: \classinstructor}\\[.5em]
  {\large\bf Scribe Notes by: \scribeone}
\end{center}

\vspace*{.4in}

                                % HERE IS WHERE YOUR SCRIBE NOTES SHOULD START
                                %  DELETE ALL OF ROB'S TEXT AND ENTER YOUR OWN.


Machine Learning tackles many real-world problems and tries to develop
algorithms where (we suspect) an explicit algorithm doesn't exist.
For example: distinguish chairs from tables.

In Machine Learning Theory (MLT) we develop models for machine learning and
evaluate ML algorithms with respect to those models.  Benefits of MLT
include identifying intractable problems (which need to be
re-specified to have any hope of being tractable), and suggest
approaches to developing new ML algorithms.

Here are some ML models we'll consider:

\begin{itemize}
\item Consistency model
\item On-line model
\item Occam
\item PAC
\end{itemize}

Some concepts we'll look at:

\begin{itemize}
\item VC-dimension (a measure over vectors that gives an idea of the
  necessary sample size for ML).
\item boosting (use an existing ML algorithm with error $\epsilon$ to
  build a better ML algorithm with error whose reciprocal is
  exponential in $1/\epsilon$.
\item hard-core sets and distributions, from complexity theory.
\end{itemize}

We're heading towards the PAC model.  But first, we'll look at some
other models in (chronological?) order.  Assume that we're restricted
to searching for boolean-valued functions on $\{0,1\}^n$.  We're given
access to instance examples that may include a label from $\{-,+\}$
indicating whether they are negative or positive instances.  Our
instances may be provided in a batch, after which we're expected to
make predictions about the labelling of subsequent instances (or
output a hypothesis function that produces such a label), or our
predictions may be required after each instance is supplied, in a
sequence.

\section*{Consistency model}

\begin{definition}
  Concept $c$ is a boolean function over $\{0,1\}^n$.
  The domain variables are denoted by $x_1,\ldots, x_n$,
  A {\it labelled example} is a pair $\langle \alpha, \beta\rangle$ where
  $\alpha\in \{0,1\}^n$ and $\beta \in \{-,+\}$ indicates whether
  $\alpha$ is a positive or negative example.
  A {\it concept class} $\cal{C}$ is a set of concepts over $x_1, \ldots,
  x_n$ \textbf{usually} with an associated representation.  For example,
  all CNF formulas over $x_1, \ldots, x_n$.
\end{definition}

\begin{definition}
  Algorithm $A$ learns $\cal{C}$ in the Consistency Model if
  given any set $S$ of labelled examples, algorithm $A$ produces concept
  $c\in \cal{C}$ consistent with $S$ if such a concept exists, and
  outputs "there is no consistent concept" otherwise.
  Furthermore, the runtime of $A$ should be polynomial in the
  size of $S$.
\end{definition}

  Note that in the above definition, we
  require that $c\in \cal{C}$, otherwise simply taking the disjunction
  of all the positive examples in $S$ will suffice (and would be ugly,
  slow, and of no predictive value).  If we don't care about run-time,
  devising $A$ is simple: just try concepts from $\cal{C}$.
  For example, if $\cal{C}$ is the set of all monotone conjunctions,
  then $A$ would simply try all $2^n$ possible conjunctions of
  variables in $\{x_1, \ldots, x_n\}$.  Thus we also
  require that $A$ should run in time polynomial in
  the size of $S$. (each example in $S$ requires $n+1$ bits).  
                                %  We
                                %  might further learning polynomially-many samples in polynomial
                                %  time.\marginpar{what does this last complexity bound mean?}

\begin{itemize}
\item[1.] Example 1: monotone conjunctions over $x_1, \ldots, x_n$.
  Suppose the target concept (which distinguishes the positive and
  negative labels) is $x_3 x_4x_7$, so $|\cal{C}|$ $=$ $2^n$, and you're
  given labelled pairs:
  \begin{eqnarray*}
    && \langle 10100110, -\rangle \\
    && \langle 11111010, +\rangle \\
    && \langle 00110010, +\rangle \\
    && \langle 11011111, -\rangle \\
    && \langle 11001100, -\rangle
  \end{eqnarray*}
  Algorithm $A$ simply takes the conjunction of all the $x_i$ where
  all positive examples have a 1 in position $i$, and verifies that
  this conjunction is falsified by all the negative examples
  (otherwise output ``no consistent concept'').
  Note that any variable that does not appear in the conjunction
  had to be eliminated because it was set to zero by
  some positive example. Thus if our conjunction is
  not consistent with the negative examples, then no conjunction will be.

\item[2.] Example 2: monotone disjunctions over $x_1, \ldots, x_n$.
  Suppose the target concept $c= x_3\vee x_4\vee x_7$, and you're
  given labelled pairs:
  \begin{eqnarray*}
    && \langle 01000100, -\rangle \\
    && \langle 11001100, -\rangle \\
    && \langle 00001101, -\rangle \\
    && \langle 10101001, +\rangle \\
    && \langle 01011100, +\rangle
  \end{eqnarray*}
  Algorithm $A$ simply takes the disjunction of all the $x_i$ where
  all negative examples have a zero in position $i$, and verifies that
  this disjunction satisfies all the positive examples (otherwise
  output ``no consistent concept'').

\item[3.] Example 3: non-monotone conjunctions/disjunctions. These are
  taken over $\{x_1, \overline{x_1}, \ldots, x_n,
  \overline{x_n}\}$. We can perform a reduction to the monotone case.
  Define $y_i = \overline{x_i}$. Under this transformation, the
  target function is a monotone conjunction over this space
  of $2n$ variables, $x_1, \ldots, x_n$ and $y_1, \ldots, y_n$.

\item[4.] Example 4: $k$-DNF formula. These are disjunctions of terms,
  each term being the conjunction of at most $k$ literals.  Again we can
  reduce this to the case of monotone disjunctions as follows.
  For each possible term of size at most $k$, introduce a new variable
  representing this term, and use Example~2. 

\item[5.] Example 5: 2-term DNF. For example, $c= x_1\overline{x_3}x_5
  \vee x_1\overline{x_4}x_5\overline{x_3}$.  This \textbf{cannot} be
  learned in polytime unless $P= NP$.  This (effective) intractability
  is due to the fact that $A$ must find a hypothesis that is
  a 2-term DNF. See the book for a proof of intractability.

\item[6.] Example 6: DNF formula. Find a consistent DNF concept, with no
  restriction on the number of terms. This is trivially learnable in
  the consistency model, since we can take the disjunction of terms,
  one term for each positive example. This example raises a red flag,
  since we are simply memorizing the data. We don't expect that the
  hypothesis returned will be very similar to the true DNF that the
  examples came from. We will see later that for true learning
  (being able to predict well on future examples), there needs to
  be compression. That is, the hypothesis returned should be of size
  that is sublinear in the size of the samples, $S$.

\end{itemize}

\bigskip

\noindent{\bf Some problems with the Consistency Model:}

\begin{itemize}
\item DNF learner is just memorizing the input (we'd like the output
  concept to be substantially smaller than the input).
\item Insisting on the algorithm found by $A$ being in $\cal{C}$ seems
  a bit extreme.
\item If the examples aren't completely consistent with any $c\in
  \cal{C}$ there's no way to output something ``close.'' 
\end{itemize}


\section*{Online/Mistake Bound (MB) model}
\label{mistakeBound}

As we saw above, the consistency model has the problem that
a positive result doesn't always imply finding a rule with
good predictive power. The Mistake Bound model addresses
this problem by expliitly modelling learning as an on-line
process.

Predictions are interleaved with examples (as opposed to learning from
a batch of examples before making a prediction).  
Learning takes place in stages. 
Each stage consists of:
\begin{enumerate}
\item $A$ (the learner) is provided with an unlabelled example.
\item $A$ makes a prediction about the label (equivalently produces a
  hypothesis for $h\in \cal{C}$ that would produce the right label).
\item $A$ is told the correct label (the one produced by the
  target concept $c\in \cal{C}$.
\end{enumerate}

\begin{definition}
  Let $\cal C$ be a concept class over $x_1, \ldots x_n$ together
  with an associated representation.
  $A$ learns $\cal{C}$ over $x_1, \ldots, x_n$ in MB model: if
  for any sequence of examples consistent with some $c\in \cal{C}$ the
  total number of mistakes made by $A$, when
  run on this sequence of examples, is polynomial in $n$ and $s$, where
  $s$ is the description size of $c$. 
  (The extra parameter $s$ is necessary
  since a concept $c$ may be exponentially long in $n$.)  $A$ learns
  $\cal{C}$ in polytime if it completes each stage/step in time polynomial
  in $n$ and $s$.
\end{definition}

\bigskip

\noindent {\bf Example 1}: non-monotone conjunctions: The learner starts with
$h= x_1 \overline{x_1}x_2\overline{x_2} \cdots x_n\overline{x_n}$.
For each mistake on a positive example, remove all unsatisfied
literals from $h$ (there'll be $n$ of them removed on the first
mistake, and at least 1 removed on each subsequent mistake).  If $h$
makes a mistake on a negative example, output ``no consistent
concept.'' 
The mistakes on the positive examples
are bounded by
$n+1$, since $n$ literals are removed after the first example,
and at least one literal is removed per subsequent mistake,
and there are at most $2n$ literals altogether.
Further, there will be no mistakes on the negative examples
as long as the examples are consistent with some conjunction.
This follows because we only threw out literals
when we absolutely had to (to be consistent with the positive examples),
so if *any* conjunction is consistent with the negative examples as well,
then our current conjunction will be.



\end{document}



