CSC321 Video Notes
----1a Why do we need machine learning
machine learning approach: learn from large amount of data and produce a program for a specific job, same learning algorithm can be used to produce different variations of that program and afford flexibility etc.
----1b What are neural networks
huge number of synapses increases computational speed / effectiveness => high degree of bandwith in nets
----1c Some simple models of neurons
linear neurons: y = b + \Sigma x_i w_i
binary threshold neurons: y = (b + \Sigma x_i w_i >= 0) ? 1 : 0
linear threshold neurons (rectified lin neurons): y = (b + \Sigma x_i w_i >= 0) ? (b + \Sigma x_i w_i) : 0
sigmoid neurons
logistic neurons: y = 1 / [1 + e^(b + \Sigma x_i w_i)]
stochastic binary neurons: y = probability of spike
stochastic linear neurons: y = poisson rate for spikes
----1d A simple example of learning
example of neurons voting for a pixel value == too limited as imposes a rigid template => nets must learn features
----1e Three types of learning
--Supervised learning
> learn to predict an output when given input
training case: input vector and desired output
for regression: output (prediction) is vector of real numbers
for classification: output (prediction) is class label
start by choosing model-class y = f(x, W) where x is an input vector and W is a family of parameters used to map it to an output y. then at each training case adjust parameters in W such that the discrepancy between the target output t and actual output y is minimal. for regression, discrepency is usually: 1/2(y-t)^2 -- for classification, discrepency measures vary
--Reinforcement learning
> learn to select action to maximize payoff: output produced by action(s) and occasional scalar reward. Goal = maximize expected sum of future rewards (nb: use of disounting factor to value close rewards more than delayed)
hard because difficult to know what went right/wrong in sequence of actions, scalar rewards don't supply much info
--Unsupervised learning
> learn good internal representation of the input for supervised or reinforcement
compact, low dimensional representation (high dim input repr by low dim manifold(s))
principal component analysis: linear (manifold is plane)
----2a Types of neural network architectures
--feed forward NN
input units -> hidden units -> output units
more than 1 layer of hidden == deep NN
similar/dissimilar things become => non linear function
--recurrent NN
more powerful + biological realism
model sequential data (remembering)
hard to exploit ability of memory, but recent algos have been good
--symetrically connected network
more restricted than rec NN
easier to study than recurrent
sym connected net without hidden layer == hopfield net
sym connected net with hidden layer(s) == boltzmann machine
----2b Perceptrons: the first generation of neural nets
perceptrons are implementations of the std statistical pattern recognition paradigm
input units -(hand coded weights)> feature units -(learned weights)> decision unit
Minsky and Papert in 1969 showed that perceptrons were limited but AI took it in other sense: NN approach is doomed to fail.
decision units: binary threshold neuron
to learn bias: treat it as extra input weight with input value set to 1
convergence procedure: if output correct, do nothing; if 0 instead of 1, add input vec to weight vec; otherwise substract input vec from weight vec
-> guaranteed to find set of weights if it exists (depends on features)
----2c A geometrical view of perceptrons
weight space: higher dimensional space
weight vector: vector of weights (point in the space)
input vector: vector of input values
training case: defines a hyperplane through origin perp to input vector (removing a degree of freedom of the space => constraining to half space)
-> constrains correct weight vectors to upper half-plane (scalar product between input vec and a correct weight vec is positive) when correct answer is 1, vice versa when 0
-> when multiple training cases, the correct space is the intersection of all the half planes, i.e. determines a hypercone "cone of feasability"
-> problem is convex (avg of 2 correct weight vecs is correct)
----2d Why the learning works
generously feasable region: hypercone of feasability minus 'hyperband' determined by margin from hyperplane(s?) of feasability of size the input vector
perceptron convergence procedure: every time the weights change, they get closer to the generously feasable region
----2e What perceptrons can't do
must get the features right!
data space: ~inverse of weight space
input vector: vector of input values (point in the space)
weight vector: vector of weights
weight plane: hyperplane perp to weight vector, its distance to the origin is the bias
-> in the case where no such plane is possible (at least 4 input cases), we say that the set of training cases is not linearly separable
----3a Learning the weights of a linear neuron
multilayer neural nets convergence procedure: every time the weights change, the actual output value gets closer to the target output value
# true for non convex problems
# not true for perceptron learning
we do not want to solve analytically (although possible) since we want to generalize to NN with non linear neurons, thus we choose an iterative method
delta rule: increment or decrement the weight vector by the input vector scaled by the residual error and the learning rate
\Delta w_i = \epsilon * x_i * (t-y)
^#learning rate ^#residual error
----3b The error surface for a linear neuron
extended weight space: extra vertical dimension for error
-> batch learning does steepest descent
-> online learning goes perp closer to current training case
pb: a lot of zigzaging when training cases are almost parallel
----3c Learning the weights of a logistic output neuron
output: y = 1/(1+e^-z)
logit: z = b + \Sigma w_i * x_i
dy/dz = y(1-y)
dy/dw_i = x_i * y(1-y)
then derive gradient descent rule for training logistic unit
----3d The backpropagation algorithm
main idea: we dont know what hidden units do, but we can compute how fast the error changes as we change a hidden activity
computing error derivatives:
define squared error E: (=(1/2)\Sigma(t-y)^2)
dE/dy_j = -(t_j-y_j)
dE/dz_j = y_j(1-y_j)dE/dy_j
then (1) how errdiv changes with weight at that level (i->j)
dE/dw_ij = y_i dE/dz_j
(2) backpropagate to layer (i) below
dE/dy_i = \Sigma w_ij dE/dz_j
----3e How to use the derivatives computed by backpropagation
optimization: how to use the weight derivatives to get good weights
generalization: how to ensure weights will generalize
sampling bias -> accidental regularities -> overfitting
how to reduce:
weight-decay - keep weights small
weight-sharing - many weights share same weight
early stopping - validation set
model averaging - aggregate results from multiple nn
bayesian fitting of nn - fancy model averaging
dropout - randomly omit hidden units
generative pre-training - ?
how often: online, full batch, mini batch
how much: fixed vs adaptive learning rate
----4a Learning to predict the next word
e.g. how to learn about relations in a family tree
-> training cases: 3 words (P1 relationship P2)
- predict 3rd from first 2
- predict if 3 words are true (need good src of negative facts)
-> distributed encoding of a person in hidden layer
- features are weights of hidden units
----4b A brief diversion in cognitive science
featural theory: concept is set of semantic features
structural theory: concept is node in relational graph
-> both theories need to be integrated (nn can implement relational graph with vectors of semantic features)
how to 'see' what the answer is? -> even in conscious reasonning we must have a way of 'seeing' which rules to use
how to implement -> not sure but need many to many mapping between concepts and neurons
----4c Another diversion the softmax output function
softmax: implement mutex == force output units to sum up to 1
softmax group: SG
softmax unit: y_i = e^z_i / \Sigma(j in SG) e^z_j
errderiv -> dy_i/dz_i = y_i(1-y_i)
cross-entropy: is the correct cost function for softmax
C = - \Sigma t_j log(y_j)
-> very big gradient because of neg log prob
----4d neuro-probabilistic language models
trigram method: use frequencies of word triplets to find relative probabilities of words given the previous 2
bengio's neural net:
layers 1->2: index of word: table lookup
layer 2: dist encoding of first 2 words
layer 3: hidden layer predicting 3rd word from dist repr of first 2
output unit: softmax group
----4e Ways to deal with the large number of possible outputs
serial architecture: add dist repr of candidate to input layers
pass forward with all candidate words to compute logit scores
use all logits in a softmax to get word probabilities
use cross-entropy errderivs to raise score for correct word and lower scores for high-scoring rivals
improvement: store words in a binary tree, but still slow
t-sne: multiscale method to produce 2D interpr of multidim vectors
----5a Why object recognition is difficult
segmentation: which pieces go together? hidden parts?
lighting: intensity of pixels is colour or lighting?
deformation: objects can be deformed in non affine ways
affordances: classes of objects usually refer to affordances than shape
viewpoint -> leads to dimension hoping
----5b Ways to achieve viewpoint invariance
one of the main issues in CV, no acceptable general solution yet
> redundant invariant features
> extract large redudant set of features
> box around object to normalize pixels
> judicious normalization: define box and specify orientation
> brute force normalization: use data set with straight boxes then at recognizing time, try rotating the boxes in different ways to see if it fits
> replicated features with pooling
> hierarchy of parts with explicit poses relative to camera
----5c Convolutional neural networks
-> replicating feature detectors all arounds the image
backprop can be modified by constraining the update in weights to be the same for copies of the feature detectors
affords: equivariant activities (neural activity is translated when detected features are)
invariant knowledge (regardless of the position during testing, it will recognize the features)
-> pooling (aggregating info from several close by feature detectors)
at each level of the net, correlate (avg, max) of 4 close by feature detectors and send it as output
issue: after several levels of pooling, lost info about position
----6a Overview of mini-batch gradient descent
full gradient descent is prone to take a long time when data set is highly redundant
online upating has too many computations
> mini-batches are a good compromise (must be random) -> easy to do matrix-matrix multiply
turn down the learning rate at the end of learning to lower the fluctuations of the mini-batch updates
----6b A bag of tricks for mini-batch
GH: black art of nn
> shifting inputs can help: elongate ellipse > circle
-> make the mean of all input vecs equal to 0
recall: logistic can help sweep stuff (big negatives or postives) under rug
> scaling the inputs: same
> better yet: decorrelate inputs -- principal component analysis
four methods for making learning faster:
> momentum: remember old velocity
> adaptive learning rate: for each param (weight) adjust learning rate wrt to its gradient
> rmsprop: divide learning rate by running avg of previous gradients
mini-batch version of rprop (only the sign)
> use curvature info
----6c The momentum method
stochastic gradient descent is online version of full batch gradient descent (where we follow the approximate gradient, that is one of the single training case (or mini-batch) instead of the full gradient).
momentum method: update rule takes into account accumulated momentum
at time t: \Delta w(t) = momentum * \Delta w(t-1) - gradient(t)
use small momentum (0.5) at beginning because of large gradients
think about what happens at next time step if previous momentum is too highly weighted
use big momentum (0.9) at end of learning
momentum allows us to use bigger learning rates without fear of floshing to and fro
better type of momentum based on Nesterov: instead of updating weights on the spot; jump first in direction of previous accumulated momentum then make correction (i.e. compute gradient from there)
intuition: better to correct mistake after making it
----6d Adaptive learning rates for each connection
justification:
with small initial weights, gradients differ a lot by layer
fan-in of a unit determines size of overshoot effect (how much input is received)
----6e rmsprop
rprop -> adaptive learning rate depending on sign on gradient and step size
when gradients agree, multiplicatively increase; otherwise multiplicatively decrease
rmsprop -> mini-batch version of rprop (rms = root mean square); divide gradient by running average of its recent magnitude
----7a Modeling sequences
autoregressive models (predict next token from previous) => memoryless
2 standard (stochastic) models
> linear dynamical model: generative models with linear dynamics for hidden states with gaussian noise -> (since linear transform of gaussian is gaussian) we can infer a prob dist over the possible states of its hidden state (given the output so far) that is gaussian [use kalman filtering]
> hidden markov model: discrete hidden states (one-of-N), state transitions are stochastic and controlled by transition matrix -> easy to represent a prob dist over discrete space [use dynamic programming]
--> recurrent neural networks
> distributed hidden state
> non-linear
> turing-complete
> deterministic -- think of hidden state as equivalent to determinstic prob dist over hidden states
----7b Training RNNs with backpropagation
backprop through time: RNN is like layered (layer for each time step) feed forward net with weight constraints (each layer has identical weights)
problem of determining initial state of the system: give random values then backprop through time to learn them
----7c A toy example of training a RNN
relationship with finite state automaton: FSM is in exactly 1 state (corresponds to node) whereas RNN has hidden units in exactly one activity vector (corresponds to pattern of activity)
----7d Why it is difficult to train a RNN
backprop acts linearly -> exploding or vanishing of gradients
good initial weights can help
e.g. 2 attractors, if initial weights close to boundary (small changes affect which attractor will engulf system) -> exploding gradient; if not -> vanishing gradients
4 effective ways to learn a RNN
> Long term short term memory
> hessian free optimization
> echo state networks -- learn only hidden to output weights
> good initialization (cf echo state) with momentum
----7e Long term short term memory
make rnn better for long term mem
----9a Overview of ways to improve generalization
method 1: get more data
2: use model with right capacity (enough to fit true regularities)
3: model averaging
4: bayesian fitting
method 2: regulate the capacity of the model
> architecture -- change neuroplasticity of hidden layer
> early stopping -- start with small weights and stop before overfitting
> weight decay --
> noise -- add noise
how to determine meta parameters (instead of brute force that doesnt generalize well)
--> divide total dataset into 3 subsets.
training data > used to learn
validation data > used to set meta parameters (to avoid overfitting them by amalgamating with training)
test data > used solely to verify generalisation performance
->N-fold cross-validation > separate test set from data set, divide remaining in N sets, rotate the validation set and train on others for all sets to obtain N estimates of meta parameters
->start with small weights and stop when performance on validation set goes down.
->models havent had time to grow big and small weights => small capacity
->because logistic units with small weights are in their linear range (act linearly)
->early stopping with correct architecture will stop net from using non-linearity to store spurious regularities
----9b Limiting the size of the weights
> weight decay: add penalty (L1 or L2)
"dont have large weights that dont do anything"
->dont use uneeded weights => dont fit sampling error
->also smoother model
> weight constraints
->limit maximum squared length of incoming vector to each unit
->lagrange multipliers
note:
L2 a.k.a squared == euclidian
L1 == manhattan but GH says absolute var?
----9c Use noise as a regularizer
adding gaussian noise to input
>variance amplified by squared weight
>will add to squared error
equivalence with L2 weight decay: by minimizing squared error, minimizing expected squared value w_i^2*\sigma_i^2 that is just L2
adding gaussian noise to weigths
>good for big nn, rnns
adding noise to pattern of activity
>cf nonrigorous unpublished stochastic backprop
----9d Introduction to the Bayesian Approach
Bayesian philosophy 101: there is a prior dist for everything
probability is degree of belief => subjectivism
most rational interpr of probability
when looking at data, combine prior dist with likelihood term (favours setting that makes data more likely) to get posterior
bayes theorem == bayesian updating rule [updated belief (= posterior) p(a|b) depends on prior p(a) and evidence contribution (normalized by p(b)) p(b|a)/p(b)]
----9e The Bayesian interpretation of weight decay
assume prior is gaussian and likelihood/output is gaussian !
supervised maximum likelihood learning
=> (finding weight vec to) minimize squared error is equiv to maximize log prob under gaussian
MAP == maximum a-posteriori
bayesian approach: find full dist over all weight vecs
> impossible when big non-linear net
----10a Why it helps to combine models
bias-variance trade-off
combining nets to reduce variance: expected gain of choosing predictor at random is greater than expected gain of averaging over all predictors by the variance of the outputs of the models
this is because we assume gaussian noise in outputs
=>want predictors that differ a lot but remain accurate
e.g. bagging == training different models on different subsets of the data (not disjoint)
boosting == train sequence of low capacity models by boosting the weights on training cases that previous models got wrong and caring less about those previous models got right
----10b Mixtures of experts
main idea: different nets are trained to focus on different aspects of data, managing net decides which expert to do the job
good for huge data sets
different from boosting since after learning every net has same weight for each training case
error function that encourages cooperation
> average of all predictors used to train single predictor
error function that encourages specialisation
> train single predictor normally
> but use manager to assign
> error is now sum of squared errors for each expert weighted by prob given by manager
->manager training
manager has softmax output layer
during learning, manager raises prob of models that give error less than average of errors of all experts
better yet, use mixture models:
probability of target under mixture of gaussians
----10c The idea of full Bayesian learning
frequentist vs bayesian (i.e. bayes theorem alludes to proportions over time vs updated degree of belief)
frequentist: when low on data, use simple model (to avoid overfitting) because makes sense
bayesian: there amount of data should not influence our beliefs about the complexity of the model
->e.g. if we have reasonable prior over all model variations of our model (that we also think is very relevant) we can average outcomes and have very good predictions
approximate full bayesian learning in a neural net:
construct a grid over parameter space (model should have few parameters)
take cross prod at every point -> see how well every combination of setting influences output
assign post prob to every group point with prod of well data is fit times how likely wrt prior
better than maxlikelihood or MAP when learning with little data
----10d Making full Bayesian learning practical
monte carlo method for approximating full bayesian
> instead of full bayesian posteriors can sample a few weight vectors
> sample weight vecs from weight space with a probability equal to their posterior probabilities
makes full bayesian learning with models with many parameters possible
----10e Dropout
ways to average models
mixture == average output probs for all models
product == multiply them then take the root and normalize
=>small probs of at least one model have more impact
................................watch again and previous
dropout makes units more likely to do something thats marginally usefull given what co-workers have achieved
----11a Hopfield Nets
storing memories as pattern of activities
simpe kind of memory model. energy model
rnn of binary threshold units
each binary conf of the net has energy state
> binary threshold rule causes net to settle in low energy states
quadratic energy function: E = -\Sigma s_i b_i -\Sigma s_i s_j w_ij
=>energy gap (local for given unit) = E(s=0) - E(s=1)
update rule: sequentially, randomly set units to the state that agrees with energy gap
-> but that is equivalent to just trusting the rectified linear units' output functions
-> just sequ. and rand. apply output function to inputs to chg
goodness = negative energy
updates cannot be updated purely in parallel (energy could go up => oscillation); but if randomly in parallel over time it could work
storing memories:
use activities of 1 and -1
storage rule -> increment weight by product of activities
----11b Dealing with spurious minima
storage capacity of hopfield net with N units: 0.15N
spurious minima: merging local minima
memorizing using hopfield's storage rule creates new energy minimum for given binary conf (i.e. memory)
if 2 energy minima thus created have very similar configurations (very close), they will merge into a deeper spurious minimum
-> this limits the storage capacity of hopfield nets
-> can use unlearning (using opposite of storage rule) to bring
REM sleep == unlearning
physicists' way to increase storage: use perceptron convergence procedure to train each unit given state of others
statisticians' interpretation: pseudo-likelihood
----11c Hopfield nets with hidden units
different role for hopfield nets: use their capacity to settle to low energy states for interpreting input rather than storing memories
=> have a set of hidden units in rnn
=> badness of representation == energy
projecting 3D to 2D loses 2 degrees of freedom
make net configuration such that it represents how 3D edges correspond to 2D edges perceived in reality
=> the 2 local minima can correspond to interpretations of the neka cube
2 computational issues:
> search: avoid getting trapped in local minima
> learning: how do we learn the weights
----11d Using stochastic units to improve search
hopnet: always reduce energy => cannot get out of local minimum hole
-> add random noise to cross energy barriers
-> simulated annealing: slowly reducing noise to end up in a deep minimum
physical idea behind: high temperature systems will make particles easy to move accross minima, but could end up distributed in several vs low temp will "guarantie" ending up in absolute minimum but will take long => start with high temp then slowly decrease temp
replace bin thresh by bin stochastic
-> raising noise == decreasing gaps
thermal equilibrium at temp 1
=> prob dist settles down over configurations
settles to stationary dist: prob dist over confs with lowest energy
-> each unit will oscillate, what remains constant is proportions of units in given states
----11e How a Boltzmann Machine models data
modeling binary data: learning representation of good=likely binary vecs, can be used to detect unusual=bad states of binary systems
causal model to generate data: 1. pick hidden states from their prior
2. pick visible states from their conditional dist given hidden state
thus prob of a visible vec v is sum for the entire distrib of prob of the hidden (from prior) given
Blotzmann machine: energy generative model. not causal!
-> energies of joint configurations of hidden and binary vector are proportional to probabilities of the joint configurations
alternatively: prob of a joint conf is prob of finding net after having updated all stochastic binary units repeatedly
=> goodness of joint config = sum of hid_biases by hid_states + same for visible + sum of product every pair of hidden units and weight of connections between them + same for visible + same again but between hid and vis
prob of joint config is proportional, to get equality: normalize by partition function (over all possible states - combinatorial in units)
hence cant compute that partition function when more than a few units, thus use MCMC to get samples from the model
----12a Boltzmann machine learning algorithm
unsupervised learning -> maximize product of probs assigned to bin vecs in training set
equivalent to: maximizing sum of log probabilities
letting the net settle to its stationary dist and sample visible vecs, N times
pb: every weights need to know about the others
solution: turns out that is not necessary, rely on
=> change in weights is proportional to the expected vals of the activity of the 2 neurons when thermal equilibrium is reached and the visible vecs are clamped minus when they are not clamped
-> very similar to unlearning in hopnets.
positive statics == data clamped
negative statistics == -> for unlearning
inefficient way to get the data (why Boltzmann machines were slow in practice and thought to be useless):
positive stats -> fix visible units and randomly update hidden ones until thrmal equilibrium at 1 is reached; sample expected vals of the model; repeat for all vecs in data set and average
negative stats -> set all weights in net to random and reach thermal eq; sample expected products of states; repeat many times and average
----12b Learning the statistics
skip
----12c Restricted Boltzmann machines
RBM: one layer of hidden units, no connections between them. no connection between visible units
inference and learning are much easier: probs are indep and can be computed in parallel
-> positive phase: clamp vis vec, compute exact expected val of pairs of vis-hid; average over mini-batch
negative phase: have a set of "fantasy particles" -> update them with alternating parallel updates then average
constrative divergence: surprising shortcut: by using only one step of reconstruction instead of an infinite amount, the learning still works
limitations: good regions of data space that are far from the data
-> so use CD1 then increase to CD3 etc over time
----12d Example of RBM learning
learn from data (increase weights), reconstruct data, unlearn from reconstruction (decrease weights)
-> every feature detector eventually converges to modeling local features
----12e RBMs for collaborative filtering
collaborative filtering: netflix competition
language model approach <=> matrix multiplication
-> alternative model == RBM for each user
----13a Ups and downs of backprop
ml = stats + AI
-> stats: low dimensional data, simple structure => simple model -> SVM, GP
-> ai: high dimensional data, complex structure => complex model -> backprop
----13b Belief nets
graphical models = graph theory + probability theory
RBM = undirected graphical models
belief net = directed acyclic graph with stochastic hidden causes and visible effects
inference problem: infer prob dist over states of unobserved vars
learning problem: adjust variables to make the net more likely to generate the data
2 kinds of generative neural nets with stochastic binary neurons:
-> energy-based: symmetric connections between them -> Boltzmann machine
-> causal: sigmoid belief net (belief net with stoch bin neurons)
----13c Learning with sigmoid belief nets
learning rule: change weight between j (causal node to i) and i by learning rate times binary state of j times difference between state of i and probability that binary state of i is 1 given its parents
stochastic gradient descent is stochastic in this context bc of sampling from the posterior
explaining away: 2 hidden causes are indep in prior but become dependent when event is observed
radford neal -> approximate posterior with monte carlo
----13d The wake sleep algorithm
variational learning
assume hidden units are independent in sigmoid belief net (like restricted boltzmann machine) even though it is not true
=> so we have factorial dist over probs of hidden units
=> this brings down degrees of freedom from 2^N-1 to N
wake sleep :: generative model
wake:
forward pass: use recognition weights
=> stochastic binary states (independently determined for each unit)
treat those as a true sample from posterior distribution and do max likelihood learning
use maximum likelihood learning to train generative weights
sleep:
drive system downwards to generate unbiased sample of the data
train recognition weights to reconstruct activities in each layer given data
complete opposite of wake phase
limitations:
1. wasteful because recognition weights trained on parts of space where no data
2. rec weights do not follow real gradient -> incorrect mode-averaging
3. posterior dist is incorrectly assumed to have indep r.v.s
----14a Learning layers of feature
stacking RBMs to make a deep belief net:
-> ...
averaging factorial distributions does not lead to a factorial distributions
contrastive wake sleep to fine tune (after using transpose of weights learned) for better generation
----14b Discriminative fine-tuning for DBN
...
permutation invariant mnist task: applying same pixel permutation to all images in the data set will not change performance of net. false for convolutional nets.
generative pre-training + discriminative fine-tuning => amazing!
----14c What happens during discriminative fine-tuning
skip
----14d Modeling real-valued data with an RBM
skip
----15a Principal component analysis to autoencoders
PCA: select dims with owest variance and project on rest of dims. 2d->1d example: take line in direction of greatest variance, project all points on that line; so we lose data info in direction of least variance.
linear encoders == pca == modelling data lying on linear manifold ; extension with nonlinear neurons -> stochastic encoders == data lying on nonlinear manifold
----15b Deep autoencoders
really cool stuff with deep autoencoders
----15c Deep autoencoders for doc retrieval
zoned out at the end
basically encode documents from vectors of non-stop word (not 'the', etc.) counts and look at nearby docs in same encoding space (greedily pre-train layers with rbm learning then fine tune with backprop)
----15d Semantic hashing
use logistic units for code; but they will tend to use middle part of logistic func to store as much data as possible; add gaussian noise to force them to make more drastic decisions (-> almost binary when at extremes of 'rugs')
-> alex krizhevsky: wtf, just use some stoch bin units
semantic hashing: use code as mem address (hash function is autoencoder); look up similar docs by flipping bits (GH: supermarket search)
----15e Learning binary codes for image retrieval
2-stage method: use 28-bit to get list of images; then 256 bit to filter
pretty cool stuff to bypass pixel intensity
----15f Shallow autoencoders for pre-training
contrastive divergence to train rbm is approximation of max likelihood
denoising autoencoder: add noise to input vector (like dropout for inputs)
contractive autoencoder: make hidden layer insensitive to input -- penalize squared gradient
pre-training is cool because features can be discovered without using labels
GH vs Google -> having a lot of data doesnt mean you should forget about regularization and pre-training on small deep nets; start using much bigger deep nets like my brain. Google: ok, will 20 million dollars do for this neural net that makes neural nets; comes with a free propeller cap to use as a cooling device.
----random notes (stuff said during inverted classroom lectures)
Early 50s-60s => linear filters used linear neurons
binary stochastic neurons are really nice
white matter is elastic to optimize brain space (connecting components)
in neural network research, tweaking is necessary thus/but hard to infer on parameter impact
step by step processing is good: layers ==> modularity
lower layer is extracting featural structural info
upper layer random ad-hocness
"the meaning of life is literal recognition"
most important rule in ML: USE BIGGER TRAINING SET
system that gets 90% right can train a system that gets 99% of it right
stochastic noise: motivation
multiple layers of adaptive weights
coup of minsky: proving mathematically limitations of neural nets to get $
"no set of weights that will give right answer on all training cases: it will rattle around"
-> but the average can be useful
example:
A (+20)-> B (+20)-> C (+20)-> A
| | |
-10 -10 -10
A=1, B=0, C=0 => A,B,C will cycle a 1
A=.001, B=0, C=0 => A,B,C will die out
95' book on neural nets provides good mathematical foundation
catastrophic unlearning: mess up the weight by learning new info when old
SVM == perceptron++
loss function = error function = residual error
GH: after playing around with NNs for a bit, we understand w=5 as "big", w=8 "very big"
Tijmen: adapting using evolutionary approaches -> GH: hmm, certainly interesting but making things soft will make learning easier
GH harbours resentment towards CV ppl cause they told him for 30 years nn was not how to do obj rec
neuron drawing tricks: line == linear, curved line == logistic
Tijmen: Laptop-completeness ^_^