Currently working at Google in Mountain View, California.
My email: firstname.lastname@example.org. (I couldn't resist registering the domain.)
If you are looking for the CIFAR-10 and CIFAR-100 datasets, click here.
Code (very outdated stuff)
Here's some CUDA/C++ code that I wrote. The descriptions here are rather skimpy, so email me if you need help getting any of it to run.
For GTX 580-class GPUs (compute capability > 2.0):
Abstract convolutional neural network for CUDA 4.0 (Google code project link) -- A C++/CUDA (with a python front-end) implementation of neural networks using the back-propagation algorithm. Capable of modeling any directed acyclic graph of layers. Supports convolutional layers, pooling layers, fully-connected layers, locally-connected layers, and others.
Includes a new implementation of convolution in CUDA which is several times faster than the 2D convolution routines posted below. Visit the project page for more information.
Note: This code has been subsumed by the convolutional neural network code above, which includes a faster version of this code with more features (such as sparse filter-channel connectivity).
A fast implementation of local, non-convolutional filters. These can be used to build an autoencoder, RBM, etc., with locally-connected, non-shared filters. The stride between filters can be arbitrary, but the catch is that the routines are only efficient if there are multiple filters per image location. That is to say, if a filter is to be applied at location (x,y) on the image, it's better to apply 16 or 32 filters there rather than just one. So instead of applying one filter at (x,y), another at (x+1,y), etc., consider applying 32 at (x,y), 32 at (x+stride,y), for some stride of your choosing.
The routines support color images with as many channels as you like. This means they'll also work for NORB, if you interpret stereo as two channels. You can also use them to build deep, locally-connected nets, in which the set of filter outputs at a particular image location at layer L can be interpreted as the set of channels in the input image to layer L+1.
Using this code, here's what an RBM with roughly 10000 11x11x2 local hidden units learned in just 10 minutes on NORB. And here's what another with 9216 11x11 local hidden units learned on the tiny images in under 10 minutes. No data preprocessing in either case.
Now used by AccelerEyes (makers of Jacket, a Matlab GPU framework).
For GTX 280-class GPUs (compute capability 1.3):
Note: this code was written for GTX 280-class GPUs. Parts of it may not work on current-generation GPUs.
2D convolution routines for CUDA 2.1-2.3(Updated May 28, 2010) -- this is my implementation of 2D convolution in CUDA, good for convolving multiple filters with multiple images. It supports greyscale filters/images as well as color filters/images. It works for all image and filter sizes, but it's faster for certain combinations of sizes than others. The readme in the archive describes the cases in which it's very fast and the cases in which it's not very fast. This code also includes:
Convolution-esque routines that are needed for computing gradients in convolutional neural nets (and RBMs) as well as sampling the visibles in convolutional RBMs.
Subsampling and supersampling routines, which are useful for building multi-layer convolutional nets.
Routines for dealing with non-overlapping sub-squares of an image (see readme for full description). One of the things they enable you to do easily is local softmax.
Routines for sampling from a bunch of multinomial distributions. These can be used to sample from the above-mentioned local softmaxes.
Convolutional neural network for CUDA 2.1-2.2 -- this is a simple convolutional neural net with one layer of convolution. It is specialized to the case of 32x32 color images and 8x8 color filters. It does a decent job of classifying the images in the CIFAR-10 dataset. It runs roughly 140x faster on a GTX 280 than a C implementation does on an Intel Core 2 2.4GHz CPU. But I suspect that if Intel implemented convolution in their MKL, the speedup wouldn't be quite as great.