2D Local Filter Routines
    for CUDA 3.2 and devices of compute capability >= 2.0
Alex Krizhevsky (akrizhevsky@gmail.com)
Last updated Apr 2, 2011
============================================

============================================
Change log:
============================================
Apr 2, 2011:
--------------------------------------------
- Added parameter to specify the order that the filter output should take.
  The default was always module-filter-image, but filter-module-image
  is useful for making deep, locally-connected nets. So now it's an option.

============================================
Compiling:
============================================
- Have the following environment variable set (change as necessary to fit actual SDK path):
    CUDA_SDK_PATH=/home/spoon/NVIDIA_GPU_Computing_SDK
- Compile the code like this (on a Linux machine):
    make exec=1
- Run it like this:
    ./local_units
- The output simply tests the three CUDA routines against their CPU equivalents (the CPU ones are by no means optimized; they're just there to test for correctness).

============================================
Description:
============================================
This code contains three CUDA routines:

1. computeFilterActs: this routine computes filter activities for locally-connected square filters on a square image. The filters are laid out on a grid over the image, with an arbitrary stride size between filters.
2. computeWeightActs: this routine can be used to compute the gradient with respect to the filter weights, given the filter activity gradients in the layer above as well as the source images. 
3. computeImgActs: this routine can be used to compute the gradients with respect to the source images, given the filter weights and the gradients with respect to the filter activities.

The catch is that the routines are only efficient if there are multiple filters per image location. That is to say, if a filter is to be applied at location (x,y) on the image, it's better to apply 16 or 32 filters there rather than just one. So instead of applying one filter at (x,y), another at (x+1,y), etc., consider applying 32 at (x,y), 32 at (x+stride,y), for some stride of your choosing.

The best way to understand exactly what the routines do may be to read the file src/localCPU.cpp, which contains CPU equivalents of the CUDA routines. Each CPU equivalent is only about 20 lines of code.

Together these three routines can be used to build an autoencoder or RBM (or whatever) with local receptive (and/or projective) fields and no weight-sharing.

The routines support color images with as many color channels as you like. This means they'll also work for NORB, if you interpret stereo as two color channels. You can also use them to build deep, locally-connected nets, in which the set of filter outputs at a particular image location at layer L can be interpreted as the set of "color channels" in the input image to layer L+1.

============================================
How to call the routines:
============================================
The file src/test.cu demonstrates how to call each of the routines.

The basic logic is this:
- Decide on the number of color channels, image size, filter size, and filter stride size you want.
- Given these, compute the minimal number of filter applications that will cover the entire image.
- The above computation will probably overshoot the size of the image. That is to say, if you apply the first filter at (0,0), the last filter on row 0 may fall off the far edge of the image. To avoid this asymmetry, the routines take a parameter called paddingStart. This should be a non-positive integer indicating where the first filter should be applied. So, for example, setting this value to -3 is just like padding the top and left edges of every image with a 3-pixel border of zeros. One sensible setting of paddingStart is half the amount by which the above computation overshot the image size. This corresponds to setting an equal padding of zeros around all edges of the image (or almost equal, if the above computation produced an odd number). You don't need to worry about the routines accessing improper memory locations because of overshoot -- any location beyond the size of the image will be taken as zero.
- Call the routines with these parameters.

============================================
Other notes:
============================================
- In the code, the word "module" refers to a group of filters applied at a particular image location.
- The code depends on a (provided) matrix library called NVMatrix. For the purposes of this code, this library doesn't provide much functionality other than wrapping the device data pointer and matrix dimensions into a tidy struct. So you can easily replace it with a matrix library (or whatever) of your own choosing.
- If your images have only one color channel, you will probably see a slight speedup if you delete the parts of the code
that deal with color channels entirely. This is because the current version of Nvidia's compiler (3.2) mysteriously does not unroll loops that have only one iteration.
