Deep Learning from Scratch to GPU - 6 - CUDA and OpenCL


There is only one thing that we have to do to make this code completely general: use general constructors from the core namespace, instead of the convenience methods from the native namespace. These methods are, in this case, ge (general matrix) instead of dge (double general native matrix), and vctr instead of dv (double native vector). The only difference in these methods is that they require an engine-specific factory as their first argument. We accommodate the fully-connected constructor to accept it as an input.

(defn fully-connected [factory activ-fn in-dim out-dim] (let-release [w (ge factory out-dim in-dim) bias (vctr factory out-dim)] (->FullyConnectedInference w bias activ-fn)))

Now, we repeat the example of running the network with native-double. That is the same factory that is used by the dge and dv methods, available in the native namespace. We can use native-float in its place, to use single-precision floating point computations on the CPU, or some of the GPU factories, or configure another factory coded by a 3-rd party, or even use the same code provided by Neanderthal, but configured in a different way.

(with-release [x (ge native-double 2 2 [0.3 0.9 0.3 0.9]) ones (vctr native-double 1 1) layer-1 (fully-connected native-double tanh! 2 4) a-1 (ge native-double 4 2) layer-2 (fully-connected native-double sigmoid! 4 1) a-2 (ge native-double 1 2)] (transfer! [0.3 0.1 0.9 0.0 0.6 2.0 3.7 1.0] (weights layer-1)) (transfer! [0.7 0.2 1.1 2] (bias layer-1)) (transfer! [0.75 0.15 0.22 0.33] (weights layer-2)) (transfer! [0.3] (bias layer-2)) (transfer (layer-2 (layer-1 x ones a-1) ones a-2)))
nil#RealGEMatrix[double, mxn:1x2, layout:column, offset:0] ▥ ↓ ↓ ┓ → 0.44 0.44 ┗ ┛

I modified the result display of this example a bit. Instead of doing a println as in the previous articles, I transfer the resulting matrix to main-memory. I do it for convenience, since this blog post and its results are automatically generated from live code, and also to teach a few patterns in this type of coding.

Don't forget that, in this example, I have used with-release for all bindings, even the output a-2. I do this because the code should support CPU and GPU. On the CPU, releasing the data is of great help, but is optional in a REPL session, since the memory eventually gets released by the JVM (with a few caveats since JVM might not do it as soon as you hoped). On the GPU, however, JVM can not do anything; the underlying GPU buffer that is not released explicitly, is not released at all until we release the whole context. Therefore, the habit that I recommend, is to always take care of that and release all vectors, matrices and other structures as soon as possible.

However, we'd like to see the result in the REPL. But, how, if the data stored in the result that is being returned (a-2) is released just the moment before it needs to be printed. Here, the transfer method transfers the data from wherever it is (main memory or GPU memory) to the equivalent object in the main memory.