Neanderthal vs ND4J - vol 4 - Fast Vector Broadcasting in Java, CPU and CUDA

But why stop here? We can improve this naive implementation and make it less naive, to show you that with a good tool in your toolbox, and a little knowledge of textbook math you can make wonders.

So, what caught my eye in the previous measurements, is that cases when the matrix is "flat" require a lot of looping iterations, each calling native operation and wasting its time by not giving it enough work to do. What I would prefer is, of course, to call fewer more demanding operations, preferably just one :)

Take a look at the diagram on the Nd4j user guide that explains what broadcasting does, and recall a bit of basic linear algebra. If you imagine the vector to be broadcast to be a \(1 \times{} n\) matrix, and imagine a \(m \times{} 1\) matrix of ones, you may recognize that their product places the elements exactly at the places where they need to end up.

This might sound a bit messy. Whaaat? I'd have to convert the vector to a matrix and whatnot? Although it would be easy in Neanderthal, and it would be hidden inside a function, there is something even simpler. Yep, whatever I end up needing for these kinds of tasks, Neanderthal just end up already having a convenient function to use!

In this case, it is the rk! function, which can multiply each combination of elements of two orthogonal vectors resulting in a matrix that can be summed element-by-element with another matrix.

Here is a less naive broadcasting implementation in Neanderthal. This time, a two-liner:

(defn less-naive-add-row-vector! [a x] (with-release [y (entry! (raw (col a 0)) 1)] (rk! y x a)))
(let [a (fge 2 3 [1 2 3 4 5 6]) x (entry! (fv 3) 1)] (less-naive-add-row-vector! a x))
#RealGEMatrix[float, mxn:2x3, layout:column, offset:0] ▥ ↓ ↓ ↓ ┓ → 2.00 4.00 6.00 → 3.00 5.00 7.00 ┗ ┛ 

It works correctly; what about the performance?