The Julia Language Challenge


I've been using the Julia language for around 5 years now, and I've grown rather fond of it. By now, I can't really imagine how to do the things I'm doing in any other language. This could, of course, be a misconception and the result of not knowing any other language well enough - a form of the Dunning–Kruger effect. But my impression is that Julia is the best language to write high performance numeric libraries for machine learning, data science, solvers, optimizations, etc....

Because of that, I get into a lot of discussions that go, "Why Julia? We always just use a scripting language and write the fast part in C/C++," and all different combinations of that argument. I usually quickly reply that this approach ends up in a lot of effort and it doesn't actually compose very well to mix a low-level language with a scripting language! One ends up with a difficult to maintain C/C++ library, and a scripting language that can't be used for all tasks, since the slow speed becomes a bottleneck. Or the C/C++ library doesn't work with the high-level constructs that you learned to love. Even Tensorflow and Google have acknowledged this, and are trying to rewrite Tensorflow in one language, namely, Swift!

But how close to the truth am I, really? How big is the difference? And how fast and composable can an implementation become in a modern C++?

Obviously, I can't just sit down and learn how to write the best and most elegant code in another language - it would take years until I reached the level of my current Julia skills. This is where the Julia challenge comes into play!

I put together a reference implementation for a problem that nicely illustrates the fundamental principles which make Julia so productive and scalable for numeric libraries. It's the foundation that allows one to freely combine packages and still get optimal performance. (If you're curious about such packages, have a look at this article: Some State of the Art Packages in Julia 1.0.)

I can't really imagine writing those in any other language, so I dare you to teach me! Use Python + Numba, Python + C, Swift, Cython, Go, Lua-Jit, Rust, D - surprise us! If all works out, this can be a great learning experience for everyone following the challenge! :)

  • There are 3 categories: developer effort, performance and extensibility.
  • A successful implementations needs to shine in all 3 categories.
  • You shouldn't need to write a large helper library to produce concise code. This can be unfair since Julia has lots of functionality for array algorithms inbuilt, but also illustrates the point. To not exclude any new languages, I'm open to discussing the details of this rule.

A common problem is that you have a bunch of arrays and scalars, and want to apply a function element-wise over the arrays. This is an extension of the regular map function:

In many programming languages, map is the name of a higher-order function that applies a given function to each element of a list, returning a list of results in the same order. It is often called apply-to-all when considered in functional form. 

It's a fundamental operation for any array library, and is used heavily in machine learning and any other area in computer science. If you were in the situation of needing to implement a Numpy-like library, this would be one of your first challenges.

It's such a basic operation that Julia has its own syntax for it. You can apply any function element-wise by just putting a dot in front of it:

x = fill(0.0, (4, 4)) y = [1, 2, 3, 4] z = 2 user_defined(x) = x + 1 println(user_defined.(x) .+ y .+ z) A = 1:4
b = A'
println(A .+ b) x = ComplexF64[1im, 2im, 3im]
println(getfield.(x, :im)) 

What makes it difficult to implement is that you can really throw anything at it! It works for n-arguments, n-dimensional arrays with arbitrary user types and functions - and pretty much needs to have optimal performance in all cases.

What I haven't even mentioned yet is that Julia takes multiple broadcast calls, e.g. user_defined.(x) .+ y .+ z, and fuses them into one loop without any temporary allocations. This mechanism is customizable, and one can influence how to fuse the call tree to allow domain specific optimizations. This is possible because Julia's implementation allows you to overload it, and passes you a lazy representation of the broadcast expression which you're free to modify appropriately.

So, the spec for our little toy broadcast implementation is the following:

  • needs to work for n arguments
  • needs to work for any combinations of scalars and n-dimensional arrays
  • inside the implementation, one needs to have access to the whole call tree and it should be possible to rewrite that tree or flatten it into one loop
  • user defined functions and types need to be inlined and SIMD acceleration should be used whenever possible

The below snippet implements a (quite) fully featured lazy, n-dimensional, n-argument broadcast. You can actually play around with the implementation by creating a nextjournal account with signup code: juliacon, and then clicking edit in the right corner of the article!

import Base: getindex, iterate, axes, eachindex, tail, struct LazyBroadcast{F, Args} f::F args::Args
end
br_getindex(scalar, I) = scalar function br_getindex(A::AbstractArray, I) idx = ntuple(i-> ifelse(size(A, i) === 1, 1, I[i]), Val(ndims(A))) return A[CartesianIndex(idx)]
end function br_getindex(x::LazyBroadcast, I) return x.f(getindex_arg(x.args, I)...) end
getindex_arg(args::Tuple{}, I) = () function getindex_arg(args::NTuple{N, Any}, I) where N return (br_getindex(args[1], I), getindex_arg(tail(args), I)...)
end getindex(x::LazyBroadcast, I) = br_getindex(x, Tuple(I))
function materialize!(out::AbstractArray, call_tree::LazyBroadcast) for i in CartesianIndices(axes(out)) out[i] = call_tree[i] end return out
end
br_construct(x) = x
function br_construct(x::Expr) x.args .= br_construct.(x.args) if Meta.isexpr(x, :call) x = :(LazyBroadcast($(x.args[1]), ($(x.args[2:end]...),))) end x
end macro broadcast(call_expr) esc(br_construct(call_expr))
end

A first prototype took me half an hour to write, and a further 1.5 hours to optimize all cases to match Julia's SIMD optimized broadcast implementation in terms of performance. This should be the baseline for truely winning this challenge.

  • Code must not be much longer while still being readable. Exemptions for languages with tiny standard libraries.
  • One shouldn't need to spend a lot of time to implement this, so that it can be easily rewritten or improved.
  • materialize! needs to have access to the full call tree, and should be able to rewrite the tree without losing any performance.
  • It should be possible to construct the lazy broadcast object as close to the actual calls as possible as in @broadcast a + b - sin(c)

To win the performance part, your implementation needs to be at least as fast as Julia's base broadcast implementation for arbitrary dimensions and argument combinations. That means it needs to use SIMD acceleration and compile away all those high level, lazy indexing constructs. It also needs to work with user defined types and functions that are out of the control of your implementation!

using BenchmarkTools reference(out, a, b, c) = (out .= a .+ b .- sin.(c)) a = rand(1000, 1000);
b = rand(1000);
c = 1.0
out1 = similar(a);
out2 = similar(a);
br = a + b - sin(c) materialize!($out1, $br) reference($out2, $a, $b, $c) out1 == out2
nothing

Now, for the out of control part, it's mandatory to use an existing package without changing it. This can also be emulated by putting it in an isolated module not using any constructs defined in our implementation. This part is very important in order to make Julia's packages talk to each other and combine already existing code just like Lego. So it's important to use a normal, user defined type/object/class in the language that you target!

pkg"add GeometryTypes" using GeometryTypes const Point3 = Point{3, Float32} module LibraryB super_custom_func(a, b) = sqrt(sum(a .* b))
end using .LibraryB: super_custom_func 
using BenchmarkTools a = rand(Point3, 10^6)
b = rand(Point3, 10^6)
out = fill(0f0, 10^6) $out .= super_custom_func.($a, $b)
br = super_custom_func(a, b) materialize!($out, $br)
nothing

The inverse of the above: now it's time for our type to be consumed by other functions. For simplicity, we just use Julia's Base sum function, which should already get us all the SIMD accelerated speed we wish for. For it to work, we just need to implement the iterator protocol.

Implementing the iterator protocol for our lazy broadcast is a bit more challenging, since now we actually need to figure out its shape so we can iterate over it (we got around that before by using an output array).

 biggest(a, b, c, rest...) = biggest(biggest(a, b), biggest(c, rest...))
biggest(a::NTuple{N1, Any}, b::NTuple{N2, Any}) where {N1, N2} = ifelse(N1 > N2, a, b)
biggest(a) = a
flatten_args(t::LazyBroadcast, rest...) = (flatten_args(t.args...)..., flatten_args(rest...)...)
flatten_args(t::Any, rest...) = (t, flatten_args(rest...)...)
flatten_args() = () axes(br::LazyBroadcast) = biggest(map(axes, flatten_args(br))...) eachindex(br::LazyBroadcast) = CartesianIndices(axes(br)) 

With that, we can overload Julia's iteration protocol:

iterate(br::LazyBroadcast) = iterate(br, (eachindex(br),)) function iterate(bc::LazyBroadcast, s) istate = iterate(s...) istate === nothing && return nothing i, newstate = istate return (bc[i], (s[1], newstate))
end

Which now enables us to feed LazyBroadcast into any function that accepts iterators! Important to win the challenge: the function we use needs to be out of our control, and best be part of some standard library:

 sum($br) sum($out .= super_custom_func.($a, $b)) sum(br)  sum(super_custom_func.(a, b)) nothing

Now let's write a simple multi-threaded materialize!, just because we can :)

using Statistics
function threaded_materialize!(out::AbstractArray, x::LazyBroadcast) Threads. for i in CartesianIndices(axes(out)) out[i] = x[i] end return out
end
a = threaded_materialize!($out, $br)
b = materialize!($out, $br)
hyper_threads = Sys.CPU_THREADS cores = hyper_threads / 2 println("best scaling ", minimum(b).time / minimum(a).time / cores)
println("average scaling ", mean(b).time / mean(a).time / cores)
nothing

Next bonus point could be to write a simple GPU-accelerated materialize! To not further bloat this article, I'll leave that for another challenge,.

This is it - if you get through this just fine with your language, I'd be delighted to see your code! If you don't but have some interesting fragments going, please publish them as well!

Results are accepted as a pull request (PR) to: https://github.com/SimonDanisch/julia-challenge. The github repository can also be used to turn this into a community effort. So, just make a PR with an incomplete solution and continue polishing it with others!

Good luck and happy coding!