First dive into Julia
I'm currently sitting in at a tutorial on the basics of Julia and parallel processing therein by Bogumił Kamiński. This is actually my first dive into Julia, so I thought I would write it down on-the-go here!
Let's jump right into it:
R = randn(10000, 10000)
Okay, that did what you'd expect. There's apparently a help statement that works inversely from what you'd expect from IPython:
?randn
What about global help?
?
;ls
We can use @time
to benchmark this randn
command:
@time randn(10000, 10000);
For comparison:
In [3]: %timeit np.random.normal(size=(10000, 10000))
3.8 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And the presenter's R attempt took 5.81 using system.time
. Wew, this is pretty fast.
To start with, we're analyzing a function found on StackOverflow that sums over the lower triangular part of a matrix (apparently, the code is pretty bad):
function upsum(M); n = size(M)[1]; sum = 0
for i = 1:n-1
for j = i+1:n
sum = sum + M[i,j]
end
end
return sum
end
upsum(R)
Let's check the performance:
%timeit upsum(R);
... All right, I'm starting to like this cheeky little language. Trying again:
@time upsum(R);
To compare that with Python:
In [8]: %timeit np.sum(np.tril(R))
245 ms ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Well, that was faster, but we can improve the Julia code. Let's first look at the inbuilt sum
function:
sum
@time sum(R);
@time sum(R);
Okay, now this is badass. Julia is dynamically compiled - it's as if Numba came out of Python and became its own language. Apparently there are ways of avoiding the first-call overhead, but this is somehow more advanced.
Note that all compiled-function cache is cleared on Julia restarts!
To compare with Python:
In [9]: %timeit np.sum(R)
53 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not too shabby for the ol' snake!
Let's try to improve the function, though:
function uppersum(M)
n = size(M, 1)
s = zero(eltype(M)) # a zero hard-typed as the same type as the entry of the matrix
# eltype stands for ELement TYPE - this is now fully generic
for i in 2:n # Julia uses column-major storage order - faster to traverse any matrix in Julia column-wise
@simd for j in 1:(i-1) # if I know I'm accessing a contiguous block of memory, I can tell Julia that using @simd
@inbounds s += M[j, i] # ignore bound checking and just access memory C-style
end
end
s
end
uppersum(R)
We can look at these @simd
and @inbouds
annotations:
?@simd
?@inbounds
Right! Let's time this implementation:
@time uppersum(R);
function uppersum_boundcheck(M)
n = size(M, 1)
s = zero(eltype(M)) # a zero hard-typed as the same type as the entry of the matrix
# eltype stands for ELement TYPE - this is now fully generic
for i in 2:n # Julia uses column-major storage order - faster to traverse any matrix in Julia column-wise
@simd for j in 1:(i-1) # if I know I'm accessing a contiguous block of memory, I can tell Julia that using @simd
s += M[j, i] # ignore bound checking and just access memory C-style
end
end
s # this is sufficient for a `return` statement
end
uppersum(R)
Let's see what kind of gain we get from losing boundchecking:
@time uppersum_boundcheck(R);
Interesingly, Julia apparently uses LLVM in the background.
Going parallel¶
The idea is: we have a triangle and we want to split it into pieces of equal "mass". This is done in the code below, via an instruction that is relatively magical to me right now.
For threading to work, note that as described in the docs, you need to set the environment variable JULIA_NUM_THREADS
. export JULIA_NUM_THREADS=4
in .bashrc
worked fine for me.
using Base.Threads
function upsum_threads(M)
n = size(M, 1)
chunks = nthreads()
sums = zeros(eltype(M), chunks) # separate subsum for each thread
chunkend = round.(Int, n*sqrt.((1:chunks) ./ chunks)) #split jobs so that each thread has approx. same number of numbers to add
@assert minimum(diff(chunkend)) > 0
chunkstart = [2; chunkend[1:end-1] .+ 1]
@threads for job in 1:chunks # tell Julia that this part is safe for threading
s = zero(eltype(M))
for i in chunkstart[job]:chunkend[job]
@simd for j in 1:(i-1)
@inbounds s += M[j, i]
end
end
sums[job] = s
end
return sum(sums)
end
upsum_threads(R)
@time upsum_threads(R);
Okay, now this is faster than the Numpy. I'm reasonably impressed - but confused as to that one magical line, though. Let's dig into it.
chunks = 4
n = 10000
round.(Int, n*sqrt.((1:chunks) ./ chunks))
Huh. Digging deeper:
(1:chunks)
Okay, this is a range...
(1:chunks) ./ chunks
And this is where it starts to hit me, as the presenter introduces the collect
command:
collect((1:chunks) ./ chunks)
OOOOOOOOOOOOOH. So ./
is a a lazy operator! In other words, if you do this:
sqrt((1:chunks) ./ chunks)
This errors because you're operating on a range
, but instead if you do this:
sqrt.((1:chunks) ./ chunks)
Chains of broadcasting operations "materialize" only once - skipping plenty of unnecessary overhead.
This is badass and I have to admit that I'm rather hyped up for Julia now!
Comments