I’m currently sitting in at a tutorial on the basics of Julia and parallel processing therein by Bogumił Kamiński. This is actually my first dive into Julia, so I thought I would write it down on-the-go here!
Generate a normally-distributed random number of type T with mean 0 and standard deviation 1. Optionally generate an array of normally-distributed random numbers. The Base module currently provides an implementation for the types Float16, Float32, and Float64 (the default), and their Complex counterparts. When the type argument is complex, the values are drawn from the circularly symmetric complex normal distribution.
Welcome to Julia 1.1.0. The full manual is available at
https://docs.julialang.org/
as well as many great tutorials and learning resources:
https://julialang.org/learning/
For help on a specific function or macro, type ? followed by its name, e.g. ?cos, or ?@time, and press enter. Type ; to enter shell mode, ] to enter package mode.
In [3]: %timeit np.random.normal(size=(10000, 10000))
3.8 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And the presenter’s R attempt took 5.81 using system.time. Wew, this is pretty fast.
To start with, we’re analyzing a function found on StackOverflow that sums over the lower triangular part of a matrix (apparently, the code is pretty bad):
function upsum(M); n = size(M)[1];sum=0for i =1:n-1for j = i+1:nsum=sum+ M[i,j] end endreturnsumendupsum(R)
-7802.649783031088
Let’s check the performance:
%timeit upsum(R);
The analogue of IPython’s %time statement (also %timeit) in Julia is @time statement. The analogue of %%time ...code... is
@time begin
...code...
end
Note, however, that you should put all performance-critical code into a function, avoiding global variables, before doing performance measurements in Julia; see the performance tips in the Julia manual.
The @time macro prints the timing results, and returns the value of evaluating the expression. To instead return the time (in seconds), use @elapsed statement.
For more extensive benchmarking tools, including the ability to collect statistics from multiple runs, see the BenchmarkTools package.
… All right, I’m starting to like this cheeky little language. Trying again:
@time upsum(R);
0.464638 seconds (5 allocations: 176 bytes)
To compare that with Python:
In [8]: %timeit np.sum(np.tril(R))
245 ms ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Well, that was faster, but we can improve the Julia code. Let’s first look at the inbuilt sum function:
sum
sum (generic function with 13 methods)
@timesum(R);
0.090729 seconds (89.03 k allocations: 4.748 MiB, 10.32% gc time)
@timesum(R);
0.100551 seconds (5 allocations: 176 bytes)
Okay, now this is badass. Julia is dynamically compiled - it’s as if Numba came out of Python and became its own language. Apparently there are ways of avoiding the first-call overhead, but this is somehow more advanced.
Note that all compiled-function cache is cleared on Julia restarts!
To compare with Python:
In [9]: %timeit np.sum(R)
53 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not too shabby for the ol’ snake!
Let’s try to improve the function, though:
function uppersum(M) n = size(M, 1) s = zero(eltype(M)) # a zero hard-typed as the same type as the entry of the matrix# eltype stands for ELement TYPE - this is now fully genericfor i in2:n # Julia uses column-major storage order - faster to traverse any matrix in Julia column-wise@simdfor j in1:(i-1) # if I know I'm accessing a contiguous block of memory, I can tell Julia that using @simd@inbounds s += M[j, i] # ignore bound checking and just access memory C-style end end senduppersum(R)
-7802.6497830305125
We can look at these @simd and @inbouds annotations:
?@simd
@simd
Annotate a for loop to allow the compiler to take extra liberties to allow loop re-ordering
!!! warning This feature is experimental and could change or disappear in future versions of Julia. Incorrect use of the @simd macro may cause unexpected results.
The object iterated over in a @simd for loop should be a one-dimensional range. By using @simd, you are asserting several properties of the loop:
It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
Floating-point operations on reduction variables can be reordered, possibly causing different results than without @simd.
In many cases, Julia is able to automatically vectorize inner for loops without the use of @simd. Using @simd gives the compiler a little extra leeway to make it possible in more situations. In either case, your inner loop should have the following properties to allow vectorization:
The loop must be an innermost loop
The loop body must be straight-line code. Therefore, @inbounds is currently needed for all array accesses. The compiler can sometimes turn short &&, ||, and ?: expressions into straight-line code if it is safe to evaluate all operands unconditionally. Consider using the ifelse function instead of ?: in the loop if it is safe to do so.
Accesses must have a stride pattern and cannot be “gathers” (random-index reads) or “scatters” (random-index writes).
The stride should be unit stride.
!!! note The @simd does not assert by default that the loop is completely free of loop-carried memory dependencies, which is an assumption that can easily be violated in generic code. If you are writing non-generic code, you can use @simd ivdep for ... end to also assert that:
There exists no loop-carried memory dependencies
No iteration ever waits on a previous iteration to make forward progress.
?@inbounds
@inbounds(blk)
Eliminates array bounds checking within expressions.
In the example below the in-range check for referencing element i of array A is skipped to improve performance.
functionsum(A::AbstractArray) r =zero(eltype(A))for i =1:length(A)@inbounds r += A[i]endreturn rend
!!! warning Using @inbounds may return incorrect results/crashes/corruption for out-of-bounds indices. The user is responsible for checking it manually. Only use @inbounds when it is certain from the information locally available that all accesses are in bounds.
Right! Let’s time this implementation:
@time uppersum(R);
0.054047 seconds (5 allocations: 176 bytes)
function uppersum_boundcheck(M) n = size(M, 1) s = zero(eltype(M)) # a zero hard-typed as the same type as the entry of the matrix# eltype stands for ELement TYPE - this is now fully genericfor i in2:n # Julia uses column-major storage order - faster to traverse any matrix in Julia column-wise@simdfor j in1:(i-1) # if I know I'm accessing a contiguous block of memory, I can tell Julia that using @simd s += M[j, i] # ignore bound checking and just access memory C-style end end s # this is sufficient for a `return` statementenduppersum(R)
-7802.6497830305125
Let’s see what kind of gain we get from losing boundchecking:
@time uppersum_boundcheck(R);
0.115098 seconds (51.16 k allocations: 2.663 MiB)
Interesingly, Julia apparently uses LLVM in the background.
Going parallel
The idea is: we have a triangle and we want to split it into pieces of equal “mass”. This is done in the code below, via an instruction that is relatively magical to me right now.
For threading to work, note that as described in the docs, you need to set the environment variable JULIA_NUM_THREADS. export JULIA_NUM_THREADS=4 in .bashrc worked fine for me.
using Base.Threadsfunction upsum_threads(M) n = size(M, 1) chunks = nthreads() sums = zeros(eltype(M), chunks) # separate subsum for each thread chunkend =round.(Int, n*sqrt.((1:chunks) ./ chunks)) #split jobs so that each thread has approx. same number of numbers to add@assert minimum(diff(chunkend)) >0 chunkstart = [2; chunkend[1:end-1] .+1]@threadsfor job in1:chunks # tell Julia that this part is safe for threading s = zero(eltype(M))for i in chunkstart[job]:chunkend[job]@simdfor j in1:(i-1)@inbounds s += M[j, i] end end sums[job] = s endreturnsum(sums)endupsum_threads(R)
-7802.649783030595
@time upsum_threads(R);
0.037879 seconds (35 allocations: 2.000 KiB)
Okay, now this is faster than the Numpy. I’m reasonably impressed - but confused as to that one magical line, though. Let’s dig into it.
And this is where it starts to hit me, as the presenter introduces the collect command:
collect((1:chunks) ./ chunks)
4-element Array{Float64,1}:
0.25
0.5
0.75
1.0
OOOOOOOOOOOOOH. So ./ is a a lazy operator! In other words, if you do this:
sqrt((1:chunks) ./ chunks)
MethodError: MethodError: no method matching sqrt(::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}})
Closest candidates are:
sqrt(!Matched::Float16) at math.jl:1018
sqrt(!Matched::Complex{Float16}) at math.jl:1019
sqrt(!Matched::Missing) at math.jl:1070
...
MethodError: no method matching sqrt(::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}})
Closest candidates are:
sqrt(!Matched::Float16) at math.jl:1018
sqrt(!Matched::Complex{Float16}) at math.jl:1019
sqrt(!Matched::Missing) at math.jl:1070
...
Stacktrace:
[1] top-level scope at In[43]:1
This errors because you’re operating on a range, but instead if you do this: