I’m currently sitting in at a tutorial on the basics of Julia and parallel processing therein by Bogumił Kamiński. This is actually my first dive into Julia, so I thought I would write it down on-the-go here!

Let’s jump right into it:

R = randn(10000, 10000)

10000×10000 Array{Float64,2}:
 -0.324424   -0.565317    0.74943    …  -0.518865    -0.0841272   0.935276 
  2.7421      0.127783   -0.406756       1.69075      1.58605     0.302112 
  2.00937    -0.36474    -1.6031         2.46846     -0.319774   -0.362626 
  1.0957     -0.328512    0.0765665      0.551588    -0.63376    -0.642072 
 -1.5761      0.0990041   0.649661       0.123745     1.53702     0.748066 
  0.0294794   0.841421    0.935812   …  -0.124979    -0.0319694  -0.308331 
  2.4428     -0.0981946   2.16323       -1.74004     -0.838027   -0.562755 
 -0.362584   -0.342403    1.11269       -1.99102      2.13044     1.05996  
 -0.85741     0.224304    0.89256       -0.357627    -0.25959     0.271416 
  1.02282    -0.470008    1.75296        1.34871     -0.16343     0.194525 
 -0.357741    0.252059   -1.02996    …  -0.125655    -1.20237     0.0220102
  0.793983    0.334861   -0.628246      -0.768169     1.08063    -0.870663 
 -0.111529   -0.557087    0.714131      -0.0785655    0.577348   -0.659775 
  ⋮                                  ⋱                                     
  0.454754    0.905449   -1.04019       -2.14169     -0.830821    0.363394 
  0.165472   -0.099097    1.58675       -0.314269    -0.500922   -2.24592  
 -0.74685     0.854795   -0.606661   …   0.390252    -1.45657     1.22648  
 -0.0369208  -0.139647    1.26695       -0.00442996  -2.24374     0.348733 
  0.620604   -0.835141   -1.59741       -0.026424    -0.491713    0.705191 
 -2.49094     0.471711    0.677353       0.51443     -0.234433    1.61501  
 -0.600199    0.907787   -0.0977633      1.39034     -1.20908     1.06054  
 -1.26894     0.718772    0.334036   …   0.994015    -1.28285    -2.15419  
 -0.411544   -0.0794345   1.58904        1.14895      0.0363      2.14895  
 -0.437881   -0.451166   -0.0647529     -0.276704     0.392206   -0.128466 
  0.86126     0.774654    0.429458      -2.03196     -0.371577    0.547281 
 -1.07246     0.421693   -0.244775      -0.479385    -0.858759   -0.300843

Okay, that did what you’d expect. There’s apparently a help statement that works inversely from what you’d expect from IPython:

?randn

search: randn rand transcode macroexpand @macroexpand1 @macroexpand

randn([rng=GLOBAL_RNG], [T=Float64], [dims...])

Generate a normally-distributed random number of type T with mean 0 and standard deviation 1. Optionally generate an array of normally-distributed random numbers. The Base module currently provides an implementation for the types Float16, Float32, and Float64 (the default), and their Complex counterparts. When the type argument is complex, the values are drawn from the circularly symmetric complex normal distribution.

Examples

julia> rng = MersenneTwister(1234);

julia> randn(rng, ComplexF64)
0.6133070881429037 - 0.6376291670853887im

julia> randn(rng, ComplexF32, (2, 3))
2×3 Array{Complex{Float32},2}:
 -0.349649-0.638457im  0.376756-0.192146im  -0.396334-0.0136413im
  0.611224+1.56403im   0.355204-0.365563im  0.0905552+1.31012im

What about global help?

search: ⊻ ⊋ ⊊ ⊉ ⊈ ⊇ ⊆ ≥ ≤ ≢ ≡ ≠ ≉ ≈ ∪ ∩ ∛ √ ∘ ∌ ∋ ∉ ∈ ℯ π ÷ ~ | ^ \ > < : / - +

Welcome to Julia 1.1.0. The full manual is available at

https://docs.julialang.org/

as well as many great tutorials and learning resources:

https://julialang.org/learning/

For help on a specific function or macro, type ? followed by its name, e.g. ?cos, or ?@time, and press enter. Type ; to enter shell mode, ] to enter package mode.

;ls

mnist.zip
Tutorial1.ipynb

We can use @time to benchmark this randn command:

@time randn(10000, 10000);

  0.813875 seconds (6 allocations: 762.940 MiB, 1.56% gc time)

For comparison:

In [3]: %timeit np.random.normal(size=(10000, 10000))             
3.8 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And the presenter’s R attempt took 5.81 using system.time. Wew, this is pretty fast.

To start with, we’re analyzing a function found on StackOverflow that sums over the lower triangular part of a matrix (apparently, the code is pretty bad):

function upsum(M); n = size(M)[1]; sum = 0
    for i = 1:n-1
        for j = i+1:n
            sum = sum + M[i,j]
        end
    end
    return sum
end

upsum(R)

-7802.649783031088

Let’s check the performance:

%timeit upsum(R);

The analogue of IPython’s %time statement (also %timeit) in Julia is @time statement. The analogue of %%time ...code... is

@time begin
    ...code...
end

Note, however, that you should put all performance-critical code into a function, avoiding global variables, before doing performance measurements in Julia; see the performance tips in the Julia manual.

The @time macro prints the timing results, and returns the value of evaluating the expression. To instead return the time (in seconds), use @elapsed statement.

For more extensive benchmarking tools, including the ability to collect statistics from multiple runs, see the BenchmarkTools package.

… All right, I’m starting to like this cheeky little language. Trying again:

@time upsum(R);

  0.464638 seconds (5 allocations: 176 bytes)

To compare that with Python:

In [8]: %timeit np.sum(np.tril(R))                                
245 ms ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Well, that was faster, but we can improve the Julia code. Let’s first look at the inbuilt sum function:

sum

sum (generic function with 13 methods)

@time sum(R);

  0.090729 seconds (89.03 k allocations: 4.748 MiB, 10.32% gc time)

@time sum(R);

  0.100551 seconds (5 allocations: 176 bytes)

Okay, now this is badass. Julia is dynamically compiled - it’s as if Numba came out of Python and became its own language. Apparently there are ways of avoiding the first-call overhead, but this is somehow more advanced.

Note that all compiled-function cache is cleared on Julia restarts!

To compare with Python:

In [9]: %timeit np.sum(R)                                         
53 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Not too shabby for the ol’ snake!

Let’s try to improve the function, though:

function uppersum(M)
    n = size(M, 1)
    s = zero(eltype(M))  # a zero hard-typed as the same type as the entry of the matrix
    # eltype stands for ELement TYPE - this is now fully generic
    for i in 2:n        # Julia uses column-major storage order - faster to traverse any matrix in Julia column-wise
        @simd for j in 1:(i-1)    # if I know I'm accessing a contiguous block of memory, I can tell Julia that using @simd
            @inbounds s += M[j, i]    # ignore bound checking and just access memory C-style 
        end
    end
    s
end

uppersum(R)

-7802.6497830305125

We can look at these @simd and @inbouds annotations:

?@simd

@simd

Annotate a for loop to allow the compiler to take extra liberties to allow loop re-ordering

!!! warning This feature is experimental and could change or disappear in future versions of Julia. Incorrect use of the @simd macro may cause unexpected results.

The object iterated over in a @simd for loop should be a one-dimensional range. By using @simd, you are asserting several properties of the loop:

It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
Floating-point operations on reduction variables can be reordered, possibly causing different results than without @simd.

In many cases, Julia is able to automatically vectorize inner for loops without the use of @simd. Using @simd gives the compiler a little extra leeway to make it possible in more situations. In either case, your inner loop should have the following properties to allow vectorization:

The loop must be an innermost loop
The loop body must be straight-line code. Therefore, @inbounds is currently needed for all array accesses. The compiler can sometimes turn short &&, ||, and ?: expressions into straight-line code if it is safe to evaluate all operands unconditionally. Consider using the ifelse function instead of ?: in the loop if it is safe to do so.
Accesses must have a stride pattern and cannot be “gathers” (random-index reads) or “scatters” (random-index writes).
The stride should be unit stride.

!!! note The @simd does not assert by default that the loop is completely free of loop-carried memory dependencies, which is an assumption that can easily be violated in generic code. If you are writing non-generic code, you can use @simd ivdep for ... end to also assert that:

There exists no loop-carried memory dependencies
No iteration ever waits on a previous iteration to make forward progress.

?@inbounds

@inbounds(blk)

Eliminates array bounds checking within expressions.

In the example below the in-range check for referencing element i of array A is skipped to improve performance.

function sum(A::AbstractArray)
    r = zero(eltype(A))
    for i = 1:length(A)
        @inbounds r += A[i]
    end
    return r
end

!!! warning Using @inbounds may return incorrect results/crashes/corruption for out-of-bounds indices. The user is responsible for checking it manually. Only use @inbounds when it is certain from the information locally available that all accesses are in bounds.

Right! Let’s time this implementation:

@time uppersum(R);

  0.054047 seconds (5 allocations: 176 bytes)

function uppersum_boundcheck(M)
    n = size(M, 1)
    s = zero(eltype(M))  # a zero hard-typed as the same type as the entry of the matrix
    # eltype stands for ELement TYPE - this is now fully generic
    for i in 2:n        # Julia uses column-major storage order - faster to traverse any matrix in Julia column-wise
        @simd for j in 1:(i-1)    # if I know I'm accessing a contiguous block of memory, I can tell Julia that using @simd
            s += M[j, i]    # ignore bound checking and just access memory C-style 
        end
    end
    s    # this is sufficient for a `return` statement
end

uppersum(R)

-7802.6497830305125

Let’s see what kind of gain we get from losing boundchecking:

@time uppersum_boundcheck(R);

  0.115098 seconds (51.16 k allocations: 2.663 MiB)

Interesingly, Julia apparently uses LLVM in the background.

Going parallel

The idea is: we have a triangle and we want to split it into pieces of equal “mass”. This is done in the code below, via an instruction that is relatively magical to me right now.

For threading to work, note that as described in the docs, you need to set the environment variable JULIA_NUM_THREADS. export JULIA_NUM_THREADS=4 in .bashrc worked fine for me.

using Base.Threads

function upsum_threads(M)
    n = size(M, 1)
    chunks = nthreads()
    sums = zeros(eltype(M), chunks)  # separate subsum for each thread
    chunkend = round.(Int, n*sqrt.((1:chunks) ./ chunks))   #split jobs so that each thread has approx. same number of numbers to add
    @assert minimum(diff(chunkend)) > 0
    chunkstart = [2; chunkend[1:end-1] .+ 1]
    @threads for job in 1:chunks     # tell Julia that this part is safe for threading
        s = zero(eltype(M))
        for i in chunkstart[job]:chunkend[job]
            @simd for j in 1:(i-1)
                @inbounds s += M[j, i]
            end
        end
        sums[job] = s
    end
    return sum(sums)
end

upsum_threads(R)

-7802.649783030595

@time upsum_threads(R);

  0.037879 seconds (35 allocations: 2.000 KiB)

Okay, now this is faster than the Numpy. I’m reasonably impressed - but confused as to that one magical line, though. Let’s dig into it.

chunks = 4
n = 10000
round.(Int, n*sqrt.((1:chunks) ./ chunks))

4-element Array{Int64,1}:
  5000
  7071
  8660
 10000

Huh. Digging deeper:

(1:chunks)

1:4

Okay, this is a range…

(1:chunks) ./ chunks

0.25:0.25:1.0

And this is where it starts to hit me, as the presenter introduces the collect command:

collect((1:chunks) ./ chunks)

4-element Array{Float64,1}:
 0.25
 0.5 
 0.75
 1.0

OOOOOOOOOOOOOH. So ./ is a a lazy operator! In other words, if you do this:

sqrt((1:chunks) ./ chunks)

MethodError: MethodError: no method matching sqrt(::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}})
Closest candidates are:
  sqrt(!Matched::Float16) at math.jl:1018
  sqrt(!Matched::Complex{Float16}) at math.jl:1019
  sqrt(!Matched::Missing) at math.jl:1070
  ...
MethodError: no method matching sqrt(::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}})
Closest candidates are:
  sqrt(!Matched::Float16) at math.jl:1018
  sqrt(!Matched::Complex{Float16}) at math.jl:1019
  sqrt(!Matched::Missing) at math.jl:1070
  ...

Stacktrace:
 [1] top-level scope at In[43]:1

This errors because you’re operating on a range, but instead if you do this:

sqrt.((1:chunks) ./ chunks)

4-element Array{Float64,1}:
 0.5               
 0.7071067811865476
 0.8660254037844386
 1.0

Chains of broadcasting operations “materialize” only once - skipping plenty of unnecessary overhead.

This is badass and I have to admit that I’m rather hyped up for Julia now!