I’ve recently come across the amazing CuPy library, and given that I haven’t updated this blog in a while, I figured this would be a great opportunity to showcase a few of its capabilities.

If you haven’t heard yet, CuPy is NumPy, but on the GPU, and it’s amazing how close that simple description is to reality. First things first! Make sure you’ve installed it (I used Conda with Python 3.6) and that your Nvidia drivers are on. On my laptop, running an integrated Intel and dedicated Nvidia GPU, I had to simply run sudo modprobe nvidia. Let’s see if that worked:

!nvidia-smi

Tue Jan 22 08:08:35 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27       Driver Version: 415.27       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   65C    P0    24W /  N/A |      0MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Yup! Let’s get to it. We’ll compare it with NumPy, of course:

import cupy as cp
import numpy as np

I’m mostly interested in operations on dense matrices, so let’s get ourselves a sample one. I’m not using an insanely large array due to MemoryErrors, but 2**20 floats should be a reasonable benchmark.

N = 1024
A = np.random.random((N, N))
A

array([[0.10388936, 0.27674225, 0.09349157, ..., 0.59858586, 0.01545899,
        0.20201765],
       [0.81588711, 0.19722361, 0.66885061, ..., 0.83687175, 0.15600763,
        0.6171922 ],
       [0.73374963, 0.66466975, 0.55082473, ..., 0.68605053, 0.93384799,
        0.84729118],
       ...,
       [0.76718438, 0.40130284, 0.81041205, ..., 0.42829758, 0.42465592,
        0.67533214],
       [0.11546777, 0.35548417, 0.645703  , ..., 0.24879487, 0.58897384,
        0.98993676],
       [0.96847189, 0.21391942, 0.70259718, ..., 0.32546387, 0.97123257,
        0.99439515]])

The CuPy API is basically Numpy’s API, with a few minor differences here and there:

B = cp.random.random((N, N))
B

array([[0.5967192 , 0.51631595, 0.49980612, ..., 0.52830527, 0.4521689 ,
        0.27857874],
       [0.80999042, 0.32971922, 0.74034167, ..., 0.7316576 , 0.05339145,
        0.67494372],
       [0.66954774, 0.08282191, 0.06237442, ..., 0.85821394, 0.33912042,
        0.00146102],
       ...,
       [0.87827673, 0.58662314, 0.97428079, ..., 0.1239315 , 0.90813556,
        0.55808706],
       [0.59890383, 0.54480358, 0.59180028, ..., 0.03094922, 0.54241454,
        0.45274242],
       [0.34639887, 0.49254118, 0.28915567, ..., 0.86708966, 0.97695957,
        0.63873008]])

One thing that can be noticed already - that displayed numbers! Right there, in Jupyter! All the memory transfer is done for you as need be, though you can also force it as needed. To me, that’s pretty amazing! Let’s make sure this is actually on the GPU:

!nvidia-smi

Tue Jan 22 08:08:36 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27       Driver Version: 415.27       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   66C    P2    24W /  N/A |     95MiB /  6078MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     12081      C   ...ik/.miniconda3/envs/nbody3.6/bin/python    85MiB |
+-----------------------------------------------------------------------------+

Clearly a bunch of memory is allocated.

A few benchmarks

All right, let’s get to the actual number crunching. Let’s take the simple element-wise log of each element in the array (on CPU that’s going to run with the MKL-accelerated Numpy on an i7):

%%timeit -o
np.log(A)

24.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

<TimeitResult : 24.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)>

Respectable, I suppose. Let’s see how CuPy fares against that:

%%timeit -o
cp.log(B)

453 µs ± 735 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

<TimeitResult : 453 µs ± 735 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)>

Instantly, I noticed two things:

My laptop fan started spinning up immediately after running that command. Clearly something more intense is going on there.
My screen went black. Fun fact: I wrote this post out of bed, without having plugged my laptop in - and my current system configuration did not enjoy having a power-hungry GPU try to run on battery, so it just switched off instantly. Consider yourself warned!

After rebooting and using the classic Restart and run all, a third fun fact occured to me: a different SI prefix on the GPU result!

__.average / _.average

53.877025466882685

That’s a pretty okay speedup for swapping n to c in the import statement.

Let’s see how well it’s going to parallelize a matrix multiplication:

cpu_operator = %timeit -o A @ A

25.2 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

gpu_operator = %timeit -o B @ B

18.7 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note how that’s literally the same operation in terms of code, as we’re not using Numpy’s functions, rather - both of these classes define an @ operator. This is going to come up later…

cpu_operator.average / gpu_operator.average

1.3472357440051654

Well, suprisingly, this is nowhere near as large of a speedup as I would expect! My results seem to vary a bit, though:

gpu_operator_saved = %timeit -o B2 = B @ B

18.7 ms ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

gpu_dot = %timeit -o cp.dot(B, B)

9.37 ms ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

gpu_dot_saved = %timeit -o B3 = cp.dot(B, B)

17.4 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Btarget= cp.empty_like(B)
gpu_dot_out = %timeit -o cp.dot(B, B, Btarget)

18.7 ms ± 9.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I think I may come back to the matrix multiplication issue in the future, because it seems like there are multiple ways to do it and it’s not clear which one is the best. Weirdly, the winner seems to be .dot(B, B), but…without saving. Let’s keep this post to an overview of CuPy’s functionality and possibly revisit that in the future. This may have been a BLAS/cuBLAS issue that I don’t quite understand yet.

Further functionality review

Okay, but what actually is B?

type(B)

cupy.core.core.ndarray

All right, some internal Cupy ndarray class. It’s pretty simple to turn it into something in host device memory, though:

type(cp.asnumpy(B))

numpy.ndarray

B[0], cp.asnumpy(B)[0]

(array([0.5967192 , 0.51631595, 0.49980612, ..., 0.52830527, 0.4521689 ,
        0.27857874]),
 array([0.5967192 , 0.51631595, 0.49980612, ..., 0.52830527, 0.4521689 ,
        0.27857874]))

Just to make sure this is actually the same array:

np.allclose(B, cp.asnumpy(B))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-8ff31bdd3597> in <module>
----> 1 np.allclose(B, cp.asnumpy(B))

~/.local/lib/python3.6/site-packages/numpy/core/numeric.py in allclose(a, b, rtol, atol, equal_nan)
   2268 
   2269     """
-> 2270     res = all(isclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan))
   2271     return bool(res)
   2272 

~/.local/lib/python3.6/site-packages/numpy/core/numeric.py in isclose(a, b, rtol, atol, equal_nan)
   2352             return less_equal(abs(x-y), atol + rtol * abs(y))
   2353 
-> 2354     x = asanyarray(a)
   2355     y = asanyarray(b)
   2356 

~/.local/lib/python3.6/site-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)
    551 
    552     """
--> 553     return array(a, dtype, copy=False, order=order, subok=True)
    554 
    555 

ValueError: object __array__ method not producing an array

Close, but no cigar! I think this may be getting out of date relatively soon, but right now NumPy doesn’t know how to handle our B GPU array. Another example of this is:

np.log(B)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-31256e41f3b1> in <module>
----> 1 np.log(B)

ValueError: object __array__ method not producing an array

What you could do instead is compare this right on the GPU, going from GPU to host to GPU again:

cp.allclose(B, cp.asarray(cp.asnumpy(B)))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-159d67e41f21> in <module>
----> 1 cp.allclose(B, cp.asarray(cp.asnumpy(B)))

AttributeError: module 'cupy' has no attribute 'allclose'

And this is, actually, the first time I saw cupy not implementing something in NumPy’s API! It’s pretty easy to get around this, in this instance:

cp.all(cp.isclose(B, cp.asarray(cp.asnumpy(B))))

array(True)

I guess that’s no proof that this works correctly, but it’s at least an argument :)

However… Wait a minute. What’s that array thing doing there? As far as I have been able to figure out, this is a single element array allocated in GPU memory that .all() reduces our boolean NxN isclose array to. It’s pretty simple to convert to a normal Python bool, though:

bool(_)

True

Reshape works, as does summing along an axis:

B.reshape(N, N, 1).sum(axis=1)

array([[504.85090841],
       [524.895922  ],
       [511.81485662],
       ...,
       [505.07597442],
       [505.44331639],
       [508.71877327]])

So do statistical functions:

B.mean(axis=0)

array([0.49501721, 0.51289292, 0.51160649, ..., 0.49777249, 0.50287185,
       0.49873693])

You can raise stuff to powers and sum to scalars:

(A**3).sum(), (B**3).sum()

(262231.4456098349, array(262330.20455528))

And of course, once again we need to force a cast to a Python float:

float(_[1])

262330.2045552799

You can also, if you want to, sum into previously allocated arrays (I was thinking of using this to test performance differences between cupy and numba.cuda, haven’t gotten to that yet, though):

Ax = np.linspace(0, 1, N)
Ax

array([0.00000000e+00, 9.77517107e-04, 1.95503421e-03, ...,
       9.98044966e-01, 9.99022483e-01, 1.00000000e+00])

A.shape, Ax.shape

((1024, 1024), (1024,))

A.sum(axis=1, out=Ax)

array([497.48857494, 510.25752741, 493.57497492, ..., 515.3009242 ,
       499.92205554, 512.74308963])

Bx = cp.linspace(0, 1, N)
Bx

array([0.00000000e+00, 9.77517107e-04, 1.95503421e-03, ...,
       9.98044966e-01, 9.99022483e-01, 1.00000000e+00])

B.shape, Bx.shape

((1024, 1024), (1024,))

B.sum(axis=1, out=Bx)

array([504.85090841, 524.895922  , 511.81485662, ..., 505.07597442,
       505.44331639, 508.71877327])

Bx

array([504.85090841, 524.895922  , 511.81485662, ..., 505.07597442,
       505.44331639, 508.71877327])

Random numbers start from different seeds:

cp.random.seed(0)
Rgpu = cp.random.random()
np.random.seed(0)
Rcpu = np.random.random()
Rgpu - Rcpu

array(0.01273565)

cp.random.seed(0)
Rgpu2 = cp.random.random()
Rgpu2 - Rgpu

array(0.)

Indexing works just like we know and love it from numpy:

Bx

array([504.85090841, 524.895922  , 511.81485662, ..., 505.07597442,
       505.44331639, 508.71877327])

Bx[0] = 3
Bx

array([  3.        , 524.895922  , 511.81485662, ..., 505.07597442,
       505.44331639, 508.71877327])

Bx[1::2] = -1
Bx

array([  3.        ,  -1.        , 511.81485662, ...,  -1.        ,
       505.44331639,  -1.        ])

The amazing power tool that is einsum works as well, let’s use it to compute the array’s trace:

cp.einsum('ii->', B), cp.sum(cp.diag(B))

(array(493.15631992), array(493.15631992))

Writing CPU and GPU agnostic code

This is a concept I found in CuPy’s library and absolutely fell in love.

In some cases, you can use array methods and operators to do what you need. This is where that A @ A and B @ B concept comes back. However, that’s not always possible. For example, there isn’t a .log() method.

HOWEVER, the CuPy folks had a pretty ingenious idea for solving that! Just watch:

def agnostic_log(array):
    xp = cp.get_array_module(array)
    return xp.log(array)

agnostic_log(A), agnostic_log(B)

(array([[-2.26442884, -1.28466872, -2.36988397, ..., -0.51318531,
         -4.16956465, -1.59940023],
        [-0.20347928, -1.62341712, -0.40219454, ..., -0.17808444,
         -1.85785034, -0.4825748 ],
        [-0.30958741, -0.40846499, -0.59633861, ..., -0.37680399,
         -0.0684416 , -0.16571087],
        ...,
        [-0.26502811, -0.91303893, -0.21021246, ..., -0.84793704,
         -0.85647603, -0.39255065],
        [-2.15876381, -1.03427455, -0.43741564, ..., -1.39112655,
         -0.52937351, -0.01011422],
        [-0.03203582, -1.54215589, -0.35297155, ..., -1.12250382,
         -0.02918932, -0.00562062]]),
 array([[-0.51630862, -0.66103639, -0.69353501, ..., -0.638081  ,
         -0.79369951, -1.27805454],
        [-0.21073286, -1.10951384, -0.30064348, ..., -0.31244264,
         -2.93010466, -0.39312597],
        [-0.40115281, -2.4910626 , -2.7746    , ..., -0.15290186,
         -1.08140002, -6.52862238],
        ...,
        [-0.12979355, -0.53337268, -0.02605573, ..., -2.08802627,
         -0.09636162, -0.5832403 ],
        [-0.51265425, -0.60732996, -0.52458607, ..., -3.47540747,
         -0.61172474, -0.79243193],
        [-1.06016435, -0.70817722, -1.24079009, ..., -0.1426129 ,
         -0.02331001, -0.44827332]]))

Same function handles two completely different array types!

cp.get_array_module(A), cp.get_array_module(B)

(<module 'numpy' from '/home/dominik/.local/lib/python3.6/site-packages/numpy/__init__.py'>,
 <module 'cupy' from '/home/dominik/.miniconda3/envs/nbody3.6/lib/python3.6/site-packages/cupy/__init__.py'>)

This is so simple, I absolutely love it. It’s not perfect (you still have to define a new function), but it’s a nice workaround. It may not be necessary for a lot longer, too…

And given that I’m ending on links, I’ll just add Matthew Rocklin’s post about prototype GPU arrays on Dask clusters.

To sum up: CuPy is awesome, if you’ve got a GPU lying around (for games, etc), you can very easily use it for your number crunching as well!