NumPy-ish GPU computations with CuPy
I've recently come across the amazing CuPy library, and given that I haven't updated this blog in a while, I figured this would be a great opportunity to showcase a few of its capabilities.
If you haven't heard yet, CuPy is NumPy, but on the GPU, and it's amazing how close that simple description is to reality.
First things first! Make sure you've installed it (I used Conda with Python 3.6) and that your Nvidia drivers are on. On my laptop, running an integrated Intel and dedicated Nvidia GPU, I had to simply run sudo modprobe nvidia
. Let's see if that worked:
!nvidia-smi
Yup! Let's get to it. We'll compare it with NumPy, of course:
import cupy as cp
import numpy as np
I'm mostly interested in operations on dense matrices, so let's get ourselves a sample one. I'm not using an insanely large array due to MemoryError
s, but 2**20
floats should be a reasonable benchmark.
N = 1024
A = np.random.random((N, N))
A
The CuPy API is basically Numpy's API, with a few minor differences here and there:
B = cp.random.random((N, N))
B
One thing that can be noticed already - that displayed numbers! Right there, in Jupyter! All the memory transfer is done for you as need be, though you can also force it as needed. To me, that's pretty amazing! Let's make sure this is actually on the GPU:
!nvidia-smi
Clearly a bunch of memory is allocated.
A few benchmarks¶
All right, let's get to the actual number crunching. Let's take the simple element-wise log of each element in the array (on CPU that's going to run with the MKL-accelerated Numpy on an i7):
%%timeit -o
np.log(A)
Respectable, I suppose. Let's see how CuPy fares against that:
%%timeit -o
cp.log(B)
Instantly, I noticed two things:
- My laptop fan started spinning up immediately after running that command. Clearly something more intense is going on there.
- My screen went black. Fun fact: I wrote this post out of bed, without having plugged my laptop in - and my current system configuration did not enjoy having a power-hungry GPU try to run on battery, so it just switched off instantly. Consider yourself warned!
After rebooting and using the classic Restart and run all
, a third fun fact occured to me: a different SI prefix on the GPU result!
__.average / _.average
That's a pretty okay speedup for swapping n
to c
in the import statement.
Let's see how well it's going to parallelize a matrix multiplication:
cpu_operator = %timeit -o A @ A
gpu_operator = %timeit -o B @ B
Note how that's literally the same operation in terms of code, as we're not using Numpy's functions, rather - both of these classes define an @
operator. This is going to come up later...
cpu_operator.average / gpu_operator.average
Well, suprisingly, this is nowhere near as large of a speedup as I would expect! My results seem to vary a bit, though:
gpu_operator_saved = %timeit -o B2 = B @ B
gpu_dot = %timeit -o cp.dot(B, B)
gpu_dot_saved = %timeit -o B3 = cp.dot(B, B)
Btarget= cp.empty_like(B)
gpu_dot_out = %timeit -o cp.dot(B, B, Btarget)
I think I may come back to the matrix multiplication issue in the future, because it seems like there are multiple ways to do it and it's not clear which one is the best. Weirdly, the winner seems to be .dot(B, B)
, but...without saving. Let's keep this post to an overview of CuPy's functionality and possibly revisit that in the future. This may have been a BLAS/cuBLAS issue that I don't quite understand yet.
Further functionality review¶
Okay, but what actually is B
?
type(B)
All right, some internal Cupy ndarray
class. It's pretty simple to turn it into something in host device memory, though:
type(cp.asnumpy(B))
B[0], cp.asnumpy(B)[0]
Just to make sure this is actually the same array:
np.allclose(B, cp.asnumpy(B))
Close, but no cigar! I think this may be getting out of date relatively soon, but right now NumPy doesn't know how to handle our B
GPU array. Another example of this is:
np.log(B)
What you could do instead is compare this right on the GPU, going from GPU to host to GPU again:
cp.allclose(B, cp.asarray(cp.asnumpy(B)))
And this is, actually, the first time I saw cupy not implementing something in NumPy's API! It's pretty easy to get around this, in this instance:
cp.all(cp.isclose(B, cp.asarray(cp.asnumpy(B))))
I guess that's no proof that this works correctly, but it's at least an argument :)
However... Wait a minute. What's that array
thing doing there? As far as I have been able to figure out, this is a single element array allocated in GPU memory that .all()
reduces our boolean NxN isclose
array to. It's pretty simple to convert to a normal Python bool, though:
bool(_)
Reshape works, as does summing along an axis:
B.reshape(N, N, 1).sum(axis=1)
So do statistical functions:
B.mean(axis=0)
You can raise stuff to powers and sum to scalars:
(A**3).sum(), (B**3).sum()
And of course, once again we need to force a cast to a Python float:
float(_[1])
You can also, if you want to, sum into previously allocated arrays (I was thinking of using this to test performance differences between cupy
and numba.cuda
, haven't gotten to that yet, though):
Ax = np.linspace(0, 1, N)
Ax
A.shape, Ax.shape
A.sum(axis=1, out=Ax)
Bx = cp.linspace(0, 1, N)
Bx
B.shape, Bx.shape
B.sum(axis=1, out=Bx)
Bx
Random numbers start from different seeds:
cp.random.seed(0)
Rgpu = cp.random.random()
np.random.seed(0)
Rcpu = np.random.random()
Rgpu - Rcpu
cp.random.seed(0)
Rgpu2 = cp.random.random()
Rgpu2 - Rgpu
Indexing works just like we know and love it from numpy:
Bx
Bx[0] = 3
Bx
Bx[1::2] = -1
Bx
The amazing power tool that is einsum
works as well, let's use it to compute the array's trace:
cp.einsum('ii->', B), cp.sum(cp.diag(B))
Writing CPU and GPU agnostic code¶
This is a concept I found in CuPy's library and absolutely fell in love.
In some cases, you can use array methods and operators to do what you need. This is where that A @ A
and B @ B
concept comes back. However, that's not always possible. For example, there isn't a .log()
method.
HOWEVER, the CuPy folks had a pretty ingenious idea for solving that! Just watch:
def agnostic_log(array):
xp = cp.get_array_module(array)
return xp.log(array)
agnostic_log(A), agnostic_log(B)
Same function handles two completely different array types!
cp.get_array_module(A), cp.get_array_module(B)
This is so simple, I absolutely love it. It's not perfect (you still have to define a new function), but it's a nice workaround. It may not be necessary for a lot longer, too...
And given that I'm ending on links, I'll just add Matthew Rocklin's post about prototype GPU arrays on Dask clusters.
To sum up: CuPy is awesome, if you've got a GPU lying around (for games, etc), you can very easily use it for your number crunching as well!
Comments