I’ve recently come across the amazing CuPy library, and given that I haven’t updated this blog in a while, I figured this would be a great opportunity to showcase a few of its capabilities.
If you haven’t heard yet, CuPy is NumPy, but on the GPU, and it’s amazing how close that simple description is to reality. First things first! Make sure you’ve installed it (I used Conda with Python 3.6) and that your Nvidia drivers are on. On my laptop, running an integrated Intel and dedicated Nvidia GPU, I had to simply run sudo modprobe nvidia. Let’s see if that worked:
!nvidia-smi
Tue Jan 22 08:08:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27 Driver Version: 415.27 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:01:00.0 Off | N/A |
| N/A 65C P0 24W / N/A | 0MiB / 6078MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Yup! Let’s get to it. We’ll compare it with NumPy, of course:
import cupy as cpimport numpy as np
I’m mostly interested in operations on dense matrices, so let’s get ourselves a sample one. I’m not using an insanely large array due to MemoryErrors, but 2**20 floats should be a reasonable benchmark.
One thing that can be noticed already - that displayed numbers! Right there, in Jupyter! All the memory transfer is done for you as need be, though you can also force it as needed. To me, that’s pretty amazing! Let’s make sure this is actually on the GPU:
!nvidia-smi
Tue Jan 22 08:08:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27 Driver Version: 415.27 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:01:00.0 Off | N/A |
| N/A 66C P2 24W / N/A | 95MiB / 6078MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 12081 C ...ik/.miniconda3/envs/nbody3.6/bin/python 85MiB |
+-----------------------------------------------------------------------------+
Clearly a bunch of memory is allocated.
A few benchmarks
All right, let’s get to the actual number crunching. Let’s take the simple element-wise log of each element in the array (on CPU that’s going to run with the MKL-accelerated Numpy on an i7):
%%timeit -onp.log(A)
24.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
<TimeitResult : 24.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)>
Respectable, I suppose. Let’s see how CuPy fares against that:
%%timeit -ocp.log(B)
453 µs ± 735 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
<TimeitResult : 453 µs ± 735 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)>
Instantly, I noticed two things:
My laptop fan started spinning up immediately after running that command. Clearly something more intense is going on there.
My screen went black. Fun fact: I wrote this post out of bed, without having plugged my laptop in - and my current system configuration did not enjoy having a power-hungry GPU try to run on battery, so it just switched off instantly. Consider yourself warned!
After rebooting and using the classic Restart and run all, a third fun fact occured to me: a different SI prefix on the GPU result!
__.average / _.average
53.877025466882685
That’s a pretty okay speedup for swapping n to c in the import statement.
Let’s see how well it’s going to parallelize a matrix multiplication:
cpu_operator =%timeit -o A @ A
25.2 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
gpu_operator =%timeit -o B @ B
18.7 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note how that’s literally the same operation in terms of code, as we’re not using Numpy’s functions, rather - both of these classes define an @ operator. This is going to come up later…
cpu_operator.average / gpu_operator.average
1.3472357440051654
Well, suprisingly, this is nowhere near as large of a speedup as I would expect! My results seem to vary a bit, though:
gpu_operator_saved =%timeit -o B2 = B @ B
18.7 ms ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
gpu_dot =%timeit -o cp.dot(B, B)
9.37 ms ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
gpu_dot_saved =%timeit -o B3 = cp.dot(B, B)
17.4 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
18.7 ms ± 9.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think I may come back to the matrix multiplication issue in the future, because it seems like there are multiple ways to do it and it’s not clear which one is the best. Weirdly, the winner seems to be .dot(B, B), but…without saving. Let’s keep this post to an overview of CuPy’s functionality and possibly revisit that in the future. This may have been a BLAS/cuBLAS issue that I don’t quite understand yet.
Further functionality review
Okay, but what actually is B?
type(B)
cupy.core.core.ndarray
All right, some internal Cupy ndarray class. It’s pretty simple to turn it into something in host device memory, though:
Close, but no cigar! I think this may be getting out of date relatively soon, but right now NumPy doesn’t know how to handle our B GPU array. Another example of this is:
np.log(B)
---------------------------------------------------------------------------ValueError Traceback (most recent call last)
<ipython-input-24-31256e41f3b1> in <module>----> 1np.log(B)ValueError: object __array__ method not producing an array
What you could do instead is compare this right on the GPU, going from GPU to host to GPU again:
cp.allclose(B, cp.asarray(cp.asnumpy(B)))
---------------------------------------------------------------------------AttributeError Traceback (most recent call last)
<ipython-input-25-159d67e41f21> in <module>----> 1cp.allclose(B, cp.asarray(cp.asnumpy(B)))AttributeError: module 'cupy' has no attribute 'allclose'
And this is, actually, the first time I saw cupy not implementing something in NumPy’s API! It’s pretty easy to get around this, in this instance:
cp.all(cp.isclose(B, cp.asarray(cp.asnumpy(B))))
array(True)
I guess that’s no proof that this works correctly, but it’s at least an argument :)
However… Wait a minute. What’s that array thing doing there? As far as I have been able to figure out, this is a single element array allocated in GPU memory that .all() reduces our boolean NxNisclose array to. It’s pretty simple to convert to a normal Python bool, though:
And of course, once again we need to force a cast to a Python float:
float(_[1])
262330.2045552799
You can also, if you want to, sum into previously allocated arrays (I was thinking of using this to test performance differences between cupy and numba.cuda, haven’t gotten to that yet, though):
The amazing power tool that is einsum works as well, let’s use it to compute the array’s trace:
cp.einsum('ii->', B), cp.sum(cp.diag(B))
(array(493.15631992), array(493.15631992))
Writing CPU and GPU agnostic code
This is a concept I found in CuPy’s library and absolutely fell in love.
In some cases, you can use array methods and operators to do what you need. This is where that A @ A and B @ B concept comes back. However, that’s not always possible. For example, there isn’t a .log() method.
HOWEVER, the CuPy folks had a pretty ingenious idea for solving that! Just watch:
def agnostic_log(array): xp = cp.get_array_module(array)return xp.log(array)agnostic_log(A), agnostic_log(B)
Same function handles two completely different array types!
cp.get_array_module(A), cp.get_array_module(B)
(<module 'numpy' from '/home/dominik/.local/lib/python3.6/site-packages/numpy/__init__.py'>,
<module 'cupy' from '/home/dominik/.miniconda3/envs/nbody3.6/lib/python3.6/site-packages/cupy/__init__.py'>)