CuPy speedup of naive N-Body vectorized force calculation

cupy
nbody
gpu
python
N-body simulation
Published

March 4, 2019

I had intended to write a post about speeding up our Numpy Ising implementation, which we found out gave reasonable numerical values, though the small grids we were able to use limited the accuracy a fair bit. However, a few difficulties came up, so I thought instead (to keep writing these a habit!) I would write a little bit about using CuPy to speed up force calculations in N-body simulations. This might be a point I’ll come back to later on this blog, as I have an ongoing project implementing that.

The part of the n-body simulation we’ll look at is the calculation of forces, where the force on the i-th point particle or celestial object is:

\[ F_i = \sum_j F_{ij} = G \sum_j \frac{m_i m_j}{|\vec{r_i}-\vec{r_j}|^3 } (\vec{r_i}-\vec{r_j}) \]

From this, Newton’s law gives \(\vec{a_i} = \vec{F_i} / m_i\). I was trying to use my own implementation of this vectorization, but I found a neat implementation by PMende on Stack that’s both more general and faster than what I had been doing. Let’s take a look!

import numpy
import cupy
import numpy_html

I basically copypasted the following:

def accelerations(positions, masses, G = 1):
    '''
    https://stackoverflow.com/a/52562874
    
    Params:
    - positions: numpy array of size (n,3)
    - masses: numpy array of size (n,)
    '''
    xp = cupy.get_array_module(positions)
    mass_matrix = masses.reshape((1, -1, 1))*masses.reshape((-1, 1, 1))
    disps = positions.reshape((1, -1, 3)) - positions.reshape((-1, 1, 3)) # displacements
    dists = xp.linalg.norm(disps, axis=2)
    dists[dists == 0] = 1 # Avoid divide by zero warnings
    forces = G*disps*mass_matrix/xp.expand_dims(dists, 2)**3
    return forces.sum(axis=1)/masses.reshape(-1, 1)

The main change I made was adding this line:

    xp = cupy.get_array_module(positions)

Which returns numpy if we pass in a numpy.ndarray and cupy if we pass in a CuPy array. This will make this function more generic for our purposes.

Let’s take a look at what each of those lines does. For the illustrations, we’ll take some particularly simple values:

N = 5
m = numpy.arange(N) + 1
r = (numpy.arange(N*3)**2).reshape((N, 3))
m
1
2
3
4
5

Pretty printing of arrays, by the way, is provided by the awesome numpy_html package. We can now start digging! Let’s first investigate the mass_matrix:

mass_matrix = m.reshape((1, -1, 1)) * m.reshape((-1, 1, 1))
mass_matrix.T
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25

This was a (5, 5, 1)-shaped array, but I used a .T transposition so that it would print more nicely, as a (1, 5, 5)-shaped array. This shows that the (i, j)-th entry is just \(m_i m_j\) - with the shape it has, we should be able to take advantage of Numpy broadcasting in our calculation.

Let’s now look at the displacements - disps. The line is

    disps = r.reshape((1, -1, 3)) - r.reshape((-1, 1, 3))

but let’s break it down a little bit more. r is

r
0 1 4
9 16 25
36 49 64
81 100 121
144 169 196

while if you reshape it with a single added dimension (signified by 1) in the first place:

r.reshape((1, -1, 3))
0 1 4
9 16 25
36 49 64
81 100 121
144 169 196

while if we were to add a dimension in the second slot:

r.reshape((-1, 1, 3))
0 1 4
9 16 25
36 49 64
81 100 121
144 169 196

We thus have a (1, 5, 3)-shaped array and a (5, 1, 3) array. Numpy (and anything implementing Numpy’s broadcasting API by extension) is going to expand that into a (5, 5, 3) array:

disps = r.reshape((1, -1, 3)) - r.reshape((-1, 1, 3))
disps.T
0 -9 -36 -81 -144
9 0 -27 -72 -135
36 27 0 -45 -108
81 72 45 0 -63
144 135 108 63 0
0 -15 -48 -99 -168
15 0 -33 -84 -153
48 33 0 -51 -120
99 84 51 0 -69
168 153 120 69 0
0 -21 -60 -117 -192
21 0 -39 -96 -171
60 39 0 -57 -132
117 96 57 0 -75
192 171 132 75 0

Which I chose to print with a transpose (as a (3, 5, 5) array) so as to illustrate the structure a bit more. Each of the three (5, 5) arrays displays a different spatial component of \(\vec{r_i} - \vec{r_j}\). The arrays are antisymmetric as \(\vec{r_{ij}} = - \vec{r_{ji}}\).

Let’s continue with the calculations! The next line simply calculates the norms of those inter-particle distances, outputting an (N, N) array (summing over the “spatial dimensions” axis):

dists = np.linalg.norm(disps, axis=2)
dists
0. 27.33130074 84.85281374 173.35224256 292.95733478
27.33130074 0. 57.78408085 146.47866739 266.22359024
84.85281374 57.78408085 0. 88.74119675 208.53776636
173.35224256 146.47866739 88.74119675 0. 119.81235329
292.95733478 266.22359024 208.53776636 119.81235329 0.

The next step is pretty clever. Since in disps (see above) each diagonal element is zero (since that’s \(\vec{r_{ii}}\)), and we’ll be dividing those by distances, we’re going to have Numpy screaming obscenities at us for dividing by zero. But since 0 / 1 = 0, we lose nothing and gain peace of mind by doing:

dists[dists == 0] = 1
dists
1. 27.33130074 84.85281374 173.35224256 292.95733478
27.33130074 1. 57.78408085 146.47866739 266.22359024
84.85281374 57.78408085 1. 88.74119675 208.53776636
173.35224256 146.47866739 88.74119675 1. 119.81235329
292.95733478 266.22359024 208.53776636 119.81235329 1.

Simple and effective! In my own implementation I had np.inf instead of 1 so as to get anything / np.inf == 0, but if anything = 0 for the problematic cases, that’s fine as well.

The next line simply adds a dimension:

dists.shape, np.expand_dims(dists, 2).shape
((5, 5), (5, 5, 1))

For those curious as to why not just .reshape((-1, -1, 1)):

try:
    dists.reshape((-1, -1, 1))
except ValueError as e:
    print(e)
can only specify one unknown dimension

And now we can finally calculate the forces themselves, first getting each of \(\vec{F_{ij}}\):

G = 1  # because let's be real...
forces = G * disps * mass_matrix / np.expand_dims(dists, 2) ** 3
forces
0. 0. 0.
0.00088164 0.0014694 0.00205716
0.00017678 0.0002357 0.00029463
6.2195164e-05 7.60163116e-05 8.98374591e-05
2.86364625e-05 3.34092063e-05 3.81819501e-05
-0.00088164 -0.0014694 -0.00205716
0. 0. 0.
0.00083963 0.00102622 0.00121281
0.00018327 0.00021382 0.00024436
7.15474501e-05 8.10871102e-05 9.06267702e-05
-0.00017678 -0.0002357 -0.00029463
-0.00083963 -0.00102622 -0.00121281
0. 0. 0.
0.00077271 0.00087574 0.00097877
0.00017863 0.00019848 0.00021833
-6.2195164e-05 -7.60163116e-05 -8.98374591e-05
-0.00018327 -0.00021382 -0.00024436
-0.00077271 -0.00087574 -0.00097877
0. 0. 0.
0.0007326 0.00080237 0.00087214
-2.86364625e-05 -3.34092063e-05 -3.81819501e-05
-7.15474501e-05 -8.10871102e-05 -9.06267702e-05
-0.00017863 -0.00019848 -0.00021833
-0.0007326 -0.00080237 -0.00087214
0. 0. 0.

And we can contract that to \(\vec{F_i}\) by summing over the other-particle index:

forces.sum(axis = 1)
0.00114925 0.00181453 0.00247981
0.00021281 -0.00014827 -0.00050936
-6.50662895e-05 -0.0001877 -0.00031034
-0.00028558 -0.00036321 -0.00044083
-0.00101141 -0.00111535 -0.00121928

And from here, a simple division by the masses suffices to get the acceleration, but we need to remember to turn the mass into a (N, 1) array via a simple reshape:

forces.sum(axis=1)/m.reshape(-1, 1)
0.00114925 0.00181453 0.00247981
0.00010641 -7.4137417e-05 -0.00025468
-2.16887632e-05 -6.25669817e-05 -0.00010345
-7.13957375e-05 -9.08016861e-05 -0.00011021
-0.00020228 -0.00022307 -0.00024386

Let’s run a quick test first that also serves to illustrate the results. We’ll first write a simple function that prepares some reasonable parameters - masses and positions in arrays provided by our chosen packages.

def prep(N, np = numpy, seed = 0):
    np.random.seed(seed)
    m = np.abs(np.random.normal(loc=100, scale=20, size=N))
    r = np.random.normal(size=(N, 3))
    return r, m

For a test, we’ll set z = 0:

r, m = prep(10, numpy, seed = 17)
r[:, -1] = 0

ax, ay, az = accelerations(r, m).T
x, y, z = r.T

import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x, y, m);
plt.quiver(x, y, ax, ay);

Looks just about right! The system is evidently self-gravitating. Let’s do the same for a few more bodies:

r, m = prep(500, numpy, seed = 17)
r[:, -1] = 0

ax, ay, az = accelerations(r, m).T
x, y, z = r.T

plt.scatter(x, y, m);
plt.quiver(x, y, ax, ay);

And that’s a proper mess, but the arrows seem to be oriented the right way (towards the system’s center of mass) and you get a bunch of very long arrows, signifying high forces at short distances - another issue that I might well come back to in another post!

And the nice thing is that our function works just as well on the GPU!

ax, ay, az = cupy.asnumpy(accelerations(cupy.asarray(r), cupy.asarray(m)).T)

plt.scatter(x, y, m);
plt.quiver(x, y, ax, ay);

Let’s now run a quick bechmark:

results = []
numbers_of_bodies = [2**n for n in range(4, 13)]
for np in [numpy, cupy]:
    for N in numbers_of_bodies:
        r, m = prep(N, np, seed=17)
        time = %timeit -oq accelerations(r, m)
        results.append({"library":np.__name__,
                        "N": N,
                        "average": time.average,
                        "stdev": time.stdev})
import pandas

df = pandas.DataFrame(results)
df
N average library stdev
0 16 0.000076 numpy 0.000003
1 32 0.000179 numpy 0.000002
2 64 0.000589 numpy 0.000020
3 128 0.002245 numpy 0.000139
4 256 0.012159 numpy 0.003883
5 512 0.044007 numpy 0.004827
6 1024 0.187316 numpy 0.007669
7 2048 0.718909 numpy 0.027526
8 4096 2.867943 numpy 0.055799
9 16 0.001288 cupy 0.000021
10 32 0.001316 cupy 0.000030
11 64 0.001463 cupy 0.000146
12 128 0.001430 cupy 0.000112
13 256 0.001321 cupy 0.000024
14 512 0.001656 cupy 0.000022
15 1024 0.005480 cupy 0.000082
16 2048 0.020252 cupy 0.000017
17 4096 0.080994 cupy 0.000289
fig, ax = plt.subplots()
ax.set_ylabel("Average runtime [s]")
for label, g in df.groupby('library'):
    g.plot('N', 'average', ax=ax, label=label, logx=True, logy=True, style="o--")
ax.grid()

Thus at low numbers of particles, CuPy has a performance overhead, but at larger numbers of particles (as limited by MemoryErrors on my device, with the regime shifting around 100 particles), the GPU (predictably) wins!

We can also calculate the runtime ratios for a speedup estimate:

speedups = df[df.library =='numpy'].set_index('N').average / df[df.library =='cupy'].set_index('N').average
speedups
N
16       0.059079
32       0.135752
64       0.402321
128      1.570570
256      9.204320
512     26.570801
1024    34.179128
2048    35.498837
4096    35.409139
Name: average, dtype: float64
speedups.plot(logx=True, style="o--", logy=True)
plt.ylabel("GPU over CPU speedup");
plt.grid()

While this is obviously not a full proper test (we don’t know how much host-device memory transfer would impact our timings, etc), it’s at least nice to see that we get 35 times the speed on the GPU for the pure acceleration stage basically for free!