CuPy speedup of naive N-Body vectorized force calculation
I had intended to write a post about speeding up our Numpy Ising implementation, which we found out gave reasonable numerical values, though the small grids we were able to use limited the accuracy a fair bit. However, a few difficulties came up, so I thought instead (to keep writing these a habit!) I would write a little bit about using CuPy to speed up force calculations in N-body simulations. This might be a point I'll come back to later on this blog, as I have an ongoing project implementing that.
The part of the n-body simulation we'll look at is the calculation of forces, where the force on the i-th point particle or celestial object is:
$$ F_i = \sum_j F_{ij} = G \sum_j \frac{m_i m_j}{|\vec{r_i}-\vec{r_j}|^3 } (\vec{r_i}-\vec{r_j}) $$From this, Newton's law gives $\vec{a_i} = \vec{F_i} / m_i$. I was trying to use my own implementation of this vectorization, but I found a neat implementation by PMende on Stack that's both more general and faster than what I had been doing. Let's take a look!
import numpy
import cupy
import numpy_html
I basically copypasted the following:
def accelerations(positions, masses, G = 1):
'''
https://stackoverflow.com/a/52562874
Params:
- positions: numpy array of size (n,3)
- masses: numpy array of size (n,)
'''
xp = cupy.get_array_module(positions)
mass_matrix = masses.reshape((1, -1, 1))*masses.reshape((-1, 1, 1))
disps = positions.reshape((1, -1, 3)) - positions.reshape((-1, 1, 3)) # displacements
dists = xp.linalg.norm(disps, axis=2)
dists[dists == 0] = 1 # Avoid divide by zero warnings
forces = G*disps*mass_matrix/xp.expand_dims(dists, 2)**3
return forces.sum(axis=1)/masses.reshape(-1, 1)
The main change I made was adding this line:
xp = cupy.get_array_module(positions)
Which returns numpy
if we pass in a numpy.ndarray
and cupy
if we pass in a CuPy array. This will make this function more generic for our purposes.
Let's take a look at what each of those lines does. For the illustrations, we'll take some particularly simple values:
N = 5
m = numpy.arange(N) + 1
r = (numpy.arange(N*3)**2).reshape((N, 3))
m
Pretty printing of arrays, by the way, is provided by the awesome numpy_html
package. We can now start digging! Let's first investigate the mass_matrix
:
mass_matrix = m.reshape((1, -1, 1)) * m.reshape((-1, 1, 1))
mass_matrix.T
This was a (5, 5, 1)
-shaped array, but I used a .T
transposition so that it would print more nicely, as a (1, 5, 5)
-shaped array. This shows that the (i, j)-th entry is just $m_i m_j$ - with the shape it has, we should be able to take advantage of Numpy broadcasting in our calculation.
Let's now look at the displacements - disps
. The line is
disps = r.reshape((1, -1, 3)) - r.reshape((-1, 1, 3))
but let's break it down a little bit more. r
is
r
while if you reshape it with a single added dimension (signified by 1
) in the first place:
r.reshape((1, -1, 3))
while if we were to add a dimension in the second slot:
r.reshape((-1, 1, 3))
We thus have a (1, 5, 3)
-shaped array and a (5, 1, 3)
array. Numpy (and anything implementing Numpy's broadcasting API by extension) is going to expand that into a (5, 5, 3)
array:
disps = r.reshape((1, -1, 3)) - r.reshape((-1, 1, 3))
disps.T
Which I chose to print with a transpose (as a (3, 5, 5)
array) so as to illustrate the structure a bit more. Each of the three (5, 5)
arrays displays a different spatial component of $\vec{r_i} - \vec{r_j}$. The arrays are antisymmetric as $\vec{r_{ij}} = - \vec{r_{ji}}$.
Let's continue with the calculations! The next line simply calculates the norms of those inter-particle distances, outputting an (N, N)
array (summing over the "spatial dimensions" axis):
dists = np.linalg.norm(disps, axis=2)
dists
The next step is pretty clever. Since in disps
(see above) each diagonal element is zero (since that's $\vec{r_{ii}}$), and we'll be dividing those by distances, we're going to have Numpy screaming obscenities at us for dividing by zero. But since 0 / 1 = 0
, we lose nothing and gain peace of mind by doing:
dists[dists == 0] = 1
dists
Simple and effective! In my own implementation I had np.inf
instead of 1
so as to get anything / np.inf == 0
, but if anything = 0
for the problematic cases, that's fine as well.
The next line simply adds a dimension:
dists.shape, np.expand_dims(dists, 2).shape
For those curious as to why not just .reshape((-1, -1, 1))
:
try:
dists.reshape((-1, -1, 1))
except ValueError as e:
print(e)
And now we can finally calculate the forces themselves, first getting each of $\vec{F_{ij}}$:
G = 1 # because let's be real...
forces = G * disps * mass_matrix / np.expand_dims(dists, 2) ** 3
forces
And we can contract that to $\vec{F_i}$ by summing over the other-particle index:
forces.sum(axis = 1)
And from here, a simple division by the masses suffices to get the acceleration, but we need to remember to turn the mass into a (N, 1)
array via a simple reshape:
forces.sum(axis=1)/m.reshape(-1, 1)
Let's run a quick test first that also serves to illustrate the results. We'll first write a simple function that prepares some reasonable parameters - masses and positions in arrays provided by our chosen packages.
def prep(N, np = numpy, seed = 0):
np.random.seed(seed)
m = np.abs(np.random.normal(loc=100, scale=20, size=N))
r = np.random.normal(size=(N, 3))
return r, m
For a test, we'll set z = 0
:
r, m = prep(10, numpy, seed = 17)
r[:, -1] = 0
ax, ay, az = accelerations(r, m).T
x, y, z = r.T
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x, y, m);
plt.quiver(x, y, ax, ay);
Looks just about right! The system is evidently self-gravitating. Let's do the same for a few more bodies:
r, m = prep(500, numpy, seed = 17)
r[:, -1] = 0
ax, ay, az = accelerations(r, m).T
x, y, z = r.T
plt.scatter(x, y, m);
plt.quiver(x, y, ax, ay);
And that's a proper mess, but the arrows seem to be oriented the right way (towards the system's center of mass) and you get a bunch of very long arrows, signifying high forces at short distances - another issue that I might well come back to in another post!
And the nice thing is that our function works just as well on the GPU!
ax, ay, az = cupy.asnumpy(accelerations(cupy.asarray(r), cupy.asarray(m)).T)
plt.scatter(x, y, m);
plt.quiver(x, y, ax, ay);
Let's now run a quick bechmark:
results = []
numbers_of_bodies = [2**n for n in range(4, 13)]
for np in [numpy, cupy]:
for N in numbers_of_bodies:
r, m = prep(N, np, seed=17)
time = %timeit -oq accelerations(r, m)
results.append({"library":np.__name__,
"N": N,
"average": time.average,
"stdev": time.stdev})
import pandas
df = pandas.DataFrame(results)
df
fig, ax = plt.subplots()
ax.set_ylabel("Average runtime [s]")
for label, g in df.groupby('library'):
g.plot('N', 'average', ax=ax, label=label, logx=True, logy=True, style="o--")
ax.grid()
Thus at low numbers of particles, CuPy has a performance overhead, but at larger numbers of particles (as limited by MemoryError
s on my device, with the regime shifting around 100 particles), the GPU (predictably) wins!
We can also calculate the runtime ratios for a speedup estimate:
speedups = df[df.library =='numpy'].set_index('N').average / df[df.library =='cupy'].set_index('N').average
speedups
speedups.plot(logx=True, style="o--", logy=True)
plt.ylabel("GPU over CPU speedup");
plt.grid()
While this is obviously not a full proper test (we don't know how much host-device memory transfer would impact our timings, etc), it's at least nice to see that we get 35 times the speed on the GPU for the pure acceleration stage basically for free!
Comments