Wednesday, December 23, 2020

The force accumulation pattern

Game physics has resisted parallelization over the years and especially so on the GPU. One reason is that it is not embarrassingly parallel like vertex and fragment shaders. Another reason is that the code becomes hairy as soon as you start optimizing it for SIMD or multi-threading. My goal is to keep the code both easy to read and as fast as possible. Also, I would like it to run both on CPU cores and GPU (not necessarily at the same time). This is usually done through abstractions that do all the optimizations behind the curtains. Starting with this blog post, I will try to identify some of these abstractions that appear often in physics simulation and see how they can be exploited.

I will start with some very simple examples applied to masses connected through springs: explicit integration and gradient descent. The former is used for dynamic simulation, while the latter is the most rudimentary form of static analysis, although it can be used for implicit integration too. In fact, they are almost the same thing mathematically speaking. There are also differences, so I will focus on the computational similarities only.

Gradient descent
$ p_{i+1} = p_i + \alpha  f $

Verlet integratin:
$ p_{i+1} = 2 p_i - p_{i-1} + \frac{h^2}{m}  f $

By $f$ I denoted the total force acting on the system, which is the negative gradient of the total energy; hence, the plus sign in both formulas. Leaving external forces (like gravity) aside, the total force then consists of internal forces only (that is, forces due to springs). Therefore, we first calculate the spring forces, then we disrtibute them to the masses and finally update the positions using one of the formulas above. Here’s the picture:
This pattern will come up over and over again, so let’s analyze it. The first and last steps are embarrassingly parallel - essentially map operations. What differentiates them though is that the first one operates on the springs (edges), whereas the last one operates on the particles (nodes/vertices). For this reason, the first step also includes a gather operation, i.e. it reads the positions of the two end-points:
p1 = p[edge(j, 1)]
p2 = p[edge(j, 2)]
This is just an indirect load operation, which can be done in parallel. (It could also be accelerated by using some fancy SIMD gather instructions.) After that we just compute the length of the spring and subtract the rest length - let’s call this value ɸ. I call ɸ the error function or the contraint function. For springs, it is simply the elongation. If we divide it by the rest length then it’s called the strain. We multiply ɸ by the spring constant k (or whatever nonlinear spring law we have) in order to obtain the spring force magnitudes λ (or tensions). Here’s the pseudocode:
d = p2 - p1
l = len(d)
n = d / l
ɸ = l - l0
λ = -k * ɸ
Now, comes the hardest part - step 2. Or if you don’t want to do it in parallel, then it’s not hard. You simply compute the force along the spring t = λ * n and distribute it to the nodes according to the following algorithm: if the spring leaves the node you add the force, otherwise subtract it. You do this using a force accumulator: you reset it to the zero for each node and then add/subtract to it as described. You can do this per-edge as a continuation of the code above:
f[edge(j, 1)] -= t
f[edge(j, 2)] += t
This is analogous to Kirchoff’s law for currents entering and leaving a circuit node (more on this another time). But the code looks a lot like the gather operation in reverse - it’s really a scatter operation (or a mixture thereof). More than that it’s a scatter-add operation. We can also call it a reduction, but this applies more to accumulating into one or two variables, like the norm of a vector or the total energy of the system. And this won’t work in parallel.

There are several ways to attacking the scatter-add. Essentially, all work on the underlying graph of springs and masses. What we need is the node-edge incidence table: for each node store initially each incident edge and its orientation (plus or minus). Then go through each node in parallel and simply add/subtract (depending on the sign) all the incident spring forces. That was easy! This is the simplest method I could find, without using locks, critical sections or atomic operations. The drawbacks are that you need to store the extra memory for the incidence (which is bit redundant) and the spring forces and update the incidence info if the graph changes (e.g. for contacts). Oh, and it might thrash the cash. Still, I don’t have a better solution for writing it from scratch.

But the good thing is that we have identified the scatter-add primitive and we can abstract it and optimize it later. In fact, some libraries offer it; for instance, I used torch-scatter for PyTorch. And to sell you a secret, you can even do it with any sparse linear algebra library out there. You just need to build the difference matrix A with -1 and 1 non-zero values on each row (corresponding to each spring) and then multiply its transpose with the spring forces: $f = A^T t$. This matrix is equivalent to the incidence info and you might even end up using less memory. I honestly don’t know if it’s faster or slower than the hand-crafted version. I’m just glad we have options. Additionally, some libraries offer a SIMD version of scatter write, but you would have to write the scatter-add yourself.
Let’s go back to the figure above and change it to make it more general. The point is that you can reuse it for all kinds of different computational codes. And not only simulation, for instance computing the vertex normals of a mesh. By local quantities I mean anything that can be defined over an edge in the graph or a generalization thereof (a stencil or hyper-edge) like a triangle or tetrahedron. The best examples are finite elements - you can implement a very simple FEM simulator with only these concepts. Finite differences should work too. And you can also use the same flow to implement a Jacobi solver for a constraint solver. And the step of computing and assembling global gradient vectors gets called many times over advanced solvers. This is why it’s crucial that it is as optimized as possible.