Wednesday, December 23, 2020

The force accumulation pattern

Game physics has resisted parallelization over the years and especially so on the GPU. One reason is that it is not embarrassingly parallel like vertex and fragment shaders. Another reason is that the code becomes hairy as soon as you start optimizing it for SIMD or multi-threading. My goal is to keep the code both easy to read and as fast as possible. Also, I would like it to run both on CPU cores and GPU (not necessarily at the same time). This is usually done through abstractions that do all the optimizations behind the curtains. Starting with this blog post, I will try to identify some of these abstractions that appear often in physics simulation and see how they can be exploited.

I will start with some very simple examples applied to masses connected through springs: explicit integration and gradient descent. The former is used for dynamic simulation, while the latter is the most rudimentary form of static analysis, although it can be used for implicit integration too. In fact, they are almost the same thing mathematically speaking. There are also differences, so I will focus on the computational similarities only.

Gradient descent
$ p_{i+1} = p_i + \alpha  f $

Verlet integratin:
$ p_{i+1} = 2 p_i - p_{i-1} + \frac{h^2}{m}  f $

By $f$ I denoted the total force acting on the system, which is the negative gradient of the total energy; hence, the plus sign in both formulas. Leaving external forces (like gravity) aside, the total force then consists of internal forces only (that is, forces due to springs). Therefore, we first calculate the spring forces, then we disrtibute them to the masses and finally update the positions using one of the formulas above. Here’s the picture:
This pattern will come up over and over again, so let’s analyze it. The first and last steps are embarrassingly parallel - essentially map operations. What differentiates them though is that the first one operates on the springs (edges), whereas the last one operates on the particles (nodes/vertices). For this reason, the first step also includes a gather operation, i.e. it reads the positions of the two end-points:
p1 = p[edge(j, 1)]
p2 = p[edge(j, 2)]
This is just an indirect load operation, which can be done in parallel. (It could also be accelerated by using some fancy SIMD gather instructions.) After that we just compute the length of the spring and subtract the rest length - let’s call this value ɸ. I call ɸ the error function or the contraint function. For springs, it is simply the elongation. If we divide it by the rest length then it’s called the strain. We multiply ɸ by the spring constant k (or whatever nonlinear spring law we have) in order to obtain the spring force magnitudes λ (or tensions). Here’s the pseudocode:
d = p2 - p1
l = len(d)
n = d / l
ɸ = l - l0
λ = -k * ɸ
Now, comes the hardest part - step 2. Or if you don’t want to do it in parallel, then it’s not hard. You simply compute the force along the spring t = λ * n and distribute it to the nodes according to the following algorithm: if the spring leaves the node you add the force, otherwise subtract it. You do this using a force accumulator: you reset it to the zero for each node and then add/subtract to it as described. You can do this per-edge as a continuation of the code above:
f[edge(j, 1)] -= t
f[edge(j, 2)] += t
This is analogous to Kirchoff’s law for currents entering and leaving a circuit node (more on this another time). But the code looks a lot like the gather operation in reverse - it’s really a scatter operation (or a mixture thereof). More than that it’s a scatter-add operation. We can also call it a reduction, but this applies more to accumulating into one or two variables, like the norm of a vector or the total energy of the system. And this won’t work in parallel.

There are several ways to attacking the scatter-add. Essentially, all work on the underlying graph of springs and masses. What we need is the node-edge incidence table: for each node store initially each incident edge and its orientation (plus or minus). Then go through each node in parallel and simply add/subtract (depending on the sign) all the incident spring forces. That was easy! This is the simplest method I could find, without using locks, critical sections or atomic operations. The drawbacks are that you need to store the extra memory for the incidence (which is bit redundant) and the spring forces and update the incidence info if the graph changes (e.g. for contacts). Oh, and it might thrash the cash. Still, I don’t have a better solution for writing it from scratch.

But the good thing is that we have identified the scatter-add primitive and we can abstract it and optimize it later. In fact, some libraries offer it; for instance, I used torch-scatter for PyTorch. And to sell you a secret, you can even do it with any sparse linear algebra library out there. You just need to build the difference matrix A with -1 and 1 non-zero values on each row (corresponding to each spring) and then multiply its transpose with the spring forces: $f = A^T t$. This matrix is equivalent to the incidence info and you might even end up using less memory. I honestly don’t know if it’s faster or slower than the hand-crafted version. I’m just glad we have options. Additionally, some libraries offer a SIMD version of scatter write, but you would have to write the scatter-add yourself.
Let’s go back to the figure above and change it to make it more general. The point is that you can reuse it for all kinds of different computational codes. And not only simulation, for instance computing the vertex normals of a mesh. By local quantities I mean anything that can be defined over an edge in the graph or a generalization thereof (a stencil or hyper-edge) like a triangle or tetrahedron. The best examples are finite elements - you can implement a very simple FEM simulator with only these concepts. Finite differences should work too. And you can also use the same flow to implement a Jacobi solver for a constraint solver. And the step of computing and assembling global gradient vectors gets called many times over advanced solvers. This is why it’s crucial that it is as optimized as possible.

Wednesday, May 6, 2020

Two algorithms from the 60's


The story goes that Michael (Mike) Powell was supposed to give a seminar at the University of Leeds on derivative free optimization one day in the early 60s, probably 1963. In November 1959 William (Bill) Davidon had just released a technical report on what he called the “variable metric method”. He was a physicist at the Argonne National Laboratory that was using gradient (and coordinate) descent methods and came to invent quasi-Newton methods. Powell had got his hands on the report in 1962 and so it seems had Colin Reeves, a chemist working on molecular structure problems and also the PhD advisor of Roger Fletcher at Leeds. Reeves gave the report to Fletcher, who implemented the method, as did Powell. After giving his talk Powell was surprised to find out that Reeves already knew the method and had implemented it. They decided to work together on the method and out came the first paper in 1963 on what was later called the Davidon-Fletcher-Powell (DFP) formula. The irony is that Davidon was rejected for publication and his paper was finally published only in 1991 in the first issue of SIAM Journal on Optimization.

The DFP work later lead to the renowned BFGS method and the whole class of quasi-Newton algorithms. The funny thing about BFGS is that it was not single paper with the acronym coming from the authors' initials, but it was actually four separate papers published independently. The matrix free or limited memory version of BFGS (L-BFGS) later become a very powerful and efficient method (due to Jorge Nocedal). Overall, the DFP, BFGS and similar methods brought significant improvements to the state of numerical optimization at that time which consisted mostly of steepest descent and Newton’s method.

One has to keep in mind that numerical analysis at that time was still incipient and that software libraries were in the process of being written. Every university had its own clunky computer, that filled entire rooms or even floors, and not even Fortran was yet standard. For instance, Davidon used block diagrams in his report and Fletcher coded in Algol. Some people used assembly, other early forms of compilers. These people were pioneers of numerical optimization in those days and many of them were not mathematicians per se. Colin Reeves started as a quantum chemist at Cambridge working on molecular orbitals and later became a computer scientist. Fletcher is described by chemists as a computer scientists who later left molecular simulations for a career in optimization. But after all, he was therefore an applied mathematician until the end. Powell was for most of his career a mathematician working at the Atomic Energy Research Establishment at Harwell, close to Oxford; he later moved to Cambridge as a professor. Davidon was a physicist (although Fletcher majored in physics at Cambridge too) who later became known as a peace activist, anti-Vietnam war protester and mastermind behind the FBI office burglary in 1971. On the other hand, Powell was interested only in the mathematics and not the physics part, as he admits that he could not follow Davidon’s arguments on the variable metric inspired from general relativity and differential geometry.

Fletcher soon went on after the DFP work to develop the nonlinear conjugate gradient (CG) solver for unconstrained nonlinear optimization. He claims that the idea was given to him by Reeves who realized that he could re-use the line search procedure inside the then classic CG solver for linear systems. This resulted in the Fletcher-Reeves method. In fact, Fletcher is of the opinion that he was given two good ideas by other people. In this way, Fletcher helped develop two methods that have become cornerstones of numerical optimization. Although, in a later interview he shared his doubts about NCG being a reliable method for large scale problems. But at that time, NCG and DFP/BFGS were pretty much the only efficient methods that were available in numerical libraries and computational quantum chemistry software suites and probably other places too. And then followed augmented Lagrangian, SQP, interior point methods and all that.

Fletcher was a man that relished practical problems and considered them as drivers of the field of numerical optimization. He considered himself a physicist rather than a mathematician until the end. In an interview he gave in 2015, before passing away one year later, he said the following: “I think for us in numerical analysis – in optimization in particular – because that’s the purpose of it: it’s to optimize; it’s not to produce pure mathematics... in my view, differing from other people’s view. But it’s to solve problems. And we just keep solving the same old problems – getting another 5 % here and another 5 % there: it’s a bit sterile.” I think these are very nice and true words coming from one of the giants of optimization.