John Tsiombikas nuclear@mutantstargoat.com
8 December 2013
This is a short post with my thoughts on what's the best way to design a transformation hierarchy for animation, and how the "ideal design" changes over time.
For a long time, I was a big fan of backwards (bottom-up) lazy evaluation of
transformation hierarchies. Essentially having an XFormNode
class for each node
in the hierarchy with a get_matrix(long msec)
function which calculates the
current node's transformation matrix for the current time, then calls
parent->get_matrix(msec)
and returns the concatenated matrix.
Matrix4x4 XFormNode::get_matrix(long msec) const
{
Matrix4x4 xform;
calc_matrix(&xform, msec);
if(parent) {
xform = parent->get_xform(msec) * xform;
}
return xform;
}
Of course, such a scheme would be wasteful if these matrices where calculated
every time get_matrix
functions are called. For instance if a node is part of a
hierarchy, then its get_matrix
will be called when we need to draw the object
corresponding to this node, and also every time the get_matrix
of any one of
its descendants is called, due to the recursive bottom-up evaluation of
matrices outlined previously. If one considers the posibility of drawing an
object multiple times for various multi-pass algorithms the problem gets worse,
with the limiting worse case scenario being if we're doing ray-tracing which
would require these functions to be called at the very least once per ray cast.
It follows then, that such a design goes hand in hand with lazy evaulation and
caching of calculated node matrices. The XFormNode
class would hold the last
requested time and corresponding matrix, and when get_matrix
is called, if the
requested time matches the last one, we just return the cached matrix instead
of recalculating it.
const Matrix4x4 &XFormNode::get_matrix(long msec) const
{
if(msec == cached_msec) {
return cached_matrix;
}
calc_matrix(&cached_matrix, msec);
cached_msec = msec;
if(parent) {
cached_matrix = parent->get_xform(msec) * cached_matrix;
}
return cached_matrix;
}
This worked nicely for a long time, and it worked like magic. At any point I could just ask for the transform at time X and would get the matrix automatically, including any effects of hierarchy, keyframe interpolations, etc. It's all good... until suddenly processors stopped getting faster any more, moore's law went belly up, and after the shock passed we all sooner or later realised that single-threaded graphics programs are a thing of the past.
In the brave new multithreaded world, lazy evaluation becomes a pain in the
ass. The first knee-jerk reaction is to add a mutex in XFormNode
, and lock the
hell out of the cached matrices. And while that might be ok for an OpenGL
program which won't have more than a couple of threads working with the scene
database at any point (since rendering itself can only be done safely from a
single thread), it throws out of the window a lot of concurrency that can take
place in a raytracer where at any point there could be 8 or more threads asking
for the matrix of any arbitrary node.
A second way to deal with this issue is to have each thread keep its own copy
of the matrix cache, keeping it in thread-specific memory. I'm shamed to admit
I never got around to doing any actual performance comparisons on this, though
I've used it for quite some time in my programs. In theory it avoids having to
wait for any other thread to access the cache, so it should be faster in
theory, but it needs a pthread_getspecific
call in every get_matrix
invocation
which comes with its own overhead.
const Matrix4x4 &XFormNode::get_matrix(long msec) const
{
// cache_key created in the constructor for each node
MatrixCache *cache = pthread_getspecific(cache_key);
if(!cache) {
// first time we need a cache for this thread we'll have to create it
cache = new MatrixCache;
pthread_mutex_lock(&cache_list_lock);
cache_list.push_back(cache);
pthread_mutex_unlock(&cache_list_lock);
pthread_setspecific(cache_key, cache);
}
if(msec == cache->msec) {
return cache->matrix;
}
calc_matrix(&cache->matrix, msec);
cache->msec = msec;
if(parent) {
cache->matrix = parent->get_xform(msec) * cache->matrix;
}
return cache->matrix;
}
This works fine, and although we managed to avoid blocking concurrent use of
get_matrix
, we had to add some amount of overhead for thread-specific storage
calls, and the code became much more complex all over the place: invalidations
must also access this thread-specific storage, we need cleanup for the
per-thread MatrixCache
objects, etc.
So nowadays I'm starting to lean more towards the simpler, less automagic design of top-down evaluation. It boils down to just going through the hierarchy once to calculate all the matrices recursively, then at any point we can just grab the previously calculated matrix of any node and use it.
void XFormNode::eval(long msec)
{
calc_matrix(&matrix, msec);
if(parent) {
matrix = parent->matrix * matrix;
}
for(size_t i=0; i<children.size(); i++) {
children[i]->eval(msec);
}
}
The simplicity of this two-pass approach is hard to overlook, however it's just not as good for some things as my original ideal method. It works fine for OpenGL programs where it suffices to calculate transformations once per frame, it even works fine for simple raytracers where we have again a single time value for any given frame. However it breaks down for ray-tracers doing distribution ray tracing for motion blur.
The best way to add motion blur to a ray tracer is through a monte-carlo method invented by Cook, Porter and Carpenter, called "distribution ray tracing". In short, when spawining primary rays we have to choose a random time in the interval centered around the frame time and extending to the past and future, as far as dictated by the shutter speed of the camera. This time is then used both to calculate the position and direction of the ray, which thus might differ between sub-pixels if the camera is moving, and to calculate the positions of the objects we're testing for intersections against. Then if we cast many rays per pixel and average the results, we'll get motion blur on anything that moves significantly while the virtual shutter is open (example from my old s-ray renderer).
It's obvious that calculating matrices once per frame won’t cut it with advanced ray-tracers, so there's no getting rid of the complexity of the lazy bottom-up scheme in that case. Admittedly, however, caching won't do much for us either because every sub-pixel will request the matrix at a different time anyway, so we might as well just calculate matrices from scratch all the time, and skip the thread-specific access overhead. The jury is still out on that one.
Do you have a favourite design for hierarchical animation code? Feel free to share it by leaving a comment below!
This was initially posted in my old wordpress blog. Visit the original version to see any comments.