On Vector Math Libraries

December 28, 2013 · Coding, Graphics, Math · Comments

For graphics coders, vector and matrix math libraries are something we use nearly every day, and in just about every function we write. If you’ve been working in this field for long, you’ve probably used a dozen different vector libs in C++, not to mention the first-class vectors and matrices in HLSL and GLSL—and you’ve probably written your own vector lib, at least twice! It seems like every project and every coder has their own special flavor of this basic utility—nearly identical with, yet always slightly different from, all the other versions out there.

Unfortunately, despite the ubiquity of these libs and the fact that every seasoned graphics coder probably knows their vector lib better than their own spouse, still most of the vector math libs I’ve seen are pretty deficient. Sometimes these deficiencies are obvious, sometimes rather subtle, but vectors and matrices are used so often and so deeply that it’s clearly worth investing time to get them right.

In this article, I’m going to relate a number of lessons about vector/matrix lib design that I’ve taken from my ~10 years doing graphics programming (first as a hobbyist, later professionally). It’ll be pretty C++-centric, but many of these lessons can also apply to other languages. Obviously, all this stuff is ultimately just my personal opinion, and I don’t expect everyone to agree with me on every point—but I’m still going to argue that my opinions are right. ;)

Parameterize By Dimension
Parameterize By Element Type
Constructors
Operations
Row-Major, Column-Major, Row Vector, Column Vector
How To SIMD, And How Not To
It’s Affine With Me
Conclusion

Parameterize By Dimension

The first lesson is this: implement your vector and matrix structs in a generic way, so that the number of dimensions is a parameter you can fill in later. In C++, this means using templates.

template <int n>
struct Vector { float data[n]; };

template <int rows, int cols>
struct Matrix { float data[rows][cols]; };

Most libs I’ve worked with don’t do this—they have separate structs/classes for the most commonly-used vector and matrix sizes, i.e. 2-, 3-, and 4-vectors, and 3×3 and 4×4 matrices. However, there are several advantages to doing this generically:

You’ll write all your operations (add, mul, dot, transpose, etc.) generically as well, so you’re not violating DRY by re-implementing the same operations for each size.
More importantly, you’re guaranteed to have all operations available for all vector/matrix sizes. This is important. How many libs have you used where 3-vectors had the full complement of operations, but 2-vectors and 4-vectors were second-class citizens missing a bunch of functionality, which you then had to add or work around? If you do everything generically from the start, you avoid this whole issue.
You have access to arbitrary-size vectors/matrices if you need them. Sure, it may be rare verging on unheard-of for graphics coders to need more than four dimensions—but on that one day you need to solve a linear system with 6 variables, or work with a 4×2 matrix for some random reason, you’ll be able to do it easily.

Now, there are a couple of disadvantages to the template approach. First, no one wants to write Vector<3> or Matrix<4, 4> instead of Vec3 or Mat44. But this can be easily addressed with a few typedefs:

typedef Vector<2> Vec2;
typedef Vector<3> Vec3;
typedef Vector<4> Vec4;
typedef Matrix<3, 3> Mat33;
typedef Matrix<4, 4> Mat44;

By all means, create typedefs for the most common sizes. You’ll still need the template notation for the weird sizes, but they’re rare, so that’s OK.

The other disadvantage is that you have to access components using subscript notation, like myVec[0] instead of myVec.x. (Assuming you’ve implemented operator[], which of course you have, right?) However, we can recover named components for 2-, 3-, and 4-vectors using template specialization:

// Generic vector
template <int n> struct Vector { float data[n]; };

// Specializations for n = 2, 3, 4
template <> struct Vector<2> { union { float data[2]; struct { float x, y; };       }; };
template <> struct Vector<3> { union { float data[3]; struct { float x, y, z; };    }; };
template <> struct Vector<4> { union { float data[4]; struct { float x, y, z, w; }; }; };

It’s handy for the specializations to still have the same data member that’s defined in the generic version, as generic operations will make use of it. So, we use anonymous structs/unions to give names to the components (they work on all the compilers/platforms you care about). It’s also possible to implement accessor functions like x()—but that syntax is too clunky to tolerate.

For the same reason, the members of the vector should be public—data encapsulation can be good sometimes, but this is not the place for it. Don’t make getter/setter methods for vector components. As you’ve seen in the code samples here, I’m declaring things as structs instead of classes, so everything’s public.

A caveat about the specializations is that you’ll have to re-implement any constructors, methods and overloaded operators in each specialization—the specializations don’t inherit them from the generic case. This is a good reason to implement as many operations as possible as free functions instead of putting them in the struct. The only things that really have to be in the struct are constructors, implicit conversion operators, and operator[], since C++ doesn’t let you write those as free functions. But more about that later.

While we’re at it, in our template specializations we can also support rgba component names as well as xyzw. We can also add some multi-component names, like xyz:

template <>
struct Vector<4> {
    union {
        float data[4];
        struct { float x, y, z, w; };
        struct { float r, g, b, a; };
        Vector<2> xy;
        Vector<3> xyz;
        Vector<3> rgb;
    };
};

Note that prior to C++11, using vectors for multi-component names in the union is only possible if you don’t have any constructors in your vector type. In C++11, types with constructors can be put in a union. If you’re not using C++11, I think it’s fine to use member functions for these, like:

template <>
struct Vector<4> {
    ...
    Vector<3> & xyz() { return reinterpret_cast<Vector<3> &>(data); }
    const Vector<3> & xyz() const { return reinterpret_cast<const Vector<3> &>(data); }
};

It’s also possible, with heroic efforts, to make all possible multi-component swizzles work—but IMO it’s not worth the complexity, given how rarely you need anything beyond .xy or .xyz.

Parameterize By Element Type

Just as you should parameterize vector/matrix structs by size, you should also parameterize them by their element type.

template <typename T, int n>
struct Vector { T data[n]; };

template <typename T, int rows, int cols>
struct Matrix { T data[rows][cols]; };

Again, most libs I’ve worked with don’t do this. The argument here is similar to the previous one. Although you might be using float vectors 99% of the time, there are some places you’ll want vectors of ints—for example, to calculate integer pixel coordinates for UI elements, to work with 2D or 3D grids, or to represent coordinates in fixed-point format. It’s nice if all the same operations you have for float vectors also “just work” with int vectors. If you don’t parameterize by element type, int vectors are likely to end up as second-class citizens with missing functionality.

Similarly, if you’re doing graphics programming then sooner or later you’ll need to work with half-precision floats, and if you have a half-float class already, making vectors of them will be trivial. And, while it’s not incredibly common, it can be quite convenient to have comparison operators that work componentwise and return a vector of bools. Don’t forget about byte vectors for 8-bit RGB and RGBA tuples, either!

Just as before, you can create typedefs for the most common cases, so you can write float3 and int3 (or vec3 and ivec3, if that’s your thing) instead of Vector<float, 3> and Vector<int, 3> all the time.

Constructors

Before discussing constructors themselves, I should mention that prior to C++11, there’s a fairly big downside to having any constructors in vector/matrix structs: it means you can’t put them in a union. As mentioned earlier, in C++11 this restriction is lifted; but you might not be using a C++11-capable compiler yet. (Visual Studio, in particular, still doesn’t support unrestricted unions even in the November 2013 CTP, the latest version as of this writing.) There are a few options available to work around the union issue:

Don’t put constructors in your vector structs; use free “maker” functions instead. So in place of float3 v(x, y, z) you’d write float3 v = makefloat3(x, y, z). This solves the problem, but comes at the cost of some syntax bloat.
Split your vector struct into two: one that has just the data and can be used in unions, and another that inherits from it and adds all the constructors. You can add implicit conversions to try to make things reasonably seamless, but you’ll still have to convert explicitly in some cases.
Don’t use vectors in unions; use C arrays instead, with explicit conversion each time the union member is read or written.

All of these solutions are workable, in my opinion, but none of them is great. The only real solution is to use C++11.

Now, on to the constructors themselves. Most vector math libs I’ve worked with do a pretty good job with constructors, so I just have a few points to touch on here.

Use explicit when writing constructors that take a single parameter, unless implicit conversions are specifically desired. (This applies to everything in C++, not just vectors/matrices.)
Provide a constructor that fills a whole vector with a single value.
Provide vector constructors that take the most common combinations of vector and scalar values (for instance, a 4-vector constructor taking a 3-vector for xyz and a scalar for w). As with swizzles, it’s possible to heroically support every possible combination, but it’s really overkill and you’ll never use most of them.
For matrices, provide an easy way to get zero and identity matrices.
For matrices (at least the most common sizes), please do provide componentwise constructors. Everyone has these for vectors, but not so often for matrices; still, they do come in handy now and then (projection matrices, for example).
If you’re using C++11, consider supporting array initialization syntax (by adding a constructor from std::initializer_list<T>), as this solves the componentwise issue nicely.
Provide ways to build a matrix out of vectors, both row-wise and column-wise. These belong as free functions, not constructors, though, due to the ambiguity of what’s meant when you just see Matrix(vec0, vec1, vec2, vec3).
Don’t use constructors for operations more complex than just filling in components—for example, building a rotation matrix from Euler angles or a quaternion.
Provide constructors that take an array of floats, as that will ease conversion between other math libs’ vector/matrix types and yours.
If you’re using C++11, write constexpr constructors where applicable, to hint to the compiler that it can optimize the constructor away.

Operations

By “operations” I mean both overloaded operators, and other non-operator functions that do stuff to vectors and matrices—such as dot, cross, normalize, transpose, inverse, and so on. All these functions will have to be templatized as well, to match the templatized vector/matrix structs from the previous sections.

There will be a few operations that carry restrictions on the dimensions they work with: the cross product only makes sense for 3-vectors, and only square matrices can be inverted, for example. These restrictions can and should be expressed in template language, so that errors are found at compile time. Don’t use runtime asserts for this sort of thing.

Overloaded operators get a lot of flak (rightly), but a vector math lib is one of the few places where they are fine and good and you should go to town with them! No one wants to read code like add(mul(v, a), mul(w, 1-a)) when you could have v*a + w*(1-a) instead. However, overloaded operators should only be used for strictly componentwise operations—add, multiply, etc. but not dot, cross and suchlike. I think this is pretty uncontroversial these days.

The one exception I make here is for matrix multiplication. A lot of libraries (including HLSL) require you to use a separate mul function for multiplying two matrices, or a matrix and a vector. Personally, I think it’s fine to use operator * for this. It’s well-established in math notation that matrix multiplication is written just like regular multiplication. Besides, in ~10 years of graphics programming I’ve never once needed to componentwise-multiply two matrices. 😉

Other recommendations:

Provide a complete set of componentwise overloaded operators, including componentwise multiplication/division, all vector-scalar operations (including addition/subtraction, dividing scalar by vector, etc.), and compound assignment operators.
Non-operator functions should be free functions, not methods. In other words, I want to write dot(a, b), not the incredibly awkward and asymmetrical a.dot(b).
Operations should always return their result, not act in-place.
Don’t forget to support componentwise min, max, abs, clamp, saturate, and lerp (or mix if you prefer). It’s also handy to have minComponent and maxComponent, for both vectors and matrices.
Cross-compatibility with other math libs and APIs is important, so make it easy to get raw access to a vector or matrix as a pointer or C array, ideally with an implicit conversion.
As mentioned earlier, it’s handy to have comparison operators that operate componentwise and return vectors of bools. You’ll also need all() and any() functions to AND or OR together (respectively) all the elements; another handy feature is a select() function that acts like a componentwise ?: operator.

Row-Major, Column-Major, Row Vector, Column Vector

This topic has been discussed at length elsewhere, so I won’t belabor it. I wish I could make an argument that some choices of these conventions are better or more standard than others, but in honesty I can’t. So, use whatever conventions you find most cromulent!

Note that at the most basic level of the lib, you don’t need to make a choice between row vectors and column vectors, as you can provide multiplication operators for both vector-matrix and matrix-vector combinations. However, once you start writing routines to generate rotation matrices and suchlike, you’ll have to pick a side.

How To SIMD, And How Not To

You may have noticed that in a lot of vector math libraries, the definition of a vector does not look like this:

struct Vector { float x, y, z, w; };

Instead it looks like this:

struct Vector { __m128 data; };

The antipattern of using hardware SIMD vectors to implement xyzw vectors is unfortunately common. It may have made sense at one point in time, but it’s outlived its usefulness, and vector libs should not be written this way now.

To see why, let’s look at why people started writing vector math libs using this “naive SIMD” approach, where a hardware vector register holds the xzyw components of a geometric vector. Of course, it was to make math code faster, but there are two main reasons I’m aware of why using naive SIMD would have been faster than the alternative (on x86 CPUs) of scalar x87 FPU code:

You can operate on (up to) four components at once. For instance, if you want to add a couple of 4-vectors componentwise, you can do it with one instruction, and that would be faster than four scalar adds in a row.
You can use the much nicer SSE instruction set, rather than the weird old x87 FPU instruction set with its inefficiencies (and weirdnesses like 80-bit intermediate precision). This would lead to faster code overall.

The hope is that building SIMD into your vector math library, which you use throughout your codebase, would give you these speedups for free versus writing the math lib in scalar form. However, this assumption needs to be reassessed today.

Let me address the second reason first. Modern compilers are quite capable of generating code that uses SSE instead of the x87 FPU for scalar floating-point math code on x86. SSE2 penetration is essentially 100% today, and both Visual Studio 2012 and clang 3.3 generate SSE2 instructions for scalar math on the default settings; so does gcc 4.8 for 64-bit code, but not for 32-bit code (I’m not sure why). For 32-bit code, gcc can be persuaded to use SSE with the command-line arguments -msse2 -mfpmath=sse.

Note that I’m not talking about auto-vectorization: this is still scalar code, using only one of the four SIMD vector slots. However, you still reap the benefits of the SSE2 instruction set over the x87 FPU.

As for the first reason—the idea that you can get a speedup by operating on several components at once—well, this is certainly true, if your components are already organized into SIMD vectors in the right way, and you don’t need to do too much rearrangement to get data into the right shape for your operations. Unfortunately, with naive SIMD this often fails to be the case.

As a test of naive SIMD versus scalar performance, I wrote a quick benchmarking app (source here) to measure a couple of small tasks that I think are pretty representative of what game code written with a typical vector math lib looks like. No special attention has been given to optimization; the code is just written in a straightforward style, one version using scalar float math, and one using naive SIMD math. The two tasks are:

Computing a world-to-camera matrix (aka view matrix) from yaw, pitch, and camera position. Involves a little trig and a little vector math.
Transforming a batch of 1000 bounding boxes by 1000 corresponding matrices, and computing new bounds for the transformed corners. Involves constructing the AABB corners, and some vector math.

Each of the tasks was repeated a large number of times to get timings large enough to measure precisely, and then I took the minimum timing over several runs. Here are the results (VS 2012, Core i5 at 2.9 GHz):

		Scalar	Naive SIMD	Optimized SIMD
World-to-camera matrix	32-bit	941.4 ms	981.3 ms
World-to-camera matrix	64-bit	396.8 ms	384.4 ms
Transform AABBs	32-bit	698.8 ms	1050.7 ms	142.3 ms
Transform AABBs	64-bit	695.1 ms	1047.2 ms	100.0 ms

For the transform-AABBs task I also implemented an optimized SIMD version—more about that in a bit.

You can see from these results that using naive SIMD didn’t help much even in the best case (64-bit world-to-camera), and actually hurt in the other cases. I’m not going to analyze the generated code closely, but basically what’s going on is that we have to spend too much time moving data around (often via memory) to get it into the right shape for SIMD vector operations to work on it. The overhead negates all the benefit from the 4-wide vector operations, and the scalar code is actually faster.

Now, these are just a couple of quick example routines, and you probably shouldn’t draw too many conclusions from this. But it illustrates how naive SIMD in your vector math lib can actually hurt you more than help you, and I suspect you’re generally better off sticking to scalar math. I’d love to see some data on this from a real project, though.

Does all this mean you shouldn’t ever use SIMD? Of course not! It’s just that you need to be judicious about it: pick the spots where SIMD can make a big difference (because you have to do the same operations on a lot of data), then organize the data so it’s SIMD-friendly, and write the code to use SIMD explicitly.

Organizing the data so it’s SIMD-friendly means using a struct-of-arrays (SOA) layout rather than the array-of-structs (AOS) layout that I’ve been calling “naive SIMD”. In other words, instead of making a hardware vector register hold the xyzw components of a geometric vector, you make it hold the x-components of four different vectors. Another register holds four y-components, another holds four z-components, etc. This has various advantages:

You get much better utilization of the four slots in the vector registers. With naive SIMD, you couldn’t always do this; if you were working with 3-vectors, for instance, you’d leave one of the slots idle and not doing anything useful. With SOA, working with 3-vectors means you use fewer registers and execute fewer instructions than with 4-vectors, and you’re still using all four slots of the vector registers.
Similarly, you’re well-placed to expand to SIMD architectures that are more than 4-wide, such as AVX or GPUs—which you couldn’t do at all with naive SIMD, since you almost never use geometric vectors of more than four components.
You generally have much less shuffling to do. For instance, doing a dot product of two geometric vectors doesn’t require any shuffling. If the rest of your application doesn’t use SOA layout, you’ll have to do some shuffling to get data in and out of SOA on entry and exit of the optimized-SIMD parts. However, you can often keep internal buffers in SOA layout throughout their lifetime.

I implemented a version of the transform-AABBs task with everything in SOA layout, and (as shown by the results in the above table), it was 5× faster than the scalar version in 32-bit code, and 7× faster in 64-bit code. I only used 4-wide SIMD, yet we got more than a 4× speedup, probably due to some combination of better instruction scheduling and better memory access patterns. Compare this to the fact that we got significant slowdowns in this same task when coding it in naive SIMD.

So what are the implications of this for vector math lib design? First of all, stick to scalar math and don’t try to use the naive, AOS style of SIMD—it doesn’t actually help, and can even hurt you. Second, consider providing a separate SOA math library, for those places where optimized SIMD is needed. You can also just use SIMD intrinsics directly in those places (or wrappers around platform-specific SIMD intrinsics), which might be clearer.

One potentially interesting idea: since we already have a vector lib parameterized by element type, you could use a SIMD vector as the element type. For example, Vector<__m128, 3> would be a collection of four 3-vectors in SOA layout. If you define overloaded operators for __m128, such as

__m128 operator + (__m128 a, __m128 b)
{
    return _mm_add_ps(a, b);
}

then you can write SOA SIMD code in a very similar style to your scalar code.

Arrays of Vector<__m128, 3> and similar types wouldn’t quite be in SOA layout; they’d be in an “AOSOA” (array of struct of arrays) format. However, this might well give you most of the memory locality benefits of SOA; I haven’t tried it.

It’s Affine With Me

Vector/matrix math libraries implement the primitives and operations of linear algebra, which deals with vectors—quantities with a direction in space and a magnitude. However, there’s another geometric primitive that’s equally useful: the point. Points belong to the branch of mathematics called affine geometry, and they are not the same thing as vectors—vectors have a direction and a magnitude, while points have neither, except relative to another point.

Affine geometry expresses the notion that the choice of origin point in a space is arbitrary. This is very similar to how in linear algebra, the choice of basis vectors (coordinate axes) is arbitrary. You can represent the same abstract vector (say, the gravity vector) in different bases (say, local space and world space), and it’ll have different component values. However, a vector space always has a fixed origin point—a gravity vector of zero means the same thing, whichever basis you’re in.

In contrast, points live in an affine space, where there is no fixed origin point. While vector spaces model things like gravity vectors, for which zero has a special meaning, affine spaces model things like the positions of objects, where no point is special. A coordinate system in an affine space consists of a choice of origin point and basis vectors. Of course, once we have a coordinate system set up around a certain origin, we then represent other points by creating vectors to them from the origin, and in turn representing those vectors as components with respect to a basis.

Many vector/matrix libs don’t bother to make the distinction, using the same type to store both points and vectors. However, it’s worth considering making this distinction explicit by creating separate types for the two, because points and vectors actually have different operations:

Vectors can be added to each other, and multiplied by scalars. Points can’t be (in a way that means anything geometrically, outside of a particular coordinate system).
You can subtract two points, and you get the vector (translation) between them.
You can add a vector to a point to get a new point (i.e. translate the point).
When you apply an affine transformation to a point, translation should be included; but when applied to a vector, the translation should be ignored.

Likewise, affine spaces have their own set of transformations. A linear transformation on a vector space can be represented (in a particular basis) as a matrix; an affine transformation can be represented (in a particular coordinate system) as a matrix together with a translation vector. Affine transformations can be composed and inverted while remaining in this form. If points and vectors are different types, matrices and affine transformations should be separate as well.

The argument here is the same as always for strong typing: if you make the point/vector distinction known to the type system, it will prevent you from accidentally doing something that doesn’t make sense. Just as the compiler won’t let you add an integer to a string, it won’t let you add two points.

Of course, when taken too far, strong types become more of a hindrance than a help; it’s a matter of opinion where to draw the line. A good naming convention can often get you most of the benefits of strong types, as long as you adhere to it. I’ve worked with vector math libs both with and without built-in affine math support, and while I slightly prefer affine math to be supported, I can understand choosing to forego this feature.

If you do choose to implement affine math, there are two main components: a Point struct, which will be very similar to the Vector one but with only the operations that make sense for points, and an Affine struct representing an affine transformation, which will contain a Matrix for the linear part and a Vector for the translation. A few other guidelines:

Provide explicit conversions between points and vectors, for those (rare) cases where you need to break out of the type system.
Provide an overloaded operator * or mul that takes an Affine and either a point or a vector, and transforms it appropriately.
Provide an operation to convert a 3D Affine into the corresponding 4×4 matrix for homogeneous coordinates, and vice versa.

A final note: it’s also possible to embed affine math into homogeneous coordinates, by setting w = 1 for points, w = 0 for vectors. This makes them behave correctly with regard to addition/subtraction and affine transformations (embedded in 4×4 matrices), but it doesn’t have the benefit of actually restricting you from performing illegal operations like adding two points, and it wastes memory and time on a w-component that’s really a compile-time constant. It’s also not quite kosher, mathematically speaking, since it overloads the homogeneous coordinate system, using w = 0 to represent both vectors and projective points at infinity—which are different things.

Conclusion

Whew! We covered a lot of ground in this article, so let me finish by summarizing the most important lessons of the previous sections.

Parameterize vector/matrix structs by element type and dimension.
Provide specializations and typedefs for the most common types and sizes, for convenience.
Provide a nice, full set of constructors, overloaded operators and free functions (like dot, cross, etc). Seriously, go crazy with these and implement everything you can think of.
Make a decision about row-major vs. column-major and row-vector vs. column-vector, and stick to it.
Don’t try to build SIMD into your general-purpose vector/matrix structs. Do use SIMD judiciously, where it’s effective, and keep the relevant data in SOA layout.
Consider adding type-safe affine math (points and affine transformations) on top of the basic linear algebra features.

That’s it! Now go forth and build great vector math libraries! 😄

Nathan Reed