# On Vector Math Libraries

December 28, 2013 · Coding, Graphics, Math · 20 comments

For graphics coders, vector and matrix math libraries are something we use nearly every day, and in just about every function we write. If you’ve been working in this field for long, you’ve probably used a dozen different vector libs in C++, not to mention the first-class vectors and matrices in HLSL and GLSL—and you’ve probably written your own vector lib, at least twice! It seems like every project and every coder has their own special flavor of this basic utility—nearly identical with, yet always slightly different from, all the other versions out there.

Unfortunately, despite the ubiquity of these libs and the fact that every seasoned graphics coder probably knows their vector lib better than their own spouse, still most of the vector math libs I’ve seen are pretty deficient. Sometimes these deficiencies are obvious, sometimes rather subtle, but vectors and matrices are used so often and so deeply that it’s clearly worth investing time to get them right.

In this article, I’m going to relate a number of lessons about vector/matrix lib design that I’ve taken from my ~10 years doing graphics programming (first as a hobbyist, later professionally). It’ll be pretty C++-centric, but many of these lessons can also apply to other languages. Obviously, all this stuff is ultimately just my personal opinion, and I don’t expect everyone to agree with me on every point—but I’m still going to argue that my opinions are right. ;)

## Parameterize By Dimension

The first lesson is this: implement your vector and matrix structs in a generic way, so that the number of dimensions is a parameter you can fill in later. In C++, this means using templates.

template <int n> struct Vector { float data[n]; };   template <int rows, int cols> struct Matrix { float data[rows][cols]; };

Most libs I’ve worked with don’t do this—they have separate structs/classes for the most commonly-used vector and matrix sizes, i.e. 2-, 3-, and 4-vectors, and 3×3 and 4×4 matrices. However, there are several advantages to doing this generically:

• You’ll write all your operations (add, mul, dot, transpose, etc.) generically as well, so you’re not violating DRY by re-implementing the same operations for each size.
• More importantly, you’re guaranteed to have all operations available for all vector/matrix sizes. This is important. How many libs have you used where 3-vectors had the full complement of operations, but 2-vectors and 4-vectors were second-class citizens missing a bunch of functionality, which you then had to add or work around? If you do everything generically from the start, you avoid this whole issue.
• You have access to arbitrary-size vectors/matrices if you need them. Sure, it may be rare verging on unheard-of for graphics coders to need more than four dimensions—but on that one day you need to solve a linear system with 6 variables, or work with a 4×2 matrix for some random reason, you’ll be able to do it easily.

Now, there are a couple of disadvantages to the template approach. First, no one wants to write Vector<3> or Matrix<4, 4> instead of Vec3 or Mat44. But this can be easily addressed with a few typedefs:

typedef Vector<2> Vec2; typedef Vector<3> Vec3; typedef Vector<4> Vec4; typedef Matrix<3, 3> Mat33; typedef Matrix<4, 4> Mat44;

By all means, create typedefs for the most common sizes. You’ll still need the template notation for the weird sizes, but they’re rare, so that’s OK.

The other disadvantage is that you have to access components using subscript notation, like myVec[0] instead of myVec.x. (Assuming you’ve implemented operator[], which of course you have, right?) However, we can recover named components for 2-, 3-, and 4-vectors using template specialization:

// Generic vector template <int n> struct Vector { float data[n]; };   // Specializations for n = 2, 3, 4 template <> struct Vector<2> { union { float data[2]; struct { float x, y; }; }; }; template <> struct Vector<3> { union { float data[3]; struct { float x, y, z; }; }; }; template <> struct Vector<4> { union { float data[4]; struct { float x, y, z, w; }; }; };

It’s handy for the specializations to still have the same data member that’s defined in the generic version, as generic operations will make use of it. So, we use anonymous structs/unions to give names to the components (they work on all the compilers/platforms you care about). It’s also possible to implement accessor functions like x()—but that syntax is too clunky to tolerate.

For the same reason, the members of the vector should be public—data encapsulation can be good sometimes, but this is not the place for it. Don’t make getter/setter methods for vector components. As you’ve seen in the code samples here, I’m declaring things as structs instead of classes, so everything’s public.

A caveat about the specializations is that you’ll have to re-implement any constructors, methods and overloaded operators in each specialization—the specializations don’t inherit them from the generic case. This is a good reason to implement as many operations as possible as free functions instead of putting them in the struct. The only things that really have to be in the struct are constructors, implicit conversion operators, and operator[], since C++ doesn’t let you write those as free functions. But more about that later.

While we’re at it, in our template specializations we can also support rgba component names as well as xyzw. We can also add some multi-component names, like xyz:

template <> struct Vector<4> { union { float data[4]; struct { float x, y, z, w; }; struct { float r, g, b, a; }; Vector<2> xy; Vector<3> xyz; Vector<3> rgb; }; };

Note that prior to C++11, using vectors for multi-component names in the union is only possible if you don’t have any constructors in your vector type. In C++11, types with constructors can be put in a union. If you’re not using C++11, I think it’s fine to use member functions for these, like:

template <> struct Vector<4> { ... Vector<3> & xyz() { return reinterpret_cast<Vector<3> &>(data); } const Vector<3> & xyz() const { return reinterpret_cast<const Vector<3> &>(data); } };

It’s also possible, with heroic efforts, to make all possible multi-component swizzles work—but IMO it’s not worth the complexity, given how rarely you need anything beyond .xy or .xyz.

## Parameterize By Element Type

Just as you should parameterize vector/matrix structs by size, you should also parameterize them by their element type.

template <typename T, int n> struct Vector { T data[n]; };   template <typename T, int rows, int cols> struct Matrix { T data[rows][cols]; };

Again, most libs I’ve worked with don’t do this. The argument here is similar to the previous one. Although you might be using float vectors 99% of the time, there are some places you’ll want vectors of ints—for example, to calculate integer pixel coordinates for UI elements, to work with 2D or 3D grids, or to represent coordinates in fixed-point format. It’s nice if all the same operations you have for float vectors also “just work” with int vectors. If you don’t parameterize by element type, int vectors are likely to end up as second-class citizens with missing functionality.

Similarly, if you’re doing graphics programming then sooner or later you’ll need to work with half-precision floats, and if you have a half-float class already, making vectors of them will be trivial. And, while it’s not incredibly common, it can be quite convenient to have comparison operators that work componentwise and return a vector of bools. Don’t forget about byte vectors for 8-bit RGB and RGBA tuples, either!

Just as before, you can create typedefs for the most common cases, so you can write float3 and int3 (or vec3 and ivec3, if that’s your thing) instead of Vector<float, 3> and Vector<int, 3> all the time.

## Constructors

Before discussing constructors themselves, I should mention that prior to C++11, there’s a fairly big downside to having any constructors in vector/matrix structs: it means you can’t put them in a union. As mentioned earlier, in C++11 this restriction is lifted; but you might not be using a C++11-capable compiler yet. (Visual Studio, in particular, still doesn’t support unrestricted unions even in the November 2013 CTP, the latest version as of this writing.) There are a few options available to work around the union issue:

• Don’t put constructors in your vector structs; use free “maker” functions instead. So in place of float3 v(x, y, z) you’d write float3 v = makefloat3(x, y, z). This solves the problem, but comes at the cost of some syntax bloat.
• Split your vector struct into two: one that has just the data and can be used in unions, and another that inherits from it and adds all the constructors. You can add implicit conversions to try to make things reasonably seamless, but you’ll still have to convert explicitly in some cases.
• Don’t use vectors in unions; use C arrays instead, with explicit conversion each time the union member is read or written.

All of these solutions are workable, in my opinion, but none of them is great. The only real solution is to use C++11.

Now, on to the constructors themselves. Most vector math libs I’ve worked with do a pretty good job with constructors, so I just have a few points to touch on here.

• Use explicit when writing constructors that take a single parameter, unless implicit conversions are specifically desired. (This applies to everything in C++, not just vectors/matrices.)
• Provide a constructor that fills a whole vector with a single value.
• Provide vector constructors that take the most common combinations of vector and scalar values (for instance, a 4-vector constructor taking a 3-vector for xyz and a scalar for w). As with swizzles, it’s possible to heroically support every possible combination, but it’s really overkill and you’ll never use most of them.
• For matrices, provide an easy way to get zero and identity matrices.
• For matrices (at least the most common sizes), please do provide componentwise constructors. Everyone has these for vectors, but not so often for matrices; still, they do come in handy now and then (projection matrices, for example).
• If you’re using C++11, consider supporting array initialization syntax (by adding a constructor from std::initializer_list<T>), as this solves the componentwise issue nicely.
• Provide ways to build a matrix out of vectors, both row-wise and column-wise. These belong as free functions, not constructors, though, due to the ambiguity of what’s meant when you just see Matrix(vec0, vec1, vec2, vec3).
• Don’t use constructors for operations more complex than just filling in components—for example, building a rotation matrix from Euler angles or a quaternion.
• Provide constructors that take an array of floats, as that will ease conversion between other math libs’ vector/matrix types and yours.
• If you’re using C++11, write constexpr constructors where applicable, to hint to the compiler that it can optimize the constructor away.

## Operations

By “operations” I mean both overloaded operators, and other non-operator functions that do stuff to vectors and matrices—such as dot, cross, normalize, transpose, inverse, and so on. All these functions will have to be templatized as well, to match the templatized vector/matrix structs from the previous sections.

There will be a few operations that carry restrictions on the dimensions they work with: the cross product only makes sense for 3-vectors, and only square matrices can be inverted, for example. These restrictions can and should be expressed in template language, so that errors are found at compile time. Don’t use runtime asserts for this sort of thing.

Overloaded operators get a lot of flak (rightly), but a vector math lib is one of the few places where they are fine and good and you should go to town with them! No one wants to read code like add(mul(v, a), mul(w, 1-a)) when you could have v*a + w*(1-a) instead. However, overloaded operators should only be used for strictly componentwise operations—add, multiply, etc. but not dot, cross and suchlike. I think this is pretty uncontroversial these days.

The one exception I make here is for matrix multiplication. A lot of libraries (including HLSL) require you to use a separate mul function for multiplying two matrices, or a matrix and a vector. Personally, I think it’s fine to use operator * for this. It’s well-established in math notation that matrix multiplication is written just like regular multiplication. Besides, in ~10 years of graphics programming I’ve never once needed to componentwise-multiply two matrices. :)

Other recommendations:

• Provide a complete set of componentwise overloaded operators, including componentwise multiplication/division, all vector-scalar operations (including addition/subtraction, dividing scalar by vector, etc.), and compound assignment operators.
• Non-operator functions should be free functions, not methods. In other words, I want to write dot(a, b), not the incredibly awkward and asymmetrical a.dot(b).
• Operations should always return their result, not act in-place.
• Don’t forget to support componentwise min, max, abs, clamp, saturate, and lerp (or mix if you prefer). It’s also handy to have minComponent and maxComponent, for both vectors and matrices.
• Cross-compatibility with other math libs and APIs is important, so make it easy to get raw access to a vector or matrix as a pointer or C array, ideally with an implicit conversion.
• As mentioned earlier, it’s handy to have comparison operators that operate componentwise and return vectors of bools. You’ll also need all() and any() functions to AND or OR together (respectively) all the elements; another handy feature is a select() function that acts like a componentwise ?: operator.

## Row-Major, Column-Major, Row Vector, Column Vector

This topic has been discussed at length elsewhere, so I won’t belabor it. I wish I could make an argument that some choices of these conventions are better or more standard than others, but in honesty I can’t. So, use whatever conventions you find most cromulent!

Note that at the most basic level of the lib, you don’t need to make a choice between row vectors and column vectors, as you can provide multiplication operators for both vector-matrix and matrix-vector combinations. However, once you start writing routines to generate rotation matrices and suchlike, you’ll have to pick a side.

## How To SIMD, And How Not To

You may have noticed that in a lot of vector math libraries, the definition of a vector does not look like this:

struct Vector { float x, y, z, w; };

struct Vector { __m128 data; };

The antipattern of using hardware SIMD vectors to implement xyzw vectors is unfortunately common. It may have made sense at one point in time, but it’s outlived its usefulness, and vector libs should not be written this way now.

To see why, let’s look at why people started writing vector math libs using this “naive SIMD” approach, where a hardware vector register holds the xzyw components of a geometric vector. Of course, it was to make math code faster, but there are two main reasons I’m aware of why using naive SIMD would have been faster than the alternative (on x86 CPUs) of scalar x87 FPU code:

• You can operate on (up to) four components at once. For instance, if you want to add a couple of 4-vectors componentwise, you can do it with one instruction, and that would be faster than four scalar adds in a row.
• You can use the much nicer SSE instruction set, rather than the weird old x87 FPU instruction set with its inefficiencies (and weirdnesses like 80-bit intermediate precision). This would lead to faster code overall.

The hope is that building SIMD into your vector math library, which you use throughout your codebase, would give you these speedups for free versus writing the math lib in scalar form. However, this assumption needs to be reassessed today.

Let me address the second reason first. Modern compilers are quite capable of generating code that uses SSE instead of the x87 FPU for scalar floating-point math code on x86. SSE2 penetration is essentially 100% today, and both Visual Studio 2012 and clang 3.3 generate SSE2 instructions for scalar math on the default settings; so does gcc 4.8 for 64-bit code, but not for 32-bit code (I’m not sure why). For 32-bit code, gcc can be persuaded to use SSE with the command-line arguments -march=native -mfpmath=sse.

Note that I’m not talking about auto-vectorization: this is still scalar code, using only one of the four SIMD vector slots. However, you still reap the benefits of the SSE2 instruction set over the x87 FPU.

As for the first reason—the idea that you can get a speedup by operating on several components at once—well, this is certainly true, if your components are already organized into SIMD vectors in the right way, and you don’t need to do too much rearrangement to get data into the right shape for your operations. Unfortunately, with naive SIMD this often fails to be the case.

As a test of naive SIMD versus scalar performance, I wrote a quick benchmarking app (source here) to measure a couple of small tasks that I think are pretty representative of what game code written with a typical vector math lib looks like. No special attention has been given to optimization; the code is just written in a straightforward style, one version using scalar float math, and one using naive SIMD math. The two tasks are:

• Computing a world-to-camera matrix (aka view matrix) from yaw, pitch, and camera position. Involves a little trig and a little vector math.
• Transforming a batch of 1000 bounding boxes by 1000 corresponding matrices, and computing new bounds for the transformed corners. Involves constructing the AABB corners, and some vector math.

Each of the tasks was repeated a large number of times to get timings large enough to measure precisely, and then I took the minimum timing over several runs. Here are the results (VS 2012, Core i5 at 2.9 GHz):

Scalar Naive SIMD Optimized SIMD
World-to-camera matrix 32-bit 941.4 ms 981.3 ms
64-bit 396.8 ms 384.4 ms
Transform AABBs 32-bit 698.8 ms 1050.7 ms 142.3 ms
64-bit 695.1 ms 1047.2 ms 100.0 ms

For the transform-AABBs task I also implemented an optimized SIMD version—more about that in a bit.

You can see from these results that using naive SIMD didn’t help much even in the best case (64-bit world-to-camera), and actually hurt in the other cases. I’m not going to analyze the generated code closely, but basically what’s going on is that we have to spend too much time moving data around (often via memory) to get it into the right shape for SIMD vector operations to work on it. The overhead negates all the benefit from the 4-wide vector operations, and the scalar code is actually faster.

Now, these are just a couple of quick example routines, and you probably shouldn’t draw too many conclusions from this. But it illustrates how naive SIMD in your vector math lib can actually hurt you more than help you, and I suspect you’re generally better off sticking to scalar math. I’d love to see some data on this from a real project, though.

Does all this mean you shouldn’t ever use SIMD? Of course not! It’s just that you need to be judicious about it: pick the spots where SIMD can make a big difference (because you have to do the same operations on a lot of data), then organize the data so it’s SIMD-friendly, and write the code to use SIMD explicitly.

Organizing the data so it’s SIMD-friendly means using a struct-of-arrays (SOA) layout rather than the array-of-structs (AOS) layout that I’ve been calling “naive SIMD”. In other words, instead of making a hardware vector register hold the xyzw components of a geometric vector, you make it hold the x-components of four different vectors. Another register holds four y-components, another holds four z-components, etc. This has various advantages:

• You get much better utilization of the four slots in the vector registers. With naive SIMD, you couldn’t always do this; if you were working with 3-vectors, for instance, you’d leave one of the slots idle and not doing anything useful. With SOA, working with 3-vectors means you use fewer registers and execute fewer instructions than with 4-vectors, and you’re still using all four slots of the vector registers.
• Similarly, you’re well-placed to expand to SIMD architectures that are more than 4-wide, such as AVX or GPUs—which you couldn’t do at all with naive SIMD, since you almost never use geometric vectors of more than four components.
• You generally have much less shuffling to do. For instance, doing a dot product of two geometric vectors doesn’t require any shuffling. If the rest of your application doesn’t use SOA layout, you’ll have to do some shuffling to get data in and out of SOA on entry and exit of the optimized-SIMD parts. However, you can often keep internal buffers in SOA layout throughout their lifetime.

I implemented a version of the transform-AABBs task with everything in SOA layout, and (as shown by the results in the above table), it was 5× faster than the scalar version in 32-bit code, and 7× faster in 64-bit code. I only used 4-wide SIMD, yet we got more than a 4× speedup, probably due to some combination of better instruction scheduling and better memory access patterns. Compare this to the fact that we got significant slowdowns in this same task when coding it in naive SIMD.

So what are the implications of this for vector math lib design? First of all, stick to scalar math and don’t try to use the naive, AOS style of SIMD—it doesn’t actually help, and can even hurt you. Second, consider providing a separate SOA math library, for those places where optimized SIMD is needed. You can also just use SIMD intrinsics directly in those places (or wrappers around platform-specific SIMD intrinsics), which might be clearer.

One potentially interesting idea: since we already have a vector lib parameterized by element type, you could use a SIMD vector as the element type. For example, Vector<__m128, 3> would be a collection of four 3-vectors in SOA layout. If you define overloaded operators for __m128, such as

__m128 operator + (__m128 a, __m128 b) { return _mm_add_ps(a, b); }

then you can write SOA SIMD code in a very similar style to your scalar code.

Arrays of Vector<__m128, 3> and similar types wouldn’t quite be in SOA layout; they’d be in an “AOSOA” (array of struct of arrays) format. However, this might well give you most of the memory locality benefits of SOA; I haven’t tried it.

## It’s Affine With Me

Vector/matrix math libraries implement the primitives and operations of linear algebra, which deals with vectors—quantities with a direction in space and a magnitude. However, there’s another geometric primitive that’s equally useful: the point. Points belong to the branch of mathematics called affine geometry, and they are not the same thing as vectors—vectors have a direction and a magnitude, while points have neither, except relative to another point.

Affine geometry expresses the notion that the choice of origin point in a space is arbitrary. This is very similar to how in linear algebra, the choice of basis vectors (coordinate axes) is arbitrary. You can represent the same abstract vector (say, the gravity vector) in different bases (say, local space and world space), and it’ll have different component values. However, a vector space always has a fixed origin point—a gravity vector of zero means the same thing, whichever basis you’re in.

In contrast, points live in an affine space, where there is no fixed origin point. While vector spaces model things like gravity vectors, for which zero has a special meaning, affine spaces model things like the positions of objects, where no point is special. A coordinate system in an affine space consists of a choice of origin point and basis vectors. Of course, once we have a coordinate system set up around a certain origin, we then represent other points by creating vectors to them from the origin, and in turn representing those vectors as components with respect to a basis.

Many vector/matrix libs don’t bother to make the distinction, using the same type to store both points and vectors. However, it’s worth considering making this distinction explicit by creating separate types for the two, because points and vectors actually have different operations:

• Vectors can be added to each other, and multiplied by scalars. Points can’t be (in a way that means anything geometrically, outside of a particular coordinate system).
• You can subtract two points, and you get the vector (translation) between them.
• You can add a vector to a point to get a new point (i.e. translate the point).
• When you apply an affine transformation to a point, translation should be included; but when applied to a vector, the translation should be ignored.

Likewise, affine spaces have their own set of transformations. A linear transformation on a vector space can be represented (in a particular basis) as a matrix; an affine transformation can be represented (in a particular coordinate system) as a matrix together with a translation vector. Affine transformations can be composed and inverted while remaining in this form. If points and vectors are different types, matrices and affine transformations should be separate as well.

The argument here is the same as always for strong typing: if you make the point/vector distinction known to the type system, it will prevent you from accidentally doing something that doesn’t make sense. Just as the compiler won’t let you add an integer to a string, it won’t let you add two points.

Of course, when taken too far, strong types become more of a hindrance than a help; it’s a matter of opinion where to draw the line. A good naming convention can often get you most of the benefits of strong types, as long as you adhere to it. I’ve worked with vector math libs both with and without built-in affine math support, and while I slightly prefer affine math to be supported, I can understand choosing to forego this feature.

If you do choose to implement affine math, there are two main components: a Point struct, which will be very similar to the Vector one but with only the operations that make sense for points, and an Affine struct representing an affine transformation, which will contain a Matrix for the linear part and a Vector for the translation. A few other guidelines:

• Provide explicit conversions between points and vectors, for those (rare) cases where you need to break out of the type system.
• Provide an overloaded operator * or mul that takes an Affine and either a point or a vector, and transforms it appropriately.
• Provide an operation to convert a 3D Affine into the corresponding 4×4 matrix for homogeneous coordinates, and vice versa.

A final note: it’s also possible to embed affine math into homogeneous coordinates, by setting w = 1 for points, w = 0 for vectors. This makes them behave correctly with regard to addition/subtraction and affine transformations (embedded in 4×4 matrices), but it doesn’t have the benefit of actually restricting you from performing illegal operations like adding two points, and it wastes memory and time on a w-component that’s really a compile-time constant. It’s also not quite kosher, mathematically speaking, since it overloads the homogeneous coordinate system, using w = 0 to represent both vectors and projective points at infinity—which are different things.

## Conclusion

Whew! We covered a lot of ground in this article, so let me finish by summarizing the most important lessons of the previous sections.

• Parameterize vector/matrix structs by element type and dimension.
• Provide specializations and typedefs for the most common types and sizes, for convenience.
• Provide a nice, full set of constructors, overloaded operators and free functions (like dot, cross, etc). Seriously, go crazy with these and implement everything you can think of.
• Make a decision about row-major vs. column-major and row-vector vs. column-vector, and stick to it.
• Don’t try to build SIMD into your general-purpose vector/matrix structs. Do use SIMD judiciously, where it’s effective, and keep the relevant data in SOA layout.
• Consider adding type-safe affine math (points and affine transformations) on top of the basic linear algebra features.

That’s it! Now go forth and build great vector math libraries! :)

## 20 comments on “On Vector Math Libraries”

December 28, 2013

After reading your article I realized I had come to similar conclusions after three years of developing my humble hobby C++11 graphics library ( https://github.com/mosra/magnum ). The article is very inspirational and I’m looking forward to make use of your ideas :-)

For base implementation I’m using template classes parametrized by type and dimensions. Vectors store array of numeric types, matrices are storing array of these vectors. The base implementations are then subclassed (instead of specialization) for two-, three- and four-component vectors and common matrix sizes. The subclasses then provide only additional convenience methods without any additional data members, so I don’t need to bother with virtual destructors or similar ugly things. Apart from typedefs I’m using also C++11 template aliases to make the usage more convenient when the user wants to change only the type and not dimension count (e.g. template using Matrix2x2 = Matrix;).

For component access I’m not using the union because, as you said, the support for unrestricted unions is still poor and can’t be easily worked around to provide consistent API for all compilers. Instead I’m providing x(), y(), z(), xy(), xyz(), rgb() etc. accessors in specialized subclasses. They are returning mutable references, so the only difference is writing the additional (), which can be lived with. For more involved swizzle operations I’m providing rather crazy swizzle() function which makes use of variadic templates, so it’s able to generate the operation at compile time, allowing the compiler to optimize. The usage is e.g. swizzle(vec);, so it’s not exactly short, but better than nothing.

As I don’t use unrestricted unions, I can afford constructors. For component-wise constructors I’m abusing variadic templates instead of std::initializer_list, because with them it’s possible to check for value count at compile time and it’s possible to make the constructor constexpr. Also I don’t need to write additional pair of braces for construction (e.g. Vector3i a(3, 5, 7) instead of Vector3i a({3, 5, 7})). Fortunately C++14 will solve some of issues related to std::initializer_list.

Originally I had separate Point classes, but I decided to scrap them and use Vector everywhere. The library aims to support both 2D and 3D and all the overload combinations for Vector2, Vector3, Point2D, Vector4 and Point3D went out of hand. The distinction between point and vector gets lost when passing the data to GLSL anyway and the conversion of three-component position to four-component homogeneous coordinates is done implicitly, so I don’t need to do anything else.

To distinct between points and vectors, next to operator* which takes transformation matrix (dual quaternion etc.) and homogeneous vector I’m providing transformPoint() and transformVector() functions on all transformation classes, which are doing similar thing as the operator*, but in a more explicit manner.

These are just my observations and they may be incorrect and false. Thanks for the article again, I enjoyed reading it :-)

December 28, 2013

We’re trying what you describe as “AOSOA” in Ogre 2.0
(to compare, compile with OGRE_SIMD_SSE2 turned on vs turned off both OGRE_SIMD_SSE2 & OGRE_SIMD_NEON). There’s a lot on my website blog and on the Ogre forums.

Our memory layout is contiguous in the following way:
XYZXYZXYZXYZ XYZXYZXYZXYZ XYZXYZXYZXYZ ….

Particularly frustum culling works extremely efficient.

There are a few drawbacks:
* When objects have a parent-child relationship you still have to deal with some scalar overhead to get the data in the right structure. This is because Objs 12, 13, 14 & 15 may be contiguous, but their parents 2, 7, 5, & 3 are not.

* This also applies when your objects have their boundings boxes in one place, and their nodes with the world transform in another (i.e. in Ogre Entities hold the AABB, but Nodes hold the position; multiple entities may be attached to the same node)

* When the amount of objects is huge, you’ll hit the memory bandwidth barrier and SIMD vs scalar doesn’t make much difference for loops where this parent-child relationship exists (however it does make a difference before hitting the barrier).

* After benchmarking, just using scalar operations to move data from SoA -> AoS -> back to SoA is much faster than using shuffle instructions, which seemed odd. My guess is that shuffle instructions required movaps, which means more memory being loaded unnecessarily. (Or I did something wrong)

Overall it is still a win and we’re happy with it, but our biggest enemy so far has been memory bandwidth.

December 28, 2013

Nice work, Nathan!

I’ve always wanted to see a standard accessor scheme that would allow orthogonality of algorithm (e.g. matrix inverse) from the type, size, or data layout. It’d be nice if we could share algorithms without insisting on the same layout.

December 28, 2013

Makes a lot of senses. I like the ‘scalar’ style approach used in graphics programming, though it can be a pain to do complex lane sharing. Matt Phar’s ISPC (http://ispc.github.io/) work does this on CPU nicely, though the extra tool chain required is a bit of a barrier. It’s a lot better for my use cases than OpenCL on a CPU though as I need to control thread/tasking and scheduling.

The VC lib (http://code.compeng.uni-frankfurt.de/projects/vc) and their C++14 proposal for SIMD vector types (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3759.html) seems a decent alternative to writing your own library; though I’ve not had time to check it out yet. Currently most of my SIMD work is pretty bespoke, so I haven’t yet gotten around to writing a more generic library – but if I did it would be AOSOA style ‘scalar’/SPMD.

It’s still handy to have a base standard math lib for times you just want to calculate a few numbers though.

December 28, 2013

Great information, but why constantly try to reimplement the same stuff when libraries like Eigen already provide it with a very permissive license?

December 28, 2013

Eigen’s license is MPL, which isn’t that permissive for a start – since you have to share changes any proprietary system code added to the library can’t be used, making it difficult to use for some games consoles for example.

From what I can see Eigen doesn’t provide a simple set of SIMD operations in an AOSOS style, which is what this article is partially about. It provides SIMD optimized operations on complex types (vectors, matrices etc.) in classic AOS style, gaining significant performance for the more complex operations which the library is mainly designed for.

December 29, 2013

Vc have been quoted, I will also remind people of the on-going effort on Boost.SIMD (currently a subpart of NT2 over https://github.com/MetaScale/nt2) which also have been presented at Bristol as an additionto the C++ standard (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3571.pdf).

Boost.SIMD is dedicated to enable portable SIMD capability over a large range of processors, while NT2 is more like a higher level, matlab like interface on top of “some amount of data with semantic”. NT2 use Boost.Fusion to automatically derives AOSSOA transformation while Boost.SIMD supports SIMD pack of structure and store them the proper way. More info there in our BoostCon talk here: https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDYQtwIwAQ&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DnxsBPjDTpZc&ei=ody_Uvlj8JfRBYqqgKgN&usg=AFQjCNGiv1HwwBUT1DVanz5BmOoMJYfklA&sig2=1iMxu5sIZyEnQZl-NOMwuQ&bvm=bv.58187178,d.d2k

Both have a Boost like licence, making the permissive part rather trivial. I’ll have a look at the proposed benchmark and see how we fare on such tasks.

January 13, 2015

“I’m not going to analyze the generated code closely, but basically what’s going on is that we have to spend too much time moving data around (often via memory) to get it into the right shape for SIMD vector operations to work on it.”

This is an odd statement to make that basically sounds like “I didn’t take the time to write this properly so it’s possible it sucked.” Which is certainly easy to do, but there are reason why people do use simd-based Vector classes. For us it yielded a 1.5-2.5x speedup on common operations.

It is definitely easier to write a bad simd implementation and end up with code that is slower than non-simd. Dealing with the 16-byte alignment issues can also get annoying.

Everyone certainly has their opinion on what makes a good vector library. In the last 15 years we’ve never needed integer vectors for example, so having a clean set of non-templated Color/Vector2/3/4/Matrix types was perfect for us. The code is much easier to follow and we’re able to use SIMD if desired. We’re also not fans of trying to union colors and vectors :).

I don’t think we’ve actually touched these classes in a few years now, so maintenance isn’t really an issue as the set of operations on vectors is fairly well known.

January 14, 2015

A quick note for readers: Please don’t use march=native in any build that you will ship to somebody else’s computer. It allows GCC to use any instruction set supported by the processor you’re compiling on. If you’re on a haswell and sending code out to users that might be on an old Core 2, you’re gonna have a bad time.

January 14, 2015

This is an odd statement to make that basically sounds like “I didn’t take the time to write this properly so it’s possible it sucked.”

Sorry, I meant that I wasn’t going to go over the generated code in the article—I did go over it myself to confirm that it had a lot more memory operations, shuffle-type operations and such than the SOA SIMD version.

Please don’t use march=native in any build that you will ship to somebody else’s computer.

Yes, good point. :) I was being a bit lazy and didn’t want to track down exactly which march setting was needed to get SSE going.

January 19, 2015

Hi nathan!Good article.What about GPU accelerated math libs? Like CUBLAS or Cuda Math.Are those worth to try if the dev environment is GPU accelerated graphics programming?Have you tried any of those?IMHO,the overhead of uploading / downloading computational data from CPU to GPU and back could be a serious bottleneck.What do you think?

January 21, 2015

GPU-accelerated math is interesting if you have a lot of math to do, enough that the GPU speedup is worth the extra latency of uploading and downloading the data. Games are starting to use the GPU for running particle systems, fluid sims and even rigid body physics these days, which in previous years would have been done on the CPU.

March 11, 2015

Hi,great article with a lot a information. For the need of a Vector class in a scientific code, I was wondering if there is a good way to use SIMD for double precision with sufficient speed up?

March 11, 2015

Ugo, yes, there are SIMD instructions for double-precision, but they are half the width—e.g. 2 doubles, instead of 4 floats, in a single 128-bit register. They might also perform less well in terms of latency and throughput than single-precision instructions. But if doubles are a requirement for your application then it’s worth looking into.

June 18, 2015

Great article, Nathan. I had a thought on the workarounds for having constructor-containing classes within unions: It appears that an unnamed struct (containing the class) does the job also. Is there any reason you can think of why I would not want to do this?

VS2013 Community does not like constructor-containing classes as members of classes/structs, so I supposed I’m forced to look at the alternatives.

July 21, 2015

Braden, you mean something like this?

template <> struct Vector<4> { union { struct { float x, y, z, w; }; struct { Vector<3> xyz; }; }; };

I hadn’t thought of that, but if the compiler accepts it, that sounds like a great alternative!

Also, VS 2015 is out now and supports both unrestricted unions and (more of) constexpr, which should take away a lot of constructor woes.

July 27, 2016

What do you think about the aliasing problem that is inherent with using unions? Are compilers smarter than that, or does typical usage typically not mix named access with indexed access?

I bring it up because this point was harped on in a fairly old SO post I found when trying to figure out the best way to handle named access: http://stackoverflow.com/questions/494597/c-member-variable-aliases

July 27, 2016

Also “load-hit-store” is referenced. Still not sure how much of a problem it is in 2016.

July 28, 2016

Merlyn, I haven’t tested this exhaustively but AFAICT, the aliasing problem is just a language-lawyer “problem”, concerning technically undefined behavior and so on, and not a real problem for compilers in a case like this. Everything is just flat arrays of floats, so I can’t see how there would be any issues with it in real life.

Load-hit-store is something that applies to older PowerPC CPUs, as infamously used in the X360 and PS3, and it mainly applies when mixing float and integer operations, or SIMD and non-SIMD operations, on the same values. Since my recommendation is not to try to wire SIMD generally into your vector math library, but to use it explicitly in targeted cases where you want it, LHS shouldn’t be a big issue.

July 28, 2016

I’m a noob at this level of optimization (slash pessimization), so I apologize if I’m being verbose or needy :)

> Everything is just flat arrays of floats, so I can’t see how there would be any issues with it in real life.

Googling LHS, it looked like it was specifically a problem when accessing the same bit of memory via both individual data members and array indices. It sounds to me like you haven’t found that to be a problem, and that it’s more an issue with the xbox360 platform/compiler.

I tried doing some tests. I am not familiar enough with assembly code (I can just barely read it) or compiler code generation, so I don’t know exactly what sorts of tests to throw at the problem. However, it looks to me like the Arduino g++ compiler is generating basically identical code with or without union usage, with or without inlining, even if I was deliberately reading and writing the data from different structures in the union (accessed via pointer), and even if I had -Os enabled (which turns on “-fstrict-aliasing” – not sure how related this is). Storing and reading intermediates from global variables broke inlining, but it was doing it in both cases, and I’m pretty sure that isn’t unique to using unions this way.

Any ideas of what might “prove” this one way or another?