Nathan Reed’s coding blog

Reading Veach’s Thesis, Part 2

Nathan Reed — Sat, 25 Feb 2023 10:24:44 -0800

In this post, we’re continuing to read Eric Veach’s doctoral thesis. In our last installment, we covered the first half of the thesis, dealing with theoretical foundations for Monte Carlo rendering. This time we’re tackling chapters 8–9, including one of the key algorithms this thesis is famous for: multiple importance sampling. Without further ado, let’s tuck in!

As before, this isn’t going to be a comprehensive review of everything in the thesis—it’s just a selection of things that made me go “oh, that’s cool”, or “huh! I didn’t know that”.

Path-Space Integrals

We usually see the rendering equation expressed as a fixed-point integral equation. The radiance field $L$ appears on both sides: $$ L = L_e + \int L \, f \, |\cos\theta| \, \mathrm{d}\omega $$ There are some theorems showing that we can solve this as an infinite series: $$ L = L_e + TL_e + T^2 L_e + \cdots $$ where $T$ is an operator representing the integral over surfaces with their BSDFs. This series constructs the solution bounce-by-bounce: first directly emitted light, then light that’s been scattered once, then scattered twice, and so on.

The trouble is, this series contains a separate integral for each possible path length. For the methods Veach is going to deploy later, he needs to be able to combine paths of all lengths in a single Monte Carlo estimator. In Chapter 8, he reformulates the rendering equation as an integral over a “space” of all possible paths: $$ L = \int L_e \, f \, \mathrm{d}\mu $$ The idea is that now we’re integrating a new kind of “variable”, which ranges over all paths (of any length) in the scene. Here, $f$ stands for the throughput along a whole path, and $L_e$ for the emitted light injected at its beginning.

By itself, this doesn’t really simplify anything; we’ve just moved the complexity from the rendering equation to the definition of the path space over which we’re integrating. This is a funny kind of “space” that actually consists of a disjoint union of an infinite sequence of subspaces, one for each possible path length. Those subspaces even have different dimensionalities, which is extra weird! But with Lebesgue measure theory, this is a legit space that can be integrated over in a mathematically rigorous way.

This sets us up for talking about probability distributions over all paths, combining different path sampling methods in an unbiased way, and so forth—which will be crucial in the following chapters.

The path-integral formulation of the rendering equation has also become quite popular in light transport theory papers today.

Non-Local Path Sampling

Veach gives an intriguing example of a potential new path sampling approach that’s facilitated by the path-integral formulation. Usually, paths are constructed incrementally starting from one end, by shooting a ray toward the next path vertex. But in the presence of specular surfaces such as a planar mirror, you could also algebraically solve for a point on the mirror that will connect two existing path vertices (say, one from a camera subpath and one from a light subpath). Even more exotically, we could consider solving for chains of multiple specular scattering events to connect a given pair of endpoints.

Veach calls this “non-local” path sampling, because it looks at vertices that aren’t just adjacent to each other on the path, but farther apart.

Veach merely sketches this idea and remarks that it could be useful. Since then, non-local sampling ideas have been researched in the manifold exploration family of techniques, such as Manifold Next-Event Estimation and Specular Manifold Sampling.

Extended Light Path Expressions

You may have seen “regular expression” syntax describing the vertices of paths, like $LS^*DE$ and suchlike. In this notation, $L$ stands for a light source, $S$ for a (Dirac) specular scattering event, $D$ a diffuse (or glossy) scattering event, and $E$ for the camera/eye. It’s a concise way to classify which kinds of paths are handled by different techniques. These “light path expressions” are widely used in the literature, as well as in production renderers to split off different lighting components into separate framebuffers.

Veach describes an extension to this notation in which extra $D$ and $S$ symbols are added to denote the continuity or discreteness of lights and cameras, in both position and directionality. For example, a point light (positionally “specular”) that radiates in all directions (“diffuse”) would be denoted $LSD$. A punctual directional light would be $LDS$, and an area light would be $LDD$. The camera is described likewise, but in the opposite order: $DSE$ is a pinhole camera, while $DDE$ is a camera with a physical lens area. These substrings are used as prefixes and suffixes for what he calls “full-path” regular expressions.

There’s a certain elegance to this idea, but I have to admit I found it confusing in practice, even after reading several chapters using these extended expressions. I had to keep looking up which symbol was the position and which was the direction, and stopping to think about what those labels mean in the context of a light source or camera.

This extended syntax doesn’t seem to have been adopted by much later literature, but I did see it used in the Path Space Regularization paper by Kaplanyan and Dachsbacher. They also print the light and camera substrings in different colors, to improve their readability.

Multiple Importance Sampling

Alright, now we’re getting into the real meat of Veach’s thesis! In a sense, all the foregoing material was just setup and preparation for the last three chapters, which contain the thesis’s major original contributions.

I’ll assume you’re familiar with the basic ideas of multiple importance sampling, the balance heuristic, and the power heuristic. If you need a refresher, here’s the relevant section of PBR.

The Balance Heuristic

There are some great insights here about the interpretation of the balance heuristic that I hadn’t seen before. Using the balance heuristic to combine samples from a collection of probability distributions $p_i(x)$ (e.g., light source sampling and BSDF sampling) turns out to be equivalent to sampling from a single distribution, whose probability density is the average of all the constituent ones: $$ p_\text{mis}(x) = \frac{1}{N} \sum_i p_i(x) $$ Intuitively, this is useful because the combined distribution inherits all of the peaks of the distributions contributing to it. If one sampling strategy is “good at” sampling a certain region of the integration domain, its $p_i(x)$ will tend to have a peak in that region. When several PDFs are averaged together, the resulting distribution has peaks (albeit smaller ones) everywhere any of the included strategies has a peak.

As an illustration, here are two fictious “PDFs” I made up, and their average:

The third curve, which simulates MIS with the balance heuristic, combines the peaks of the first two.

Here’s all three curves together:

So, the balance heuristic combines the strengths of the sampling strategies within it: it’s “pretty good at” sampling all the regions that any of the constitutent strategies are “good at”.

A corollary of this fact is that the balance heuristic will assign a given path the same contribution weight no matter which strategy generated it. This isn’t the case for other MIS weighting functions, such as the power heuristic.

The Power Heuristic

The power heuristic doesn’t have quite such a tidy interpretation; it’s not equivalent to sampling any single distribution. It intuitively does something similar to the balance heuristic, but also “sharpens” the weights, making small contributions smaller and large ones larger.

According to Veach, this is helpful to reduce variance in areas where one of the included strategies is already a very close match for the integrand. In those cases, MIS isn’t really needed, and the balance heuristic can actually make things worse. The power heuristic makes things less worse.

There’s a great graph in the thesis (Figure 9.10) showing actual variance measurements for light source sampling, BSDF sampling, and the two combined with the balance heuristic or the power heuristic:

These are plotted logarithmically over several orders of magnitude in surface roughness, so they give some nice concrete evidence about the efficacy of MIS in reducing variance across a wide range of shading situations.

MIS Examples

We’ve all seen that classic MIS showcase image, with the different light source sizes versus material roughnesses. That comes from this thesis, of course! Here’s a neat Shadertoy rendition of it, created by Maxwell Planck:

Light source samples are color-coded red, and BSDF samples are green; this is a nice way to visualize how the two get weighted differently across the image.

However, I was interested to see that Veach also has a second demo scene, which I haven’t come across before. It’s simpler and less “pretty” than the more famous one above, but in my mind it demonstrates the value of MIS even more starkly.

This scene just consists of a large emissive surface at right angles to a diffuse surface:

(Shadertoy here, which I adapted from Planck’s.)

Depending how far you are from the light, either BSDF sampling or light source sampling is more effective at estimating the illumination. So, you don’t even need a whole range of material roughnesses to benefit from MIS; area lights and diffuse walls are enough!

Conclusion

I’ve known about multiple importance sampling for a long time, but I never felt like I quite got my head around it. I had the idea that it was something about shifting weight toward whichever sampling method gives you the “highest quality” samples in a given region, but it always seemed a little magical to me how you could determine that from purely local information (the pdfs at a single sample point).

I’m glad I took the time to read through Veach’s own explanation of this, as it goes into a lot more detail about the meaning and intuition behind the balance heuristic. I have a much better understanding of how and why it works, now.

One thing I didn’t get to address here (because I didn’t have much useful to say about it) was the optimality(-ish) proofs Veach gives. There are a few theorems proved in this chapter that roughly say something like “this heuristic might not be the best one, but it’s not that far behind the best one”. I’d like to contextualize these results better (what justifies saying it’s “not that far”?), but I haven’t yet found the right angle.

The last couple chapters in the thesis are about bidirectional path tracing and Metropolis light transport. This post has stretched long enough, so those will have to wait for another time!

Reading Veach’s Thesis

Nathan Reed — Sat, 03 Dec 2022 14:50:39 -0800

If you’ve studied path tracing or physically-based rendering in the last twenty years, you’ve probably heard of Eric Veach. His Ph.D thesis, published in 1997, has been hugely influential in Monte Carlo rendering. Veach introduced key techniques like multiple importance sampling and bidirectional path tracing, and clarified a lot of the mathematical theory behind Monte Carlo rendering. These ideas not only inspired a great deal of later research, but are still used in production renderers today.

Recently, I decided to sit down and read this classic thesis in full. Although I’ve seen expositions of the central ideas in other places such as PBR, I’d never gone back to the original source. The thesis is available from Stanford’s site (scroll down to the very bottom for PDF links). It’s over 400 pages—a textbook in its own right—but I’ve found it very readable, with clearly presented ideas and incisive analysis. There’s a lot of formal math, too, but you don’t really need more than linear algebra, calculus, and some probability theory to understand it. I’m only about halfway through, but there’s already been some really interesting bits that I’d like to share. So hop in, and let’s read Veach’s thesis together!

This isn’t going to be a comprehensive review of everything in the thesis—it’s just a selection of things that made me go “oh, that’s cool”, or “huh! I didn’t know that”.

Unbiased vs Consistent Algorithms

You’ve probably heard people talk about “bias” in rendering algorithms and how unbiased algorithms are better. Sounds reasonable, bias is bad and wrong, right? But then there’s this other thing called “consistent” that algorithms can be, which makes them kind of okay even if they’re biased? I’ve encountered these concepts in the graphics world but never really saw a clear explanation of them (especially “consistent”).

Veach has a pretty nice one-page explanation of what this is and why it matters (§1.4.4). Briefly, bias is when the mean value of the estimator is wrong, independent of the noise due to random sampling. “Consistent” is when the algorithm’s bias approaches zero as you take more samples. An unbiased algorithm generates samples that are randomly spread around the true, correct answer from the very beginning. A consistent algorithm generates samples that are randomly spread around a wrong answer to begin with, but then they migrate toward the right answer over time.

The reason it matters is that with an unbiased algorithm, you can track the variance in your samples and get a good idea of how much error there is, and you can accurately predict how many samples it’s going to take to get the error down to a given level. With a biased but consistent algorithm, you could have a situation where it looks like it’s converged because the samples have low variance, but it’s converged to an inaccurate value. You have no real way to detect that, and no way to tell how many more samples might be necessary to achieve a given error bound.

Photon Phase Space

The classic Kajiya rendering equation deals with this quantity called “radiance” that’s notoriously hard to get a handle on, both intuitively and mathematically. We’re usually shown a definition that has some derivative-looking notation like $$ L = \frac{\mathrm{d}^2\Phi}{\mathrm{d}A \, \mathrm{d}\omega \cos \theta} $$ which, like, what? What is the actual function that is being differentiated here? What are the variables? What does this even mean?

If you’re the sort of person who feels more secure when things like this are put on an explicit, formal mathematical footing, Chapter 3 is for you. Veach takes it back to physics by defining a phase space (state space) for photons. Each photon has a position, direction, and wavelength, so the phase space is 6-dimensional (3 + 2 + 1). We can imagine the photons in the scene as a cloud of points in this space, moving around with time, spawning at light sources and occasionally dying when absorbed at surfaces.

Then, all the usual radiometric quantities like flux, irradiance, radiance, and so on can be defined in terms of measuring the density of photons (or rather, their energy density) in various subsets of this space. For example, radiance is defined in terms of the photons flowing through a given patch of surface, with directions within a given cone, and then taking a limit as the surface patch and cone sizes go to zero. This kind of limiting procedure is formalized using measure theory, as a Radon–Nikodym derivative.

Incident and Exitant Radiance

Another thing we get from this notion of photon phase space is a precise distinction between incident and exitant radiance. The rendering equation describes how to calculate $L_o$ (exitant radiance, leaving the surface) in terms of the BSDF and $L_i$ (incident radiance, arriving at the surface). But then how are these $L_o$ and $L_i$ related to each other? There’s just one unified radiance field, not two; but trying to define it as a function of position and direction, $L(x, \omega)$, we run into some awkwardness at points on surfaces because the radiance changes discontinuously there.

Veach §3.5 gives a nice definition of incident and exitant radiance functions in terms of the photon phase space, by looking at trajectories moving toward the surface or away from it in time. (To be fair, I think this could be done as well by looking at one-sided limits of the 3D radiance field as you approach the surface from either direction.)

Reciprocity and Adjoint BSDFs

Much of the thesis in Chapters 4–7 is concerned with how to handle non-reciprocal BSDFs—or, as Veach calls them, non-symmetric. We’re often told that BSDFs “should” obey a reciprocity law, $f(\omega_i \to \omega_o) = f(\omega_o \to \omega_i)$, in order to be well-behaved. However, Veach points out that non-reciprocal BSDFs are commonplace and unavoidable in practice:

Refraction is non-reciprocal (§5.2). Radiance changes by a factor of $\eta_o^2 / \eta_i^2$ when refracted (more about this in the next section); reverse the direction of light, and it inverts this factor.
Shading normals are non-reciprocal (§5.3). Shading normals can be interpreted as a factor $|\omega_i \cdot n_s| / |\omega_i \cdot n_g|$ multiplied into the BSDF. Note that this expression involves only $\omega_i$ and not $\omega_o$, so if those directions are swapped, this value will in general be different.

Does this spell doom for physically-based rendering algorithms? Surprisingly, no. According to Veach, it just means we have to be careful about the order of arguments to our BSDFs, and not treat them as interchangeable. The rendering will still work as long as we’re consistent about which direction light is flowing. (It’s a bit like working with non-commutative algebra; you can still do most of the same things, you just need to take care to preserve the order of multiplications.)

For photon mapping or bidirectional path tracing, we might need two separate importance-sampling routines: one to sample $\omega_i$ given $\omega_o$ (when tracing from the camera) and one to sample $\omega_o$ given $\omega_i$ (when tracing from a light source).

Another way to think about it is that light is emitted and scatters through the scene, it uses the regular BSDF, but when “importance” is emitted by cameras and scatters through the scene, it uses the adjoint BSDF—which is just the BSDF with its arguments swapped (§3.7.6). Then both directions of scattering give consistent results and can be intermixed in algorithms.

Non-Reciprocity of Refraction

I was not previously aware that radiance should be scaled by $\eta_o^2 / \eta_i^2$ when a ray is refracted! This fact somehow skipped me by in everything I’ve read about physically-based light transport (although when I looked, I found PBR discussing this issue in §16.1.3). The radiance changes because light gets compressed into a smaller range of directions when refracted, as this diagram (excerpted from the thesis) shows:

So, a ray entering a glass object should have its radiance more than doubled. However, the scaling is undone when the ray exits the glass again. That explains why you can often get away without modeling this radiance scaling in a renderer; if the camera and all light sources are outside of any refractive media, there’s no visible effect. This would only show up if, for instance, some light sources were inside a medium—and would only show up as those light sources being a little dimmer than they should be, which would be easy to overlook (and easy for an artist to compensate by bringing those lights up a bit).

However, the radiance scaling does become important when we use things like photon mapping and bidirectional path tracing, where we have to use the adjoint BSDF when tracing from the light sources. Then, the $\eta^2$ factors apply inversely to these paths, which is important to get right, or else the bidirectional methods won’t be consistent with unidirectional ones.

Veach also derives (§6.2) a generalized reciprocity relationship that holds for BSDFs with refraction (in the absence of shading normals): $$ \frac{f(\omega_i \to \omega_o)}{\eta_o^2} = \frac{f(\omega_o \to \omega_i)}{\eta_i^2} $$ He proposes that instead of tracking radiance $L$ along paths, we instead track the quantity $L/\eta^2$. When BSDFs are written with respect to this modified radiance, the $\eta^2$ factors cancel out and the BSDF becomes symmetric again. In this case, no scaling needs to be done as the ray traverses different media, and paths in both directions can operate by the same rules; only at the ends of the path (at the camera and at lights) do some $\eta^2$ factors need to be incorporated. Veach argues that this a simpler and easier-to-implement approach to path tracing overall.

It’s interesting to note, though, that PBRT doesn’t take Veach’s suggested approach here; it tracks unscaled radiance, and puts in the correct scaling factors due to refraction, for paths in both directions.

Conclusion

The refraction scaling business was the most surprising point for me in what I’ve read so far, but Veach’s argument for non-symmetric scattering being OK as long as you take care to handle it correctly was also very intriguing!

That brings us to the end of Chapter 7, which is about halfway through. The next chapters are about multiple importance sampling, bidirectional path tracing, and Metropolis sampling. I hope this was interesting, and maybe I’ll do a follow-up post when I’ve finished it!

Texture Gathers and Coordinate Precision

Nathan Reed — Sat, 15 Jan 2022 08:21:17 -0800

A few years ago I came across an interesting problem. I was trying to implement some custom texture filtering logic in a pixel shader. It was for a shadow map, and I wanted to experiment with filters beyond the usual hardware bilinear.

I went about it by using texture gathers to retrieve a neighborhood of texels, then performing my own filtering math in the shader. I used frac on the scaled texture coordinates to figure out where in the texel I was, emulating the logic the GPU texture unit would have used to calculate weights for bilinear filtering.

To my surprise, I noticed a strange artifact in the resulting image when I got the camera close to a surface. A grid of flickery, stipply lines appeared, delineating the texels in the soft edges of the shadows—but not in areas that were fully shadowed or fully lit. What was going on?

Dramatic reenactment of the artifact that started me on this investigation.

After some head-scratching and experimenting, I understood a little more about the source of these errors. In the affected pixels, there was a mismatch between the texels returned by the gather and the texels that the shader thought it was working with.

You see, the objective of a gather operation is to retrieve the set of four texels that would be used for bilinear filtering, if that’s what we were doing. You give it a UV position, and it finds the 2×2 quad of texels whose centers surround that point, and returns all four of them in a vector (one channel at a time).

As the UV position moves through the texture, when it crosses the line between texel centers, the gather will switch to returning the next set of four texels.

In this diagram, the large labeled squares are texels. Whenever the input UV position is within the solid blue box, the gather returns texels ABCD. If the input point moves to the right and crosses into the dotted blue box, then the gather will suddenly start returning BEDF instead. It’s a step function—a discontinuity.

Meanwhile, in my pixel shader I’m calculating weights for combining these texels according to some filter. To do that, I need to know where I am within the current gather quad. The expression for this is:

float2 texelFrac = frac(uv * textureSize - 0.5);

(The - 0.5 here is to make coordinates relative to texel centers instead of texel edges.)

This frac is supposed to wrap around from 1 back to 0 at the exact same place where the gather switches to the next set of texels. The frac has a discontinuity, and it needs to match exactly with the discontinuity in the gather result, for the filter calculation to be consistent.

But in my shader, they didn’t match. As I discovered, there was a region—a very small region, but large enough to be visible—where the gather switched to the next set of texels before the frac wrapped around to 0. Then, the shader blithely made its weight calculations for the wrong set of texels, with ugly results.

This diagram is not to scale—the actual mismatch is much smaller than depicted here—but it illustrates what was going on. It was as if the texel squares as judged by the gather were the yellow squares, ever so slightly offset from the blue ones that I got by calculating directly in the shader. Those flickery lines in the shadow will make their entrance whenever some pixels happen to fall into the tiny slivers of space between these two conflicting accounts of “where the texel grid is”.

Now on the one hand, this suggests a simple fix. We can add a small offset to our calculation:

const float offset = /* TBD */;
float2 texelFrac = frac(uv * textureSize + (-0.5 + offset));

Then we can empirically hand-tweak the value of offset, and see if we can find a value that makes the artifact go away.

On the other hand, we’d really like to understand why this mismatch exists in the first place. And as it turns out, once we understand it properly, we’ll be able to deduce the exact, correct value for offset—no hand-tweaking necessary.

Into the Texture-Verse

Texture gathers and samples are performed by a GPU’s “texture units”—fixed-function hardware blocks that shaders call out to. From a shader author’s point of view, texture units are largely a black box: put UVs in, get filtered results back. But to address our questions about the behavior of gathers, we’ll need to dig down a bit into what goes on inside that black box.

We won’t (and can’t) go all the way down to the exact hardware architecture, as those details are proprietary, and GPU vendors don’t share a lot about them. Fortunately, we won’t need to, as we can get a general logical picture of what’s happening on the basis of formal API specs, which all the vendors’ texture units need to comply with.

In particular, we can look at the Direct3D functional spec (written for D3D11, but applies to D3D12 as well), and the Vulkan spec. We could also look at OpenGL, but we won’t bother, as Vulkan generally specifies GPU behavior the same or more tightly than OpenGL.

Let’s start with Direct3D. What does it have to say about how texture sampling works?

Quite a bit—that’s the topic of a lengthy section, §7.18 Texture Sampling. There are numerous steps described for the sampling pipeline, including range reduction, texel addressing modes, mipmap selection and anisotropy, and filtering. Let’s focus in on how the texels to sample are determined in the case of (bi)linear filtering:

D3D §7.18.8 Linear Sample Addressing

…Linear sampling in 1D selects the nearest two texels to the sample location and weights the texels based on the proximity of the sample location to them.

Given a 1D texture coordinate in normalized space U, assumed to be any float32 value.

U is scaled by the Texture1D size, and 0.5f is subtracted. Call this scaledU.

scaledU is converted to at least 16.8 Fixed Point. Call this fxpScaledU.

The integer part of fxpScaledU is the chosen left texel. Call this tFloorU. Note that the conversion to Fixed Point basically accomplished: tFloorU = floor(scaledU).

The right texel, tCeilU is simply tFloorU + 1.

…

The procedure described above applies to linear sampling of a given miplevel of a Texture2D as well…

OK, here’s something interesting: “scaledU is converted to at least 16.8 Fixed Point.” What’s that about? Why would we want the texture sample coordinates to be in fixed-point, rather than staying in the usual 32-bit floating-point?

One reason is uniformity of precision. Another section of the D3D spec explains:

D3D §3.2.4 Fixed Point Integers

Fixed point integer representations are used in a couple of places in D3D11…

Texture coordinates for sampling operations are snapped to fixed point (after being scaled by texture size), to uniformly distribute precision across texture space, in choosing filter tap locations/weights. Weight values are converted back to floating point before actual filtering arithmetic is performed.

As you may know, floating-point values are designed to have finer precision when the value is closer to 0. That means texture coordinates would be more precise near the origin of UV space, and less elsewhere. However, image-space operations such as filtering should behave identically no matter their position within the image. Fixed-point formats have the same precision everywhere, so they are well-suited for this.

Illustration of fixed-point texture coordinates, if there were only 3 subpixel bits (2³ = 8 subdivisions). Each dot is a possible fixed-point value. Two adjacent bilinear/gather footprints are highlighted in yellow and cyan.

(Incidentally, you might wonder: don’t we already have non-uniform precision in the original float32 coordinates that the shader passed into the texture unit? Yes—but given current API limits on texture sizes, the 24-bit float mantissa gives precision equal or better than 16.8 fixed-point, throughout at least the [0,1]² UV rectangle. You can still lose too much precision if you work with too-large UV values in float32 format, though.)

Another possible reason for using fixed-point in texture units is just that integer ALUs are smaller and cheaper than floating-point ones. But there are a lot of other operations in the texture pipeline still done in full float32 format, so this likely isn’t a major design concern.

Precision, Limited Edition

At this point, we can surmise that our mysterious gather discrepancy may have something to do with coordinates being converted to “at least 16.8 fixed point”, per the D3D spec.

These are the scaled texel coordinates, so the integer part of the value (the 16 bits in front of the radix point) determines which texels we’re looking at, and then there are at least 8 more bits in the fractional part, specifying where we are within the texel.

The minimum 8 bits of sub-texel precision is also re-stated in various other locations in the spec, such as:

D3D §7.18.16.1 Texture Addressing and LOD Precision

The amount of subtexel precision required (after scaling texture coordinates by texture size) is at least 8-bits of fractional precision (2⁸ subdivisions).

The D3D spec text is also clear that conversion to fixed-point occurs before taking the integer part of the coordinate to determine which texels are filtered.

But how does this end up inducing a tiny offset to the locations of texel squares, when we compare the 32-bit float inputs to the fixed-point versions?

There’s one more ingredient we need to look at it, which is how the conversion to fixed-point is accomplished. Specifically: how does it do rounding? The 16.8 fixed-point has coarser precision than the input floats in most cases, so floats will need to be snapped to one of the available 16.8 values.

Back to our best friend, the D3D spec, which gives detailed rules about the various numeric formats, the arithmetic rules they need to satisfy, and the processes for conversion amongst them. Regarding conversion of floats to fixed-point:

D3D §3.2.4.1 FLOAT -> Fixed Point Integer

For D3D11 implementations are permitted 0.6f ULP tolerance in the integer result vs. the infinitely precise value n*2^f after the last step above.

The diagram below depicts the ideal/reference float to fixed conversion (including round-to-nearest-even), yielding 1/2 ULP accuracy to an infinitely precise result, which is more accurate than required by the tolerance defined above. Future D3D versions will require exact conversion like this reference.

[in the “float32 -> Fixed Point Conversion” diagram:]

Round the 32-bit value to a decimal that is extraBits to the left of the LSB end, using nearest-even.

There’s the answer: the conversion uses rounding to nearest-even (the same as the default mode for float math). This means floating-point values will be snapped to the nearest fixed-point value, with ties breaking to the even side.

Now, we’re finally in a position to explain the artifact that started this whole quest. When we pass our float32 UVs into the texture unit, they get rounded to the nearest fixed-point value at 8 subpixel bits—in other words, the nearest 1/256th of a texel. This means that the last half a bit—the last 1/512th of a texel—will round up to the next higher integer texel value.

When fixed-point conversion is done by round-to-nearest, all the points in the yellow square end up rounded to one of the yellow dots, and assigned the corresponding set of texels; likewise the cyan ones.

Note how the squares are offset from the texel centers by half the grid spacing.

Therefore, in that last 1/512th, bilinear filtering operations and gathers will choose a one-higher set of texels to interpolate between—while the shader computing frac on the original float32 values will still think it’s in the original set of texels. This is exactly what we saw in the original artifact!

Accordingly, we can now see that the frac input needs to be shifted by exactly 1/512th texel in order to make its wrap point line up. It’s very much like the old C/C++ trick of adding 0.5 before converting a float to integer, to obtain rounding instead of truncation.

const float offset = 1.0/512.0;
float2 texelFrac = frac(uv * textureSize + (-0.5 + offset));

Lo and behold, the flickery lines on the shadow are now completely gone. 👌🎉😎

Eight is a Magic Number

All GPUs that support D3D11—which means essentially all PC desktop/laptop GPUs from the last decade and a half—should be compliant with the D3D spec, so they should all be rounding and converting their texture coordinates the same way. Except that there’s still some wiggle room there: the spec only prescribes 8 subtexel bits as a minimum. GPU designers have the option to use more than 8, if they wish. How many bits do they actually use?

Let’s see what Vulkan has to say about it. The Vulkan spec’s chapter §16 Image Operations describes much the same operations as the D3D spec, but at a more abstract mathematical level—it doesn’t nail down the exact sequence of operations and precision the way D3D does. In particular, Vulkan doesn’t say what numeric format should be used for the floor operation that extracts the integer texel coordinates. However, it does say:

VK §16.6 Unnormalized Texel Coordinate Operations

…the number of fraction bits retained is specified by VkPhysicalDeviceLimits::subTexelPrecisionBits.

So, Vulkan doesn’t out-and-out say that texture coordinates should be converted to a fixed-point format, but that seems to be implied or assumed, given the specification of a number of “fraction bits” retained.

Also, in Vulkan the number of subtexel bits can be queried in the physical device properties. That means we can use Sascha Willems’ fantastic Vulkan Hardware Database to get an idea of what subTexelPrecisionBits values are reported for actual GPUs out there.

The results as of this writing show about 89% of devices returning 8, and the rest returning 4. There are no devices returning more than 8.

The distribution of subTexelPrecisionBits as reported by the Vulkan Hardware Database. The reports of values 0 and 6 look bogus, as do most of the reports of 4.

The Vulkan spec minimum for subTexelPrecisionBits is also 4, not 8 (see Table 53 – Required Limits). And it seems there’s a significant minority of GPUs that have only 4 subtexel bits. Or is there? Let’s poke at that a little further.

Of the reports that return 4 bits, a majority of them seem to be from Apple platforms. Now, Apple doesn’t implement Vulkan directly, so these must be going through MoltenVK. And it turns out that MoltenVK hardcodes subTexelPrecisionBits to 4, at the time of this writing. The associated comment suggests that Metal doesn’t publicly expose or specify this value, so they’re just setting it to the minimum. This value shouldn’t be taken as meaningful! In fact, I would bet money that all the Apple GPUs have 8 subtexel bits, just like everyone else. (The only one I’ve tested directly is the M1, and it indeed seems to be 8.) However, I don’t think there is any public documentation from Apple to confirm or refute this.

Many other reports of 4 subtexel bits come from older Linux drivers for GPUs that definitely have 8 subtexel bits; those might also be incomplete Vulkan implementations, or some other odd happenstance. Some Android GPUs also have both 4 and 8 reported in the database for the same GPU; I assume 8 is the correct value for those. Finally, there are software rasterizers such as SwiftShader and llvmpipe, which also seem to just return the spec minimum.

The fact that the Vulkan spec minimum is 4, rather than 8, suggests that there are (or were) some GPUs out there that actually only have 4 subtexel bits—or why wouldn’t the spec minimum be 8? But I haven’t been able to find out what GPUs those could be.

Moreover, there’s a very practical reason why 8 bits is the standard value! Subtexel precision is directly related to bilinear filtering, and most textures in 3D apps are in 8-bit-per-channel formats. If you’re going to interpolate 8-bit texture values and store them in an 8-bit framebuffer, then you need 8-bit subtexel precision; otherwise, you’re likely to see banding whenever a texture is magnified—whenever the camera gets close to a surface. Lots of effects like reflection cubemaps, skyboxes, and bloom filters would also be really messed up if you had less than 8 subtexel bits!

Overall, it seems very safe to assume that any GPU you’d actually want to run on will have exactly 8 bits of subtexel precision—no more, no less.

What about the rounding mode? Unfortunately, as noted earlier, the Vulkan spec doesn’t actually say that texture coordinates should be converted to fixed-point, and thus doesn’t specify rounding behavior for that operation.

Given that the D3D behavior is more tightly specified here, we can expect that behavior to hold whenever we’re on a D3D-supporting GPU (even if we’re running with Vulkan or OpenGL on that GPU). The question is a little trickier for other GPUs, such as Apple’s and the assorted mobile GPUs. They don’t support D3D, so they’re under no obligation to follow D3D’s spec. That said, it seems probable that they do also use round-to-nearest here, especially Apple. (I’d be a little more hesitant to assume this across the board with the mobile crowd.)

I can tell you that from my experiments, the 1/512 offset consistently fixes the gather mismatch across all desktop GPU vendors, OSes, and APIs that I’ve been able to try, including Apple’s. However, I haven’t had the chance to test this on mobile GPUs so far.

Interlude: Nearest Filtering

I initially followed a bit of a red herring with this investigation. I wanted to verify whether the 1/512 offset was correct across a wider range of hardware, so I created a Shadertoy to test it, and asked people to run it and let me know the results. (By the way, thanks, everyone!)

The results I got were all over the place. For some GPU vendors an offset was required, and for others, it wasn’t. In some cases, it seemed like it might have changed between different architectures of the same vendor. There was even some evidence that it depended on which API you were using, with D3D and OpenGL giving different results on the same GPU—although I wasn’t able to conclusively verify that. Oh jeez. What the heck?

As it turns out, I’d taken a shortcut that was actually kind of a long-cut. You see, Shadertoy is built on WebGL, which doesn’t actually support texture gathers currently (they’re planned to be in the next version of WebGL). So, I substituted with something that’s similar in many ways: nearest-neighbor filtering mode.

Just like gathers, nearest-neighbor filtering also has to select a texel based on the texture unit’s judgement of which texel square your coordinates are in, and there is again the possibility of a mismatch versus the shader’s version of the calculation. The only difference is that there isn’t a 0.5 texel offset—otherwise, I expected it to work the same way as a gather, using the same math and rounding modes.

Surprise! It doesn’t. The results of nearest-neighbor filtering suggest that GPUs aren’t consistent in how they compute the nearest texel to the sample point. To find the nearest texel, we need to apply floor to the scaled texel coordinates; but it looks like some GPUs round off the coordinates to 8 subpixel bits before taking the floor, and others might truncate instead of rounding—or they might just be applying floor to the floating-point value directly, rather than converting it to fixed-point at all.

Now, the D3D11 functional spec does say (§7.18.7 Point Sample Addressing) that point sampling (aka nearest filtering) is supposed to use the same fixed-point conversion and rounding as in the bilinear case. And some GPUs out there are definitely in violation of that, to the tune of 1/512th texel, unless I’ve misunderstood something!

Here’s the Shadertoy, if you want to check it out (see the code comments for an explanation).

Happily, however, if you’re actually interested in gathers, the behavior of those appears to be completely consistent. (Honestly, surprising for anything to do with GPU hardware!)

Conclusion

The inner workings of texture units are something we can usually gloss over as GPU programmers. For the most part, once we’ve prepared the mipmaps and configured the sampler settings, things Just Work™ and we don’t need to think about it a lot.

Once in awhile, though, something comes along that brings the texture unit’s internal behavior to the fore, and this was a great example. If you ever try to build a custom filter in a shader using texture gathers, the mismatch in the texture unit’s internal precision versus the float32 calculations in the shader will create a very noticeable visual issue.

Fortuitously, we were able to get a good read on what’s going on from a close perusal of API specs, and hardware survey data plus a few directed tests helped to confirm that gathers really do work the way it says in the spec, across a wide range of GPUs. And best of all, the fix is simple and universal once we’ve understood the problem.

git-partial-submodule

Nathan Reed — Sat, 04 Sep 2021 11:47:29 -0700

View on GitHub

Have you ever thought about adding a submodule to your git project, but you didn’t want to bear the burden of downloading and storing the submodule’s entire history, or you only need a handful of files out of the submodule?

Git provides partial clone and sparse checkout features that can make this happen for top-level repositories, but so far they aren’t available for submodules. That’s a hole I aimed to fill with this project. git-partial-submodule is a tool for setting up submodules with blobless clones. It can also save sparse-checkout patterns in your .gitmodules file, allowing them to be managed by version control, and automatically applied when the submodules are cloned.

As a motivating example, a fresh clone of Dear ImGui consumes about 80 MB (of which 75 MB is in the .git directory) and takes about 10 seconds to clone on a fast connection. It also brings in roughly 200 files, including numerous examples and backends and various other ancillary files. The actual ImGui implementation—the part you need for your app—is in 11 files totaling 2.5 MB.

In contrast, a blobless, sparse clone of Dear ImGui requires only about 7 MB (4.5 MB in the .git directory), takes ~2 seconds to clone, and checks out only the files you want.

(This is not to pick on Dear ImGui at all! These issues arise with any healthy, long-lived project, and the history bloat in particular is an artifact of git’s design.)

One way developers might address this is by “vendoring”, or copying the ImGui files they need into their own repository and checking them in. That can be a legitimate solution, but it has various downsides.

Another solution supported out of the box by git is “shallow” clones, which essentially only download the latest commit and no history. Submodules can be configured to be cloned shallowly. This works, and is useful in some cases such as cloning on a build machine where you’re not going to be manipulating the repository at all. However, shallow clones make it difficult to do normal development workflows with the submodule. In contrast, a blobless clone functions normally with most workflows, as it can download missing data on demand.

Since git’s own submodule commands do not (yet) allow specifying blobless mode or sparse checkout, I built git-partial-submodule to work around this. It’s a single-file Python script that you use just for the initial setup of submodules. Instead of git submodule add, you do git-partial-submodule.py add. When cloning a repository with existing submodules, you use git-partial-submodule.py clone instead of recursively cloning or git submodule update --init.

It works by manually calling git clone with the blobless/sparse options, setting up the submodule repo in your .git/modules directory, and hooking everything up so git sees it as a legit submodule. Afterward, ordinary submodule operations such as fetches and updates should work normally—although I haven’t done super extensive testing on this, and I’ve been warned that blobless/sparse are still experimental git features that may have sharp edges.

The other thing git-partial-submodule does is to save and restore sparse-checkout patterns in your .gitmodules for each submodule. When you only need a subset of the submodule’s file tree, this lets you manage those patterns under version control in the superproject, so that others who clone the project (and are also using git-partial-submodule) will automatically get the right set of files. You can configure this using the ordinary git sparse-checkout commands, but currently you have to remember to do the extra step of saving the patterns to .gitmodules when changing them, or restoring the patterns from .gitmodules after pulling/merging. This might be able to be automated further using some git hooks, but I haven’t looked into it yet.

I’m excited to try out this workflow for some of my own projects, replacing vendored projects with partial submodules, and I hope it will be helpful to some others out there as well. Issues and PRs are open on GitHub, and contributions are welcome. If you end up trying this, let me know if it works for you!

Slope Space in BRDF Theory

Nathan Reed — Fri, 16 Jul 2021 15:34:37 -0700

When you read BRDF theory papers, you’ll often see mention of slope space. Sometimes, components of the BRDF such as NDFs or masking-shadowing functions are defined in slope space, or operations are done in slope space before being converted back to ordinary vectors or polar coordinates. However, the meaning and intuition of slope space is rarely explained. Since it may not be obvious exactly what slope space is, why it is useful, or how to transform things to and from it, I thought I would write down a gentler introduction to it.

Slope Refresher

First off, what even is this “slope” thing we’re talking about? If you think back to your high school algebra class, the slope of a line was defined as “rise over run”, or the ratio $\Delta y / \Delta x$ between some two points on the line.

The steeper the line, the larger the magnitude of its slope. The sign of the slope indicates which direction the line is sloping in. The slope is infinite if the line is vertical.

The concept of slope can readily be generalized to planes as well as lines. Planes have two slopes, one for $\Delta z / \Delta x$ and one for $\Delta z / \Delta y$ (using $z$-up coordinates, and assuming the surface is not vertical):

These values describe how much the surface rises or falls in $z$ if you take a step along either $x$ or $y$. This completely specifies the orientation of a planar surface, as steps in any other direction can be derived from the $x$ and $y$ slopes.

In calculus, the slope of a line is generalized to the derivative or “instantaneous slope” of a curve, $\mathrm{d}y/\mathrm{d}x$. For curved surfaces, so long as they can be expressed as a heightfield (where $z$ is a function of $x, y$), slopes become partial derivatives $\partial z / \partial x$ and $\partial z / \partial y$.

It’s worth noting that slopes are completely coordinate-dependent quantities. If you transform to a different coordinate system, the slopes of $z$ with respect to $x, y$ will be totally different values, or even infinite (if the surface is not a heightfield anymore in the new coordinates).

Normals and Slopes

We usually describe surfaces in 3D by their normal vector rather than their slopes, as the normal is able to gracefully handle surfaces in any orientation without infinities, and is easier to transform into different coordinate systems. However, there is a simple relationship between a surface’s normal and its slopes, as this diagram should hopefully convince you:

The two triangles with the dotted lines in the figure are congruent (same angles and sizes), but rotated by 90 degrees. As the normal is, by definition, perpendicular to the surface, the normal’s components have the same proportionality as coordinate deltas along the surface, just swapped around. This diagram shows the $xz$ projection, but the same holds true of the $yz$ components: $$ \begin{aligned} \frac{\Delta z}{\Delta x} &= -\frac{\mathbf{n}_x}{\mathbf{n}_z} \\[1em] \frac{\Delta z}{\Delta y} &= -\frac{\mathbf{n}_y}{\mathbf{n}_z} \end{aligned} $$ The negative sign is because $\Delta z$ is going down while $\mathbf{n}_z$ is going up (or vice versa, depending on the orientation).

Just for completeness, when you have a heightfield surface $z(x, y)$, the partial derivatives are related to its normal at a point in the same way: $$ \begin{aligned} \frac{\partial z}{\partial x} &= -\frac{\mathbf{n}_x}{\mathbf{n}_z} \\[1em] \frac{\partial z}{\partial y} &= -\frac{\mathbf{n}_y}{\mathbf{n}_z} \end{aligned} $$

Slope Space

Now we’re finally ready to define slope space. Due to the relationship between slopes and normal vectors, slopes act as an alternate parameterization of unit vectors in the $z > 0$ hemisphere. Given any vector, we can treat it as a normal and find the slopes of a surface perpendicular to it. “Slope space” refers to this domain: the 2D space of all the possible slope values. As slopes can be any real numbers, slope space is just the real plane, $\mathbb{R}^2$, but with a special meaning.

A good way to visualize slope space is to identify it with the plane $z = 1$. Then, vectors at the origin can be converted to slope space by intersecting them with the plane:

Here I’ve introduced the notation $\tilde{\mathbf{n}}$ for the 2D vector in slope space corresponding to the 3D vector $\mathbf{n}$. The tilde ($\sim$) notation for slope-space quantities is commonly used in the BRDF literature, and I’ll follow it here.

Intersecting a ray with the $z = 1$ plane is equivalent to rescaling the vector so that $\mathbf{n}_z = 1$, and then the slopes can be read off as the negated $x, y$ components of the rescaled vector. You can visualize the slope plane as having inverted $x, y$ axes compared to the base coordinates to take care of this. (Note the $x$-axis on the slope plane, pointing to the left, in the diagram above.)

So, you can picture the hemisphere being blown up and stretched onto the plane, by projecting each point away from the origin until it hits the plane. This establishes a bijection (one-to-one mapping) between the unit vectors with $z > 0$ and points on the plane.

To make it official, the slope-space parameterization of an arbitrary vector $\mathbf{v}$ with $\mathbf{v}_z > 0$ is defined by: $$ \begin{aligned} \tilde{\mathbf{v}}_x &= -\frac{\mathbf{v}_x}{\mathbf{v}_z} \\[1em] \tilde{\mathbf{v}}_y &= -\frac{\mathbf{v}_y}{\mathbf{v}_z} \end{aligned} $$ This assumes that the vector is upward-pointing, so that $\mathbf{v}_z > 0$. Finite slopes cannot represent horizontal vectors (normal to vertical surfaces), and they cannot distinguish between upward- and downward-pointing vectors, as slopes have no sense of orientation—reverse the normal, and you still get the same slopes.

Converting back from slopes to an ordinary unit normal vector is also simple: $$ \mathbf{v} = \text{normalize}(-\tilde{\mathbf{v}}_x, -\tilde{\mathbf{v}}_y, 1) $$

Converting to Polar Coordinates

Another common parameterization of unit vectors is the polar coordinates $\theta, \phi$. It’s straightforward to work out the direct conversion between slope space and polar coordinates.

Following common conventions, we define the polar coordinates so that $\theta$ measures downward from the $+z$ axis, and $\phi$ measures counterclockwise from the $+x$ axis. The conversion between polar and 3D unit vectors is: $$ \begin{aligned} \theta &= \text{acos}(z) \\ \phi &= \text{atan2}(y, x) \end{aligned} \qquad \begin{aligned} x &= \sin\theta \cos\phi \\ y &= \sin\theta \sin\phi \\ z &= \cos\theta \end{aligned} $$ and the conversion between polar and slope space is: $$ \begin{aligned} \theta &= \text{atan}(\sqrt{\tilde x^2 + \tilde y^2}) \\ \phi &= \text{atan2}(-\tilde y, -\tilde x) \end{aligned} \qquad \begin{aligned} \tilde x &= -\!\tan\theta \cos\phi \\ \tilde y &= -\!\tan\theta \sin\phi \\ \end{aligned} $$ This can be derived by setting $\tilde x = -x/z$ and substituting the conversion from polar, then using the identity $\sin/\cos = \tan$.

A fact worth noting here is that the magnitude of a slope-space vector, $|\tilde{\mathbf{v}}|$, is equal to $\tan\theta_\mathbf{v}$.

Properties of Slope Space

Now we’ve seen how to define slope space and convert back and forth from it. But why is it useful? Why would we want to represent vectors or functions in this way?

In microfacet BRDF theory, we usually assume the microsurface is a heightfield for simplicity (which is a pretty reasonable assumption for a lot of everyday materials). If the microsurface is a heightfield, then its normals are constrained to the upper hemisphere. Slope space, which parameterizes exactly the upper hemisphere, is a good match for this.

From a performance perspective, slope space is also much cheaper to transform to and from than polar coordinates, which makes it nicer to use in shaders. It requires only some divides or a normalize, as opposed to a bunch of forward or inverse trigonometric functions.

Slope space also has no boundaries, in contrast to other representations of unit vectors. The origin (0, 0) of the slope plane represents a flat surface normal, and the farther away you get, the more extreme the slope, but you can’t make the surface turn upside down or produce an invalid normal. So, you can freely do various manipulations on vectors in slope space without worrying about exceeding any bounds.

Another useful fact about slope space is that many linear transformations of a surface, such as scaling or shearing, map to transformations of its slope space in simple ways. For example, scaling a surface by a factor $\alpha$ along its $z$-axis causes its normal vectors’ $z$-components to scale by $1/\alpha$ (due to normals taking the inverse transpose), but then since $\mathbf{n}_z$ is in the denominator in the definition of slope space, we have that the slopes of the surface are scaled by $\alpha$.

Here’s a table of how transformations of the microsurface map to transformations of slope space:

Surface	Slope Space
Horizontal scale by $(\alpha_x, \alpha_y)$	Scale by $(1/\alpha_x, 1/\alpha_y)$
Vertical scale by $\alpha$	Scale by $\alpha$
Horizontal rotate ($xy$) by $\theta$	Rotate by $\theta$
Vertical rotate ($xz, yz$)	Projective transform (not recommended)
Horizontal shear ($xy$) by $\begin{bmatrix} 1 & k_2 \\ k_1 & 1 \end{bmatrix}$	Shear by $\begin{bmatrix} 1 & -k_1 \\ -k_2 & 1 \end{bmatrix}$
Vertical shear by $\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ k_x & k_y & 1 \end{bmatrix}$	Translate by $(k_x, k_y)$
Vertical shear by $\begin{bmatrix} 1 & 0 & k_x \\ 0 & 1 & k_y \\ 0 & 0 & 1 \end{bmatrix}$	Projective transform (not recommended)

These transformations in slope space are often exploited by parameterized BRDF models; they can implement roughness, anisotropy, and such as transformations applied to a single canonical BRDF (see for example Heitz 2014, section 5).

Distributions in Slope Space

One of the key ingredients in a microfacet BRDF is its normal distribution function (NDF), and one of the key uses for slope space is defining NDFs. Because slope space is an unbounded 2D plane, we can import existing 1D or 2D distribution functions and manipulate them in various ways, just as we would in any 2D domain. As long as we end up with a valid, normalized probability distribution in the slope plane (sometimes called a slope distribution function, or a $P^{22}$ function—I’m not sure where the latter term comes from), we can transform it to a properly normalized NDF expressed in polar or vector form. Let’s see how to do that.

The Jacobian

When mapping distribution functions from one space to another, it’s important to remember that the values of these functions are not dimensionless numbers; they are densities with respect to the area or volume measure of the underlying space. Therefore, it’s not enough just to change variables to express the function in the new coordinates; you also have to correct for the way the mapping stretches or squeezes the volume, which can vary from place to place.

Symbolically, suppose we have a domain $A$ with a probability density $p(a)$ defined on it. We want to map this to a domain $B$ parameterized by some new coordinates $b$. What we want is not just $p(a) = p(b)$ when $a \mapsto b$ under the mapping. Rather, we need to maintain: $$ p(a) \, \mathrm{d}A = p(b) \, \mathrm{d}B $$ where $\mathrm{d}A, \mathrm{d}B$ are matching volume elements of the respective spaces, with $\mathrm{d}A \mapsto \mathrm{d}B$ under the mapping we’re using. This says that the amount of probability (or whatever thing whose density we’re measuring) in the infinitesimal volume $\mathrm{d}A$ is conserved under the mapping; the same amount of probability is present in $\mathrm{d}B$.

This equation can be rewritten: $$ p(b) = p(a) \frac{\mathrm{d}A}{\mathrm{d}B} $$ The factor $\mathrm{d}A / \mathrm{d}B$ here is called the Jacobian, referring to the determinant of the Jacobian matrix which contains all the derivatives of the change of variables from $a$ to $b$. Actually, this is the inverse Jacobian, as the forward Jacobian for $A \to B$ would be $\mathrm{d}B / \mathrm{d}A$. The forward Jacobian is the factor by which the mapping stretches or squeezes volumes locally around a point. Because a probability density has volume in the denominator, it transforms using the inverse Jacobian.

So, when converting a slope-space distribution to an NDF, we have to multiply by the appropriate Jacobian. But how do we find out what that is? First off, we have to recall that NDFs are defined not as a density over solid angle in the hemisphere, but as a density over projected area on the $xy$ plane. Thus, it’s not enough to just find the Jacobian from slope space to polar coordinates; we also need to find the Jacobian from polar coordinates to projected area.

To do this, I find it easiest to use the formalism of differential forms. Explaining how those work is out of the scope of this article, but here’s an exposition I found useful. They’re essentially fields of dual $k$-vectors.

First, we can write down the $xy$ projected area element, $\mathrm{d}x \wedge \mathrm{d}y$, in terms of polar coordinates by differentiating the mapping from polar to Cartesian, which I’ll repeat here for convenience: $$ \begin{gathered} \left\{ \begin{aligned} x &= \sin\theta \cos\phi \\ y &= \sin\theta \sin\phi \\ z &= \cos\theta \end{aligned} \right. \\[2em] \begin{aligned} \mathrm{d}x \wedge \mathrm{d}y &= (\cos\theta\cos\phi\,\mathrm{d}\theta - \sin\theta\sin\phi\,\mathrm{d}\phi) \ \wedge \\ &\qquad (\cos\theta\sin\phi\,\mathrm{d}\theta + \sin\theta\cos\phi\,\mathrm{d}\phi) \\[0.5em] &= \cos\theta\sin\theta\cos^2\phi\,(\mathrm{d}\theta \wedge \mathrm{d}\phi) \ - \\ &\qquad \cos\theta\sin\theta\sin^2\phi\,(\mathrm{d}\phi \wedge \mathrm{d}\theta) \\[0.5em] &= \cos\theta\sin\theta\,(\mathrm{d}\theta \wedge \mathrm{d}\phi) \end{aligned} \end{gathered} $$ Then, we can do the same thing with the slope-space area element: $$ \begin{gathered} \left\{ \begin{aligned} \tilde x &= -\!\tan\theta \cos\phi \\ \tilde y &= -\!\tan\theta \sin\phi \\ \end{aligned} \right. \\[1.5em] \begin{aligned} \mathrm{d}\tilde x \wedge \mathrm{d} \tilde y &= -(\cos^{-2}\theta\cos\phi\,\mathrm{d}\theta - \tan\theta\sin\phi\,\mathrm{d}\phi) \ \wedge \\ &\qquad -(\cos^{-2}\theta\sin\phi\,\mathrm{d}\theta + \tan\theta\cos\phi\,\mathrm{d}\phi) \\[0.5em] &= \tan\theta\cos^{-2}\theta\cos^2\phi\,(\mathrm{d}\theta \wedge \mathrm{d}\phi) \ - \\ &\qquad \tan\theta\cos^{-2}\theta\sin^2\phi\,(\mathrm{d}\phi \wedge \mathrm{d}\theta) \\[0.5em] &= \frac{\tan\theta}{\cos^2\theta} \, (\mathrm{d}\theta \wedge \mathrm{d}\phi) \end{aligned} \end{gathered} $$ Now, all we have to do is divide: $$ \begin{aligned} \frac{\mathrm{d}\tilde x \wedge \mathrm{d} \tilde y}{\mathrm{d}x \wedge \mathrm{d}y} &= \frac{\tan\theta}{\cos^2\theta} \frac{1}{\cos\theta\sin\theta} \\[1em] &= \frac{1}{\cos^4\theta} \end{aligned} $$ Et voilà! The Jacobian for converting densities from slope space to NDF form is $1/\cos^4\theta$. We’ll have to multiply by this factor in addition to changing variables.

Some Common Distributions

As an example of the conversion from slope space to NDF, let’s take the standard (bivariate) Gaussian distribution defined on slope space: $$ D(\tilde{\mathbf{m}}, \sigma) = \frac{1}{2\pi\sigma^2} \exp\left(-\frac{|\tilde{\mathbf{m}}|^2}{2\sigma^2}\right) $$ To turn this into an NDF, we need to change variables from $\tilde{\mathbf{m}}$ to $(\theta_\mathbf{m}, \phi_\mathbf{m})$, and also multiply by the Jacobian $1/\cos^4\theta_\mathbf{m}$. Recalling that $|\tilde{\mathbf{m}}| = \tan\theta_\mathbf{m}$, this becomes: $$ D(\mathbf{m}, \sigma) = \frac{1}{2\pi\sigma^2\cos^4\theta_\mathbf{m}} \exp\left(-\frac{\tan^2\theta_\mathbf{m}}{2\sigma^2}\right) $$ Hey, that looks familiar—it’s the Beckmann NDF! (Although it’s more usually seen with the roughness parameter $\alpha = \sqrt{2}\sigma$.) The Beckmann distribution is a Gaussian in slope space.

The isotropic GGX NDF (Walter et al 2007) looks like this: $$ D(\mathbf{m}, \alpha) = \frac{\alpha^2}{\pi \cos^4\theta_\mathbf{m} \bigl(\alpha^2 + \tan^2\theta_\mathbf{m} \bigr)^2 } $$ You might now recognize those familiar-looking $\cos^4\theta_\mathbf{m}$ and $\tan\theta_\mathbf{m}$ factors. Yep, this NDF is also a convert from slope space! Working backwards, we can see that it was originally: $$ D(\tilde{\mathbf{m}}, \alpha) = \frac{\alpha^2}{\pi \bigl(\alpha^2 + |\tilde{\mathbf{m}}|^2 \bigr)^2 } $$ Although this formula is probably less familiar, it matches the pdf of the bivariate Student’s $t$-distribution with the “normality” parameter $\nu$ set to 2, and scaled by $\alpha/\sqrt{2}$. (You can also create a family of NDFs that interpolate between GGX and Beckmann, by exposing a user parameter that controls $\nu$; see Ribardière et al 2017.)

(Incidentally, the GGX NDF is often seen written in this alternate form: $$ D(\mathbf{m}, \alpha) = \frac{\alpha^2}{\pi \bigl( (\alpha^2 - 1)\cos^2\theta_\mathbf{m} + 1 \bigr)^2 } $$ This is the same function as the form above (which is from the original GGX paper), but rearranged to make it cheaper to evaluate, as it eliminates the $\tan^2$ using the identity $\tan^2 = (1 - \cos^2)/\cos^2$. However, this form also introduces numerical precision problems, and Filament has a numerically stable form: $$ D(\mathbf{m}, \alpha) = \frac{\alpha^2}{\pi \bigl(\alpha^2 \cos^2\theta_\mathbf{m} + \sin^2\theta_\mathbf{m} \bigr)^2 } $$ which is again the same function, rearranged some more; you’re meant to calculate $\sin^2\theta_\mathbf{m}$ as the squared magnitude of the cross product $|\mathbf{n} \times \mathbf{m}|^2$. This has nothing to do with slope space; I just thought it was neat and worth knowing.)

Conclusion

To recap, the most important thing to take away about slope space is that it provides an alternate representation for unit vectors in the upper hemisphere, by projecting them out onto an infinite plane. This enables us to work with distributions in plain old 2D space, and then map them back into functions on the hemisphere. Slope space also provides convenient mappings from some linear transformations of the microsurface to linear or affine transformations in the slope plane.

I hope this has demystified the concept of slope space a little bit, and now you won’t be confused by it anymore when reading BRDF papers! 😄

Hash Functions for GPU Rendering

Nathan Reed — Fri, 21 May 2021 17:52:07 -0700

Back in 2013, I wrote a somewhat popular article about pseudorandom number generation on the GPU. In the eight years since, a number of new PRNGs and hash functions have been developed; and a few months ago, an excellent paper on the topic appeared in JCGT: Hash Functions for GPU Rendering, by Mark Jarzynski and Marc Olano. I thought it was time to update my former post in light of this paper’s findings.

Jarzynski and Olano’s paper compares GPU implementations of a large number of different hash functions along dual axes of performance (measured by time to render a quad evaluating the hash at each pixel) and statistical quality (quantified by the count of failures of TESTU01 “Big Crush” tests). Naturally, there is quite a spread of results in both performance and quality. Jarzynski and Olano then identify the few hash functions that lie along the Pareto frontier—meaning they are the best choices along the whole spectrum of performance/quality trade-offs.

When choosing a hash function, we might sometimes prioritize performance, and other times might prefer to sacrifice performance in favor of higher quality (real-time versus offline applications, for example). The Pareto frontier provides the set of optimal choices for any point along that balance—ranging from LCGs at the extreme performance-oriented end, to some quite expensive but very high-quality hashes at the other end.

In my 2013 article, I recommended the “Wang hash” as a general-purpose 32-bit-to-32-bit integer hash function. The Wang hash was among those tested by Jarzynski and Olano, but unfortunately it did not lie along the Pareto frontier—not even close! The solution that dominates it—and one of the best balanced choices between performance and quality overall—is PCG. In particular, the 32-bit PCG hash used by Jarzynski and Olano goes as follows:

uint pcg_hash(uint input)
{
    uint state = input * 747796405u + 2891336453u;
    uint word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
    return (word >> 22u) ^ word;
}

This has slightly better performance and much better statistical quality than the Wang hash. It’s fast enough to be useful for real-time, while also being high-quality enough for almost any graphics use-case (if you’re not using precomputed blue noise, or low-discrepancy sequences). It should probably be your default GPU hash function.

Just to prove it works, here’s the bit pattern generated by a few thousand invocations of the above function on consecutive inputs:

Yep, looks random! 👍

PCG Variants

Incidentally, you might notice that the PCG function posted above doesn’t match that found in other sources, such as the minimal C implementation on the PCG website. This is because “PCG” isn’t a single function, but more of a recipe for constructing PRNG functions. It works by starting with an LCG, and then applying a permutation function to mix around the bits and improve the quality of the results. There many possible permutation functions, and O’Neill’s original PCG paper provides a set of building blocks that can be combined in various ways to get generators with different characteristics. In particular, the PCG used by Jarzynski and Olano corresponds to the 32-bit “RXS-M-XS” variant described in §6.3.4 of O’Neill. (See also the list of variants on Wikipedia).

Hash or PRNG?

One of the main points I discussed in my 2013 article was the distinction between PRNGs and hash functions: the former are designed for a good distribution within a single stateful stream, but do not necessarily provide good distribution across streams with consecutive seeds; hash functions are stateless and designed to give a good distribution even with consecutive (or otherwise highly correlated) inputs.

PCG is actually designed to be a PRNG, not a hash function, so it may surprise you to see it being used as a hash here. What gives? Well, apparently PCG is just so good that it works well as a hash function too! ¯\_(ツ)_/¯

It’s worth noting that PCG does support more or less efficient jump-ahead, owing to the LCG at its core; it’s possible to advance an LCG by $n$ steps in only $O(\log n)$ work using modular exponentiation. However, that is not what Jarzynski and Olano’s code does: it’s not jumping ahead to the $n$th value in a single PCG sequence, but essentially just taking the first value from each of $n$ sequences with consecutive initial states. The fact that this works at all is somewhat surprising, and a testament to the power of permutation functions.

In my previous article, I also recommended that if you need multiple random values per pixel, you could start with a hash function and then iterate either LCG or Xorshift using the hash output as an initial state. You can still do that, using PCG as the initial hash—but it might be just as fast to iterate PCG. The interesting thing about PCG’s design is that only the LCG portion of it actually carries data dependencies from one iteration to the next, and LCGs are super fast. The permutation parts are independent of each other and can be pipelined to exploit instruction-level parallelism when doing multiple iterations.

For completeness, the “PRNG form” of the above PCG variant looks like:

uint rng_state;

uint rand_pcg()
{
    uint state = rng_state;
    rng_state = rng_state * 747796405u + 2891336453u;
    uint word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
    return (word >> 22u) ^ word;
}

That’s about it! Be sure to check out Jarzynski and Olano’s paper for some more tidbits, including a discussion of hashes with multi-dimensional inputs and outputs.

Making Your Own Container Compatible With C++20 Ranges

Nathan Reed — Sat, 20 Mar 2021 17:23:15 -0700

With some of my spare time lately, I’ve been enjoying learning about some of the new features in C++20. Concepts and the closely-related requires clauses are two great extensions to template syntax that remove the necessity for all the SFINAE junk we used to have to do, making our code both more readable and more precise, and providing much better error messages (although MSVC has sadly been lagging in the error messages department, at the time of this writing).

Another interesting C++20 feature is the addition of the ranges library (also ranges algorithms), which provides a nicer, more composable abstraction for operating on containers and sequences of objects. At the most basic level, a range wraps an iterator begin/end pair, but there’s much more to it than that. This article isn’t going to be a tutorial on ranges, but here’s a talk to watch if you want to see more of what it’s all about.

What I’m going to discuss today is the process of adding “ranges compatibility” to your own container class. Many of the C++ codebases we work in have their own set of container classes beyond the STL ones, for a variety of reasons—better performance, more control over memory layouts, more customized interfaces, and so on. With a little work, it’s possible to make your custom containers also function as ranges and interoperate with the C++20 ranges library. Here’s how to do it.

Making Your Container an Input Range

At the high level, there are two basic ways that a container class can interact with ranges. First, it can be readable as a range, meaning that we can iterate over it, pipe it into views and pass it to range algorithms, and so forth. In the parlance of the ranges library, this is known as being an input range: a range that can provide input to other things.

The other direction is to accept output from ranges, storing the output into your container. We’ll do that later. To begin with, let’s see how to make your container act as an input range.

Range Concepts

The first decision we have to make is what particular kind of input range we can model. The C++20 STL defines a number of different concepts for ranges, depending on the capabilities of their iterators and other things. Several of these form a hierarchy from more general to more specific kinds of ranges with tighter requirements. Generally speaking, it’s best for your container to implement the most specific range concept it’s able to. This enables code that works with ranges to make better decisions and use more optimal code paths. (We’ll see some examples of this in a minute.)

The relevant input range concepts are:

std::ranges::input_range: the most bare-bones version. It requires only that you have iterators that can retrieve the contents of the range. In particular, it doesn’t require that the range can be iterated more than once: iterators are not required to be copyable, and begin/end are not required to give you the iterators more than once. This could be an appropriate concept for ranges that are actually generating their contents as the result of some algorithm that’s not easily/cheaply repeatable, or receiving data from a network connection or suchlike.
std::ranges::forward_range: the range can be iterated as many times as you like, but only in the forward direction. Iterators can be copied and saved off to later resume iteration from an earlier point, for example.
std::ranges::bidirectional_range: iterators can be decremented as well as incremented.
std::ranges::random_access_range: you can efficiently do arithmetic on iterators—you can offset them forward or backward by a given number of steps, or subtract them to find the number of steps between.
std::ranges::contiguous_range: the elements are actually stored as a contiguous array in memory; the iterators are essentially fancy pointers (or literally are just pointers).

In addition to this hierarchy of input range concepts, there are a couple of other standalone ones worth mentioning:

std::ranges::sized_range: you can efficiently get the size of the range, i.e. how many elements from begin to end. Note that this is a much looser constraint than random_access_range: the latter requires you be able to efficiently measure the distance between any pair of iterators inside the range, while sized_range only requires that the size of the whole range is known.
std::ranges::borrowed_range: indicates that a range doesn’t own its data, i.e. it’s referencing (“borrowing”) data that lives somewhere else. This can be useful because it allows references/iterators into the data to survive beyond the lifetime of the range object itself.

The reason all these concepts are important is that if I’m writing code that operates on ranges, I might need to require some of these concepts in order to do my work efficiently. For example, a sorting routine would be very difficult to write for anything less than a random_access_range (and indeed you’ll see that std::ranges::sort requires that). In other cases, I might be able to do things more optimally when the range satisfies certain concepts—for instance, if it’s a sized_range, I could preallocate some storage for results, while if it’s only an input_range and no more, then I’ll have to dynamically reallocate, as I have no idea how many elements there are going to be.

The rest of the ranges library is written in terms of these concepts (and you can write your own code that operates generically on ranges using these concepts as well). So, once your container satisfies the relevant concepts, it will automatically be recognized and function as a range!

In C++20, concepts act as boolean expressions, so you can check whether your container satisfies the concepts you expect by just writing asserts for them:

#include <ranges>
static_assert(std::ranges::forward_range<MyCoolContainer<int>>);
// int is just an arbitrarily chosen element type, since we
// can't assert a concept for an uninstantiated template

Checks like this are great to add to your test suite—I’m big in favor of writing compile-time tests for generic/metaprogramming stuff, in addition to the usual runtime tests.

However, when you first drop that assert into your code, it will almost certainly fail. Let’s see now what you need to do to actually satisfy the range concepts.

Defining Range-Compatible Iterators

In order to satisfy the input range concepts, you need to do two things:

Have begin and end functions that return some iterator and sentinel types. (We’ll discuss these in a little bit.)
The iterator type must satisfy the iterator concept that matches your range concept.

Each one of the concepts from input_range down to contiguous_range has a corresponding iterator concept: std::input_iterator, std::forward_iterator, and so on. It’s these concepts that contain the real meat of the requirements that define the different types of ranges: they list all the operations each kind of iterator must support.

To begin with, there are a couple of member type aliases that any iterator class will need to define:

difference_type: some signed integer type, usually std::ptrdiff_t
value_type: the type of elements that the iterator references

The second one seems pretty understandable, but I honestly have no idea why the difference_type requirement is here. Taking the difference between iterators doesn’t make sense until you get to random-access iterators, which actually define that operation. As far as I can tell, the difference_type for more general iterators isn’t actually used by anything. Nevertheless, according to the C++ standard, it has to be there. It seems that the usual idiom is to set it to std::ptrdiff_t in such cases, although it can be any signed integer type.

(Technically you can also define these types by specializing std::iterator_traits for your iterator, but here we’re just going to put them in the class.)

Beyond that, the requirements for std::input_iterator are pretty straightforward:

The iterator must be default-initializable and movable. (It doesn’t have to be copyable.)
It must be equality-comparable with its sentinel (the value marking the end of the range). It doesn’t have to be equality-comparable with other iterators.
It must implement operator ++, in both preincrement and postincrement positions. However, the postincrement version does not have to return anything.
It must have an operator * that returns a reference to whatever the value_type is.

One point of interest here is that the default-initializable requirement means that the iterator class can’t contain references, e.g. a reference to the container it comes from. It can store pointers, though.

A prototype input iterator class could look like this:

template <typename T>
class Iterator
{
public:
    using difference_type = std::ptrdiff_t;
    using value_type = T;
    Iterator();                 // default-initializable
    bool operator == (const Sentinel&) const;   // equality with sentinel
    T& operator * () const;     // dereferenceable
    Iterator& operator ++ ()    // pre-incrementable
        { /*do stuff...*/ return *this; }
    void operator ++ (int)      // post-incrementable
        { ++*this; }
private:
    // implementation...
};

For a std::forward_iterator, the requirements are just slightly tighter:

The iterator must be copyable.
It must be equality-comparable with other iterators of the same container.
The postincrement operator must return a copy of the iterator before modification.

A prototype forward iterator class could look like:

template <typename T>
class Iterator
{
public:
    // ...same as the previous one, except:
    bool operator == (const Iterator&) const;   // equality with iterators
    Iterator operator ++ (int)  // post-incrementable, returns prev value
        { Iterator temp = *this; ++*this; return temp; }
};

I’m not going to go through the rest of them in detail; you can read the details on cppreference.

Begin, End, Size

Once your container is equipped with an iterator class that satisfies the relevant concepts, you’ll need to provide begin and end functions to get those iterators. There are three ways to do this: they can be member functions on the container, they can be free functions that live next to the container in the same namespace, or they can be “hidden friends”; they just need to be findable by ADL.

The return types from begin and end don’t have to be the same. In some cases, it can be useful to have end return a different type of object, a “sentinel”, which isn’t actually an iterator; it just needs to be equality-comparable with iterators, so you can tell when you’ve gotten to the end of the container.

Also, these are the same begin/end used for range-based for loops.

One oddity worth mentioning here is that if you go the free/friend functions route, you’ll need to add overloads for both const and non-const versions of your container:

class MyCoolContainer;
auto begin(const MyCoolContainer& c);
auto end(const MyCoolContainer& c);
auto begin(MyCoolContainer& c);
auto end(MyCoolContainer& c);

You might think it would be enough to provide just the const overloads, but if you do that, only the const version of the container will be recognized as a range! The non-const overloads must be present as well for non-const containers to work.

Curiously, if you provide begin/end as member functions instead, then this doesn’t come up: const overloads will work for both.

This behavior is surprising, and I’m not sure if it was intended. However, it’s worth noting that iterators generally need to remember the constness of the container they came from: a const container should give you a “const iterator” that doesn’t allow mutating its elements. Therefore, the const and non-const overloads of begin/end will generally need to return different iterator types, and so you’ll need to have both in any case. (The exception would be if you’re building an immutable container; then it only needs a const iterator type.)

In addition to begin and end, you’ll also want to implement a size function, if applicable. Again, this can be either a member function, a free function, or a hidden friend. The presence of this function satisfies std::ranges::sized_range, which (as mentioned earlier) can enable range algorithms to operate more efficiently.

So, to sum up: to allow your custom container class to be readable as a range, you’ll need to:

Decide which range concept(s) you can model, which mainly comes down to what level of iterator capabilities you can provide;
Implement iterator classes (both const and non-const, if applicable) that fulfill all the requirements of the chosen iterator concept;
Implement begin, end, and size functions.

Once we’ve done this, the ranges library should recognize your container as a range. It will automatically be accepted by range algorithms, we can take views of it, we can iterate over it in range-for loops, and so on.

As before, you can test that you’ve done everything correctly by asserting that your container satisfies the expected range concepts. If you’re working with gcc or clang, this will even give you some pretty reasonable error messages if you didn’t get it right! (In MSVC, for the time being, you’ll have to narrow down errors by popping open the hood and asserting each of the concept’s sub-clauses one at a time, to see which one(s) failed.)

Accepting Output From Ranges

We’ve discussed how to make a custom container serve as input to the C++20 ranges library. Now, we need to come back to the other direction: how to let your container capture output from the ranges library.

There are a couple of different forms this can take. One way is to accept generic ranges as parameters to a constructor (or other methods, such as append or insert methods) of your container class. This allows, for example, easily converting other containers (that are also range-compatible) to your container. It also allows capturing the output of a ranges “pipeline” (a series of views chained together).

Another form of range output, which comes up with certain of the range algorithms, is via output iterators, which are iterators that allow storing or inserting values into your container.

Constructor From A Range

To write a constructor (or other method) that takes a generic range parameter, we can use the same range concepts we saw earlier. One neat new feature in C++20 is writing functions with a parameter type (or return type) constrained to match a given concept. The syntax looks like this:

#include <ranges>
class MyCoolContainer
{
public:
    explicit MyCoolContainer(std::ranges::input_range auto&& range)
    {
        for (auto&& item : range)
        {
            // process the item
        }
    }
};

The syntax concept-name auto for the parameter type reminds us that concepts aren’t types; this is still, under the hood, a template function that’s performing argument type deduction (hence the auto). In other words, the above is syntactic sugar for:

template <std::ranges::input_range R>
explicit MyCoolContainer(R&& range)
{
    // ...
}

which is in turn sugar for:

template <typename R>
requires(std::ranges::input_range<R>)
explicit MyCoolContainer(R&& range)
{
    // ...
}

I prefer the shorthand std::ranges::input_range auto syntax, but ~~at the time of this writing MSVC’s support for it is still shaky~~. (Update: fixed in 16.10! 😊) If in doubt, use the syntax template <std::ranges::input_range R>.

In any case, constraining the parameter type to satisfy input_range allows this constructor overload to accept anything out there that implements begin, end, and iterators, as we’ve seen in previous sections. You can then iterate over it generically and do whatever you want with the results.

The range parameter is declared as auto&& to make it a universal reference, meaning that it can accept either lvalues or rvalues; in particular, it can accept the result of a function call returning a range, and it can accept the result of a pipeline:

MyCoolContainer c{ another_range |
                   std::views::transform(blah) |
                   std::views::filter(blah) };

A completely generic range-accepting method like this might not be the most useful thing. If we have a container storing int values, for example, it wouldn’t make a lot of sense for us to accept ranges of strings or other arbitrary types. We’d like to be able to put some additional constraints on the element type of the range: perhaps we only want element types that are convertible to int.

Helpfully, the ranges library provides a template range_value_t that retrieves the element type of a range—namely, the value_type declared by the range’s iterator. With this, we can state additional constraints like so:

explicit MyCoolContainer(std::ranges::input_range auto&& range)
requires(std::convertible_to<std::ranges::range_value_t<decltype(range)>, int>)
{
    // ...
}

We can even define a concept that wraps up these requirements:

template <typename R, typename T>
concept input_range_of =
    std::ranges::input_range<R> &&
    std::convertible_to<std::ranges::range_value_t<R>, T>;

and then use it as follows:

explicit MyCoolContainer(input_range_of<int> auto&& range)
{
    // ...
}

Something like this should be in the standard library, IMO.

You can also choose to require one of the more specialized concepts, like forward_range or random_access_range, if you need those extra capabilities for whatever you’re doing. However, just as a container should generally implement the most specific range concept it can provide, a function that takes a range parameter should generally require the most general range concept it can deal with, or it will unduly restrict what kind of ranges can be passed to it.

That said, there might be cases where you can switch to a more efficient implementation if the range satisfies some extra requirements. For example, if it’s a sized_range, then you might be able to reserve storage before inserting the elements. You can test for this inside your function body using if constexpr:

explicit MyCoolContainer(input_range_of<int> auto&& range)
{
    if constexpr (std::ranges::sized_range<decltype(range)>)
    {
        reserve(std::ranges::size(range));
    }

    for (auto&& item : range)
    {
        // process the item
    }
}

Here, std::ranges::size is a convenience wrapper that knows how to call the range’s associated size function, whether it’s implemented as a method or a free function.

You could also do things like: check if the range is a contiguous_range and the item is something trivially copyable, and switch to memcpy rather than iterating over all the items.

Output Iterators

Range views and pipelines operate on a “pull” model, where the pipeline is represented by a proxy range object that generates its results lazily when you iterate it. Taking generic range objects as parameters to your container is an easy and useful way to consume such objects, and that probably suffices for most uses. However, there are a handful of bits in the ranges library that operate on a “push” model, where you call a function that wants to store values into your container via an output iterator. This comes up with certain ranges algorithms like ranges::copy, ranges::transform, and ranges::generate.

Personally, I don’t see a hugely compelling reason to worry about these, as it’s also possible to use views to express the same operations; but for the sake of completeness, I’ll discuss them briefly here.

At this point, it won’t surprise you to learn that just as there were concepts for input ranges, there are also concepts std::ranges::output_range and std::output_iterator. In this case there’s just that one concept, not a hierarchy of refinements of them; however, if you peruse the definitions of some of the ranges algorithms, you’ll find that many of them don’t actually use output_iterator, but state slightly different, less- or more-specific requirements of their own. (This part of the standard library feels a little less fully baked than the rest; I wouldn’t be surprised if some of this gets elaborated or polished a bit more in C++23 or later revisions.)

The requirements for an output iterator (broadly construed) are very similar to those for an input iterator, only adding that the value returned by dereferencing the iterator must be writable by assigning to it: you must be able to do *iter = foo; for some appropriate type of foo. If you’ve implemented a non-const input iterator, it probably satisfies the requirement already.

It’s also possible to do slightly more exotic things with an output iterator, like returning a proxy object that accepts assignment and does “something” with the value assigned. An example of this is the STL’s std::back_insert_iterator, which takes whatever is assigned to it and appends to its container (as opposed to overwriting an existing value in the container). The STL has a few more things like that, including an iterator that writes characters out to an ostream.

There are also some cases amongst the ranges algorithms of “input-output” iterators, such as for operations that reorder a range in place, like sorting. These often have a bidirectional or random-access iterator requirement, plus needing the dereferenced types to be swappable, movable, and varying other constraints. Those details probably aren’t going to be relevant to you unless you’re doing something tricky, like making a container that generates elements on the fly somehow, or returns proxy objects rather than direct references to elements (like std::vector<bool>).

Conclusion

The C++20 ranges library provides a lot of powerful, composable tools for manipulating sequences of objects, and a range of specificity from the most generic and abstract container-shaped things down to the very concrete, efficient, and practical. When working with your own container types, it would be nice to be able to take advantage of these tools.

As we’ve seen, it’s hardly an onerous task to implement ranges compatibility for your own containers. Most of the necessaries are things you were probably already doing: you probably already had an iterator class and begin/end methods. It only takes a little bit of attention to satisfying certain details—like adding the difference_type and value_type aliases, and making sure you can both preincrement and postincrement—to make your iterators satisfy the STL iterator concepts, and thus have your containers recognized as ranges. It’s also no sweat to write functions accepting generic ranges as input, letting you store the output of other range operations into your container.

I hope this has been a useful peek under the hood and has given you some ideas about how your container classes can benefit from the new C++20 features.

Python-Like enumerate() In C++17

Nathan Reed — Sat, 24 Nov 2018 22:42:04 -0800

Python has a handy built-in function called enumerate(), which lets you iterate over an object (e.g. a list) and have access to both the index and the item in each iteration. You use it in a for loop, like this:

for i, thing in enumerate(listOfThings):
    print("The %dth thing is %s" % (i, thing))

Iterating over listOfThings directly would give you thing, but not i, and there are plenty of situations where you’d want both (looking up the index in another data structure, progress reports, error messages, generating output filenames, etc).

C++ range-based for loops work a lot like Python’s for loops. Can we implement an analogue of Python’s enumerate() in C++? We can!

C++17 added structured bindings (also known as “destructuring” in other languages), which allow you to pull apart a tuple type and assign the pieces to different variables, in a single statement. It turns out that this is also allowed in range for loops. If the iterator returns a tuple, you can pull it apart and assign the pieces to different loop variables.

The syntax for this looks like:

std::vector<std::tuple<ThingA, ThingB>> things;
...
for (auto [a, b] : things)
{
    // a gets the ThingA and b gets the ThingB from each tuple
}

So, we can implement enumerate() by creating an iterable object that wraps another iterable and generates the indices during iteration. Then we can use it like this:

std::vector<Thing> things;
...
for (auto [i, thing] : enumerate(things))
{
    // i gets the index and thing gets the Thing in each iteration
}

The implementation of enumerate() is pretty short, and I present it here for your use:

#include <tuple>

template <typename T,
          typename TIter = decltype(std::begin(std::declval<T>())),
          typename = decltype(std::end(std::declval<T>()))>
constexpr auto enumerate(T && iterable)
{
    struct iterator
    {
        size_t i;
        TIter iter;
        bool operator != (const iterator & other) const { return iter != other.iter; }
        void operator ++ () { ++i; ++iter; }
        auto operator * () const { return std::tie(i, *iter); }
    };
    struct iterable_wrapper
    {
        T iterable;
        auto begin() { return iterator{ 0, std::begin(iterable) }; }
        auto end() { return iterator{ 0, std::end(iterable) }; }
    };
    return iterable_wrapper{ std::forward<T>(iterable) };
}

This uses SFINAE to ensure it can only be applied to iterable types, and will generate readable error messages if used on something else. It accepts its parameter as an rvalue reference so you can apply it to temporary values (e.g. directly to the return value of a function call) as well as to variables and members.

This compiles without warnings in C++17 mode on gcc 8.2, clang 6.0, and MSVC 15.9. I’ve banged on it a bit to ensure it doesn’t incur any extra copies, and it works as expected with either const or non-const containers. It seems to optimize away pretty cleanly, too! 🤘

Using A Custom Toolchain In Visual Studio With MSBuild

Nathan Reed — Tue, 20 Nov 2018 13:34:01 -0800

Like many of you, when I work on a graphics project I sometimes have a need to compile some shaders. Usually, I’m writing in C++ using Visual Studio, and I’d like to get my shaders built using the same workflow as the rest of my code. Visual Studio these days has built-in support for HLSL via fxc, but what if we want to use the next-gen dxc compiler?

This post is a how-to for adding support for a custom toolchain—such as dxc, or any other command-line-invokable tool—to a Visual Studio project, by scripting MSBuild (the underlying build system Visual Studio uses). We won’t quite make it to parity with a natively integrated language, but we’re going to get as close as we can.

If you don’t want to read all the explanation but just want some working code to look at, jump down to the Example Project section.

This article is written against Visual Studio 2017, but it may also work in some earlier VSes (I haven’t tested).

MSBuild

Before we begin, it’s important you understand what we’re getting into. Not to mince words, but MSBuild is a stringly typed, semi-documented, XML-guzzling, paradigmatically muddled, cursed hellmaze. However, it does ship with Visual Studio, so if you can use it for your custom build steps, then you don’t need to deal with any extra add-ins or software installs.

To be fair, MSBuild is open-source on GitHub, so at least in principle you can dive into it and see what the cursed hellmaze is doing. However, I’ll warn you up front that many of the most interesting parts vis-à-vis Visual Studio integration are not included in the Git repo, but are hidden away in VS’s build extension DLLs. (More about that later.)

My jumping-off point for this enterprise was this blog post by Mike Nicolella. Mike showed how to set up an MSBuild .targets file to create an association between a specific file extension in your project, and a build rule (“target”, in MSBuild parlance) to process those files. We’ll review how that works, then extend it and jazz it up a bit to get some more quality-of-life features.

MSBuild docs (such as they are) can be found on MSDN here. Some more information can be gleaned by looking at the C++ build rules installed with Visual Studio; on my machine they’re in C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\VC\VCTargets. For example, the file Microsoft.CppCommon.targets in that directory contains most of the target definitions for C++ compilation, linking, resources and manifests, and so on.

Adding A Custom Target

As shown in Mike’s blog post, we can define our own build rule using a couple of XML files which will be imported into the VS project. (I’ll keep using shader compilation with dxc as my running example, but this approach can be adapted for a lot of other things, too.)

First, create a file dxc.targets—in your project directory, or really anywhere—containing the following:

<?xml version="1.0" encoding="utf-8"?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
  <ItemGroup>
    <!-- Include definitions from dxc.xml, which defines the DXCShader item. -->
    <PropertyPageSchema Include="$(MSBuildThisFileDirectory)dxc.xml" />
    <!-- Hook up DXCShader items to be built by the DXC target. -->
    <AvailableItemName Include="DXCShader">
      <Targets>DXC</Targets>
    </AvailableItemName>
  </ItemGroup>

  <Target
    Name="DXC"
    Condition="'@(DXCShader)' != ''"
    BeforeTargets="ClCompile">
    <Message Importance="High" Text="Building shaders!!!" />
  </Target>
</Project>

And another file dxc.xml containing:

<?xml version="1.0" encoding="utf-8"?>
<ProjectSchemaDefinitions xmlns="http://schemas.microsoft.com/build/2009/properties">
  <!-- Associate DXCShader item type with .hlsl files -->
  <ItemType Name="DXCShader" DisplayName="DXC Shader" />
  <ContentType Name="DXCShader" ItemType="DXCShader" DisplayName="DXC Shader" />
  <FileExtension Name=".hlsl" ContentType="DXCShader" />
</ProjectSchemaDefinitions>

Let’s pause for a moment and take stock of what’s going on here. First, we’re creating a new “item type”, called DXCShader, and associating it with the extension .hlsl. That way, any files we add to our project with that extension will automatically have this item type applied.

Second, we’re instructing MSBuild that DXCShader items are to be built with the DXC target, and we’re defining what that target does. For now, all it does is print a message in the build output, but we’ll get it doing some actual work shortly.

A few miscellaneous syntax notes:

Yes, you need two separate files. No, there’s no way to combine them, AFAICT. This is just the way MSBuild works.
The syntax @(DXCShader) means “the list of all DXCShader items in the project”. The Condition attribute on a target says under what conditions that target should execute: if the condition is false, the target is skipped. Here, we’re executing the target if the list @(DXCShader) is non-empty.
BeforeTargets="ClCompile" means this target will run before the ClCompile target, i.e. before C/C++ source files are compiled with cl.exe. This is because we’re going to output our shader bytecode to headers which will get included into C++, so the shader compile step needs to run earlier.
Importance="High" is needed on the <Message> task for it to show up in the VS IDE on the default verbosity setting. Lower importances will be masked unless you turn up the verbosity.

To get this into your project, in the VS IDE right-click the project → Build Dependencies… → Build Customizations, then click “Find Existing” and point it at dxc.targets. Alternatively, add this line to your .vcxproj (as a child of the root <Project> element, doesn’t matter where):

<Import Project="dxc.targets" />

Now, if you add a .hlsl file to your project it should automatically show up as type “DXC Shader” in the properties; and when you build, you should see the message Building shaders!!! in the output.

Incidentally, in dxc.xml you can also set up property pages that will show up in the VS IDE on DXCShader-type files. This lets you define your own metadata and let users configure it per file. I haven’t done this, but for example, you could have properties to indicate which shader stages or profiles the file should be compiled for. The <Target> element can then have logic that refers to those properties. Many examples of the XML to define property pages can be found in C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\VC\VCTargets\1033 (or a corresponding location depending on which version of VS you have). For example, custom_build_tool.xml in that directory defines the properties for the built-in Custom Build Tool item type.

Invoking The Tool

Okay, now it’s time to get our custom target to actually do something. Mike’s blog post used the MSBuild <Exec> task to run a command on each source file. However, we’re going to take a different tack and use the Visual Studio <CustomBuild> task instead.

The <CustomBuild> task is the same one that ends up getting executed if you manually set your files to “Custom Build Tool” and fill in the command/inputs/outputs metadata in the property pages. But instead of putting that in by hand, we’re going to set up our target to generate the metadata and then pass it in to <CustomBuild>. Doing it this way is going to let us access a couple handy features later that we wouldn’t get with the plain <Exec> task.

Add this inside the DXC <Target> element:

<!-- Setup metadata for custom build tool -->
<ItemGroup>
  <DXCShader>
    <Message>%(Filename)%(Extension)</Message>
    <Command>
      "$(WDKBinRoot)\x86\dxc.exe" -T vs_6_0 -E vs_main %(Identity) -Fh %(Filename).vs.h -Vn %(Filename)_vs
      "$(WDKBinRoot)\x86\dxc.exe" -T ps_6_0 -E ps_main %(Identity) -Fh %(Filename).ps.h -Vn %(Filename)_ps
    </Command>
    <Outputs>%(Filename).vs.h;%(Filename).ps.h</Outputs>
  </DXCShader>
</ItemGroup>

<!-- Compile by forwarding to the Custom Build Tool infrastructure -->
<CustomBuild Sources="@(DXCShader)" />

Now, given some valid HLSL source files in the project, this will invoke dxc.exe twice on each one—first compiling a vertex shader, then a pixel shader. The bytecode will be output as C arrays in header files (-Fh option). I’ve just put the output headers in the main project directory, but in production you’d probably want to put them in a subdirectory somewhere.

Let’s back up and look at the syntax in this snippet. First, the <ItemGroup><DXCShader> combo basically says “iterate over the DXCShader items”, i.e. the HLSL source files in the project. Then what we’re doing is adding metadata: each of the child elements—<Message>, <Command>, and <Outputs>—becomes a metadata key/value pair attached to a DXCShader.

The %(Foo) syntax accesses item metadata (within a previously established context for “which item”, which is here created by the iteration over the shaders). All MSBuild items have certain built-in metadata like path, filename, and extension; we’re building on those to construct additional metadata, in the format expected by the <CustomBuild> task. (It matches the metadata that would be created if you set up the command line etc. manually in the Custom Build Tool property pages.)

Incidentally, the $(WDKBinRoot) variable (“property”, in MSBuild-ese) is the path to the Windows SDK bin folder, where lots of tools like dxc live. It needs to be quoted because it can (and usually does) contain spaces. You can find out these things by running MSBuild with “diagnostic” verbosity (in VS, go to Tools → Options → Projects and Solutions → Build and Run → “MSBuild project build output verbosity”)—this will spit out all the defined properties plus a ton of logging about which targets are running and what they’re doing.

Finally, after setting up all the required metadata, we simply pass it to the <CustomBuild> task. (This task isn’t part of core MSBuild, but is defined in Microsoft.Build.CPPTasks.Common.dll—an extension plugin to MSBuild that comes with Visual Studio.) Again we see the @(DXCShader) syntax, meaning to pass in the list of all DXCShader items in the project. Internally, <CustomBuild> iterates over it and invokes your specified command lines.

Incremental Builds

At this point, we have a working custom build! We can simply add .hlsl files to our project, and they’ll automatically be compiled by dxc as part of the build process, without us having to do anything. Hurrah!

However, while working with this setup you will notice a couple of problems.

When you modify an HLSL source file, Visual Studio will not reliably detect that it needs to recompile it. If the project was up-to-date before, hitting Build will do nothing! However, if you have also modified something else (such as a C++ source file), then the build will pick up the shaders in addition.
Anytime anything else gets built, all the shaders get built. In other words, MSBuild doesn’t yet understand that if an individual shader is already up-to-date then it can be skipped.

Fortunately, we can easily fix these. But first, why are these problems happening at all?

VS and MSBuild depend on .tlog (tracker log) files to cache information about source file dependencies and efficiently determine whether a build is up-to-date. Somewhere inside your build output directory there will be a folder full of these logs, listing what source files have gotten built, what inputs they depended on (e.g. headers), and what outputs they generated (e.g. object files). The problem is that our custom target isn’t producing any .tlogs.

Conveniently for us, the <CustomBuild> task supports .tlog handling right out of the box; we just have to turn it on! Change the <CustomBuild> invocation in the targets file to this:

<!-- Compile by forwarding to the Custom Build Tool infrastructure,
     so it will take care of .tlogs -->
<CustomBuild
  Sources="@(DXCShader)"
  MinimalRebuildFromTracking="true"
  TrackerLogDirectory="$(TLogLocation)" />

That’s all there is to it—now, modified HLSL files will be properly detected and rebuilt, and unmodified ones will be properly detected and not rebuilt. This also takes care of deleting the previous output files when you do a clean build. This is one reason to prefer using the <CustomBuild> task rather than the simpler <Exec> task (we’ll see another reason a bit later).

Thanks to Olga Arkhipova at Microsoft for helping me figure out this part!

Header Dependencies

Now that we have dependencies hooked up for our custom toolchain, a logical next step is to look into how we can specify extra input dependencies—so that our shaders can have #includes, for example, and modifications to the headers will automatically trigger rebuilds properly.

The good news is that yes, we can do this by adding an <AdditionalInputs> metadata key to our DXCShader items. Files listed there will get registered as inputs in the .tlog, and the build system will do the rest. The bad news is that there doesn’t seem to be an easy way to detect on a file-by-file level which additional inputs are needed.

This is frustrating because Visual Studio actually includes a utility for tracking file accesses in an external tool! It’s called tracker.exe and lives somewhere in your VS installation. You give it a command line, and it’ll detect all files opened for reading by the launched process (presumably by injecting a DLL and detouring CreateFile(), or something along those lines). I believe this is what VS uses internally to track #includes for C++—and it would be perfect if we could get access to the same functionality for custom toolchains as well.

Unfortunately, the <CustomBuild> task explicitly disables this tracking functionality. I was able to find this out by using ILSpy to decompile the Microsoft.Build.CPPTasks.Common.dll. It’s a .NET assembly, so it decompiles pretty cleanly, and you can examine the innards of the CustomBuild class. It contains this snippet, in the ExecuteTool() method:

bool trackFileAccess = base.TrackFileAccess;
base.TrackFileAccess = false;
num = base.TrackerExecuteTool(pathToTool2, responseFileCommands, commandLineCommands);
base.TrackFileAccess = trackFileAccess;

That is, it’s turning off file access tracking before calling the base class method that would otherwise invoke the tracker. I’m sure there’s a reason why they did that, but sadly it’s stymied my attempts to get automatic #include tracking to work for shaders.

(We could also invoke tracker.exe manually in our command line, but then we face the problem of merging the tracker-generated .tlog into that of the <CustomBuild> task. They’re just text files, so it’s potentially doable…but that is way more programming than I’m prepared to attempt in an XML-based scripting language.)

Although we can’t get fine-grained file-by-file header dependencies, we can still set up conservative dependencies by making every HLSL source file depend on every header. This will result in rebuilding all the shaders whenever any header is modified—but better to rebuild too much than not enough. We can find all the headers using a wildcard pattern and an <ItemGroup>. Add this to the DXC <Target>, before the “setup metadata” section:

<!-- Find all shader headers (.hlsli files) -->
<ItemGroup>
  <ShaderHeader Include="*.hlsli" />
</ItemGroup>
<PropertyGroup>
  <ShaderHeaders>@(ShaderHeader)</ShaderHeaders>
</PropertyGroup>

You could also set this to find .h files under a Shaders subdirectory, or whatever you prefer. The ** wildcard is available for recursively searching subdirectories, too.

Then add this inside the <ItemGroup><DXCShader> section:

<AdditionalInputs>$(ShaderHeaders)</AdditionalInputs>

We have to do a little dance here, first forming the ShaderHeader item list, then expanding it into the ShaderHeaders property, and finally referencing that in the metadata. I’m not sure why, but if I try to use @(ShaderHeader) directly in the metadata it just comes out blank. Perhaps it’s not allowed to have nested iteration over item lists in MSBuild.

In any case, after making these changes and rebuilding, the build should now pick up any changes to shader headers. Woohoo!

Error/Warning Parsing

There’s just one more bit of sparkle we can easily add. When you compile C++ and you get an error or warning, the VS IDE recognizes it and produces a clickable link that takes you to the source location. If a custom build step emits error messages in the same format, they’ll be picked up as well—but what if your custom toolchain has a different format?

The dxc compiler emits errors and warnings in gcc/clang format, looking something like this:

Shader.hlsl:12:15: error: cannot convert from 'float3' to 'float4'

It turns out that Visual Studio already does recognize this format (at least as of version 15.9), which is great! But if it didn’t, or in case you’ve got a tool with some other message format, it turns out you can provide a regular expression to find errors and warnings in the tool output. The regex can even supply source file/line information, and the errors will become clickable in the IDE, just as with C++. (This is all totally undocumented and I only know about it because I spotted the code while browsing through the decompiled CPPTasks DLL. If you want to take a look for yourself, the juicy bit is the VCToolTask.ParseLine() method.)

This will use .NET regex syntax, and in particular, expects a certain set of named captures to provide metadata. By way of example, here’s the regex I wrote for gcc/clang-format errors:

(?'FILENAME'.+):(?'LINE'\d+):(?'COLUMN'\d+): (?'CATEGORY'error|warning): (?'TEXT'.*)

FILENAME, LINE, etc. are the names the parsing code expects for the metadata. There’s one more I didn’t use: CODE, for an error code (like C2440, etc.). The only required one is CATEGORY, without which the message won’t be clickable (and it must be one of the words “error”, “warning”, or “note”); all the others are optional.

To use it, pass the regex to the <CustomBuild> task like so:

<CustomBuild
  Sources="@(DXCShader)"
  MinimalRebuildFromTracking="true"
  TrackerLogDirectory="$(TLogLocation)"
  ErrorListRegex="(?'FILENAME'.+):(?'LINE'\d+):(?'COLUMN'\d+): (?'CATEGORY'error|warning): (?'TEXT'.*)" />

Example Project

Here’s a complete VS2017 project with all the features we’ve discussed, a couple demo shaders, and a C++ file that includes the compiled bytecode (just to show that works).

Download Example Project (.zip, 4.3 KB)

And for completeness, here’s the final contents of dxc.targets:

<?xml version="1.0" encoding="utf-8"?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
  <ItemGroup>
    <!-- Include definitions from dxc.xml, which defines the DXCShader item. -->
    <PropertyPageSchema Include="$(MSBuildThisFileDirectory)dxc.xml" />
    <!-- Hook up DXCShader items to be built by the DXC target. -->
    <AvailableItemName Include="DXCShader">
      <Targets>DXC</Targets>
    </AvailableItemName>
  </ItemGroup>

  <Target
    Name="DXC"
    Condition="'@(DXCShader)' != ''"
    BeforeTargets="ClCompile">

    <Message Importance="High" Text="Building shaders!!!" />

    <!-- Find all shader headers (.hlsli files) -->
    <ItemGroup>
      <ShaderHeader Include="*.hlsli" />
    </ItemGroup>
    <PropertyGroup>
      <ShaderHeaders>@(ShaderHeader)</ShaderHeaders>
    </PropertyGroup>

    <!-- Setup metadata for custom build tool -->
    <ItemGroup>
      <DXCShader>
        <Message>%(Filename)%(Extension)</Message>
        <Command>
          "$(WDKBinRoot)\x86\dxc.exe" -T vs_6_0 -E vs_main %(Identity) -Fh %(Filename).vs.h -Vn %(Filename)_vs
          "$(WDKBinRoot)\x86\dxc.exe" -T ps_6_0 -E ps_main %(Identity) -Fh %(Filename).ps.h -Vn %(Filename)_ps
        </Command>
        <AdditionalInputs>$(ShaderHeaders)</AdditionalInputs>
        <Outputs>%(Filename).vs.h;%(Filename).ps.h</Outputs>
      </DXCShader>
    </ItemGroup>

    <!-- Compile by forwarding to the Custom Build Tool infrastructure,
         so it will take care of .tlogs and error/warning parsing -->
    <CustomBuild
      Sources="@(DXCShader)"
      MinimalRebuildFromTracking="true"
      TrackerLogDirectory="$(TLogLocation)"
      ErrorListRegex="(?'FILENAME'.+):(?'LINE'\d+):(?'COLUMN'\d+): (?'CATEGORY'error|warning): (?'TEXT'.*)" />
  </Target>
</Project>

The Next Level

At this point, we have a pretty usable MSBuild customization for compiling shaders, or using other kinds of custom toolchains! I’m pretty happy with it. However, there’s still a couple of areas for improvement.

As mentioned before, I’d like to get file access tracking to work so we can have exact dependencies for included files, rather than conservative (overly broad) dependencies.
I haven’t done anything with parallel building. Currently, <CustomBuild> tasks are run one at a time. There is a <ParallelCustomBuild> task in the CPPTasks assembly…unfortunately, it doesn’t support .tlog updating or the error/warning regex, so it’s not directly usable here.

To obtain these features, I think I’d need to write my own build extension in C#, defining a custom task and calling it in place of <CustomBuild> in the targets file. It might not be too hard to get that working, but I haven’t attempted it yet.

In the meantime, now that the hard work of circumventing the weird gotchas and reverse-engineering the undocumented innards has been done, it should be pretty easy to adapt this .targets setup to other needs for code generation or external tools, and have them act mostly like first-class citizens in our Visual Studio builds. Cheers!

Mesh Shader Possibilities

Nathan Reed — Sat, 29 Sep 2018 11:42:26 -0700

NVIDIA recently announced their latest GPU architecture, called Turing. Although its headlining feature is hardware-accelerated ray tracing, Turing also includes several other developments that look quite intriguing in their own right.

One of these is the new concept of mesh shaders, details of which dropped a couple weeks ago—and the graphics programming community was agog, with many enthusiastic discussions taking place on Twitter and elsewhere. So what are mesh shaders (and their counterparts, task shaders), why are graphics programmers so excited about them, and what might we be able to do with them?

The GPU Geometry Pipeline Has Gotten Cluttered

The process of submitting geometry—triangles to be drawn—to the GPU has a simple underlying paradigm: you put your vertices into a buffer, point the GPU at it, and issue a draw call to say how many primitives to render. The vertices get slurped linearly out of the buffer, each is processed by a vertex shader, the triangles are rasterized and shaded, and Bob’s your uncle.

But over decades of GPU development, various extra features have gotten bolted onto this basic pipeline in the name of greater performance and efficiency. Indexed triangles and vertex caches were created to exploit vertex reuse. Complex vertex stream format descriptions are needed to prepare data for shading. Instancing, and later multi-draw, allowed certain sets of draw calls to be combined together; indirect draws could be generated on the GPU itself. Then came the extra shader stages: geometry shaders, to allow programmable operations on primitives and even inserting or deleting primitives on the fly, and then tessellation shaders, letting you submit a low-res mesh and dynamically subdivide it to a programmable level.

While these features and more were all added for good reasons (or at least what seemed like good reasons at the time), the compound of all of them has become unwieldy. Which subset of the many available options do you reach for in a given situation? Will your choice be efficient across all the GPU architectures your software must run on?

Moreover, this elaborate pipeline is still not as flexible as we would sometimes like—or, where flexible, it is not performant. Instancing can only draw copies of a single mesh at a time; multi-draw is still inefficient for large numbers of small draws. Geometry shaders’ programming model is not conducive to efficient implementation on wide SIMD cores in GPUs, and its input/output buffering presents difficulties too. Hardware tessellation, though very handy for certain things, is often difficult to use well due to the limited granularity at which you can set tessellation factors, the limited set of baked-in tessellation modes, and performance issues on some GPU architectures.

Simplicity Is Golden

Mesh shaders represent a radical simplification of the geometry pipeline. With a mesh shader enabled, all the shader stages and fixed-function features described above are swept away. Instead, we get a clean, straightforward pipeline using a compute-shader-like programming model. Importantly, this new pipeline is both highly flexible—enough to handle the existing geometry tasks in a typical game, plus enable new techniques that are challenging to do on the GPU today—and it looks like it should be quite performance-friendly, with no apparent architectural barriers to efficient GPU execution.

Like a compute shader, a mesh shader defines work groups of parallel-running threads, and they can communicate via on-chip shared memory as well as wave intrinsics. In lieu of a draw call, the app launches some number of mesh shader work groups. Each work group is responsible for writing out a small, self-contained chunk of geometry, called a “meshlet”, expressed in arrays of vertex attributes and corresponding indices. These meshlets then get tossed directly into the rasterizer, and Bob’s your uncle.

(More details can be found in NVIDIA’s blog post, a talk by Christoph Kubisch, and the OpenGL extension spec.)

The appealing thing about this model is how data-driven and freeform it is. The mesh shader pipeline has very relaxed expectations about the shape of your data and the kinds of things you’re doing to do. Everything’s up to the programmer: you can pull the vertex and index data from buffers, generate them algorithmically, or any combination.

At the same time, the mesh shader model sidesteps the issues that hampered geometry shaders, by explicitly embracing SIMD execution (in the form of the compute “work group” abstraction). Instead of each shader thread generating geometry on its own—which leads to divergence, and large input/output data sizes—we have the whole work group outputting a meshlet cooperatively. This mean we can use compute-style tricks, like: first do some work on the vertices in parallel, then have a barrier, then work on the triangles in parallel. It also means the input/output bandwidth needs are a lot more reasonable. And, because meshlets are indexed triangle lists, they don’t break vertex reuse, as geometry shaders often did.

An Upgrade Path

The other really neat thing about mesh shaders is that they don’t require you to drastically rework how your game engine handles geometry to take advantage of them. It looks like it should be pretty easy to convert most common geometry types to mesh shaders, making it an approachable upgrade path for developers.

(You don’t have to convert everything to mesh shaders straight away, though; it’s possible to switch between the old geometry pipeline and the new mesh-shader-based one at different points in the frame.)

Suppose you have an ordinary authored mesh that you want to load and render. You’ll need to break it up into meshlets, which have a static maximum size declared in the shader—NVIDIA’s blog post recommends 64 vertices and 126 triangles as a default. How do we do this?

Fortunately, most game engines currently do some form of vertex cache optimization, which already organizes the primitives by locality—triangles sharing one or two vertices will tend to be close together in the index buffer. So, a quite viable strategy for creating meshlets is: just scan the index buffer linearly, accumulating the set of vertices used, until you hit either 64 vertices or 126 triangles; reset and repeat until you’ve gone through the whole mesh. This could be done at art build time, or it’s simple enough that you could even do it in the engine at level load time.

Alternatively, vertex cache optimization algorithms can probably be modified to produce meshlets directly. For GPUs without mesh shader support, you can concatenate all the meshlet vertex buffers together, and rapidly generate a traditional index buffer by offsetting and concatenating all the meshlet index buffers. It’s pretty easy to go back and forth.

In either case, the mesh shader would be mostly just acting as a vertex shader, with some extra code to fetch vertex and index data from their buffers and plug them into the mesh outputs.

What about other kinds of geometry found in games?

Instanced draws are straightforward: multiply the meshlet count and put in a bit of shader logic to hook up instance parameters. A more interesting case is multi-draw, where we want to draw a lot of meshes that aren’t all copies of the same thing. For this, we can employ task shaders—a secondary feature of the mesh shader pipeline. Task shaders add an extra layer of compute-style work groups, running before the mesh shader, and they control how many mesh shader work groups to launch. They can also write output variables to be consumed by the mesh shader. A very efficient multi-draw should be possible by launching task shaders with a thread per draw, which in turn launch the mesh shaders for all the individual draws.

If we need to draw a lot of very small meshes, such as quads for particles/imposters/text/point-based rendering, or boxes for occlusion tests / projected decals and whatnot, then we can pack a bunch of them into each mesh shader workgroup. The geometry can be generated entirely in-shader rather than relying on a pre-initialized index buffer from the CPU. (This was one of the original use cases that, it was hoped, could be done with geometry shaders—e.g. submitting point primitives, and having the GS expand them into quads.) There’s also a lot of flexibility to do stuff with variable topology, like particle beams/strips/ribbons, which would otherwise need to be generated either on the CPU or in a separate compute pre-pass.

(By the way, the other original use case that, it was hoped, could be done with geometry shaders was multi-view rendering: drawing the same geometry to, say, multiple faces of a cubemap or slices of a cascaded shadow map within a single draw call. You could do that with mesh shaders, too—but Turing actually has a separate hardware multi-view capability for these applications.)

What about tessellated meshes?

The two-layer structure of task and mesh shaders is broadly similar to that of tessellation hull and domain shaders. While it doesn’t appear that mesh shaders have any kind of access to the fixed-function tessellator unit, it’s also not too hard to imagine that we could write code in task/mesh shaders to reproduce tessellation functionality (or at least some of it). Figuring out the details would be a bit of a research project for sure—maybe someone has already worked on this?—and perf would be a question mark. However, we’d get the benefit of being able to change how tessellation works, instead of being stuck with whatever Microsoft decided on in the late 2000s.

New Possibilities

It’s great that mesh shaders can subsume our current geometry tasks, and in some cases make them more efficient. But mesh shaders also open up possibilities for new kinds of geometry processing that wouldn’t have been feasible on the GPU before, or would have required expensive compute pre-passes storing data out to memory and then reading it back in through the traditional geometry pipeline.

With our meshes already in meshlet form, we can do finer-grained culling at the meshlet level, and even at the triangle level within each meshlet. With task shaders, we can potentially do mesh LOD selection on the GPU, and if we want to get fancy we could even try dynamically packing together very small draws (from coarse LODs) to get better meshlet utilization.

In place of tile-based forward lighting, or as an extension to it, it might be useful to cull lights (and projected decals, etc.) per meshlet, assuming there’s a good way to pass the variable-size light list from a mesh shader down to the fragment shader. (This suggestion from Seb Aaltonen.)

Having access to the topology in the mesh shader should enable us to calculate dynamic normals, tangents, and curvatures for a mesh that’s deforming due to complex skinning, displacement mapping, or procedural vertex animation. We can also do voxel meshing, or isosurface extraction—marching cubes or tetrahedra, plus generating normals etc. for the isosurface—directly in a mesh shader, for rendering fluids and volumetric data.

Geometry for hair/fur, foliage, or other surface cover might be feasible to generate on the fly, with view-dependent detail.

3D modeling and CAD apps may be able to apply mesh shaders to dynamically triangulate quad meshes or n-gon meshes, as well as things like dynamically insetting/outsetting geometry for visualizations.

For rendering displacement-mapped terrain, water, and so forth, mesh shaders may be able to assist us with geometry clipmaps and geomorphing; they might also be interesting for progressive meshing schemes.

And last but not least, we might be able to render Catmull–Clark subdivision surfaces, or other subdivision schemes, more easily and efficiently than it can be done on the GPU today.

To be clear, a great deal of the above is speculation and handwaving on my part—I don’t want to mislead you that all of these things are for sure doable with the new mesh and task shader pipeline. There will certainly be algorithmic difficulties and architectural hindrances that will come up as graphics programmers have a chance to dig into this. Still, I’m quite excited to see what people will do with this capability over the next few years, and I hope and expect that it won’t be an NVIDIA-exclusive feature for too long.