# Quick and Easy GPU Random Numbers in D3D11

January 12, 2013 · Coding, Graphics · 15 comments

In games and graphics one often needs to generate pseudorandom numbers. Needless to say, PRNGs are an extremely well-researched topic; however, the majority of the literature focuses on applications with very exacting quality requirements: cryptography, high-dimensional Monte Carlo simulations, and suchlike. These PRNGs tend to have hundreds of bytes of state and take hundreds of instructions to update. That’s way overkill for many more modest purposes—if you just want to do a little random sampling in a game context, you can probably get away with much less.

To drive home just how much lower my random number standards will be for this article, I’m not going to run a single statistical test on the numbers I generate—I’m just going to look at them! The human visual system is pretty good at picking out patterns in what we see, so if we generate a bitmap with one random bit per pixel, black or white, it’s easy to see if we’re generating “pretty random” numbers—or if something’s going wrong.

The one on the left is a linear congruential generator (LCG), and on the right is Xorshift. We’re always told that LCGs are bad news, and now you can see just how bad! Xorshift, on the other hand, is much better. It’ll actually pass some medium-strength statistical tests, and it certainly looks random enough to the eye. Moreover, it’s quite fast compared to other PRNGs of similar quality.

Since D3D11 GPUs support integer operations natively, it’s easy to port these PRNGs to shader code. GPUs do things in parallel, so we’ll create an independent instance of the PRNG for each work item—vertex, pixel, or compute-shader thread. Then we just need to seed them with different values, e.g. using the vertex index, pixel screen coordinates, or thread index, and we’ll get different sequences. Here’s the shader code for the LCG and Xorshift versions I used:

uint rng_state;   uint rand_lcg() { // LCG values from Numerical Recipes rng_state = 1664525 * rng_state + 1013904223; return rng_state; }   uint rand_xorshift() { // Xorshift algorithm from George Marsaglia's paper rng_state ^= (rng_state << 13); rng_state ^= (rng_state >> 17); rng_state ^= (rng_state << 5); return rng_state; }   // Example of using Xorshift in a compute shader [numthreads(256, 1, 1)] void cs_main(uint3 threadId : SV_DispatchThreadID) { // Seed the PRNG using the thread ID rng_state = threadId.x;   // Generate a few numbers... uint r0 = rand_xorshift(); uint r1 = rand_xorshift(); // Do some stuff with them...   // Generate a random float in [0, 1)... float f0 = float(rand_xorshift()) * (1.0 / 4294967296.0);   // ...etc. }

LCGs are really fast—updating the state takes just one imad instruction (in HLSL assembly, which is just an intermediate language, but still a reasonable proxy for machine code speed). Xorshift is a bit slower, requiring six instructions, but that’s not bad considering the quality of random numbers it gives you. Figure two or three more instructions to get the number into the range you need, and convert it to a float if necessary. On a high-end GPU, you can generate tens of billions of random numbers per second with these PRNGs, easy.

However, running the PRNGs in parallel this way, generating a bitmap with one 32-bit word from each thread gives an unexpected result:

Again, on the left is the LCG and on the right is Xorshift. The LCG doesn’t look too different from before, but Xorshift looks absolutely terrible! What’s going on?

## Wide and Deep

PRNGs are designed to be well-distributed when you “go deep”—draw many values from the same instance. Since this involves sequentially updating the state after each value, it doesn’t map well to the GPU. On the GPU we need to “go wide”—set up a lot of independent PRNG instances with different seeds so we can draw from each of them in parallel. But PRNGs aren’t designed to give good statistics across seeds. I tried several small PRNGs I found on the Web, and they all produced obvious artifacts when going wide, even if they were perfectly fine going deep.

The first thing I tried to fix this problem was to just throw away the first few values of each PRNG sequence, hoping the unwanted correlation would disappear after a few iterations. However, this doesn’t help much—once the sequences are correlated, they tend to stay that way. Here are the LCG and Xorshift after 64 iterations per thread:

It’s better, but there are still pretty obvious nonrandom patterns, even 64 iterations deep into each sequence.

We need another ingredient. Fortunately, there’s another kind of pseudorandom function that’s explicitly designed to be well-distributed when going wide: hash functions. If we hash the seed when initializing the PRNG, it should mix things up enough to decorrelate the sequences of nearby threads.

(Incidentally, I was confused for a long time about the distinction between PRNGs and hash functions—they seem to do the same thing, i.e. create an “unpredictable” but deterministic output by jumbling up input bits. It wasn’t until I started playing around with GPU PRNGs that I realized the difference: PRNGs are designed for going deep, and hashes for going wide.)

Researching hash functions on the Web presents a similar problem to researching PRNGs: most of the ones you hear about are crypto hashes, or hashes designed to digest large and variable-length data, and many produce a 64-bit or 128-bit output. For initializing a PRNG, we just want something that can hash one 32-bit integer into another. The fastest hash function I’ve found that fits the bill is one invented by Thomas Wang, as reported on Bob Jenkins’ website. The Wang hash goes as follows:

uint wang_hash(uint seed) { seed = (seed ^ 61) ^ (seed >> 16); seed *= 9; seed = seed ^ (seed >> 4); seed *= 0x27d4eb2d; seed = seed ^ (seed >> 15); return seed; }

It’s nine instructions in HLSL assembly, and does a fine job of randomizing the seeds. Here’s the output of the Wang hash, using thread index as a seed, without any additional PRNG iterations:

To the eye, it’s indistinguishable from true randomness. In fact, we could just iterate the Wang hash and it would make a perfectly good PRNG. However, it’s a few more instructions than Xorshift or the LCG, so it’ll be a little slower. It makes more sense to use the hash as a high-power tool to obliterate any correlation in the seeds, then turn things over to one of the PRNGs.

To sum up: if you need quick and easy GPU random numbers and don’t have stringent statistical requirements, I recommend either an LCG (faster) or Xorshift (better distributed when going deep), seeded using the Wang hash in either case. These are both great choices that are quite fast and will most likely give results plenty good enough for games and graphics applications.

## 15 comments on “Quick and Easy GPU Random Numbers in D3D11”

January 12, 2013

Cool! That’s really very useful.

Cheers,
Sam

January 12, 2013

Agreed, great post Nathan!

Also like the non-scientific comparison of the hashes, for graphics use cases our eyes are definitely our best tool, at least when performance constrained (which we always are!)

January 14, 2013

Very interesting article, thanks!

I wonder about the relative speed of bit-operations versus floating-point on dx11-GPUs – do you have any profiled results on this? Also, I guess it would be interesting to compare it to the currently most commonly used shader random-function, that uses a different category of functions:

//http://glsl.heroku.com/e#6002.0
float rand( vec2 n )
{
return fract(sin(dot(n.xy, vec2(12.9898, 78.233)))* 43758.5453);
}

January 14, 2013

The original LCG looks much worse than I’d expect, unless this is using low bit(s). Some of the suggested constants from one of L’Ecuyer papers might perform better. Having said that XorShift is a good choice. I’d be tempted to try MurmurHash2.

January 15, 2013

Sam and Johan: thanks!

Mikkel: I ran a quick test on my GTX 580, using a shader that does a long string of mads, and got roughly 1.75x faster results using the float one vs the integer one. That’s a little disappointing—I hoped they would be the same speed at least—but I think the advantages of using ints for proper PRNGs and hashes are worth it. I haven’t got a Kepler or any recent AMD GPUs handy to try, but it would be interesting to see how they compare.

As for the function you posted, I’ve seen a couple like that around. On my GTX 580 I profiled it as being about 4x slower than LCG, and comparable speed to Xorshift. It also has some artifacts (visible in the GLSL sandbox if you turn it up to 1:1 resolution). However, this function accepts 2D input and generates a [0,1) float natively, which is nice. It’s definitely interesting to see if anyone can come up with a purely floating-point PRNG with good distribution.

Marc: I’m using all 32 bits of the generated values (each 32×1 tile of the bitmap represents one generated uint). So the low bits are indeed in there. MurmurHash2 would probably be a good choice as well; looking at this code I counted 11 instructions for the 4-byte case, versus the Wang hash’s 9.

January 16, 2013

Geeky notes: The sinusoid based thing seems to stem from a stripped down Rey generator..never seen the article it’s based on, but the documentation of TestU01 gives an overview and the source has an implementation. Other pure float RNGs that come to mind are various flavors of Weyl generators (again basics can be found with TestU01) and permutation polynomials (https://github.com/ashima/webgl-noise). My gut is that none of these are worth investigating. Oh there are others, like you can make a float LCG, but in singles…seems tough. I’ll stop now because I’m being more noise than signal. :)

WRT: MurmurHash2 – Bad me, I was only thinking about the pre-hash part since it is separable. Dropping the final mix might be okay. Moving to a fragment shader, where you need to mix two or three values it should be a win.

March 16, 2013

Hi there.
Very nice writeup! :)
Its great to see this stuff being done.

I too am interested in random-ish number generation on the GPU. I’ve approached it from a slightly different angle and am doing it for use in procedural noise.

I have had very mixed results from integer hashing on GPU’s. It seems that integer operations can still be very slow on many modern GPU’s (even though they claim to support it in GLSL)
So I have stuck with floating point style RNG. Floating-point RNG also allows compatibility with OpenGL ES and older hardware.

I looked into the Permutation Polynomial and didn’t at all like the result. I found it was not very random at all. And was still rather slow. I also looked into BBS as used by Marc Olano in MNoise. And found it to be even worse. ( I too observe the wide vs deep issue. So was cool to see you mention it here. :) )
http://briansharpe.wordpress.com/2011/10/01/gpu-texture-free-noise/

So I came up with my own very basic/hacky hash function.
http://briansharpe.wordpress.com/2011/11/15/a-fast-and-simple-32bit-floating-point-hash-function/
It essentially multiplies+squares a 2D coordinate with a large number and calls fract() on the result. I have various 2D and 3D flavors of it. One nice thing about it is that we can get different results by simply changing the “large number” which is applied at the end. So the cost of generating (eg)3 random numbers from a vec3 coord is not much more than the cost of generating 1 random number. (So its ideal for noise where multiple numbers are often required)
Code can be found here.
https://github.com/BrianSharpe/GPU-Noise-Lib/blob/master/gpu_noise_lib.glsl

There is a lot of correlation which can be seen with my hash, but I’ve found it to be _ok_ for noise. Because with (eg)Perlin noise, we’re combining 3 random numbers together to generate a final noise value. The overlaying of multiple numbers together is enough to break up any correlation observed by the eye.

I had a quick test of the sin-based noise as posted by Mikkel. Even though it was slightly slower, the results were better. I may make use of it with value-noises where correlation is most visible due to only using 1 random number for the result. Making a 3D version of it would be needed too.

My next step is to make a 4D version.
At the moment texture-based permutation tables are still faster for GPU perlin noise than the stuff I have written. (about 30% faster)
But I have a hunch that hashing will be faster in the 4D case. We’ll see… ;)

If you have any thoughts on floating-point hashing, or on what I’ve done, then let me know.
Cheers!
Brian

March 21, 2013

I’m currently experimenting on a variation of the one Mikkel posted.

My target is D3D9 , so I can’t use bitwise operations.

HLSL:

float2 sincos_noise;
sincos(dot(tex.xy, float2(12.9898,78.233)),sincos_noise.x,sincos_noise.y);
sincos_noise = frac(sincos_noise * 43758.5453 + tex.yx);

Sin and cos are both calculated with one sincos instruction so this gives 2 variations of the original formula for 2D noise.

I have yet to find one that looks more random, and/or runs faster while still working well with D3D9 shaders.

I wonder of the 2D noise it produces could be combine for a 1D noise with even better randomness or if there is a clever and fast way to make it produce 3D noise.

July 3, 2013

Just to note: I’ve finished my 4D hash function.
Writeup can be found here:

Code can be found here:
https://github.com/BrianSharpe/GPU-Noise-Lib/blob/master/gpu_noise_lib.glsl

I use it to make fast table-free 4D Value and Perlin noise.
See functions:
FAST32_2_hash_4D()
Value4D()
Perlin4D()

cheers!

August 12, 2015

Awesomesauce!

September 16, 2015

Lgcs with multiplication or adding greater than 20 bit values run bad when done in double math due to clipping of accuracy over 52bit values in double math. The lgc bitmap here looks like it suffers from this clipping, although the code supplied seems fine. I expect the bitmap did not come from this code, or some maybe the compiler/gpu uses doubles even for int math.
A decent lcg can look perfect to the eye by the way, but testing toolkits can reveal many types of strong numeric patterns which are effectively invisible in raw bitmaps and have the potential to bork the performance of algorithms which rely on plain tempered numerical entropy.

May 11, 2016

Great ideas! It was not clear to me, however, in which part of the shader the wang_hash function should be run to avoid slowing down the whole per vertex per pixel execution. I mean, if one is to use this slower function before runnign LCG or Xorshift, it would be slowing down the process anyway.

Also, when you mention using the wanghash result as seed for the LCG or Xorshift, you mean passing the value into what you’ve called rng_state?

May 11, 2016

AndradeS, which part of the shader you run it in will depend on what you’re trying to do: if you need random numbers per vertex, you’d run it in the vertex shader; if you need them per pixel, you’d run it in the pixel shader.

The hash would be run in the same shader that uses the RNG. You’d put together whatever seed data you have (such as vertex ID or screen-space pixel position, maybe incorporating a time-based seed as well) into a 32-bit integer, run that through the hash, and use the result as the initial state for the RNG.

July 6, 2016

Great post Nathan. I am new to RNGs and would like to evaluate them for the GPU and Xeon Phi. I really like your use of visuals for a quick evaluation. How did you generate those plots/graphics?

Thanks,
Greg

July 6, 2016

Hi Greg, I wrote a simple C++ program using Direct3D to run the RNGs as compute shaders and output the result as a bitmap file. It’s nothing special, but I put the code up on Github here if you want to look at it.