<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Nathan Reed&#039;s coding blog</title>
	<atom:link href="http://www.reedbeta.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.reedbeta.com/blog</link>
	<description>Polygons and pixels and shaders, oh my!</description>
	<lastBuildDate>Mon, 13 Feb 2012 22:25:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Understanding BCn Texture Compression Formats</title>
		<link>http://www.reedbeta.com/blog/2012/02/12/understanding-bcn-texture-compression-formats/</link>
		<comments>http://www.reedbeta.com/blog/2012/02/12/understanding-bcn-texture-compression-formats/#comments</comments>
		<pubDate>Sun, 12 Feb 2012 17:36:46 +0000</pubDate>
		<dc:creator>Nathan Reed</dc:creator>
				<category><![CDATA[Workflow]]></category>

		<guid isPermaLink="false">http://www.reedbeta.com/blog/?p=95</guid>
		<description><![CDATA[The current state of the art in GPU-supported texture compression is a set of seven formats called BC1 through BC7. These formats are used by almost all realistic 3D games to reduce the memory footprint of their texture maps. In this article, I&#8217;m going to get under the hood of each of the seven BCn [...]]]></description>
			<content:encoded><![CDATA[<p>The current state of the art in GPU-supported texture compression is a set of seven formats called BC1 through BC7.  These formats are used by almost all realistic 3D games to reduce the memory footprint of their texture maps.  In this article, I&#8217;m going to get under the hood of each of the seven BCn formats, explain how they work, and show how their features make them effective for compressing different kinds of images. <span id="more-95"></span> </p>
<h2>Why Texture Compression?</h2>
<p>Given that computer memories have gotten bigger, faster, and cheaper over the years, you might wonder whether there is still a need for textures to be compressed in memory at all.  (Compression on disk is another matter, but I&#8217;m not talking about that in this article.)  But at least for large-scale, realistic 3D games, aesthetic expectations have of course grown in step with the hardware improvements.  There&#8217;s a continuing push for more textures with higher resolutions, to increase visual detail and reduce repetition; and as game shading models become more sophisticated, each material requires a greater number of textures: diffuse color, normal map, specular color, gloss, emissive glow, and others. </p>
<p>Moreover, despite the vast advancements in hardware capabilities over the last couple of decades, texture sampling in a shader is becoming relatively <em>more</em> expensive, not less! Processors have gotten faster and faster over the years, and so has memory&#8212;but memory is falling behind, and the gap between memory and computation has grown wider.  Hardware designers are fighting an uphill battle to provide enough memory bandwidth to keep processors supplied with data to chew on.  Given this state of affairs, texture compression is not just about cramming more pixels into the game; it&#8217;s a crucial performance optimization, since it reduces by a large factor the memory bandwidth required to get those pixels into the GPU&#8217;s shader cores. </p>
<p>It&#8217;s clear that GPU-supported texture compression is not going to go away.  In fact, it is more vital than ever to building a game with high-quality graphics&#8212;and new, more sophisticated compression formats are a continuing area of research. </p>
<h2>Block Compression</h2>
<p>BC stands for &#8220;block compression&#8221;, and the BCn formats all operate in terms of 4&times;4 blocks of pixels.  All images are sliced up into these small blocks, and each block is self-contained&#8212;the data to decode it is all in one contiguous chunk in memory.  Moreover, the size of each compressed block is fixed&#8212;either 8 or 16 bytes, depending on which BCn format is being used.  This represents a 4:1 or 8:1 compression ratio, if the source image is in 8-bit RGBA format. </p>
<p>This standard layout is designed to make it easy for the GPU to use these formats for rendering. GPUs need to be able to quickly access any part of a texture; they don&#8217;t read the entire image sequentially, so streaming compression algorithms and those that have variable compression ratios are poorly suited for this application.  The fixed block size makes it easy to locate the BCn block containing any given pixel, and the self-contained block data means that the GPU can decompress just the part of the image it needs to access. </p>
<p>Moreover, texture samples often exhibit locality in both dimensions&#8212;if one pixel from a texture is accessed, it&#8217;s likely that other pixels nearby in 2D will be accessed too.  The 4&times;4 block structure supports this well, since it&#8217;s convenient to decompress all 16 pixels at once and then store them in the texture cache, where they can be efficiently reused for additional sampling operations. </p>
<p>The first three BCn formats are better known as DXT1, DXT3, and DXT5, respectively, but BCn (for n between 1 and 7) are the more up-to-date names for these formats. </p>
<h2>Endpoints And Indices</h2>
<p>All of the BCn formats are based around a simple idea: in many images of interest, within any small area (such as a 4&times;4 block) there tends to be limited variation in the set of colors present. For example, take a look at this brick texture (from <a href="http://www.cgtextures.com/">CgTextures</a>).  An extract from the texture is shown here, together with several magnified 4&#215;4 blocks from it: </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/01/brick_extracts.png" title="A few 4x4 blocks extracted from a brick texture" width="368" height="256" class="aligncenter size-full wp-image-106" /> </p>
<p>If you look at these blocks, you can see what&#8217;s meant by &#8220;limited color variation&#8221;.  Within a  block, we often have just darker and lighter shades of a single color, or (at corners and edges) a gradient or blend between two contrasting colors.  The BCn formats are designed to exploit this redundancy by separating the definition of the colors in a block from their spatial distribution. </p>
<p>Each block in a BCn image has a very small color palette, with just a few colors to choose from. The pixels are then coded as indices into this palette, which requires only a handful of bits per pixel, since the palette is so small.  The palette is further compressed by assuming that all its colors are laid out evenly spaced along a line segment in RGB space.  Then the file only has to store the endpoints of that line, and all the other palette entries are reconstructed by blending the two endpoints in different proportions. </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/01/rgb_space.png" title="Four colors along a line in RGB space" width="388" height="316" class="aligncenter size-full wp-image-120" /> </p>
<p>All BCn blocks contain two main pieces of data: the endpoints of these color-space line segments, and the per-pixel palette indices that say how far along the line each pixel is. </p>
<p>Even without delving into the details of specific BCn formats, we can already get an idea of the kinds of situations in which they will fail.  Most of the BCn formats will give poor-quality results anywhere that three very different colors are present in a single block.  For example, a block containing a mix of red, green and blue pixels cannot be represented using the simpler BCn formats, because red, green, and blue do not lie along a straight line in RGB space.  This presents a problem for normal maps, which often have exactly this situation.  On the other hand, color textures like the brick shown above can often survive BCn compression with very little visible degradation. </p>
<p>As an example, here are the four blocks from above, compressed using BC1 format: </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/01/brick_blocks_compare.png" title="Comparison of BC1-compressed blocks vs uncompressed originals" width="376" height="112" class="aligncenter size-full wp-image-113" /> </p>
<p>Here you can clearly see how each block has collapsed down to no more than four distinct colors (BC1 uses a four-color palette), and subtle hue variations have vanished (because the four colors must lie on a line in RGB space).  The effects are obvious at the level of individual blocks, but at the level of the entire image they are much less apparent: </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/01/brick_compare.png" title="Comparison of BC1-compressed brick texture vs uncompressed original" width="528" height="292" class="aligncenter size-full wp-image-114" /> </p>
<p>If you look carefully, you can see definite differences between these two images&#8212;particularly along the edges between shadowed and bright areas&#8212;but on the whole, the compressed image looks very faithful to the original. </p>
<p>Now that we&#8217;ve gone over the basics, let&#8217;s jump into the details of specific BCn formats. </p>
<h2>BC1</h2>
<p>BC1 stores RGB data.  It technically supports an alpha channel, but the alpha is only 1-bit (that is, it must be either 0 or 255).  It uses 8 bytes to store each 4&times;4 block, giving it an average data rate of 0.5 bytes per pixel.  Each block consists of two color endpoints, which are stored in 2 bytes each, using RGB 5:6:5 format.  The palette contains four entries generated from those endpoints, so the indices require 2 bits per pixel, making up the other 4 bytes of the block. </p>
<p>BC1 is a good choice for most standard-issue color maps, unless there&#8217;s a specific reason to use one of the other formats.  One such reason could be that the image requires smooth gradients.  Due to the use of 5:6:5 colors, BC1 cannot represent smooth gradients well, as illustrated here: </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/01/gradient_compare_bc1.png"  title="Comparison of a gradient, uncompressed vs BC1" width="640" height="144" class="aligncenter size-full wp-image-127" /> </p>
<p>The top gradient should look relatively smooth (although on many monitors you will still see some banding&#8212;with today&#8217;s contrast ratios <a href="http://19lights.com/wp/2011/09/30/is-8-bits-enough-of-course-not/">eight bits aren&#8217;t enough</a> for perfectly smooth gradients).  However, the bottom gradient is much more bandy than the top one. Some bands are even visibly green, since the 5:6:5 encoding gives us colors like (57, 60, 57). </p>
<p>This isn&#8217;t an issue with fitting pixels into a four-color palette.  The gradients above are 512 pixels wide, which means that each 4&times;4 block contains exactly two colors&#8212;no issue for a four-color palette to handle!  The problem is that the endpoints aren&#8217;t stored with enough precision.  However, this is only an issue for certain kinds of textures that involve very smooth gradients or very subtle color variations, such as skies and human skin.  For these kinds of images, BC7 (described later) may be a better choice.  The majority of game textures undergo little visible degradation in BC1, like the brick texture in the previous section. </p>
<h3>Degeneracy, and Breaking It</h3>
<p>I mentioned earlier that BC1 supports a 1-bit alpha channel.  How does this fit in?  After the 4 bytes of endpoints and 4 bytes of indices described above, there&#8217;s apparently no more space in the block for additional data.  But there is a loophole that can be exploited to pack even more information into the same space: degeneracy!  The system described above is <em>degenerate</em>, meaning that there are multiple ways to encode the same image.  In this case, the degeneracy originates in the symmetry of the two color endpoints.  Nothing about the BC1 format thus far singles out in what order the endpoints should be stored: if you swap the two, and invert the indices to compensate, you end up with exactly the same 4&times;4 block of pixels as before.  So there is a twofold degeneracy: there are two equally good ways to store the same image. </p>
<p>BC1 cleverly exploits this by <em>breaking</em> the degeneracy: it defines an alternative mode that is triggered for a given block by the order of the endpoints.  Although the endpoints are colors in 5:6:5 format, they can also be interpreted as 16-bit unsigned integers.  If the first endpoint is numerically greater than the second, the above description of the format holds: the palette contains four colors spaced evenly from one endpoint to the other.  But if the first endpoint is less-equal to the second, the palette is modified: its first three entries are three colors spaced evenly from one endpoint to the other (that is, the two endpoints and their average), and the fourth entry is transparent: black with zero alpha. </p>
<p>In this way, BC1 can support a 1-bit alpha by switching into this second mode for specific blocks that contain transparent pixels.  However, some precision must be sacrificed for the non-transparent values in these blocks because only three distinct colors remain, rather than four.  BC1 also cannot store any color information in the transparent areas.  This makes it suitable for storing texture maps for &#8220;cutout&#8221; materials, such as grates, fences, and vegetation, where alpha testing is used to discard the transparent parts of the image.  Care must be taken, however, when using bilinear filtering with BC1 cutout textures.  Since the color component of the transparent pixels is always black, a dark fringe will form around the transparent areas where bilinear filtering blends the colored pixels with the transparent ones.  This can be avoided by setting the alpha threshold high, so that the dark areas are culled away, or by dividing the interpolated color by the interpolated alpha in the shader, which will cancel out the darkening. </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/01/grate_compare.png"  title="Comparison of BC1 cutout texture filtered with and without alpha correction" width="514" height="285" class="aligncenter size-full wp-image-141" /> </p>
<p>Here is an example of an image with 1-bit alpha, upscaled with bilinear filtering, with an alpha test threshold of 128.  On the left, only plain bilinear filtering has been used, producing a dark fringe where the color of the texture is interpolated 50% toward the black of the transparent area. On the right, the interpolated alpha has been divided out. </p>
<h2>BC4</h2>
<p>Rather than going through these formats in numerical order (which corresponds to the chronological order in which they were introduced, in successive generations of GPU hardware), I&#8217;m going to go in order from the simplest to most complicated.  After BC1, BC4 is the next logical step. </p>
<p>BC4 stores a grayscale image&#8212;no RGB, just a single color channel&#8212;and uses 8 bytes per block. Its endpoints are one byte each, and it uses an eight-element palette, so it has 3 bits of indices per pixel. </p>
<p>BC4 is the same size as BC1, but it gives much better quality than BC1 when storing a grayscale image, due to both the expanded palette (eight elements instead of four) and the extended endpoint precision (8 bits instead of 5&ndash;6).  This makes BC4 an excellent choice for height maps, gloss maps, and any other kind of grayscale texture.  Compare the quality of the gradient here with that of BC1, above: </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/01/gradient_compare_bc4.png"  title="Comparison of a gradient, uncompressed vs BC4" width="640" height="144" class="aligncenter size-full wp-image-128" /> </p>
<p>There is little or no visible difference between the BC4-compressed gradient and the uncompressed original. </p>
<p>Like BC1, BC4 makes use of degeneracy breaking by defining an alternative mode triggered based on the order of the endpoints.  If the first endpoint is numerically greater than the second, the palette consists of eight values evenly spaced from one endpoint to the other.  Otherwise, the first six entries in the palette are evenly spaced from one endpoint to the other, and the last two are, respectively, 0 and 255&#8212;black and white.  BC4 already has excellent quality due to its full 8-bit endpoints and large palette, but allowing this alternative mode can make it even better in certain cases, such as at sharp edges between black and white areas in the map. </p>
<h2>BC2, BC3, and BC5</h2>
<p>These formats are simply combinations of the previous two.  BC3 stores RGBA data, using BC1 for the RGB part and BC4 for the alpha part, for a total block size of 16 bytes, or an average of 1 byte per pixel.  It&#8217;s the most common format for textures that require a full alpha channel, and can also be used for packing a color texture together with any grayscale image, such as a height map or gloss map.  Since the alpha is stored separately from the color, BC3 does not use the BC1 1-bit alpha mode in the color part. </p>
<p>BC5 is a two-channel format in which each block is just two BC4 blocks.  This is very useful for tangent-space normal maps, if the the X and Y components are stored and the Z component is reconstructed in the pixel shader.  Since each channel has its own endpoints and indices, normal maps&#8212;in which the X and Y components are often &#8220;doing different things&#8221;, so to speak&#8212;retain quite a bit more fidelity in BC5 than in BC1.  The downside is that BC5 requires twice as much memory, at 16 bytes per block; this can also make it slower for shaders to access because more memory bandwidth is needed to get the texture to the shader cores.  But this may be a price worth paying for the substantial increase in quality. </p>
<p>BC2 is a bit of an odd duck, and frankly is never used nowadays.  It stores RGBA data, using BC1 for the RGB part, and a straight 4 bits per pixel for the alpha channel.  The alpha part doesn&#8217;t use any endpoints-and-indices scheme, just stores explicit pixel values.  But since each alpha value is just 4 bits, there are only 16 distinct levels of alpha, which causes extreme banding and makes it impossible to represent a smooth gradient or edge even approximately.  Like BC3, it totals 16 bytes per block.  As far as I can think of, there&#8217;s no reason ever to use this format, since BC3 can do a better job in the same amount of memory.  I include it here just for historical reasons. </p>
<h2>BC6 and BC7</h2>
<p>The final two formats were introduced very recently, just within the last couple of years, and are only supported by D3D11-level graphics hardware.  They&#8217;re also vastly more complex than any of the other formats we&#8217;ve discussed.  As a result of their newness, complexity, and hardware requirements, they&#8217;re not yet well supported in texture compression tools and libraries, and aren&#8217;t yet well-known or widely used. </p>
<p>Both of them consume 16 bytes per block, the same as BC3 and BC5.  BC7 targets 8-bit RGB or RGBA data, and BC6 targets RGB <a href="http://en.wikipedia.org/wiki/Half-precision_floating-point_format">half-precision</a> floating-point data.  BC6 is therefore the only BCn format that can natively store HDR images, and is an excellent replacement for RGBM and other HDR encodings that rendering programmers have heretofore used to shoehorn HDR data into compressed textures. </p>
<p>The reason BC6 and BC7 are so complicated is that they allow a variety of different modes that change the details of the format, such as the palette size and the way the endpoints are stored. Modes are specified by the first few bits of each block, so each block can effectively have a different format!  This makes BC6 and BC7 very adaptable to the image contents, as they can choose the best mode for each individual 4&times;4 block.  But the downside is that compressing to BC6 or BC7 is much more difficult and slow, since the compressor has many more options to try to achieve the best-quality representation of each block. </p>
<p>The different modes essentially trade off various features of the format.  For example, they trade off endpoint precision versus index precision: some modes have larger palettes, but store the endpoints with fewer bits per component; other modes have higher-precision endpoints, but smaller palettes. </p>
<p>Another enhancement that BC6 and BC7 feature is the ability to have more than one line segment in each block.  Now, some of the formats described previously&#8212;namely BC3 and BC5&#8212;have two line segments per block, but they use them for different channels&#8212;in BC3, color and alpha, or in BC5, the two grayscale channels.  BC6 and BC7 introduce the concept of <em>partitioning</em>, which allows different line segments to be used for different pixels in the block.  There are a variety of spatial partitioning patterns available (the ones for BC6 can be seen <a href="http://msdn.microsoft.com/en-us/library/hh308952.aspx#bc6h_partition_set">here</a>).  These predefined patterns assign each pixel in the block to one line segment or another; the indices then control which color out of that line segment&#8217;s palette the pixel gets.  Partitioning can improve quality in cases where the colors in a block don&#8217;t fall very neatly along a single line in RGB space; with multiple line segments, the original range of colors in the block can be reproduced more faithfully. </p>
<p>The number of line segments is controlled by the per-block mode setting.  Modes with more line segments have more endpoints to store, so naturally, these modes also tend to have lower-precision endpoints and smaller palettes&#8212;everything still has to fit in 16 bytes per block.  When applicable, the chosen partition pattern is also stored by a few more bits in the block. </p>
<h3>Image Comparisons</h3>
<p>As an example of the higher image quality enabled by the sophisticated BC7 features, here are the four blocks extracted from the brick texture earlier in this article, compressed with BC7: </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/02/brick_blocks_compare_bc7.png"  title="Comparison of BC7-compressed blocks vs uncompressed originals" width="376" height="112" class="aligncenter size-full wp-image-154" /> </p>
<p>If you compare this with the earlier BC1 version of this image, there&#8217;s a massive difference.  The second block from the left still appears visibly degraded relative to the uncompressed version, but it&#8217;s much better than in BC1, and the other three blocks hardly have any visible differences with their uncompressed versions. </p>
<p>At the zoomed-out level, there is no perceptible difference between the uncompressed and BC7 versions of the image, even if you examine them quite closely: </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/02/brick_compare_bc7.png"  title="Comparison of BC7-compressed brick texture vs uncompressed original" width="528" height="292" class="aligncenter size-full wp-image-153" /> </p>
<p>Finally, BC7 also works extremely well on the gradient example, where BC1 failed miserably: </p>
<p><img src="http://www.reedbeta.com/blog/wp-content/uploads/2012/02/gradient_compare_bc7.png"  title="Comparison of a gradient, uncompressed vs BC7" width="640" height="144" class="aligncenter size-full wp-image-152" /> </p>
<h3>Compression Cleverness</h3>
<p>The slightly unfortunate thing about BC6 and BC7 is that because they have all these different per-block options, they have to use up some of the precious 16 bytes just to say which options were picked, leaving less space for the actual contents&#8212;the endpoints and indices. However, the smart people who invented these formats (presumably some engineers at NVIDIA and AMD&#8230;will we ever know who they are?) found a whole bag of tricks for squeezing more data into a block. </p>
<p>For example, recall that BC1 and BC3 both exploit degeneracy breaking to effectively gain one more bit of data, by using the order of the two endpoints in a block as a signal to switch between two modes.  BC6 and BC7 also use degeneracy breaking, but not to switch modes; they take advantage of it to eliminate one bit of the indices.  When you swap the two endpoints, you have to bitwise-invert all the indices to compensate: indices 00, 01, 10, and 11 become 11, 10, 01, and 00, respectively. But you can also turn this around: you can always invert the indices by swapping the endpoints.  So you can declare, for instance, that the most-significant bit of the upper-left pixel&#8217;s index will always be 0: if it&#8217;s not, swap the endpoints and then it will be!  Then you don&#8217;t have to actually store the 0 bit; that bit can be used for something else.  In partitioned modes, one bit can be saved for each line segment by swapping that segment&#8217;s endpoints. </p>
<p>Another space-saving trick, which BC6 uses in most of its modes, is delta compression for endpoints. Rather than storing all the endpoints at a high level of precision, these modes store one endpoint at relatively high precision, then represent the other endpoints as lower-precision delta vectors relative to that base endpoint.  (Unfortunately, the spec refers to this feature by the uninformative phrase &#8220;transformed endpoints&#8221;.)  This is an interesting approach because it restricts the possible lengths and orientations of the line segments in RGB space, but still allows their absolute position to be set precisely.  Quantized lengths and orientations for the line segments will naturally introduce error in color reproduction, but precise absolute positioning allows the error to be distributed more evenly over all the pixels, as opposed to concentrating in a few. </p>
<p>Incidentally, although BC6 deals in floating-point values, it nevertheless treats them as 16-bit integers throughout almost all stages of the decoding and interpolation process!  If you&#8217;re wondering how that can even work at all, take a look at <a href="http://randomascii.wordpress.com/2012/01/23/stupid-float-tricks-2/">this article</a>.  The key point is that magnitude order&#8212;from lower to higher values&#8212;is the same for floats and their integer representations.  So interpolating floats as if they were ints actually does something reasonable, although it does not always produce <em>linear</em> interpolation&#8212;it effectively interpolates along a piecewise-linear approximation of a logarithmic curve!  BC6 does involve some special-casing to handle negative numbers, NaNs and infinities, but it mostly treats its values as ints, relying on this fact about the IEEE float representation to make everything work. </p>
<p>BC7 does not use delta compression for its endpoints, but has a similar mechanism, referred to as &#8220;P-bits&#8221;.  A P-bit is a shared least-significant bit that gets tacked onto the end of all the RGB color values of the endpoints.  This is a bit of a head-scratcher, but the end result of it is very like that of BC6&#8242;s delta compression: the possible lengths and orientations of the RGB line segments are more coarsely quantized, but the whole line segment can be positioned more precisely, allowing error to be distributed more evenly over the pixels in the block. </p>
<p>Finally, since BC7 can store both color and alpha channels, it&#8217;s equipped with modes that offer a few choices for combining the two.  Some modes have one set of indices for all four channels, which (roughly speaking) requires color and alpha to have the same spatial distribution within the block.  Other modes include two distinct sets of indices, one for color and one for alpha, allowing the two to be relatively independent.  Moreover, the separate-index modes also include channel swapping flags: alpha can be swapped with red, green, or blue, or left in place.  This effectively allows any of the four channels to use distinct indices from the rest. </p>
<h2>Comparison Table</h2>
<p>To sum up all up, here&#8217;s a table listing the major differences between the seven BCn formats: </p>
<table class="datatable">
<tr>
<th></th>
<th>Type Of Data</th>
<th>Data Rate</th>
<th>Palette Size</th>
<th>Line Segments</th>
<th>Use For</th>
</tr>
<tr>
<th>BC1</th>
<td>RGB + optional 1-bit alpha</td>
<td>0.5 byte/px</td>
<td>4</td>
<td>1</td>
<td>Color maps<br/>Cutout color maps (1-bit alpha)<br/>Normal maps, if memory is tight</td>
</tr>
<tr>
<th>BC2</th>
<td>RGB + 4-bit alpha</td>
<td>1 byte/px</td>
<td>4</td>
<td>1</td>
<td>n/a</td>
</tr>
<tr>
<th>BC3</th>
<td>RGBA</td>
<td>1 byte/px</td>
<td>4 color + 8 alpha</td>
<td>1 color + 1 alpha</td>
<td>Color maps with full alpha<br/>Packing color and mono maps together</td>
</tr>
<tr>
<th>BC4</th>
<td>Grayscale</td>
<td>0.5 byte/px</td>
<td>8</td>
<td>1</td>
<td>Height maps<br/>Gloss maps<br/>Font atlases<br/>Any grayscale image</td>
</tr>
<tr>
<th>BC5</th>
<td>2 &times; grayscale</td>
<td>1 byte/px</td>
<td>8 per channel</td>
<td>1 per channel</td>
<td>Tangent-space normal maps</td>
</tr>
<tr>
<th>BC6</th>
<td>RGB, floating-point</td>
<td>1 byte/px</td>
<td>8&ndash;16</td>
<td>1&ndash;2</td>
<td>HDR images</td>
</tr>
<tr>
<th>BC7</th>
<td>RGB or RGBA</td>
<td>1 byte/px</td>
<td>4&ndash;16</td>
<td>1&ndash;3</td>
<td>High-quality color maps<br/>Color maps with full alpha</td>
</tr>
</table>
<h2>Compressors</h2>
<p>In this article, I described the data representation of the BCn formats, but not how to write compressors for them.  Like many compression techniques, the BCn formats are designed to be simple and fast to decompress&#8212;but that often comes at the cost of making compression difficult!  Writing high-quality BCn compressors is a big enough topic for its own article, and is a subject of ongoing research.  In particular, the BC6 and BC7 formats have a much greater &#8220;search space&#8221; because they offer so many more options for encoding each block of an image, and high-quality BC6 and BC7 compression involves a lot of brute-force searching for the best (lowest-error) combination of options, making these compressors quite a bit slower than those for BC1&ndash;5. </p>
<p>For compressing and decompressing BC1&ndash;5, the open-source <a href="http://code.google.com/p/nvidia-texture-tools/">NVIDIA Texture Tools</a> are probably your best bet.  NVTT includes both command-line utilities and a set of libraries that can be linked into your own projects. </p>
<p>It is still surprisingly difficult to find any publicly available tools that support BC6&ndash;7 at all.  The only publicly available tools I&#8217;ve found that can compress to these formats with high image quality are NVIDIA&#8217;s internal development compressors, which have been open-sourced and are hosted at the NVTT Google Code repository <a href="http://code.google.com/p/nvidia-texture-tools/downloads/list">here</a>.  These tools are not the user-friendliest!  They run <em>very</em> slowly&#8212;taking several minutes to encode a 256&times;256 image&#8212;and they don&#8217;t save to a standard file format like DDS, but simply dump the compressed blocks into a raw binary file.  Their license status is also unknown, so it may not be safe from a legal perspective to incorporate their source into your own projects.  However, they do give very high-quality results, and I used them for all the BC7 comparison images in this article. </p>
<p>Compression and decompression of all the BCn formats is also supported by the Direct3D 11 SDK in the form of the <a href="http://msdn.microsoft.com/en-us/library/ff476286.aspx">D3DX11CreateTextureFromFile</a> function.  However, my experience is that the compressed image quality is not very good with this API, so I would not advise using it for compression (decompression should be fine). </p>
<h2>References</h2>
<p>For the full, bit-for-bit specifications of each of the BCn formats, see these references: </p>
<ul>
<li><a href="http://msdn.microsoft.com/en-us/library/hh308955.aspx">Overview of all the BCn formats</a></li>
<li><a href="http://msdn.microsoft.com/en-us/library/bb694531.aspx">Complete specs for BC1&ndash;5</a></li>
<li><a href="http://www.opengl.org/registry/specs/ARB/texture_compression_bptc.txt">Complete specs for BC6&ndash;7</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.reedbeta.com/blog/2012/02/12/understanding-bcn-texture-compression-formats/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Shunting-Yard Algorithm</title>
		<link>http://www.reedbeta.com/blog/2011/12/11/the-shunting-yard-algorithm/</link>
		<comments>http://www.reedbeta.com/blog/2011/12/11/the-shunting-yard-algorithm/#comments</comments>
		<pubDate>Sun, 11 Dec 2011 19:51:41 +0000</pubDate>
		<dc:creator>Nathan Reed</dc:creator>
				<category><![CDATA[Languages]]></category>

		<guid isPermaLink="false">http://www.reedbeta.com/blog/?p=71</guid>
		<description><![CDATA[In game development (or programming in general) it&#8217;s not uncommon to have a situation where you&#8217;d like to let a user enter an arithmetic formula that your code parses and evaluates. For example, in a shader you might like to have an annotation that specifies how a parameter is to be computed in the main [...]]]></description>
			<content:encoded><![CDATA[<p>In game development (or programming in general) it&#8217;s not uncommon to have a situation where you&#8217;d like to let a user enter an arithmetic formula that your code parses and evaluates.  For example, in a shader you might like to have an annotation that specifies how a parameter is to be computed in the main application.  In various kinds of authoring tools you might like to create a shape, image, or animation based on a mathematical function.  Embedding a full-fledged scripting language like Python or Lua is a bit overkill for these kinds of tasks.  So how can we handle arithmetic expressions without a large amount of infrastructure? </p>
<p><span id="more-71"></span> In textbooks and university computer-science courses, we often hear about a few classic approaches to parsing formulas.  One is the so-called <a href="http://en.wikipedia.org/wiki/Reverse_Polish_notation">reverse Polish notation</a>, where we write formulas in postfix form, with operators following their operands:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000dd;">2</span> <span style="color: #0000dd;">3</span> <span style="color: #000040;">+</span>           <span style="color: #666666;">// means 2 + 3</span>
a b <span style="color: #000040;">+</span> c d <span style="color: #000040;">+</span> <span style="color: #000040;">*</span>   <span style="color: #666666;">// means (a + b) * (c + d)</span>
pi <span style="color: #0000dd;">4</span> <span style="color: #000040;">/</span> <span style="color: #0000dd;">sin</span>      <span style="color: #666666;">// means sin(pi/4)</span></pre></div></div>

<p>The nice things about RPN are that (a) parentheses are not needed, nor the concepts of operator precedence and associativity, since the order of operations is fully specified by the notation; and (b) it can be parsed by a simple algorithm: scan the formula left-to-right, when you see an operand push it on a stack, and when you see an operator, pop the required operand(s) off the stack, apply the operator, and push the result back on.  When you&#8217;re done, the result is the only item left on the stack (assuming well-formed input). </p>
<p>That&#8217;s quite easy to add to an application; there&#8217;s almost no infrastructure needed.  But it has the disadvantage that it requires users to work in this unfamiliar and awkward notation.  Of course, users can be trained to work in RPN, and with experience it no longer appears unfamiliar or awkward. But let&#8217;s take pity on the poor users and let them work with standard mathematical notation.  What options are there? </p>
<p>The standard computer-science cirriculum at this point would start talking about context-free grammars, abstract syntax trees, and syntax-directed parsers.  There are two main approaches to building parsers that are used in practice, i.e. for parsing programming languages: top-down (also known as recursive descent or LL) and bottom-up (aka shift-reduce, LR).  Unfortunately, neither of these is a good fit for embedding a simple arithmetic language in an application.  Top-down parsing isn&#8217;t a good fit for arithmetic in general, since each level of operator precedence requires its own nonterminal symbol in the grammar (each of which corresponds to a function call in the parser), and right-associative operators can&#8217;t be expressed in the grammar without breaking the LL constraint, necessitating some sort of extragrammatical fixup.  Bottom-up parsing works by using a large state machine whose transition rules are usually impractical to work out by hand, requiring a parser generator tool such as <a href="http://www.gnu.org/s/bison/">Bison</a> to compute.  Again, that&#8217;s a lot of infrastructure to throw at what is not such a complicated problem. </p>
<p>Fortunately, there is another way: the <a href="http://en.wikipedia.org/wiki/Shunting-yard_algorithm">shunting-yard algorithm</a>.  It is due to Edsger Dijkstra, and so named because it supposedly resembles the way <a href="http://www.youtube.com/watch?v=vop7UTPhdoc">trains are assembled and disassembled</a> in a railyard.  This algorithm processes infix notation efficiently, supports precedence and associativity well, and can be easily hand-coded. </p>
<h2>How It Works</h2>
<p>As in RPN, we scan the formula from left to right, processing each operand and operator in order. However, we now have two stacks: one for operands and another for operators.  Then, we proceed as follows: </p>
<ul>
<li>If we see an operand, push it on the operand stack.</li>
<li>If we see an operator:
<ul style="padding-bottom: 0">
<li>While there&#8217;s an operator on top of the operator stack of precedence <em>higher than or equal to</em> that of the operator we&#8217;re currently processing, pop it off and apply it. (That is, pop the required operand(s) off the stack, apply the operator to them, and push the result back on the operand stack.)</li>
<li>Then, push the current operator on the operator stack.</li>
</ul>
</li>
<li>When we get to the end of the formula, apply any operators remaining on the stack, from the top down.  Then the result is the only item left on the operand stack (assuming well-formed input).</li>
</ul>
<p>Note that &#8220;applying&#8221; an operator can mean a couple of different things in this context.  You could actually execute the operators, in which case the operands would be numerical values of all the terms and subexpressions; you could also build a syntax tree, in which case the operands would be subtrees.  The algorithm works the same way in either case. </p>
<p>That&#8217;s basically all there is to it, aside from some bells and whistles!  As you can see, it has a lot in common with the RPN algorithm, and is just a little more complicated. </p>
<h2>Advanced Usage</h2>
<p>I described the algorithm above in its simplest form, but there are several enhancements that can be made to handle more complicated formulas. </p>
<p><strong>Associativity.</strong> Above, I said that when processing an operator, any operators of <em>equal</em> precedence at the top of the stack should be popped and applied.  This makes those operators left-associative, since the leftmost of the two operators will be applied first.  You can implement right-associativity by leaving equal-precedence operators on the stack. </p>
<p><strong>Parentheses.</strong> Parens are a bit of a special case.  When you see a left paren, push it on the operator stack; no other operators can pop a paren (so it&#8217;s as if it has the <em>lowest</em> precedence).  Then when you see a right paren, pop-and-apply any operators on the stack until you get back to a left paren, which is popped and discarded. </p>
<p><strong>Unary operators.</strong> These generally work just like any binary operators except that they only pop one operand when they&#8217;re applied.  There is one extra rule that needs to be followed, though: when processing a unary operator, it&#8217;s only allowed to pop-and-apply other unary operators&#8212;never any binary ones, regardless of precedence.  This rule is to ensure that formulas like <tt>a ^ -b</tt> are handled correctly, where <tt>^</tt> (exponentiation) has a higher precedence than <tt>-</tt> (negation).  (In <tt>a ^ -b</tt> there&#8217;s only one correct parse, but in <tt>-a^b</tt> you want to apply the <tt>^</tt> first.) </p>
<p>Both prefix and postfix unary operators can be used.  The way to tell whether you&#8217;re in a position to allow prefix or postfix operators is to look at the previous token; if it&#8217;s an operand, you&#8217;re looking for binary and postfix unary operators, and if the previous token is an operator (or there&#8217;s no previous token) you&#8217;re looking for prefix unary operators.  Note that a left parent counts as an operator and a right paren as an operand for this purpose.  This rule also allows you to tell whether <tt>-</tt> is a negation (unary) or a subtraction (binary)&#8212;it&#8217;s a negation if it appears when looking for a prefix unary operator, and a subtraction otherwise. </p>
<p><strong>Function calls.</strong>  The prefix/postfix rule also allows you to tell when an open paren designates a function call rather than grouping a subexpression (a grouping paren is like a prefix operator while a function-call one is like a postfix operator).  When a function-call paren is encountered, the operand at the top of the stack is the function to be called.  Push the paren on the operator stack as before, but also set up a list somewhere to hold the function arguments, and maintain a mapping that lets you find that list again from the paren on the stack.  (Note that with nested function calls, you could have multiple left parens on the stack.) </p>
<p>Then, when a comma is encountered, pop-and-apply operators back to a left paren; the operand on the top of the stack is then the next argument, and should be popped and added to the argument list. When the right paren is encountered, do the same, then pop and discard the left paren.  (Note that the arguments shouldn&#8217;t be left on the operand stack, at least not without some sort of sentinel between them; this would allow an ill-formed call like <tt>f(a, b, +)</tt> to be parsed as <tt>f(a + b)</tt>). </p>
<p>Array subscripts using square brackets can be handled in the same way as function calls. </p>
<p>After all this, the algorithm has grown a bit, and is a little trickier to get right in all the corner cases&#8212;but it&#8217;s still pretty simple, and <em>certainly</em> less work than a full-blown syntax-directed parser!  In my opinion, it&#8217;s a shame that the shunting-yard algorithm isn&#8217;t more widely discussed as part of standard texts and computer-science university courses.  Even the <a href="http://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools">Dragon Book</a> doesn&#8217;t so much as mention it!  I never heard of this algorithm until I happened to see a forum post that referenced it, but its simplicity, elegance, and efficiency make it a superior solution for many use-cases of processing arithmetic expressions. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.reedbeta.com/blog/2011/12/11/the-shunting-yard-algorithm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GPU Profiling 101</title>
		<link>http://www.reedbeta.com/blog/2011/10/12/gpu-profiling-101/</link>
		<comments>http://www.reedbeta.com/blog/2011/10/12/gpu-profiling-101/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 07:09:03 +0000</pubDate>
		<dc:creator>Nathan Reed</dc:creator>
				<category><![CDATA[GPU Programming]]></category>

		<guid isPermaLink="false">http://www.reedbeta.com/blog/?p=3</guid>
		<description><![CDATA[In the screenshots in my previous post, you might have noticed this readout in one corner: As you can see, even though Idyll is at a very, very early stage (it has no textures, only ambient and directional lighting), it still has a fairly complete performance measurement system. I chose to implement this early on [...]]]></description>
			<content:encoded><![CDATA[<p>In the screenshots in my <a href="/blog/2011/10/09/a-first-look-at-idyll/">previous post</a>, you might have noticed this readout in one corner: </p>
<p><img src="/blog/wp-content/uploads/2011/10/perf001.png" title="Perf Data" width="223" height="153" class="aligncenter size-full wp-image-48" /> </p>
<p>As you can see, even though Idyll is at a very, very early stage (it has no textures, only ambient and directional lighting), it still has a fairly complete performance measurement system.  I chose to implement this early on in development because it&#8217;s my belief that although it can be too early to optimize, <strong>it&#8217;s never too early to profile.</strong> </p>
<p><span id="more-3"></span> Even at the very beginning of development, I want to know that the performance numbers I&#8217;m seeing are <em>reasonable</em>.  I don&#8217;t need to worry about the details&#8212;I&#8217;m not going to worry that 0.47 ms is too long to spend drawing a 2700-triangle, untextured city&#8212;but I do want to know the numbers are at about the right order of magnitude.  To put it another way, if I were spending 4.7 ms drawing a 2700-triangle, untextured city, then I&#8217;d be wondering what was going on!  More than likely, it would be because I was doing something dumb that was forcing the driver or GPU into a slow mode of execution.  This kind of bug can be hard to spot because the rendered frame is still visually correct, so you can&#8217;t <em>see</em> it (unless it causes you to drop from 60 to 30 Hz or something like that).  But if you&#8217;re measuring your performance and you at least sanity-check your numbers from time to time, you&#8217;ll notice something&#8217;s wrong. </p>
<p>The other reason to implement GPU profiling early is that when it comes time to do your performance optimizations for real, you&#8217;ll have more trust in your profiling system.  You&#8217;ll have ironed out most of the bugs, and you&#8217;ll have a feel for how much noise there is in the data and what sorts of artifacts there may be. </p>
<p>So now that you know <em>why</em> you should do GPU profiling ;), <em>how</em> do you actually do it?  In this article, I&#8217;ll show you how to set it up in Direct3D 11.  Similar commands should be available in any modern graphics API, although the details may vary. </p>
<p>First, let&#8217;s talk about how <em>not</em> to do GPU profiling.  <strong>Do not measure performance in terms of framerate!</strong>  This is something many beginners do, but it&#8217;s misleading at best, since framerate isn&#8217;t a linear measure of performance.  People will make statements like &#8220;turning on XYZ dropped my framerate by 10 fps&#8221;, but this is meaningless, since a drop from 100 to 90 fps is a very different thing than a drop from 30 to 20 fps.  You could clarify by reporting the fps you started from, but why bother?  Just report performance results using units of real time. Milliseconds are fine, although you can go one step further and express everything in fractions of your frame budget.  For instance, I&#8217;m currently targeting 60 Hz, so I have 16.67 ms to render a frame.  Instead of saying that my objects rendered in 0.47 ms, I could make Idyll report that they rendered in 2.8% of a frame. </p>
<p>Another caveat that should be obvious: you can&#8217;t measure GPU performance by timing on the CPU. For instance, calling <tt>QueryPerformanceCounter</tt> before and after a draw call won&#8217;t measure how long the GPU took to draw your objects, only how long the graphics driver took to queue up that call in the various data structures it has under the hood.  That might be useful information to have in general, but it&#8217;s not GPU profiling. </p>
<h2>Placing Queries</h2>
<p>The tools we&#8217;ll use to get profiling data out of the GPU are <a href="http://msdn.microsoft.com/en-us/library/ff476578.aspx"><tt>ID3D11Query</tt></a> objects. As described in the docs, these can be created with <a href="http://msdn.microsoft.com/en-us/library/ff476515.aspx"><tt>ID3D11Device::CreateQuery</tt></a>, and come in various flavors.  There are two we&#8217;ll want: <tt>D3D11_QUERY_TIMESTAMP</tt> and <tt>D3D11_QUERY_TIMESTAMP_DISJOINT</tt>.  You&#8217;ll need one timestamp query for each stage of your renderer that you want to profile separately (in my case, one each for shadow clear, shadow objects, main clear, and main objects), plus two more to timestamp the beginning and ending of the whole frame.  You&#8217;ll also need one &#8220;timestamp disjoint&#8221; query; I&#8217;ll return to that one later. </p>
<p>Executing a timestamp query causes the GPU to read an internal clock and store the current value of that clock somewhere for later retrieval.  This is done by calling <a href="http://msdn.microsoft.com/en-us/library/ff476422.aspx"><tt>ID3D11DeviceContext::End</tt></a> and passing the query object you created.  (There&#8217;s no call to <tt>Begin</tt> for timestamp queries, because they don&#8217;t operate like a stopwatch&#8212;they don&#8217;t measure the elapsed time between two points; they just take a clock reading at a single point.)  You&#8217;ll want to place one of these calls at the beginning of the frame, and one after each chunk of rendering work you want to profile.  You&#8217;ll also need one at the very end, after your call to <tt>IDXGISwapChain::Present</tt>. Each of these is a separate <tt>ID3D11Query</tt> object, because all of these timestamps need to be collected and stored separately as the GPU chews its way through the frame. </p>
<p>The &#8220;timestamp disjoint&#8221; query should enclose the whole frame.  This one does require paired calls to <tt>ID3D11DeviceContext::Begin</tt> (just before your first timestamp) and <tt>ID3D11DeviceContext::End</tt> (just after your last one).  What does the disjoint query do? Its most important role for our purposes is to tell us the <em>clock frequency</em>.  All the timestamps we&#8217;re going to measure are expressed as counts of ticks of some internal GPU clock&#8230;but it&#8217;s the disjoint query that tells how to convert those ticks into real time.  It also has a secondary purpose: if the something went wrong with the clock, such as the counter overflowing and wrapping back to zero or the clock frequency changing during the middle of the frame (as can happen on laptops due to CPU throttling), it&#8217;ll also tell us that.  However, in those cases we just have to throw our data out, as we don&#8217;t have enough information to recover accurate timings.  In practice, this rarely happens, and even when it does happen it only leads to one frame of missed data, so it&#8217;s not a big deal. </p>
<p>To recap, here&#8217;s what a render routine might look like after all these queries have been added:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">void</span> Render <span style="color: #008000;">&#40;</span>ID3D11DeviceContext <span style="color: #000040;">*</span> pContext<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #666666;">// Begin disjoint query, and timestamp the beginning of the frame</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>Begin<span style="color: #008000;">&#40;</span>pQueryDisjoint<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>End<span style="color: #008000;">&#40;</span>pQueryBeginFrame<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #666666;">// Draw shadow map</span>
    ClearShadowMap<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>End<span style="color: #008000;">&#40;</span>pQueryShadowClear<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    DrawStuffInShadowMap<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>End<span style="color: #008000;">&#40;</span>pQueryShadowObjects<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #666666;">// Draw main view</span>
    ClearMainView<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>End<span style="color: #008000;">&#40;</span>pQueryMainClear<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    DrawStuffInMainView<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>End<span style="color: #008000;">&#40;</span>pQueryMainObjects<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #666666;">// Display frame on-screen and finish up queries</span>
    pSwapChain<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>Present<span style="color: #008000;">&#40;</span><span style="color: #0000dd;">1</span>, <span style="color: #0000dd;">0</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>End<span style="color: #008000;">&#40;</span>pQueryEndFrame<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>End<span style="color: #008000;">&#40;</span>pQueryDisjoint<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></td></tr></table></div>

<h2>Retrieving The Data</h2>
<p>So far we&#8217;ve set up a number of query objects that tell the GPU when and how to collect the timing data we want.  Once the GPU actually renders the frame and collects the data, we need to retrieve it so we can analyze and display it. </p>
<p>The idea is simple enough.  We just use the <a href="http://msdn.microsoft.com/en-us/library/ff476428.aspx"><tt>ID3D11DeviceContext::GetData</tt></a> method to see if a particular query has finished yet, and return its result if it has.  As the docs say, <tt>GetData</tt> returns <tt>S_FALSE</tt> if the query is still in flight on the GPU, and <tt>S_OK</tt> when it&#8217;s done.  So, we just need to write a loop that waits (an idle wait, please, <strong>not a busy wait!</strong>) until <tt>GetData</tt> returns something other than <tt>S_FALSE</tt>.  Then our query data will be waiting for us! </p>
<p>The timestamp queries all send back a <tt>UINT64</tt> containing&#8212;you guessed it&#8212;the timestamp value, measured in some sort of clock tick.  The disjoint query returns a <a href="http://msdn.microsoft.com/en-us/library/ff476194.aspx"><tt>D3D11_QUERY_DATA_TIMESTAMP_DISJOINT</tt></a> struct, which contains the clock frequency and, as mentioned before, the rarely-encountered &#8220;disjoint&#8221; flag that means you have to throw out the last frame&#8217;s data because something weird happened with the clock. </p>
<p>So all we have to do now is subtract adjacent timestamp values to get the deltas, convert those into milliseconds and off we go!  The code for this might look like:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
</pre></td><td class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">void</span> CollectTimestamps <span style="color: #008000;">&#40;</span>ID3D11DeviceContext <span style="color: #000040;">*</span> pContext<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #666666;">// Wait for data to be available</span>
    <span style="color: #0000ff;">while</span> <span style="color: #008000;">&#40;</span>pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>GetData<span style="color: #008000;">&#40;</span>pQueryDisjoint, <span style="color: #0000ff;">NULL</span>, <span style="color: #0000dd;">0</span>, <span style="color: #0000dd;">0</span><span style="color: #008000;">&#41;</span> <span style="color: #000080;">==</span> S_FALSE<span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#123;</span>
        Sleep<span style="color: #008000;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>       <span style="color: #666666;">// Wait a bit, but give other threads a chance to run</span>
    <span style="color: #008000;">&#125;</span>
&nbsp;
    <span style="color: #666666;">// Check whether timestamps were disjoint during the last frame</span>
    D3D10_QUERY_DATA_TIMESTAMP_DISJOINT tsDisjoint<span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>GetData<span style="color: #008000;">&#40;</span>pQueryDisjoint, <span style="color: #000040;">&amp;</span>tsDisjoint, <span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span>tsDisjoint<span style="color: #008000;">&#41;</span>, <span style="color: #0000dd;">0</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">if</span> <span style="color: #008000;">&#40;</span>tsDisjoint.<span style="color: #007788;">Disjoint</span><span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#123;</span>
        <span style="color: #0000ff;">return</span><span style="color: #008080;">;</span>
    <span style="color: #008000;">&#125;</span>
&nbsp;
    <span style="color: #666666;">// Get all the timestamps</span>
    UINT64 tsBeginFrame, tsShadowClear, <span style="color: #666666;">// ... etc.</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>GetData<span style="color: #008000;">&#40;</span>pQueryBeginFrame, <span style="color: #000040;">&amp;</span>tsBeginFrame, <span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span>UINT64<span style="color: #008000;">&#41;</span>, <span style="color: #0000dd;">0</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    pContext<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>GetData<span style="color: #008000;">&#40;</span>pQueryShadowClear, <span style="color: #000040;">&amp;</span>tsShadowClear, <span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span>UINT64<span style="color: #008000;">&#41;</span>, <span style="color: #0000dd;">0</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #666666;">// ... etc.</span>
&nbsp;
    <span style="color: #666666;">// Convert to real time</span>
    <span style="color: #0000ff;">float</span> msShadowClear <span style="color: #000080;">=</span> <span style="color: #0000ff;">float</span><span style="color: #008000;">&#40;</span>tsShadowClear <span style="color: #000040;">-</span> tsBeginFrame<span style="color: #008000;">&#41;</span> <span style="color: #000040;">/</span>
                            <span style="color: #0000ff;">float</span><span style="color: #008000;">&#40;</span>tsDisjoint.<span style="color: #007788;">Frequency</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">*</span> <span style="color:#800080;">1000.0f</span><span style="color: #008080;">;</span>
    <span style="color: #666666;">// ... etc.</span>
<span style="color: #008000;">&#125;</span></pre></td></tr></table></div>

<p>Now you have the timings in milliseconds and can do with them as you please!  Display them on screen, average them over time, log them to a file, or whatever else you can think of. </p>
<h2>Double-Buffered Queries</h2>
<p>Back toward the beginning of this article, I said that you would need <em>one</em> timestamp query for each rendering stage and <em>one</em> disjoint query.  Actually, I lied.  I wanted to keep things simple while explaining the basics, but the truth is that to do a proper GPU profiler, you&#8217;ll need to <em>double-buffer</em> your queries, meaning that you need two copies of everything. </p>
<p>This is because the CPU and GPU run simultaneously, not sequentially&#8212;or at least they should, for efficiency!  While the GPU is rendering frame n, the CPU is forging ahead and working on frame n + 1. When the GPU finishes frame n, it puts it up on screen and starts on frame n + 1; then the CPU gets the go-ahead to start on frame n + 2.  But this means that if we fire off a bunch of timestamp queries in frame n on the CPU, then wait for them to complete before we start the next frame, we&#8217;ll be serializing the system.  The CPU will have to wait for the GPU to finish the frame it just dispatched, and then the GPU will idle waiting for the CPU to finish the next frame.  This is a Bad Thing. </p>
<p>Double-buffering allows us to get around this by keeping separate queries for alternate frames. On frame n we&#8217;ll fire off one set of queries, then we&#8217;ll wait for the <em>other</em> set of queries to complete&#8212;these were the queries for frame n &#8211; 1.  Then we&#8217;ll re-use the queries we just collected and fire off a new set for frame n + 1, then finally wait for the queries from frame n.  This way we always give the GPU a whole frame to finish its work, and avoid dropping into the slow, serialized mode of execution. </p>
<p>(By the way, this is a good example of what I mentioned at the top of this article about sanity-checking your performance numbers!  You might have noticed that I&#8217;m measuring the total frame time on the CPU as well as on the GPU&#8212;on the CPU, it&#8217;s just <tt>QueryPerformanceCounter</tt> as normal.  When the CPU and GPU are running simultaneously, as they should be, the CPU and GPU frame time will be equal (plus or minus some measurement noise).  But if you didn&#8217;t double-buffer your queries, or did something else to make them drop into the slow, serialized execution mode, you&#8217;d see the CPU frame time shoot up to substantially larger than the GPU frame time, since it&#8217;s doing its own work and then waiting for the GPU to finish all its work.) </p>
<p>At this point, you should have all the information you&#8217;ll need to go and write a GPU profiler of your own!  But in case some of the details aren&#8217;t clear, I&#8217;ve posted <a href="/downloads/gpuprofiler.zip">my GPU profiling code</a> from Idyll, which you&#8217;re free to include in your own project; it&#8217;s under the BSD license.  In that code, I&#8217;ve defined an enum called <tt>GTS</tt> (GPU Time-Stamp) to label all the timestamped points in the frame; you should modify it to contain the set of blocks you care about.  It handles all the double-buffering for you, and will also average all the deltas over a half-second, for smoother display.  Now, go forth and find out how long the GPU is taking to render your scenes! </p>
]]></content:encoded>
			<wfw:commentRss>http://www.reedbeta.com/blog/2011/10/12/gpu-profiling-101/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A First Look at Idyll</title>
		<link>http://www.reedbeta.com/blog/2011/10/09/a-first-look-at-idyll/</link>
		<comments>http://www.reedbeta.com/blog/2011/10/09/a-first-look-at-idyll/#comments</comments>
		<pubDate>Sun, 09 Oct 2011 22:06:51 +0000</pubDate>
		<dc:creator>Nathan Reed</dc:creator>
				<category><![CDATA[Eye Candy]]></category>

		<guid isPermaLink="false">http://www.reedbeta.com/blog/?p=29</guid>
		<description><![CDATA[Hello, and welcome to my new coding blog! I&#8217;m going to use this space to talk about lighting, shaders, GPU programming, and other graphics- and gamedev-related topics as they come up. In my spare time I&#8217;m working on a project that for now I&#8217;m calling &#8220;Idyll&#8221;. At the moment this is just a graphics engine, [...]]]></description>
			<content:encoded><![CDATA[<p>Hello, and welcome to my new coding blog!  I&#8217;m going to use this space to talk about lighting, shaders, GPU programming, and other graphics- and gamedev-related topics as they come up.  In my spare time I&#8217;m working on a project that for now I&#8217;m calling &#8220;Idyll&#8221;.  At the moment this is just a graphics engine, and it&#8217;s just in its infancy.  Eventually I hope to expand it into a more full-featured game engine&#8212;and maybe even make an actual game with it&#8212;but for my purposes, the graphics are paramount. </p>
<p><span id="more-29"></span>My primary goals with this project are:
<ol>
<li>To teach myself the Direct3D 11 API.  Since my background is in OpenGL 3.0 and PS3 (which has an API similar to D3D9), I want to update my knowledge.  I actually started this project with D3D 10, but I intend to convert it to D3D 11 soon.</li>
<li>To have a place to experiment with new graphics techniques a little more freely than I might in the context of my day job.  The workplace is great for some sorts of things, but playing around with ideas just to see where they go is not one of them. :)</li>
</ol>
<p>That&#8217;s the general idea for where I&#8217;m going with this.  Just to give you something concrete, here are a couple of screenshots of Idyll in its current state.  I wasn&#8217;t kidding when I said it&#8217;s just in its infancy!  It doesn&#8217;t have textures or any materials other than Lambert, and the scene is just something I threw together in an afternoon. </p>
<p><a href="/blog/wp-content/uploads/2011/10/idyll001.png"> <img src="/blog/wp-content/uploads/2011/10/idyll001-300x187.png" title="Idyll City" width="300" height="187" class="floater size-medium wp-image-30" /> </a> <a href="/blog/wp-content/uploads/2011/10/idyll002.png"> <img src="/blog/wp-content/uploads/2011/10/idyll002-300x187.png" title="Idyll City Arch" width="300" height="187" class="floater size-medium wp-image-32" /> </a>
<p class="clearer"></p>
<p>I apologize for the shadow quality; it&#8217;s just a single 1024&sup2; shadow map stretched across the whole scene.  The lighting is a directional light, plus an ambient with a vertical ramp from blue to black.  It <em>is</em> lit in linear color space, though! </p>
]]></content:encoded>
			<wfw:commentRss>http://www.reedbeta.com/blog/2011/10/09/a-first-look-at-idyll/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

