Sponsored By

Sorry, But Size Does Count!

Haim Barad launches into an in-depth discussion of data optimzation. Data optimization can liberate shackled artists from subjugating their talents to the limitations of commercial gaming platforms, but Barad here presents some useful optimization guidelines and shows that size really does count after all.

Haim Barad, Blogger

November 2, 1999

13 Min Read
Game Developer logo in a gray background | Game Developer

First of all, I'd like to welcome all of you to the Optimizations Corner. This column will be a regular monthly column at Gamasutra that will feature articles related to optimization methodologies, techniques, tool usage, and so on. This first article will explore the area of data optimizations. However, I'd like to start with a few introductory ideas about optimizations in general…

Firstly, I must confess that my area of expertise does not lie in game development, per se. That is, I have never written a game. (If fact, I'm not even that good at playing them.) However, I (and other members of my team) have been heavily involved in application optimizations, especially 3D games. I also have another thing going for me in that I do work for a company that knows a bit about the CPU and the PC platform. In addition, I am also extremely fortunate to be surrounded by a lot of talented people who will be contributing to this series from time to time. Because of this, I hope to bring a valuable viewpoint to the game development community.

Enough said. I won't bore you with any more about the article series in general. But I do want to add a few comments about optimizations and their role in the gaming industry.

Don't Oppress Your Artists

Actually, this section could have been called, "Why should you still be interested in optimizations?" It used to be that a highly tuned program was required just so it could run at a decent level of performance. With PCs getting faster all the time, why are optimizations still so important? Even if you tune your code, will the overall speedup obtained be worth the effort?

Maybe it's time to look at optimizations in a different light.

Think of it this way. Optimizations are opportunities to free your artists a little more. Artists are creative people, but artists for game companies are creative people with chains on their paint brushes. Basically, the company management (possibly together with the publisher) specify the target platforms for the product. The techno-wizards say that they can achieve a certain level of performance on the target platforms, then they all look at the artist and tell them, "make it look great, but with only so many polygons, textures, etc." That is, the artistic budget for the game is set by reverse engineering the performance, starting with the marketing process of specifying target platforms for the product. Thus, artists are creatively oppressed.

Optimizations can be thought of as a way to loosen those chains on your artists. And you might be surprised to learn that many optimizations can be done with very little effort. In fact, an often overlooked place for optimizations is in the data itself.

And it is with this note that I finally head into the main subject of this first paper:

Data Optimizations: It's Not Just Code

It turns out that optimizing applications is not just dependent upon the code. It's very often in the way you organize, store and present the data to the application. In this section, I'll explore the effect of primitive size and how it affects performance. More importantly, I'll show how possibilities for optimization that will let you to increase your effective content at far less cost in performance than you might think.

"Don't worry... it's not the size that counts."

Well, I won't try to verify or deny the above statement with regards to human relationships, but it's definitely a false statement when it comes to 3D models and the way in which they're organized. When measuring performance for a 3D engine, the size of each primitive processed drastically affects the performance of the engine.

Most 3D engines process vertex data in primitive fashion. Let's use D3D as an example (although the principal I'm discussing applies to basically any engine), in which models are represented as a collection of triangles in either discrete, strip, or fan form. This is done in either an indexed or ordered fashion, with the triangles batched up in collections that represent the same geometry state. Although this paper focuses on indexed triangle lists, the principles still hold for other primitive types.

Below is an example of an index primitive. Notice that there are really two structure: the indices (that specify connectivity of vertices) and the vertices themselves (or the vertex pool). The indices are a collection of pointers to vertex structures that include position, normal, texture, coordinates, and so on.

Notice that the location of the 3 vertices of the first triangle (represented in the animation) can appear at random places in the vertex pool. Also keep in mind that the vertex pool is usually a much larger structure (often a cache line per vertex), whereas each index is only a pointer.

Regardless of whether the structure below represents a discrete triangle list, triangle strip, or fan, the size of the primitive is related to the overall size of the vertex pool. It is these vertices that must be processed, and when I speak of the size of the primitive, I'm really talking about the size of the vertex pool.

An Example of an Index Primitive

This primitive represents a collection of vertices in the vertex pool and a collection of indices all using the same geometry state. The geometry state specifies things such as transformation matrices, lights, material properties and so on. From now on, I'll just refer to this as state. Very often, a primitive corresponds to an object (or a piece of one object) in a scene, but it doesn't have to. What's important is that all the content in the primitive has the same state. In fact, multiple primitives of the same state can be batched together to form a much larger primitive. This multi-primitive batching of smaller primitives is one performance optimization step for the data.

Since I'm so concerned about size in this article (no jokes please... unless you're obsessed about triangle strip envy...), let's take a look at throughput rates as they relate to primitive size. Looking below in Figure 2, you see a plot of throughput performance versus primitive size (for given transform and lighting modes of DX7 measured on a Pentium III). You can see that the graph is not flat at all and in fact shows that performance depends largely on primitive size.


A Wasted Opportunity

What do I mean by a wasted opportunity? Let's take the transform only case as an example. All my primitives of size 32 could have been of size 64 with very little performance degradation. Why? Because the throughput of the engine is about 54% faster with primitives of size 64 than with size 32. In other words, it would cost me a 30% increase in transformation time for effectively doubling the content. Consider that the percentage of the entire application that is the transformation is often 20-30% and you can see that an almost negligible decrease in performance (maybe 6-10% in frame rate) awaits you for doubling the content by increasing the size of the primitives.

In one case study of a real game coming out this winter holiday season (sorry, I'm not at liberty to say which one), we performed an experiment that effectively doubled the content (that's 2X the number of vertices per primitive for the entire scene... we could have been more selective and only doubled the interesting content). We noticed a drop in frame rate of about 1.2 frames per second (only 5% in this case). That's not much of a price to pay for much greater content.

Their artists could have had twice the triangle budget... and we haven't even addressed other types of optimization yet!

Why is an Engine Sensitive to Primitive Size?

Any engine requires a certain amount of overhead to process a primitive. Many of the auxiliary data structures (e.g. transform matrices, light structures, etc.) have to be fetched from memory, thus accounting for some of this overhead. Also, new generation games are using SIMD capabilities of newer CPUs (e.g. Streaming SIMD Extensions). These engines restructure the game's data into a SIMD friendly format that promotes efficient fetching and processing of the data.

Specifically regarding floating point SIMD extensions, a certain amount of overhead is associated with SIMD style processing (such as matrix expansion for 4-wide matrix vector calculations if done in Structure of Arrays style). As you can see, this overhead can overwhelm the actual vertex processing if the primitive size is too small. In fact, the Direct3D Processor Specific Graphics Pipeline for the Pentium III (starting with DX6.1) processor uses a path with scalar Streaming SIMD Extensions instructions whenever the primitive is too small. Relative overhead in primitive processing becomes smaller as the size of the primitive increases, thus performance increases to a point of diminishing returns.

You can also see from the figure that vertex lighting makes the engine less sensitive to primitive size. The decrease in sensitivity is because the cost of the vertex lighting itself tends to reduce the relative cost of the overhead talked about previously. Pixel based lighting (via texturing) doesn't affect this sensitivity.

I spoke earlier about engine's restructuring their data in a SIMD friendly format. We'll let's now take a closer look at data organizations, including the effects of SIMD capabilities.

So... What About Data Organization?

There are other issues regarding your data that do affect performance. Consider the fact that most multimedia processing (of which 3D graphics is a part) can be characterized by streaming lots of data into the CPU, operating on it in the CPU and then streaming some results out of the CPU.

When you consider the above, you'll find that a lot of potential performance is wasted on inefficient data movement. The data must be organized so that it can be efficiently moved in and out of the CPU. Caches go a long way to help avoid data reloading. However, prefetching into the proper cache level should be done in order to hide cache latencies. (i.e. By loading the data before we really need it, it appears to the CPU that the data is already there, ready and waiting to be operated on).

Some guidelines for data organization are:

1. Proper data alignment. A cache miss causes a cache line of memory (32-bytes) to be fetched into an assigned cache line. If your data is spread over multiple cache lines (and didn't have to be), then you'll be causing unnecessary cache misses. Proper alignment is also necessary when loading into registers to avoid misalignment penalties and also make use of fast aligned loads.

2.Vertical parallelism data organization. The vertical versus horizontal parallelism model best describes the way in which SIMD extensions (e.g. MMX Technology, Streaming SIMD Extensions, 3DNOW!) exploit parallelism within the algorithm. Either the computation itself can be done in parallel steps ("horizontal") or the algorithm can be done serially, but with multiple data elements ("vertical"). Because data dependencies in an algorithm limit the amount in which a computation can be done in parallel, it's usually more efficient to process the algorithm serially, while operating on many pieces of data in parallel. Below is an animated figure of the vertical process. Note that the vector X is composed of 4 separate data points each doing the same serial operation... but all four are operated on in parallel producing 4 results in parallel. The data dependency chain of the calculation doesn't limit the parallelism as it would in horizontal parallelism.

The Vertical Process


Another way of referring to this type of data representation is SOA (Structure Of Arrays) for vertical parallelism and AOS (Array Of Structures) for horizontal parallelism. While this has often been written about with respect to geometry, it also applies to just about any type of operation (e.g. physical deformation, procedural texturing, and so on).

One other practical advantage of using vertical parallelism is the ease with which it is possible to turn serial code into parallel code. If you look at the simple transform code, you'll see that the code looks very similar to the code for the vertical SIMD code that operates on 4 vertices simultaneously. For more information about this, I'll refer you to an earlier Gamasutra paper on this topic from one of the members of my team.

3.Proper use of prefetches and streaming stores. This goes back to efficient data movement. All the parallel processing in the world of data inside the CPU won't do any good if you can move the data in and out fast enough. I'll refer you here to a previous published paper in Gamasutra by a member of my team on the subject of efficient data movement through prefetches and streaming stores.

4.Avoid dirty writebacks. When you write out data to cacheable memory, the cache line to which you write becomes "dirty" and if you write all over the place in memory, then writebacks of these dirty lines to main memory are required. Try to keep your updates to memory in as compact a manner as possible to avoid unnecessary writebacks.

The above is not an exhaustive list. However, I do hope that it will provoke performance programmers into examination of these and other issues. In future articles, we will drill down even further into these and other topics.

Conclusion

Obviously I wouldn't tell you or your artists to just add content where it doesn't make sense. That certainly isn't my desire or intention. But I hope you got some information from this article about the effects of primitive size and data organization on 3D performance. This should provide you with some guidelines in how to better estimate the platform's performance given that your content budget process must take this information into account.

It's also my intention to set a tone for optimizations in this article series and provide some insight into data organization and its affect on performance. I hope I've been successful at this and I look forward to your comments.

Haim has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.

Read more about:

Features

About the Author

Haim Barad

Blogger

Haim Barad has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like