Trending
Opinion: How will Project 2025 impact game developers?
The Heritage Foundation's manifesto for the possible next administration could do great harm to many, including large portions of the game development community.
In this in-depth technical article, Neversoft co-founder Mick West discusses performance concerns when optimizing asset processing for games, including the basic nature of the common problems and in-depth solutions for keeping the pipeline efficient.
[In an in-depth technical article, Neversoft co-founder Mick West discusses performance concerns when optimizing asset processing for games, including the basic nature of the common problems and in-depth solutions for keeping the pipeline efficient.]
The fundamental building block of any game asset pipeline is the asset processing tool. An asset processing tool is a program or piece of code that takes data in one format and performs some operations on it, such as converting it into a target specific format, or performing some calculation, such as lighting or compression. This article discusses the performance issues with these tools and gives some ideas for optimization with a focus on minimizing I/O.
Asset conversion tools are too often neglected during development. Since they are usually well specified and discrete pieces of code, they are often tasked to junior programmers. Generally, any programmer can easily create a tool that works to a simple specification, and at the start of a project the performance of the tool is not so important because the size of the data involved is generally small and the focus is simply on getting things up and running.
However, toward the end of the project, the production department often realizes that a large amount of time is being wasted waiting for these tools to complete their tasks. The accumulation of near-final game data and the more rapid iterations in the debugging and tweaking phase of the project make the speed of these tools of paramount importance.
Further, time may be wasted trying to optimize the tools at this late stage, and there’s a significant risk that bugs will be introduced into the asset pipeline (and the game) when making significant changes to processes and code during the testing phase.
Hence, it’s highly advisable to devote sufficient time to optimizing your asset pipeline early in development. It’s also advisable to use the people who are highly experienced in doing the types of optimizations needed. This early application of optimization is another example of what I call mature optimization (see “Mature Optimization,” Game Developer, January 2006).
There’s a limited number of man hours available in the development of a game. If you wait until the need for optimization becomes apparent, you will have already wasted hundred of hours.
Asset processing tools come in three flavors: converters, calculators, and packers. Converters take data that are arranged in a particular set of data structures and rearrange them into another set of data structures, which are often machine- or engine-specific. A good example here is a texture converter, which might take textures in .PNG format and convert it to a form that can be directly loaded into the graphic memory of the target hardware.
Asset calculators take an asset or group of assets and perform some set of calculations on them such as calculating lighting and shadows or creating normal maps. Since these operations involve a lot of calculations and several passes over the data, they typically take a lot longer than the asset conversion tools. Sometimes they take large assets, such as high-resolution meshes, and produce smaller assets, such as displacement maps.
The third processing tool type, asset packers, take the individual assets and package them into data sets for use in particular instances in the game, generally without changing them much.
Using an asset packer might involve simply gathering all the files used by one level of the game and arranging them into a .WAD file. Or it might involve grouping files in such a way that streaming can be effectively performed when moving from one area of the game to another. Since the amount of data can be very large, the packing process might take a lot of time and be very resource intensive, requiring lots of memory and disk space, especially for final builds.
You may be surprised how often the simplest method of optimization is overlooked. Are you letting the content creators use the debug version of a tool? It’s a common mistake for junior programmers, but even the most experienced among us sometimes overlook this simple step.
So before you do anything, try turning the optimization settings on and off to make sure there’s a noticeable speed difference. Then, in release mode, try tweaking some settings, such as “optimize for speed” and “optimize for size.” Depending on the nature of the data (and the hardware your tools are running on), you might actually get faster code if you use “optimize for size.” The optimal optimization setting can vary from tool to tool.
Be careful when tweaking the optimization settings to test the speed of your code. In a multitasking operating system like Windows XP, a lot is going on, so your timings might vary dramatically from one run to the next. Taking the average is not always a useful measure either, as it can be greatly skewed by random events. A more accurate way is to compare the lowest times of multiple runs of two different settings, as that will be closest to the “pure” run.
Most PCs now have some kind of multicore and/or hyper-threading. If your tools are written in the traditional mindset of a single processing thread, you’re wasting a significant amount of the silicon you paid for, as well as the time of the artists and level designers as they wait for their assets to be converted.
Since the nature of asset data is generally to be large chunks of homogeneous data, such as lists of vertices and polygons, it’s generally very amenable to data level parallelization with worker threads, where the same code is run on multiple chunks of similar data concurrently, taking advantage of the cache. For details on this approach see “Particle Tuning” (Game Developer, April 2006).
Antivirus software should be configured so that it does not scan the directories that your assets reside in, nor the actual tools. Poorly written antivirus and other security tools can significantly degrade the speed of a machine that performs a lot of file operations. Try running a build both with and without the antivirus software and see if there is any difference in speed. Then consider removing the antivirus software entirely.
If you have any form of distributed “farm” of machines in the asset pipeline, beware of any screensaver other than “turn off monitor.” Some screensavers use a significant chunk of processing power. You need to be especially careful of this problem when repurposing a machine; the previous user may have installed her favorite screensaver, which doesn’t kick in for several hours, and then slows the machine to a crawl.
In-house tools don’t always need to be up to the same code standards as the code you use in your commercially released games. Sometime you can get performance benefits by making certain dangerous assumptions about the data you’re processing and the hardware it will be running on.
Instead of constantly allocating buffers as needed, try allocating a “reasonable” chunk of memory as a general purpose buffer. If you have debugging code, make sure you can switch it off. Logging or other instrumenting functions can end up taking more time than the code they are logging. If earlier stages in the pipeline are robust enough, then (very carefully) consider removing error and bounds checking from later stages if you can see they are a significant factor.
If you have a bunch of separate programs, consider bunching them together into one uber-tool to cut the load times. All these are bad practices, but for their limited lifetime, the risks may be outweighed by the rewards.
Older programmers tend to write conversion tools using the standard C I/O functions: fopen, fread, fwrite, fclose, etc. The standard method is to open an input file and an output file, then read in chunks of data from the input file (with fread or fgetc), and write them to the output file (with fwrite or fputc).
This approach has the advantage of being simple, easy to understand, and easy to implement. It also uses very little memory, so quite often tools are written like this. The problem is it’s insanely slow. It’s a holdover from the (really) bad old days of computing, when processing large amounts of data meant reading from one spool of tape and writing to another.
Younger programmers learn to use C++ I/O “streams,” which are intended to make it easy for data structures to be read and written into a binary format. But when used to read and write files, they still suffer from the same problems that our older C programmer has. It’s still stuck in the same serial model of “read a bit, write a bit” that’s not only excessively slow, but also mostly unnecessary on modern hardware.
Unless you’re doing things like encoding .MPEG data, you will generally be dealing with files that are smaller than a few tens of megabytes. Most developers will now have a machine with at least 1GB of memory. If you’ll be processing the whole file a piece at a time, then there’s no reason you should not load the entire file into memory.
Similarly, there’s no reason you should have to write your output file a few bytes at a time. Build the file in memory, and write it out all at once.
You might counter that that’s what the file cache is for. It’s true: The OS will buffer reads and writes in memory, and very few of those reads or writes will actually cause physical disk access. But the overhead associated with using the OS to buffer your data versus simply storing it in a raw block of memory is very significant.
Listing 1 shows a simple file conversion program that takes a file and writes out a version of it with all the zero bytes replaced with 0xFF. It’s simple for illustration purposes, but many file format converters do not do significantly more CPU work than this simple example.
LISTING 1 Old-fashioned file I/O
FILE *f_in = fopen("IMAGE.JPG","rb");
FILE *f_out = fopen("IMAGE.BIN","wb");
fseek(f_in,0,SEEK_END);
long size = ftell(f_in);
rewind(f_in);
for (int b = 0;b<size;b++) {
char c = fgetc(f_in);
if (c == 0) c = 0xff;
fputc(c,f_out);
}
fclose(f_in);
fclose(f_out);
Listing 2 shows the same program converted to read in the whole file into a buffer, process it, and write it out again. The code is slightly more complex, yet this version executes approximately ten times as fast as the version in Listing 1.
LISTING 2 Reading the Whole File into Memory
FILE *f_in = fopen("IMAGE.JPG","rb");
fseek(f_in,0,SEEK_END);
long size = ftell(f_in);
rewind(f_in);
char* p_buffer = (char*) malloc (size);
fread (p_buffer,size,1,f_in);
fclose(f_in);
unsigned char *p= (unsigned char*)p_buffer;
for (int x=0;x<size;x++,p++)
if (*p == 0) *p = 0xff;
FILE *f_out = fopen("IMAGE.BIN","wb");
fwrite(p_buffer,size,1,f_out);
fclose(f_out);
free(p_buffer);
The use of serial I/O is a throwback to the days of limited memory and tape drives. But a combination of factors means it’s still useful to think of your file conversion essentially as a serial process.
First, since file operations can proceed asynchronously, you can be processing data while it’s being read in and begin writing it out as soon as some is ready. Second, memory is slow, and processors are fast. This can lead us to think of normal random access memory as a just a very fast hard disk, with your processor’s cache memory as your actual working memory.
While you could write some complex multi-threaded code to take advantage of the asynchronous nature of file I/O, you can get the full advantages of both this and optimal cache usage using Windows’ memory mapped file functions to read in your files.
The process of memory mapping a file is really very simple. All you are doing is telling the OS that you want a file to appear as if it is already in memory. You can then process the file exactly as if you just loaded it yourself, and the OS will take care of making sure that the file data actually shows up as needed.
This gives you the advantage of asynchronous I/O because you can immediately start processing once the first page of the file is loaded, and the OS will take care of reading the rest of the file as needed. It also makes best use of the memory cache, especially if you process the file in a serial manner. The act of memory mapping a file also ensures that the moving of data is kept to the minimum. No buffers need to be allocated.
Listing 3 shows the same program converted to use memory mapped I/O. Depending on the state of virtual memory and the file cache, this is several times faster than the “whole file” approach in Listing 2. It looks annoyingly complex, but you only have to write it once. The amount of speed-up will depend on the nature of the data, the hardware, and the size and architecture of your build pipeline.
LISTING 3 Using Memory Mapped Files
HANDLE hInFile = ::CreateFile(L"IMAGE.JPG",
GENERIC_READ,FILE_SHARE_READ,NULL,
OPEN_EXISTING,FILE_ATTRIBUTE_READONLY,NULL);
DWORD dwFileSize = ::GetFileSize(hInFile, NULL);
HANDLE hMappedInFile = ::CreateFileMapping(hInFile,
NULL,PAGE_READONLY,0,0,NULL);
LPBYTE lpMapInAddress = (LPBYTE) ::MapViewOfFile(
hMappedInFile,FILE_MAP_READ,0,0,0);
HANDLE hOutFile = ::CreateFile(L"IMAGE.BIN",
GENERIC_WRITE | GENERIC_READ,0,NULL,
CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL);
HANDLE hMappedOutFile = ::CreateFileMapping(hOutFile,
NULL,PAGE_READWRITE,0,dwFileSize,NULL);
LPBYTE lpMapOutAddress = (LPBYTE) ::MapViewOfFile(
hMappedOutFile, FILE_MAP_WRITE,0,0,0);
char *p_in=(char*)lpMapInAddress;
char* p_out = (char*)lpMapOutAddress;
for (int x=0;x<dwFileSize;x++,p_in++) {
char c = *p_in;
if (c == 0) c = 0xff;
*p_out++ = c;
}
::CloseHandle(hMappedInFile);
::CloseHandle(hMappedOutFile);
::CloseHandle(hInFile);
::CloseHandle(hOutFile);
Llopis, Noel. “Optimizing the Content Pipeline,” Game Developer, April 2004.
Carter, Ben. “The Game Asset Pipeline: Managing Asset Processing," Gamasutra, Feb. 21, 2005.
[EDITOR'S NOTE: This article was independently published by Gamasutra's editors, since it was deemed of value to the community. Its publishing has been made possible by Intel, as a platform and vendor-agnostic part of Intel's Visual Computing microsite.]
Read more about:
FeaturesYou May Also Like