Sponsored By

Profiling, Data Analysis, Scalability, and Magic Numbers: Meeting the Minimum Requirements for Age of Empires II: The Age of KingsProfiling, Data Analysis, Scalability, and Magic Numbers: Meeting the Minimum Requirements for Age of Empires II: The Age of Kings

This is the first of a two-part article that describes the tips, tricks, tools, and pitfalls that went into raising the performance profile of Age of Empires II: The Age of Kings. All of the techniques and tools used to measure and improve AoK are fully capable of improving the performance of other games.

herb marselas, Blogger

August 9, 2000

32 Min Read
Game Developer logo in a gray background | Game Developer

Age of Empires II: The Age of Kings (AoK), a tile-based, 2D isometric, real-time strategy game, was built on the code base used in the original Age of Empires (AoE) and extended in its Rise of Rome expansion pack. In these games, players guide one of many civilizations from the humble beginning of a few villagers to an empire of tens or hundreds of military and non-military units, while competing against other human or computer-controlled opponents in single or multiplayer modes.

This is the first of a two-part article that describes the tips, tricks, tools, and pitfalls that went into raising the performance profile of Age of Empires II: The Age of Kings. All of the techniques and tools used to measure and improve AoK are fully capable of improving the performance of other games.

Beginning the Diagnosis

In some ways, the AoK development team was fortunate because we had the benefit of an existing code base to work with. Many performance improvements went into AoE, including extensive optimization of its graphics drawing core, and this work gave us a good starting point for AoK.

Still, a significant amount of new functionality was added over the course of the sequel's two-year development cycle. This new functionality, as well as new requirements placed on existing functionality, meant that there was a large amount of new work to do in order to meet the minimum system requirements for shipping AoK. As such, a dedicated performance improvement phase began in April 1999 to ready AoK for its September 1999 release. The purpose of this phase was to identify and resolve the game's remaining outstanding performance issues, and to determine whether AoK would perform well on the intended minimum system configuration.

Our team had some ideas as to which parts of the code were taking a long time to execute, and we used Intel's VTune, NuMega's TrueTime, and our own profiling code to verify these hunches and see exactly where time was being spent during program execution. Often these performance results alone were enough to determine solutions, but sometimes it wasn't clear why the AoK code was underperforming, and in these cases we analyzed the data and data flow to determine the nature of the problem.

Once a performance problem is identified, several options are available to fix it. The most straightforward and recognized solution is direct code optimization by optimizing the existing C code, translating it to hand-coded x86 Assembly, rearranging data layouts, and/or implementing an alternative algorithm.

Sometimes we found that an algorithm, though optimal for the situation, was executing too often. In one case, unit pathing had been highly optimized, but it was being called too often by other subsystems. In these cases, we fixed the problem by capping the number of times the code could be called by other systems or by capping the amount of time the code could spend executing. Alternately, we might change the algorithm so its processing could occur over multiple game updates instead of all at once.

We also found that some functionality, no matter how much we optimized it, still executed too slowly. For example, supporting eight players in a game required too much processor time on the minimum system, so we specified that the minimum system could support only four players. We presented scalability features such as this as facets of game play or as options that players could adjust to their liking. These scalable features ultimately allowed AoK to run well on its stated minimum system, providing incentives or rewards to users who have better computers.

And then there were AoK's approximately 30 single-player scenarios. We evaluated the performance of these scenarios slightly differently from other game functionality. Instead of trying to optimize offending code, we first examined the scenario for performance problems that had been inadvertently introduced by the scenario designers in their construction of the scenario and its elements. In many cases, performance improved significantly with slight changes to the scenario, for example, reducing the number of player units, shrinking the game map, or making sections of the maps inaccessible to players. Above all, we made sure that we did not change the designer's vision of the scenario as we optimized it.

Shopping for Old Hardware

One of the goals of AoK was to keep the system requirements as low as possible. This was necessary in order to reach the broadest audience possible and to stay on the same incremental processor performance ramp set by the original Age of Empires and its Rise of Rome expansion pack. Our overriding concern was to meet these minimum system requirements, yet still provide an enjoyable game experience.

The original Age of Empires was released in September 1997, and required a 90MHz Pentium processor with 16MB RAM and a 2D graphics card capable of handling 8-bit palletized color. The Rise of Rome expansion pack shipped a year later and raised the minimum system processor to a 120MHz Pentium. Based on this information, the AoK minimum processor was pegged as a 133MHz Pentium with 32MB of physical RAM (Figure 1). The additional RAM was required due mainly to the increased size and number of graphics and sound files used by AoK. There was also a greater amount of game data and an executable that grew from approximately 1.5MB for AoE to approximately 2.4MB for AoK.

To make sure AoK worked on the minimum system, we had to shop for old hardware. We purchased systems matching the minimum system specification from a local system reseller - we no longer used systems that slow. When the "new" computers arrived, we decided not to wipe the hard drives, nor did we reinstall software and hardware with the latest driver versions. We did this because we expected that most AoK users wouldn't optimize their computer's configuration or settings, either. Optimizing these systems would have undoubtedly improved our performance numbers, but it would not have translated into true performance gains on other minimally-configured computers. On the other hand, for normal in-house play-testing, we used computers that were significantly more powerful than the minimum system configuration, which made up for performance issues caused by unoptimized code and enabled logging functions during play-testing (Figure 1).

Figure 1. AoK minimum PC and play-test PC system configurations.

Minimum System Spec Test PC

Ensemble Play-Test PC

133 MHz Pentium Processor

450 MHz Pentium II processor

S3 Virge 2MB graphics card

Nvidia TNT 8MB graphics card

32 MB RAM

128 MB RAM

Windows 98

Windows 98

*later upgraded to 166 MHz

A precedent set by the original Age of Empires was the use of options and settings playable on the minimum system (Figure 2). A list of the specific options supported by the minimum system was needed due to the large number of them available in AoK (Figure 3). These were also the default options for the single-player and multiplayer games, and were used to guide the creation of approximately 30 single-player scenarios.

Figure 2. AoK minimum system game play specifications.

4 players; any combination of human and computer players

4 players map sizes

75 unit population cap

800X600 resolution

Low-detail terrain graphics quality

*added as part of scalability effort

Figure 3. Game play and feature scability

Number of Players

2 to 8, in any combination of human or computer

Size of Map

2 to 8 player sizes and "giant" size

Type of Map

All land (Arabia)

Mostly water (islands)

Nine other in between (Coastal, Baltic, and so on)

Unit Population Cap

25 to 200 units per player

Civilization Sets

Western European, Eastern European, Middle Eastern, Asian

Resolution

800X600

1024X768

1280X1024

Three Terrain Detail Modes

High detail -- multi-pass, anisotropic filtering, RGB color calculation

Medium detail -- multi-pass, fast, lower-quality filtering. RGB color calculation

One of the first tasks of this dedicated performance phase was to determine the largest performance problems, the improvements that we could hope to make, and the likelihood that AoK would meet the minimum system specification in terms of processor and physical memory. This initial profiling process led us to increase the minimum required processor speed from 133 to 166MHz. We also felt that meeting the 32MB memory size could difficult but we were fairly certain that the memory footprint could be reduced enough to meet that goal.

Grist for Profiling

No matter how good or bad a program looks when viewed through the lens of profiling statistics, the only true test of satisfactory performance is how players feel about their game experience. To help correlate player responses with game performance in AoK, we used several on-screen counters that displayed the average and peak performance. Of these counters, the ones that calculated the average frame rate and lowest frame rate over the last several hundred frames were used most to determine performance problems. Additional statistics included average and peak game simulation time (in milliseconds) over the last several hundred game updates.

Identifying symptoms of play-testing performance problems and making saved games of these problem situations was very useful. We replayed saved games in the profiler, and routines that took too long could be identified quickly. Unfortunately, some problems were difficult to track down, such as memory leaks and other programs running on the play-tester's computer.

We also created scenarios that stressed specific situations. For instance, we stressed the terrain engine's hill-drawing by using a special scenario consisting of a large game map covered mostly with hills. Other special scenarios were created that included many buildings, walls, or attempts to path units long distances between difficult obstacles. These scenarios were easy to build, and it was obvious the first time the scenario was run whether a given issue needed to be targeted for optimization.

The final set of data came in the form of recorded AoK games. AoK has a feature that allows human or computer player commands to be recorded to a file. This data can then be played back later as if the original player were issuing the commands. These recorded games helped diagnose pathfinding problems when it was unclear how a unit had arrived at a particular destination.

Since AoK was able to load scenarios, saved games, and recorded games from the command line, the game could be run automatically by a profiler. This simplified the profiling process by allowing the profiler to run AoK and have it jump directly into the problem. This command-line process bypassed the startup and pregame option screens. (Some profilers slowed the game down so much that manually loading a saved game from the profiler would have been impossible.) And since performance profiling and logging significantly slowed game play, analyzing recorded games was a much better solution from the tester's perspective. Multiplayer games could be recorded and then played back command-for-command under the profiler overnight to investigate performance issues.

Some performance issues from AoE needed to be resolved while we were working on AoK, the biggest of which was AoE's 2D graphics pipeline. The graphics for AoK are created through a combination of software rendering and hardware composition. This pipeline had been highly optimized for AoE by hand-coding most of the system in Assembly, so there was not much additional need to optimize it for AoK.

But there were new features to integrate into the 2D pipeline. For one thing, AoK had more detailed terrain. Also, units that were visually obscured behind buildings and other obstructions appeared as outlines so players could see where they were. Both of these systems were implemented as a mixture of C/C++ and hand-coded Assembly during implementation.

The biggest challenge in keeping the performance up for the graphics system was making sure that the sprites used for graphics in the game were properly tagged as belonging in system memory or video memory. If a sprite was in the wrong memory type a significant performance hit or even an error could occur, but it was usually hard to identify these graphics memory location problems. They were usually marked by a drawing problem on-screen, such as a shadow drawing on top of a unit instead of under it.

Sprites used by the software rendering engine needed to be in system memory so that they could be read and processed. If they resided in video memory instead, the limited throughput from video memory caused a significant performance hit to the game. Conversely, sprites bltted by the hardware that accidentally ended up in system memory would render slowly and could fail to render at all if the hardware bltter didn't support blts from system memory.

Pathfinding problems from AoE also had to be fixed. In AoE, there was a single unit-pathing system, which was known as "tile pathing" because it broke the game map down into individual tiles and classified them as "passable" or "nonpassable." This tile-pathing system was fairly good at moving units short distances, but it often took too long to find paths (if it could find one at all), so we created two additional pathing systems for AoK.

The first of these two systems, "MIP-map pathing," quickly approximated distant paths across the map. The basis for MIP-map pathing was the construction of compact bit vectors that described the passability of each tile on the map. This system allowed the game to determine quickly whether it was even possible for a unit to get from its current location to the general target area. The only way to determine whether the general area could be reached was through the resolution of the bit vectors.

Once a unit was within a short distance of its target area, another new pathing system called "low-level pathing" was used. Low-level pathing allowed very accurate pathing over short distances. When low-level pathing failed, the pathing system fell back and used the original tile pathing from AoE.

Changing the pathing system from a single, general-purpose system to three special-purpose systems improved the performance of AoK and also significantly improved game play since it virtually eliminated the problem of stuck and stopped units caused by pathing failures.

While we were able to improve the pathing system for AoK, enhancing the unit-class hierarchy system was a much more onerous task. The unit-class hierarchy system from AoE couldn't be changed easily since so many game systems and so much functionality relied on the old implementation. At its heart, the game's unit-class system is a hierarchy of derived classes and each derived class is more specialized than its parent. The special functions of each derived class are supported by virtual functions exposed by the classes in the hierarchy. A simplified version of the hierarchy is shown in Figure 4.

From a programming standpoint, calling a virtual function consumes no more overhead than a regular class function.

If each class could implement only its own version of the virtual functions, then this hierarchy wouldn't cause any function overhead problems. However, since each level of the hierarchy implements its own special code, it must also call its parent's version of the derived function to perform its work. In a hierarchy four classes deep, that means calling three additional functions. This may not sound like much, but it can add up when code is executed hundreds of thousands or millions of times.

Some performance improvement could have be gained by circumventing the hierarchy using "special case" units. For example, walls are a type of building unit that do not attack other units and only need to scan their vicinity for enemy units every few game updates unless they are under attack. To handle this special case, we could specifically check whether the current unit being processed is a wall, and if so, skip the code that is only executed for other buildings. Unfortunately, coding in too many special cases can also lead to performance losses, because you end up checking to see whether a unit is one of your many special cases. In the end, we left unit-class hierarchy in place, and made specific changes to shortcut to functionality that didn't apply to specific units.

Commercial Profiling Tools: The Good, the Bad, and the Ugly

Performance analysis extends beyond evaluating the execution speed of program functions and subsystems. It also includes measuring memory usage and evaluating the way the program interacts with other programs and the operating system. In order to determine where the performance problems were in AoK, four separate tools were used: Intel's VTune, NuMega's TrueTime, the Windows NT performance counters, and our own profiling and memory instrumentation code.

Although we used Microsoft Visual C++, we did not use the bundled Microsoft Profiler. There were two reasons for this: we found it difficult to get the Microsoft product to work correctly (or at all) and the data format from their profiler was either inadequate or needed post-processing in a spreadsheet to be minimally useful. Using VTune, TrueTime, and the NT performance counters we were able to collect, analyze, and present data in a reasonable fashion.

VTune is a sampling profiler, which means it has a component that wakes up every few milliseconds (or whatever amount of time you specify) and looks at what processes are executing on the CPU(s). When you decide enough time has elapsed, you can stop VTune and look at the statistics it produces for each process executed during that time. If you've compiled your program with debug information, VTune can display which lines of code were called and what percentage of the elapsed time was consumed by the executing code.

VTune is great because you don't need to compile a special version of your program, it doesn't slow your program down while it runs, and it lets you see the amount of time the CPU spent executing processes besides your own. The only major drawback is that you can end up with spurious data due to this sampling. This can be caused by other processes that are running in the system, or by running VTune for too long a period. To improve VTune's accuracy on your own program, it comes with an API to turn VTune on and off programmatically. This is a very useful feature, especially when drilling down into the performance of specific subsystems and smaller sections of code.

We found that VTune's call-graph functionality couldn't be used with a program that linked either explicitly or implicitly with DirectDraw. Also, some applications (including AoK) were too large in terms of code and debug information in order for VTune to resolve its data back correctly to a line of code. It seems that some of these problems have been fixed in VTune 4.5, however.

Another commercial product that we used was NuMega's TrueTime, which is an instrumenting profiler. To use this product, you have to make a special TrueTime compilation of your program that inserts timing code into each module. This can sometimes be a slow build process, but it's worth it. As the TrueTime build of your program runs, TrueTime logs which functions are entered, when they are entered, and when they are exited. This process can be significantly slower than VTune's effectively real-time performance but it's a useful second opinion nonetheless. The only big drawback (and it can be very severe) is that TrueTime can slow down your program so much that it's impossible to use it for profiling network code. This problem can also skew profiling statistics for time-based game actions such as AI or updates that are scheduled to occur at a certain interval of time.

This performance hit from TrueTime also made it impractical to use it to analyze the performance of the graphics subsystem. When system performance relies on two independent processors (such as the main CPU and the graphics card), efficient cooperation between both processors is critical so that they run concurrently and perform operations in parallel. When TrueTime slowed the CPU (and consequently the AoK rendering load which the CPU governed), it made the graphics card appear to give better performance than it actually did.

There were four drawbacks to both programs. First, neither program can be run in batch mode, so the programmer has to baby-sit the programs while they run through each performance test case. Even though we worked on performance test cases one at a time, it would have been convenient to run each program in batch mode overnight to gather results from other test cases. VTune has since added a batch interface in version 4.5 but support is still lacking in TrueTime.

Second, performance numbers gathered during the execution of a program need to be taken with a grain of salt. Due to the multi-threaded nature of the Windows operating system, other programs (including the performance tool itself) are effectively running at the same time, and that can skew performance. Fortunately, multiple performance runs with the same tool or with different tools can help identify specific problem areas by correlating all of the results, and analyzing performance over smaller sections of code can improve accuracy and reduce the time required by some performance tools.

The third drawback to these profilers is that it's difficult to use both TrueTime and VTune together when using Visual C++ as your development environment. TrueTime cannot instrument code from Visual C++ with VTune installed because VTune renames certain underlying compile and link programs.

Finally, although both tools display call graphs, we found it difficult at times to ascribe performance timings to specific subsystems. For instance, pathing was sometimes called from both movement and retargeting code, but we were not able to determine which subsystem was using more of the pathing code. TrueTime was generally accurate about this, but in some cases, the numbers it reported just didn't seem to add up. In this type of case, we had to place our own timing code directly into AoK to get complete results.

Regardless of how good today's profiling tools are, they have no understanding of or insight into the underlying program they profile; profiling and analysis tools would be significantly more useful if they had some understanding of what the application was attempting to accomplish. With that kind of additional functionality, these tools could provide performance statistics that would greatly enhance the programmer's ability to improve the application performance. Until that day arrives, you'll have to add profiling and analysis code to your application for even the most mundane performance information beyond simple timings and call graphs.

Performance on the Minimum System

Since performance statistics can change based on the platform on which the application is running, it was especially critical to get computer systems that matched the minimum system specification. To demonstrate this performance differential and the scalability of AoK, two test cases were run on the minimum system configuration and one was run on a regular development workstation (Figure 5). To contrast the data as much as possible in this example, the first test case uses the maximum AoK settings for players (eight) and map size (giant). The second test case conforms to the game settings for the minimum system configuration: four players on a four-player-sized game map.

Figure 5. Performance Analysis.

Test PC 1

Test Case 1

Test PC 2

Test Case 2

166 MHz Pentium

60 seconds of game play

Dual 450 MHz Pentium III

60 seconds of game play

32 MB RAM

Eight-player game

128 MB RAM

Four-player game

S3 Virge GX

Giant map, largest map available

Nvidia TNT2 Ultra

Four-player map size

Windows 98

One civ from each of the four civ art setse

Windows 2000

All civs share same div art set

Using VTune, the percentage of CPU clock cycles spent in each process during an AoK game was calculated for a 60-

second period at 30-minute intervals. This was done on the 166MHz Pentium minimum system (Figure 6), and on a dual 450MHz Pentium III development workstation (Figure 7).

As you can see, the four-player game performs well on the 166MHz Pentium. The AoK process starts at approximately 60 percent of the CPU and increases to about 75 percent after 30 minutes. The additional time devoted to the virtual memory manager (VMM) process at startup is caused by AoK data as it is paged in and accessed for the first time. In contrast, the amount of CPU time used by AoK in the eight-player game degrades over time. This is due to the additional memory requirements to support so many players and such a large game map. The CPU reaches the point where it's spending almost as much time managing the virtual memory system as actually executing the game itself.

Since the development workstation (Test PC 2) is a dual-processor system and AoK is single-threaded, the second CPU is idle as the kernel runs. This is why the NTOSKRNL is shown as approximately 50 percent of the CPU.

As both the four- and eight-player games progress, the AoK process continues to use more and more of the CPU. There is no downward pressure being applied from other processes as there was for the 166MHz Pentium for eight players.

If it had not already been established that four players was the number of players to support on the minimum system, these same statistics could have been collected for a varying number of players. Then we could have set the maximum number of players based on how many players could fit within the memory footprint of 32MB.

To complement and augment the results from the commercial profilers we were using, we developed an in-house profiling tool as well. This tool logged the execution time of high-level game systems and functions (telling us how much time was spent by each one) and told us the absolute amount of time a section of code took to execute - a sanity check for performance optimizations that we sorely needed. Our profiling system consisted of four simple functions that were easily inserted and removed for profiling purposes and relied on a simple preprocessor directive, _PROFILE, that compiled the profiling code in or out of the executable. This let us keep our profiling calls in the code, instead of forcing us to add and remove them to create nonprofiled builds. You can download an abbreviated example of the profiling code from the Game Developer web site (http://www.gdmag.com).

While VTune told us how much of the CPU AoK was using (Figure 6), our custom profiler told us how much time was being spent on each of AoK's major subsystems (Figure 8). This additional information told us interesting things about the performance of AoK and where we might be able to improve performance. You can see in Figure 8 that the amount of time devoted to game world simulation and unit AI increases from approximately 33 percent to approximately 57 percent of the AoK process over the course of three samples at 30-minute intervals during an eight-player game.

Looking back at the process statistics from VTune in Figure 6, you see that the amount of time spent in the VMM increases while the time spent on AoK decreases. Since AoK spends more time in simulation/AI and the operating system spends more time manipulating virtual memory, we can propose some theories to explain this:

  • The simulation/AI code is allocating more memory over time without freeing memory, stressing the VMM. However, skipping ahead to Figure 5, we see this probably isn't the case since the memory footprint isn't skyrocketing.

  • The simulation/AI code is allocating and deallocating so much memory that as time goes on, the memory heap is becoming fragmented, and that's slowing memory allocation. The only way to confirm this theory is to instrument the code and determine where, when, and how often memory is allocated.

  • The data being processed by the simulation/AI is so large or being accessed so randomly that it constantly causes the VMM to flush data from memory and read in new data from the virtual memory swap file.

 

More data would be required to determine the cause of this problem. It would also be good to break the "simulation/AI" group down into more discrete components for timing.

Our timing code relies on the Assembly instruction ReadTimestampCounter (RDTSC), but it could also have used the Win32 QueryPerformanceCounter, or another fine-grained counter or timer. We chose RDTSC because it was simple to use, it works on all Pentium (and later) processors (except some very early Cyrix Pentium-class parts), and these profiling functions were based on extending existing code.

Finally, although both tools display call graphs, pathing was sometimes called from both movement and retargeting code but we were not able to determine which subsystem was using more of the pathing code. TrueTime was generally accurate about this, but in some cases the numbers it reported just didn't seem to add up. In this type of case, we had to place our own timing code directly into AoK to get complete results.

As I stated earlier, it was difficult to assign performance timings to specific subsystems based on the results of the commercial profilers that we used. To remedy this, we built functionality into our custom profiler to determine how much of each system's time was spent in, say, pathing. Here's how our profiler works. During profiler initialization (_ProfInit), the static array of profiling information (groupList) is initialized to zero, and the CPU frequency is calculated. The size of the groupList array matches the number of profile group entries in the ProfileGroup enum list in the prof header file. The CPU frequency is calculated with a simple, albeit slow, function called GetFrequency. (Alternately, this could have used one of the specific CPU identification libraries available from Intel or AMD, but this code works transparently on Windows 95/98 and NT and across processors without problems.)

The final part of initialization seeds each groupList array entry with its parent group. Since the groupList array entries match the ProfileGroup enums in order, the ProfileGroup enum can be used as an index into the groupList array to set the parent group value. Using the SetMajorSection macro significantly simplifies this code by encapsulating the array offset and assignment. More importantly, it allows us to use the stringizing operator (#) to record the parent group's ProfileGroup declaration as a string (const char *) for use when formatting our output.

The second requirement for our custom profiler was that its profiling code had to be smart enough to make sure that the profiling start (_ProfStart) and stop (_ProfStop) statements were inserted around a function or group of functions in correct pairings. The _ProfStop function first makes sure that profiling was started, and at that point the current time is recorded. This is then used to calculate and store the elapsed time. The number of calls made is incremented, and the starting time is reset to zero. We wanted to avoid the situation where profiling starts multiple times on the same group, or where a _ProfStop appears before its corresponding _ProfStart. To ensure the correct pairing of profiling statements, in _ProfStart a check is made to ensure that the function has not already been called by examining the starting timing value mqwStart. The current time is then recorded into mqwStart using the GetTimeStamp function, a wrapper for RDTSC.

In GetTimeStamp, it should be noted that the eax and edx registers are used for returning the current 64-bit timing value as two 32-bit values, which are subsequently shifted and combined. In this case, there is no need to push and pop the scratch registers since the compiler is smart enough to recognize the inline Assembly use. However, if this timing code was encapsulated in a macro, there's the chance that the compiler might not recognize it and it would be necessary to push and pop the registers.

Another issue we confronted with our custom profiling system was the accuracy and resolution of timing available from a system that uses two function calls from the calling code (first to _ProfStart and then to GetTimeStamp). Since we use this timing code to profile larger subsystems and functions, there will be timing variations due to system factors, such as the execution of other processes by the operating system. If we time significantly smaller portions of code, down to a few lines, it's preferable to inline the RDTSC call or use it within a macro.

Using the RDTSC as a high-resolution timer can present another problem, too. Note that RDTSC is not an instruction that will serialize the execution inside the CPU. In other words, RDTSC can be rescheduled just like any other CPU instruction and may actually be executed before, during, or after the block of code you're attempting to time. Using a fencing (serializing) instruction like CPUID can solve this.

At the end of the program, the _ProfSave function saves the recorded profiling information out to a file. The name of the group, the number of calls, the elapsed time spent in the group, the average time per call, its percentage of its parent group, and the parent group name are listed for each profile group. This output is formatted out using the complicated proftrace macro, which once again uses the stringizing operator (#) to print out the character version of the profile group followed by its information.

Next month we'll wrap up talking about our profiling tools by discussing the memory instrumentation we created for AoK. Then, we'll take an in-depth look at a number of performance issues facing AoK, including unit movement and pathing, and see how they were addressed.

For More Information

Ensemble Studios

http://www.ensemblestudios.com

Intel VTune and C/C++ Complier

http://developer.intel.com/vtune

MicroQuill HeapAgent and SmartHeap

http://www.microquill.com

NuMega TrueTime

http://www.numega.com

Performance Analysis and Tuning

Baecker, Ron, Chris DiGiano, and Aaron Marcus. "Software Visualization for Debugging." Communications of the ACM (Vol. 40, No. 4): April 1997

Marselas, Herb. "Advanced Direct3D Performance Analysis" Microsoft Meltdown Proceedings, 1998

Marselas, Herb. "Don't Starve that CPU! Making the Most of Memory Bandwidth." Game Developers Conference Proceedings, 1999.

Pottinger, Dave. "Coordinated Unit Movement." Game Developer (January and February 1999).

Shanley, Tom. Pentium Pro and Pentium II System Architecture, 2nd ed. Colorado Springs, Colo.: Mindshare Inc., 1997

Acknowledgements: Creating and optimizing AoK was a team effort. I'd like to thank the AoK team, and specifically the other AoK programmers, for help in getting the details of some of that effort into this article. I'd also like to thank everyone at Ensemble Studios for reviewing this article.

Herb Marselas currently works at Ensemble Studios. He helped out on Age of Empires II: The Age of Kings. Shhhh! Please don't tell anyone he's working on a secret 3D-engine project called [deleted]. Previously, he worked at the Intel Platform Architecture Lab where he created the IPEAK Graphics Performance Toolkit. You can reach him at [email protected].

Read more about:

Features

About the Author

herb marselas

Blogger

Herb Marselas is a 3D engine specialist working on the next age of real-time strategy games at Ensemble Studios, the creatively titled RTS3. The first rule of RTS3 is that you don't ask about RTS3. Drop him a line at [email protected]

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like