Sponsored By

Sponsored: Improving Android game performance with Intel INDE GPA

This tutorial presents a step-by-step guide to performance analysis, bottleneck identification, and rendering optimization of an OpenGL ES 3.0 application on Android.

Game Developer, Staff

August 13, 2015

43 Min Read
Game Developer logo in a gray background | Game Developer

This tutorial presents a step-by-step guide to performance analysis, bottleneck identification, and rendering optimization of an OpenGL ES 3.0 application on Android. The sample application, entitled “City Racer,” simulates a road race through a stylized urban setting.  Performance analysis of the application is done using the Intel INDE Graphics Performance Analyzers (Intel INDE GPA) tool suite.

Acknowledgements

This tutorial is an Android and OpenGL ES 3.0 version of the Intel Graphics Performance Workshop for 3rd Generation Intel Core Processor (Ivy Bridge) (PDF) created by David Houlton.  It ships with Intel INDE GPA.

Tutorial Organization

This tutorial guides you through four successive optimization steps.  At each step the application is analyzed with Intel INDE GPA to identify specific performance bottlenecks.  An appropriate optimization is then toggled within the application to overcome the bottleneck and it is analyzed again to measure the performance gained.  The optimizations applied are generally in line with the guidelines provided in the Developer’s Guide for Intel Processor Graphics (PDF).

Over the course of the tutorial, the applied optimizations improve the rendering performance of City Racer by 83%.

City Racer Icon

The combined city and vehicle geometry consists of approximately 230K polygons (690K vertices) with diffuse mapped materials lit by a single non-shadow casting directional light.  The provided source material includes the code, project files, and art assets required to build the application, including the source code optimizations identified throughout this tutorial.

Prerequisites

City Racer Sample Application

City Racer is logically divided into race simulation and rendering subcomponents.  Race simulation includes modeling vehicle acceleration, braking, turning parameters, and AI for track following and collision avoidance.  The race simulation code is in the track.cpp and vehicle.cpp files and is not affected by any of the optimizations applied over the course of this tutorial.

The rendering component consists of drawing the vehicles and scene geometry using the OpenGL ES 3.0 and our internally developed CPUT framework.  The initial version of the rendering code represents a first-pass effort, containing several performance-limiting design choices.

Mesh and texture assets are loaded from the Media/defaultScene.scene file.  Individual meshes are tagged as either pre-placed scenery items, instanced scenery with per-instance transformation data, or vehicles for which the simulation provides transformation data.  There are several cameras in the scene:  one follows each car and an additional camera allows the user to freely explore the scene.  All performance analysis and code optimizations are targeted at the vehicle-follow camera mode.

For the purpose of this tutorial, City Racer is designed to start in a paused state, which allows you to walk through each profiling step with identical data sets.  City Racer can be unpaused by unchecking the Pause check box in the City Racer HUD or by setting g_Paused = false at the top of CityRacer.cpp.

Optimization Potential

Consider the City Racer application as a functional but non-optimized prototype.  In its initial state it provides the visual result desired, but not the rendering performance.  It has a number of techniques and design choices in place that are representative of those you’d find in a typical game-in-development that limits the rendering performance.  The goal of the optimization phase of development is to identify the performance bottlenecks one by one, make code changes to overcome them, and measure the improvements achieved.

Note that this tutorial addresses only a small subset of all possible optimizations that could be applied to City Racer.  In particular, it only considers optimizations that can be applied completely in source code, without any changes to the model or texture assets.  Other asset-changing optimizations are excluded here simply because they become somewhat cumbersome to implement in tutorial format, but they can be identified using Intel INDE GPA tools and should be considered in a real-world game optimization.

Performance numbers shown in this document were captured on an Intel Atom processor-based system (codenamed Bay Trail) running Android.  The numbers may differ on your system, but relative performance relationships should be similar and logically lead to the same performance optimizations.

The optimizations to be applied over the course of the tutorial are found in CityRacer.cpp. They can be toggled through City Racer’s HUD or through direct modification in CityRacer.cpp.

CityRacer.cpp

CityRacer.cpp

1

boolg_Paused = true;

2

boolg_EnableFrustumCulling = false;

3

boolg_EnableBarrierInstancing = false;

4

boolg_EnableFastClear = false;

5

boolg_DisableColorBufferClear = false;

6

boolg_EnableSorting = false;

They are enabled one by one as you progress through the optimization steps.  Each variable controls the substitution of one or more code segments to achieve the optimization for that step of the tutorial.

Optimization Tutorial

The first step is to build and deploy City Racer on an Android device.  If your Android environment is set up correctly, the buildandroid.bat file located in CityRacer/Game/Code/Android will perform these steps for you. 

Next, launch Intel INDE GPA Monitor, right click the system tray icon, and select System Analyzer.

System Analyzer will show a list of possible platforms to connect to. Choose your Android x86 device and press “Connect.”

System Analyzer - Choose your Android x86 device

When System Analyzer connects to your Android device, it will display a list of applications available for profiling. Choose City Racer and wait for it to launch.

System Analyzer - a list of applications available for profiling

While City Racer is running, press the frame capture button to capture a snapshot of a GPU frame to use for analysis.

Capture a snapshot of a GPU frame to use for analysis

Examine the Frame

Open Frame Analyzer for OpenGL and choose the City Racer frame you just captured, which will allow you to examine GPU performance in detail.

Open Frame Analyzer for OpenGL* to examine GPU performance

The timeline corresponds to an OpenGL draw call

The timeline at the top is laid out in equally spaced ‘ergs’ of work, each of which usually corresponds to an OpenGL draw call.  For a more traditional timeline display, select GPU Duration on the X and Y axis. This will quickly show us which ergs are consuming the most GPU time and where we should initially focus our efforts.  If no ergs are selected, then the panel on the right shows our GPU time for the entire frame, which is 55ms.

GPU duration

Optimization 1 – Frustum Culling

When viewing all of the draws, we can see that there are many items drawn that are not visually apparent on the screen.  By changing the Y-axis to Post-Clip Primitives the gaps in this view serve to point out which draws are wasted because the geometry is entirely clipped.

A view-frustum culling routine

The buildings in City Racer are combined into groups according to spatial locations. We can cull out the groups not visible and thus eliminate the GPU work associated with them. By toggling the Frustum Culling check box, each draw will be run through a view-frustum culling routine on the CPU before being submitted to the GPU.

Turn on the Frustum Culling check box and use System Analyzer to capture another frame.  Once the frame is captured, open it again in Frame Analyzer.

Frame Analyzer after frustum culling option enabled

By viewing this frame we can see the number of draws is reduced by 22% from 740 to 576 and our overall GPU time is reduced by 18%.

Frustum Culling Draw calls

Frustum Culling GPU duration

Optimization 2 – Instancing

While frustum culling reduced the overall amount of ergs, there are still a great number of small ergs (highlighted in yellow) which, when taken cumulatively, add up to a non-trivial amount of GPU time.

A non-trivial amount of GPU time

By examining the geometry for these ergs we can see the majority of them are the concrete barriers which line the sides of the track.

Concrete barriers which line the sides of the track

We can eliminate much of the overhead involved in these draws by combining them into a single instanced draw.  By toggling the Barrier Instancing check box the barriers will be combined into a single instanced draw thus removing the need for the CPU to submit each one of them via a draw to the GPU.

Turn on the Barrier Instancing check box and use System Analyzer to capture another frame.  Once the frame is captured, open it with Frame Analyzer.

Frame Analyzer - after barrier instancing enabled

By viewing this frame we can see the number of draws is reduced by 90% from 576 to 60.

Draw calls before concrete barrier instancing

Draw calls after concrete barrier instancing

Draw calls before concrete barrier instancing (top) and after instancing (bottom)

Additionally, the GPU duration is reduced by 71% to 13ms.

Instancing gpu duration

Optimization 3 – Front to Back Sorting

The term “overdraw” refers to writing to each pixel multiple times; this can impact pixel fill rate and increase frame rendering time.  Examining the Samples Written metric shows us that each pixel is being written to approximately 1.8 times per frame (Resolution / Samples Written).

Output Merger

Sorting the draws from front to back before rendering is a relatively straightforward way to reduce overdraw because the GPU pipeline will reject any pixels occluded by previous draws.

Turn on the Sort Front to Back check box and use System Analyzer to capture another frame.  Once the frame is captured, open it with Frame Analyzer.

Frame Analyzer after enabling sort front to back

By viewing this frame we can see the Samples Written metric decreased by 6% and our overall GPU time is reduced by 8%.

Output Merger after enabling sort front to back

GPU duration after enabling sort front to back

 

Optimization 4 – Fast Clear

A final look at our draw times shows the first erg is taking the longest individual GPU time.  Selecting this erg reveals that it’s not a draw but a glClear call.

First erg taking the longest individual GPU time

glClear call

Intel’s GPU hardware has an optimization path that performs a ‘fast clear’ in a fraction of the time it takes a traditional clear.  A fast clear can be performed by setting the glClearColor to all black or all white (0, 0, 0, 0 or 1, 1, 1, 1).

Turn on the Fast Clear check box and use System Analyzer to capture another frame.  Once the frame is captured, open it with Frame Analyzer.

Frame Analyzer after enabling fast clear

By viewing this frame we can see the GPU duration for the clear has decreased by 87% over the regular clear, from 1.2ms to 0.2ms.

GPU duration for the clear decreased

GPU duration for the clear decreased

As a result, the overall frame duration of the GPU is decreased by 24% to 9.2ms.

The overall frame duration of the GPU decreased

 

Conclusion

This tutorial has taken a representative early-stage game application and used the Intel INDE GPA to analyze application behavior and make targeted changes to improve performance.  The changes made and improvements realized were:

Optimization

Before

After

% Improvement

Frustum Culling

55.2ms

45.0ms

82%

Instancing

45.0ms

13.2ms

71%

Sorting

13.2ms

12.1ms

8%

Fast Clear

12.1ms

9.2ms

24%

Overall GPU Optimizations

55.2ms

9.2ms

83%

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

Overall, from the initial implementation of City Racer to the best optimized version, we demonstrate rendering performance improvement of 300%, from 11 fps to 44 fps.  Since this implementation starts out significantly sub-optimal, a developer applying these techniques will probably not see the same absolute performance gain on a real-world game.

Nevertheless, the primary goal of this tutorial is not the optimization of this specific sample application, but the potential performance gains you can find by following the recommendations in Developer’s Guide for Intel Processor Graphics and the usefulness of Intel INDE GPA in finding and measuring those improvements.

Visit the Intel Developer Zone for more articles like this

Read more about:

Sponsor Resource Center
Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like