Sponsored By

How to Analyze the Performance Cost of Your Unity Shaders

In this post, you will learn how to analyze your unity shader complexity with numbers so you finally can: - Stop being fragment-bound in your unity game performance and... - Compare the GPU complexity between two shaders

Ruben Torres Bonet, Blogger

June 23, 2020

11 Min Read
Game Developer logo in a gray background | Game Developer

In this post, you will learn how to analyze your unity shader complexity with numbers so you finally can:

  • Stop being fragment-bound in your unity game performance

  • Compare the GPU complexity between two shaders

  • Analyze shader costs in terms of texture, arithmetic and load/store GPU cycles

  • ... And reach 60, 90 or higher FPS

For this, we will use this little-known tool called mali offline shader compiler.

With this free software, you'll finally be able to see how you're spending your GPU cycles in your Unity shaders.

So let's get started with this exciting topic.

Is It Important to Count GPU Cycles Nowadays?

Now more than ever, it is crucial to understand the impact of your shaders on the performance of your game.

With ever increasing resolutions (I look at you, VR), more and more games are bottlenecked by fragment shading stage.

“The more pixels you render, the more attention you have to pay to the cost of your fragment shaders.”

Rubén (The Gamedev Guru)

And to get an idea on how expensive your shaders are, here are two approaches:

  • Making guesstimates, e.g. "this shader looks expensive".

  • Measuring: either through static analysis or in-game profiling.

In this blog post, we will measure the cost of your shaders through static analysis. Guesstimates will work better once you gain more experience measuring 😉

In the next sections, you and I will get to compile your shaders in matter of minutes.

With that, we will get valuable performance information about them that will guide your future decisions.

Setting Up Your Mali Offline Compiler

You can download Mali Offline Compiler as part of Arm Mobile Studio.

On that page, you'll want to download the latest release for your target platform.

Download arm Mobile Studio

arm-mobile-studio-download

Download arm Mobile Studio

Once you've gone through the setup, the mali offline compiler should be part of your PATH variable, i.e. you'll be able to invoke it through the command line.

If that was not the case, you can add it yourself. You can find the malioc executable on the installation path.

Compiling Your Unity Shaders

Before we can start using the Mali Offline Shader Compiler, we need to instruct Unity to compile the shader you want to analyze.

You see, mali knows nothing about your unity shaders' format.

Mali just wants it in GLSL format.

Luckily, this is pretty easy in Unity.

Navigate to a material of your choice and click on the wheel icon on its right. Then, click on select shader.

Unity: Finding Your Shader

unity-select-shader

Unity: Finding Your Shader

Doing so will show you the inspector of your shader, which includes its name, some meta-data and the possibility to compile it.

Unity: Compiling Your Shader

unity-compile-shader

Unity: Compiling Your Shader

(You might need to select GLES3x, as this is the graphics API Mali works well with)

Guess which button will you press?

Getting Your Unity Shader Performance Metrics

Once you pressed Compile and show code, your code editor will show you the possibly long list of shaders that Unity compiled for you.

This temporary file contains all the vertex and fragment shader variants Unity produced for you.

Vertex shaders start with #ifdef VERTEX and end at its #endif.

And you can delimit fragment shaders by FRAGMENT.

Here's what you'll want to do:

  • Copy the inner code of either a vertex or a fragment shader

  • Paste it into a new file and save it with its proper extension (.vert or .frag)

  • Kindly ask mali to give you the performance metrics

Let me show you two examples on the standard shader.

Vertex Shader Performance Metrics

Here's the code I am saving to shader.vert:


//#ifdef VERTEX
#version 300 es
#define HLSLCC_ENABLE_UNIFORM_BUFFERS 1
#if HLSLCC_ENABLE_UNIFORM_BUFFERS
#define UNITY_UNIFORM
#else
#define UNITY_UNIFORM uniform
#endif
#define UNITY_SUPPORTS_UNIFORM_LOCATION 1
#if UNITY_SUPPORTS_UNIFORM_LOCATION
#define UNITY_LOCATION(x) layout(location = x)
#define UNITY_BINDING(x) layout(binding = x, std140)
#else
#define UNITY_LOCATION(x)
#define UNITY_BINDING(x) layout(std140)
#endif
uniform vec3 _WorldSpaceCameraPos;
uniform mediump vec4 unity_SHBr;
uniform mediump vec4 unity_SHBg;
uniform mediump vec4 unity_SHBb;
uniform mediump vec4 unity_SHC;
uniform vec4 hlslcc_mtx4x4unity_ObjectToWorld[4];
uniform vec4 hlslcc_mtx4x4unity_WorldToObject[4];
uniform vec4 hlslcc_mtx4x4unity_MatrixVP[4];
uniform vec4 _MainTex_ST;
uniform vec4 _DetailAlbedoMap_ST;
uniform mediump float _UVSec;
in highp vec4 in_POSITION0;
in mediump vec3 in_NORMAL0;
in highp vec2 in_TEXCOORD0;
in highp vec2 in_TEXCOORD1;
out highp vec4 vs_TEXCOORD0;
out highp vec4 vs_TEXCOORD1;
out highp vec4 vs_TEXCOORD2;
out highp vec4 vs_TEXCOORD3;
out highp vec4 vs_TEXCOORD4;
out mediump vec4 vs_TEXCOORD5;
out highp vec4 vs_TEXCOORD7;
out highp vec3 vs_TEXCOORD8;
vec4 u_xlat0;
mediump vec4 u_xlat16_0;
bool u_xlatb0;
vec4 u_xlat1;
mediump float u_xlat16_2;
mediump vec3 u_xlat16_3;
float u_xlat12;
void main()
{
u_xlat0 = in_POSITION0.yyyy * hlslcc_mtx4x4unity_ObjectToWorld[1];
u_xlat0 = hlslcc_mtx4x4unity_ObjectToWorld[0] * in_POSITION0.xxxx + u_xlat0;
u_xlat0 = hlslcc_mtx4x4unity_ObjectToWorld[2] * in_POSITION0.zzzz + u_xlat0;
u_xlat0 = u_xlat0 + hlslcc_mtx4x4unity_ObjectToWorld[3];
u_xlat1 = u_xlat0.yyyy * hlslcc_mtx4x4unity_MatrixVP[1];
u_xlat1 = hlslcc_mtx4x4unity_MatrixVP[0] * u_xlat0.xxxx + u_xlat1;
u_xlat1 = hlslcc_mtx4x4unity_MatrixVP[2] * u_xlat0.zzzz + u_xlat1;
gl_Position = hlslcc_mtx4x4unity_MatrixVP[3] * u_xlat0.wwww + u_xlat1;
#ifdef UNITY_ADRENO_ES3
u_xlatb0 = !!(_UVSec==0.0);
#else
u_xlatb0 = _UVSec==0.0;
#endif
u_xlat0.xy = (bool(u_xlatb0)) ? in_TEXCOORD0.xy : in_TEXCOORD1.xy;
vs_TEXCOORD0.zw = u_xlat0.xy * _DetailAlbedoMap_ST.xy + _DetailAlbedoMap_ST.zw;
vs_TEXCOORD0.xy = in_TEXCOORD0.xy * _MainTex_ST.xy + _MainTex_ST.zw;
u_xlat0.xyz = in_POSITION0.yyy * hlslcc_mtx4x4unity_ObjectToWorld[1].xyz;
u_xlat0.xyz = hlslcc_mtx4x4unity_ObjectToWorld[0].xyz * in_POSITION0.xxx + u_xlat0.xyz;
u_xlat0.xyz = hlslcc_mtx4x4unity_ObjectToWorld[2].xyz * in_POSITION0.zzz + u_xlat0.xyz;
u_xlat0.xyz = hlslcc_mtx4x4unity_ObjectToWorld[3].xyz * in_POSITION0.www + u_xlat0.xyz;
vs_TEXCOORD1.xyz = u_xlat0.xyz + (-_WorldSpaceCameraPos.xyz);
vs_TEXCOORD8.xyz = u_xlat0.xyz;
vs_TEXCOORD1.w = 0.0;
vs_TEXCOORD2 = vec4(0.0, 0.0, 0.0, 0.0);
vs_TEXCOORD3 = vec4(0.0, 0.0, 0.0, 0.0);
u_xlat0.x = dot(in_NORMAL0.xyz, hlslcc_mtx4x4unity_WorldToObject[0].xyz);
u_xlat0.y = dot(in_NORMAL0.xyz, hlslcc_mtx4x4unity_WorldToObject[1].xyz);
u_xlat0.z = dot(in_NORMAL0.xyz, hlslcc_mtx4x4unity_WorldToObject[2].xyz);
u_xlat12 = dot(u_xlat0.xyz, u_xlat0.xyz);
u_xlat12 = inversesqrt(u_xlat12);
u_xlat0.xyz = vec3(u_xlat12) * u_xlat0.xyz;
vs_TEXCOORD4.xyz = u_xlat0.xyz;
vs_TEXCOORD4.w = 0.0;
u_xlat16_2 = u_xlat0.y * u_xlat0.y;
u_xlat16_2 = u_xlat0.x * u_xlat0.x + (-u_xlat16_2);
u_xlat16_0 = u_xlat0.yzzx * u_xlat0.xyzz;
u_xlat16_3.x = dot(unity_SHBr, u_xlat16_0);
u_xlat16_3.y = dot(unity_SHBg, u_xlat16_0);
u_xlat16_3.z = dot(unity_SHBb, u_xlat16_0);
vs_TEXCOORD5.xyz = unity_SHC.xyz * vec3(u_xlat16_2) + u_xlat16_3.xyz;
vs_TEXCOORD5.w = 0.0;
vs_TEXCOORD7 = vec4(0.0, 0.0, 0.0, 0.0);
return;
}
//#endif

Note that you have to exclude the first #ifdef VERTEX and the last #endif. I just left them there for your reference.

Then, invoke the mali offline compiler like "malioc shader.vert", which produces this output:


C:\Users\rtorresb\Desktop\Tmp>malioc shader.vert

Mali Offline Compiler v7.1.0 (Build 7a3538)

Copyright 2007-2020 Arm Limited, all rights reserved

Configuration

=============

Hardware: Mali-G76 r0p0

Driver: Bifrost r19p0-00rel0

Shader type: OpenGL ES Vertex (inferred)

Main shader

===========

Work registers: 32

Uniform registers: 82

Stack spilling: False

A LS V T Bound

Total instruction cycles: 2.9 16.0 0.0 0.0 LS

Shortest path cycles: 2.9 16.0 0.0 0.0 LS

Longest path cycles: 2.9 16.0 0.0 0.0 LS

A = Arithmetic, LS = Load/Store, V = Varying, T = Texture

As you can see, this specific shader is load/store bound with 16 cycles for a Mali G76 GPU.

It's a pretty expensive one, but that's what you get when using the standard shader.

If you wanted to optimize this shader, then you'll want to reduce the load/store operations of your shaders. Then, redo this step to see how you improved it.

Fragment Shader Performance Metrics

Let's go through the same procedure with the fragment shader below:


//#ifdef FRAGMENT
#version 300 es
#ifdef GL_EXT_shader_texture_lod
#extension GL_EXT_shader_texture_lod : enable
#endif
precision highp float;
precision highp int;
#define HLSLCC_ENABLE_UNIFORM_BUFFERS 1
#if HLSLCC_ENABLE_UNIFORM_BUFFERS
#define UNITY_UNIFORM
#else
#define UNITY_UNIFORM uniform
#endif
#define UNITY_SUPPORTS_UNIFORM_LOCATION 1
#if UNITY_SUPPORTS_UNIFORM_LOCATION
#define UNITY_LOCATION(x) layout(location = x)
#define UNITY_BINDING(x) layout(binding = x, std140)
#else
#define UNITY_LOCATION(x)
#define UNITY_BINDING(x) layout(std140)
#endif
uniform mediump vec4 _WorldSpaceLightPos0;
uniform mediump vec4 unity_SHAr;
uniform mediump vec4 unity_SHAg;
uniform mediump vec4 unity_SHAb;
uniform mediump vec4 unity_SpecCube0_HDR;
uniform mediump vec4 _LightColor0;
uniform mediump vec4 _Color;
uniform float _GlossMapScale;
uniform mediump float _OcclusionStrength;
UNITY_LOCATION(0) uniform mediump sampler2D _MainTex;
UNITY_LOCATION(1) uniform mediump sampler2D _MetallicGlossMap;
UNITY_LOCATION(2) uniform mediump sampler2D _OcclusionMap;
UNITY_LOCATION(3) uniform mediump samplerCube unity_SpecCube0;
in highp vec4 vs_TEXCOORD0;
in highp vec4 vs_TEXCOORD1;
in highp vec4 vs_TEXCOORD4;
in mediump vec4 vs_TEXCOORD5;
layout(location = 0) out mediump vec4 SV_Target0;
vec3 u_xlat0;
vec3 u_xlat1;
mediump vec4 u_xlat16_1;
mediump vec3 u_xlat16_2;
vec4 u_xlat3;
mediump float u_xlat16_4;
mediump vec3 u_xlat16_5;
mediump vec3 u_xlat16_6;
mediump vec3 u_xlat16_7;
mediump vec3 u_xlat16_8;
vec3 u_xlat9;
mediump vec3 u_xlat16_9;
mediump vec3 u_xlat16_13;
mediump vec3 u_xlat16_15;
float u_xlat18;
float u_xlat20;
mediump float u_xlat16_24;
float u_xlat27;
mediump float u_xlat16_27;
float u_xlat28;
void main()
{
u_xlat0.x = dot(vs_TEXCOORD1.xyz, vs_TEXCOORD1.xyz);
u_xlat0.x = inversesqrt(u_xlat0.x);
u_xlat9.xyz = (-vs_TEXCOORD1.xyz) * u_xlat0.xxx + _WorldSpaceLightPos0.xyz;
u_xlat1.xyz = u_xlat0.xxx * vs_TEXCOORD1.xyz;
u_xlat0.x = dot(u_xlat9.xyz, u_xlat9.xyz);
u_xlat0.x = max(u_xlat0.x, 0.00100000005);
u_xlat0.x = inversesqrt(u_xlat0.x);
u_xlat0.xyz = u_xlat0.xxx * u_xlat9.xyz;
u_xlat27 = dot(_WorldSpaceLightPos0.xyz, u_xlat0.xyz);
#ifdef UNITY_ADRENO_ES3
u_xlat27 = min(max(u_xlat27, 0.0), 1.0);
#else
u_xlat27 = clamp(u_xlat27, 0.0, 1.0);
#endif
u_xlat27 = max(u_xlat27, 0.319999993);
u_xlat16_2.xy = texture(_MetallicGlossMap, vs_TEXCOORD0.xy).xw;
u_xlat28 = (-u_xlat16_2.y) * _GlossMapScale + 1.0;
u_xlat20 = u_xlat28 * u_xlat28 + 1.5;
u_xlat27 = u_xlat27 * u_xlat20;
u_xlat20 = dot(vs_TEXCOORD4.xyz, vs_TEXCOORD4.xyz);
u_xlat20 = inversesqrt(u_xlat20);
u_xlat3.xyz = vec3(u_xlat20) * vs_TEXCOORD4.xyz;
u_xlat0.x = dot(u_xlat3.xyz, u_xlat0.xyz);
#ifdef UNITY_ADRENO_ES3
u_xlat0.x = min(max(u_xlat0.x, 0.0), 1.0);
#else
u_xlat0.x = clamp(u_xlat0.x, 0.0, 1.0);
#endif
u_xlat0.x = u_xlat0.x * u_xlat0.x;
u_xlat9.x = u_xlat28 * u_xlat28;
u_xlat18 = u_xlat9.x * u_xlat9.x + -1.0;
u_xlat0.x = u_xlat0.x * u_xlat18 + 1.00001001;
u_xlat0.x = u_xlat0.x * u_xlat27;
u_xlat0.x = u_xlat9.x / u_xlat0.x;
u_xlat16_4 = u_xlat28 * u_xlat9.x;
u_xlat16_4 = (-u_xlat16_4) * 0.280000001 + 1.0;
u_xlat0.x = u_xlat0.x + -9.99999975e-05;
u_xlat0.x = max(u_xlat0.x, 0.0);
u_xlat0.x = min(u_xlat0.x, 100.0);
u_xlat16_9.xyz = texture(_MainTex, vs_TEXCOORD0.xy).xyz;
u_xlat16_5.xyz = u_xlat16_9.xyz * _Color.xyz;
u_xlat16_13.xyz = _Color.xyz * u_xlat16_9.xyz + vec3(-0.220916301, -0.220916301, -0.220916301);
u_xlat16_13.xyz = u_xlat16_2.xxx * u_xlat16_13.xyz + vec3(0.220916301, 0.220916301, 0.220916301);
u_xlat16_6.x = (-u_xlat16_2.x) * 0.779083729 + 0.779083729;
u_xlat16_15.xyz = u_xlat16_5.xyz * u_xlat16_6.xxx;
u_xlat16_6.x = (-u_xlat16_6.x) + 1.0;
u_xlat16_6.x = u_xlat16_2.y * _GlossMapScale + u_xlat16_6.x;
#ifdef UNITY_ADRENO_ES3
u_xlat16_6.x = min(max(u_xlat16_6.x, 0.0), 1.0);
#else
u_xlat16_6.x = clamp(u_xlat16_6.x, 0.0, 1.0);
#endif
u_xlat16_7.xyz = (-u_xlat16_13.xyz) + u_xlat16_6.xxx;
u_xlat0.xyz = u_xlat0.xxx * u_xlat16_13.xyz + u_xlat16_15.xyz;
u_xlat0.xyz = u_xlat0.xyz * _LightColor0.xyz;
u_xlat3.w = 1.0;
u_xlat16_8.x = dot(unity_SHAr, u_xlat3);
u_xlat16_8.y = dot(unity_SHAg, u_xlat3);
u_xlat16_8.z = dot(unity_SHAb, u_xlat3);
u_xlat16_8.xyz = u_xlat16_8.xyz + vs_TEXCOORD5.xyz;
u_xlat16_8.xyz = max(u_xlat16_8.xyz, vec3(0.0, 0.0, 0.0));
u_xlat16_2.xyz = log2(u_xlat16_8.xyz);
u_xlat16_2.xyz = u_xlat16_2.xyz * vec3(0.416666657, 0.416666657, 0.416666657);
u_xlat16_2.xyz = exp2(u_xlat16_2.xyz);
u_xlat16_2.xyz = u_xlat16_2.xyz * vec3(1.05499995, 1.05499995, 1.05499995) + vec3(-0.0549999997, -0.0549999997, -0.0549999997);
u_xlat16_2.xyz = max(u_xlat16_2.xyz, vec3(0.0, 0.0, 0.0));
u_xlat16_27 = texture(_OcclusionMap, vs_TEXCOORD0.xy).y;
u_xlat16_6.x = (-_OcclusionStrength) + 1.0;
u_xlat16_6.x = u_xlat16_27 * _OcclusionStrength + u_xlat16_6.x;
u_xlat16_8.xyz = u_xlat16_2.xyz * u_xlat16_6.xxx;
u_xlat16_15.xyz = u_xlat16_15.xyz * u_xlat16_8.xyz;
u_xlat27 = dot(u_xlat3.xyz, _WorldSpaceLightPos0.xyz);
#ifdef UNITY_ADRENO_ES3
u_xlat27 = min(max(u_xlat27, 0.0), 1.0);
#else
u_xlat27 = clamp(u_xlat27, 0.0, 1.0);
#endif
u_xlat0.xyz = u_xlat0.xyz * vec3(u_xlat27) + u_xlat16_15.xyz;
u_xlat16_15.x = (-u_xlat28) * 0.699999988 + 1.70000005;
u_xlat16_15.x = u_xlat28 * u_xlat16_15.x;
u_xlat16_15.x = u_xlat16_15.x * 6.0;
u_xlat16_24 = dot(u_xlat1.xyz, u_xlat3.xyz);
u_xlat16_24 = u_xlat16_24 + u_xlat16_24;
u_xlat16_8.xyz = u_xlat3.xyz * (-vec3(u_xlat16_24)) + u_xlat1.xyz;
u_xlat27 = dot(u_xlat3.xyz, (-u_xlat1.xyz));
#ifdef UNITY_ADRENO_ES3
u_xlat27 = min(max(u_xlat27, 0.0), 1.0);
#else
u_xlat27 = clamp(u_xlat27, 0.0, 1.0);
#endif
u_xlat16_24 = (-u_xlat27) + 1.0;
u_xlat16_24 = u_xlat16_24 * u_xlat16_24;
u_xlat16_24 = u_xlat16_24 * u_xlat16_24;
u_xlat16_13.xyz = vec3(u_xlat16_24) * u_xlat16_7.xyz + u_xlat16_13.xyz;
u_xlat16_1 = textureLod(unity_SpecCube0, u_xlat16_8.xyz, u_xlat16_15.x);
u_xlat16_15.x = u_xlat16_1.w + -1.0;
u_xlat16_15.x = unity_SpecCube0_HDR.w * u_xlat16_15.x + 1.0;
u_xlat16_15.x = u_xlat16_15.x * unity_SpecCube0_HDR.x;
u_xlat16_15.xyz = u_xlat16_1.xyz * u_xlat16_15.xxx;
u_xlat16_6.xyz = u_xlat16_6.xxx * u_xlat16_15.xyz;
u_xlat16_6.xyz = vec3(u_xlat16_4) * u_xlat16_6.xyz;
u_xlat0.xyz = u_xlat16_6.xyz * u_xlat16_13.xyz + u_xlat0.xyz;
SV_Target0.xyz = u_xlat0.xyz;
SV_Target0.w = 1.0;
return;
}
//#endif

 

We save that to shader.frag and invoke the mali offline shader compiler:


C:\Users\rtorresb\Desktop\Tmp>malioc shader.frag

Mali Offline Compiler v7.1.0 (Build 7a3538)

Copyright 2007-2020 Arm Limited, all rights reserved

Configuration

=============

Hardware: Mali-G76 r0p0

Driver: Bifrost r19p0-00rel0

Shader type: OpenGL ES Fragment (inferred)

Main shader

===========

Work registers: 32

Uniform registers: 24

Stack spilling: False

A LS V T Bound

Total instruction cycles: 1.7 0.0 1.0 2.0 T

Shortest path cycles: 1.7 0.0 1.0 2.0 T

Longest path cycles: 1.7 0.0 1.0 2.0 T

A = Arithmetic, LS = Load/Store, V = Varying, T = Texture

In this case, our shader is texture bound. To optimize it, you need to reduce the number of texture channels and accesses you're doing in your shader.

(By the way, these numbers are MUCH worse on other variants of the standard shader).

Decoration-Painting-Logo

What Does This Teach Us?

Here are a few key lessons you can get from this post:

  • Everything counts towards performance: instructions, texture channels, variants. Everything.

  • You have a neat tool to measure the cost of your shaders

  • And more importantly, you can now compare shaders' performance when in doubt

  • You are now a step closer to 60 FPS.

However, keep in mind:

  • These estimates greatly vary across architectures and even driver versions...

  • Yet, these metrics will be incredibly useful for your optimization journey

What do you think?

To get more useful tips to improve your unity game performance and avoid 1-star reviews, grab my Unity performance checklist now.

~Ruben

Read more about:

Featured Blogs
Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like