Sponsored By

Bringing SIMD accelerated vector math to the web through Dart

In this reprinted <a href="http://altdevblogaday.com/">#altdevblogaday</a> in-depth piece, SCEA's senior developer support engineer John McCutchan digs into Dart, and looks at how to bring SIMD accelerated math to the web using the programming language.

June 20, 2012

5 Min Read
Game Developer logo in a gray background | Game Developer

[In this reprinted #altdevblogaday in-depth piece, SCEA's senior developer support engineer John McCutchan digs into Dart, and looks at how to bring SIMD accelerated math to the web using the programming language.] Recently, I have been working on a vector math library for Dart. Boringly, I named it Dart Vector Math. The latest version can be found on Github. My two biggest goals for Dart Vector Math are the following:

  • Near 100 percent GLSL compatible syntax. This includes the awesome vector shuffle syntax, and flexible construction of vectors and matrices.

  • Performance in terms of both CPU time and memory usage / garbage collection load.

Aside from a couple of quirks, Dart Vector Math is GLSL syntax compatible. It is possible to copy and paste GLSL code into Dart and after making a couple tweaks have it compile with Dart Vector Math. This makes debugging shader code easy. Since Dart is a garbage collected language, to be optimal in terms of space you want to avoid creating lots of objects. In order to facilitate that, Dart Vector Math offers many functions that work directly on already allocated vectors and matrices. This weekend I started to look at CPU performance of Dart Vector Math versus glMatrix.dart (a port of glMatrix from JavaScript to Dart, the current champ of JavaScript vector math libraries). The initial results are heavily in favor of Dart Vector Math: ============================================= Matrix Multiplication ============================================= Avg: 14.59 ms Min: 10.161 ms Max: 22.927 ms (Avg: 14590 Min: 10161 Max: 22927) ============================================= Matrix Multiplication glmatrix.dart ============================================= Avg: 283.353 ms Min: 272.062 ms Max: 287.988 ms (Avg: 283353 Min: 272062 Max: 287988) ============================================= mat4x4 inverse ============================================= Avg: 28.289 ms Min: 21.019 ms Max: 34.891 ms (Avg: 28289 Min: 21019 Max: 34891) ============================================= mat4x4 inverse glmatrix.dart ============================================= Avg: 318.909 ms Min: 315.435 ms Max: 325.831 ms (Avg: 318909 Min: 315435 Max: 325831) ============================================= vector transform ============================================= Avg: 4.324 ms Min: 2.811 ms Max: 14.859 ms (Avg: 4324 Min: 2811 Max: 14859) ============================================= vector transform glmatrix.dart ============================================= Avg: 144.431 ms Min: 138.263 ms Max: 153.798 ms (Avg: 144431 Min: 138263 Max: 153798) The code for 4×4 matrix multiplication in Dart Vector Math and glMatrix are practically identical, so on closer inspection the above numbers didn't make much sense. There is one key difference- Dart Vector Math uses a native Dart object to store the matrix while glMatrix uses a Float32Array as storage. Digging into the disassembly, I discovered that indexing into a Float32Array is a slow path for the VM right now, skewing the results against glMatrix.dart. Not that big of a deal, Dart is a new language and the VM needs time to mature. Once the performance issue with Float32Arrays is fixed, I want to have Dart Vector Math use them for two reasons. First, they take up 50 percent less space (single vs. double precision floats). Second, WebGL needs Float32Arrays for uniform data which means the matrix is going to eventually end up inside a Float32Array, might as well keep it in one the whole time. There is no CPU performance benefit from using Float32Array as storage because all operations result in the floats being promoted to doubles, operated on, and then stored back as floats. My intention to move to Float32Array got me thinking and I ended up asking myself: Why doesn't the browser offer an API for common vector math operations on Float32Array implemented efficiently with SIMD instruction sets? Well, I'm not sure why it is not offered, but I ended up spending the weekend implementing it for the Dart VM. The API follows:

class SimdFloat32Array {
 static matrix4Inv(Float32List dst, int dstIndex, Float32List src, int srcIndex,
    int count);
 static matrix4Mult(Float32List dst, int dstIndex, Float32List a, int aIndex,
    Float32List b, int bIndex, int count);
 static transform(Float32List M, int Mindex, Float32List v, int vIndex, int vCount);
}

I do not want anyone to get hung up on the specific API or naming convention (let's avoid bikeshedding). My three biggest goals for this API are the following: 

  • Offer the important operations used by vector math libraries

  • Operate directly on floats instead of promoting to doubles

  • Design for bulk processing

So far I have exposed three of the important operations, but there are many more. Each of those functions is backed by an SSE implementation that operates directly on the Float32Array data. Notice that each of the methods take a count variable, this allows a single call to do bulk work. The results of my implementation were very encouraging: ============================================= Matrix Multiplication SIMD ============================================= Avg: 8.702 ms Min: 8.475 ms Max: 9.217 ms (Avg: 8702 Min: 8475 Max: 9217) ============================================= mat4x4 inverse SIMD ============================================= Avg: 7.107 ms Min: 6.89 ms Max: 7.754 ms (Avg: 7107 Min: 6890 Max: 7754) ============================================= vector transform SIMD ============================================= Avg: 6.415 ms Min: 6.204 ms Max: 7.006 ms (Avg: 6415 Min: 6204 Max: 7006) Aside from the vector transformation operation (I think my SSE vector transform code is just slow), I got speedups between 2x and 4x. Does this have legs? I hope so, but it's not my call. If you see value in exposing this acceleration architecture into the browser, speak up. Anticipating some questions: What about JavaScript? The API would be easy to expose in JavaScript. What about hardware without SIMD instruction sets? Probably not an issue since ARM, x86, and PPC have excellent SIMD instruction sets. Other platforms can implement the API using scalar floating point instructions. What about other browsers? Again, this API would be easy to expose if it gained support. Fast vector math operations are a requirement if we are going to start writing amazing games in the browser, I hope my proposal can make this possible. [This piece was reprinted from #AltDevBlogADay, a shared blog initiative started by @mike_acton devoted to giving game developers of all disciplines a place to motivate each other to write regularly about their personal game development passions.]

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like