Tag Archive for 'inline'

New Apparat Example

Good news everyone. The Apparat inline expansion works now to full extent after fixing some minor bugs. A complete example is also available. Just change the paths in the build.properties file and compile everything using Ant.

Apparat Example

Use the inline feature with care. Apparat does not try to optimize your code and performs nothing but dead simple inlining. This can lead to slower code due to the creation of lots of local registers. Your code gets also much bigger and will require more space in memory. I am actually not a fan of manual inlining at all. I think it makes only sense to inline code if you have a powerful optimizer available that will cleanup the whole mess.

The fun story about this example is that the inlined version is slower using the lastes Flash Player release candidate if you have only 40.000 particles. That is why I increased the number of particles to 80.000 ;). I developed the example using an old standalone player and the inlined version was nearly twice as fast. However when I watched the example in the browser with the latest release candidate the game was completely different. Kudos to Adobe for significantly improving the Flash Player performance!

Macro Expansion

Apparat has another new feature called Macro Expansion. I talked about this with Nico Zimmermann at FFK in Cologne. Nico was using TDSI for a project but he was not very satisfied with it because you have to inline all inverse-square root tricks manually.
This is why Apparat has now macro expansion. I am actually not a big fan of it. I think a good compiler would do this for you without you having to go through all the steps. Unfortunately writing this compiler will take longer than the couple of hours I have spent on the macro expansion today.

So if you want to have quick and dirty inlining capabilities: this is for you. It is an easy fix for a feature a lot of people have asked for. I will continue working on TAAS to implement this much better in the future.

TurboDieselSportInjection

I am definitly not good at choosing names for software projects. However TurboDieselSportInjection is a release of my experiments from yesterday. It is a spinoff from the whole framework and allows you to inline __bytecode and of course to use the new Memory API.

Hopefully you are kind enough to provide me with some feedback. I am especially interested in Exceptions that occur when reading or writing ABC files. Have fun!

Update: TDSI is now open source!

Alchemy for ActionScript

Today I had to do something else than backend development and since FOTB is getting closer and I could not really continue working on TAAS I decided to add something which is easy to implement and has a huge benifit: Alchemy support in ActionScript.

So what is the idea? TAAS is part of a framework I developed to manipulate SWF, SWC and ABC files. The main focus are of course ABC files since they contain the bytecode which gets executed.
Part of the framework are tools for control flow analysis, various bytecode analyzers and also a search-and-replace system which work on a bytecode level. There are for instance pattern matchers that search for bad code produced by the ASC and replace the match with a more performant set of instructions.

With all those weapons in my arsenal I thought it should be a walk in the park to implement the Alchemy features in a way that makes sense. So the first idea is to have the old functionality AS3C had but more robust. AS3C had a feature that was the __asm function which allowed you to inline instructions. The new framework comes with the old __asm and also another cool method: __bytecode! This will inline raw bytes. This means also you would have to know all the indices for variables you want to use from the constant pool in advance so __asm will still be your friend.

With the __bytecode method it is already possible to use all Alchemy features again. It would also be possible with the __asm method but writing plain bytes is simply more elitist. In order to make it easy for the developer I want a high-level API. Having a class with some static methods is nice of course but also slow. Alchemy is fast because those opcodes that write and read from a ByteArray are no method calls. They are low-level FlashPlayer features.

The first attempt was to write a Memory class that allows you to use the Alchemy features. This class contains raw bytecode implementations and ActionScript code. This means if you do not use the optimizer everything will still work — only 1000 times slower. When looking at the memory class there is another tool of the framework that becomes very helpful. Both the __bytecode and ActionScript stuff should not co-exist with each other. So when we inline the bytecode a dead-code-elimination will simply cleanup afterwards. Since the 0x47 byte for instance is “ReturnVoid” the ActionScript code which would follow afterwards can be dropped. That code is now unreachable.

Step two is to replace all calls to the Memory class with the correct Alchemy opcode. This was really simple and the result is a really really fast way to access a ByteArray while still maintaining a high comfort. Of course one might think now that the __bytecode method becomes useless since no methods of the Memory class are called at all. But if anyone is crazy enough to access the Memory class untyped with a runtime namespace for instance you are still happy to have the code optimized inside. In some circumstances it is simply impossible to figure out that someone called Memory.writeByte(). End of the story: your calls to a ByteArray are always optimized in the best way possible.

This is an example of the Memory.readByte() method before applying optimizations:

0x000000       GetLocal0
0x000001       PushScope
0x000002       FindPropStrict       QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "__bytecode")
0x000004       PushShort            0xd1
0x000007       PushByte             0x35
0x000009       PushByte             0x48
0x00000b       CallPropVoid         QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "__bytecode"), 3
0x00000e       GetLex               QName(PackageNamespace("flash.system"), "ApplicationDomain")
0x000010       GetProperty          QName(PackageNamespace(""), "currentDomain")
0x000012       GetProperty          QName(PackageNamespace(""), "domainMemory")
0x000014       GetLocal1
0x000015       SetProperty          QName(PackageNamespace(""), "position")
0x000017       GetLex               QName(PackageNamespace("flash.system"), "ApplicationDomain")
0x000019       GetProperty          QName(PackageNamespace(""), "currentDomain")
0x00001b       GetProperty          QName(PackageNamespace(""), "domainMemory")
0x00001d       CallProperty         QName(PackageNamespace(""), "readUnsignedByte"), 0
0x000020       ReturnValue

The same method after inlining the bytes and applying various other analysis like dead-code-elimination:

0x000000       GetLocal0
0x000001       PushScope
0x000000       GetLocal1
0x000001       GetByte
0x000002       ReturnValue

This is an example of the famous inverse square root using the Memory API:

private function invSqrt( value: Number ): Number
{
	var half: Number = 0.5 * value;
	Memory.writeFloat( value, 0 );
	Memory.writeInt( 0x5f3759df - ( Memory.readInt( 0 ) >> 1 ), 0 );
	value = Memory.readFloat( 0 );
	value = value * ( 1.5 - half * value * value );
	return value;
}

The same method before optimization in bytecode representation:

0x000000       GetLocal0
0x000001       PushScope
0x000002       PushDouble           0.5
0x000004       GetLocal1
0x000005       Multiply
0x000006       ConvertDouble
0x000007       SetLocal2
0x000008       GetLex               QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "Memory")
0x00000a       GetLocal1
0x00000b       PushByte             0x0
0x00000d       CallPropVoid         QName(PackageNamespace(""), "writeFloat"), 2
0x000010       GetLex               QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "Memory")
0x000012       PushInt              0x5f3759df
0x000014       GetLex               QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "Memory")
0x000016       PushByte             0x0
0x000018       CallProperty         QName(PackageNamespace(""), "readInt"), 1
0x00001b       PushByte             0x1
0x00001d       ShiftRight
0x00001e       Subtract
0x00001f       PushByte             0x0
0x000021       CallPropVoid         QName(PackageNamespace(""), "writeInt"), 2
0x000024       GetLex               QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "Memory")
0x000026       PushByte             0x0
0x000028       CallProperty         QName(PackageNamespace(""), "readFloat"), 1
0x00002b       ConvertDouble
0x00002c       SetLocal1
0x00002d       GetLocal1
0x00002e       PushDouble           1.5
0x000030       GetLocal2
0x000031       GetLocal1
0x000032       Multiply
0x000033       GetLocal1
0x000034       Multiply
0x000035       Subtract
0x000036       Multiply
0x000037       ConvertDouble
0x000038       SetLocal1
0x000039       GetLocal1
0x00003a       ReturnValue

The same method after inlining the Memory API:

0x000000       GetLocal0
0x000001       PushScope
0x000002       PushDouble           0.5
0x000004       GetLocal1
0x000005       Multiply
0x000006       ConvertDouble
0x000007       SetLocal2
0x00000a       GetLocal1
0x00000b       PushByte             0x0
0x000000       SetFloat
0x000012       PushInt              0x5f3759df
0x000016       PushByte             0x0
0x000000       GetInt
0x00001b       PushByte             0x1
0x00001d       ShiftRight
0x00001e       Subtract
0x00001f       PushByte             0x0
0x000000       SetInt
0x000026       PushByte             0x0
0x000000       GetFloat
0x00002b       ConvertDouble
0x00002c       SetLocal1
0x00002d       GetLocal1
0x00002e       PushDouble           1.5
0x000030       GetLocal2
0x000031       GetLocal1
0x000032       Multiply
0x000033       GetLocal1
0x000034       Multiply
0x000035       Subtract
0x000036       Multiply
0x000037       ConvertDouble
0x000038       SetLocal1
0x000039       GetLocal1
0x00003a       ReturnValue

As you can see this is blazing fast. Now the next job is to finish TAAS. Once TAAS is complete even a method like the inverse square root might be inlined and optimized much better. I did a simple test using the Lorenz attractor from before and replacing the Vector.<uint> buffer with a ByteArray gave a performance boost of about 5fps. Afterwards I tried getting rid of the Particle class completly and the framerate dropped a little bit. But imagine having 300.000 particle’s x, y and z coodrinates stored in an Array. It was still faster than the old version but not as fast as combining the power of Alchemy with simple ActionScript optimizations like linked lists.