Today I had to do something else than backend development and since FOTB is getting closer and I could not really continue working on TAAS I decided to add something which is easy to implement and has a huge benifit: Alchemy support in ActionScript.
So what is the idea? TAAS is part of a framework I developed to manipulate SWF, SWC and ABC files. The main focus are of course ABC files since they contain the bytecode which gets executed.
Part of the framework are tools for control flow analysis, various bytecode analyzers and also a search-and-replace system which work on a bytecode level. There are for instance pattern matchers that search for bad code produced by the ASC and replace the match with a more performant set of instructions.
With all those weapons in my arsenal I thought it should be a walk in the park to implement the Alchemy features in a way that makes sense. So the first idea is to have the old functionality AS3C had but more robust. AS3C had a feature that was the __asm function which allowed you to inline instructions. The new framework comes with the old __asm and also another cool method: __bytecode! This will inline raw bytes. This means also you would have to know all the indices for variables you want to use from the constant pool in advance so __asm will still be your friend.
With the __bytecode method it is already possible to use all Alchemy features again. It would also be possible with the __asm method but writing plain bytes is simply more elitist. In order to make it easy for the developer I want a high-level API. Having a class with some static methods is nice of course but also slow. Alchemy is fast because those opcodes that write and read from a ByteArray are no method calls. They are low-level FlashPlayer features.
The first attempt was to write a Memory class that allows you to use the Alchemy features. This class contains raw bytecode implementations and ActionScript code. This means if you do not use the optimizer everything will still work — only 1000 times slower. When looking at the memory class there is another tool of the framework that becomes very helpful. Both the __bytecode and ActionScript stuff should not co-exist with each other. So when we inline the bytecode a dead-code-elimination will simply cleanup afterwards. Since the 0x47 byte for instance is “ReturnVoid” the ActionScript code which would follow afterwards can be dropped. That code is now unreachable.
Step two is to replace all calls to the Memory class with the correct Alchemy opcode. This was really simple and the result is a really really fast way to access a ByteArray while still maintaining a high comfort. Of course one might think now that the __bytecode method becomes useless since no methods of the Memory class are called at all. But if anyone is crazy enough to access the Memory class untyped with a runtime namespace for instance you are still happy to have the code optimized inside. In some circumstances it is simply impossible to figure out that someone called Memory.writeByte(). End of the story: your calls to a ByteArray are always optimized in the best way possible.
This is an example of the Memory.readByte() method before applying optimizations:
0x000000 GetLocal0
0x000001 PushScope
0x000002 FindPropStrict QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "__bytecode")
0x000004 PushShort 0xd1
0x000007 PushByte 0x35
0x000009 PushByte 0x48
0x00000b CallPropVoid QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "__bytecode"), 3
0x00000e GetLex QName(PackageNamespace("flash.system"), "ApplicationDomain")
0x000010 GetProperty QName(PackageNamespace(""), "currentDomain")
0x000012 GetProperty QName(PackageNamespace(""), "domainMemory")
0x000014 GetLocal1
0x000015 SetProperty QName(PackageNamespace(""), "position")
0x000017 GetLex QName(PackageNamespace("flash.system"), "ApplicationDomain")
0x000019 GetProperty QName(PackageNamespace(""), "currentDomain")
0x00001b GetProperty QName(PackageNamespace(""), "domainMemory")
0x00001d CallProperty QName(PackageNamespace(""), "readUnsignedByte"), 0
0x000020 ReturnValue
The same method after inlining the bytes and applying various other analysis like dead-code-elimination:
0x000000 GetLocal0
0x000001 PushScope
0x000000 GetLocal1
0x000001 GetByte
0x000002 ReturnValue
This is an example of the famous inverse square root using the Memory API:
private function invSqrt( value: Number ): Number
{
var half: Number = 0.5 * value;
Memory.writeFloat( value, 0 );
Memory.writeInt( 0x5f3759df - ( Memory.readInt( 0 ) >> 1 ), 0 );
value = Memory.readFloat( 0 );
value = value * ( 1.5 - half * value * value );
return value;
}
The same method before optimization in bytecode representation:
0x000000 GetLocal0
0x000001 PushScope
0x000002 PushDouble 0.5
0x000004 GetLocal1
0x000005 Multiply
0x000006 ConvertDouble
0x000007 SetLocal2
0x000008 GetLex QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "Memory")
0x00000a GetLocal1
0x00000b PushByte 0x0
0x00000d CallPropVoid QName(PackageNamespace(""), "writeFloat"), 2
0x000010 GetLex QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "Memory")
0x000012 PushInt 0x5f3759df
0x000014 GetLex QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "Memory")
0x000016 PushByte 0x0
0x000018 CallProperty QName(PackageNamespace(""), "readInt"), 1
0x00001b PushByte 0x1
0x00001d ShiftRight
0x00001e Subtract
0x00001f PushByte 0x0
0x000021 CallPropVoid QName(PackageNamespace(""), "writeInt"), 2
0x000024 GetLex QName(PackageNamespace("com.joa_ebert.abc.bytecode.asbridge"), "Memory")
0x000026 PushByte 0x0
0x000028 CallProperty QName(PackageNamespace(""), "readFloat"), 1
0x00002b ConvertDouble
0x00002c SetLocal1
0x00002d GetLocal1
0x00002e PushDouble 1.5
0x000030 GetLocal2
0x000031 GetLocal1
0x000032 Multiply
0x000033 GetLocal1
0x000034 Multiply
0x000035 Subtract
0x000036 Multiply
0x000037 ConvertDouble
0x000038 SetLocal1
0x000039 GetLocal1
0x00003a ReturnValue
The same method after inlining the Memory API:
0x000000 GetLocal0
0x000001 PushScope
0x000002 PushDouble 0.5
0x000004 GetLocal1
0x000005 Multiply
0x000006 ConvertDouble
0x000007 SetLocal2
0x00000a GetLocal1
0x00000b PushByte 0x0
0x000000 SetFloat
0x000012 PushInt 0x5f3759df
0x000016 PushByte 0x0
0x000000 GetInt
0x00001b PushByte 0x1
0x00001d ShiftRight
0x00001e Subtract
0x00001f PushByte 0x0
0x000000 SetInt
0x000026 PushByte 0x0
0x000000 GetFloat
0x00002b ConvertDouble
0x00002c SetLocal1
0x00002d GetLocal1
0x00002e PushDouble 1.5
0x000030 GetLocal2
0x000031 GetLocal1
0x000032 Multiply
0x000033 GetLocal1
0x000034 Multiply
0x000035 Subtract
0x000036 Multiply
0x000037 ConvertDouble
0x000038 SetLocal1
0x000039 GetLocal1
0x00003a ReturnValue
As you can see this is blazing fast. Now the next job is to finish TAAS. Once TAAS is complete even a method like the inverse square root might be inlined and optimized much better. I did a simple test using the Lorenz attractor from before and replacing the Vector.<uint> buffer with a ByteArray gave a performance boost of about 5fps. Afterwards I tried getting rid of the Particle class completly and the framerate dropped a little bit. But imagine having 300.000 particle’s x, y and z coodrinates stored in an Array. It was still faster than the old version but not as fast as combining the power of Alchemy with simple ActionScript optimizations like linked lists.

Nice work. I’m really excited to see this go out into the world. :)
Hi Joa,
this is all incredible stuff! About 8 years ago I wrote a JVM bytecode optimiser in perl because I noticed that the java compiler tended to ignore stack based operations. Unfortunately I don’t have the code anymore :-( However I can still remember some of the ideas. I’ll share them with you.
Unfortunatley abc bytecode does not seem to have very good stack manipulation operands (no ROT or OVER as far as I can tell), however it does have DUP and SWAP.
Looking at you inlined byte code I am wondering whether all your fast pushbyte-getint and pushbyte-setint etc could all be done on the stack itself. (as a side note, and I am probably not understanding something because I am new to this, but couldn’t you implement TAAS using the stack? Sorry if that is a stupid question)
Here is an attempt at an example
0×000000 GetLocal0
0×000001 PushScope
0×000002 PushDouble 0.5
0×000004 GetLocal1
0×000005 Multiply
0×000006 ConvertDouble
0×000007 SetLocal2
0x00000a GetLocal1
could be (without address label as I am not sure of opcode sizes)
GetLocal0
PushScope
GetLocal1
Dup
PushDouble 0.5
Multiply
ConvertDouble
Swap
Here you end up with local1 at the top of the stack and local 2 below it. Now I don’t understand how the writefloat(0) and then readint(0) works unfortunately and so I have not attempted to optimize that BUT I am certain that using the stack more will increase you optimisations further.
If you could point me at the docs for the readfloat stuff I will try to write opcodes for the above example that use the stack more.
Do you think this is a fruitful approach?
Yes absolutely. Do you know about the work of the Sable research group at McGill University? They have developed Soot which is a Java optimization framework that converts the stack based representation into three address code and furthermore converts the TAC into SSA representation.
This is exactly what TAAS is all about as well. The stack based code gets converted into a TAC representation with stack SSA for branches. I am also working on a SSA for local registers but unfortunately I am currently pretty busy since we are working on the AudioTool 1.0.
Branching into pruned SSA using liveness analysis and dominance frontiers is nearly complete. Getting out of SSA is a different story and not yet implemented as well as converting the TAAS control flow graph back into ca ontrol flow graph of bytecode vertices which might then be converted back into a linear list of operations.
The only cool thing is that the framework is very robust in terms of bytecode. This is absolutely independent of TAAS and works already very well as this example demonstrates.
You can read more about the memory API here:
Wow this looks like it coduld be some really impressive work!
I may be a little confused but is this all haxe based or have you written a sepparate compiler?
It is like AS3C a handwritten post-compile time system and has nothing to do with the ActionScript or haXe at all. Only the ActionScript APIs of course. But you can write the same APIs for haXe as well.
The same tool will be able to optimize any SWF file. It does not make a difference if you used the haXe compiler or ASC.