A couple more HPROF runs really helped dial in the Render() call path, which is assumed the one that gets executed the most. If you run the Timer Base Tick less than the framerate, then less so. To recap, the demo uses 30 FPS (33ms it currently truncates) and 100ms TBT.
Our goal is simple: do not generate per-frame "garbage". Even simple things like ArrayList.iterator() allocate small objects, and these build up in between GC cycles, and ultimately trigger GC sooner than it otherwise would. So the fastest allocation is the one that is avoided.
GC is inevitable, but we don't want the framework to be a cause of it. We leave that privilege to the game-specific code. For example, our initial rotating-cube demo incurs a concurrent GC every 25-30 seconds, after the most-recent optimizations.
We are concentrating on the top-10 methods (by CPU time) in the HPROF output, and now only 7 of those are AGE methods, the rest coming from JRE java.util.concurrent objects.