[Pharo-project] Cog VM -- Thanks and Performance / Optimization Questions
John B Thiel
jbthiel at gmail.com
Thu Feb 17 15:21:01 CET 2011
Cog VM -- Thanks and Performance / Optimization Questions
To Everyone, thanks for your great work on Pharo and Squeak, and to
Eliot Miranda, Ian Piumarta, and all VM/JIT gurus, especially thanks
for the Squeak VM Cog and its precursors, which I was keenly
anticipating for a decade or so, and is really going into stride with
the latest builds.
I like to code with awareness of performance issues. Can you tell or
point me to some performance and efficiency tips for Cog and the
Squeak compiler -- detail on which methods are inlined, best among
alternatives, etc. For example, I understand #to:do: is inlined --
what about #to:do:by: and #timesRepeat and #repeat ? Basically, I
would like to read a full overview of which core methods are specially
optimized (or planned).
I know about the list of NoLookup primitives, as per Object
class>>howToModifyPrimitives, supposing that is still valid?
What do you think is a reasonable speed factor for number-crunching
Squeak code vs C ? I am seeing about 20x slower in the semi-large
scale, which surprised me a bit because I got about 10x on smaller
tests, and a simple fib: with beautiful Cog is now about 3x (wow!).
That range, 3x tiny tight loop, to 20x for general multi-class
computation, seems a bit wide -- is it about expected?
My profiling does not reveal any hotspots, as such -- it's basically
2, 3, 5% scattered around, so I envision this is just the general
vm/jit overhead as you scale up -- referencing distant objects, slots,
dispatch lookups, more cache misses, etc. But maybe I am generally
using some backwater loop/control methods, techniques, etc. that could
be tuned up. e.g. I seem to recall a trace at some point showing
#timesRepeat taking 10% of the time (?!). Also, I recall reading
about an anomaly with BlockClosures -- something like being rebuilt
every time thru the loop - has that been fixed? Any other gotchas to
watch for currently?
(Also, any notoriously slow subsystems? For example, Transcript
writing is glacial.)
The Squeak bytecode compiler looks fairly straightforward and
non-optimizing - just statement by statement translation. So it
misses e.g. chances to store and reuse, instead of pop, etc. I see
lots of redundant sequences emitted. Are those kind of things now
optimized out by Cog, or would tighter bytecode be another potential
optimization path. (Is that what the Opal project is targetting?)
More information about the Pharo-project