[Pharo-project] Cog VM -- Thanks and Performance / Optimization Questions

Schwab,Wilhelm K bschwab at anest.ufl.edu
Thu Feb 17 16:06:06 CET 2011


A nice mix is to do memory management and logic (e.g. decide when to stop iterating) in Smalltalk and to have C-callable "primitives" for the heavy loops.  A great way to reach the latter is to define the functions using extern "C" - then you can use C++ features (streams, templates) in the function bodies.  IMHO, C++ with some suitable operator overloading does a fairly nice job of formula translation, and it is a good fit for fixed size arithmetic.

If Cog can make the above optional, so much the better.  


________________________________________
From: pharo-project-bounces at lists.gforge.inria.fr [pharo-project-bounces at lists.gforge.inria.fr] On Behalf Of John B Thiel [jbthiel at gmail.com]
Sent: Thursday, February 17, 2011 9:21 AM
To: pharo-project at lists.gforge.inria.fr
Subject: [Pharo-project] Cog VM -- Thanks and Performance / Optimization        Questions

Cog VM -- Thanks and Performance / Optimization Questions


To Everyone, thanks for your great work on Pharo and Squeak,  and to
Eliot Miranda, Ian Piumarta, and all VM/JIT gurus, especially thanks
for the Squeak VM Cog and its precursors, which I was keenly
anticipating for a decade or so, and is really going into stride with
the latest builds.

I like to code with awareness of performance issues.  Can you tell or
point me to some performance and efficiency tips for Cog and the
Squeak compiler -- detail on which methods are inlined, best among
alternatives, etc.  For example, I understand #to:do: is inlined --
what about #to:do:by: and #timesRepeat and #repeat  ?  Basically, I
would like to read a full overview of which core methods are specially
optimized (or planned).

I know about the list of NoLookup primitives, as per Object
class>>howToModifyPrimitives,  supposing that is still valid?

What do you think is a reasonable speed factor for number-crunching
Squeak code vs C ?   I am seeing about 20x slower in the semi-large
scale, which surprised me a bit because I got about 10x on smaller
tests, and a simple fib: with beautiful Cog is now about 3x (wow!).
That range, 3x tiny tight loop, to 20x for general multi-class
computation, seems a bit wide -- is it about expected?

My profiling does not reveal any hotspots, as such -- it's basically
2, 3, 5% scattered around, so I envision this is just the general
vm/jit overhead as you scale up -- referencing distant objects, slots,
dispatch lookups, more cache misses, etc.  But maybe I am generally
using some backwater loop/control methods, techniques, etc. that could
be tuned up.  e.g. I seem to recall a trace at some point showing
#timesRepeat taking 10% of the time (?!).   Also, I recall reading
about an anomaly with BlockClosures -- something like being rebuilt
every time thru the loop - has that been fixed?  Any other gotchas to
watch for currently?

(Also, any notoriously slow subsystems?  For example, Transcript
writing is glacial.)

The Squeak bytecode compiler looks fairly straightforward and
non-optimizing - just statement by statement translation.  So it
misses e.g. chances to store and reuse, instead of pop, etc.  I see
lots of redundant sequences emitted.  Are those kind of things now
optimized out by Cog, or would tighter bytecode be another potential
optimization path.  (Is that what the Opal project is targetting?)

-- jbthiel





More information about the Pharo-project mailing list