[Pharo-project] Cog VM -- Thanks and Performance / Optimization Questions [Micro-Bench Loop]

John B Thiel jbthiel at gmail.com
Thu Feb 24 23:13:39 CET 2011


Thanks Eliot, Stef and All for good answers on the Cog VM inlining and
implementation details.  That is also exciting news about Eliot and
MarkusD's work on an adaptive optimizer. By "completely close the gap
to C"  how close do you mean? The StrongTalk work posits a theoretical
best of about 2x, so I suppose you mean in that general 2x-5x
neighborhood vs machine code... ?

The 20x slower macro algorithm I mentioned is too large to post.
Below is a simple micro-loop I made up to explore some aspects of
closures, loops and relative VM timings.  The C-equivalent machine
code takes 400msec on my test machine.  In Squeak/Pharo, I see a huge
range of speed factors with this, from 7x to 220x (!!) slower than
machine code, on current and near-history VMs (Cog and standard
Squeak).

Consider:

"A = to:by:do:  loop"
benchLooperA
|sum|
 	sum:=1e8.
	sum to: 1 by: -1 do: [:x |
		sum := (sum bitShift: -1) + x ] .
	^sum

"B = timesRepeat loop"
benchLooperB
|sum count|
 	count := sum:=1e8.
	sum timesRepeat:  [
		sum := (sum bitShift: -1) + count.
		count := count - 1 ] .
	^sum


A and B compute the same except one using #to:by:do:,  the other
#timesRepeat.  The near-machine code for this computation runs about
400msec  (0.4 sec) on my test system. (x86 Windows 7, N450 Atom cpu,
1.66GHz)

On the following 3 test VM/Images,     :

1.  Squeak 4.1 stock (vm = Squeak 4.0.2, 2010apr3)
2.  Pharo 1.1.1 stock (vm = Cog 2010sep21)
3.  Pharo 1.2rc2 + Cog2011feb6

I get the following times for
   A   Time millisecondsToRun: [ Sandbox benchLooperA ]
   B   Time millisecondsToRun: [ Sandbox benchLooperB ]

(A, B) execution times rounded to seconds

1.   (24, 88)
2.   (34, 11)
3.   (3, 8)


Here we see a 3x-4x difference A to B,  with an anomoly that Pharo
1.1.1 is actually much *slower* on #to:by:do:   (is it the closure bug
(?), see below).  (Also we see 8x-11x speedup with the latest
Cog-2011feb6, great!)

Now here is something odd -- if I invoke the loops via workspace DoIt,
the timing changes in Pharo 1.1.1,  like this:

Time millisecondsToRun: [
|sum|
 	sum:=1e8.
	sum to: 1 by: -1 do: [:x |
		sum := (sum bitShift: -1) + x ] .
	sum	]
	
(highlight and DoIt)

(A, B) invoked from workspace, time in seconds

1.  (24, 88)
2.  (5, 11)
3.  (3, 8)


The Pharo 1.1.1 timing anomaly for case 2A disappeared - it's now 5
seconds instead of 34.

So, 4 questions:

* What causes the Pharo 1.1.1. anomaly timing difference for #to:by:do
invoked via workspace DoIt vs in the #benchLooper method  (5 sec vs 34
sec)  ?

* Why is the #timesRepeat loop 3x-4x slower than #to:by:do ?  Is that
simply the difference between inlined and non-inlined methods, or
other factors?

* Even the best Cog time here is 7x slower vs machine-code (and 20x
slower for the timesRepeat case), whereas latest Cog fib: runs at 3x.
What factors make this micro-bench loop not as optimizable as fib: for
Cog?

* Why is the stock Squeak 4.1 VM  **220x slower** than machine code
for case B = timesRepeat loop,  (88 sec vs 0.4 sec) ?


Thanks to everyone for insights and comments.

-- jbthiel




More information about the Pharo-project mailing list