[SimGrid-user] Simgrid Insider News June 2017

Martin Quinson martin.quinson at ens-rennes.fr
Fri Jul 7 17:04:11 CEST 2017


Hello,

I will try to start a little newsletter summarizing the things going
on in the SimGrid community. Please do not hesitate to react if I
forget things, if I misunderstod things, if I am not clear, or for any
other reason.


Version 3.16 was released on time. No last-minute fix dot release was
necessary this time :) My only regret about it is that we had to ship
with a small bug in the sharing of multi-core VMs in the case of
overcommit (eg, several bi-core VMs on a 3-core PM). This is fixed in
the git now, but since it is not a regression (multi-core VMs are new)
it will have to wait until next release, end september.


Things already evolved since the release. We had a meeting in Lyon
gathering the core team of SimGrid (Arnaud, Fred and me) with the
developers of derivative tools. It was mostly about Wrench, a workflow
simulation framework developed by Henri Casanova and Rafael Ferreira
da Silva using our new interface s4u in C++, but we managed to discuss
also with Pierre François Dutot about BatSim (a Batch system
simulator) and others. Wrench is under heavy development on GitHub and
it's currently not really ready for external use but this project is a
strong driver for the S4U interface. 

For the S4U interface, we added a bunch of functions. Many are still
missing compared to e.g. MSG, and we will keep adding them as people
request them. If you like the idea of simulation distributed
applications but HATE the MSG interface (as many people do), consider
S4U. The API is more coherent and sensible. It's still in progress but
already usable. Fill bugs about the MSG features you're missing in S4U
and we will fix it ASAP (but not sooner).

For the S4U internals, we spent more time on the design of Activities
(ie, how to make a coherent set of classes for execution,
communication, disk, etc). This is obviously central in SimGrid, with
many impacts on the rest. To complicate things, the model-checker use
process transitions as a central concept. These transitions are often
fired because of an activity, but not always. We are seeking for a way
to make all this coherent, easy to use, extensible and efficient...

We spoke of parallel tasks, which are very important in BatSim. These
Activities consume more than one resource (eg, several CPU hosts and
several links), agregating a parallel application is a simple and
abstract view. The discussion was to try to use them with our regular
performance model (it seems theoretically impossible) more generic so
that you can mix disk with CPU and/or network (this will work) and
more efficient (we have optim ideas). The new name for these things is
CoActivity, losely inspired of co-routines.

We spoke of platform automatic discovery and calibration. We came up
with crucial experiments that can be used to automatically determine
the message size at which the MPI stack switch between eager, buffered
and rendez-vous modes. Once implemented, it should greatly ease the
use of the automatic platform calibration scripts. More work is needed
on the statistical side to automatically characterize the performance
discontinuity in each of these mode. This builds upon ongoing
discussion with the G5K team to make it possible to use SimGrid for
platform acceptance test and other platform misbehavior detections.

We redesigned the disk resource, for sake of simplicity. The
read/write performance will be completely decorelated of the
filesystem abstraction. A disk will have no idea of the remaining
space but only of how fast the ongoing read and write actions are.
Several filesystem abstractions will be provided, ranging from "no FS,
write as much as you want" to "keep track of the head position in all
opened file to detect when the file is full". Others could follow.

We worked on the documentation. The level-0 page "getting started" is
complete and ready, introducting all concepts in one page only. Now we
have to write the level-1 pages, detailing each of the main concept
(such as "describing the platform" or "describing your application) in
a single page each. Each of these page will point to the level-2
documentation, constituted either of the API reference and/or of
scientific article detailing our modeling and design choices. 

We came up with a possible plan to implement SimDAG on top of S4U.
SimDAG differs from MSG in several aspects: (1) there is no actor and
everything is centralized. So the main thread can have the ability to
create s4u activities just like actors (but it cannot block on these,
of course). (2) simdag activities (called "tasks") have dependencies
and once everything is setup, the end of a dependency automatically
start other tasks (3) you can create an activity without saying where
it should take place, and later schedule it onto a given resource.

In s4u, activities are little state machines (inspired of MPI
requests). For example, no communication occurs until comm.start() is
called. send() is then a simple wrapper to comm_create()+start(). This
way, optional parameters are removed from the constructor, and users
are expected to use specific setters before they start the activity. 

To implement (2) and (3) of SimDAG, we will add vetoing observers to
the start() transition. If some dependencies or the schedule are still
missing, the start() will be canceled, and will automatically be
re-attempted once the blocking condition disappear. The same mechanism
could be used to implement 1-port models (with a vetoer blocking
activities if the resource is already used), etc.

We also ironed out a bunch of bugs togethers. About #189 (nice config
format for host energy consumption), we found how to do but this is
still to be done. We did not find a nice way to turn waitany and
waitall into proper activities. This makes the model-checker a mess,
but the week was too short unfortunately. This point is my main goal
for the 3.17 release, but this is *hard*.

As a conclution on this meeting, we now have a lot to implement for
the coming weeks, even if the time will become scarce because of the
vacations and then because of the university new term...


Not related to this meeting, we are currently working with Marie
Duflot-Kremer and Samuel Thibault on a tool for statistical
model-checking, sampling executions to assess performance properties.
We progress on the design, but I'm slowing in the implementation effort.

The Anh Pham is working on a new reduction algorithm to improve the
performance of the model-checker. That's the topic of his Ph.D.

The master internship of Betsegaw Lemma is now over, and I still have
to integrate his work to model the network energy consumption. During
that time, he's working on a scientific publication detailling this.

The master internship of Tom Cornebize is also over, and the things he
needed to efficiently simulate HPL at very large scale are already
integrated in v3.16. The corresponding publication is almost ready.


VoilĂ , this is it for this month. Sorry for the long mail, I strive to
make these status report short, but many interesting things keep
occuring to me...

Bye, Mt.

-- 
You have a problem and decide to use haskell.  
Now you have a endofunctor in the problems category.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.gforge.inria.fr/pipermail/simgrid-user/attachments/20170707/245ec45d/attachment.sig>


More information about the Simgrid-user mailing list