[Cado-nfs-discuss] Help starting out using two machines

Emmanuel Thomé Emmanuel.Thome at inria.fr
Sat Mar 7 19:15:38 CET 2020


On Sat, Mar 07, 2020 at 12:37:14PM -0500, David Willmore wrote:
> To state it a different way, even if one were to compile the software
> properly on the slave/client box with all the optimizations necessary
> to make good use of the hardware, the client script will still use the
> server provided binary if the --binddir option is not provided!  The
> user may end up inadvertently using a less than optimal client binary
> without being aware of it.

Yes, exactly. That's one of my points against this feature.

> > [[ --override -t auto ]]
>
> It is at this point that I'm running into problems.  It seems that if
> I ever have any override of the threads, I get errors.  In tracking
> down what was going wrong, I ran into cado-nfs-client.py not
> respecting the workdir parameter for 'downloads' which effects
> downloaded binaries, polynomials, and roots files.

I'm not sure which workdir you're speaking of. There's an optional
WORKDIR in the workunits, but I don't think we ever use it, and anyway
it's a path that is relative to the cado-nfs-client.py's "basepath".

At any rate, if you want to adjust where the client puts its stuff,
you want to adjust its "basepath": it
defaults to $CWD and can be adjusted with the --basepath option to
cado-nfs-client.py.

If you wonder why the client seems to ignore paths that are specified in
the server-level parameter file, here's a simple explanation: if you go
the route where you're running the clients by yourself (as opposed to
having the server run them, which we both agree is a potential source of
problems), then there's a chicken-and-egg problem: the client needs to be
somewhere before it contacts the server, so there's no way it can obey
paths that are set in the server-level parameter file. (other parameters,
of more algorithmic nature, appear in the work units, naturally).
 
> [...]
> That still leaves the main problem.  I get strange errors like:
> NFO:root:Running  ' d o w n l o a d / p o l y s e l e c t '  - P  1 0
> 0 0 0  - N  3 5 3 4 9 3 7 4 9 7 3 1 2 3 6 2 7 3 0 1 4 6 7 8 0 7 1 2 6
> 0 9 2 0 5 9 0 6 0 2 8 3 6 4 7 1 8 5 4 3 5 9 7 0 5 3 5 6 6 1 0 4 2 7 2
> 1 4 8 0 6 5 6 4 1 1 0 7 1 6 8 0 1 8 6 6 8 0 3 4 0 9  - d e g r e e  4
> - v  - t  2  - a d m i n  1 4 5 0 0 0  - a d m a x  1 5 0 0 0 0  - i n
> c r  6 0  - n q  2 5 6  - s o p t e f f o r t  0  >  ' 2 / c 9 0 . p o
> l y s e l e c t 1 . 1 4 5 0 0 0 - 1 5 0 0 0 0 '
> ERROR:root:Command resulted in exit code 1
> ERROR:root:Stderr: b'/bin/sh:  2 / c 9 0 . p o l y s e l e c t 1 . 1 4
> 5 0 0 0 - 1 5 0 0 0 0 : No such file or directory'

Oh, that seems to be a hilarious one !

> I am guessing the regex used in implementing the override option is
> doing something strange.  I'm no a python person, but I'll try to take
> a look into this to see what I could find.  As always, any advice is
> very much welcome.

The " >  ' 2 / c 9 0" part is most weird. Which commit did you try ? I
changed some stuff this week, and got rid of one /bin/sh middle man (see
https://gitlab.inria.fr/cado-nfs/cado-nfs/issues/21718 and commits
referenced at the end of the page). I'd be interested in the WU file that
got downloaded by your client.

If you find something, do tell me. It seems puzzling enough. I'll try and
see if I can reproduce.

> > This should normally get you going for the sieving part of
> > factorizations. As long as you don't point too many machines to the
> > server, it should cope. You might want to adjust the adrange and qrange
> > so that the clients don't ask for work too often
> 
> That is a very good caution and I will be aware of it.  Is there a
> downside to making those values larger?  Memory usage, etc.?  I'm
> guessing they may just make the work units take longer and that can
> have some effect on smaller machines assisting in the calculation.

Longer runs on the clients. This might become awkward if you have to deal
with scheduling systems and so on. Yes, it's also a potential problem
when you have machines of vastly different power (and this is an
acknowledged issue I want to work on someday). Memory usage oes not
depend on workunit size.

One thing I forgot to add is that at some point, if you hammer the server
too much, the sqlite3 backend is going to get in the way, as it has
several shortcomings. We've had much better success using the mysql
backend for large projects (I think there's some doc in
scripts/cadofactor/README.md). But probably the first thing to do before
that is to strive to be gentle on the database load on the server.

> > Filtering, then, happens on the server. Linear algebra too. If you want
> > to do linear algebra on several machines, it's a different topic.
> 
> Indeed.  I think I'll be okay as it is for the small numbers I plan to
> run--512 bits or so.

Depending on your interconnect, a distributed block Wiedemann can be a
useful strategy even for 512 bits. But that's clearly a "next-level"
exercise.

E.


More information about the Cado-nfs-discuss mailing list