[Cado-nfs-discuss] Help starting out using two machines

David Willmore davidwillmore at gmail.com
Sat Mar 7 18:37:14 CET 2020


First, sorry for the delay in responding, I responded to Emmanuel off
thread to let him know that I was going to take a little while to work
through this to ensure I was understanding everything as I went along,
so please don't take my slowness to reply as a lack of respect for
Emmanuel's efforts to assist me, it's quite the opposite, I took the
time to make the most of what was offered.  Thank you for your
understanding.

On Thu, Mar 5, 2020 at 6:31 PM Emmanuel Thomé <Emmanuel.Thome at inria.fr> wrote:
> This is an excellent occasion to complement the README.

If my situation can be a learning experience for others, then it will
not be a waste.

> In my opinion, we should be almost silent regarding slaves.hostnames and
> friends.  They're here as a convenience means, in order to spawn all jobs
> from the server, which comes in handy for baby size factorizations. But
> for anything beyond that "testing-only" approach, you need to do
> otherwise.

I am inclined to agree.

> The process is simple.
>
>     1 - build all binaries on the server node, and run the server script.
>     2 - build all binaries on the client nodes, and run the client
>     scripts to point to the server. (the server node can also have the
>     role of a client).
>
> Concerning 1, you need binaries, because some steps are run on the server
> no matter what (creating the factor base files, but also linear algebra,
> unless you choose to do linear algebra separately). To run a "bare"
> server, you must do:
>
>     make
>     ./cado-nfs.py 90377629292003121684002147101760858109247336549001090677693 server.whitelist=0.0.0.0/0 --server --workdir /some/path/to/a/fresh/directory/
>
> (adjust the network mask according to the client that will connect.)
>
> The server's standard output contains two lines that you really want to
> pay attention to, namely:
>
>     Info:root: If this computation gets interrupted, it can be resumed with ./cado-nfs.py /some/path/to/a/fresh/directory/c60.parameters_snapshot.0
>
> and
>
>     Info:HTTP server: You can start additional cado-nfs-client.py scripts with parameters: --server=https://localhost:40047 --certsha1=108fcb0e4961f195b3011d4c2e0c6077c82b238a
>
>
> After going through the very early setup, the server will be essentially
> idle, and wait for clients to connect. You might want to have it in a
> screen. As the message I just quoted above says, it is perfectly fine if
> you interrupt and restart the server, provided that you use the suggested
> command line.

I have made it this far and everything works are you have stated.  The
only issue I had was that my OS distro needed holes for 8001 to be put
into the firewall so that the clients and server could communicate.

> Now for 2. Clients are not necessarily the same machines as the server
> machine. They may have different architectures and so on. So, despite the
> fact that the server does have the functionality to ship a binary to the
> client, I very much advise against using it. Instead, you should build
> your binaries on the client, and instruct the client script to use them,
> and skip the server-shipped binaries. This is done as follows
> (./cado-nfs-client.py --help might be worth a visit)
>
>     ./cado-nfs-client.py --basebath $wdir --server=https://localhost:40047 --certsha1=108fcb0e4961f195b3011d4c2e0c6077c82b238a --bindir $(eval `make show` ; echo $build_tree)
>
> where --server and --certsha1 are as suggested by the server, and $wdir
> is some local working directory that you create specifically for the
> client. Side note, you can even automate this temp dir stuff with this
> helper script from the cado-nfs testing suite:
>
>     ./tests/provide-wdir.sh --arg --basepath -- ./cado-nfs-client.py --server=.......[rest of the cmdline above]
>
>
> The part:
>     --bindir $(eval `make show` ; echo $build_tree)
> uses cado-nfs's top-level makefile to retrieve the build directory, so
> that you can pass it to the script. Of course, you can specify the build
> directory with any less-automatic means you see fit.

This is all very good advice.  I am fortunate enough to have nearly
identical hardware for both server and client machines--I swapped
their roles for my testing just to ensure that was not going to be a
problem.  In the future, I may use some other hardware as clients
which have completely different CPU complexes and the above
instructions will be very helpful.  I had not been aware that the
server could push binaries to clients and that may have tripped me up
if I hadn't built them properly on the client first--the binaries
would not have run and I would have spend countless hours tracking
down bugs that were unrelated to the real problem.

To state it a different way, even if one were to compile the software
properly on the slave/client box with all the optimizations necessary
to make good use of the hardware, the client script will still use the
server provided binary if the --binddir option is not provided!  The
user may end up inadvertently using a less than optimal client binary
without being aware of it.

> Clients work mostly fine as is with these settings, but will still obey
> all parameters that are suggested by the server. That includes in
> particular the number of threads per client, which is the server's
> --client-threads argument.
>
> Again, deciding the number of threads per client should rather be the
> client's business. Hence there's a way, in cado-nfs-client.py, to
> override the server-suggested setting. You just add, for example:
>
>     --override -t 4
>
> to the cado-nfs-client.py command line, to force 4-threads clients. Note
> that you may have several clients per node, and at least for polyselect
> this is preferrable to having a client with many threads.
>
> Once your computation is in the sieving phase, you can switch to a setup
> where you have only one client per node, and use automatic thread
> placement (alas, this is _only_ for sieving, for the moment). You just
> add this to the cado-nfs-client.py command line:
>
>     --override -t auto

It is at this point that I'm running into problems.  It seems that if
I ever have any override of the threads, I get errors.  In tracking
down what was going wrong, I ran into cado-nfs-client.py not
respecting the workdir parameter for 'downloads' which effects
downloaded binaries, polynomials, and roots files.  Those directories
always seem to be referenced relative to the CWD when the command is
invoked.  That's probably not the desired behavior.  It certainly lead
to difficulty in debugging as the client was storing state in
locations outside of the scope of the work dir--which I was clearing
out between runs.

That still leaves the main problem.  I get strange errors like:
NFO:root:Running  ' d o w n l o a d / p o l y s e l e c t '  - P  1 0
0 0 0  - N  3 5 3 4 9 3 7 4 9 7 3 1 2 3 6 2 7 3 0 1 4 6 7 8 0 7 1 2 6
0 9 2 0 5 9 0 6 0 2 8 3 6 4 7 1 8 5 4 3 5 9 7 0 5 3 5 6 6 1 0 4 2 7 2
1 4 8 0 6 5 6 4 1 1 0 7 1 6 8 0 1 8 6 6 8 0 3 4 0 9  - d e g r e e  4
- v  - t  2  - a d m i n  1 4 5 0 0 0  - a d m a x  1 5 0 0 0 0  - i n
c r  6 0  - n q  2 5 6  - s o p t e f f o r t  0  >  ' 2 / c 9 0 . p o
l y s e l e c t 1 . 1 4 5 0 0 0 - 1 5 0 0 0 0 '
ERROR:root:Command resulted in exit code 1
ERROR:root:Stderr: b'/bin/sh:  2 / c 9 0 . p o l y s e l e c t 1 . 1 4
5 0 0 0 - 1 5 0 0 0 0 : No such file or directory'

There seems to be a space between every character in that string when
the --override parameter is specified.  When it's not there, the above
entry looks something like:
INFO:root:Running 'download/polyselect' -P 10000 -N
353493749731236273014678071260920590602836471854359705356610427214806564110716801866803409
-degree 4 -v -t 2 -admin 0 -admax 5000 -incr 60 -nq 256 -sopteffort 0
> '/home/willmore/factoring/cado-nfs-2.3.0/2/c90.polyselect1.0-5000'

I am guessing the regex used in implementing the override option is
doing something strange.  I'm no a python person, but I'll try to take
a look into this to see what I could find.  As always, any advice is
very much welcome.

> This should normally get you going for the sieving part of
> factorizations. As long as you don't point too many machines to the
> server, it should cope. You might want to adjust the adrange and qrange
> so that the clients don't ask for work too often

That is a very good caution and I will be aware of it.  Is there a
downside to making those values larger?  Memory usage, etc.?  I'm
guessing they may just make the work units take longer and that can
have some effect on smaller machines assisting in the calculation.

> Filtering, then, happens on the server. Linear algebra too. If you want
> to do linear algebra on several machines, it's a different topic.

Indeed.  I think I'll be okay as it is for the small numbers I plan to
run--512 bits or so.

Emmanuel,  thank you so much for your help.  You have moved me much
further down the path of understanding.  I truly appreciate the effort
you have put into this and I hope I can help out in some way in the
future.

Cheers,
David


More information about the Cado-nfs-discuss mailing list