[Ecm-discuss] plans for a new release

Torbjorn Granlund tg at swox.com
Ven 2 Mai 13:25:33 CEST 2008


Feedback about 6.2-rc1:

1. "configure -help" prints
     ...
     --enable-sse2           use SSE2 instruction in NTT code
                             [default=yes], [for], [Pentium], [4,], [no], ...
     ...
   Somewhat ugly, but not a critical bug perhaps.  :-)

2. You include your own mul_fft.c file with many interfaces that are
   also in the GMP version of the file.  This is not good, imagine if
   somebody uses you ecm library in a GMP application, they might get
   surprises.  I see you mangle one antry point, but it seems you need
   to mangle a few more.

3. The B2 parameter has been somewhat freely interpreted in previous
   releases, but now it is getting silly:

     ecm -pm1 1000000 81400000
     ...
     Using B1=1000000, B2=342622942

   I realize there are technical reasons for this rounding, but a
   factor of more than 4?!  Does that improve efficiency so much that
   it is worth it?

4. For large numbers, NTT is slower than non-NTT.  It would be nice if
   the code chose the fastest variant automatically, using a tuneable
   parameter.

5. How to use ecm efficiently?  Your advice should probably include
   reasoning about 32-bit vs 64-bit builds, since 64-bit binaries will
   bring the greatest of all speedups, for capable processors.  There
   are several aspects here.  (A) With a 32-bit OS on a PC, there is
   no hope to run 64-bit binaries.  This is a common problem, for both
   DOS XP (32-bit only) and for GNU/Linux and BSD where people often
   install the i386 OS instead of the amd64 OS.  (B) GMP-ECM defaults
   to whatever the found compiler defaults to, I think.  GMP defaults
   to the fastest ABI, i.e., some 64-bit ABI.

   It would be a shame if people spent a lot of time running a 32-bit
   GMP-ECM on a 64-bit processor.

6. The sp_sub and sp_add asm code isn't too clever.  Firstly, the
   cmovc instruction isn't available just because __i386__ is defined.
   Secondly, I strongly doubt these are optmizations, compared to what
   the compiler can generate.  In particular sp_add looks as a
   pessimation.  (Please don't compare generic gcc output to this on a
   modern processor, compare, say, gcc -march=athlon, to this.)

7. I am surprised that the improved NTT code is not "more faster" than
   the old code than it is.  Considerable time is spent in mpn_mod_1,
   at least in my measurements.  Since you're using a nail by default
   in sp, you may want to use the code below instead, which works only
   for nails.  (Or at least use mpn_preinv_mod_1, even it is declared
   obsolete.  The tricky thing is that you'd need to shift the
   dividend before calling mpn_preinv_mod_1, just like you need to
   shift the divisor.  That shift need was one reason for declaring
   the function obsolete.)

   Here is an alternative method.  In your case, you should table cnt,
   bi, B1modb, and B2modb, and pass them to the function, instead of
   computing them therein.  If you do that, expect a great speedup,
   the O(1) term will go down by around 70-150 cycles and the O(n)
   term will at least be halved.  ECM's stage 2 should run > 10%
   faster as a result.

   Note the your tabled ->mul_c is probably not accurate enough, since
   it doesn't use an implicit msb.  The tabling need could be
   accomodated at the start of mpzspv_from_mpzv; it seems len is
   always large enough to hide such computations.

#define udiv_rnd_preinv(r, nh, d, di)					\
  do {									\
    mp_limb_t _qh, _ql, _r;						\
    umul_ppmm (_qh, _ql, (nh), (di));					\
    _qh += (nh) + 1;							\
    _r = - _qh * (d);							\
    if (_r > _ql)							\
      _r += (d);							\
    (r) = _r;								\
  } while (0)

mp_limb_t
mpn_mod_1small (mp_srcptr ap, mp_size_t n, mp_limb_t b)
{
  mp_limb_t rh, rl, bi, q, ph, pl, r;
  mp_limb_t B1modb, B2modb;
  mp_size_t i;
  int cnt;
  mp_limb_t mask;

  ASSERT (b <= GMP_NUMB_MAX / 2);

  count_leading_zeros (cnt, b);

  b <<= cnt;
  invert_limb (bi, b);

  B1modb = -b * ((bi >> (GMP_LIMB_BITS-cnt)) | (CNST_LIMB(1) << cnt));
  ASSERT (B1modb <= b);		/* NB: not fully reduced mod b */
  udiv_rnd_preinv (B2modb, B1modb, b, bi);

  B1modb >>= cnt;
  B2modb >>= cnt;

  umul_ppmm (ph, pl, ap[n - 1], B1modb);
  add_ssaaaa (rh, rl, ph, pl, 0, ap[n - 2]);

  for (i = n - 3; i >= 0; i -= 1)
    {
      /* rr = ap[i]				< B
	    + LO(rr)  * (B mod b)		<= (B-1)(b-1)
	    + HI(rr)  * (B^2 mod b)		<= (B-1)(b-1)
      */
      umul_ppmm (ph, pl, rl, B1modb);
      add_ssaaaa (ph, pl, ph, pl, 0, ap[i]);

      umul_ppmm (rh, rl, rh, B2modb);
      add_ssaaaa (rh, rl, rh, rl, ph, pl);
    }

  r = (rh << cnt) | (rl >> (GMP_LIMB_BITS - cnt));
  mask = -(mp_limb_t) (r >= b);
  r -= mask & b;

  udiv_qrnnd_preinv (q, r, r, rl << cnt, b, bi);
  return r >> cnt;
}

-- 
Torbjörn



Plus d'informations sur la liste de diffusion Ecm-discuss