[Cryptominisat-devel] I need a bit of help debugging CryptoMiniSat

Martin Maurer meinemailingliste at online.de
Mer 19 Jan 12:43:45 CET 2011


Hello Mate,

can you create list files of the one and the other case ?

I don't know how to do this with normal gcc (non-arm / embedded system) 
version,
but perhaps

http://www.gp32x.com/board/index.php?/topic/39134-how-to-get-gcc-to-output-list-files/

gives a hint ?

Perhaps '-save-temps' ?

Output shall be a list files, which contains mixed C and assembler code for 
each cpp file.
Then it perhaps easier to compare working and non-working version...

Best regards,

Martin

----- Original Message ----- 
From: "Mate Soos" <soos.mate at gmail.com>
To: <cryptominisat-devel at lists.gforge.inria.fr>
Cc: "Oliver Kullmann" <O.Kullmann at Swansea.ac.uk>
Sent: Wednesday, January 19, 2011 12:22 PM
Subject: [Cryptominisat-devel] I need a bit of help debugging CryptoMiniSat


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear All,

I need a bit of your help. I have been putting quite some time to find a
very strange bug in CryptoMiniSat, and I finally have found it, but I am
not sure what to do with it. The problem was that gcc 4.5.X (4.5.1 for
sure) seemed to miscompile CryptoMiniSat: the generated executable was
behaving incorrectly. Through a 32-bit Fedora core 14 virtual machine I
have traced the problem to this piece of code:

inline const bool Solver::propNormalClause(Watched* &i, Watched* &j,
Watched *end, const Lit p, PropBy& confl, const bool update)
{
    [..]
    const uint32_t offset = i->getNormOffset();
    Clause& c = *clauseAllocator.getPointer(offset);

    // Make sure the false literal is data[1]:
    if (c[0] == ~p) {
        std::swap(c[0], c[1]);
    }

    assert(c[1] == ~p);

    // If 0th watch is true, then clause is already satisfied.
    if (value(c[0]).getBool()) {
        printf("Zeroth watch is true\n");
        j->setNormClause();
        j->setNormOffset(offset);
        j->setBlockedLit(c[0]);
        j++;
        return true;
    }
    // Look for new watch:
    [..]

The interesting part here is "std::swap(c[0], c[1])". The 'miscompiled'
code indeed swaps c[0] and c[1], but then the check for value(c[0]) goes
wrong: it actually checks c[1]'s value, as if the swap didn't take
place. If we put in these code pieces:

// Make sure the false literal is data[1]:
const Lit lit2 = c.size()>2 ? c[2] : c[0];
if (c[0] == ~p) {
    std::swap(c[0], c[1]);
}
if (c.size() > 2) assert(lit2 == c[2]);

then the compiled executable is correct (and, by the way, checks that
c[2] hasn't been touched). This to me sounds very fishy. Furthermore,
this "fix" seems to fix CryptoMiniSat for all problems. I.e. this is the
only piece of 'miscompiled' code.


To ease the pain of debugging, I have put a full .tar.bz2 file here:

http://planete.inrialpes.fr/~soos/tmp/cryptominisat.tar.bz2

If you extract it, you will find the CryptoMiniSat sources, and three
extra directories:

"problems/" -- a set of problems. "fuzz.cnf" is what interests us the
most. It's small, and CryptoMS fails on it with assert

"buildO0/" -- a build created with "CXXFLAGS=O0 ../configure", which
disables all optimisations. This works perfectly well. Executable
included. Output of "./cryptominisat --nosortwatched  --nosatelite
- --noclausevivif --nonormxorfind --nobinxorfind  --novarreplace
- --nofailedlitsearch ../problems/fuzz.cnf > output" included

"build/" -- a build created with "../configure", i.e. a regular build.
Fails for "fuzz.cnf". Executable included. Output of "./cryptominisat
- --nosortwatched  --nosatelite --noclausevivif --nonormxorfind
- --nobinxorfind  --novarreplace --nofailedlitsearch ../problems/fuzz.cnf
> output" included

The diff between the two outputs is (executed from "buildO0/"):

diff -U 40 output ../build/output | less
 [..uselesess stuff..]
 Propagating lit -48
 [..]
 clause num 262752 as i of prop: 48 -51 290 346 446 -800
- -Zeroth watch is true
- -clause num 262752 after propNorm: -51 48 290 346 446 -800
+Skip watch
+new watch
+clause num 262752 after propNorm: -51 290 48 346 446 -800

Note that it must swap c[0] and c[1] as p = -48, so ~p = 48, so the "if
(c[0] == ~p)" triggers. The correct line (with "-" in front) swaps c[0]
and c[1] finds that the new c[0] is true, and moves along. The incorrect
one swaps c[0] and c[1], but then checks the value of c[1], noting that
it's false, then finds another watch which seems to be c[3], but then
proceeds to set c[1] to c[2] and c[2] to ~p = 48:

// Look for new watch:
for (Lit *k = c.getData() + 2, *end2 = c.getDataEnd(); k != end2; k++) {
    printf("Skip watch\n");
    if (value(*k) != l_False) {
        printf("new watch\n");
        c[1] = *k;
        *k = ~p;
        watches[(~c[1]).toInt()].push(Watched(offset, c[0]));
        return true;
     }
}

ending up with:

 48  -51  290 346 446 -800  --> swap c[0] with c[1] =
- -51  48   290 346 446 -800  --> set c[1] to c[2], and c[2] to ~p =
- -51  290  48  346 446 -800

But this hypothesis may not be too good. It doesn't explain why it
treats things as if it has found c[2] instead of c[3] to be the new watch.

Anyone has any clue why all this happens? Could this be a gcc bug? Here
is what I think:

* shouldn't be a strict-alias bug, as there are no two clauses treated
at the same time,  and there seems to be no type-prunings going on here
other than that to get Clause "c" from memory space. But we are not
dealing with two clauses here.

* Lits should be able to alias themselves (as they are the same class,
so strict-alias doesn't play a part), and if you have a look, all
references to lits are with c[X] or through a Lit*, so the compiler must
think they can alias.

* Maybe using the Clause struct the way it is now is dangerous. However,
it's meant to be supported by gcc:
http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html. Currently, we have
something like:

struct Clause {
  uint32_t size;
  uint32_t otherdata;
  Lit data[0];
};

which corresponds to that described in the webpage.


Unfortunately, I am not an expert of assembly, disassembly, or
assembly-level debugging. The executable is there, though, so if you are
better and have some spare time, I would really appreciate if you could
have a go at it with gdb. Putting the right breakpoint should be easy
with all the debugging output, just don't forget to give all the options
to CryptoMiniSat, as detailed above. Breaking on clause num 262752 (it
propagates exactly once) should be sufficient to break at the right
moment and examine what the hell is going on. The gdb debugging commands
to print literals' variables are:

"print c.data[0].x/2+1" --> prints c[0]'s variable
"print c.data[1].x/2+1" --> prints c[1]'s variable
etc.

I must be going mad, maybe someone with a fresh mind has some ideas
about this?

Thank you in advance,

Mate


- -- 
Mate Soos
Security Research Labs
http://www.srlabs.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk02yYwACgkQsTOOstKb0jk/6wCgjve6sDFJKwJ3y93mpoIhrqNz
qroAn2rury10J27zf7JRE5SsCEUkNG7V
=T3gr
-----END PGP SIGNATURE-----

_______________________________________________
Cryptominisat-devel mailing list
Cryptominisat-devel at lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/cryptominisat-devel 




More information about the Cryptominisat-devel mailing list