EuCrypt: Correcting MPI Implementation

~ An unexpected part of the EuCrypt library series. Start with Introducing EuCrypt. ~

This is a sad but necessary interruption in the EuCrypt series itself: although coming immediately after chapter 2, this is not chapter 3 at all, I’m sorry to say. Instead of adding another useful part of smg-rsa as the actual chapter 3 does, I’m forced by the uncovered reality in the field to publish this correction: a correction of a coding error that has lived for years in the very much used GnuPG 1.4.10 open sore code. As a wonderful and truly epic example of all the great qualities of the open-source approach that thrives on thousands of eyes that surely quickly and accurately find and correct any bug, this error has survived perfectly fine until now, undetected and even papered over and buried deeper when it tried to manifest! It is quite natural after all, not to mention according to the “law”: given enough eyeballs, any bug will be quickly plastered over and buried as deep as it can be. There, Mr. Raymond, I fixed that Bazaar description for you. Let it be said again and again and forever that in anything that matters, it’s never quantity that matters, but always quality.

The error itself is a tiny thing with a huge impact: a single wrong letter in the wrong place, transforming a piece of code meant to copy over a whole MPI into a piece of code that does precisely… nothing1. Obviously, typing the wrong letter can and does happen from time to time to anyone who types. However, it takes a very sloppy sort of monkey pushing the keys to fail to notice its effect, namely that the resulting code does… nothing. You’d think that anyone would at least read their code once more at a later time to catch such silly mistakes, pretty much like one reads any other text for mere proofreading if nothing more. You’d think that anyone at anytime would at least run their code once to see it doing the thing they wanted it to do, wouldn’t you? Well, it clearly shows this is not the way of the great Bazaar, no. So let me show you the resulting code that achieves nothing but still actually eats up some resources whenever called, here you are, straight from mpi/include/mpi-internal.h:

#define MPN_COPY_INCR( d, s, n)   \
    do {        \
  mpi_size_t _i;      \
  for( _i = 0; _i < (n); _i++ ) \
      (d)[_i] = (d)[_i];    \
    } while (0)


Speaking of eyeballs, take your time a bit and spot the error. Then call your child or your grandpa over, ask them to spot the error too and let me know how this addition of another pair of eyeballs clearly helped.

Once the error is spotted, the reader might ask the obvious question, of course: “why does the caller appear to work?” The answer to this is… layered let’s say. The caller does not actually work in fact, unsurprisingly. The caller is a method called mpi_tdiv_q_2exp(MPI w, MPI u, unsigned count). This method supposedly shifts u by count bits to the right and stores the result in w. Except that it doesn’t actually *always* do this: in some cases, all it does is to trim from w count bits and nothing more because it relies on the above good-for-nothing code to copy the relevant bits from u to w. Here, let’s look at its code that is found in mpi/mpi-div.c:

mpi_tdiv_q_2exp( MPI w, MPI u, unsigned count )
    mpi_size_t usize, wsize;
    mpi_size_t limb_cnt;

    usize = u->nlimbs;
    limb_cnt = count / BITS_PER_MPI_LIMB;
    wsize = usize - limb_cnt;
    if( limb_cnt >= usize )
  w->nlimbs = 0;
    else {
  mpi_ptr_t wp;
  mpi_ptr_t up;

  RESIZE_IF_NEEDED( w, wsize );
  wp = w->d;
  up = u->d;

  count %= BITS_PER_MPI_LIMB;
  if( count ) {
      mpihelp_rshift( wp, up + limb_cnt, wsize, count );
      wsize -= !wp[wsize - 1];
  else {
      MPN_COPY_INCR( wp, up + limb_cnt, wsize);

  w->nlimbs = wsize;

Let’s actually read that a bit and see how it works: usize stores the number of “limbs” of an MPI, while limb_cnt stores the number of full limbs that need to be discarded as a result of a shift right by count bits. Those limbs are essentially machine words, groups of bits that the machine can work with directly. Their length in bits is that BITS_PER_MPI_LIMB, hence the integer division of count by it to find out just how many full limbs a shift by count bits implies. Once that is known, the code sets the remaining size (as number of limbs) of the result w and proceeds to copy over from u the bits that remain after the shift (the wp and up are simply pointers to the internal array of limbs of w and u respectively). However, at this point, the path splits by means of that if (count): the shift can further affect another limb by discarding (shifting right) only some of its bits and in this case there is in fact a shift-and-copy operation by means of calling mpihelp_rshift; if this is not the case and the original shift was made by a multiple of BITS_PER_MPI_LIMB, there is only a straightforward copy of limbs to make and that is supposedly achieved by our do-nothing-by-error code, called MPN_COPY_INCR.

Considering the above, it’s hopefully no surprise to you when I say that mpi_tdiv_q_2exp fails of course, as soon as it makes use of the buggy code of MPN_COPY_INCR. And you’d think the error would at least be caught at this point if it wasn’t caught earlier, wouldn’t you? Basic testing of the most obvious kind, meaning both branches of an if, something everyone always does in one way or another, right? Bless your heart and sweet youth and all that but wake up already, as this is open sore we are talking about: no, it’s not caught here either, of-course-not, it’s just-not-enough-eyeballs-yet. Honestly, I suspect they meant “given enough eyeballs LOST as a result, any bug becomes shallow.” We might even live to see this, you know?

Still, there is more! This defective mpi_tdiv_q_2exp is actually used by another method in the original GnuPG so surely the error is exposed there! It has to be, it’s by now the third layer and presumably it’s really not all that many layers that can be built on code-that-does-nothing + code-that-works-only-sometimes. Oh, but the Bazaar is all mighty, the Bazaar can2, unlike those pesky cathedrals that actually had any foundation. Let me illustrate this to you, with code taken straight from GnuPG 1.4.10, from the primegen.c file, from method is_prime that supposedly does probabilistic primality testing and such little things on which the security of your RSA key depends:

    q = mpi_copy( nminus1 );
    k = mpi_trailing_zeros( q );
    mpi_tdiv_q_2exp(q, q, k);

What’s that, you ask? Why, nothing terrible, just a bit of code meant to calculate q as nminus1 / (2^k). Since a division by 2^k is simply a right shift by k bits, that’s really the perfect job for mpi_tdiv_q_2exp, isn’t it? So then why does this code do first a copy of nminus1 into q and only then a shift right by k bits of resulting q into… same variable q. Why exactly doesn’t it do simply a shift right by k bits of nminus1 into q? Like this:

  k = mpi_trailing_zeros( nminus1 );
  mpi_tdiv_q_2exp( q, nminus1, k );

Two method calls instead of three, two lines instead of three, overall much clearer and easier to follow code anyway, since we are dividing indeed nminus1 by 2^k, not some “q” that is meant to be the result, really. So why do it in a complicated way when you can do it in a simple way? Except, you can’t, you see, because you don’t actually read the code you use and the code you use is broken but nobody actually bothered to check it. And when the error that is by now 3 layers deep manifests itself through unexpected results of the simple and straightforward code that should have been there, you don’t spend like me the hours needed to track down the error and actually, finally, mercifully put it out of its misery correcting the code. Oh no, that would be wasted time, wouldn’t it? Instead you are being productive and you find a workaround, simply papering the error over and dancing around it with an idiotic extra-copy that makes *this* code more difficult to follow and further *hides* the previous error, pushing it one layer deeper. Oh, now I plainly see the main advantage of open source: since you are not responsible in any way for the code you write, you can get away with such despicable behaviour, can’t you?

Well, here is not open source. Here code gets actually read and errors get dissected, documented and corrected when found, not swept under the carpet. So, to correct this error, first thing written is a basic test that runs all the branches of that if in mpi_tdiv_q_2exp. Here it is, together with its helper print function, both to be found from now on in mpi/tests/test_mpi.c:

void print_results(MPI in, MPI out, char * title)
  fprintf(stdout, "******** %s ********", title);

  fprintf(stdout, "input : ");
  mpi_print(stdout, in, 1);

  fprintf(stdout, "output: ");
  mpi_print(stdout, out, 1);


 * Test that will fail on original code and will pass after EuCrypt fix is applied.
void test_rshift()
  MPI out, in, copy_in;
  out = mpi_alloc(0);
  in = mpi_alloc(0);
  copy_in = mpi_alloc(0);

  mpi_fromstr(out, "0x20E92FE28E1929");   /* some value */
  mpi_fromstr(in, "0x2000000010000001000000002");
  mpi_fromstr(copy_in, "0x2000000010000001000000002"); /* to make sure the actual input is print, since call can modify in */

  /* print value of BITS_PER_MPI_LIMB */
  fprintf(stdout, "BITS_PER_MPI_LIMB is %d\n", BITS_PER_MPI_LIMB);

  /* shift by 0 */
  mpi_tdiv_q_2exp(out, in, 0);
  print_results(copy_in, out, "TEST: right shift by 0");

  /* shift by multiple of BITS_PER_MPI_LIMB */
  mpi_fromstr(in, "0x2000000010000001000000002");

  mpi_tdiv_q_2exp(out, in, BITS_PER_MPI_LIMB);
  print_results(copy_in, out, "TEST: right shift by BITS_PER_MPI_LIMB");
  /* shift by non-multiple of BITS_PER_MPI_LIMB */
  mpi_fromstr(in, "0x2000000010000001000000002");
  mpi_tdiv_q_2exp(out, in, BITS_PER_MPI_LIMB - 3);
  print_results(copy_in, out, "TEST: right shift by BITS_PER_MPI_LIMB - 3");


Running this with the old code from mpi (so the code from chapter 2, earlier) will show just how mpi_tdiv_q_2exp “works” so that there is no doubt remaining about that. After this and before doing any change, the next step is to search in the code for *any other* users of either mpi_tdiv_q_2exp or the MPN_COPY_INCR itself, since those users for all I know might even rely by now on the wrong behaviour. Mercifully, this search returned empty and so I can proceed finally to the fix itself, which is of course very easy to do *after all this work* i.e. once it was found and tracked down and isolated to ensure that a change does not break down something else. And after this, the last but mandatory step is of course to run the previously written test again and check its output. The corrected code in mpi/include/mpi-internal.h:

#define MPN_COPY_INCR( d, s, n)   \
    do {        \
  mpi_size_t _i;      \
  for( _i = 0; _i < (n); _i++ ) \
      (d)[_i] = (s)[_i];    \
    } while (0)

After all this work that was caused not by the original tiny error itself, not by the wrong letter at the wrong place, but by the total and repeated failure to correct it afterwards, even 3 layers up, tell me again about all those eyeballs and about how productive it is to write a quick workaround instead of searching for the error and eliminate it at source. Then go and run a grep -r “workaround” . in any source directory of any big open source project, just for… fun let’s say, what else.

To wrap this up, here is the vpatch with all the above changes, and my signature for it:

Hopefully no other similar trouble will surface until next week, so that I can finally move on to Chapter 3 that is all about the Miller-Rabin algorithm. Stay tuned!

  1. You know, by this point one might even be relieved to find that it does at least nothing. As opposed to doing something worse. 

  2. “Yes, we can”, right?  

Categorie: Coding.

Lasati un comentariu.

5 Comentarii

  1. - says:

    Have you been living under a rock?

    Yes, that bug used to exist in GnuPG, but it has been spotted and fixed four bloody years ago…

    In general I really like your rants about open sores, but here the pot is calling the kettle black…

  2. Diana Coman says:

    How does that change the fact that it WAS papered over in primegen.c to start with?

    In any case, thank you for your contribution and let me preserve this here because it’s rather juicy: “This bug has been with us since the version 0.0.0 of GnuPG.” (from: ). For the innocent: version 0.0.0 was released in 1997 (as per: so yes, great work there, it took the Bazaar ONLY 6 years to have someone actually read the code used in checking primality of RSA numbers. Can restate the law: given enough years of use, even code that does nothing at all may finally be spotted by someone. Does that sound better?

    Also: mind pointing me to the clear dissection of this, complete with the author’s negrating and/or apologies for it? No, the “oh, but it’s not all that bad” doesn’t cut it I’m afraid, quite on the contrary, it’s usually well related to the very same despicable approach of papering it over.

  3. Mercifully, this search returned empty and so I can proceed finally to the fix itself, which is of course very easy to do *after all this work*

    Reminds me of that 27b/6 bit,

    Actually, you were asking me to design a logotype which would have taken me a few hours and fifteen years experience.

  4. Anonymous says:

    So I had a longer answer, but I realised that posting just the executive summary conveys all the immediately useful points while minimising the damage to the rant.

    Yes, bad code bad, bad code mungers also bad. But there’s really no point complaining about the fact that you could see just how bad. I don’t know what to do to make “the community” care about code quality more and not satisfy itself with a mere “well it seems to work”, though I’ll readily agree it should, but I do think that railing against openness and development model both aren’t helpful. A cathedral full of idiots (and we have a few) is no better than a bazaar filled with same. And, of course, this might be an oopsie layered in less than stellar workarounds, it might also be an “underhanded C contest” entry found in the wild. Perhaps not, but the enterprising researcher might take a gander at the available evidence regardless.

  5. Diana Coman says:

    Anonymous V + WoT is the solution and it’s already available, no need to wonder really. The only way to make people “care” is, of course, individual responsibility that comes inevitably with ownership, yes, what else?

    I think you might have missed one point there: I’m not attacking the availability of the code at all. Quite on the contrary, I’m saying that merely throwing tons of “code” out there is not making anything (other than problems) available. As you can see from this EuCrypt series and in other places on this blog (and on other TMSR blogs for that matter), I am making my code available. The main difference though is that I’m retaining responsibility for it and ALSO for any changes to it that I sign (if any).

    Mircea Popescu Quite, yes. And let’s link the eyeballing capacity model too since we have it anyway:

One Trackback

  1. […] 4090 pseudo-randomly generated bits and so on until you might even find as I did last time that bits and parts of the implementation do nothing in fact. Don’t take my word for it either: go and check for yourself, it’s a very […]

Lasati un comentariu

Puteti folosi aceste taguri si atribute HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

RSS Subscribe to Ossasepia


Recent comments: