Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zn_poly-0.9.p9 fails at least one its tests on power7 #14098

Closed
kiwifb opened this issue Feb 11, 2013 · 30 comments
Closed

zn_poly-0.9.p9 fails at least one its tests on power7 #14098

kiwifb opened this issue Feb 11, 2013 · 30 comments

Comments

@kiwifb
Copy link
Member

kiwifb commented Feb 11, 2013

On the login node of our power7 cluster (beatrice) zn_poly fails make check

(sage-sh) frb15@p2n14-c:src$ make check
test/test -quick all
mpn_smp_basecase()... ok
mpn_smp_kara()... make: *** [check] Segmentation fault (core dumped)

Here is a detailed backtrace

(gdb) r mpn_smp_kara
Starting program: /hpc/scratch/frb15/sandbox/sage-5.7.beta4/spkg/build/zn_poly-0.9.p9/src/test/test mpn_smp_kara
mpn_smp_kara()... 
Program received signal SIGSEGV, Segmentation fault.
0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67
67      random2.c: No such file or directory.
        in random2.c
(gdb) bt
#0  0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67
#1  0x00000400000dd360 in __gmpn_random2 (rp=0x0, n=-5198573331259894519) at random2.c:54
#2  0x000000001002c634 in ZNP_mpn_random2 (res=0x0, n=13248170742449657097) at test/support.c:107
#3  0x0000000010027224 in testcase_mpn_smp_kara (n=6624085371224828549) at test/mpn_mulmid-test.c:89
#4  0x0000000010027434 in test_mpn_smp_kara (quick=0) at test/mpn_mulmid-test.c:125
#5  0x00000000100210dc in run_test (target=0x10041488, quick=0) at test/test.c:187
#6  0x0000000010021450 in main (argc=2, argv=0xfffffffe5a8) at test/test.c:235
(gdb) bt full
#0  0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67
        bi = 4398113622176
        ranm = 268711088
        cap_chunksize = 0
        chunksize = 0
        i = 277803815622628616
#1  0x00000400000dd360 in __gmpn_random2 (rp=0x0, n=-5198573331259894519) at random2.c:54
        rstate = 0x40000134db8
        bit_pos = 8
        ran = 3915822088
        ranm = 3915822088
#2  0x000000001002c634 in ZNP_mpn_random2 (res=0x0, n=13248170742449657097) at test/support.c:107
        i = 0
#3  0x0000000010027224 in testcase_mpn_smp_kara (n=6624085371224828549) at test/mpn_mulmid-test.c:89
        buf1 = 0x0
        buf2 = 0x0
        ref = 0x0
        res = 0x0
        success = 1
#4  0x0000000010027434 in test_mpn_smp_kara (quick=0) at test/mpn_mulmid-test.c:125
        success = 1
        n = 6624085371224828549
        trial = 0
#5  0x00000000100210dc in run_test (target=0x10041488, quick=0) at test/test.c:187
        success = 4095
#6  0x0000000010021450 in main (argc=2, argv=0xfffffffe5a8) at test/test.c:235
        found = 1
        all_success = 1
        any_targets = 1
        quick = 0
        i = 33
        j = 1
(gdb) q

It seems to point the finger at mpir.

New spkg:

CC: @jdemeyer

Component: porting

Author: François Bissey, David Harvey

Reviewer: Paul Zimmermann, Jeroen Demeyer

Merged: sage-5.8.beta1

Issue created by migration from https://trac.sagemath.org/ticket/14098

@kiwifb
Copy link
Member Author

kiwifb commented Feb 11, 2013

comment:1

Note this is the quick test always run with zn_poly. It passes in 5.7beta3 without debug and it fails in beta4 with SAGE_DEBUG=yes.

@zimmermann6
Copy link
Contributor

comment:2

the __gmpn_random2 (rp=0x0, n=-5198573331259894519) call is very suspicious, since the second argument should be a size in limbs.

Paul

@kiwifb
Copy link
Member Author

kiwifb commented Feb 12, 2013

comment:3

Hi Paul,

I suspect that the problem is triggered when enabling the debugging code, furthermore zn_poly itself is built with -DNDEBUG regardless of SAGE_DEBUG=yes. I am wondering if it could cause the problem.

Francois

@kiwifb
Copy link
Member Author

kiwifb commented Feb 12, 2013

comment:4

Very odd. The main code is always compiled with -DNDEBUG - no option to turn it of. But the code for the test which fails is all compiled with -DDEBUG - no turning it off either. So it must happening when SAGE_DEBUG is turned on for some other component of sage. Since no one else seem to have seen it before it has to be a power7 specific problem.

@kiwifb
Copy link
Member Author

kiwifb commented Feb 12, 2013

comment:5

To continue on what you started Paul in

testcase_mpn_smp_kara (n=6624085371224828549)

n is supposed to be a size_t so I think we have a gross overflow somewhere earlier. The value originates from here:

/*
   Tests mpn_smp_kara for a range of n.
*/
int
test_mpn_smp_kara (int quick)
{
   int success = 1;
   size_t n;
   ulong trial;

   // first a dense range of small problems
   for (n = 2; n <= 30 && success; n++)
   for (trial = 0; trial < (quick ? 300 : 30000) && success; trial++)
      success = success && testcase_mpn_smp_kara (n);

   // now a few larger problems too
   for (trial = 0; trial < (quick ? 100 : 3000) && success; trial++)
   {
      n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2;      <======= n generated here.
      success = success && testcase_mpn_smp_kara (n);
   }

   return success;
}

@kiwifb
Copy link
Member Author

kiwifb commented Feb 13, 2013

comment:6

On power7 it appears that ZNP_mpn_smp_kara_thresh is equal to SIZE_MAX which according to /usr/include/stdint.h is

/* Limit of `size_t' type.  */
# if __WORDSIZE == 64
#  define SIZE_MAX              (18446744073709551615UL)
# else
#  define SIZE_MAX              (4294967295U)
# endif

random_ulong is defined by

ulong
random_ulong (ulong max)
{
   return gmp_urandomm_ui (randstate, max);
}

so n needs to be size_t which is at most SIZE_MAX but the test generate a random number between 0 and 3 * SIZE_MAX + 2.

Oh dear! I wonder why that doesn't work.

I guess it is potentially fine if ZNP_mpn_smp_kara_thresh is not SIZE_MAX, I don't know how it is on other systems.

@zimmermann6
Copy link
Contributor

comment:7

Francois, can you see how ZNP_mpn_smp_kara_thresh is defined on other 64-bit systems,
and which kinds of values is generated by n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2?

Paul

@kiwifb
Copy link
Member Author

kiwifb commented Feb 13, 2013

comment:8

I am certainly poking at that. The value of ZNP_mpn_smp_kara_thresh is computed by the tuning code and it is clearly allowed to be equal to SIZE_MAX

   // generate tuning.c file
   printf (header);

   x = ZNP_mpn_smp_kara_thresh;
   printf ("size_t ZNP_mpn_smp_kara_thresh = ");
   printf (x == SIZE_MAX ? "SIZE_MAX;\n" : "%lu;\n", x);

So someone potentially set themselves for trouble in the test. However after inserting a few printf in the code the mystery deepens

mpn_smp_basecase()... ok
mpn_smp_kara()... test: src/mpn_mulmid.c:241: ZNP_mpn_smp_kara: Assertion `n >= 2' failed.
maxtrial= 98 SIZE_MAX= 18446744073709551615
maxtrial= 98
n=31
n=24
n=38
n=40
n=74
n=24
n=28
n=32
n=77
n=67
n=76
n=64
n=13
n=17
n=90
n=42
n=47
n=79
n=21
n=82
n=32
n=10
n=67
n=25
n=26
n=39
n=77
n=90
n=97
n=7
n=74
n=59
n=70
n=87
n=23
n=6
n=70
n=97
n=78
n=74
n=57
n=53
n=28
n=21
n=51
n=33
n=41
n=2
n=88
n=57
n=56
n=96
n=46
n=38
n=69
n=93
n=11
n=61
n=24
n=25
n=45
n=46
n=6
n=44
n=32
n=93
n=59
n=45
n=46
n=31
n=91
n=32
n=45
n=45
n=90
n=61
n=78
n=47
n=33
n=75
n=71
n=37
n=92
n=94
n=50
n=84
n=8
n=43
n=15
n=31
n=31
make: *** [check] Aborted (core dumped)
Error running zn_poly's quick test suite ('make check').

I didn't have the assertion before and after putting these we Abort rather than segfault.

@zimmermann6
Copy link
Contributor

comment:9

I guess there is a bug in the tuning code, which should not give for ZNP_mpn_smp_kara_thresh a huge value.

Paul

@sagetrac-dmharvey
Copy link
Mannequin

sagetrac-dmharvey mannequin commented Feb 13, 2013

comment:10

I am the author.... thanks Paul for drawing my attention to this.

I haven't looked at this code for years so it's almost as mysterious to me as to everyone else here!

My guess is that the bug is in the test code rather than in the tuning code. I suspect that the threshold is allowed to be SIZE_MAX, but that the line

n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2; 

should be replaced by e.g.

if (ZNP_mpn_smp_kara_thresh == SIZE_MAX)
   n = random_ulong (100) + 2;
else
   n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2; 

It could also be a bug in the tuning code, but that would be much harder to fix. If I remember correctly what this threshold means, it is very surprising to me that its optimal value is SIZE_MAX on any real system.

@kiwifb
Copy link
Member Author

kiwifb commented Feb 13, 2013

comment:11

Thanks for the code. My last error was due to me trying to do something similar and failing to read the original code properly (putting the +2 inside the bracket).
power7 is a strange beast but it is unlikely that it is the optimal value. The tuning probably assume something that is wrong on this platform and that would indeed be difficult to find.

@kiwifb
Copy link
Member Author

kiwifb commented Feb 13, 2013

comment:12

Not sure what happened I wanted to do another run to post tuning.c but the value of ZNP_mpn_smp_kara_thresh is now 133. I swear it was SIZE_MAX before. There is still plenty of SIZE_MAX value in the file:

#include "zn_poly_internal.h"

size_t ZNP_mpn_smp_kara_thresh = 133;
size_t ZNP_mpn_mulmid_fallback_thresh = 4868;

tuning_info_t tuning_info[] = 
{
   {  // bits = 0
   },
   {  // bits = 1
   },
   {  // bits = 2
         94,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        270,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        206,   // KS1 -> KS2 middle product threshold
   SIZE_MAX,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 3
        105,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        270,   // KS1 -> KS2 squaring threshold
       9634,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        120,   // KS1 -> KS2 middle product threshold
   SIZE_MAX,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 4
        123,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        154,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        132,   // KS1 -> KS2 middle product threshold
   SIZE_MAX,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold

@zimmermann6
Copy link
Contributor

comment:13

Francois, anyway it does not hurt to implement what David suggests in comment [comment:10].
This should fix this ticket once for all.

Paul

@kiwifb
Copy link
Member Author

kiwifb commented Feb 14, 2013

comment:14

I can say it worked nicely, so I'll prepare a new spkg with it so this kind of thing cannot happen again. I think I found out what happened and made thing different. In the original build I used gcc-4.7.1, this build the compiler was gcc shipped with the distro gcc-4.3.4. There could be some subtle bugs lurking in gcc itself or the standard used to compile the tuning code.

#include "zn_poly_internal.h"

size_t ZNP_mpn_smp_kara_thresh = SIZE_MAX;
size_t ZNP_mpn_mulmid_fallback_thresh = SIZE_MAX;

tuning_info_t tuning_info[] = 
{
   {  // bits = 0
   },
   {  // bits = 1
   },
   {  // bits = 2
         94,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        218,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        216,   // KS1 -> KS2 middle product threshold
   SIZE_MAX,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 3
        107,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        167,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        146,   // KS1 -> KS2 middle product threshold
       6889,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 4
         68,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        187,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
         95,   // KS1 -> KS2 middle product threshold
       7367,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 5
         60,   // KS1 -> KS2 multiplication threshold
      18841,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        192,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        128,   // KS1 -> KS2 middle product threshold
       5037,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold

@kiwifb
Copy link
Member Author

kiwifb commented Feb 14, 2013

Attachment: mpn_mulmid-test.c.patch.gz

patch added to zn_poly for review purposes

@kiwifb
Copy link
Member Author

kiwifb commented Feb 14, 2013

comment:15

OK new spkg ready for review. I also attached the patch for review but it is just David's code.

@kiwifb

This comment has been minimized.

@kiwifb
Copy link
Member Author

kiwifb commented Feb 14, 2013

Author: Francois Bissey, David Harvey

@kiwifb kiwifb modified the milestones: sage-5.7, sage-5.8 Feb 14, 2013
@zimmermann6
Copy link
Contributor

comment:16

the patch looks fine to me, however since I have no access to a power7 I can only check the patch and new package on another computer. Jeroen, how should we proceed in that case, assuming Francois (the author of the patch and new package) is the only person to have access to a power7?

Paul

@jdemeyer
Copy link
Contributor

comment:17

I don't mind giving positive_review in this case. We can reasonably expect that the author has tested the package on the failing machine.

@jdemeyer
Copy link
Contributor

Reviewer: Paul Zimmermann

@zimmermann6
Copy link
Contributor

comment:18

I don't mind giving positive_review in this case.

however I'd like to check first the new spkg works on my machine.

Paul

@jdemeyer
Copy link
Contributor

comment:19

This doesn't even build:

Applying patches to upstream sources...
makemakefile.py.patch
patching file makemakefile.py
mpn_mulmid-test.c.patch
patching file test/mpn_mulmid-test.c
Hunk #1 FAILED at 121.
1 out of 1 hunk FAILED -- saving rejects to file test/mpn_mulmid-test.c.rej
Error: '../patches/mpn_mulmid-test.c.patch' failed to apply.

real    0m0.011s
user    0m0.010s
sys     0m0.000s
************************************************************************
Error installing package zn_poly-0.9.p10
************************************************************************

@kiwifb
Copy link
Member Author

kiwifb commented Feb 19, 2013

comment:21

Sorry made a big mistake when preparing the final spkg (source not pristine clean, that's rather unforgiving). It should be ok now (I double checked).

@zimmermann6
Copy link
Contributor

comment:22

all tests now pass on my computer (on top of Sage 5.6).

Paul

@zimmermann6
Copy link
Contributor

Changed reviewer from Paul Zimmermann to Paul Zimmermann, Jeroen Demeyer

@jdemeyer
Copy link
Contributor

Merged: sage-5.8.beta1

@nexttime
Copy link
Mannequin

nexttime mannequin commented Feb 24, 2013

comment:24

zn_poly's tuning (and apparently due to that its test suite, too) is flaky on other systems as well: #13947

It would be nice if some of you could also take a look at that... :P

@kiwifb
Copy link
Member Author

kiwifb commented Feb 25, 2013

comment:25

Yes it looks similar. Turning debugging on was somewhat helpful here.

@fchapoton
Copy link
Contributor

Changed author from Francois Bissey, David Harvey to François Bissey, David Harvey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants