-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lapack tests slower with OPENBLAS_NUM_THREADS>1 #192
Comments
Hi @bashseb , Thank you for the feedback. What's your CPU? I think it enables hyper-threading feature. Thus, the performance of i=32 is same as i=16. How do you run lapack test? What's your system blas? Is it Intel MKL? Xianyi |
thanks @xianyi for your assistance. My CPU is a "Xeon(R) CPU E5-2690 0 @ 2.90GHz". It has 32 threads, 16 physical cores and 2 sockets. I just looked at I ran the lapack test that is included in the lapack-3.4.2. I guess it's not really a benchmark, but just verification that it works or not. But it outputs numbers on the runtime (see the pastebin). I'm trying to run the hpl benchmark (http://www.netlib.org/benchmark/hpl/) and will report the numbers - is this a good idea? My system blas is the default CentOS release 6.3 (Final) EDIT: I looked at EDIT2: I guess this cpu counts as 'sandybridge'. I think in this case my compiler is too old and I need AVC, right? thanks a load! |
Hi @bashseb , I just checked the outputs of LAPACK test. Becasue OpenBLAS uses multithreading and the input matrix is very small, the performance is low than the default BLAS. This issue is same as #103 . HPL is a very good benchmark. Yes, it need MPI. I don't know other basic benchmark for BLAS. I think Red Hat already applied the sandy bridge patch to this gcc version. Thus, you don't need to update your compiler. Xianyi |
@xianyi thanks a lot. HPL benchmarks clearly show a run time advantage of openblas. I multiplied the default |
HPL is perhaps the best benchmark, but the MPI makes it difficult to measure BLAS performance. |
Hi @xianyi -- can you elaborate on OpenBlas' behavior when encountered hyper-threading on a virtual machine (e.g. Amazon EC2 instances)? A concrete scenario: let's say we have 4 physical cores and with HT, 8 threads (2 logical cores per physical core). Is it then valid to set What if I used |
Hi @concretevitamin , It depends on your benchmark. For example, DGEMM and other BLAS3 functions are compute-intensive. Thus, those function cannot get benefit from HT. The performance of Xianyi |
Yes, HPL is a good benchmark for a distributed machine, but it requires MPI and a whole host of other tuning to do seriously. If the goal is just to measure BLAS performance, I would just time the DGEMMs, or run |
Thanks for your response. In my case, I am wondering if GEQRF could benefit On Sunday, August 3, 2014, Zhang Xianyi [email protected] wrote:
|
Hello,
thank you for providing the openblas lib. My interest is to use your optimized library. I'm not sure if this is the correct place to ask such questions. If no, please let me know.
I'm trying to use openblas (git) and as a first test tried to compile lapack-3.4.2. I added
BLASLIB = /path/to/libopenblas.a -lpthread
to make.inc.I then tested the lapack tests with different
export OPENBLAS_NUM_THREADS=i
. I didmake cleantesting
andmake
. I can observe that i cores are utilized, but the test results ofi>1
are slower than fori=1
or with the system blas (which beats openblas on most tests fori=1
). Furthermore, a value ofi=32
is interpreted asi=16
.I must have obviously done something wrong., but I don't know exactly how to narrow it down. I'd appreciate any hints.
I've put the result of
i=32
on a pastebin http://pastebin.com/BQATuymzThe text was updated successfully, but these errors were encountered: