-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
too many LOCK/UNLOCK in memory.c, low efficiency. #1782
Comments
Could you post it as a pull request? It is hard to untangle formatting in the post. |
.... 1st loop is invoked for x86_32 and mips64, and looks disabled for x86_64 ... |
Guess I made a wrong assumption on the likelyhood of the condition being true (not sure if I did benchmark this when I made the very first PR that introduced these locks, but back then I did not even expect xianyi to merge my PR without any discussion or changes). Hopefully the new TLS code can become the default for memory.c in the near future. |
Short short lock is good to allow concurrent thread to kick in. Actually this malloc if at all happens once on each thread per function call. I dont think small locks have any benefit, in effect if other proceeds to _alloc this one will be locked out anyway. sum of time is all allocs + all checks either way. I dont think lock swarm in every thread speeds it up at all. 2nd loop has known loop start and will be unrolled. |
I think it is fixed to the maximum (not considering chance of great rewriting in shadow of incoming TLS allocator) |
@brada4 @martin-frbg , |
You can comment out |
thanks for your quick reply. And I will test with comment out WHEREAMI. |
Just post a PR in toplevel header if you can measure 1st loop yields no improved performance. |
I do not think we will want to remove "WHEREAMI" unconditionally on all MIPS64, or do we ? (And I believe the main reason why USE_PTHREAD_LOCK is unset on WIndows is that there is no libpthread there. I am still worried about introducing races on someone's Loongson3A-based compute cluster, if these still exist) |
For the 1st loop, only MIPS64 and x86_32 use this loop. Other platforms can work well without this part. This topic and modify have not any relation with LOONGSON3A platform. Actually I have another pull request about LOONGSON3A, we can have more talk on that thread. |
#1800 pull request for the 1st loop modify on MIPS64. Share same code with X86_64, arm, power, .... |
Seems that first loop fell out of fashion on x86/x86_64 with 2021d0f "remove expensive function calls", and the MIPS port had only inherited it from the original libGotoBLAS2 implementation. |
From that it can be deduced that x86(_32) will gain same way.... |
Problem fixed. |
In blas_memory_alloc, there are too many LOCK_COMMAND. And I check the code,
2570 do {
2571 if (!memory[position].used && (memory[position].pos == mypos)) {
2572 LOCK_COMMAND(&alloc_lock);
2573 /* blas_lock(&memory[position].lock);/
2574
2575 if (!memory[position].used) goto allocation;
2576
2577 UNLOCK_COMMAND(&alloc_lock);
2578 / blas_unlock(&memory[position].lock);*/
2579 }
2580
2581 position ++;
2582
2583 } while (position < NUM_BUFFERS);
2584
2588 position = 0;
2589
2590 do {
2591 /* if (!memory[position].used) { /
2592 LOCK_COMMAND(&alloc_lock);
2593 / blas_lock(&memory[position].lock);/
2594
2595 if (!memory[position].used) goto allocation;
2596
2597 UNLOCK_COMMAND(&alloc_lock);
2598 / blas_unlock(&memory[position].lock);/
2599 / } */
2600
2601 position ++;
2602
2603 } while (position < NUM_BUFFERS);
One atomic opertion if (!memory[position].used) need one LOCK/UNLOCK. And why not we move LOCK/UNLOCK outside of loop?
One malloc require many times LOCK/UNLOCK operations, this will lead to memery alloc very low efficient.
Below modify can pass with 5 threads test.
LOCK_COMMAND(&alloc_lock);
2570 do {
2571 if (!memory[position].used && (memory[position].pos == mypos)) {
2572
2573 /* blas_lock(&memory[position].lock);/
2574
2575 if (!memory[position].used) goto allocation;
2576
2577
2578 / blas_unlock(&memory[position].lock);*/
2579 }
2580
2581 position ++;
2582
2583 } while (position < NUM_BUFFERS);
2584 UNLOCK_COMMAND(&alloc_lock);
2588 position = 0;
2589 LOCK_COMMAND(&alloc_lock);
2590 do {
2591 /* if (!memory[position].used) { /
2592
2593 / blas_lock(&memory[position].lock);/
2594
2595 if (!memory[position].used) goto allocation;
2596
2597 UNLOCK_COMMAND(&alloc_lock);
2598 / blas_unlock(&memory[position].lock);/
2599 / } */
2600
2601 position ++;
2602
2603 } while (position < NUM_BUFFERS);
UNLOCK_COMMAND(&alloc_lock);
The text was updated successfully, but these errors were encountered: