too many LOCK/UNLOCK in memory.c, low efficiency. #1782

fengrl · 2018-09-30T09:46:14Z

In blas_memory_alloc, there are too many LOCK_COMMAND. And I check the code,
2570 do {
2571 if (!memory[position].used && (memory[position].pos == mypos)) {
2572 LOCK_COMMAND(&alloc_lock);
2573 /* blas_lock(&memory[position].lock);/
2574
2575 if (!memory[position].used) goto allocation;
2576
2577 UNLOCK_COMMAND(&alloc_lock);
2578 / blas_unlock(&memory[position].lock);*/
2579 }
2580
2581 position ++;
2582
2583 } while (position < NUM_BUFFERS);
2584

2588 position = 0;
2589
2590 do {
2591 /* if (!memory[position].used) { /
2592 LOCK_COMMAND(&alloc_lock);
2593 / blas_lock(&memory[position].lock);/
2594
2595 if (!memory[position].used) goto allocation;
2596
2597 UNLOCK_COMMAND(&alloc_lock);
2598 / blas_unlock(&memory[position].lock);/
2599 / } */
2600
2601 position ++;
2602
2603 } while (position < NUM_BUFFERS);

One atomic opertion if (!memory[position].used) need one LOCK/UNLOCK. And why not we move LOCK/UNLOCK outside of loop?
One malloc require many times LOCK/UNLOCK operations, this will lead to memery alloc very low efficient.

Below modify can pass with 5 threads test.
LOCK_COMMAND(&alloc_lock);
2570 do {
2571 if (!memory[position].used && (memory[position].pos == mypos)) {
2572
2573 /* blas_lock(&memory[position].lock);/
2574
2575 if (!memory[position].used) goto allocation;
2576
2577
2578 / blas_unlock(&memory[position].lock);*/
2579 }
2580
2581 position ++;
2582
2583 } while (position < NUM_BUFFERS);
2584 UNLOCK_COMMAND(&alloc_lock);

2588 position = 0;
2589 LOCK_COMMAND(&alloc_lock);
2590 do {
2591 /* if (!memory[position].used) { /
2592
2593 / blas_lock(&memory[position].lock);/
2594
2595 if (!memory[position].used) goto allocation;
2596
2597 UNLOCK_COMMAND(&alloc_lock);
2598 / blas_unlock(&memory[position].lock);/
2599 / } */
2600
2601 position ++;
2602
2603 } while (position < NUM_BUFFERS);
UNLOCK_COMMAND(&alloc_lock);

brada4 · 2018-09-30T13:01:54Z

Could you post it as a pull request? It is hard to untangle formatting in the post.
Best in 2 parts, IMO 2nd loop should benefit by getting unrolled compiling, i have doubts about 1st.

brada4 · 2018-09-30T13:34:28Z

.... 1st loop is invoked for x86_32 and mips64, and looks disabled for x86_64 ...

martin-frbg · 2018-09-30T16:37:43Z

Guess I made a wrong assumption on the likelyhood of the condition being true (not sure if I did benchmark this when I made the very first PR that introduced these locks, but back then I did not even expect xianyi to merge my PR without any discussion or changes). Hopefully the new TLS code can become the default for memory.c in the near future.
@brada as I read it this is "only" about moving the locking outside of the loop, and mips64 appears to be fengrl's main platform (which would be great as we seem to have few if any mips64 developers around lately)

brada4 · 2018-09-30T20:11:18Z

Short short lock is good to allow concurrent thread to kick in. Actually this malloc if at all happens once on each thread per function call. I dont think small locks have any benefit, in effect if other proceeds to _alloc this one will be locked out anyway. sum of time is all allocs + all checks either way. I dont think lock swarm in every thread speeds it up at all.

2nd loop has known loop start and will be unrolled.
1st will be a loop, still better than same loop with re-locking...
...though it tries to start in memory warmer/closer to particular core, but it does not look very clean detecting core with assembly or pthreads depending on architecture...

address #1782 2nd loop

brada4 · 2018-10-05T09:00:48Z

I think it is fixed to the maximum (not considering chance of great rewriting in shadow of incoming TLS allocator)

fengrl · 2018-10-08T03:31:21Z

@brada4 @martin-frbg ,
I notice that, ".... 1st loop is invoked for x86_32 and mips64, and looks disabled for x86_64 ...".
And as I understand, x86_64 can work well with only the 2nd loop, so I guess mips64 can also work well with only the 2nd loop.

brada4 · 2018-10-08T06:22:10Z

You can comment out #define WHEREAMI in comon_mips64.h to disable 1st loop.
Both code segments have to do with serializing memory allocations. 1st loop tries to find unused piece of memory without lock , then lock & allocate, 2nd locks, finds, allocates.
If you can measure difference it might be worth committing. But have in mind that there is parallel non-blocking allocator in works, and probably this code will fade away.

fengrl · 2018-10-08T06:31:54Z

thanks for your quick reply. And I will test with comment out WHEREAMI.
Anticipate non-blocking allocator. Thanks

brada4 · 2018-10-08T06:43:13Z

Just post a PR in toplevel header if you can measure 1st loop yields no improved performance.
Probably x86 with its biggest deployment called syswow64 will move along, It depends on the speed of polishing TLS allocator.

martin-frbg · 2018-10-08T20:43:48Z

I do not think we will want to remove "WHEREAMI" unconditionally on all MIPS64, or do we ? (And I believe the main reason why USE_PTHREAD_LOCK is unset on WIndows is that there is no libpthread there. I am still worried about introducing races on someone's Loongson3A-based compute cluster, if these still exist)

fengrl · 2018-10-09T01:15:14Z

For the 1st loop, only MIPS64 and x86_32 use this loop. Other platforms can work well without this part.
Would you please help check whether this loop is platform specific?

This topic and modify have not any relation with LOONGSON3A platform. Actually I have another pull request about LOONGSON3A, we can have more talk on that thread.

fengrl · 2018-10-09T03:24:10Z

#1800 pull request for the 1st loop modify on MIPS64. Share same code with X86_64, arm, power, ....

martin-frbg · 2018-10-09T09:06:05Z

Seems that first loop fell out of fashion on x86/x86_64 with 2021d0f "remove expensive function calls", and the MIPS port had only inherited it from the original libGotoBLAS2 implementation.

brada4 · 2018-10-09T15:51:46Z

From that it can be deduced that x86(_32) will gain same way....

fengrl · 2018-10-10T02:24:04Z

Problem fixed.

martin-frbg added a commit that referenced this issue Oct 5, 2018

Merge pull request #1785 from brada4/develop

a980953

address #1782 2nd loop

fengrl closed this as completed Oct 10, 2018

brada4 mentioned this issue Oct 11, 2018

Remove unnecessary locking code in allocation serialisation code. #1814

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many LOCK/UNLOCK in memory.c, low efficiency. #1782

too many LOCK/UNLOCK in memory.c, low efficiency. #1782

fengrl commented Sep 30, 2018

brada4 commented Sep 30, 2018

brada4 commented Sep 30, 2018

martin-frbg commented Sep 30, 2018 •

edited

Loading

brada4 commented Sep 30, 2018 •

edited

Loading

brada4 commented Oct 5, 2018

fengrl commented Oct 8, 2018

brada4 commented Oct 8, 2018

fengrl commented Oct 8, 2018

brada4 commented Oct 8, 2018

martin-frbg commented Oct 8, 2018

fengrl commented Oct 9, 2018

fengrl commented Oct 9, 2018 •

edited

Loading

martin-frbg commented Oct 9, 2018

brada4 commented Oct 9, 2018

fengrl commented Oct 10, 2018

too many LOCK/UNLOCK in memory.c, low efficiency. #1782

too many LOCK/UNLOCK in memory.c, low efficiency. #1782

Comments

fengrl commented Sep 30, 2018

brada4 commented Sep 30, 2018

brada4 commented Sep 30, 2018

martin-frbg commented Sep 30, 2018 • edited Loading

brada4 commented Sep 30, 2018 • edited Loading

brada4 commented Oct 5, 2018

fengrl commented Oct 8, 2018

brada4 commented Oct 8, 2018

fengrl commented Oct 8, 2018

brada4 commented Oct 8, 2018

martin-frbg commented Oct 8, 2018

fengrl commented Oct 9, 2018

fengrl commented Oct 9, 2018 • edited Loading

martin-frbg commented Oct 9, 2018

brada4 commented Oct 9, 2018

fengrl commented Oct 10, 2018

martin-frbg commented Sep 30, 2018 •

edited

Loading

brada4 commented Sep 30, 2018 •

edited

Loading

fengrl commented Oct 9, 2018 •

edited

Loading