Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many LOCK/UNLOCK in memory.c, low efficiency. #1782

Closed
fengrl opened this issue Sep 30, 2018 · 15 comments
Closed

too many LOCK/UNLOCK in memory.c, low efficiency. #1782

fengrl opened this issue Sep 30, 2018 · 15 comments

Comments

@fengrl
Copy link
Contributor

fengrl commented Sep 30, 2018

In blas_memory_alloc, there are too many LOCK_COMMAND. And I check the code,
2570 do {
2571 if (!memory[position].used && (memory[position].pos == mypos)) {
2572 LOCK_COMMAND(&alloc_lock);
2573 /* blas_lock(&memory[position].lock);/
2574
2575 if (!memory[position].used) goto allocation;
2576
2577 UNLOCK_COMMAND(&alloc_lock);
2578 /
blas_unlock(&memory[position].lock);*/
2579 }
2580
2581 position ++;
2582
2583 } while (position < NUM_BUFFERS);
2584

2588 position = 0;
2589
2590 do {
2591 /* if (!memory[position].used) { /
2592 LOCK_COMMAND(&alloc_lock);
2593 /
blas_lock(&memory[position].lock);/
2594
2595 if (!memory[position].used) goto allocation;
2596
2597 UNLOCK_COMMAND(&alloc_lock);
2598 /
blas_unlock(&memory[position].lock);/
2599 /
} */
2600
2601 position ++;
2602
2603 } while (position < NUM_BUFFERS);

One atomic opertion if (!memory[position].used) need one LOCK/UNLOCK. And why not we move LOCK/UNLOCK outside of loop?
One malloc require many times LOCK/UNLOCK operations, this will lead to memery alloc very low efficient.

Below modify can pass with 5 threads test.
LOCK_COMMAND(&alloc_lock);
2570 do {
2571 if (!memory[position].used && (memory[position].pos == mypos)) {
2572
2573 /* blas_lock(&memory[position].lock);/
2574
2575 if (!memory[position].used) goto allocation;
2576
2577
2578 /
blas_unlock(&memory[position].lock);*/
2579 }
2580
2581 position ++;
2582
2583 } while (position < NUM_BUFFERS);
2584 UNLOCK_COMMAND(&alloc_lock);

2588 position = 0;
2589 LOCK_COMMAND(&alloc_lock);
2590 do {
2591 /* if (!memory[position].used) { /
2592
2593 /
blas_lock(&memory[position].lock);/
2594
2595 if (!memory[position].used) goto allocation;
2596
2597 UNLOCK_COMMAND(&alloc_lock);
2598 /
blas_unlock(&memory[position].lock);/
2599 /
} */
2600
2601 position ++;
2602
2603 } while (position < NUM_BUFFERS);
UNLOCK_COMMAND(&alloc_lock);

@brada4
Copy link
Contributor

brada4 commented Sep 30, 2018

Could you post it as a pull request? It is hard to untangle formatting in the post.
Best in 2 parts, IMO 2nd loop should benefit by getting unrolled compiling, i have doubts about 1st.

@brada4
Copy link
Contributor

brada4 commented Sep 30, 2018

.... 1st loop is invoked for x86_32 and mips64, and looks disabled for x86_64 ...

@martin-frbg
Copy link
Collaborator

martin-frbg commented Sep 30, 2018

Guess I made a wrong assumption on the likelyhood of the condition being true (not sure if I did benchmark this when I made the very first PR that introduced these locks, but back then I did not even expect xianyi to merge my PR without any discussion or changes). Hopefully the new TLS code can become the default for memory.c in the near future.
@brada as I read it this is "only" about moving the locking outside of the loop, and mips64 appears to be fengrl's main platform (which would be great as we seem to have few if any mips64 developers around lately)

@brada4
Copy link
Contributor

brada4 commented Sep 30, 2018

Short short lock is good to allow concurrent thread to kick in. Actually this malloc if at all happens once on each thread per function call. I dont think small locks have any benefit, in effect if other proceeds to _alloc this one will be locked out anyway. sum of time is all allocs + all checks either way. I dont think lock swarm in every thread speeds it up at all.

2nd loop has known loop start and will be unrolled.
1st will be a loop, still better than same loop with re-locking...
...though it tries to start in memory warmer/closer to particular core, but it does not look very clean detecting core with assembly or pthreads depending on architecture...

martin-frbg added a commit that referenced this issue Oct 5, 2018
@brada4
Copy link
Contributor

brada4 commented Oct 5, 2018

I think it is fixed to the maximum (not considering chance of great rewriting in shadow of incoming TLS allocator)

@fengrl
Copy link
Contributor Author

fengrl commented Oct 8, 2018

@brada4 @martin-frbg ,
I notice that, ".... 1st loop is invoked for x86_32 and mips64, and looks disabled for x86_64 ...".
And as I understand, x86_64 can work well with only the 2nd loop, so I guess mips64 can also work well with only the 2nd loop.

@brada4
Copy link
Contributor

brada4 commented Oct 8, 2018

You can comment out #define WHEREAMI in comon_mips64.h to disable 1st loop.
Both code segments have to do with serializing memory allocations. 1st loop tries to find unused piece of memory without lock , then lock & allocate, 2nd locks, finds, allocates.
If you can measure difference it might be worth committing. But have in mind that there is parallel non-blocking allocator in works, and probably this code will fade away.

@fengrl
Copy link
Contributor Author

fengrl commented Oct 8, 2018

thanks for your quick reply. And I will test with comment out WHEREAMI.
Anticipate non-blocking allocator. Thanks

@brada4
Copy link
Contributor

brada4 commented Oct 8, 2018

Just post a PR in toplevel header if you can measure 1st loop yields no improved performance.
Probably x86 with its biggest deployment called syswow64 will move along, It depends on the speed of polishing TLS allocator.

@martin-frbg
Copy link
Collaborator

I do not think we will want to remove "WHEREAMI" unconditionally on all MIPS64, or do we ? (And I believe the main reason why USE_PTHREAD_LOCK is unset on WIndows is that there is no libpthread there. I am still worried about introducing races on someone's Loongson3A-based compute cluster, if these still exist)

@fengrl
Copy link
Contributor Author

fengrl commented Oct 9, 2018

For the 1st loop, only MIPS64 and x86_32 use this loop. Other platforms can work well without this part.
Would you please help check whether this loop is platform specific?

This topic and modify have not any relation with LOONGSON3A platform. Actually I have another pull request about LOONGSON3A, we can have more talk on that thread.

@fengrl
Copy link
Contributor Author

fengrl commented Oct 9, 2018

#1800 pull request for the 1st loop modify on MIPS64. Share same code with X86_64, arm, power, ....

@martin-frbg
Copy link
Collaborator

Seems that first loop fell out of fashion on x86/x86_64 with 2021d0f "remove expensive function calls", and the MIPS port had only inherited it from the original libGotoBLAS2 implementation.

@brada4
Copy link
Contributor

brada4 commented Oct 9, 2018

From that it can be deduced that x86(_32) will gain same way....

@fengrl
Copy link
Contributor Author

fengrl commented Oct 10, 2018

Problem fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants