Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Block Reduce Fails when using while loops #330

Open
anicusan opened this issue Nov 15, 2022 · 6 comments
Open

Simple Block Reduce Fails when using while loops #330

anicusan opened this issue Nov 15, 2022 · 6 comments

Comments

@anicusan
Copy link
Member

Hi, thank you for developing this library - I would like to write optimised kernels for common GPU algorithms such as reduce, scan, radix sort, etc. similar to CUB but available on all KernelAbstractions platforms. The resulting "KA standard library" (KALib? Caleb?) could be used as a benchmark for future KA development & optimisation - and I can use the lessons along the way to populate the "Writing Kernels" section in the documentation. Big plans, but...

I'm implementing the block-wise reduce following this tutorial with this simple-looking code:

using KernelAbstractions
using CUDA
using CUDAKernels


@kernel function block_reduce(out, in)

    # Get block / workgroup size
    bs = @uniform @groupsize()[1]

    # Block / group index, thread index within block, global thread index
    bi = @index(Group, Linear)
    ti = @index(Local, Linear)
    gi = @index(Global, Linear)

    # Copy each thread's corresponding item from global to shared memory
    cache = @localmem eltype(out) (bs,)
    cache[ti] = in[gi]
    @synchronize

    # Reduce elements in shared memory using sequential addressing following
    # https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
    @private s = bs ÷ 2
    while s > 0
        if ti < s
            cache[ti] += cache[ti + s]
        end

        @synchronize

        s = s >> 1
    end

    # Copy result back to global memory
    if ti == 1
        out[bi] = cache[1]
    end
end


num_blocks = 10
block_size = 32
num_elements = num_blocks * block_size

in = rand(1:10, num_elements) |> CuArray
out = zeros(num_blocks) |> CuArray


kernel_reduce = block_reduce(CUDADevice(), block_size)
ev = kernel_reduce(out, in, ndrange=num_elements)

wait(ev)
println(out)

It shouldn't be more exotic than the example code in the docs - however, these two lines:

    @private s = bs ÷ 2
    while s > 0

Produce the following errors:

Reason: unsupported use of an undefined name (use of 'bs')
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:31
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported dynamic function invocation (call to div)
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:31
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported use of an undefined name (use of 's')
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:32
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported dynamic function invocation (call to >)

[...Stacktrace...]

I tried following the code using Cthulhu.jl, but the errors appear simple: it's calling div(::Any, ::Int64) and >(::Any, ::Int64), so I assume the bs = @uniform @groupsize()[1] and @private s = bs ÷ 2 are not inferred as being integers.

If I switch the arrays and device to CPU() I get the following error:

    nested task error: MethodError: no method matching isless(::Int64, ::NTuple{32, Int64})

Would you know why these errors appear or how I could investigate (and fix..) them?

Thanks,
Leonard

@vchuravy
Copy link
Member

vchuravy commented Nov 16, 2022

Hm I need to think through the semantics of while loops on the CPU... #262

You can use @macroexpand to debug the lowering of the kernel. You should see two functions being generated one for the GPU the other for the CPU

This looks like a scope issue.

In general I was thinking that instead of impliementing block reduce in KA we would provide a reduce function or macro that the backends need to implement.

@anicusan
Copy link
Member Author

On the CPU the while loop misses an index into the thread-local @private variable:

            while s > 0
                begin
                    var"##N#486" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
                    begin
                        #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:264 =#
                        for var"##I#485" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:265 =#
                            (KernelAbstractions.__validindex)(__ctx__, var"##I#485") || continue
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:266 =#
                            bi = KernelAbstractions.__index_Group_Linear(__ctx__, var"##I#485")
                            ti = KernelAbstractions.__index_Local_Linear(__ctx__, var"##I#485")
                            gi = KernelAbstractions.__index_Global_Linear(__ctx__, var"##I#485")
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:267 =#
                            if ti < s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")]
                                #= REPL[18]:21 =#
                                cache[ti] += cache[ti + s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")]]
                            end
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:268 =#
                        end
                    end
                end

I notice that the @private s is normally used as s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")] (the NTuple GPU semantics simulation is really clever!) but the while s > 0 condition doesn't.

Does the length((KernelAbstractions.__workitems_iterspace)(__ctx__)) need to be redefined so many times? Could it be defined only once outside the user-code loops?

Also - would a break then work correctly in a @kernel ?

@anicusan
Copy link
Member Author

On the GPU I see:

[...]
                    bs = ((#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/KernelAbstractions.jl:127 =#
                            (KernelAbstractions.groupsize)(__ctx__)))[1]
[...]
                    var"#177#s" = KernelAbstractions.bs ÷ 2
                    while s > 0
[...]

Why is bs the only variable to be accessed as KernelAbstractions.bs rather than the locally-defined bs? That explains the undefined name and unsupported dynamic function (div).

Analog for var"#177#s" and s. I find the macros.jl source a bit hard to follow - I need to wrap my head around macros better - but would you know where to go next to fix these transformations?

Thanks for the quick reply, I'll be using @macroexpand in the future! This could go into a "Debugging Kernels" section in the docs..

@vchuravy
Copy link
Member

So one thing to think through is what a while loop with a @synchronize inside should look like.

@vchuravy
Copy link
Member

vchuravy commented Nov 16, 2022

Yeah the @synchronize makes while loops hard....

s = MVector(Int, length(wkgrp))
mask = map(s->s>0, s)
while any(mask)
   for tid in wkgrp
       mask[tid] || continue
       if tid < s[tid]
            cache[ti] += cache[ti + s[tid]]
        end
    end
    for tid in wkgrp
        mask[tid] || continue
        s[tid] = s[tid] >> 1
    end
    mask = map(s->s>0, s)
end

But we certainly couldn't support break

@vchuravy
Copy link
Member

We could solve break through introducing a mask...
I like this direction, but it is something that the current architecture doesn't easily support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants