Simple Block Reduce Fails when using `while` loops #330

anicusan · 2022-11-15T17:35:44Z

Hi, thank you for developing this library - I would like to write optimised kernels for common GPU algorithms such as reduce, scan, radix sort, etc. similar to CUB but available on all KernelAbstractions platforms. The resulting "KA standard library" (KALib? Caleb?) could be used as a benchmark for future KA development & optimisation - and I can use the lessons along the way to populate the "Writing Kernels" section in the documentation. Big plans, but...

I'm implementing the block-wise reduce following this tutorial with this simple-looking code:

using KernelAbstractions
using CUDA
using CUDAKernels


@kernel function block_reduce(out, in)

    # Get block / workgroup size
    bs = @uniform @groupsize()[1]

    # Block / group index, thread index within block, global thread index
    bi = @index(Group, Linear)
    ti = @index(Local, Linear)
    gi = @index(Global, Linear)

    # Copy each thread's corresponding item from global to shared memory
    cache = @localmem eltype(out) (bs,)
    cache[ti] = in[gi]
    @synchronize

    # Reduce elements in shared memory using sequential addressing following
    # https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
    @private s = bs ÷ 2
    while s > 0
        if ti < s
            cache[ti] += cache[ti + s]
        end

        @synchronize

        s = s >> 1
    end

    # Copy result back to global memory
    if ti == 1
        out[bi] = cache[1]
    end
end


num_blocks = 10
block_size = 32
num_elements = num_blocks * block_size

in = rand(1:10, num_elements) |> CuArray
out = zeros(num_blocks) |> CuArray


kernel_reduce = block_reduce(CUDADevice(), block_size)
ev = kernel_reduce(out, in, ndrange=num_elements)

wait(ev)
println(out)

It shouldn't be more exotic than the example code in the docs - however, these two lines:

    @private s = bs ÷ 2
    while s > 0

Produce the following errors:

Reason: unsupported use of an undefined name (use of 'bs')
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:31
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported dynamic function invocation (call to div)
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:31
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported use of an undefined name (use of 's')
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:32
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported dynamic function invocation (call to >)

[...Stacktrace...]

I tried following the code using Cthulhu.jl, but the errors appear simple: it's calling div(::Any, ::Int64) and >(::Any, ::Int64), so I assume the bs = @uniform @groupsize()[1] and @private s = bs ÷ 2 are not inferred as being integers.

If I switch the arrays and device to CPU() I get the following error:

    nested task error: MethodError: no method matching isless(::Int64, ::NTuple{32, Int64})

Would you know why these errors appear or how I could investigate (and fix..) them?

Thanks,
Leonard

The text was updated successfully, but these errors were encountered:

vchuravy · 2022-11-16T16:32:49Z

Hm I need to think through the semantics of while loops on the CPU... #262

You can use @macroexpand to debug the lowering of the kernel. You should see two functions being generated one for the GPU the other for the CPU

This looks like a scope issue.

In general I was thinking that instead of impliementing block reduce in KA we would provide a reduce function or macro that the backends need to implement.

anicusan · 2022-11-16T18:37:12Z

On the CPU the while loop misses an index into the thread-local @private variable:

            while s > 0
                begin
                    var"##N#486" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
                    begin
                        #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:264 =#
                        for var"##I#485" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:265 =#
                            (KernelAbstractions.__validindex)(__ctx__, var"##I#485") || continue
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:266 =#
                            bi = KernelAbstractions.__index_Group_Linear(__ctx__, var"##I#485")
                            ti = KernelAbstractions.__index_Local_Linear(__ctx__, var"##I#485")
                            gi = KernelAbstractions.__index_Global_Linear(__ctx__, var"##I#485")
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:267 =#
                            if ti < s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")]
                                #= REPL[18]:21 =#
                                cache[ti] += cache[ti + s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")]]
                            end
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:268 =#
                        end
                    end
                end

I notice that the @private s is normally used as s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")] (the NTuple GPU semantics simulation is really clever!) but the while s > 0 condition doesn't.

Does the length((KernelAbstractions.__workitems_iterspace)(__ctx__)) need to be redefined so many times? Could it be defined only once outside the user-code loops?

Also - would a break then work correctly in a @kernel ?

anicusan · 2022-11-16T18:53:38Z

On the GPU I see:

[...]
                    bs = ((#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/KernelAbstractions.jl:127 =#
                            (KernelAbstractions.groupsize)(__ctx__)))[1]
[...]
                    var"#177#s" = KernelAbstractions.bs ÷ 2
                    while s > 0
[...]

Why is bs the only variable to be accessed as KernelAbstractions.bs rather than the locally-defined bs? That explains the undefined name and unsupported dynamic function (div).

Analog for var"#177#s" and s. I find the macros.jl source a bit hard to follow - I need to wrap my head around macros better - but would you know where to go next to fix these transformations?

Thanks for the quick reply, I'll be using @macroexpand in the future! This could go into a "Debugging Kernels" section in the docs..

vchuravy · 2022-11-16T20:18:47Z

So one thing to think through is what a while loop with a @synchronize inside should look like.

vchuravy · 2022-11-16T20:26:26Z

Yeah the @synchronize makes while loops hard....

s = MVector(Int, length(wkgrp))
mask = map(s->s>0, s)
while any(mask)
   for tid in wkgrp
       mask[tid] || continue
       if tid < s[tid]
            cache[ti] += cache[ti + s[tid]]
        end
    end
    for tid in wkgrp
        mask[tid] || continue
        s[tid] = s[tid] >> 1
    end
    mask = map(s->s>0, s)
end

But we certainly couldn't support break

vchuravy · 2022-11-16T20:27:23Z

We could solve break through introducing a mask...
I like this direction, but it is something that the current architecture doesn't easily support.

vchuravy mentioned this issue Nov 16, 2022

Fix stmt form of at_private #331

Merged

vchuravy mentioned this issue Nov 17, 2022

Fix quickstart #335

Merged

This was referenced Mar 8, 2024

BUG: @synchronize inside loop in CPU is not working #465

Open

Inconsistent results across platforms and datatypes on CPU backend #468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Block Reduce Fails when using `while` loops #330

Simple Block Reduce Fails when using `while` loops #330

anicusan commented Nov 15, 2022

vchuravy commented Nov 16, 2022 •

edited

Loading

anicusan commented Nov 16, 2022

anicusan commented Nov 16, 2022

vchuravy commented Nov 16, 2022

vchuravy commented Nov 16, 2022 •

edited

Loading

vchuravy commented Nov 16, 2022

Simple Block Reduce Fails when using while loops #330

Simple Block Reduce Fails when using while loops #330

Comments

anicusan commented Nov 15, 2022

vchuravy commented Nov 16, 2022 • edited Loading

anicusan commented Nov 16, 2022

anicusan commented Nov 16, 2022

vchuravy commented Nov 16, 2022

vchuravy commented Nov 16, 2022 • edited Loading

vchuravy commented Nov 16, 2022

Simple Block Reduce Fails when using `while` loops #330

Simple Block Reduce Fails when using `while` loops #330

vchuravy commented Nov 16, 2022 •

edited

Loading

vchuravy commented Nov 16, 2022 •

edited

Loading