Lower-level kernel form? #578

anicusan · 2025-03-05T14:25:59Z

This is mainly to start a conversation around the KA kernel language, as it currently starts accumulating more functionality / cruft; for example, if I want a high-performance kernel as written in raw CUDA C++ (but backend- and type-agnostic and having all the Julia niceties), kernels would start to look like:

@kernel unsafe_indices=true cpu=false inbounds=true function somekernel(arg1, @Const(arg2))
    ...
end

What I'd expect by default - a GPU kernel with comparable performance to CUDA - is not really what the language guides me to by default, as I need to add @kernel unsafe_indices=true cpu=false inbounds=true to get close. Even then, with the recent @synchronize lane checks, we see big performance hits in previously well-performing code (e.g. from 540 ms to 1.54 s for a sum - see issue).

Perhaps this is the point where I should emphasise how much I appreciate KernelAbstractions and the titanic work put into it and the JuliaGPU ecosystem. I hope this post does not come across as sweeping criticism, but a discussion for possible future improvements (of course, here "improvements" being simply my personal opinion based on the work I do - and how I'm using KA for HPC code).

Having followed KA development for a few years now, I understand the constraints that evolved the current KA interface - implicit boundschecks, separate CPU and GPU compilation pipelines, ndrange being, well, a range and not the blocksize and nblocks seen in CUDA, divergent synchronize, etc.

Would there be a possibility for, say, a @rawkernel, with more minimal functionality:

@rawkernel function somekernel(arg1, @const(arg2))
    # Closely mimic the typical GPU API (CUDA, OpenCL) only exposing the local and block indices
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

# Example syntax to get point across - I don't care much for that now, just the functionality
block_size = 128
somekernel{backend, block_size}(arg1, arg2, nblocks=18)

Or more JuliaGPU-like kernel syntax:

function somekernel(arg1, @const(arg2))
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

result = @ka backend block_size=128 blocks=18 somekernel(arg1, arg2)

# Or create callable object
fkernel = @ka backend block_size=128 somekernel
fkernel(arg1, arg2, blocks=18)

Which would very closely map to the GPU backend's kernel language; I think this would have a few advantages:

Simpler to implement and maintain: e.g. no need to inject divergent synchronization checks.
Simpler to transpile to the right GPU backend (maybe even transpile Julia-to-Julia, then let the backend do the work?).
Simpler, more concise syntax.
More consistent usage with the corresponding JuliaGPU @cuda, @metal, etc. kernels.
And most importantly, performance as you'd expect from the equivalent CUDA C++ kernel.

What are your thoughts?

The text was updated successfully, but these errors were encountered:

vchuravy · 2025-03-05T15:12:55Z

This is essentially the evolution I have in mind with #562

Aligning KA relatively closely to OpenCL/SPIRV semantics

My milestones for KA 0.10 is essentially

CPU to POCL transition
Finish KernelInstrinsics
add a low-level "launch" interface.

But without touching the kernel language itself. I would then expect users like you to start using the lower level interface directly.

KA v1.0 would then be removing deprecated functionality from KA kernel language

anicusan · 2025-03-07T18:38:14Z

What will the low-level interface look like?

Also, I’m concerned about maintaining the current performance levels as new features are added to KA. The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

vchuravy · 2025-03-12T12:51:01Z

What will the low-level interface look like?

Much more like programming OpenCL.

import KernelIntrinsics

function vadd(a, b, c)
    i = KernelIntrinsics.get_global_id()
    @inbounds c[i] = a[i] + b[i]
    return
end

The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

Full agreement on this. I have been trying to be very cautious with changes like that, but in this case it was unavoidable to correctly map kernels onto existing GPU architectures.

pxl-th · 2025-03-13T11:11:24Z

The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

Haven't looked into why yet, but #564 completely hangs my machine with GaussianSplatting.jl.
From the quick logs that I was able to see before the hang, looks like it happens with render kernel.
Haven't tried unsafe_indices yet.

vchuravy · 2025-03-13T11:16:43Z

Since you don't use global indices you should be able to add unsafe_indicies

pxl-th · 2025-03-13T11:31:07Z

I saw also that the kernel now uses malloc intrinsic (thus spawning a hostcall).
Do you know why it is so? And why not every kernel now does this (that doesn't use unsafe_indices=true)?

I guess I'm not sure when/why unsafe_indices makes a difference now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower-level kernel form? #578

Lower-level kernel form? #578

anicusan commented Mar 5, 2025

vchuravy commented Mar 5, 2025

anicusan commented Mar 7, 2025

vchuravy commented Mar 12, 2025

pxl-th commented Mar 13, 2025 •

edited

Loading

vchuravy commented Mar 13, 2025

pxl-th commented Mar 13, 2025 •

edited

Loading

Lower-level kernel form? #578

Lower-level kernel form? #578

Comments

anicusan commented Mar 5, 2025

vchuravy commented Mar 5, 2025

anicusan commented Mar 7, 2025

vchuravy commented Mar 12, 2025

pxl-th commented Mar 13, 2025 • edited Loading

vchuravy commented Mar 13, 2025

pxl-th commented Mar 13, 2025 • edited Loading

pxl-th commented Mar 13, 2025 •

edited

Loading

pxl-th commented Mar 13, 2025 •

edited

Loading