Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower-level kernel form? #578

Open
anicusan opened this issue Mar 5, 2025 · 6 comments
Open

Lower-level kernel form? #578

anicusan opened this issue Mar 5, 2025 · 6 comments

Comments

@anicusan
Copy link
Member

anicusan commented Mar 5, 2025

This is mainly to start a conversation around the KA kernel language, as it currently starts accumulating more functionality / cruft; for example, if I want a high-performance kernel as written in raw CUDA C++ (but backend- and type-agnostic and having all the Julia niceties), kernels would start to look like:

@kernel unsafe_indices=true cpu=false inbounds=true function somekernel(arg1, @Const(arg2))
    ...
end

What I'd expect by default - a GPU kernel with comparable performance to CUDA - is not really what the language guides me to by default, as I need to add @kernel unsafe_indices=true cpu=false inbounds=true to get close. Even then, with the recent @synchronize lane checks, we see big performance hits in previously well-performing code (e.g. from 540 ms to 1.54 s for a sum - see issue).

Perhaps this is the point where I should emphasise how much I appreciate KernelAbstractions and the titanic work put into it and the JuliaGPU ecosystem. I hope this post does not come across as sweeping criticism, but a discussion for possible future improvements (of course, here "improvements" being simply my personal opinion based on the work I do - and how I'm using KA for HPC code).

Having followed KA development for a few years now, I understand the constraints that evolved the current KA interface - implicit boundschecks, separate CPU and GPU compilation pipelines, ndrange being, well, a range and not the blocksize and nblocks seen in CUDA, divergent synchronize, etc.

Would there be a possibility for, say, a @rawkernel, with more minimal functionality:

@rawkernel function somekernel(arg1, @const(arg2))
    # Closely mimic the typical GPU API (CUDA, OpenCL) only exposing the local and block indices
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

# Example syntax to get point across - I don't care much for that now, just the functionality
block_size = 128
somekernel{backend, block_size}(arg1, arg2, nblocks=18)

Or more JuliaGPU-like kernel syntax:

function somekernel(arg1, @const(arg2))
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

result = @ka backend block_size=128 blocks=18 somekernel(arg1, arg2)

# Or create callable object
fkernel = @ka backend block_size=128 somekernel
fkernel(arg1, arg2, blocks=18)

Which would very closely map to the GPU backend's kernel language; I think this would have a few advantages:

  • Simpler to implement and maintain: e.g. no need to inject divergent synchronization checks.
  • Simpler to transpile to the right GPU backend (maybe even transpile Julia-to-Julia, then let the backend do the work?).
  • Simpler, more concise syntax.
  • More consistent usage with the corresponding JuliaGPU @cuda, @metal, etc. kernels.
  • And most importantly, performance as you'd expect from the equivalent CUDA C++ kernel.

What are your thoughts?

@vchuravy
Copy link
Member

vchuravy commented Mar 5, 2025

This is essentially the evolution I have in mind with #562

Aligning KA relatively closely to OpenCL/SPIRV semantics

My milestones for KA 0.10 is essentially

  1. CPU to POCL transition
  2. Finish KernelInstrinsics
  3. add a low-level "launch" interface.

But without touching the kernel language itself. I would then expect users like you to start using the lower level interface directly.

KA v1.0 would then be removing deprecated functionality from KA kernel language

@anicusan
Copy link
Member Author

anicusan commented Mar 7, 2025

What will the low-level interface look like?

Also, I’m concerned about maintaining the current performance levels as new features are added to KA. The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

@vchuravy
Copy link
Member

What will the low-level interface look like?

Much more like programming OpenCL.

import KernelIntrinsics

function vadd(a, b, c)
    i = KernelIntrinsics.get_global_id()
    @inbounds c[i] = a[i] + b[i]
    return
end

The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

Full agreement on this. I have been trying to be very cautious with changes like that, but in this case it was unavoidable to correctly map kernels onto existing GPU architectures.

@pxl-th
Copy link
Member

pxl-th commented Mar 13, 2025

The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

Haven't looked into why yet, but #564 completely hangs my machine with GaussianSplatting.jl.
From the quick logs that I was able to see before the hang, looks like it happens with render kernel.
Haven't tried unsafe_indices yet.

@vchuravy
Copy link
Member

Since you don't use global indices you should be able to add unsafe_indicies

@pxl-th
Copy link
Member

pxl-th commented Mar 13, 2025

I saw also that the kernel now uses malloc intrinsic (thus spawning a hostcall).
Do you know why it is so? And why not every kernel now does this (that doesn't use unsafe_indices=true)?

I guess I'm not sure when/why unsafe_indices makes a difference now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants