Share buffers between CPU and GPU #1696

kiltyj · 2023-06-05T04:39:04Z

This updates ggml_metal_add_buffer to use MTLDevice.newBufferWithBytesNoCopy to attempt share buffers between CPU and GPU rather than re-allocating for the GPU.

I've been following #1642 and noticed that this might help some of the swapping-related issues some have been seeing with larger models and devices with < 96GB memory. With this change, I'm no longer seeing any swapping or odd/corrupted output.

Apologize if I missed any contribution steps, or if this change is missing something obvious. One thing I'm not sure about is whether this covers all possible cases. One thing to note is that newBufferWithBytesNoCopy requires a page-aligned source buffer. This seems to be the case with mmap, but not sure if it will remain true for all possible configurations / code paths.

…nd GPU

ggerganov · 2023-06-05T05:08:26Z

Ah very interesting!
I did see the MTLDevice.newBufferWithBytesNoCopy function and tried it but it kept crashing.
I guess I was running with --no-mmap at that time.

One thing to note is that newBufferWithBytesNoCopy requires a page-aligned source buffer. This seems to be the case with mmap, but not sure if it will remain true for all possible configurations / code paths.

Currently, it's not the case without mmap. Should be easy to fix I think

kiltyj · 2023-06-05T06:20:08Z

@ggerganov I took my best shot at it, but I'm definitely stepping into unfamiliar territory. Even without this code, these buffers seemed to be page-aligned when running with --no-mmap on my M1 Max, so I'm not sure how to best test this other than confirming that the modified code still seems to work.

akx · 2023-06-05T08:49:54Z

👍 Anecdotally seems to work fine on my laptop (M2 Max). Initialization seems to be a lot faster, at least.

x4080 · 2023-06-05T20:23:05Z

Hi, I tried this and havent got disk swap, m2 pro 16gb using llama-13b-supercot.ggmlv3.q4_0.bin
result:

llama_print_timings:        load time =  1501.98 ms
llama_print_timings:      sample time =     6.33 ms /   327 runs   (    0.02 ms per token)
llama_print_timings: prompt eval time =  1140.83 ms /    16 tokens (   71.30 ms per token)
llama_print_timings:        eval time = 23182.15 ms /   326 runs   (   71.11 ms per token)

ggerganov

Great work!

Johnhersh · 2023-06-06T15:09:47Z

After this PR was merged I now get this error:
ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184

Any idea why?

kiltyj · 2023-06-06T15:42:18Z

@Johnhersh The error you shared was actually added in #1706. Was it running without issues before #1696? I suspect that before that it was actually failing silently when creating that buffer, e.g. hanging or producing garbage output.

The root cause is that macOS is refusing to create a buffer big enough to hold the model you're loading. It seems like macOS has an upper limit on a MTLBuffer size that is roughly half the amount of total unified memory, which a 30B model exceeds for machines with 32GB or less.

Since the limit seems to be on individual buffer size and not total memory usage, last night I started playing with some ideas for how to split the data buffer in a way that downstream Metal code would be happy with, but haven't gotten anything working yet, as I'm not very familiar with how this buffer is / can be segmented.

Johnhersh · 2023-06-06T15:44:58Z

Was it running without issues before #1696? I suspect that before that it was actually failing silently when creating that buffer, e.g. hanging or producing garbage output.

Yes you're correct, before these it was outputting ?? repeatedly. So this error message is quite correct then. Thank you!

kiltyj · 2023-06-06T16:03:50Z

So that I could confirm that this is not related to the newBufferWithBytes => newBufferWithBytesNoCopy change, I just tried running the pre-#1696 version on my 32GB machine with a 33B model. Even with newBufferWithBytes, Metal refuses to allocate a buffer large enough for the model (returns nil).

Not too surprising. Just wanted to make sure that the MTLDevice.maxBufferLength limit applies to both methods of creating a MTLBuffer. So @ggerganov, seems like we might need to figure out how to split up buffers that exceed this limit.

davidliudev · 2023-06-09T06:10:49Z

Is there any means to overcome the limit and allocate more than 1/2 of the system ram?

ggerganov · 2023-06-09T07:05:52Z

Yes, we have to split the buffer into smaller parts.
Unless I am missing something, it should be quite easy to fix, with changes only in the ggml_metal_add_buffer() call

Will probably fix this over the weekend, but if anyone figures it out in the meantime - will be happy to see a PR

kiltyj · 2023-06-09T20:00:23Z

if anyone figures it out in the meantime - will be happy to see a PR

I took a crack at it a few nights ago, but didn't finish. I had code in place to split up the buffer, but was encountering issues somewhere downstream from that. My suspicion at the time (based on very limited understanding of what's going on in the shaders) was that the shaders are sometimes reaching beyond the length of the buffer segments being passed to them, i.e. whereas before they could reach anywhere within the large, single buffer, now they're getting garbage data when reaching beyond the end of the particular segment passed to it.

Again this was just a theory based very shakily on only a (so far) pretty limited grasp of things and a couple hours of hacking and tinkering. @ggerganov I'm sure you could easily refute or confirm it. I'll probably try again after hours at some point, unless you beat me to it this weekend. If my theory is correct, I think we'd actually need to pass all buffer segments to the shaders and figure out which segment's pointer to use from within the shader, which I'm assuming would have performance impacts.

ggerganov · 2023-06-10T13:31:05Z

My suspicion at the time (based on very limited understanding of what's going on in the shaders) was that the shaders are sometimes reaching beyond the length of the buffer segments being passed to them, i.e. whereas before they could reach anywhere within the large, single buffer, now they're getting garbage data when reaching beyond the end of the particular segment passed to it.

Yes, that might very well be the case.

I think you can still work around that by creating multiple views with overlap.
Example, say you want to map 16GB buffer. Create 3 views, each 8GB of size:

view 0 has offset 0, i.e. range [0GB, 8GB]
view 1 has offset 4GB, i.e range [4GB, 8GB]
view 2 has offset 8GB, i.e. range [8GB, 16GB]

The ggml_metal_get_buffer() will then search through all the views and select the one that contains the tensor entirely.
This will guarantee that tensors with size of up to 4GB will be "mappable" which is plenty for all intents and purposes.

Again, I could still be missing something

kiltyj · 2023-06-10T15:58:28Z

I think you can still work around that by creating multiple views with overlap.
Example, say you want to map 16GB buffer. Create 3 views, each 8GB of size:

This is pretty close to what I tried the other night, except the view size was a function of maxBufferLength rather than fixed at 8GB, and I simply selected the view that contained the tensor's starting address in the first half of the view.

The view size on my Mac would've been larger than 8GB, but I can try with fixed 8GB segments in case I had a bug with alignment, or something, unless you can see an obvious issue with what I described above.

ggerganov · 2023-06-10T19:42:16Z

The 8GB view is just an example - it should be dynamic, similar to what you describe.
Not sure why it wouldn't work

Use MTLDevice.newBufferWithBytesNoCopy to share buffers between CPU a…

bc04508

…nd GPU

Page-align buffers used by Metal

a9d0bea

kiltyj and others added 3 commits June 5, 2023 05:02

Remove trailing whitespace

c1b4424

Only import unistd.h for Metal builds

c38b0bb

metal : remove unnecessary copies

e129f0b

ggerganov approved these changes Jun 5, 2023

View reviewed changes

ggerganov merged commit 9d0693b into ggml-org:master Jun 5, 2023

kiltyj deleted the metal-shared-memory branch June 5, 2023 20:30

ggerganov mentioned this pull request Jun 6, 2023

llama : Metal inference #1642

Merged

20 tasks

ggerganov mentioned this pull request Jun 10, 2023

Add checks for buffer size with Metal #1706

Merged

ymcui mentioned this pull request Jun 12, 2023

[METAL] GPU Inference fails due to buffer error (buffer "data" size is larger than buffer maximum) #1815

Closed

4 tasks

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share buffers between CPU and GPU #1696

Share buffers between CPU and GPU #1696

kiltyj commented Jun 5, 2023

ggerganov commented Jun 5, 2023

kiltyj commented Jun 5, 2023

akx commented Jun 5, 2023

x4080 commented Jun 5, 2023

ggerganov left a comment

Johnhersh commented Jun 6, 2023

kiltyj commented Jun 6, 2023

Johnhersh commented Jun 6, 2023

kiltyj commented Jun 6, 2023

davidliudev commented Jun 9, 2023

ggerganov commented Jun 9, 2023 •

edited

Loading

kiltyj commented Jun 9, 2023

ggerganov commented Jun 10, 2023

kiltyj commented Jun 10, 2023

ggerganov commented Jun 10, 2023

Share buffers between CPU and GPU #1696

Share buffers between CPU and GPU #1696

Conversation

kiltyj commented Jun 5, 2023

ggerganov commented Jun 5, 2023

kiltyj commented Jun 5, 2023

akx commented Jun 5, 2023

x4080 commented Jun 5, 2023

ggerganov left a comment

Choose a reason for hiding this comment

Johnhersh commented Jun 6, 2023

kiltyj commented Jun 6, 2023

Johnhersh commented Jun 6, 2023

kiltyj commented Jun 6, 2023

davidliudev commented Jun 9, 2023

ggerganov commented Jun 9, 2023 • edited Loading

kiltyj commented Jun 9, 2023

ggerganov commented Jun 10, 2023

kiltyj commented Jun 10, 2023

ggerganov commented Jun 10, 2023

ggerganov commented Jun 9, 2023 •

edited

Loading