You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Edit: Looks like I can get the control I need out of the High-level wrapper, so this is not a pressing issue. But I would still like to use the low-level API, so an answer would be helpful.
First off, thanks to all the contributors, this library has made querying local LLMs for small projects a breeze.
I've been having a lot of fun with this, and recently I'm trying to use the low-level API. It works well but I would like to speed up the generation by offloading the model to GPU, just like I have from high-level. Unfortunately, I can't see any way from the provided low level api examples.
I took a look at llama_cpp.py reference, which includes a n_gpu_layers argument in the llama_context_params structure, but I cannot figure out how to actually pass that value; I can't see anywhere it is actually modified. I'm not familiar with C++ so I might be missing something obvious.
Additionally, are there any better other resources I should be referencing for the low-level API?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Edit: Looks like I can get the control I need out of the High-level wrapper, so this is not a pressing issue. But I would still like to use the low-level API, so an answer would be helpful.
First off, thanks to all the contributors, this library has made querying local LLMs for small projects a breeze.
I've been having a lot of fun with this, and recently I'm trying to use the low-level API. It works well but I would like to speed up the generation by offloading the model to GPU, just like I have from high-level. Unfortunately, I can't see any way from the provided low level api examples.
I took a look at llama_cpp.py reference, which includes a
n_gpu_layers
argument in thellama_context_params
structure, but I cannot figure out how to actually pass that value; I can't see anywhere it is actually modified. I'm not familiar with C++ so I might be missing something obvious.Additionally, are there any better other resources I should be referencing for the low-level API?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions