Command line switch to use F16 for memory_k and memory_v (refactor of #154) #294
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
made the changes requested by @ggerganov in #154 . fixes #146
With this change, you can half the
llama_model_load: memory_size = 512.00 MB
->memory_size = 256.00 MB
(ctx512 7B q4_0)
A non empirical comparison does not seem to degrade the quality of the prediction. but that might not mean anything. (waiting on #270)