Hotfix prompt caching introduced in #1169, fixes #1257 #1260

ivanstepanovftw · 2023-05-01T00:35:39Z

I do not have much time trying to try to tie all the magic logic inside examples/main and refactor it, so there is another "hack", that needs // REVIEW before merging.

ejones · 2023-05-01T00:58:59Z

examples/main/main.cpp

@@ -404,7 +404,7 @@ int main(int argc, char ** argv) {
            // optionally save the session on first sample (for faster prompt loading next time)
            if (!path_session.empty() && need_to_save_session) {
                need_to_save_session = false;
-                llama_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
+                llama_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size() - 1); // FIXME: -1 is a hack


This could actually be the right approach in general. It will just have the effect of forcing eval one token earlier, which will of course have effectively the same performance. That said, ideally I'm wondering if we do this on the read side, like maybe try decrementing n_matching_session_tokens by one in the startup code above? The reason I say that is eventually I envision sessions being used to restore the full transcript vs. caching the prompt, and in this case the approach here would clip the last token. In addition, I believe that there's an (however unlikely) edge case here with empty session_tokens.

I'd actually been wondering about if we should do this anyway, because of the flow of state from eval to sampling, specifically the logits vector. In other words, if we should force an eval to ensure up-to-date logits. This corresponds to the fact that, in main at least, sampling is never done without first eval-ing. I haven't deeply explored this. Perhaps this could be related to your issue.

ejones · 2023-05-01T01:00:38Z

llama.cpp

@@ -2724,6 +2724,7 @@ size_t llama_load_session_file(struct llama_context * ctx, const char * path_ses
    const size_t n_orig_state_size = llama_get_state_size(ctx);
    if (n_state_size != n_orig_state_size) {
        fprintf(stderr, "%s : failed to validate state size\n", __func__);
+        return 0;


Thanks! Just noticed this, oops...

ejones · 2023-05-01T01:02:27Z

llama.cpp

@@ -2739,7 +2740,7 @@ size_t llama_save_session_file(struct llama_context * ctx, const char * path_ses
    llama_copy_state_data(ctx, state_data.get());

    file.write_u32('ggsn'); // magic
-    file.write_u32(0); // version
+    file.write_u32(1); // version


nit: it's harmless but I don't believe it's necessary here to change the version? The binary format isn't affected by the decision of how many tokens to store.

ejones · 2023-05-01T01:21:03Z

The one thing I want to dig into is the logs from your cache hit case in your ticket:

main: attempting to load saved session from session/7B/hello.bin..
main: could not load session file, will recreate
main: session file has exact match for prompt!

There's definitely unwanted behavior here where the session tokens are read out as a side effect that can succeed while the session fails to load (+ main not doing anything about it). I imagine there'd be some strangeness there because it would probably n_past up all the session tokens while still having the initial context.

That said, it's not obvious to me why your session loading failed or what code path this took. I'm not seeing a path where there wouldn't be a prior log or assertion.

ivanstepanovftw · 2023-05-01T10:16:55Z

Fixed in #1263

Hotfix prompt caching introduced in ggml-org#1169, fixes ggml-org#1257

c9fdebc

ivanstepanovftw requested review from ejones and ggerganov May 1, 2023 00:35

ejones reviewed May 1, 2023

View reviewed changes

ivanstepanovftw closed this May 1, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix prompt caching introduced in #1169, fixes #1257 #1260

Hotfix prompt caching introduced in #1169, fixes #1257 #1260

ivanstepanovftw commented May 1, 2023 •

edited

Loading

ejones May 1, 2023

ejones May 1, 2023

ejones May 1, 2023

ejones commented May 1, 2023

ivanstepanovftw commented May 1, 2023

Hotfix prompt caching introduced in #1169, fixes #1257 #1260

Hotfix prompt caching introduced in #1169, fixes #1257 #1260

Conversation

ivanstepanovftw commented May 1, 2023 • edited Loading

ejones May 1, 2023

Choose a reason for hiding this comment

ejones May 1, 2023

Choose a reason for hiding this comment

ejones May 1, 2023

Choose a reason for hiding this comment

ejones commented May 1, 2023

ivanstepanovftw commented May 1, 2023

ivanstepanovftw commented May 1, 2023 •

edited

Loading