`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

ochafik · 2025-02-22T22:23:25Z

TL;DR: fixes tool calling of Qwen 2.5 Coder 0.5B/1.5B/3B/7B/32B... at any temperature

Follow up to #9639

instructions to build this branch

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-bench-prod
cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF

Added support for regex grammar triggers, and respect when they should be matching at the start only (was already declared but not implemented; should avoid spurious triggering when the triggers were defined as wide-catches).
- In llama.h, deprecating llama_sampler_init_grammar_lazy (which used to take tokens or words) in favour of llama_sampler_init_grammar_lazy_patterns (which takes tokens or full-string regex patterns w/ a group that marks from where the grammar is triggered)
Dramatically improved tool calls success rate of Qwen 2.5 Coder (Hermes 2 format) w/ more triggers that match what the models tends to output (esp. at higher temperatures) / looser triggers w/ regular expressions
- 32B model can power Cline decently w/ this PR: feat: add llama.cpp provider w/ native tool calls cline/cline#1946
Added scripts/tool_bench.py to evaluate tool call compliance probability of llama-server & ollama on different models, at different temperatures

The following heatmap shows compliance ratio on two super basic tool call tests (hello world & weather tests from examples/server/tests/unit/test_tool_call.py, now shared w/ the bench tool). 3 pairs of columns for llama-server of this PR, baseline llama-server (master) and ollama.

export ARGS=( --n 30 --llama-baseline="$(which llama-server)" --temp -1 --temp 0 --temp 0.5 --temp 0.75 --temp 1 --temp 1.5 --temp 2 --temp 5 ) 

./scripts/tool_bench.py run ${ARGS[@]} --model "Qwen 2.5 Coder 7B Q4_K_M"             --output ../qwenc7b.jsonl   --hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF:Q4_K_M   --ollama qwen2.5-coder:7b-instruct-q4_K_M
./scripts/tool_bench.py run ${ARGS[@]} --model "Qwen 2.5 Coder 1.5B Q4_K_M"           --output ../qwenc1.5b.jsonl --hf unsloth/Qwen2.5-Coder-1.5B-Instruct-128K-GGUF:Q4_K_M --ollama qwen2.5-coder:1.5b-instruct-q4_K_M

See gist with results for many more models

Notes about results:

the failures of llama-server at temp = 2 are model humour / stylistic choice (Sure! You can use the following Python code... instead of tool call)
ollama seems to only recognize the tool call format of the template, but models like Qwen 2.5 Coder 7B is quite... creative in its tool call outputs, esp. at higher temperatures.
ollama's default temperature seems to be 0.6 (hence why the row w/ @ None kinda fits results of lower rows)
The tests may need further tweaking to accept arguably “correct” answers. The framing of the hello world test is questionable, sometimes models just explain how they would write the code.
The benchmark tool also supports running test_calc_results which evaluates how well a model follows up on tool results. This seems to have more varied failure modes so it's not evaluated by default.

TODO:

Run & share more bench results (esp. other Qwen Coder variants!)
Stabilize tests / ci
Analyze bench times

Update llama-grammar.h update Update llama-grammar.h Update common.h Update common.h Update sampling.cpp Update chat.cpp update test_tool_call.py Update server.cpp Update utils.hpp Update chat.cpp Update test_tool_call.py Update fetch_server_test_models.py

…3 8b tool outputs)

… beginning

GuuD · 2025-02-22T22:36:52Z

Was Qwen 2.5 Coder even trained for tool use? 🤯

ochafik · 2025-02-23T00:18:09Z

Was Qwen 2.5 Coder even trained for tool use? 🤯

@GuuD I guess all models must be to some extent, these days. Their technical report only mentions in passing the fact that BigCodeBench is "primarily aimed at evaluating the ability of tool-use and complex instruction following" and their results on that benchmark look quite decent. But given the variety of outputs the model wraps tool calls in, I doubt they stuck to the syntax used in their jinja template.

…n outputs

…g problem)

ochafik

Thanks!

examples/server/server.cpp

ochafik · 2025-03-05T02:43:48Z

In general this LGTM, although some part I'm not understand 100%.

Btw, since the complexity of chat.cpp is growing recently, I think it would be beneficial to have less dependency on json type. The problem is that json type is known to be bad for performance, and also make static analyzing code impossible.

Thanks @ngxson !

I wonder if we could use some of the static json tricks used by alibaba/yalantinglibs (nifty macros that make data structures statically reflectable, for faster & safer serialization / deserialization). But if performance is the main concern I'd start w/ some benchmarks once the tool / mcp things stabilize (incl. streaming), there's larger bits I'd like to optimize / cache (grammar parsing, schema generation...).

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov · 2025-03-05T07:38:00Z

For me the main concerns with json.hpp is the static analysis difficulties (as @ngxson mentioned) and also increased compilation time. I don't have a good feeling how much json affects the performance of applications, but it's second-tier concern in my list compared to the former.

ngxson · 2025-03-05T10:39:19Z

examples/server/server.cpp

                    }
                }
            }
            if (params.sampling.grammar_lazy) {
-                GGML_ASSERT(params.sampling.grammar_trigger_tokens.size() > 0 || params.sampling.grammar_trigger_words.size() > 0);
+                GGML_ASSERT(!params.sampling.grammar_triggers.empty());


We should never use GGML_ASSERT inside params_from_json_cmpl as it will crash the whole server and malicious users can DDoS the server easily. Throw an std::runtime_error instead

ngxson · 2025-03-05T10:58:19Z

common/common.h

+    std::string value;
+    llama_token token = LLAMA_TOKEN_NULL;
+
+    template <class T> T to_json() const;


Hmm ok I didn't notice that we cannot include json in this file. Then maybe change it to:

template <class T> T to() const;

Then use with to<json> ?

While it does read better in the call site, I think naming it just to makes the interface harder to understand to readers, esp. given the unconventional template use (if anything, would name it serialize / deserialize, or provide operator<< / operator>>). Happy to revisit in a follow up / to batch update all the to_json* to something :-)

ngxson · 2025-03-05T10:59:10Z

common/chat.cpp

@@ -559,29 +587,29 @@ static common_chat_msg parse_json_tool_calls(
    return result;
 }

+static common_chat_tool_call process_tool_call(const json & tool_call) {


We could also apply the same to/from<json> pattern that I discussed above for common_chat_tool_call

ngxson · 2025-03-05T11:19:04Z

I wonder if we could use some of the static json tricks used by alibaba/yalantinglibs (nifty macros that make data structures statically reflectable, for faster & safer serialization / deserialization)

IMO this doesn't seem to be beneficial of us, as it implies even more dependency to json (which we currently want to avoid).

Also I'm doubt if it is even faster. The benchmark shown is for struct_pack and not struct_json, I don't think we can go further when parsing/serializing json.

Looking at their example:

struct person {
  std::string name;
  int age;
};
YLT_REFL(person, name, age);

struct_json::to_json(p, str);
struct_json::from_json(p1, str);

Already, this is quite bad for us since sometimes we what to have optional fields or fields with multiple possible values. For example, "prompt" can be string or an array of token in our case, which breaks that "statically reflectable" that you mentioned.

Also, having explicit to_json / from_json seems safer since you can explicitly validate the input data.

But if performance is the main concern I'd start w/ some benchmarks once the tool / mcp things stabilize (incl. streaming), there's larger bits I'd like to optimize / cache (grammar parsing, schema generation...).

The performance lost is not about serializing / deserializing the json, but it comes from the fact that you do need to get something by .at("...") and set by doing json_obj["..."] = .... The reason why this is bad because this set/get operation relies on [un]ordered_map, which has complexity O(log n)

For example, in these 2 versions, json is clearly slower:

json data = {
  {"name", "abc"},
  {"age", 123}
};

std::string salutation = "Hello, " + data.at("name"); // O(log n) each time you use "at"

// versus struct
struct person {
  std::string name;
  int age;
};
person p = person{"abc", 123};

std::string salutation = "Hello, " + p.name; // always O(1)

ochafik · 2025-03-05T11:56:44Z

IMO this doesn't seem to be beneficial of us, as it implies even more dependency to json (which we currently want to avoid).
Also I'm doubt if it is even faster. The benchmark shown is for struct_pack and not struct_json, I don't think we can go further when parsing/serializing json.

@ngxson I didn't test this yet tbh, but even without static json metaprogramming tricks (which on paper could unlock some speed) there's definitely room for faster parsing and serialisation, cf. @jart's https://github.com/jart/json.cpp (a serious option to also lower compilation times btw, with a maybe slightly more verbose interface but also maybe less syntax gotchas)

Already, this is quite bad for us since sometimes we what to have optional fields or fields with multiple possible values. For example, "prompt" can be string or an array of token in our case, which breaks that "statically reflectable" that you mentioned.

Yeah we'd need to evaluate feasibility of support std::variant and std::optional to make this work (potentially a cool weekend exploration - deciding on variant branches, if possible, could be interesting - might require extra metadata declaration, and runtime backtracking and/or a static disambiguation strategy - assuming we can do enough constexpr magic).

I think the main value in adding a bit of static typing would be to reduce the surface for misuse. (and also, would move the json dependency entirely in a single generic serialization compilation unit)

is bad because this set/get operation relies on [un]ordered_map, which has complexity O(log n)

~~Actually it's a hashmap~~ (Edit: my bad, looks like ordered_json uses a vector so it's all O(n) (ouch!), and the default unordered json - which I don't think we use - uses std::map so indeed O(log n))

…zations

ngxson · 2025-03-05T12:12:52Z

there's definitely room for faster parsing and serialisation

As said, there are not many places in the code we use parsing / serializing, so just to remind here that it may not be beneficial to optimize this part.

Actually it's a hashmap so average O(1), worst case O(n) (with n very small anyway).

Ok maybe I confused with std::map that uses red-black tree.

But in anyway, that O(1) is still slower than accessing struct member. Don't forget that with a std::map<std::string, ...> you also need to calculate the hash of the string, which depends on the string length. By contrast, accessing class member can be done in just one pointer dereference, and can be even optimized at compile time.

I think the main value in adding a bit of static typing would be to reduce the surface for misuse. (and also, would move the json dependency entirely in a single generic serialization compilation unit)

I'm not sure if I understand this correctly, aren't struct already used for static typing?

ngxson · 2025-03-05T12:16:09Z

static json metaprogramming

Maybe a bit off-topic, IMO doing this with C macro should not be a big problem. The biggest problem is that we try not to use too many macro (or nested macro), as it is impossible to debug with something like gbd. When possible, use template instead.

ochafik · 2025-03-05T12:20:55Z

Ok maybe I confused with std::map that uses red-black tree.

@ngxson cf. my update above, the default nlohmann::json (which we mostly don't use) uses an std::map indeed, but nlohmann::unordered_json uses... a vector 😅 (probably not a huge issue given they're all smallish anyway)

I think the main value in adding a bit of static typing would be to reduce the surface for misuse. (and also, would move the json dependency entirely in a single generic serialization compilation unit)

I'm not sure if I understand this correctly, aren't struct already used for static typing?

Yes but not in places like the to_json_oaicompat*, where I've been caught many times by an extraneous pair of curly braces that completely change the meaning of a json array, etc.

static json metaprogramming

Maybe a bit off-topic, IMO doing this with C macro should not be a big problem. The biggest problem is that we try not to use too many macro (or nested macro), as it is impossible to debug with something like gbd. When possible, use template instead.

Absolutely! Those variadic macros give me bad chills haha, and not sure I like the idea of maintaining variadic templates either, but I'm keeping an open mind ;-)

ngxson · 2025-03-05T12:38:26Z

Yes but not in places like the to_json_oaicompat*, where I've been caught many times by an extraneous pair of curly braces that completely change the meaning of a json array, etc.

I see. Yeah in fact the main purpose of to_json_oaicompat* is to convert the struct data into OAI-compat schema. Problem is that, the original schema is in Open API format, so I'm not sure how we can translate it into cpp. For now, we rely mostly on the server pytest scripts to check that.

…t trigger patterns for lazy grammars (ggml-org#12034) * sampler: turn lazy grammar trigger words to regexes * add scripts/tool_bench.sh & .py * constrain llama json output regardless of function name if matches at beginning * update relaxed newline space rule in grammar tests * support add_generation_prompt query parameter (useful for /apply_template) * Update src/llama-grammar.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

codefromthecrypt · 2025-03-09T06:26:19Z

ran into some snags and listed them here. #12279 one case almost passed. If anyone can take a look, I'd love to get something stable working as I'm demonstrating this in about 10 days.

…t trigger patterns for lazy grammars (ggml-org#12034) * sampler: turn lazy grammar trigger words to regexes * add scripts/tool_bench.sh & .py * constrain llama json output regardless of function name if matches at beginning * update relaxed newline space rule in grammar tests * support add_generation_prompt query parameter (useful for /apply_template) * Update src/llama-grammar.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

ochafik added 5 commits February 21, 2025 22:16

add scripts/tool_bench.sh & .py

Verified

This commit was signed with the committer’s verified signature.

JohnTitor Yuki Okushi

GPG key ID: B0986C85C0E2DAA1

Verified
Learn about vigilant mode

a456911

optionally allow any spaces in json schema grammars (useful for llama…

Verified

This commit was signed with the committer’s verified signature.

davidtwco David Wood

GPG key ID: 2592E76C87381FD9

Verified
Learn about vigilant mode

14a4388

…3 8b tool outputs)

constrain llama json output regardless of function name if matches at…

e2ca8be

… beginning

better error when wrong function called

Loading
Loading status checks…

53266f9

github-actions bot added script testing examples python server labels Feb 22, 2025

ochafik added 2 commits February 22, 2025 22:41

improve error message in weather test

7833c16

add more models to tool_bench.sh

Loading
Loading status checks…

0e1a00e

ochafik added 16 commits February 23, 2025 00:33

benchmark other sizes of qwen 2.5 coder

Loading
Loading status checks…

44740f7

rm duplicate in tool_bench.sh

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

dd6eb97

add missing <variant> include

Loading
Loading status checks…

0fc6218

fix lints

Verified

This commit was signed with the committer’s verified signature.

dtolnay David Tolnay

GPG key ID: F9BA143B95FF6D82

Verified
Learn about vigilant mode

6fd4972

improve "bad" qwen triggers

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

Loading
Loading status checks…

2e656f9

add cast to please some gccs

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

Loading
Loading status checks…

fbd3c19

ditch server test request retry logic

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

62a1416

fix flake8 lints

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

Loading
Loading status checks…

596ff7f

nits

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

fe6968f

remove any_spaces grammar option, allow extra line for airy llama jso…

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

1caacd5

…n outputs

Update test_tool_call.py

Loading
Loading status checks…

789a3e1

test w/ beefier qwen 2.5 coder 3b

Loading
Loading status checks…

6493a14

revert some test_hello_world diffs

cc817a0

diff

ead02c6

Update test_tool_call.py

d7acf2c

add requirements for tool_bench

0db4073

ochafik added 7 commits March 4, 2025 19:07

Merge remote-tracking branch 'origin/master' into tool-bench-prod

2470a1c

common_grammar_trigger: always use string value (+ optional token)

e6e9c13

add llama_grammar_trigger_pattern

Loading
Loading status checks…

5d43b72

add common_grammar_trigger.{to_json,from_json}

Loading
Loading status checks…

1317a35

fix crashing typo

ad3caa3

avoid returning optional from parse_json

Loading
Loading status checks…

a6d7887

disable slow hello Llama-3.1-8B (chopped unescaped string witin strin…

Loading
Loading status checks…

20a2f5f

…g problem)

ochafik commented Mar 5, 2025

View reviewed changes

examples/server/server.cpp Show resolved Hide resolved

fix nit eol at eof

Loading
Loading status checks…

92e9723

Update src/llama-grammar.cpp

Loading
Loading status checks…

01be080

Co-authored-by: Georgi Gerganov <[email protected]>

Merge remote-tracking branch 'origin/master' into tool-bench-prod

Loading
Loading status checks…

00db465

ngxson reviewed Mar 5, 2025

View reviewed changes

ochafik added 2 commits March 5, 2025 12:03

avoid ggml_assert in server for grammar triggers inconsistency

24010fe

add comment on limits to common_grammar_trigger.to/from json speciali…

Loading
Loading status checks…

71719a6

…zations

ngxson approved these changes Mar 5, 2025

View reviewed changes

ochafik merged commit 669912d into ggml-org:master Mar 5, 2025
50 checks passed

codefromthecrypt mentioned this pull request Mar 9, 2025

Misc. bug: tool call issues with hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF #12279

Closed

This was referenced Mar 9, 2025

sampler: fixes trigger tokens + lazy grammars (fix typo cast from token to string) #12291

Merged

Eval bug: Llama server <tool_call> is occasionally not parsed as json, and is in content rather than tool_calls #12256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

ochafik commented Feb 22, 2025 •

edited

Loading

GuuD commented Feb 22, 2025

ochafik commented Feb 23, 2025 •

edited

Loading

ochafik left a comment

ochafik commented Mar 5, 2025

ggerganov commented Mar 5, 2025

ngxson Mar 5, 2025

ngxson Mar 5, 2025

ochafik Mar 5, 2025

ngxson Mar 5, 2025

ngxson commented Mar 5, 2025

ochafik commented Mar 5, 2025 •

edited

Loading

ngxson commented Mar 5, 2025

ngxson commented Mar 5, 2025 •

edited

Loading

ochafik commented Mar 5, 2025 •

edited

Loading

ngxson commented Mar 5, 2025

codefromthecrypt commented Mar 9, 2025

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

Conversation

ochafik commented Feb 22, 2025 • edited Loading

GuuD commented Feb 22, 2025

ochafik commented Feb 23, 2025 • edited Loading

ochafik left a comment

Choose a reason for hiding this comment

ochafik commented Mar 5, 2025

ggerganov commented Mar 5, 2025

ngxson Mar 5, 2025

Choose a reason for hiding this comment

ngxson Mar 5, 2025

Choose a reason for hiding this comment

ochafik Mar 5, 2025

Choose a reason for hiding this comment

ngxson Mar 5, 2025

Choose a reason for hiding this comment

ngxson commented Mar 5, 2025

ochafik commented Mar 5, 2025 • edited Loading

ngxson commented Mar 5, 2025

ngxson commented Mar 5, 2025 • edited Loading

ochafik commented Mar 5, 2025 • edited Loading

ngxson commented Mar 5, 2025

codefromthecrypt commented Mar 9, 2025

`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

ochafik commented Feb 22, 2025 •

edited

Loading

ochafik commented Feb 23, 2025 •

edited

Loading

ochafik commented Mar 5, 2025 •

edited

Loading

ngxson commented Mar 5, 2025 •

edited

Loading

ochafik commented Mar 5, 2025 •

edited

Loading