-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tool-call
: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars
#12034
Conversation
Update llama-grammar.h update Update llama-grammar.h Update common.h Update common.h Update sampling.cpp Update chat.cpp update test_tool_call.py Update server.cpp Update utils.hpp Update chat.cpp Update test_tool_call.py Update fetch_server_test_models.py
…3 8b tool outputs)
Was Qwen 2.5 Coder even trained for tool use? 🤯 |
@GuuD I guess all models must be to some extent, these days. Their technical report only mentions in passing the fact that BigCodeBench is "primarily aimed at evaluating the ability of tool-use and complex instruction following" and their results on that benchmark look quite decent. But given the variety of outputs the model wraps tool calls in, I doubt they stuck to the syntax used in their jinja template. |
…g problem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks @ngxson ! I wonder if we could use some of the static json tricks used by alibaba/yalantinglibs (nifty macros that make data structures statically reflectable, for faster & safer serialization / deserialization). But if performance is the main concern I'd start w/ some benchmarks once the tool / mcp things stabilize (incl. streaming), there's larger bits I'd like to optimize / cache (grammar parsing, schema generation...). |
Co-authored-by: Georgi Gerganov <[email protected]>
For me the main concerns with |
examples/server/server.cpp
Outdated
} | ||
} | ||
} | ||
if (params.sampling.grammar_lazy) { | ||
GGML_ASSERT(params.sampling.grammar_trigger_tokens.size() > 0 || params.sampling.grammar_trigger_words.size() > 0); | ||
GGML_ASSERT(!params.sampling.grammar_triggers.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never use GGML_ASSERT
inside params_from_json_cmpl
as it will crash the whole server and malicious users can DDoS the server easily. Throw an std::runtime_error
instead
std::string value; | ||
llama_token token = LLAMA_TOKEN_NULL; | ||
|
||
template <class T> T to_json() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I didn't notice that we cannot include json
in this file. Then maybe change it to:
template <class T> T to() const;
Then use with to<json>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it does read better in the call site, I think naming it just to
makes the interface harder to understand to readers, esp. given the unconventional template use (if anything, would name it serialize
/ deserialize
, or provide operator<<
/ operator>>
). Happy to revisit in a follow up / to batch update all the to_json*
to something :-)
@@ -559,29 +587,29 @@ static common_chat_msg parse_json_tool_calls( | |||
return result; | |||
} | |||
|
|||
static common_chat_tool_call process_tool_call(const json & tool_call) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also apply the same to/from<json>
pattern that I discussed above for common_chat_tool_call
IMO this doesn't seem to be beneficial of us, as it implies even more dependency to json (which we currently want to avoid). Also I'm doubt if it is even faster. The benchmark shown is for Looking at their example: struct person {
std::string name;
int age;
};
YLT_REFL(person, name, age);
struct_json::to_json(p, str);
struct_json::from_json(p1, str); Already, this is quite bad for us since sometimes we what to have optional fields or fields with multiple possible values. For example, Also, having explicit
The performance lost is not about serializing / deserializing the json, but it comes from the fact that you do need to get something by For example, in these 2 versions, json is clearly slower: json data = {
{"name", "abc"},
{"age", 123}
};
std::string salutation = "Hello, " + data.at("name"); // O(log n) each time you use "at"
// versus struct
struct person {
std::string name;
int age;
};
person p = person{"abc", 123};
std::string salutation = "Hello, " + p.name; // always O(1) |
@ngxson I didn't test this yet tbh, but even without static json metaprogramming tricks (which on paper could unlock some speed) there's definitely room for faster parsing and serialisation, cf. @jart's https://github.com/jart/json.cpp (a serious option to also lower compilation times btw, with a maybe slightly more verbose interface but also maybe less syntax gotchas)
Yeah we'd need to evaluate feasibility of support I think the main value in adding a bit of static typing would be to reduce the surface for misuse. (and also, would move the json dependency entirely in a single generic serialization compilation unit)
|
…zations
As said, there are not many places in the code we use parsing / serializing, so just to remind here that it may not be beneficial to optimize this part.
Ok maybe I confused with std::map that uses red-black tree. But in anyway, that
I'm not sure if I understand this correctly, aren't struct already used for static typing? |
Maybe a bit off-topic, IMO doing this with C macro should not be a big problem. The biggest problem is that we try not to use too many macro (or nested macro), as it is impossible to debug with something like gbd. When possible, use |
@ngxson cf. my update above, the default
Yes but not in places like the
Absolutely! Those variadic macros give me bad chills haha, and not sure I like the idea of maintaining variadic templates either, but I'm keeping an open mind ;-) |
I see. Yeah in fact the main purpose of |
…t trigger patterns for lazy grammars (ggml-org#12034) * sampler: turn lazy grammar trigger words to regexes * add scripts/tool_bench.sh & .py * constrain llama json output regardless of function name if matches at beginning * update relaxed newline space rule in grammar tests * support add_generation_prompt query parameter (useful for /apply_template) * Update src/llama-grammar.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
ran into some snags and listed them here. #12279 one case almost passed. If anyone can take a look, I'd love to get something stable working as I'm demonstrating this in about 10 days. |
…t trigger patterns for lazy grammars (ggml-org#12034) * sampler: turn lazy grammar trigger words to regexes * add scripts/tool_bench.sh & .py * constrain llama json output regardless of function name if matches at beginning * update relaxed newline space rule in grammar tests * support add_generation_prompt query parameter (useful for /apply_template) * Update src/llama-grammar.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
TL;DR: fixes tool calling of Qwen 2.5 Coder 0.5B/1.5B/3B/7B/32B... at any temperature
Follow up to #9639
instructions to build this branch
llama.h
, deprecatingllama_sampler_init_grammar_lazy
(which used to take tokens or words) in favour ofllama_sampler_init_grammar_lazy_patterns
(which takes tokens or full-string regex patterns w/ a group that marks from where the grammar is triggered)scripts/tool_bench.py
to evaluate tool call compliance probability ofllama-server
&ollama
on different models, at different temperaturesThe following heatmap shows compliance ratio on two super basic tool call tests (hello world & weather tests from
examples/server/tests/unit/test_tool_call.py
, now shared w/ the bench tool). 3 pairs of columns for llama-server of this PR, baseline llama-server (master) and ollama.See gist with results for many more models
Notes about results:
Sure! You can use the following Python code...
instead of tool call)@ None
kinda fits results of lower rows)test_calc_results
which evaluates how well a model follows up on tool results. This seems to have more varied failure modes so it's not evaluated by default.TODO: