Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool-call: add support for tool-calls using Model Context Protocol #11556

Open
wants to merge 84 commits into
base: master
Choose a base branch
from

Conversation

bandoti
Copy link
Collaborator

@bandoti bandoti commented Jan 31, 2025

This PR adds support for tool-calls using a --tools switch to llama-cli.

It is currently ⚠Experimental!⚠

To test this, first build llama-cli using something like:

cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Debug -DLLAMA_CURL=ON -DLLAMA_TOOLCALL=ON
cmake --build build --config Debug

Then run a Model Context Protocol:

npm install @modelcontextprotocol/server-everything
npx -y supergateway --stdio "npx -y @modelcontextprotocol/server-everything"

In another terminal, launch llama-cli (remove the --single-turn switch to interact):

./build/bin/llama-cli.exe -c 2048 -ngl 8 -cnv --jinja -m 'C:/Users/bandoti/Downloads/Llama-3.2-3B-Instruct-Q6_K.gguf' --tools "http://localhost:8000/sse" -p "What is one plus nine?" --single-turn

Output:

...

{
    "type": "function",
    "function": {
        "name": "add",
        "description": "Adds two numbers",
        "parameters": {
            "properties": {
                "a": {
                    "description": "First number",
                    "type": "number"
                },
                "b": {
                    "description": "Second number",
                    "type": "number"
                }
            },
            "type": "object"
        }
    }
}
...

user

What is one plus nine?assistant

{"name": "add", "parameters": {"a": 1, "b": 9}}Accepted

The sum of 1 and 9 is 10. [end of text]

And the MCP server output:

[supergateway] New SSE connection from ::1
[supergateway] POST to SSE transport (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] SSE → Child (session 0f6ff484-3557-4972-a8e3-451fd4c69f36): {"jsonrpc":"2.0","id":1,"method":"initialize","params":{"capabilities":{},"clientInfo":{"name":"llama.cpp","version":"1.0.0"},"protocolVersion":"2024-11-05"}}
[supergateway] Child → SSE: {
  result: {
    protocolVersion: '2024-11-05',
    capabilities: { prompts: {}, resources: [Object], tools: {}, logging: {} },
    serverInfo: { name: 'example-servers/everything', version: '1.0.0' }
  },
  jsonrpc: '2.0',
  id: 1
}
[supergateway] POST to SSE transport (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] SSE → Child (session 0f6ff484-3557-4972-a8e3-451fd4c69f36): {"jsonrpc":"2.0","method":"notifications/initialized"}
[supergateway] POST to SSE transport (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] SSE → Child (session 0f6ff484-3557-4972-a8e3-451fd4c69f36): {"jsonrpc":"2.0","id":2,"method":"tools/list"}
[supergateway] Child → SSE: {
  result: {
    tools: [ [Object], [Object], [Object], [Object], [Object], [Object] ]
  },
  jsonrpc: '2.0',
  id: 2
}
[supergateway] POST to SSE transport (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] SSE → Child (session 0f6ff484-3557-4972-a8e3-451fd4c69f36): {"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"arguments":{"a":1,"b":9},"name":"add"}}
[supergateway] Child → SSE: { result: { content: [ [Object] ] }, jsonrpc: '2.0', id: 3 }
[supergateway] SSE connection closed (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] Client disconnected (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)

Tasks:

Integrating toolcall support with llama-cli

  • Add a --tools option to pass in a JSON tools array
  • Add a --tool-choice option which defaults to "auto" (see this ref)
  • Add a --tool-parallel switch for parallel tool-calls.
  • Copy remaining logic from oaicompat_completion_params_parse in utils.hpp into common_chat_apply_template (common.cpp).
  • Some other grammar changes in the main.cpp algorithm?

Implement toolcall handlers for Model Context Protocol (MCP).

  • Add C++ types for base MCP messages.
  • Add C++ types and procedures for Lifecycle phase of MCP protocol.
  • Implement Stdio transport.
  • Implement HTTP SSE transport using cURL.
  • Add base types in common library for abstracting out a tool-call handlers. This should include types/functions for translating between the underlying tool-call implementation (OpenAI style) to other formats (MCP in this case). After the template gets applied in common_chat_apply_template via a call to common_chat_params_init, the resulting prompt member of common_chat_params will contain the JSON-Formatted tool-calls. This should be translated and dispatched to the registered handlers (if one was specified).
  • Other refactoring to support receiving input from the handlers while simultaneously allowing the users input/interjection between request/response in the handlers.
  • Add C++ types for MCP utility messages to ping, cancel, and receive progress updates for long-running tool-calls.

@bandoti bandoti requested a review from ngxson as a code owner February 4, 2025 19:06
@github-actions github-actions bot added testing Everything test related server labels Feb 4, 2025
@bandoti
Copy link
Collaborator Author

bandoti commented Feb 4, 2025

@ochafik I am working on adding the tool calls to llama-cli, and at this point I have wired into common_chat_apply_template initial support (from what I can tell) for passing in the templates and tool array/tool_choice.

However, I am needing some advice on how to handle the remaining fields of common_chat_params as returned by common_chat_params_init. It is my basic understanding of this that each time the template gets applied, it needs to relay this back to the sampling parameters so it can get hooked into the main token-processing routine. Is this correct? If so, do I simply need to tokenize/push the grammar triggers like server.cpp? At the moment when common_chat_apply_template is called it returns a string but I can change that by adding an out parameter or something.

Thank you for your work on the core of this feature I am excited to get it working on llama-cli! 😊

@ochafik
Copy link
Collaborator

ochafik commented Feb 5, 2025

Hey @bandoti , sorry for the delay, some quick background questions first:

  • What use case you have in mind for this, is it to treat the cli as a single shot server?
  • How would you display the output of the tool calls to make it useable (in openai format?). Could you add an example output to the PR description?

Have you considered going directly one step further and have the CLI call tools? @brucepro is looking into doing tool call w/ MCP servers from the server's Web UI (ref), maybe you could join forces / do the same in C++ w/ CURL).

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

@ochafik I got this working now in llama-cli now. Here's the command I ran followed by the output:

 ./build/bin/llama-cli.exe -c 2048 -ngl 8 -cnv --jinja -m 'C:/Users/mtmcp/Downloads/Llama-3.2-3B-Instruct-Q6_K.gguf' --tools '[
    {
      "type":"function",
      "function":{
        "name":"get_current_weather",
        "description":"Get the current weather in a given location",
        "parameters":{
          "type":"object",
          "properties":{
            "location":{
              "type":"string",
              "description":"The city and state, e.g. San Francisco, CA"
            }
          },
          "required":["location"]
        }
      }
    }
  ]'

system

Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 05 Feb 2025

You have access to the following functions. To call a function, please respond with JSON for a function call.Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.

{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                }
            },
            "required": [
                "location"
            ]
        }
    }
}

You are a helpful assistant


> What is the weather like in Mumbai?
{"name": "get_current_weather", "parameters": {"location": "Mumbai"}}

>
llama_perf_sampler_print:    sampling time =       1.41 ms /    36 runs   (    0.04 ms per token, 25477.71 tokens per second)
llama_perf_context_print:        load time =    1731.11 ms
llama_perf_context_print: prompt eval time =   17904.77 ms /   204 tokens (   87.77 ms per token,    11.39 tokens per second)
llama_perf_context_print:        eval time =    1457.84 ms /    18 runs   (   80.99 ms per token,    12.35 tokens per second)
llama_perf_context_print:       total time =   29930.62 ms /   222 tokens
Interrupted by user

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

Hey @bandoti , sorry for the delay, some quick background questions first:

* What use case you have in mind for this, is it to treat the cli as a single shot server?

* How would you display the output of the tool calls to make it useable (in openai format?). Could you add an example output to the PR description?

Have you considered going directly one step further and have the CLI call tools? @brucepro is looking into doing tool call w/ MCP servers from the server's Web UI (ref), maybe you could join forces / do the same in C++ w/ CURL).

@ochafik Good timing we responded at the exact same time haha. No worries on the delay—here's some general objectives:

  1. Testability. Having llama-cli being able to process these function calls can lend for some really useful automated tests using tools like expect &co. This can quickly validate logic in the function-call behavior.
  2. I actually have been working on an on-going effort to wrap llama-cli in a Tcl scripting environment, and the general idea here is that these function calls could be extremely interesting way to create automation.

In both of these cases, the output can be processed and simply scanned for a valid JSON result. If it's valid, then honor the function calls otherwise just print to the console.

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

I will track the MCP protocol work it sounds interesting! I still think there's a lot of need for local-only tools however, and want to ensure these features are workable/testable without standing up endpoints and such. 😊

When you mention adding this capability in cURL, how do you mean? Setting up llama-cli as a MCP client?

EDIT: After reading more on MCP I see the potential flow, where the AI runs and communicates with the resource services. I'd imagine building that on top of the changes here would work well. A series of services can simply be passed into the llama-cli and it could dispatch to them when it needs something (at least that's how I'm understanding it).

@brucepro
Copy link
Contributor

brucepro commented Feb 5, 2025

I will track the MCP protocol work it sounds interesting! I still think there's a lot of need for local-only tools however, and want to ensure these features are workable/testable without standing up endpoints and such. 😊

When you mention adding this capability in cURL, how do you mean? Setting up llama-cli as a MCP client?

For MCP, I am adding the SSE client support into the webui. This link was the best example I found: https://github.com/apify/tester-mcp-client/blob/main/src/mcpClient.ts
Then you can run one of the proxy's that allows you to use MCP's servers directly. This one seemed promising. https://github.com/punkpeye/mcp-proxy/ although I think writing a python solution to handle the SSE api calls and just using the python sdk directly is where I will end up. https://github.com/modelcontextprotocol So in the end will have the WebUI able to add any SSE server with a congif of

{
  "mcpServers": {
    "fetch": {
      "name": "Fetch",
      "type": "sse",
      "serverUrl": "http://localhost:8765/sse"
    }
  }
}

Still in progress. Once I hit debug mode will update my repo and start testing.

@ochafik ochafik self-requested a review February 5, 2025 17:09
@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

@brucepro thanks for the info on this. It seems to me, in general, a protocol like this is the way to go for the local AI in llama-cli to invoke actions as well. I'll take a closer look and see what it'll take to add it.

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

@ochafik As I understand it the requirement to get this working I need to add a "translation" layer between the models OpenAI function call request/response and MCP, correct? This shouldn't be too difficult with cURL and the json library.

I really like the discovery aspect of the MCP protocol—will make managing a collection of functionality much easier.

So I will start working on it as I think this is an important part of the function call API. We can revisit the other aspects of MCP like prompts and the like—those are very powerful as well, albeit that's a fair amount of work so will have to be done gradually.

@brucepro
Copy link
Contributor

@ochafik As I understand it the requirement to get this working I need to add a "translation" layer between the models OpenAI function call request/response and MCP, correct? This shouldn't be too difficult with cURL and the json library.

I really like the discovery aspect of the MCP protocol—will make managing a collection of functionality much easier.

So I will start working on it as I think this is an important part of the function call API. We can revisit the other aspects of MCP like prompts and the like—those are very powerful as well, albeit that's a fair amount of work so will have to be done gradually.

Did you get

@ochafik As I understand it the requirement to get this working I need to add a "translation" layer between the models OpenAI function call request/response and MCP, correct? This shouldn't be too difficult with cURL and the json library.

I really like the discovery aspect of the MCP protocol—will make managing a collection of functionality much easier.

So I will start working on it as I think this is an important part of the function call API. We can revisit the other aspects of MCP like prompts and the like—those are very powerful as well, albeit that's a fair amount of work so will have to be done gradually.

Did you make any progress on the cli mcp? I have a super basic React App made that seems to work with llamacpp here. https://github.com/brucepro/llamacppMCPClientDemo I tested with llama3.3 70b but not much else. Will be adding prompts and resources next and debugging. Once it is cleaned up, I will work on migrating it to the WebUI.

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 11, 2025

@brucepro I'm currently working on adding the types for MCP protocol and initialization handshake. I have all the types defined just going to add unit test on them today.

Working in a different branch but I'll merge that piece in hopefully today.

I added a checklist in the PR description above to track these changes. 😊

@bandoti
Copy link
Collaborator Author

bandoti commented Mar 4, 2025

@CISC I merged in your --single-turn changes here. Thanks for adding that, as it works well for the toolcall case. When using a base prompt it didn't make sense to apply the chat templates, so this gives a means to invoke toolcalls non-interactively.

If you would like to test the tool-calls and report any issues, I would also be most grateful. Please see the instructions above in the PR description. With the --single-turn option I ran a quick test like:
Console 1:
npx -y supergateway --stdio "npx -y @modelcontextprotocol/server-everything"

Console 2:
./build/bin/llama-cli.exe -c 2048 -ngl 8 -cnv --jinja -m 'C:/Users/bandoti/Downloads/Llama-3.2-3B-Instruct-Q6_K.gguf' --tools "http://localhost:8000/sse" -p "What is one plus nine?" --single-turn

Output:

...

{
    "type": "function",
    "function": {
        "name": "add",
        "description": "Adds two numbers",
        "parameters": {
            "properties": {
                "a": {
                    "description": "First number",
                    "type": "number"
                },
                "b": {
                    "description": "Second number",
                    "type": "number"
                }
            },
            "type": "object"
        }
    }
}
...

user

What is one plus nine?assistant

{"name": "add", "parameters": {"a": 1, "b": 9}}Accepted

The sum of 1 and 9 is 10. [end of text]

Communication with MCP server:

[supergateway] New SSE connection from ::1
[supergateway] POST to SSE transport (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] SSE → Child (session 0f6ff484-3557-4972-a8e3-451fd4c69f36): {"jsonrpc":"2.0","id":1,"method":"initialize","params":{"capabilities":{},"clientInfo":{"name":"llama.cpp","version":"1.0.0"},"protocolVersion":"2024-11-05"}}
[supergateway] Child → SSE: {
  result: {
    protocolVersion: '2024-11-05',
    capabilities: { prompts: {}, resources: [Object], tools: {}, logging: {} },
    serverInfo: { name: 'example-servers/everything', version: '1.0.0' }
  },
  jsonrpc: '2.0',
  id: 1
}
[supergateway] POST to SSE transport (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] SSE → Child (session 0f6ff484-3557-4972-a8e3-451fd4c69f36): {"jsonrpc":"2.0","method":"notifications/initialized"}
[supergateway] POST to SSE transport (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] SSE → Child (session 0f6ff484-3557-4972-a8e3-451fd4c69f36): {"jsonrpc":"2.0","id":2,"method":"tools/list"}
[supergateway] Child → SSE: {
  result: {
    tools: [ [Object], [Object], [Object], [Object], [Object], [Object] ]
  },
  jsonrpc: '2.0',
  id: 2
}
[supergateway] POST to SSE transport (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] SSE → Child (session 0f6ff484-3557-4972-a8e3-451fd4c69f36): {"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"arguments":{"a":1,"b":9},"name":"add"}}
[supergateway] Child → SSE: { result: { content: [ [Object] ] }, jsonrpc: '2.0', id: 3 }
[supergateway] SSE connection closed (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)
[supergateway] Client disconnected (session 0f6ff484-3557-4972-a8e3-451fd4c69f36)

@ochafik /cc @brucepro /cc

@brucepro
Copy link
Contributor

brucepro commented Mar 4, 2025

Wasn't able to get it working on my windows system, it compiled using my win64devkit, but then when I called the --tools using the sse server it just halted with no output. Using it without was just fine. Will debug a bit today after I move to my linux system.

@bandoti
Copy link
Collaborator Author

bandoti commented Mar 4, 2025

@brucepro Sounds good please let me know if I can help. At least we know the client is crashing server (they're communicating)! 😅

@CISC
Copy link
Collaborator

CISC commented Mar 5, 2025

@CISC I merged in your --single-turn changes here. Thanks for adding that, as it works well for the toolcall case. When using a base prompt it didn't make sense to apply the chat templates, so this gives a means to invoke toolcalls non-interactively.

Great!

As far as I see this relies on the model outputting OAI compatible JSON responses, right? So models that don't conform (or can't be properly coerced) to that might have issues.

There's a (currently paused) PR over at transformers that will add tool call parsing using jinja with an inverse template (I have a placeholder draft PR here), which will make it easy to handle mixed responses as well as non-JSON (code).

If you would like to test the tool-calls and report any issues, I would also be most grateful. Please see the instructions above in the PR description.

I'll set it up and run some tests this week. :)

@bandoti
Copy link
Collaborator Author

bandoti commented Mar 5, 2025

@brucepro, @CISC There is currently a logical issue with how llama-cli is handling AI response in the templates. I need to update this to use common_chat_parse (as it's doing in server.cpp:to_json_oaicompat_chat) in order to gain access to the underlying tool_call members. Please stay tuned I'll let you know when this change is integrated.

There's a (currently paused) PR over at transformers that will add tool call parsing using jinja with an inverse template (I have a placeholder draft PR here), which will make it easy to handle mixed responses as well as non-JSON (code).

Have you taken a look at the minja templating? Just want to make sure no duplicate effort is happening. @ochafik might have already completed this logic and it should be working with server already.

In general, what I've been doing in this PR is create a new MCP client (currently only supports tool-calls but we can later add resources and prompts); translating toolcalls from OpenAI format to MCP format; and porting over the changes that are happening on server. Come to think of it, after calling common_chat_parse the OpenAI compatibility could probably be skipped when dispatching function calls as it can be directly converted to an MCP function-call.

@CISC
Copy link
Collaborator

CISC commented Mar 5, 2025

Have you taken a look at the minja templating? Just want to make sure no duplicate effort is happening. @ochafik might have already completed this logic and it should be working with server already.

As inverse templates are not a thing yet minja does not support it, but the hope is certainly that it will be able to once ready. :)

In case it wasn't clear we are talking about parsing responses from the model, having an inverse template means we should be able to structurally recreate the chat messages, response (with the emitted tool calls properly denoted) and all from the rendered conversation.

bandoti added 4 commits March 5, 2025 12:54
commit 7adfa18
Author: Mason M <[email protected]>
Date:   Thu Mar 6 17:19:09 2025 -0400

    Re-Prompt after toolcall

commit c8843da
Author: Mason M <[email protected]>
Date:   Thu Mar 6 13:41:45 2025 -0400

    Use format to extract toolcalls
@bandoti
Copy link
Collaborator Author

bandoti commented Mar 7, 2025

@CISC, @brucepro I fixed the tool-call formatting/calling now it "works" with various models/templates. For some reason it's looping over-and-over though (keeps calling the function). I'm not exactly sure why, but it could be due to how it's re-prompting with tool output. Main improvement though is that the tool-calls are being invoked properly (for the most part—one model passes strings instead of ints to the "add" function, and it crashes the client because the MCP error response not being handled at the moment).

I am tied up for next couple days so won't be able to work on it, but if anyone wants to take a crack at solving the issue a first spot to step in the debugger is the chat_formatter::result chat_formatter::operator() (const std::string & role, const std::string & content) method (in main.cpp). This is a functor stored in chat_add_and_format variable. 😉 I will be on to answer quick questions if they come up though.

@ochafik /cc

@brucepro
Copy link
Contributor

brucepro commented Mar 8, 2025

Sorry to take so long to get back. Issue on windows build box was curl dev lib. On my ubuntu system all worked well according to your instructions. So maybe a reminder to add the curl dev lib

@bandoti
Copy link
Collaborator Author

bandoti commented Mar 9, 2025

@ochafik, @brucepro, @CISC, @ngxson, @ggerganov
After stepping back and taking a look at the bigger picture here I realise that the Model Context Protocol probably is not a good fit for llama-cli (still good for llama.cpp though) and should simply exist as a separate library, acting as a middleware between all the mcp-server-exposed resources and the application-level LLM.

While a CLI certainly provides a good user experience, and in fact we COULD directly inject it into llama-cli's main loop (as I've been working through here), I feel that is actually out of scope for that application. In our case, a new CLI application would instead call "llama-server" for all the LLM sampling requests, and so-forth. And it would pull resources/tools from its connected MCP servers. And those tools could also request sampling from the application, which it would route to llama-server, and so-forth.

What I'm trying to say is that I am thinking to instead move this to a repo outside of llama.cpp and simply make it a C library for MCP clients/servers (because really, it should work with llama.cpp and any other providers).

So in general, applications will use llama-server, and they will link against the MCP library. The library will provide hooks for the user to accept/reject requests by tools to access sampler, and provide the notication mechanism when prompts change, et cetera. This will provide the intended separation like their TypeScript/Python SDK but in a portable C API. 😊

I am welcome to suggestions if anyone sees another possibility here, but this seems the most straightforward outcome of the proof-of-concept.

@CISC
Copy link
Collaborator

CISC commented Mar 9, 2025

SGTM, but I think it would still be useful for llama-cli to interface with MCP as well.

@bandoti
Copy link
Collaborator Author

bandoti commented Mar 9, 2025

SGTM, but I think it would still be useful for llama-cli to interface with MCP as well.

We can get there but there will have to be some significant refactors. For example, MCP servers can send sampling requests to the client that would occur in parallel. We will have to support that in llama-cli as well, to dispatch requests locally.

So, the capabilities ARE there, but is it in scope to make those changes to llama-cli?

@bandoti
Copy link
Collaborator Author

bandoti commented Mar 10, 2025

@brucepro I am going to keep pursuing adding parallel support to get full MCP functionality locally (over time—for now I think all we can do is basic tool-calls).

I realize the whole notion of dispatching to llama server between MCP is what you're working on in Python/Typescript. So I don't want to duplicate effort there! But I would like a fully working MCP library in C/C++. Will get there eventually.

@brucepro
Copy link
Contributor

Great. About 80% done in webui. Lots of little things to work out such as getting the resources I to additional context, checking props to make sure supported. Having a subprocess to run prompts. Tools works pretty well although the agent loves the echo function too much.
Should be ready for real testing in a day or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants