Skip to content

Add support for stopping generation during a stream. #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 30, 2025

Conversation

jmont-dev
Copy link
Owner

@jmont-dev jmont-dev commented Mar 30, 2025

Adds support to gracefully stop an active stream using either the generate or chat endpoints. The bound response function has been changed to return a bool which determines whether or not to continue streaming for this response.

When used asynchronously, a simple atomic variable can be used to return false from the calling thread in order to stop a stream launched in another thread. See the cases added in test for an example of doing so.

@jmont-dev jmont-dev merged commit be224a8 into master Mar 30, 2025
1 check passed
This was referenced Mar 30, 2025
@Chadliu0806
Copy link

Hi jmont-dev

Thanks for quick feedback. The following as my scenario of program.

  1. I package [ollama-hpp] as DLL file.
  2. I coding MFC program to handle these APIs.
  3. I create threading and call ollama::response generate() to inference information.
  4. I added the stop() function in [ollama.hpp], and this stop function will execute this->cli->Post("/api/stop"), but when I execute the stop function, I cannot stop the return of the Restful API immediately. It will wait for about 2~3 seconds and cannot be disconnected immediately. Is this correct?

Thanks a lot.

@jmont-dev
Copy link
Owner Author

I don't believe /api/stop is a supported endpoint in Ollama's API. You can view all of the available endpoints here: https://github.com/ollama/ollama/blob/main/docs/api.md.

In order to stop a generation, the client interacting with Ollama needs to cancel the request.

I provide an example of how to do so under the test Chat with Asynchronous Interrupted Streaming Response under test/test.cpp. It looks something like this:

std::atomic<bool> done{false};
std::string streamed_response;

bool on_receive_response(const ollama::response& response)
{   
    streamed_response+=response.as_simple_string();
    if (response.as_json()["done"]==true) done=true;

    // If this is true, continue streaming. If this is false, cancel the request and stop.
    return !done;
}

std::function<bool(const ollama::response&)> response_callback = on_receive_response;  
        
ollama::message message("user", "Why is the sky blue?");       
        
std::thread new_thread( [message, response_callback]{ ollama::chat(test_model, message, response_callback, options); } );

unsigned int microsec_waited = 0;

// Interrupt the stream by setting the atomic return variable false after two seconds.
while (!done) { std::this_thread::sleep_for(std::chrono::microseconds(100) ); microsec_waited+=100; if (microsec_waited==2000000) { done.store(true); } }
new_thread.join();

Basically, the new changes allow the bound function when doing a generation to return a bool that specifies whether or not to continue the stream. This gets passed to httplib and can be used to cancel the request with the content receiver, causing Ollama to stop immediately if false is returned. You can use a simple atomic variable from another thread as a switch to stop requests as shown in this example.

This example uses the chat endpoint, but the same technique can be used with the generate call. When the request is cancelled using this method, Ollama will stop immediately and will not take any additional time to finish the generation.

@Chadliu0806
Copy link

Hi jmont

Your solution does successfully stop the operation during ollama inference.
Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants