Skip to content

Commit 63632d7

Browse files
authored
Merge pull request #24 from jmont-dev/manual_requests
Update Readme for handling context.
2 parents 49c2c74 + 3ab08e7 commit 63632d7

File tree

1 file changed

+58
-2
lines changed

1 file changed

+58
-2
lines changed

Diff for: README.md

+58-2
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,10 @@ The test cases do a good job of providing discrete examples for each of the API
5757
- [Embedding Generation](#embedding-generation)
5858
- [Debug Information](#debug-information)
5959
- [Manual Requests](#manual-requests)
60+
- [Handling Context](#handling-context)
61+
- [Context Length](#context-length)
6062
- [Single-header vs Separate Headers](#single-header-vs-separate-headers)
61-
- [About this software:](#about-this-software)
63+
- [About this software](#about-this-software)
6264
- [License](#license)
6365

6466

@@ -404,13 +406,67 @@ std::cout << ollama::generate(request) << std::endl;
404406
```
405407
This provides the most customization of the request. Users should take care to ensure that valid fields are provided, otherwise an exception will likely be thrown on response. Manual requests can be made for generate, chat, and embedding endpoints.
406408
409+
### Handling Context
410+
Context from previous generate requests can be used by including a past `ollama::response` with `generate`:
411+
412+
```C++
413+
std::string model = "llama3.1:8b";
414+
ollama::response context = ollama::generate(model, "Why is the sky blue?");
415+
ollama::response response = ollama::generate(model, "Tell me more about this.", context);
416+
```
417+
418+
This will provide the past user prompt and response to the model when making a new generation. Context can be chained over multiple messages and will contain the entire conversation history from the first prompt:
419+
420+
```C++
421+
ollama::response first_response = ollama::generate(model, "Why is the sky blue?");
422+
ollama::response second_response = ollama::generate(model, "Tell me more about this.", first_response);
423+
ollama::response third_response = ollama::generate(model, "What was the first question that I asked you?", second_response);
424+
```
425+
426+
Context can also be added as JSON when creating manual requests:
427+
```C++
428+
ollama::response response = ollama::generate("llama3.1:8b", "Why is the sky blue?");
429+
430+
ollama::request request(ollama::message_type::generation);
431+
request["model"]="llama3.1:8b";
432+
request["prompt"]="Why is the sky blue?";
433+
request["stream"] = false;
434+
request["context"] = response.as_json()["context"];
435+
std::cout << ollama::generate(request) << std::endl;
436+
```
437+
438+
Note that the `chat` endpoint has no specialized context parameter; context is simply supplied through the message history of the conversation:
439+
440+
```C++
441+
ollama::message message1("user", "What are nimbus clouds?");
442+
ollama::message message2("assistant", "Nimbus clouds are dense, moisture-filled clouds that produce rain.");
443+
ollama::message message3("user", "What was the first question I asked you?");
444+
445+
ollama::messages messages = {message1, message2, message3};
446+
447+
std::cout << ollama::chat("llama3.1:8b", messages) << std::endl;
448+
```
449+
### Context Length
450+
Most language models have a maximum input context length that they can accept. This length determines the number of previous tokens that can be provided along with the prompt as an input to the model before information is lost. Llama 3.1, for example, has a maximum context length of 128k tokens; a much smaller number of <b>2048</b> tokens is often enabled by default from Ollama in order to reduce memory usage. You can increase the size of the context window using the `num_ctx` parameter in `ollama::options` for tasks where you need to retain a long conversation history:
451+
452+
```C++
453+
// Set the size of the context window to 8192 tokens.
454+
ollama::options options;
455+
options["num_ctx"] = 8192;
456+
457+
// Perform a simple generation which includes model options.
458+
std::cout << ollama::generate("llama3.1:8b", "Why is the sky blue?", options) << std::endl;
459+
```
460+
461+
Keep in mind that increasing context length will increase the model size in memory when loading to a GPU. You should ensure your hardware has sufficient memory to hold the larger model when configuring for long-context tasks.
462+
407463
## Single-header vs Separate Headers
408464
For convenience, ollama-hpp includes a single-header version of the library in `singleheader/ollama.hpp` which bundles the core ollama.hpp code with single-header versions of nlohmann json, httplib, and base64.h. Each of these libraries is available under the MIT license and their respective licenses are included.
409465
The single-header include can be regenerated from these standalone files by running `./make_single_header.sh`
410466

411467
If you prefer to include the headers for these libraries separately, you can do so by including the standard header located in `include/ollama.hpp`.
412468

413-
## About this software:
469+
## About this software
414470

415471
Ollama is a high-quality REST server and API providing an interface to run language models locally via llama.cpp.
416472

0 commit comments

Comments
 (0)