You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This provides the most customization of the request. Users should take care to ensure that valid fields are provided, otherwise an exception will likely be thrown on response. Manual requests can be made for generate, chat, and embedding endpoints.
406
408
409
+
### Handling Context
410
+
Context from previous generate requests can be used by including a past `ollama::response` with `generate`:
411
+
412
+
```C++
413
+
std::string model = "llama3.1:8b";
414
+
ollama::response context = ollama::generate(model, "Why is the sky blue?");
415
+
ollama::response response = ollama::generate(model, "Tell me more about this.", context);
416
+
```
417
+
418
+
This will provide the past user prompt and response to the model when making a new generation. Context can be chained over multiple messages and will contain the entire conversation history from the first prompt:
419
+
420
+
```C++
421
+
ollama::response first_response = ollama::generate(model, "Why is the sky blue?");
422
+
ollama::response second_response = ollama::generate(model, "Tell me more about this.", first_response);
423
+
ollama::response third_response = ollama::generate(model, "What was the first question that I asked you?", second_response);
424
+
```
425
+
426
+
Context can also be added as JSON when creating manual requests:
427
+
```C++
428
+
ollama::response response = ollama::generate("llama3.1:8b", "Why is the sky blue?");
Most language models have a maximum input context length that they can accept. This length determines the number of previous tokens that can be provided along with the prompt as an input to the model before information is lost. Llama 3.1, for example, has a maximum context length of 128k tokens; a much smaller number of <b>2048</b> tokens is often enabled by default from Ollama in order to reduce memory usage. You can increase the size of the context window using the `num_ctx` parameter in `ollama::options` for tasks where you need to retain a long conversation history:
451
+
452
+
```C++
453
+
// Set the size of the context window to 8192 tokens.
454
+
ollama::options options;
455
+
options["num_ctx"] = 8192;
456
+
457
+
// Perform a simple generation which includes model options.
458
+
std::cout << ollama::generate("llama3.1:8b", "Why is the sky blue?", options) << std::endl;
459
+
```
460
+
461
+
Keep in mind that increasing context length will increase the model size in memory when loading to a GPU. You should ensure your hardware has sufficient memory to hold the larger model when configuring for long-context tasks.
462
+
407
463
## Single-header vs Separate Headers
408
464
For convenience, ollama-hpp includes a single-header version of the library in `singleheader/ollama.hpp` which bundles the core ollama.hpp code with single-header versions of nlohmann json, httplib, and base64.h. Each of these libraries is available under the MIT license and their respective licenses are included.
409
465
The single-header include can be regenerated from these standalone files by running `./make_single_header.sh`
410
466
411
467
If you prefer to include the headers for these libraries separately, you can do so by including the standard header located in `include/ollama.hpp`.
412
468
413
-
## About this software:
469
+
## About this software
414
470
415
471
Ollama is a high-quality REST server and API providing an interface to run language models locally via llama.cpp.
0 commit comments