Skip to content

Commit 811ff85

Browse files
crasmYellowRoseCx
authored andcommitted
Add --n-predict -2 for stopping generation on full context (ggml-org#2565)
1 parent 37c9717 commit 811ff85

File tree

3 files changed

+12
-4
lines changed

3 files changed

+12
-4
lines changed

examples/common.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -543,7 +543,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
543543
fprintf(stdout, " --in-suffix STRING string to suffix after user inputs with (default: empty)\n");
544544
fprintf(stdout, " -f FNAME, --file FNAME\n");
545545
fprintf(stdout, " prompt file to start generation.\n");
546-
fprintf(stdout, " -n N, --n-predict N number of tokens to predict (default: %d, -1 = infinity)\n", params.n_predict);
546+
fprintf(stdout, " -n N, --n-predict N number of tokens to predict (default: %d, -1 = infinity, -2 = until context filled)\n", params.n_predict);
547547
fprintf(stdout, " -c N, --ctx-size N size of the prompt context (default: %d)\n", params.n_ctx);
548548
fprintf(stdout, " -b N, --batch-size N batch size for prompt processing (default: %d)\n", params.n_batch);
549549
fprintf(stdout, " -gqa N, --gqa N grouped-query attention factor (TEMP!!! use 8 for LLaMAv2 70B) (default: %d)\n", params.n_gqa);

examples/main/README.md

+6-2
Original file line numberDiff line numberDiff line change
@@ -160,9 +160,13 @@ The following options allow you to control the text generation process and fine-
160160

161161
### Number of Tokens to Predict
162162

163-
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
163+
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity, -2 = until context filled)
164164

165-
The `--n-predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text. A value of -1 will cause text to be generated without limit.
165+
The `--n-predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text.
166+
167+
A value of -1 will enable infinite text generation, even though we have a finite context window. When the context window is full, some of the earlier tokens (half of the tokens after `--n-keep`) will be discarded. The context must then be re-evaluated before generation can resume. On large models and/or large context windows, this will result in significant pause in output.
168+
169+
If the pause is undesirable, a value of -2 will stop generation immediately when the context is filled.
166170

167171
It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n-predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `--ignore-eos` parameter.
168172

examples/main/main.cpp

+5-1
Original file line numberDiff line numberDiff line change
@@ -431,8 +431,12 @@ int main(int argc, char ** argv) {
431431
// - take the n_keep first tokens from the original prompt (via n_past)
432432
// - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
433433
if (n_past + (int) embd.size() + std::max<int>(0, guidance_offset) > n_ctx) {
434-
const int n_left = n_past - params.n_keep;
434+
if (params.n_predict == -2) {
435+
fprintf(stderr, "\n\n%s: context full, stopping generation\n", __func__);
436+
break;
437+
}
435438

439+
const int n_left = n_past - params.n_keep;
436440
// always keep the first token - BOS
437441
n_past = std::max(1, params.n_keep);
438442
n_past_guidance = std::max(1, params.n_keep + guidance_offset);

0 commit comments

Comments
 (0)